Browse Source

This is now more or less functional on the test data.

* Add readme
 * Clean up comments and syntax
master
William Davis 10 months ago
parent
commit
cf47ce959b
  1. 14
      README.md
  2. 25
      main.awk

14
README.md

@ -0,0 +1,14 @@
# Oligonucleotide format
This is a script that takes messy stdin and outputs a csv appropriate for ordering primers from Sigma Aldrich.
## To do
* Broaden gene name field regex
* Better sequence regex
* Handle different field separator types (copying from Excel is the expected use case)
* Handle junk data like extraneous lines
* Fix output format accuracy
* Handle forward and reverse primers appropriately
* Is there a better way to match fields than a for loop ( i=1;i<=NF;i++) ?
* Lint

25
main.awk

@ -1,22 +1,13 @@
#!/usr/bin/awk -f
# this is a script that takes messy stdin and outputs a csv appropriate for ordering primers from Sigma Aldrich
# check if first line contains word "Gene" (case insensitive) or "Name"
# if so set var to that column number
# for each subsequent line put namevar field into gene name spot (?) and regex anything that looks like a gene seq into the seq spot
# for each line check if namevar is a duplicate to output either F or R
# FS should be set to whatever happens when someone copy pastes an excel sheet
# Oligo_Name, 5'Mod, Sequence, 3'Mod, Scale(µmole), Purification, Format, Concentration(µM), Number of Tubes (Default = 1 tube)
BEGIN { IGNORECASE = 1 }
BEGIN { FS=","; namevar="GENE" }
NR == 1 && /gene/ { # too many {}{}}{}{{} ?
for ( i=1;i<=NF;i++) { # there has to be a better way
if($i=="gene") { # should be a regex
namecolumn=i
}
}
NR == 1 && /gene/ {
for ( i=1;i<=NF;i++) # there has to be a better way
if ( $i=="gene" ) namecolumn=i
}
{ print $namecolumn } # check assignment first
END {
print "gene name column in input is: " namecolumn
NR != 1 {
for ( i=1;i<=NF;i++) if ( $i ~ /[atcg]{6,}/ ) seq=$i
print $namecolumn ",," seq ",,scale,Desalt,Tubes,,1"
}

Loading…
Cancel
Save