The Genbank parsing problem
Oh BioRuby, you are just as capricious and inconsistent as Ruby itself.
You can parse Genbank bank files with BioRuby like thus:
puts "Parsing seqs ..."
Bio::FlatFile.auto("foo.genbank").each_entry { |gb|
puts "Sequence '#{gb.to_biosequence.entry_id}'"
}
puts "Finished."
which will print the id of every sequence in the file. However, if the file ends with blank lines, i.e. after the genbank terminator:
2161 agagcccgaa ttgatgcacg gattgatttc gaatctggaa ggataaagaa agaggaattc 2221 gctgagatca tgaagacctg ttccaccatt gaagacctca gacggcaaaa atag // <blank line here>
such as what the Fetch functionality in BioRuby produces, then BioRuby reads these as additional, empty records:
Parsing seqs ... Sequence 'CY011043' Sequence '' Finished.
Nice. You can route around this by trimming the blank lines before handing it to the parser:
puts "Parsing seqs ..."
data = File.open("foo.genbank", "rb") { |f| f.read() }
buffer = StringIO.new(data.rstrip!(), 'rb')
Bio::FlatFile.auto(buffer).each_entry { |gb|
puts "Sequence '#{gb.to_biosequence.entry_id}'"
}
to give:
Parsing seqs ... Sequence 'CY011043' Finished.

