Genbank files
Some of the finer and surprising points of parsing Genbank files.
Recently, I had cause to take to take a large number of Genbank formatted files and modify them, adding attributes and changing the name and id of the sequence. Here are some unexpected 'features' I encountered. The parsing was done with BioPython, and some of these issues are naturally Python-centric. Positions are given in normal (not computer science) format, ie. the first character is position 1.
LOCUS BTV8_Netherlands_2006 1981 bp DNA linear VRL 15-JUL-2008 ACCESSION AM498054 DEFINITION Bluetongue virus 8 complete viral segment 4 ...
Biopython extracts an id (accession) and a name (title) for the SeqRecord it constucts from any sequence file it parses. Unfortunately what they are depends on the actual format. By experimentation, for Genbank, the title is taken from the LOCUS on the first line, and the id is taken from the VERSION field. ACCESSION (oddly) isn't used.
The first line of the file has very strict positional requirements:
- It must start with 'LOCUS' and then space padding up to 12 characters
- The name follows (starting at position 13) and must not contain any space, but can contain punctuation and other strange characters. The positional constraints of the header line thus limit the length of the name.
- To describe length, 'bp' or 'aa' must appear at position 42. This is preceded by the atual length and space.
- Thus the length is right-aligned in its position. All other fields are left-aligned.
- The sequence type (e.g. 'DNA') appears at position 48, but can be missing (blank).
- The sequence shape (e.g. 'linear', 'circular') appears at position 56, but can be missing.
- The division code (?) appears at position 65 and is compulsory.
- The date appears at position 69.
There is an earlier version of Genbank, with different fixed positions. Seriously.
Subsequent fields in Genbank are less rigorous about positioning.
Additional fields can be inserted into the file (e.g. 'AWESOME Very awesome.'). For the most part, parsers skip unrecognised fields.

