Hitchhikers guide to BioPython: SeqRecords

For the novice, more-than-raw sequences.

(Previously published on BiocodersHub.)

Previously I'd spoken about how Biopython represents sequence data with the Seq class. But there is also the SeqRecord class:

  • A Seq is just raw sequence data and information about what type of sequence it is.
  • A SeqRecord is a Seq and all the other information that is associated with that sequence: identifiers, annotataions, features and so on.

It can be confusing working out if a Biopython function uses or returns Seqs or SeqRecords. As a rule of thumb if a function uses or produces anything more than the raw sequence data, or retrieves the sequence information from a file or database, it uses a SeqRecord. Thus the I/O functions all read and write SeqRecords.

To create a SeqRecord you need a Seq:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> s = Seq("MKQHKAMIVALIVICITAVVAALV", IUPAC.protein)
>>> from Bio.SeqRecord import SeqRecord
>>> sr = SeqRecord (s)

This produce the simplest possible SeqRecord, being a simple sequence with no identifiers or annotations.

The main optional arguments for the SeqRecord constructor are:

  • id (string): the primary identifier for the sequence, usually a database accession number.
  • name (string): the title of the sequence. Often the same as the id, but sometimes a clone name or similar.
  • description (string): free text notes associated with the sequence. Often, additional information associated that is not covered by the other fields is included in ther description is a structured format.

These also are the names of respective object members the arguments are assigned to:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqRecord import SeqRecord
>>> sr = SeqRecord (Seq("MKQHKAMIVALIVI", IUPAC.protein), id="A73456",
name="E. coli phosphatase")
>>> sr.id
'A73456'
>>> sr.name
'E. coli phosphatase'
>>> sr.description
'<unknown description>'

If no value is given for the id, name and description of a SeqFeature, those members default to the warning message ‘<unknown FOO>’. This can create problems if you try to test whether the member is set or equal to another if you assume that an unset field is an empty string or false:

for s in seqs:
if (not s.id):
   # this doesn't work
   print "%s has no id"

By wide convention in metadata, identifiers or IDs are required and unique, at least locally, while names or titles are optional and duplicates. In practice, Biopython is unlikely to malfunction or care about either. However duplicate or missing IDs or names may cause dependent programs to complain or crash and create confusion for anyone reading the output. So it’s advisable to use both and make them unique.

There are three additional SeqRecord optional arguments (and members) for SeqRecord:

dbxrefs (list): This is simply a list of reference or accession numbers to the sequence in other databases (i.e. not the one that you’ve retreived the sequence from). It is often empty, but when present the entries are simply strings:

>>> sr.dbxrefs
["ASAP:13298", "GI:16131885", "GeneID:948570"]

The database references are often of the form database_name:id_in_database and manipulating the database references is as list as changing the strings in the dbxrefs list:

>>> sr.dbxrefs.append ('PrmDb:12345')

annotations (dictionary): Additional information about the sequence not associated with or linked to a location in the sequence as key-values where the key is the type of information. This tends to be more unstructured information like the organism and taxonomy involved, the date of modification, etc.:

>>> sr.annotations
{'accessions': ['U49845'],
'data_file_division': 'PLN',
'date': '21-JUN-1999',
'gi': '1293613',
'keywords': [''],
'organism': 'Saccharomyces cerevisiae',
'references': [Reference(title='Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae', ...),
Reference(title='Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein', ...),
Reference(title='Direct Submission', ...)],
'sequence_version': 1,
'source': "Saccharomyces cerevisiae (baker's yeast)",
'taxonomy': ['Eukaryota',
'Fungi',
'Ascomycota',
'Saccharomycotina',
'Saccharomycetes',
'Saccharomycetales',
'Saccharomycetaceae',
'Saccharomyces']}

Note that the keys are almost always lowercase_separated_by_underscores. There’s little agreement about standard key names, although the records you get from one database or source will tend to agree with each other. You’ll also notice there’s a great deal of heterogeneity in what’s used for an annotation value. Some are plain text (‘organism’), others are a list of strings (‘taxonomy’, ‘keywords’) and some use special Biopython classes to represent complex information (‘references’). This means that accessing or modifying the annotations can be easy or complicated:

>>> if 'Saccharomycetales' in sr.annotations['taxonomy']:
   print 'correct taxa'
correct taxa

# add a reference to the annotations
>>> new_ref = Reference (title="The 5' end of the ..." ...)
>>> sr.annotations.references.append (new_ref)

# find out what the date is, where the format may not be consistent
>>> import datetime
>>> try:
   mod_date = datetime.strptime(sr.annotations['date'], '%d-%b-%Y')
except ValueError, err:
   # if date in wrong format
   ...

A consequence of storing annotations in a dictionary is that there cannot be two annotations of the same type. This is generally true but but occasionally files will occur that do have this problem.

features (list): A list of SeqFeatures for data that is associated with specific location in the sequence. This is a more complicated tiopic that we’ll discuss in the next installment.

(In more recent version of Biopython, SeqRecords have an additional member, letter_annotations. This is a dictionary of properties for every location along the sequence, and usually used quality scores (e.g. Section 16.1.5) or secondary structure information (e.g. from Stockholm/PFAM alignment files).)

Previously, I noted how the right way to do sequence IO was with Bio.SeqIO. There are two other methods that can be useful for quickly dumping a SeqRecord to the screen or a file. ‘print’ will throw a reasonably readable form of the SeqRecord data to the screen:

>>> print sr2
ID: NM_001083539.1
Name: NM_001083539
Description: Homo sapiens killer cell immunoglobulin-like receptor, three domains, short cytoplasmic tail, 1 (KIR3DS1), mRNA.
Number of features: 26
...

And the 'format' method can be used to choose a specific formal format:

>>> print sr_2.format('fasta')
>NM_001083539.1 Homo sapiens killer cell immunoglobulin-like receptor, three domains, short cytoplasmic tail, 1 (KIR3DS1), mRNA.
CCGGCAGCACCATGTTGCTCATGGTCGTCAGCATGGCGTGTGTTGGGTTGTTCTTGGTCC

It’s worth paying attention to how the fields in different file formats map to the SeqRecord object. For example, the VERSION, LOCUS and DEFINITION line in a Genbank file become the id, name and description field of a SeqRecord respectively. When written out to a Fasta file, the id becomes the first entry on the accession line, with the description appended. The name is lost, as are the annotations and features. The moral here is probably to stick to using as few formats as possible and ones that are complex enough to support the data you are interested in.

Back in the previous installment, I also noted that Seqs were basically strings and could do just about anything strings could:

>>> s = Seq ('ACGTACGT')
>>> 'ACG' in s
True
>>> s[0]
'A'
>>> s[:5]
Seq('ACGTA', Alphabet())

The same holds for SeqRecords in many regards. They can be sliced, indexed and queried:

>>> sr = SeqRecord (s)
>>> sr
SeqRecord(seq=Seq('ACGTACGT', Alphabet()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])
>>> 'ACG' in sr
True
>>> sr[0]
'A'
>>> sr[:5]
SeqRecord(seq=Seq('ACGTA', Alphabet()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])

SeqRecords can even be concatenated:

>>> sr[:3] + sr[5:]
SeqRecord(seq=Seq('ACGCGT', Alphabet()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])