Consensus in BioRuby

Explaining the ill-explained ways to obtain a consensus sequence in BioRuby.

In BioRuby, alignments are equipped with several methods for obtaining consensus sequences. Unfortunately, these have terse descriptions which point you at the BioPerl documentation, with the added bonus of not quite working like the BioPerl equivalents.

First, let's create a very simple alignment, where everything agrees except the last sequence which leads with a differing character and ends with a gap:

>> require 'bio'
=> true
>> aln =['acgt', 'acgt','acgt', 'ccg-'])
=> #"acgt", 1=>"acgt", 2=>"acgt", 3=>"ccg-"}, serial3

First consensus_iupac produces a "true" consensus sequence across all members. If sequences differ, the consensus sequence has an ambiguous character that sums these differences:

>> aln.consensus_iupac() => "mcg?"

Note the first position is m (a or c), but where gaps exist, like the final position, the final character is marked gnomically as ?.

An extra complication is the handling of gaps. The option :gap_mode takes the values 0, 1 or -1, which correspond to treating gaps like a character (the default), any gaps at that site appearing in the output (i.e. effectively stripping no-gap characters) and stripping all gaps before calculating the consensus:

>> aln.consensus_iupac(:gap_mode=>0)
=> "mcg?"
>> aln.consensus_iupac(:gap_mode=>1)
=> "mcg-"
>> aln.consensus_iupac(:gap_mode=>-1)
=> "mcgt"

consensus_string uses a threshold to calculate the consensus character at that site: if the most frequent residue meets the threshold, it is selected. Gap characters are treated as above:

>> aln.consensus_string(threshold=0.5)
=> "acgt"
>> aln.consensus_string(threshold=0.5, :gap_mode=>0)
=> "acgt"
>> aln.consensus_string(threshold=0.5, :gap_mode=>1)
=> "acg-"
>> aln.consensus_string(threshold=0.5, :gap_mode=>-1)
=> "acgt"