"Checking gene names & identities in Biomart"

I need to check some gene names and get the sequences for them. Ordinarily, I would prefer to use Python for this sort of thing. Unfortunately, the Python interfaces to Biomart appear to have been written by insane people, so R it is.

What is Biomart? What is a biomart? A biomart is a way of providing remote access to large genomic datasets and associated information, and Biomart is one of the popular instances of it:

  • Biomart: http://central.biomart.org
  • Cosmic http://www.sanger.ac.uk/genetics/CGP/cosmic/biomart/martview
  • Ensembl http://www.ensembl.org/biomart/martview
  • Intogen http://biomart.intogen.org/

Under the hood, individual biomart servers may connect to a whole bunch of other servers, but this is invisible to users. Different servers may also specialise in different datasets, so you will need to discover and select the service you are using. You can use this data to answer queries like:

  • "Give me all the genes for this species"
  • "What's the name for this gene in this other nomenclature"
  • "Show me all the genes associated with this region"

There is a web interface to every biomart ("MartView"), but the winning feature of marts is programmatic access over the web, which is what I'll be demonstrating here.

First load the `biomaRt library and see what marts are available at the default address:

library ('biomaRt')
head (listMarts())
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 90
## 2   ENSEMBL_MART_MOUSE      Mouse strains 90
## 3     ENSEMBL_MART_SNP  Ensembl Variation 90
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 90

A little bit of nomenclature:

  • A mart is a standalone service, so a biomart server may have several marts.
  • Each mart may contain a number of datasets that can be queried
  • Each datasets consists of records with attributes (columns)
  • A record will have a value for each attribute
  • A query can use a set of filters to control and narrow its search

So, lets select Ensembl and see what datasets it has:

ensembl <- useMart('ensembl')
head (listDatasets (ensembl))
##                     dataset                   description    version
## 1  cfamiliaris_gene_ensembl         Dog genes (CanFam3.1)  CanFam3.1
## 2     mmulatta_gene_ensembl    Macaque genes (Mmul_8.0.1) Mmul_8.0.1
## 3  rnorvegicus_gene_ensembl          Rat genes (Rnor_6.0)   Rnor_6.0
## 4 ptroglodytes_gene_ensembl Chimpanzee genes (CHIMP2.1.4) CHIMP2.1.4
## 5       oaries_gene_ensembl        Sheep genes (Oar_v3.1)   Oar_v3.1
## 6  nleucogenys_gene_ensembl        Gibbon genes (Nleu1.0)    Nleu1.0

Select the human dataset. Alternatively, you can select the mart and dataset in one go:

mart <- useDataset("hsapiens_gene_ensembl", mart=ensembl)
mart <- useMart (biomart="ensembl", dataset="hsapiens_gene_ensembl")

You can see what sort of fields are available to filter upon:

filters = listFilters (mart)
head (filters)
##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start
attribs = listAttributes (mart)
head (attribs)
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

Queries are built with the getBM call. For example, to retreive genes by their gene symbol:

results <- getBM (attributes=c("ensembl_gene_id", "hgnc_symbol"), filters="hgnc_symbol", values=c("CPVL", "PYGL"), mart=mart)
results
##   ensembl_gene_id hgnc_symbol
## 1 ENSG00000106066        CPVL
## 2 ENSG00000100504        PYGL

To put this in plain English:

  • using the mart we selected earlier
  • return the ensembl_gene_id and hgnc_symbol columns for every record
  • where the hgnc_symbol column must be CPVL or PYGL.

So attributes are what we want to know and filters are how we restrict our search to a set of values.

Note that if the gene symbol does not exist in the set, the results will not include a result for it:

results <- getBM (attributes=c("ensembl_gene_id", "hgnc_symbol"), filters="hgnc_symbol", values=c("CPVL", "BOGUS"), mart=mart)
results
##   ensembl_gene_id hgnc_symbol
## 1 ENSG00000106066        CPVL

A query that has no results will return an empty frame:

results <- getBM (attributes=c("ensembl_gene_id", "hgnc_symbol"), filters="hgnc_symbol", values=c("BOGUS"), mart=mart)
results
## [1] ensembl_gene_id hgnc_symbol    
## <0 rows> (or 0-length row.names)

Wonderful. Now let's do something real. Given a list of obselete or synonym gene names, extract the current and official ones. First, I'll connect to a different biomart in a slightly different way:

library ('biomaRt')
listMarts (host='www.genenames.org')
## Error in listMarts(host = "www.genenames.org"): Unexpected format to the list of available marts.
## Please check the following URL manually, and try ?listMarts for advice.
## http://www.genenames.org:80/biomart/martservice?type=registry&requestid=biomaRt
mart <- useMart (host='www.genenames.org', biomart='g4public')
## Error in listMarts(host = host, path = path, port = port, includeHosts = TRUE, : Unexpected format to the list of available marts.
## Please check the following URL manually, and try ?listMarts for advice.
## http://www.genenames.org:80/biomart/martservice?type=registry&requestid=biomaRt
head (listDatasets (mart))
##                     dataset                   description    version
## 1  cfamiliaris_gene_ensembl         Dog genes (CanFam3.1)  CanFam3.1
## 2     mmulatta_gene_ensembl    Macaque genes (Mmul_8.0.1) Mmul_8.0.1
## 3  rnorvegicus_gene_ensembl          Rat genes (Rnor_6.0)   Rnor_6.0
## 4 ptroglodytes_gene_ensembl Chimpanzee genes (CHIMP2.1.4) CHIMP2.1.4
## 5       oaries_gene_ensembl        Sheep genes (Oar_v3.1)   Oar_v3.1
## 6  nleucogenys_gene_ensembl        Gibbon genes (Nleu1.0)    Nleu1.0
mart <- useMart (host='www.genenames.org', biomart='g4public', dataset='hgnc')
## Error in listMarts(host = host, path = path, port = port, includeHosts = TRUE, : Unexpected format to the list of available marts.
## Please check the following URL manually, and try ?listMarts for advice.
## http://www.genenames.org:80/biomart/martservice?type=registry&requestid=biomaRt

I chose this 'mart because I wanted to query names. Also the default central mart was flaky the day I did this, which seems to be a common-enough occurence.

Next find the allowed attributes and filters:

head (listFilters (mart))
##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start
head (listAttributes (mart))
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

Load or details the names to query on:

# the first two are synonyms, the third and the last a primary and current identifier
query_symbols <- c ('BHLHB2', 'BRP44L', 'BOGUS', 'CPVL')

Then build and submit the query. Note that symbols that are correct (i.e. current and the primary symbol) and those that are unrecognised do not return any results:

attributes <- c(
  "gd_app_sym", "gd_app_name", 
  "gd_prev_sym",
  'md_ensembl_id',
  'f_gd_prev_sym'
)
filters <- 'f_gd_prev_sym' # or 'f_gd_aliases'
values <- query_symbols

results <- getBM (attributes=attributes, filters=filters, values=values, mart=mart)
## Error in getBM(attributes = attributes, filters = filters, values = values, : Invalid attribute(s): gd_app_sym, gd_app_name, gd_prev_sym, md_ensembl_id, f_gd_prev_sym 
## Please use the function 'listAttributes' to get valid attribute names
head (results)
## [1] ensembl_gene_id hgnc_symbol    
## <0 rows> (or 0-length row.names)

A more complicated example: get upstream, downstream and coding regions for a set of genes:

library(tools)
library(biomaRt)

# initialize mart:
ensembl <- useMart('ensembl') 
mart <- useDataset("hsapiens_gene_ensembl",mart=ensembl)

# get the central sequence
# note that seqinr masks the biomart function, lovely
pep_seqs <- biomaRt::getSequence (id=as.vector (query_symbols),
 type="hgnc_symbol", seqType="peptide", mart = mart)
pep_seqs <- pep_seqs[!pep_seqs$peptide=='Sequence unavailable',]
pep_seqs <- pep_seqs[!duplicated(pep_seqs$hgnc_symbol),]

# get the upstream sequence
upstr_seqs <- biomaRt::getSequence (id=as.vector (query_symbols),
 type="hgnc_symbol", seqType="coding_gene_flank", upstream=5000, mart = mart)
upstr_seqs <- upstr_seqs[!upstr_seqs$coding_gene_flank=='Sequence unavailable',]
upstr_seqs <- upstr_seqs[!duplicated(upstr_seqs$hgnc_symbol),]

# get the downstream sequence
downstr_seqs <- biomaRt::getSequence (id=as.vector (query_symbols),
 type="hgnc_symbol", seqType="coding_gene_flank", downstream=5000, mart=mart)
downstr_seqs <- downstr_seqs[!downstr_seqs$coding_gene_flank=='Sequence unavailable',]
downstr_seqs <- downstr_seqs[!duplicated(downstr_seqs$hgnc_symbol),]

Some notes:

  • Biomart will usually return multiple sequences for each gene - because there are multiple sequences in the database. Some are partial, some are from abberant chromosomes, etc. That's why I deduplicate above and why you will have to filter stuff out according to your own preferences.
  • Remember, if there's no hit, nothing will be returned.
  • For reasons unclear to me, some were returned with 'Sequence unavailable'.

And here's an absolute hack. As said, sometimes Biomart can be flaky and I had a huge problem trying to download a set of several hundred sequences when the service would barf once every 100 or so. So I wrote a fallback function to retry any failed queries:

# "get_seq_fxn" does the biomart fetching and
# returns the results it gets

repeat {
   message ("trying ", ids[1])
   results <- try (get_seq_fxn())
   if (class (results) == "data.frame") return (results)
   print ("errored out")
   Sys.sleep (0.1)
}

References