Monday, 27 July 2009


I use R and Bioconductor for most of my work. I am also increasingly replacing things I would have done before in Perl with R. One such example of this is the Bioconductor module biomaRt.

As the name suggest it allows for access to BioMart via R. BioMart is a method of accessing large online databases such as Ensembl. For example you may want to convert gene IDs from Entrez to Symbols, or retrieve 5kb upstream from the transcription start site of a list of genes etc etc. There are lots of things you can do with it.

biomaRt lets you do all this via R. This is particular appealing to me as I do differential gene expression analysis in R, so I have lists of genes already in R objects which I can retrieve lots of information about. Maybe I want all the GO annotations for a gene list, or to find a list of any SNPs within the coding region or something.

Anyway it is pretty useful, the documentation isn't bad either.

To give a brief example of how it works:

ids <- c("7157","3845") ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") seqs <- getSequence(id = ids, type = "entrezgene", seqType = "transcript_flank", upstream = 5000, mart = ensembl) seqs <- getSequence(id = ids, type = "entrezgene", seqType = "transcript_flank", upstream = 5000, mart = ensembl) exportFASTA(sequences=seqs,file="example.fas") library(xtable) results <- getGene(id=ids,type="entrezgene",mart=ensembl) print(xtable(results),type="html",file="Example.html")
This code will retrieve 5kb upstream of the transcription start sites of the two genes listed in the 'ids' list (though this could be a much longer list). It will then generate an html output file with information about these genes. Simple and effective.

The functions
  • listAttributes(ensembl)
  • listFilters(ensembl)
can be used to show the names of the things you can query and the things you can filter on.

You can also access lots of other databases, not just Ensemble as shown here.


No comments:

Post a Comment