Thursday, 7 July 2011

Things I would tell a budding bioinformatician to learn.

I recently read Ewan Birney's blog post, which I found echoed a lot of my own thoughts about the use of statistical in computational biology. I thought I would compile my own similar list but for bioinformatics  / computational biology in general. I have not been and in the field as long as Ewan and I certainly still have a lot to learn, particularly about statistics due to my biological background, but I have learnt some things over the last ten years, that like Ewan, I wish someone had told me long ago.  The points are in no particular order.

A little introduction:
My first degree was in Genetics, which was very molecular with very little statistics, apart from an impenetrable module taught by the maths department, who could barely contain their distain for biologists. It was only when I started my Ph.D. in the evolution of regulatory sequences that I started to get more interested in bioinformatics and statistics. A lot of my research was into the population genetics of regulatory sequences looking at within and between species variation. I decided that I wanted to simulate some evolutionary processes to test some hypotheses. Luckily someone suggested that I learn Perl. So I did. Which brings me to point one.

1) Learn a scripting language.
I really don't think it matters which one and I will not get in the to pros and cons here. A scripting language such as Perl or Python will serve you well in numerous tasks that would take hours or even be impossible with pointing and clicking in Excel or dragging and dropping files around. It probably wasn't the best choice for writing simulations, but it worked and it was easy to learn. Within a week or two of sitting down with the Llama book then the Camel book (The standard books from O'Reilly) I could write code to do what I wanted. Looking back it was hideous, ugly and inefficient code, but it worked and probably did the same as beautiful, elegant and efficient code would do, just slower. I had multiple versions of the scripts that did slightly different things, on different machines some with bugs fixed some not. I sometimes accidentally broke my code and spent hours trying to get it back how it was before. Which brings me onto point number two, version control.

2) Learn a Version Control system.
I got into version control quite late so I wish someone had impressed the value of it upon me earlier. Again, it doesn't matter which one you use. I use subversion (SVN) and git but the finer points of each tools are well described elsewhere. It basically allows you to keep a history of your work and revert to any previous version. You can also branch a project, so maybe you need to make some major changes for a certain task. You could copy your script with a new name and constantly forget which version has had bugs fixed and which have not, or you could create a branch in subversion and merge in any bug fixes. The other benefits are that, if you use a remote server, it also acts as a backup and allows you to work on the same script on multiple machines, say at work and at home. The learning curve is minimal, perhaps an hour to learn the basics and a week of regular use to commit the commands to memory and have it as part of your routine. You can use also Dropbox as an easy version control system, but it doesn't have the features of a real version control system.

3) Learn R.
I initially started to learn R as I was drawn by it's plotting functions. I was using Excel or some other tools to generate images and was frustrated by all the clicking and messing about to get a plot that looked like crap anyway, never mind when the data set changed or was expanded and I had to start again. To begin with I was doing analysis in Perl, outputting to text files which I would process in Excel, save to csv then load in to R for plotting. Slowly I got used to R and it began to take over more and more of the things I had done in Excel, then more and more of the things I had done in Perl. Now I live in R. It would never tell anyone R is easy, it is not. It is a pain and even now catches me out occasionally with some of it's idiosyncrasies. Still, I love R. Once you get into it's way of thinking it is fast elegant and so empowering. I wish someone had told me long ago to make some time for it and learn R, I never make a figure in anything else, which brings me on to point four.

4) You can't beat a good figure. 
A lot of bioinformatics is about finding patterns, separating the signal from the noise. Our brains are pretty good at this, too good really as we can often see things that are not really there, like the face of Jesus on a piece of toast. However, on the whole our eyes are far better at finding patterns than any number of fancy algorithms. The key to this is visualization. In a recent post by Jan Aerts, he says "statistics is about proving what you expect, while visualization is about discovering what you didn't expect". I completely agree with this, visualization for me is about exploring data, looking for patterns to suggest the next analysis. They key tool for visualization for me is R and particularly the ggplot2 package, which provides a very flexible and intuitive way of interacting with data. But the central piece of advise should be obvious, look at your data, only then will you see things that you were not expecting to find. The other benefit of good visualization is in communicating your ideas and discoveries with others, something which is also aided by the next point.

5) Learn LaTeX (and Sweave).
This point is probably more controversial, as lots of people get a long fine without LaTeX. I started to learn LaTeX as part of learning Sweave, which is combination of R and LaTeX. Sweave enables you to combine code, documentation, analysis and visualizations in one place. The power of this is in reproducible research, which enable someone else (or a future you) to repeat exactly what you did and get the same results. I also find the ability to concentrate on content and to be able to write documents and presentations on any machine, with nothing more than a text editor, very liberating. I now make all my presentations in beamer (the LaTeX equivalent of Powerpoint) and use LaTeX for most documents, even if not using Sweave. I wish someone had told me about LaTeX while I was writing up my Ph.D. as it would have made my life so much easier and my Ph.D. much prettier too. Because LaTeX is text based is also works very well with version control systems and Dropbox. I can be editing a document at work, go home, then open my laptop and continue editing from where I was up to. It is essentially a language of its own so it does take a while to get used to and be able to write without constantly referring to a tutorial. There are some great editors available, my favorite being TeXShop on Mac though I mainly use Emacs now, though that is a topic for another post.

6) Know enough statistics to know what you do not know.  
Statistics can seem a complicated business, but it isn't a black box of voodoo magic.  Most of the statistics methods I use are to determine how likely some feature of my data could be just due to chance, or to see if two things really are different, or if there is a good chance they are actually just two different samples from the same source. In my humble experience you can normally get a feel for the significance of data from some visualizations and statistics are a way of formalizing those observations. For many genomics problems a simple permutation test can be very informative. You just need to think about your data and the questions you want to ask.  I use binomial and hypergeometric tests quite a lot, along with  Kolmogorov–Smirnov and the occasional linear model. I think it is really important to have a good understanding about probability, distributions and variance and to understand about the assumptions about various tests, such as normality.  But importantly I talk to statisticians as much as I can and know the limits of my little tool box of tests. It is far to early to get carried away by a Google search and R and end up with something that looks fancy, but that you don't understand and could well be invalid for your data.  It is even easier to get the tails on your tests wrong and get a false positive/negative result. So learn as much as you can about statistics by reading and using it, but ask once you feel you are getting out of your depth. Finally If something doesn't look significant when you plot it, it probably isn't.

7) Learn to work at the command line. 
This may well be the most important of the points here. It is certainly the 'gateway' skill. Almost all of the other points require you to be working at the command line ( in a terminal). This is often one of the most off putting things to people from non-computational backgrounds, and quite rightly. There is no help menu or indeed any menu to click around and look for things that look like what you what. You need to know what you are doing. But you can learn, and in a very short space of time you can become amazingly productive. There are loads of books and online tutorials available and it isn't really that hard. cd to change directory, ls to list, mkdir to make a directory, cp to copy etc. Not exactly cryptic. Admittedly it gets more more complex as you get more productive, but for loops, grep, xargs etc are not too difficult really. If I could learn it anyone can. Once you have it opens up a whole world of working.

I hope this post made some sense, it was useful for me to think about some of the tools I use and to weigh up the time it takes to learn something new compared to the time it will save. I also found it interesting that the things here are not specific to bioinformatics. The field changes, technologies come and go but people will always need to know how to manipulate, visualize and analyse data.


  1. This post should be the basis for reshaping of university curricula. I would add only one point: 8) learn the basics of relational databases

  2. I absolutely agree about relational databases and it one of my secret shames that I am not more fluent in mysql or similar. I suspect, if I rewrote this post in a couple of years I would add that as another point and wonder how I managed for so long without more expertise in this area.

  3. Thanks for expressing this so well; I've already passed it along. I'd only suggest two ideas. First, I've found SQLite a useful DB that's easy to work with because there's no admin work to speak of. Second, J is the scripting tool that I use pretty much all the time I'm not working in R.

  4. Much agreed. One more thing to add to the list: learn how to test code that you write. (R has RUnit and testthat.)

    Also, have you come across Software Carpentry? It's my default place to point people at for basic scientific computing skills.

  5. This is very similar in spirit to what I wrote in a PLoS Computational Biology article back in 2009.

    These points can't be made enough. Great post!

  6. Good post, I agree with your list but, like the Marcin comment above, would add one: Databases, every bix project I've ever worked on has used a relational database so a good understanding of the relational model and SQL is essential in my view. The core for my work is a good scripting language, a relational database and a good analysis package like R. Judicious use of all three as appropriate will serve most purposes.

  7. Thanks for the comment Mick, again I agree that for many people a relational database would be on this list. Just for me, although I use them, they are not as essential as the other points. I might make an edit and add SQL to make the post more relevant to a wider audience though, thanks for the comments.

  8. Hi Stew, yes, it depends on your type of work, SQLite is a great little tool ,very easy to learn, and plays very nicely with Perl, Python, and shell

  9. I love the Post,,,, Thanks a Lot

  10. Hi Stew,
    Just to say that your website is really really nice and your posts are always excellent. Some tips and commands you post are "saving my neck". Please, keep doing this very good job!