Thursday, 16 August 2012

Randomize lines in two files, keeping relative order


I recently wanted to randomize the lines in two files, but to keep the relative order of the lines between the files. So I can remember how to do this next time I will post it here.
for i in `cat file1.txt`;do echo $RANDOM;done >randomOrder.txt
paste randomOrder.txt  file1.txt file2.txt  |sort -k1n >sorted.txt
cut -f 2 sorted.txt  >file1.txt
cut -f 3 sorted.txt  >file2.txt
rm -f sorted.txt
rm -f randomOrder.txt

Thursday, 28 July 2011

Perl One Liner: Delete files with wrong number of lines

Just a quick perl one liner for future reference. I needed to delete some text files that didn't have the correct number of lines as they would break a downstream R script to parse the results
for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==200}' ${f} ;done
I always forget that $ARGV is the variable for the input file name in a one-liner.

Thursday, 7 July 2011

Things I would tell a budding bioinformatician to learn.

I recently read Ewan Birney's blog post, which I found echoed a lot of my own thoughts about the use of statistical in computational biology. I thought I would compile my own similar list but for bioinformatics  / computational biology in general. I have not been and in the field as long as Ewan and I certainly still have a lot to learn, particularly about statistics due to my biological background, but I have learnt some things over the last ten years, that like Ewan, I wish someone had told me long ago.  The points are in no particular order.

Thursday, 16 December 2010

LSF: Using job arrays



Our cluster uses the LSF job scheduler. One feature that I find useful is the ability to create job arrays. These are similar jobs that differ in just once respect, such as input file or parameter, or they could be identical such as for simulations, modelling etc. The main benefit of job array, at least for me, is the ability to control the number of jobs running, and change it on the fly. For example I need 500 jobs running, but I can only run 240 jobs at any one time on my group's queue on the cluster and that would exclude others in my group for getting anything done. So I can use job arrays to submit 500 jobs, but only allow 100 to run at any time. When one finishes another one starts until they are all finished.


Another benefit of jobs arrays is that they are submitted as a single job, so a job array with a thousand parts is submitted instantly, but submitting a thousand separate jobs would take a long time.


Below is an example bash script that does a distributed sort, it is designed to show how to use job arrays and dependencies, not necessarily how to do sorting.

### Generate a random big file that we want to sort, 10 Million lines
perl -e 'for (1..1E7){printf("%.0f\n",rand()*1E7)};' > bigFile
### Split the file up into chunks with 10,000 lines in each chunk
split -a 3 -d -l 10000 bigFile split
### rename the files on a 1-1000 scheme not 0-999
for f in split*;do mv ${f} $(echo ${f} |perl -ne 'm/split(0*)(\d+)/g;print "Split",$2+1,"\n";');done
### submit a job array, allowing 50 jobs to be run at anyone time
ID=$(bsub -J "sort[1-1000]%50" "sort -n Split\$LSB_JOBINDEX >Split\$LSB_JOBINDEX.sorted" |perl -ne 'm/<(\d+)>/;print "$1"')
### merge the sorted files together once all the jobs are finished using the –w dependency
ID2=$(bsub -w "done($ID)" "sort -n -m *.sorted >bigFile.sorted" |perl -ne 'm/<(\d+)>/;print "$1"')
### Delete the temp files, waits for the merge to finish first
bsub -w "done($ID2)" "rm -f Split*"

The main point is that the jobs differ only by the value passed to them from the $LSB_JOBINDEX environment variable. Each job gets a different version of this with the number specified in the square brackets earlier, [1-1000] in this case. There are also additional notation for doing steps, such as 10,20,30 and you can also just specify a list of numbers such as 1,5,10,22,999 etc.


The hard part is making this simple number map to something useful for your task, in this case it was easy as I used split to name the files with sequential numbers, but perhaps you have 500 data-sets you want to perform the same analysis on. In this case you either rename the data-sets with a sequential naming, or use a look up table to associate input files with the numbers given from $LSB_JOBINDEX and have your analysis script use the lookup table to convert the number from $LSB_JOBINDEX into an input filename or parameter.
They key point in the code is using the %50 notation to choose how many jobs to run at any one time. This can be changed with bmod, for example:


bmod -J"%100" JOBID This would now allow 100 jobs to be run simultaneously, rather then 50. Notice also the use of the perl one liner (I am sure awk would work too) to get the job ID and store it ready to use as a dependency for the next step. This is another benefit of the job array, in that there is just one job id, which makes modifying and killing jobs much easier.

You can monitor the status of job arrays with the -A flag to bjos (bjobs -A), which will show you how many jobs are pending, running, done or exited etc.

If you want to check the progress if a particular job you can do a bpeek using its job id and array id, e.g. bpeek 1234542[101], the same notation works for bkill and bjobs

Friday, 10 December 2010

R: Basic R Skills - Splitting and Plotting

I am giving a short R course next year, so I am going to make a series of blog posts to help get my thoughts and example code in order. The aim is to introduce people with little or no experience of R to the language with self contained examples. The order of the posts are not going to reflect any order in the course, just what I feel like doing at the time.

This first post is going to deal with splitting and plotting data. It is a common occurrence to have data in such a form that you want to split the data in one column based on the data in another column. Maybe you want to split an experimental result by age or gender for example. Perhaps you want to see if there is a difference in the distribution of results in males and females. The example code below goes through one such hypothetical example.



The figure shows the output you should get from running the code. Essentially the example is designed to illustrate the split function and the ~ (tilde) character. 


The split function will do what it says, split a vector of data (A), based on another vector (B). It returns a list, with each element of the list being all of the element in A that match each element in B. For example 



A <- c(1,2,3,4)
B <- c("X","Y","X","Y")
sp <- split(A,B)
sp
$X
[1] 1 3
$Y
[1] 2 4
Now we have a list, and we can operate on each element of the list using the apply functions, such as lapply

lapply(sp,sum)
$X
[1] 4
$Y
[1] 6

There are lots off different apply functions, a good introduction is here.

The other main way of splitting is using the ~ (tilde) operation. In my head I always read this as 'given', such as plot(A ~ B) is "plot A given B". This is an example of the formula notation in R, but here we are using it very simply. It essentially does the same thing as split.

Note: You actually need to do plot(A ~ factor(B)) if B isn't already a factor.

Lots of functions support the function call, such as t.test in the example, for others you can use the lapply and split version, such as for density in the example. 

I also mention the aggregate function, which essentially is the same as lapply and split but seems slower on large datasets. 

Wednesday, 8 December 2010

R: Using RColorBrewer to colour your figures in R

RColorBrewer is an R packages that uses the work from http://colorbrewer2.org/ to help you choose sensible colour schemes for figures in R. For example if you are making a boxplot with eight boxes, what colours would you use, or if you are drawing six lines on an x-y plot what colours would you use so you can easily distinguish the colours and look them up on a key? RColorBrewer help you to do this.

Below is some example R code that generates a few plots, coloured by RColorBrewer.



The colours are split into three group, sequential, diverging, and qualitative.


  1. Sequential - Light colours for low data, dark for high data
  2. Diverging -  Light colours for mid-range data, low and high contrasting dark colours
  3. Qualitative - Colours designed to give maximum visual difference between classes
The main function is brewer.pal, which you simply give the number of colours you want, and the name of the palette, which you can choose from running display.brewer.all()

There are limits on the number of colours you can get, but if you want to extend the Sequential or Diverging groups you can do so with the colorRampPalatte command, for example :

colorRampPalette(brewer.pal(9,"Blues"))(100)

This will generate 100 colours based on the 9 from the 'Blues' palette. See image below for a contrast.


From compBiomeBlog


Thursday, 9 September 2010

Quick Tip: PUT - scp files from remote machine to local machine

It is often desirable to have a quick look at a file on a remote machine, such as an HPC cluster, on your desktop machine. I use MacFUSE and SSHFS for this but sometimes I like to move a file without finding my current location in a finder windows. I have written a quick wrapper around scp with a line to determine the IP address of my local machine, stored as an environment variable.

Add the following lines to your .bashrc on the remote machine.


export DESKTOP=$(last -100 |grep $(whoami) |head -n 1 |perl -ane 'print $F[2]')
put(){
if [ -z "$1" ]; then
    echo "put - Sends specified files to Desktop of local machine";
    echo "usage: put filesToSend";
else
find "$*" |xargs -I % scp % $DESKTOP:~/Desktop
fi
}

Then reload the file with source .bashrc. You should now be able to type put fileName, and the file will appear on your desktop. As long as you have your rsa keys set up, otherwise it will ask for a password too. It works for multiple files too, such as *.png. 

I find it useful for quickly viewing pdfs and images.