Friday 12 May 2017

Bioinformatics: The Age of Illumina

Illumina has so dominated Bioinformatics since 2008 that it has started to feel like the default. It seems normal that much of our field is making software diagnostic assays from their short read technology.

Our facility recently got a PacBio, so I started to think about the read lengths and what it means. We had a massive price drop on reads, which moved from 30 to 250 bases in a decade, but that's still relatively short. PacBio reads are expensive but they're so much longer, that it completely resolves problems which could only be solved through statistics and money before.

With short reads we have uncertainty about genotypes - does someone have a different sequence at this location? At the moment we map short reads against a scaffold genome, then use population frequencies to filter to rare variants. This method is entirely based on the relative cost of read lengths.

We also have uncertainty about isoforms: we count reads across exon junctions then perform complicated statistics. Given long enough reads we could just count them.

So how much are we living in the age of Illumina? It's hard to tell when you're so immersed in it. I feel that while much of the difficulty of the field might disappear were read length to get much longer, the core of it would continue to be useful and necessary.

Core Bioinformatics

Once short read problems are resolved, what use are we?

What protein levels are in this cell or patient?
 What is the genotype of this patient or cell? How does it compare with earlier times, or vs healthy people, or people with the same disease?




Biology Band Names

While sitting in a biology talk, I often hear a phrase and think "That sounds like a great name for a Death Metal band".

Some examples:

  • Death Receptor
  • Death Domain
  • Pro-death Pathway
  • Disruptor Peptide
  • Mutually antagonistic interactions
  • Death inducer obliterator protein 1

Any more?

Monday 18 July 2016

3 simple rules for bioinformatics file formats

1. Don't create a new file format when an existing one would work.
2. Put version information in a header
3. Structure your data so it's possible to read with a computer
#1 is to Maximise interoperability - every time you create a new file format, you slow someone down as they need to write new parsers or convert it to something they can use.
Need a list of genomic ranges? Put them in a bed file so I can use my existing scripts or visualise them in a genome browser. Why do so many variant callers have their own format, or write VCF files that don't obey the spec?
#2 is so you know what a file contains
Gene annotations are critical for a huge amount of work. So, which version of the constantly updating annotations is contained in this GTF file? I have no idea...
FastQ files store sequencing reads, with letters representing different quality scores. Illumina took this file format, which has no header to store file format version, and made 4 different versions of the file where the letters map to subtly different things. Don't be like Illumina.
#3 is to allow others to build on your knowledge
Biology is way too complicated for a single human brain to understand.... so let's store our understanding in human language so computers can't understand it either!
A lot of biologists spend their whole careers working on a handful of genes, then writing up that information in journals written for other biologists to read. Storing information about one gene in English is fine. Storing information about 20,000 genes in plain English is madness.
Storing data in an unstructured way makes asking questions such as "find me all of the genes that are like X" difficult if not impossible.
Since scientists are rewarded for writing papers not keeping data open, structured and up to date, public efforts stagnate and fall out of date. This has left open a business model of paying an army of people to read and summarise the literature in a database. Too bad it's now proprietary information.

Sunday 29 March 2015

Slides: Workshop on Source control, git merge walkthroughs

I gave a workshop recently on Git and Source control. Slides are here: http://www.slideshare.net/DavidLawrence10/git-workshop-sourcecontrol

The slides aren't great, it's mostly a workshop - you have to make sure everyone does the typing.

If you don't use source control, it's almost certainly the biggest gain in software productivity you can get for the amount of effort it takes to learn.

It seems there are a huge amount of Git tutorials, but I wanted to add a bit of a source control intro, and walk people through probably the hardest thing they will do (merge git branches). I mentioned that this is probably the hardest thing, 99% of life is easy.

When teaching Git, one of the questions is whether you should teach "the minimum to get started" or spend a bit of time explaining the basics, so that people get a correct mental model.

For newbies, I think you need to know:
  • Motivation for using source control
  • Hashes
  • File diffs
  • Directed Acyclic Graphs
So I run people through use of md5sum, diff etc.

My friend Paul helped people out, and also played the part of Git on a whiteboard - showing how the commit graph is updated.

I think Git is a a pretty great solution, and once you understand it's a filesystem database with a directed acyclic graphs of changesets to text files, you will treat it like that, and everything will be fine.

I ran out of time before I could do a group pull/push from bitbucket on the internet, and only briefly showed smartgit. I finished off with a Gource visualisation of changes to our work repo. Here's a youtube video of Gource.

We will see in time if it worked, and whether they start using source control.

Thursday 12 March 2015

Pipelines and Slapstick: What silent film can teach us about data processing


Watch the first minute and 20 seconds of this clip of Charlie Chaplin's silent 1936 film Modern Times (go on - it's hilarious!):


There are some useful things to learn here, and not just about the futility of life.

The steps on the assembly line must be performed in order: first tightening, then hammering. The output from one step is input for the next, so each is only as fast as the person in front, and slowdowns ripple through the line.

Pipelines are used a lot in computers - in both hardware (CPUs and graphics cards) and software (Unix pipes, data processing such as Bioinformatics). Our pipelines have similar properties. How long does it take to get a result? The length of time it takes to pass through all the stages. How fast can each stage go (ie what speed is the conveyor belt?) - the speed of the slowest stage (or worker).

In the video it turns out that Chaplin's character (The Little Tramp) is the bottleneck of the entire factory. For example the large man behind Chaplin is a tireless and efficient worker, but becomes useless when starved of input from upstream.

It's obvious (and funny) here, but not always so easy to find the little tramp slowing down your software pipelines.

Without careful profiling, it would be easy to spot the low rate of hammering and then attempt to try and optimise that part of the process. But no increase in hammering efficiency would improve total throughput, nor would adding more hammerers. 

It's easy (and fun) to spend months rewriting your hammer modules in a different language or upgrading the hammer servers, only to see no improvement at all.

Could we fix this pipeline by using software? Firstly we could add some buffering, such as in Unix pipes. At his quickest, the tramp was faster than the pipeline, so it may have been possible for him to build up a bit of completed work, so he could scratch his nose (ie have high variance in time to complete the step), without stalling everyone downstream.

Rather than shutting down the entire factory if a step falls behind, our software steps can just block and wait for data. But software usually has problems of its own compared to the physical world. When the conveyor belt stops, widgets sit there, ready to go as soon as the belt starts moving again. If we only keep the data in RAM (ie a standard Unix pipe from stdout to stdin), a crash in a software step will destroy one or all of the widgets, requiring the entire pipeline to be re-run again.

It's possible to fix this by storing intermediate data to files or a database, but I/O is very slow. Slowing the slowest step slows everything, but it also means that slowing the non-slowest step doesn't affect throughtput at all (only introducing a slight increase in latency).

I recently identified a bottleneck in one of my pipelines, which was the step that inserts data into a particular table of my database. I was able to improve this by having earlier steps process the data into Postgres binary files so they could be quickly inserted via the COPY command. Even though the total amount of CPU work (and cores) increased for earlier steps in the pipeline, because the bottleneck improved (and the bottleneck didn't shift to the extra processing) - the total throughput increased.

To continue with the tortured analogies, it may be worth it to hire someone to scratch the tramp's nose or shoo away the flies..... (ok, ok, I'm done.)

BioGraphServ - Bioinformatics Graph Server

I've created a webapp for quickly and easily generating graphs and analysis.
Drag & drop small files (BED, expression CSV and VCF files) onto the page to upload, and it will generate some graphs and analysis. This can be further customised and downloaded in different image formats.
A quick overview (with screenshots) can be found here:


 A SVD (similar to PCA) plot.


A diagram of chromosome regions, generated from a .bed file.

Source available under creative commons attribution licence, paper to be released one day...

Monday 27 October 2014

Custom Scripts

Quite often a biologist will come to me with a paper and ask me to recreate an analysis on their data.

"Sure", I say, and start to skim through the methods for the bioinformatics section.

.. were upregulated blah blah ... <PageDown>
... p<0.001 ... <PageDown>
... Cultivated blah blah nanomolar ... <PageDown>
"were sequenced on" (ok here it is) ... "analysed with custom bioinformatics scripts" ....

And there it is, the rage again!

Custom scripts! Of course! It's all so clear now!

What "custom scripts" (and only custom scripts) in your methods tells me is you did something critical to produce your results, and you won't tell me how or what.

So by now I've long stopped thinking about your experiment, and am now wondering what was going on in your head to make you think this is acceptable.

Are all programs the same to you? So stupefyingly dull that they are not worth more than a few words to describe? A mere matter of programming, how could it be wrong! Or is it the other way? Code is magical, mysterious and beyond human comprehension?

Or maybe you think all science should be written up like this?

Microscopy advances summarised as "custom optical methods". PCR written up in the 80s as "DNA was copied using custom molecular biology methods"? Assays as custom measuring methods, antibodies - custom rabbit methods?

Oh, and people that write "Custom Python", "Custom Perl" or "Custom R Scripts" as if that makes it better?

Thanks, now I know what language the code you're not showing me is written in? Fuck you with custom Fucking methods!