Nature Genetics has recently published this interesting work on repeatability of gene expression analysis. That microarray is a powerful tool used for study a number of biological phenomena is nowadays obvious. As it is that a microarray study should be repeatable, being it an instance of the experimental approach the science is based on. So, those who try to verify the repeatability of such kind of experiments are certainly welcome (sorry for repetitions… 😉 ). Even when one considers only a part of it, as they do, by focusing on the possibility to redo the analysis, starting from the raw data or the processed ones (i.e.: normalised).
What we learn from the paper is, IMHO, that the situation is better than what one would have expected, but there is still quite room to improve.
They started from an initial list of 20 papers, they choose a single conclusion from each of them (i.e.: a single figure or table) and tried to come at the same one, by redoing the analysis reported. An important part of the test is that they purposely didn’t contacted the original authors, rather they tried to get all what they needed from the paper and from the data published in ArrayExpress or GEO, hence exploiting the available data annotations.
Result is that 8 out of 20 studies could be reproduced, but only two of them could be redone, as they say, “in principle”. It astonished me to read that the main cause of such a still low reproducibility was not due to the lack of data annotations, or bad analysis descriptions, but ugh! the fact that data were missing!
To me, this mainly means that the compliance with check list standards, such as MIAME (or others listed under MIBBI), although a valuable step forward, is not enough. I hardly get how a paper submission without at least the raw values for expression intensities (or alternatively, the image), is useful to the scientific community. Not to mention other details (the paper’s authors cite software name and version, I would add usage of ontology terms and recognizable links like PMIDS), which are more and more important if we want to do good things with emerging technologies, such as bio-ontologies and Semantic Web.
I know, reporting biological data is like paying taxes: when you have to do it as individual, it’s annoying, you don’t get an immediate advantage from it, and of course nobody like paying taxes or annotating experiments per se. The (known) fact is that this individual annoyance contributes to the whole community’s benefit and to everyone of us at some point (isn’t the community made of individuals?). The part about Biology is well explained by the editorial in the same NatGen issue. Like in the taxes case, stick-and-carrot policy helps. On one side, there is the right policy of demanding that data related to a submitted paper are annotated and sent to public repositories. On the other hand, educating on the value of public annotated data is as well important. As it is, for instance, involving biologists in building curated biological databases, which typically can be repaid with at least the name on the papers (the ultimate goal of any scientist…).
Maybe there is yet more that could be done. Someone (public funders?) could encourage the introduction of more automatisms inside software tools like LIMSs (for instance, BASE could integrate Ontology lookup services, such as BioPortal or EBI’s OLS). Those who work on/with reporting standards could focus more on end user-dedicated tools that read/write the standards (OK, here I am biased…). Existing standards that are essentially based on object oriented models, such as MAGE-TAB, could allow links to descriptions defined in syntaxes that depend much less on predefined schemas/structures, i.e.: it would be nice to be able to attach RDF fragments to a set of structured spreadsheets (ahem… biased again… by the way, did I mentioned ISA-TAB? And the DCTHERA directory?).
There are some aspects about describing biological investigations that could be formalised more than it has done so far. For instance, the OBI ontology or the EXPO ontology are starting points that might be useful to describe more precisely biological protocols and the experimental design. Also the whole statistical analysis potentially could be described in a way that would allow to re-run it automatically, as it is possible for Taverna workflows.
Furthermore, there are things that can be done after the data and papers are submitted. Despite reproducing the experimental activity described in articles is time consuming (as it is reported by the NatGen paper) and, again, not individually much interesting for most people, a way should be found so that someone does it from time to time (again, public projects?). Related to this, in the emergent WWW you see a lot of examples where users can provide feedback by voting the items they’re viewing (goods, hotels, e-bay sellers, etc.). This create user-based rankings that are often pretty reliable and useful. Why not using the same approach and add something like "vote this paper", or "vote this experiment" to the web page describing a biological entity? Why not allowing users to annotate web-published contents with tags, ontology terms and alike? myExperiment is a well known example of that. Similar collaboration is applied in several emerging collaborative data-curation projects, which are often implemented by means of wiki and semantic wikis. Several ones are mentioned in this other editorial. Last but not least, the Swan project.
Anyway, coming back to the repeatability paper, 8/20, for complex technologies, such as microarray and other -omics, is not so bad after all. I suspect that verifying the reproducibility the bio-materials would be interesting too. But that’s another story and the paper is worth reading anyway. As a final note, in the PubMed links to the paper, this previous work is reported, which should be interesting as well.