I wish I had written this post earlier, but unfortunately the work it is about has been keeping me rather away from anything else in the last months.
Besides, now that the EBI Linked Data have been out for two months, I've the advantage that I can talk about it with some experience on the field. James has initially made a good perspective post about that, highlighting how our organisation embracing the Semantic Web can benefit the end users. We have recently enjoyed meeting interested people and interacting with them during the tutorial we gave to SWAT4LS 2013, as well as during the later hackaton. More is coming with our RDFApp contest.
As in other cases, this is backed by an existing repository, the Biosamples Database (aka BioSD to friends), which is already available in the form of a human-oriented web application, REST web services and file dumps. BioSD is a large collection of data, which of rationale is to provide a single point of access that focuses on the biological material used for experimental activity, so that other, technology-specific and assay-specific repositories can link or use to expose part of their information. This has several advantages, such as:
- specific data are more searchable when one starts searching from features of interest like a particular organism or disease state
- on line publishing of sample-related data can just be let to us, avoiding duplication of work and information
- we are aiming at standardising and harmonising sample representation, especially for what concerns the usage of ontologies in sample attribute types and values.
That said, it is natural to be willing of adding RDF and standard ontologies to this rationale. This gives further benefits, namely the typical benefits bring about by Semantic Web technologies, like better data integration, or easier data access for third-party applications. For instance, I just needed a couple of SPARQL constructs to search for samples annotated with types and sub-types of a given CHEBI substance, a result I achieved by combining our SPARQL endpoint and the endpoint from Bioportal. OK, not very easy for a bench biologist (though things are improving for them too), yet much easier for a programmer used to walk ontology trees with PERL or upload huge database dumps on MySQL.
I think it's worth to talk about how we have modelled such data as RDF. The starting point has been what you can see in the BioSD interface, or in the SampleTab format. The main entities we deal with are sample groups and samples. Because we want to make data provider's life easy, we allow them to be liberal in defining a sample and its attributes. Essentially, anything that can somehow be associated to the biological materials used or useable for a biomedical study, can be a sample on BioSD, including a cell culture, the description of a cell line, a human patient, a piece of tissue, soil taken for environmental studies. Similarly, a sample attribute can be as simple as a pair of value/type strings, which can be optionally enriched with stuff like ontology terms and units. The core of the schema we have designed for BioSD/RDF is not much different than these basic elements. As it is common in these cases, we have framed it into a small application ontology, which links a number of existing wider ontologies.
As you may have guessed, our data acquisition policy might lead to a too simple or noisy data. On the one hand, we are trying to introduce some manual curation to the data that we import from the extern, or is produced afresh by data providers. On the other hand, our BioSD/RDF converter relies on Zooma, another project from our Functional Genomics Team. Zooma is a text mining tool that does a sort of indirect crowdsouring, by digesting and counting manual annotations to existing biomedical repositories (such as Gene Expression Atlas), so that it can use them to infer to which ontology terms a pair of strings is likely to be associated. For instance, it might take the labels 'organism', 'mus-musculus' and return NCBITaxon_10090. All the sample attributes that come from BioSD are enriched this way, if ontology annotations are not explicitly provided by the original submitter. The result is obviously not perfect and I aim at exploiting a few traditional text mining tools as well in future. Notwithstanding, we're happy enough with the first tests. For example, the query mentioned above shows that it is now relatively easy to find samples related to ontology-identifiable chemical compounds.
As EBI, we have paid attention to the creation of a common URI schema across all the datasets that we export. For instance, all the OWL individuals from the BioSD datasets are named like 'http://rdf.ebi.ac.uk/resource/biosamples/sample/ACCESSION'. Achieving that might be tricky some times, especially if one wants to assign the same URIs to multiple original records that are known to represent the same real-world entity. As an example, we're reasonably sure that within a given data submission (e.g., all the samples about an experiment) all the attributes of type 'organism' and value 'strain A mouse' refer to the same real entity. In such a case we assign a URI like 'http://rdf.ebi.ac.uk/resource/biosamples/exp-prop-val/S-BSD-123#d53edc...', where 'S-BSD-123' is an accession for the submission and the 'd53edc...' bit is a hash (like MD5) of the string “organism strain A mouse”. Hashing long and unpredictable strings is a common trick to obtain something 1-1 to the string and practical to be used as URI fragment. This way, all the mouse-A annotations within the submission will be mapped onto the same URI. We are left with the fact that string pairs like 'mus-musculus/organism' are likely the same in any other context. Unfortunately we cannot always detect that automatically, but ontology annotations provided by Zooma are very helpful, i.e., all the attributes of this type will have different URIs, but will also be rdf:type NCBITaxon_10090.
By the way, we use URIs with '...#xxx' tails for those resources that have too many chances not to be preserved from one data release to another. Sample attributes is such a case, since data curation might cause them to be removed, added, or replaced. And these hash-symbol tails are one example of the kind of URI conventions we agreed over all the EBI data sets.
Inside the RDF conversion factory
Now it comes the more geeky part. BioSD was born as a relational database, mostly due to the simplicity of its data model. However, we try not to deal directly with SQL and the relational view of the data. One reason is that we support data submission through a tabular format and one of the best ways to parse such format files is to map them onto the objects of an object-oriented programming language. Another reason is that, albeit simple, our model is complex enough to make it desirable to deal with it via an object model representation. In fact, tasks like data validation and tracking sample treatment pipelines (i.e., graphs of 'derived from' that tells you how samples where obtained from initial sources) are easier to manage with such a model.
For similar reasons, we haven't chosen to map BioSD data directly from the Oracle database where they are originally stored (using something like D2RQ), we rather wanted to map JavaBean objects and their properties (including link to other JavaBeans) onto RDF statements. Now, there are several Java libraries to do that, to name but a few: JenaBeans, RDFBeans, Java Architecture for Owl Binding (JAOB). However they all use the same approach as JPA or JAXB: they require that you clutter your existing code with Java annotations. We don't like that, because we would be forced to touch generic classes that we might want to use beyond the BioSD project and one particular RDF mapping in BioSD is not necessarily good everywhere.
Despite having done my homework, I couldn't find anything like Java2RDF, the small library that I started writing for the BioSD->RDF conversion, but designing it as general-purpose code. At least, I don't know anything like that, which is written in Java and well documented. Beyond Java, SemanticWebPogos is based on an approach very similar to Java2RDF, but it is written in Groovy. I understand that Groovy and Java can easily interplay, maybe I'll give a closer look at that in future.
This is an example of how our library works (look for testCompleteExample()). As you can see, it is pretty easy to define a mapping in a very declarative way. Moreover, because you are still programming in Java, you can be very flexible in plugging custom code into more common mappings. An example of that is the attribute value mapper in the BioSD converter, which has to apply a number of atypical procedures, such as asking mappings to Zooma or RDFizing numbers and units.
If you're going to delve into Java2RDF, please be mercy with the quality of its documentation: I'm already aware of this problem and I will find time in the next weeks to fix it.
If you have been brave enough to read this post up here, you might also be interested in going through the details and have an idea of how we plan to extend this work during the next weeks.
Java2RDF needs better documentation and code improvements. For instance, I'd like to abstract it away from its current dependency on OWL API. In fact, you may want to use a different RDF/OWL framework, such as Jena or Sesame, currently we access OWL API only via a few general facilities, so a plug-in approach shouldn't be difficult to add.
On the side of our operations, we are completing a revamp of the software infrastructure used to run BioSD, this will include regular data set releases, both on the end user web site and at the SPARQL endpoint. This will also mean RDF will be more up-to-date with respect to the web site.
We have drafted other possible future developments during the SWAT4LS hackaton. We would like to showcase how RDF and SPARQL can be useful with our data. For instance, we're playing with geotagged data, mainly to show that RDF data can be extracted from a triple store and transformed into some other application-required format, namely KML, to achieve nice results quickly, such as showing entities on a map. A more advanced use case we're thinking of is improving sample searching by leveraging on SPARQL. Like: asking the user a set of desired features (organism, age, dosage, etc) and look for samples or sample groups that are most similar to the given input. This would be especially useful if it can be used over multiple sample-related data sets, so that they can be used smoothly behind the same interface. We've also worked on integration of data similar to ours during the hackaton, using this approach.
Further future developments are... well, who knows? One great thing about the linked data philosophy is that consumers can contribute to creative data applications, to which providers has never thought about. So, have a try with the BioSD endpoint or other EBI datasets and let us know!
Disclaimer: what is written hereby are personal views of the author and doesn't represent the EMBL-EBI position.