Some Intro Reads about Lightweight Data Schemas for Life Sciences

Mix of standards
Examples of standards (see at https://tinyurl.com/y2oerj4u for the image sources).

I’m writing this short post mostly as a quick note on a couple of things I’m working on, after a couple of times I was asked for this kind of background.

We’re living in a world of data more than ever, and that makes it even more important to publish data following good principles of data sharing, which ease the job of finding data, accessing them, and making them reusable by means of interoperability and good practices, such as clear licences published along with the data. That is, it’s good to publish data following good principles of data sharing, such as the FAIR principles.

A Long History of Data Standardisation

For a long time, practitioners from the fields of artificial intelligence, knowledge representation, formal logics and other academic subjects have been working on a standardisation approach based on the ideas of the semantic web and the linked data. Essentially, the idea of re-using the web technologies (eg, the HTTP protocol to exchange documents), concepts (eg, the powerful idea of the hyperlink to relate entities one each other) and practices (eg, working together and in an open way under a web standardisation body like the W3C).

In the life science domain, this has produced great results in some cases, while in others, things could have gone better (to say the least…). For example, GeneOntology has been a major success, among other reasons, despite description logics, the modelling approach it is based on, is often seen as too complicated to deal with.

Going light

While formal ontologies are still very important in this domain, the world is also moving to something else. A couple of years ago, a bunch of organisations, mostly search engine providers, with Google as the usual leader, gathered together and leveraged on years of experience on both the linked data and the similar areas, for instance, microformats. The result has been schema.org.

This kind of evolution is well-described in this article. schema.org puts together a couple of ideas.

One is about defining a general and very simple tool to model data. A sort of “ontology”, but much simpler and lightweight.

The other is about defining it in a way that allows for integrating into existing technologies, namely, allow schema.org annotations into web pages based in HTML. This is a simple, yet powerful means to extend the existing web, by taking a technology that allows us to publish human-readable documents (and hence formatting, layout, etc) and adding means to describe the data that the documents are about, in a way that can easily be understood by search engines like Google. Plenty of articles describe how this could be useful to make your own site more visible (and maybe get a good ranking) by Google (example).

Other uses of schema.org are possible. For instance, Google is also able to
digest data descriptions based on schema.org and make your dataset findable in their recently-launched dataset search resource, which, if you’re an open data publisher of any kind, is pretty cool.

Lightweight Schemas for the Biology Community

schema.org is very general and indeed, it might be too generic for specific application domains. So, it has been natural for several communities to start working on domain-specific extensions. In the life science community, bioschemas is one of the most relevant projects.

Their goals are similar to the ones schema.org has, ie, providing a simple language to describe biological data and support use cases like informal annotation of web pages and powering data search applications.

They have a simple and clear description of that. More details are available
in their training portal.

And for Agriculture

The above is where I was when, a while ago, I started to work on a couple of projects: publishing plant biology data coming from our Knetminer platform (presentation here) and, later, publishing a variety of agriculture data, including those from Knetminer and the Designing Future Wheat project.

The result has mainly been the (still in progress) Agrischemas project.

As you can imagine, the idea of Agrischemas is to reuse bioschemas (and hence, schema.org) as much as possible to model data in the agriculture, agronomy, farming and related domains, plus providing a further extension for the not-so-many cases where we need domain-specific types.

Additionally, we (ie, the Knetminer team and other collaborators) aim at reusing other data models as much as we can (eg, MIAPPE, BrAPI).

So far, we have managed to use this approach with a couple of data sets, namely Knetminer and gene expression data coming from the EBI’s Gene Expression Atlas. All the data are available on our SPARQL endpoint (you might want to look at SPARQL in a nutshell, SPARQL Tutorial by Jena).

While we’re in the early days, we’ve outlined a possible way to use that. For instance, we have shown simple use cases (this and this) where data modelled like above are fetched by Python scripts, in a Jupyter notebook.

The future

More is to come in the next weeks. For example, apart from the completion of the above work, an idea we’d like to explore is bridging SPARQL with GraphQL, a pretty good language used to describe JSON-based web services (or, APIs). The rationale of using GraphQL is that many developers, especially front-end developers, are proficient with querying web APIs and dealing with JSON more than they are with SPARQL and linked data (which remain useful to model and integrate data).

Click to rate this post!
[Total: 0 Average: 0]

Written by

I'm the owner of this personal site. I'm a geek, I'm specialised in hacking software to deal with heaps of data, especially life science data. I'm interested in data standards, open culture (open source, open data, anything about openness and knowledge sharing), IT and society, and more. Further info here.

Leave a Reply

Your email address will not be published. Required fields are marked *