The lack of industry wide data standards could hold back innovation in life sciences R&D, suggest Dr Jarek Tomczak and Gabrielle Whittick, Consultants at the Pistoia Alliance
Against the backdrop of increasing digitisation in R&D, it’s no surprise that technologies such as artificial intelligence (AI), machine learning (ML) and deep learning (DL) are being widely explored.
Life science research especially and, in particular, chemistry, has a long history with AI, starting with the DENDRAL expert system developed more than 50 years ago.
Today, the interest is focused on identifying novel compounds with desired properties and finding efficient ways to synthesise and analyse them. Such innovations are bringing the Lab of the Future (LotF), which aims to modernise lab environments through embracing technology, data and automation, closer to becoming a reality.
However, because AI relies on high quality data for accurate outcomes and predictions, industry wide data standards are essential if the LotF is going to bring tangible benefits.
LotF advancements mean more instrument and device data are being captured, which, alongside the interest in personal genomics and the digitisation of healthcare records, puts even greater demands on data sharing systems. Just last year, the US FDA warned that the discouragement of data sharing could hold back progress in clinical trials.
This has led many life science companies to build their own tools to manage and store data — but such tools don’t work in other organisations, stunting collaboration and therefore progress.
By contrast, there are many off-the-shelf tools from a range of vendors to choose from, but each of these also uses different formatting and storage methods.
And in life sciences R&D, no one can go it alone: to really make breakthroughs, collaboration and data sharing are key. Adopting an open, unified data format to help standardise data industry wide is critical – making data not only more discoverable and shareable, but tools interoperable.
A quick glance back at some of the last decade’s R&D breakthroughs shows the true value data sharing can bring. During the 2014 outbreak of Ebola, for instance, it was reported that data sharing helped scientists quickly trace the virus’s origins and control the epidemic, accelerating the development of new cures and therapies.
To make this kind of co-operation more common, data needs to be standardised, so stakeholders can work together to produce better outcomes. The Pistoia Alliance recently partnered with the Allotrope Foundation and several major pharmaceutical companies to launch MethodDB to centralise and standardise experiment descriptions.
This saves scientists time when reproducing experiments on different instruments, improving not only the outcomes of individual experiments, but supporting the development of new therapies.
Data standards are also vital if we’re to accelerate adoption of (and get more value from) the latest AI, ML and DL technologies. Advancing the LotF undoubtedly requires high levels of automation, and for algorithms to draw accurate conclusions, it’s vital the data fed into them are of the highest quality.
Also key is that data are processed using a single unified format so it can be harmonised – ensuring patterns unnoticed by the naked eye can be spotted and important insights aren’t missed.
Having established why we need to standardise data to advance the LotF, the real question is how? Because experiments rely on high quality, structured and preferably annotated data from a range of sources and systems, data collection must be done using a consistent format.
This is why The Pistoia Alliance launched the Unified Data Model (UDM) project in 2018; it delivers an open data format for the exchange of experimental information about compound synthesis and testing. The project team has been working to improve the UDM during the past 18 months and launched Version 6.0 in February 2020, and is now supporting organisations in adoption.
The UDM helps to mitigate the effects of experimental data not being readily available by ensuring it is stored in a usable electronic format. If data from laboratory experiments are recorded on paper or an Excel document, it can’t be shared easily with researchers in the same room, let alone other organisations.
The UDM, however, makes it easier to access old data and compare data sets, especially since even if data are stored electronically, data sets often require extensive pre-processing and conversion. This is because some of the most popular file formats used in life sciences research don’t enforce enough content structure.
This version of the UDM is also improving the semantics and validation of data, which is vital since in R&D, acronyms and words can be frequently interchanged. A unified data model that provides alignment on terms, however, helps reduce the likelihood of trends and wider patterns being missed in experiments where words can mean more than one thing.
The UDM is not only about the exchange of comprehensive reaction data, but the minability of such data as well. The model is designed to capture information relevant for retrosynthesis and reaction product prediction, and reaction condition optimisation.
In scientific endeavour, organisations see far better outcomes when working together than in isolation. Collaboration is essential if we want to see more breakthroughs in treatments, but with more data being created and more organisations building their own tools which aren’t compatible with others’, sharing experimental data has become increasingly challenging.
This is why a universal format is essential; it means scientists can not only spend less time formatting data and more time finding new treatments, but can more easily share information to improve clinical outcomes.
Ultimately, if society is to benefit from the advancements that the LotF is promising, data must be as discoverable and as shareable as possible – which is why an open, ready-to-use UDM is essential in scientific research.