Harnessing dark data in early stage drug development

A 1966, anticipating the impact of rapid advances in computing, a Time magazine article hypothesised: “By 2000, the machines will be producing so much that everyone in the US will, in effect, be independently wealthy. How to use leisure time will be a major problem.” Graeme Dennis, Commercial Director of Preclinical Pharma, IDBS, reports

The arrival of the integrated circuit appeared to address the nascent industry’s primary obstacles — computer size and cost — and the microcomputer was born shortly thereafter. Neither this history making advance, nor Time’s spectacular prediction, anticipated the growth in data that is collected but utilised only once, or never — the commonly termed “dark data.”

Local storage, an explosion in the number of users of all levels of computer literacy, and well-intended (and necessary) data security measures leave swaths of data inaccessible. A Veritas study suggests the volume of dark data may account for half of all information stored.

Big Pharma is increasingly turning to third-party R&D firms to accelerate the early stages of drug development. Today, the most effective of these firms are technology focused with a cultural mindset that recognises the power of data. The expectation is that this work is done faster and cheaper, while also maintaining legal and regulatory compliance.

To combat these pressures, contract research or manufacturing organisations are turning to ever more advanced data strategies to streamline their own processes.

For instance, method execution and sample management approaches with origins in manufacturing and quality control (QC) have reached forward into drug development. The rigor provided by these systems is desirable, although pharma has chafed at their inability to capture deviations and exceptions as flexibly as needed.

Graeme Dennis, Commercial Director of Preclinical Pharma, IDBS

For example, if unexpected samples arrive at a CRO for bioanalysis as part of a larger anticipated group, they must be accounted for from the moment of arrival. The use cases upon which legacy LIMS were developed don’t anticipate a situation such as this, as work requests were raised and filled intramurally. The scientific informatics software industry has stepped in to satisfy these requirements when LIMS or LES may have been misapplied.

This simple observation exposes one of the primary reasons why dark data continues to accumulate: there is no completely suitable system in which to house it. For instance, many in vivo preclinical study results generated by contract research organisations reside in portable document formats (unsuited to parsing) in email (an unshared environment by design).

This choice is dictated by convenience. But not only is the data unshared and non-sharable, it’s invisible to the organisation. Even if exposed, many scientists would not even consider using or reinterpreting data without significant context regarding how, when, by whom and under what precise conditions it was gathered. They will instead redo the study.

When less is more

The point has been made that the antidote to the dark data morass is not simply to expose it all.1 At the other extreme, a “hoarding” phenomenon wherein data is retained regardless of quality or significance “just in case.” So, our goal, as Culkin describes, is not simply to find (and find value in) what is stored, but also to store less dark data. Don't assume that because storage is cheap, it’s best to retain everything. After all, the fact that much of the data persisted unexposed may itself speak to its usefulness.

Once data has been exposed, one particular area of focus is the move toward providing contextually rich data. This is critical. Legacy methods of recording and collating data are prone to errors, but further to this, they fail to provide the full context in which the data was captured.

Sometimes called data provenance, these conditions frequently dictate the reusability or applicability of data for interpretation. Instrumentation, lab location and conditions, as well as materials used, are some of the many factors that should be considered throughout the drug development lifecycle.

Frequently, sample origin, transport conditions and custody are also essential. Capturing this information relies on an advanced informatics infrastructure and significant forethought in system design.

  • the adoption of a data maturation model for existing data
  • examining integration approaches that couple data deposition with data acquisition
  • the crucial scientific data strategy that genuinely transforms an organisation to a "data first" culture.

Recognising data maturity

Classifying data according to its level of maturity can help to set an organisation on the path to successful data stewardship; such an approach permits the assignment of resources, effort and priority to align with broader company goals.

For instance, data may be considered fully dark or sequestered, shared (perhaps on a shared drive or SharePoint), structured (stored in a database) and ultimately standardised (both structured and harmonised with internal or industry standards). Stratifying data this way, and then functionally, perhaps according to most active project, candidate or biological relevance, can constrain the required effort to something manageable.

Exploring integration approaches

The nearer data can be captured at the moment of acquisition, with structure and context, the much less likely it is to become sequestered. Vendor assessments must raise integration capabilities early and often in the evaluation of candidates. It is insufficient to accept a promise of a robust API as evidence of viable integration. Sometimes data pipelining tools, database replication or scripting close these gaps. The best solutions eliminate the gap via integration.

Embracing a data first cultural shift

It is increasingly understood that an organisation's top asset, after its talent, is its data. And yet it is rarely treated as such. A data first strategy acknowledges this priority, socialises it and provides the tools that enable success, software or otherwise.

The FAIR data principles, suggesting that data be findable, accessible, interoperable and reusable, provide broadly accepted guidelines under which such a programme may be implemented, stressing identifier quality, open protocols, controlled vocabularies and full provenance.

Such a programme succeeds when it is both visibly endorsed by senior leadership and prioritised at the bench. Invite both champions and sceptics to participate in the process! They will provide some of your most valuable input.

Like the unexpressed parts of the human genome, introns, some dark data can be expected to have great meaning and significance. It has indisputably been acquired, and maintained, at some cost, and no biological process persists at a net loss.

By revealing this previously hidden information, optimal conditions and processes can be replicated, reducing workload and increasing efficiency. Error detection, for example, becomes much easier when it is possible to pinpoint where and when a certain context changed. By rolling back to this point, it is possible to pick up development from a later stage, rather than starting the entire process again. In essence, the dark data has shed light on a new way forward.


  1. http://tiny.cc/otf38y.