Revolutionising toxicogenomics with big data

Genestack CEO, Dr Misha Kapushesky, explains how metadata can unlock pharmaceutical assay results for deeper analysis

A new dose response application is making it possible for toxicologists to investigate how genes respond to particular chemical compounds at different doses, and to perform benchmark dose analyses at the gene and pathway levels. This is creating the opportunity to model the potential adverse effects of chemical compounds without the need for animal testing.

The increasing availability and lower cost of genomics and other omics data means that information generation is no longer an issue. The big bottleneck for computational toxicologists is the availability of tools that will allow them to find relevant data from within their own organisations and put private figures into a wider context using emerging public data sets. Driving this requirement is the increasing adoption of adverse outcome pathways (AOPs) for risk assessment (Figure 1).1 This is a conceptual framework that can be used to link in vitro assay results to whole animal effects in a pathway context.

Determining mode of action

Genomics provides a good proxy for measuring the impact on the whole organism. By taking a cell line and exposing it to a chemical, you can measure gene expression and use this as a marker for dose response. By looking at different doses of chemicals and at how they influence gene expression, it is possible to ask questions such as: if we know the mode of action of compound A and we can see that compound B elicits the same response, can we make a conclusion that it has a similar mode of action? If compound A and B have a similar chemical structure, can we predict the same response? These are some of the big questions in toxicology. To start to explore these questions, there needs to be a better mechanism to organise knowledge.

Improved searches will unlock the data

This is where the challenge emerges. In recent years, the toxicology departments of large consumer product companies have acquired a wealth of data to investigate the modes of action of chemical compounds and identify off-target effects. These data are diverse and may include transcriptomics, proteomics, methylation and other assay data; it is also stored in different formats and in different repositories.

In addition, these data have the potential to help toxicologists to understand toxicity events across the scales of biological organisation, but it is difficult for scientists without a computer science background to find the relevant data themselves and perform suitable analyses. If the infrastructure for data management was improved and tools were available to support the analysis, then this would significantly increase the efficiency of the process and allow many more chemicals to be screened.

I first became aware of these types of data management issues when I joined the European Bioinformatics Institute in 2002. It was clear then that functional genomics scientists lacked the structure and tools required to do their work. So, I set about creating an infrastructure that would support the team. This included building a large expression data repository that would support various queries and interface with other data sources and applications. These types of tools and frameworks have since been developed by Genestack to support pharma and other organisations that have large, complex data sets.

One of these projects has been to support computational toxicologists — within one of the largest consumer goods companies in the world — to develop a secure cloud-based environment to store, analyse and browse toxicogenomic data. Users are now able to upload data to one centralised location and browse through hundreds of thousands of public, private and shared data sets.

The game-changing technology that has made this possible is the development of a really powerful method of metadata management that can be used to describe data of all types — projects, experiments, studies or individual data types such as a chemical structure or dose. The indexing is achieved by applying descriptors to the different types of data. If it is a sequencing assay, for example, you can define its type, whether it is array or VCF data, its source — such as from a private or public repository — and its attributes.

The other element of the technology is the use of ontologies: a controlled vocabulary based on a curated list of agreed-upon terms to describe genomic elements, processes and interactions. By using an ontology, all synonyms of a term such as Homo sapiens, H. sapiens, Homo sapiens sapiens or human can automatically be taken into account and can control how the data is described.

Application to toxicology

The Genestack platform now allows toxicologists at the client company to perform data analysis to investigate how genes and pathways respond to particular chemicals, and to conduct benchmark dose analyses. For example, this new dose response application could be used to identify the concentration of a compound above which a specific gene or pathway starts to show a significant response.

The scientist can then look for other compounds that exhibit a similar response. This data can come from experiments conducted internally or from publicly available data sets that have been processed using the Genestack technology.

To validate the pipelines generated on the platform, we have reanalysed two large public toxicogenomic databases and compared these with the results achieved on the platform. The Connectivity Map dataset from the Broad Institute is a functional look-up table of the genome that can be used to determine the cellular effects of a given compound, and the LINCS L1000 dataset from the NIH is a comprehensive resource for gene expression changes observed in human cell lines. The results from the Genestack analysis were consistent with the published analyses.

The Holy Grail: toxicology in vitro

A better understanding of how chemicals impact biological activities and how these are associated with adverse outcomes creates the opportunity to revolutionise toxicity testing. This will move it away from a system based on whole animal testing to one primarily based on in vitro methods that evaluate changes in biological processes using cells, cell lines and cellular components.

To achieve this, we need improved data management and tools that empower toxicologists to access, search and analyse data without the need for sophisticated programming skills. The availability of such a centralised data and metadata management system has moved the industry a step closer to this goal.