Scalable methods to query data heterogeneity
- A book chapter presents the main challenges of integrating and reasoning with life science data, and surveys how Semantic Web technologies are a relevant framework for addressing these issues. 31
- During Cécile Beust’s internship, we converted the Comparative Toxicogenomics Database (CTD) into the BioPAX format. We expect to use the result as an input for CadBiom in order to build large-scale biological dynamic networks based on guarded transitions. The generated models could be analyzed thanks to reachability queries in order to identify environmental exposures causal signatures associated to the occurrence of chronic liver diseases. 42
- Many studies focus on phenotype observation of species of interest, with additional data on experimental conditions and on metagenomic for biodiversity. We designed a data schema that (1) avoids the unnecessary duplication of data engineering for each study, (2) provides a common repository of queries and (3) in the long run supports combining data from multiple studies. We deployed this data model using AskOmics. We validated our approach by integrating the data from a previous article and repoducing the analysis. This work will serve as a foundation for our future contribution to the DeepImpact project. 25, 48
Improving reusability along the data life cycle: a Regulatory Circuits Case Study [ O. Dameron, X. Garnier, M. Louarn, A. Siegel] 16
- Many life science studies’ data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. We designed a modular RDF representation of the Regulatory Circuits data, the sample-specific and the tissue-specific networks, and the corresponding metadata. The result is available at zenodo14. It supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies.
- Information on protein-protein interactions is collected in numerous primary databases. Several meta-databases aggregate primary databases to provide more exhaustive datasets. Redundancy occurs in meta-databases when some publications reporting protein-protein interactions have been annotated with different precision levels by multiple primary databases. We proposed a precise definition of explicit and implicit redundancy, and showed that both can be easily detected using Semantic Web technologies. We applied this process to a dataset from the APID meta-database and showed that while explicit redundancies were detected by the APID aggregation process, about 15% of APID entries are implicitly redundant and should not be taken into account when presenting confidence-related metrics. Finally, we built a “reproducible interactome” with interactions that have been reproduced by multiple methods or publications. The size of the reproducible interactome was drastically impacted by removing redundancies for both yeast (-59%) and human (-56%), and we showed that this is largely due to implicit redundancies.
- This work was also a part of the habilitation thesis “From homogeneous data to heterogeneous data in systems biology” defended by Emmanuelle Becker 33
- Molecular complexes play a major role in the regulation of biological pathways. The Biological Pathway Exchange format (
BioPAX) facilitates the integration of data sources describing interactions some of which involving complexes. The BioPAXspecification explicitly prevents complexes to have any component that is another complex. However, we observed that the well-curated Reactome pathway database contains such recursive complexes of complexes. We proposed reproductible and semantically-rich SPARQL queries for identifying and fixing invalid complexes in BioPAXdatabases, and evaluate the consequences of fixing these non-conformities in the Reactome database 27. Overall, this method improved the conformity and the automated analysis of the graph by repairing the topology of the complexes in the graph. This will allow to apply further reasoning methods on better consistent data.
- A large quantity of experimental data can be generated on tissues and cells by using complementary high-throughput techniques like transcriptomics, proteomics and metabolomics, as well as by target analyses for specific molecules. This results in a large amount of multimodal data. Each modality can be statistically analyzed to produce lists of differentially-expressed molecules between experimental conditions. We hypothesized that considering the different levels of omics as a whole will help understand biological systems, and introduced a methology to map results of high-throughput transcriptomic and high-throughput metabolomic data on a graph representing metabolism (interactions and their regulation 26. This graph highlights the links between small molecules and proteins, and thus might be the appropriate system to integrate both metabolomic and transcriptomic data. This work opens new perspectives to integrate simultaneously proteomic/transcriptomic and metabolomic data, and to find networks between these molecules or potential common upstream regulators.
- This work was also a part of the habilitation thesis “From homogeneous data to heterogeneous data in systems biology” defended by Emmanuelle Becker 33
- Although the BioPAX standard has been widely adopted by the community to describe biological pathways, no computational method is able of studying the dynamics of the networks described in the BioPAX large-scale resources. To solve this issue, our Cadbiom framework was designed to automatically transcribe the biological systems knowledge of large-scale BioPAX networks into discrete models. The framework then identifies the trajectories that explain a biological phenotype (e.g., all the biomolecules that are activated to induce the expression of a gene) 22.
Addressing barriers in comprehensiveness, accessibility, reusability, interoperability and reproducibility of computational models in systems biology. [ A. Siegel] 18
- Computational models are often employed in systems biology to study the dynamic behaviours of complex systems. With the rise in the number of computational models, finding ways to improve the reusability of these models and their ability to reproduce virtual experiments becomes critical. Correct and effective model annotation in community-supported and standardised formats is necessary for this improvement. Here, we present recent efforts toward a common framework for annotated, accessible, reproducible and interoperable computational models in biology, and discuss key challenges of the field.
Metabolism: from protein sequences to systems ecology
Modeling proteins with crossing dependencies [ F. Coste] 47
- In collaboration with H. Talibart and M. Carpentier (ISYEB, Muséum national d’Histoire naturelle), we proposed a new Potts model inference method that is considerably faster, enabling to represent and align deeper and larger alignments, and that can use pseudocounts to improve the robustness of the inference with respect to sampling variations. This method has been implemented in the PP suite 47.
Deep attention networks for enzyme class predictions [ N. Buton, F. Coste, Y. Le Cunff] 43
- We studied the interest of Transformer deep neural networks for the functional annotation of sequences by focusing on the prediction of enzymatic classes. Our EnzBert transformer models, trained to predict enzyme commission (EC) numbers by specialization of a protein language model, were able to significantly outperform state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. We also showed that the attention of Transformers provides an interesting built-in mechanism for the interpretabitlity of these predictions by proposing a simple aggregation of the attention maps which was on par with, or better than, other classical interpretability methods on predicting the enzymatic sites of enzymes 43.
Combining knowledge-based and sequence comparison approaches to elucidate metabolic functions34 The taxonomic diversity within an environmental sample is most often identified through targeted gene marker sequencing, or amplicon. We designed a new method estimating the metabolic capacities of a wild organism based on the estimated taxonomy of its sequenced amplicon sequence. The method consists in selecting taxonomicaly close annotated genomes in UniProt, then it estimates clusters of shared enzymes to identify the core proteome of the taxon. The core-proteome can be considered as a proxy of the wild organim metabolic capacities. Unlike other approaches in this field, our method considers taxonomic assignments as inputs and not exclusively 16S rRNA amplicons, and it provides as output a metabolic network instead of a metabolic profiling. The method, implemented in the Esmecata tool, and its application to the biogaz reactor are described in Arnaud Belcour PhD thesis 34.
Insights into the potential for mutualistic and harmful host‐microbe interactions affecting brown alga freshwater acclimation13 Microbes can modify their hosts’ stress tolerance, thus potentially enhancing their ecological range. An example of such interactions is Ectocarpus subulatus, one of the few freshwater-tolerant brown algae. This tolerance is partially due to its (un)cultivated microbiome. The biological station of Roscoff investigated this phenomenon by modifying the microbiome of laboratory-grown E. subulatususing mild antibiotic treatments, which affected its ability to grow in low salinity. Low salinity acclimation of these algal-bacterial associations was then compared, including a study at the metabolic scale using the tool designed in the Dyliss team, and reviewed in 32: gene expression of the host and metabolite profiles were affected almost exclusively in the freshwater-intolerant algal-bacterial communities, and vitamin K synthesis is one possible bacterial service missing specifically in freshwater-intolerant cultures in low salinity. Together, these results provide two promising hypotheses to be examined by future targeted experiments.
Microbial genomics: from cells to genes (and back to cells)20 The rumen harbours countless various microorganisms that have established multiplicity of relationships to efficiently digest complex nutrients, essentials for the host’s health, growth and performances. Recent studies using omics-based techniques have revealed that changes in rumen microbiota are associated with changes in ruminants’ production and health parameters. This review advocate for the benefits of switching from traditional rumen microbes studies using anaerobic culture-based techniques to molecular techniques applied to microbial cultures. The paper provides a comprehensive review of current advances in molecular methods to identify novel rumen microbes and discuss how culturing and mathematics could enhance our understanding of rumen microbiology.
Input contributions on metabolic outputs : application to human diets23 The public availability of human microbiome datasets makes it possible to apply diets to these human microbiomes metabolim to model the behavior of organisms. We automated an approach (nAIO) that allows, for each input nutrient in the network, to determine the percentages that are distributed in the different outputs when the organism is forced to evolve in a given diet. The nAIO is computed thanks to the inversion of a large-scale matrix and is combined with linear optimization problems. We applied this method to all known bacterial networks from studies of the gut microbiota and stored in the Virtual Metabolic Human database. The calculation of nAIOs shows that computation times do not depend on the size of the network but rather on the selected diet. The nAIO calculation also shows that for some bacteria the nAIOs are independent of diet. For these bacteria the nAIOs can be used to make predictions that result in a linear relationship between the inputs of the system and its outputs.
Regulation and signaling: detecting complex and discriminant signatures of phenotypes
Learning Boolean controls in regulated metabolic networks: a case-study [ A. Siegel, K. Thuillier] 21
- Many techniques have been developed to infer Boolean regulations from a prior knowledge network and experimental data. Existing methods are able to reverse-engineer Boolean regulations for transcriptional and signaling networks, but they fail to infer regulations that control metabolic networks. We present a novel approach to infer Boolean rules for metabolic regulation from time series data and a prior knowledge network. Our method is based on a combination of answer set programming and linear programming and generates candidate Boolean regulations that can reproduce the given data when coupled to the metabolic network. We evaluated our approach on a core regulated metabolic network and show how the quality of the predictions depends on the available kinetic, fluxomics or transcriptomics time series data.
Discrete modeling for integration and analysis of large-scale signaling networks [ A. Siegel, N. Théret] 22
- The computation of sets of biological entities implicated in phenotypes is hampered by the complex nature of controllers acting in competitive or cooperative combinations. The identification of controllers relies on computational methods for dynamical systems, which require the biological information about the interactions to be translated into a formal language. We used the
biopax2cadbiommethod to create Cadbiom models from three biological pathway databases (KEGG, PID and ACSN). The cadbiom framework then identifies the trajectories that explain a biological phenotype (e.g., all the biomolecules that are activated to induce the expression of a gene). The comparative analysis of these models highlighted the diversity of molecules in sets of biological entities that can explain a same phenotype. The application of our framework to the search of biomolecules regulating the epithelial-mesenchymal transition not only confirmed known pathways in the control of epithelial or mesenchymal cell markers but also highlighted new pathways for transient states.
- Hepatic Stellate Cells produce a wide variety of molecules involved in ECM remodeling, such as adamalysins 84. However, the limitations of discovering new functions of these proteins stem from the experimental approaches that are difficult to implement due to their structure and biochemical features. In that context, we developed an original framework combining the identification of small modules in conserved regions independent of known domains and the concepts of phylogenomics (association of conservation and phenotype gained concurrently during evolution). We estimated the phylogenetic history of ADAMTS and ADAMTS like proteins in nine bilateria species including human, suggesting the emergence of the ADAMTSL and papilin within the ADAMTS. A dataset of 447 protein-protein interactions (PPI) with the 26 ADAMTS-TSL human paralogs was constructed and we estimated ancestral scenario for PPI appearances along our bilateria tree. We found 45 ancestors displaying a co-appearance of conserved module signatures and PPI. We identified convergent appearances of PPI with COMP and CCN2 and we showed that distinct signatures of the ADAMTS7, ADAMTS3 and ADAMTS4 ancestors could be involved in those interactions. We finally obtained a signature discontinuous along the primary sequence but folding in a contiguous three dimensional region in the hyalectanase sub-group of ADAMTS and putatively involved in the ACAN and VCAN interactions. The resulting evolutionary model of motif signatures and protein-protein interaction signatures of the ADAMTS family is validated by data from literature and provides biologists with many new potential functional motifs freely available on ITOL.Olivier Dennler defended his PhD thesis 35 and an article is under consideration by an international journal.
Creation of predictive functional signaling networks [ M. Bougueon, N. Théret] 30.
The rule-based model approach. A Kappa model for hepatic stellate cells activation by TGFB130 Kappa is a site graph rewriting language. It offers a rule-centric approach, inspired from chemistry, where interaction rules locally modify the state of a system that is defined as a graph of components, connected or not. In this case study, the components will be occurrences of hepatic stellate cells in different states, and occurrences of the protein TGFB1. The protein TGFB1 induces different behaviors of hepatic stellate cells thereby contributing either to tissue repair or to fibrosis. Better understanding the overall behavior of the mechanisms that are involved in these processes is a key issue to identify markers and therapeutic targets likely to promote the resolution of fibrosis at the expense of its progression. Characterizing gene structure with grammatical languages and conservation information [ C. Belleannée, S. Blanquart, O. Dameron, N. Guillaudeux] 12
- Based on syntactic models and graph formalisms, we compared splicing structures of 2167 triplets of orthologous genes shared in human, mouse and dog. This resulted in the prediction of 6861 new coding transcripts (
i.e.putative proteins) on these species, mainly for dog, an emergent model species. Every predicted transcript shares an identical exonic structure with a coding transcript already known in another species, hence defining them as orthologs. Additionnaly, we identified a set 253 gene triplets with strictly conserved exonic structures in human, mouse and dog, and so expressing the same proteome ( i.e.the same isoform coding transcripts). These genes express a total of 879 groups of orthologous isoforms, such that in each group, the same splicing structure is shared in each three species gene. Although these genes express a same proteome, we showed that the expressed transcriptomes may be different, due to the gene’s propensity to express distinct alternatively transcribed mRNAs encoding the same protein 12.
- In the context of our participation in the IPL NeuroMarker project, a joint study with Institut du Cerveau (Inseerm/CNRS/Sorbonne Université) at the Pitié-Salpêtrière hospital and the Aramis team (Inria Paris) evidenced a signature of four microRNAs in n presymptomatic and symptomatic subjects with frontotemporal dementia and amyotrophic lateral sclerosis associated with a C9orf72 mutation 73. To critically assess the discriminative power of this signature and compare it with other available signatures, this study was followed with a validation study that highlighted the discriminative power of the different signatures with an independant cohort 15. This large scale reproducibility analysis of previously identified microRNA signatures for fronto-temporal degeneration and amyotrophic lateral sclerosis was mostly confirmed the signature identified in 73 as discriminative for patients and pre-symptomatic carriers of the C9orf72 mutation.
- In recent years, many approaches have been developed for modeling disease progression from data, most of these approaches requiring longitudinal data. Indeed, among the proposed models, only the event-based models (EBMs) allow inferring a disease progression score from cross-sectional data. However, EBMs were applied to a relatively small number of variables (typically 10-50) and it is not known whether they perform well in higher dimensions. In the context of the IPL NeuroMarker, we proposed a new model using cross-sectional multimodal data based on variational autoencoders in 3 steps: (1) estimation of the latent space, (2) definition in the latent space of a curve defining the progression of the disease; and finally (3) estimation of the progression score of an individual by orthogonal projection of its coordinates in the latent space onto the main trajectory curve. When applied to frontotemporal degeneration and amyotrophic lateral sclerosis, the proposed method was more efficient than EBMs 14, 24, 28.
- Virgilio Kmetzsch also defended his PhD thesis “Multimodal analysis of neuroimaging and transcriptomic data in genetic frontotemporal dementia”. 36, and this work was also a part of the habilitation thesis “From homogeneous data to heterogeneous data in systems biology” defended by Emmanuelle Becker 33
Signatures of mutants of the enzyme EXOSC10/Rrp6 [E. Becker] 19
- The conserved 3′-5′ exoribonuclease EXOSC10/Rrp6 is required for gametogenesis, brain development, erythropoiesis and blood cell enhancer function. The human ortholog is essential for mitosis in cultured cancer cells. Little is known, however, about the role of Exosc10 during embryo development and organogenesis. The transcriptional landscape of EXOSC10 mutants was investigated to explain its essentiality for the eight-cell embryo/morula transition 19.
Signature of Crohn’s disease symptoms from microbiota profiles[Y. Le Cunff] 11
- Standard approaches to describe patients’ microbiota in Crohn’s disease (CD) consists in comparing those with control individuals’ microbiota. In this work, we decided to rather focus on distinguishing subgroups of microbiota profiles within a novel CD cohort studied in Rennes’ C.H.U. We used unsupervised clustering techniques to highlight the existence of three microbiota subprofiles, each linked with a different symptoms’ severity. Moreover, we also showed that these groups are largely stable over time. Finally, using differential abundance analysis, we managed to point out key species which could act as signatures for CD evolution over time. 11