Scalable methods to query data heterogeneity
- The utilization of ontologies and standardized formats facilitates interoperability but does not completely address all challenges associated with integrating heterogeneous data types and sources. The Biological Pathway Exchange format (
BioPAX) facilitates the integration of data sources describing interactions, involving molecular complexes that play a major role in the regulation of biological pathways. However databases utilizing the BioPAX format frequently exhibit redundant molecular complexes with identical properties but distinct identifiers. Furthermore, these databases often contain invalid complexes, in which the components themselves are complexes, resulting in a recursive representation that differs from the flat representation required by the format specifications. Consequently, such non-conformity and redundancy introduce modifications within the graph, which impact the subsequent analyses. We proposed reproductible and semantically-rich SPARQL queries for identifying and fixing non-conformity and redundancy in BioPAXdatabases, and evaluate the consequences of fixing these non-conformities and redundancies in the Reactome database. Firstly, we show that they introduce genericity problems, as redundant representations mask implicit redundancies. Importantly, we also measure how these non-conformities lead to structures that artificially modify the topology of the graph, increasing the path length between graph nodes and compromising the analysis of the interaction network 17.
- This corrected graph can further be used as an input to link transcriptomics and metabolomics data measured in the context of understanding inter-individual variability of feed efficiency in growing pigs. We elucidate the process of identifying modules of co-expressed genes associated to feed efficiency and their connections with metabolites and fatty acids concentrations in 16. Our study establishes a link between transcriptome and metabolome data, revealing connections between immunity and fatty acid composition 18. These transcriptomic and metabolic data were further mapped on the corrected Reactome
BioPAXgraph, revealing that co-expressed transcripts can be connected in the corrected BioPAXgraph 21.
- These different results were the support of the PhD of Camille Juigné “Integration and analysis of heterogeneous biological data through multilayer graph exploitation to gain deeper insights into feed efficiency variations in growing pigs”, co-supervized by the IRISA Dyliss Team and the INRAe Pegase Team (F. Gondret), and defended on December 1st 2023 21.
Identifying meaningful query modules from a collection of SPARQL queries. [ O. Dameron, A. Regnier] 24
- Creating SPARQL queries requires users to acquire a precise undestanding of the SPARQL endpoint data schema, which is typically tedious. This task can be facilitated by relying on a collection of previous queries that can be adapted or combined to create new ones. We developped a method to identify query modules, i.e. portions of SPARQL queries that are shared among multiple queries, and investigated whether these modules correspond to biologically-relevant notions, and how they can be combined to create new queries 24.
Metabolism: from protein sequences to systems ecology
Phylogenetic inference of functional sequence modules in ADAMTS-TSL proteins [ C. Belleannée, S. Blanquart, F. Coste, O. Dennler, N. Théret] 14.
- The vast majority of proteins have a modular multidomain organization. Domains are conserved building blocks of proteins that are widely used to characterize and predict protein functions. However, the organization of multidomain proteins underlies a great complexity and their biological role is generally not the sum of the functions attributed to each domain. To address this problem, it is necessary to develop new methods to better identify functional signatures. Here, we developed a framework based on partial local multiple alignments (to find conserved sequence modules) and phylogenetic reconciliation methods (to integrate the evolution of species, genes, sequence modules, and Protein-Protein Interactions). Application of our framework to the search for functional sequence modules in extracellular matrix proteins from the ADAMTS (A Disintegrin-like and Metalloproteinase with ThromboSpondin motif) and ADAMTSL (ADAMTS-like) family, enabled to highlight sequence signatures potentially involved in Protein-Protein Interactions 14.
Modeling proteins with crossing dependencies [ F. Coste] 23
- The PP suite 63 enables to align pairs of Potts model for a comparison of protein sequences that takes into account the coevolution of residues in the sequences. Yet, because of their intrinsic overparametrization, and because of sampling biases that could not be easily handled, the inferred Potts models to be aligned are often not comparable. We studied those issues, first by searching for a relevant canonical form of Potts models to get rid of unwanted parameters divergence, and then by exploring how explicit covariance-based methods that are able to overcome these sampling issues could be adapted to directly infer comparable models 23.
- We studied the interest of Transformer deep neural networks for the functional annotation of sequences by focusing on the prediction of enzymatic classes. Our EnzBert transformer models, trained to predict enzyme commission (EC) numbers by specialization of a protein language model, were able to significantly outperform state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. We also showed that the attention of Transformers provides an interesting built-in mechanism for the interpretabitlity of these predictions by proposing a simple aggregation of the attention maps which was on par with, or better than, other classical interpretability methods on predicting the enzymatic sites of enzymes 12. This work was part of Nicolas Buton’s Ph.D., defended in October 2023 20 and has been presented by Nicolas in JOBIM 2022 37.
Comparison of metabolic networks based on heterogeneous annotation sets [ A. Belcour, S. Blanquart, J. Got, P. Hamon-Giraud, V. Mataigne, A. Siegel] 11
- Comparative analysis of genome-scale metabolic networks (GSMNs) may yield important information on the biology, evolution, and adaptation of species. However, it is impeded by the high heterogeneity of the quality and completeness of structural and functional genome annotations, which may bias the results of such comparisons. To address this issue, we developed AuCoMe, a pipeline to automatically reconstruct homogeneous GSMNs from a heterogeneous set of annotated genomes without discarding available manual annotations. We tested AuCoMe with three data sets, one bacterial, one fungal, and one algal, and showed that it successfully reduces technical biases while capturing the metabolic specificities of each organism. Our results also point out shared and divergent metabolic traits among evolutionarily distant algae, underlining the potential of AuCoMe to accelerate the broad exploration of metabolic evolution across the tree of life.
Dynamic genome-based metabolic modeling of the predominant cellulolytic rumen bacterium Fibrobacter succinogenes S85 [ J. Got, A. Siegel] 15
- Fibrobacter succinogenes is a cellulolytic bacterium that plays an essential role in the degradation of plant fibers in the rumen ecosystem. It converts cellulose polymers into intracellular glycogen and the fermentation metabolites succinate, acetate, and formate. We developed dynamic models of F. succinogenes S85 metabolism on glucose, cellobiose, and cellulose on the basis of a network reconstruction done with the automatic reconstruction of metabolic model workspace. The accuracy of the models was acceptable in simulating F. succinogenes carbohydrate metabolism with an average coefficient of variation of the root mean squared error of 19%. The resulting models are useful resources for investigating the metabolic capabilities of F. succinogenes S85, including the dynamics of metabolite production. Such an approach is a key step toward the integration of omics microbial information into predictive models of rumen metabolism.
Regulation and signaling: detecting complex and discriminant signatures of phenotypes
A rule-based multiscale model of hepatic stellate cell plasticity19 Hepatic stellate cells (HSC) are the source of extracellular matrix (ECM) whose overproduction leads to fibrosis, a condition that impairs liver functions in all chronic liver diseases. Understanding the dynamics of HSCs is required to develop new therapeutic approaches. In this work, we used the Kappa graph rewriting language to develop the first rule-based model describing the dynamics of HSCs during liver fibrosis and its reversion. Kappa offers a rule-centric approach where interaction rules locally modify the state of a system that is defined as a graph of components, connected or not. HSCs are modeled as agents presenting seven cell physiological states and interacting with TGFB1 molecules that regulate HSC activation and the secretion of type I collagen, the main component of the ECM. We introduced counters to scale the intermediate steps between cell states, and tokens to describe TGFB1 and type I collagen quantities thereby highly reducing the computational cost. Simulation studies revealed the critical role of the HSC inactivation process during fibrosis progression and reversion. We further have demonstrated the model’s sensitivity to TGFB1 parameters, suggesting its adaptability to a variety of pathophysiological conditions in which TGFB1 release associated with the inflammatory response differs. Using new experimental data from a mouse model of CCl4-induced-liver fibrosis, we validated the predicted ECM dynamics. Our model further predicts the accumulation of inactivated HSCs during chronic liver disease. By analyzing RNA sequencing data from patients with non-alcoholic steatohepatitis (NASH) associated with liver fibrosis, we confirmed this accumulation, identifying iHSCs as novel markers of fibrosis progression. Overall, our study provides the first model of HSC dynamics in chronic liver disease that can be used to explore the regulatory role of iHSCs in liver homeostasis, but our model can also be generalized to fibroblasts during repair and fibrosis in other tissues.