Baptiste Bauvin, Thibaud Godon, Guillaume Bachelot, Claudia Carpentier, Riikka Huusaari, Maxime Desraspe, Juho Rousu, Caroline Quach-Thanh, and Jacques Corbeil
Extracting a COVID-19 Signature from a Multi-Omic Dataset
Baptiste Bauvin, Thibaud Godon, Guillaume Bachelot, Claudia Carpentier, Riikka Huusaari, Maxime Desraspe, Juho Rousu, Caroline Quach-Thanh, and Jacques Corbeil
We present a multi-omic signature for COVID-19 developed through a comprehensive Quebec initiative that established an extensive dataset of COVID-19 positive and negative patient samples.Moving beyond traditional symptomatic studies that rely on limited descriptors, our research integrates clinical, proteomic, and metabolomic data to classify COVID-19 status using thousands of features. Our multi-view machine learning approach extracts distinctive COVID-19 signatures from multi-omic data with remarkable effectiveness. By applying ensemble methods, we developed accurate and interpretable models for high-dimensional data-containing significantly more features than samples-achieving 89% ± 5% balanced accuracy. Through our novel feature relevance methodology, we identified condensed 12-and 50-feature signatures that enhanced classification accuracy by at least 3% compared to the original feature set. This approach successfully extracted and interpreted a robust multi-omic signature characterizing COVID-19positive individuals from a large, complex dataset, representing a significant advancement in COVID-19 biomarker discovery.
Over a decade ago, longitudinal multiomics analysis was pioneered for early disease detection and individually tailored precision health interventions. However, high sample processing costs, expansive multiomics measurements along with complex data analysis have made this approach to precision/personalized medicine impractical. Here we describe in a case report, a more practical approach that uses fewer measurements, annual sampling, and faster decision making. We also show how this approach offers promise to detect an exceedingly rare and potentially fatal condition before it fully manifests. Specifically, we describe in the present case report how longitudinal multiomics monitoring (LMOM) helped detect a precancerous pancreatic tumor and led to a successful surgical intervention. The patient, enrolled in an annual blood-based LMOM since 2018, had dramatic changes in the June 2021 and 2022 annual metabolomics and proteomics results that prompted further clinical diagnostic testing for pancreatic cancer. Using abdominal magnetic resonance imaging, a 2.6 cm lesion in the tail of the patient’s pancreas was detected. The tumor fluid from an aspiration biopsy had 10,000 times that of normal carcinoembryonic antigen levels. After the tumor was surgically resected, histopathological findings confirmed it was a precancerous pancreatic tumor. Postoperative omics testing indicated that most metabolite and protein levels returned to patient’s 2018 levels. This case report illustrates the potentials of blood LMOM for precision/personalized medicine, and new ways of thinking medical innovation for a potentially life-saving early diagnosis of pancreatic cancer. Blood LMOM warrants future programmatic translational research with the goals of precision medicine, and individually tailored cancer diagnoses and treatments.
We introduce MetaboAnalyst version 6.0 as a unified platform for processing, analyzing, and interpreting data from targeted as well as untargeted metabolomics studies using liquid chromatography – mass spectrometry (LC–MS). The two main objectives in developing version 6.0 are to support tandem MS (MS2) data processing and annotation, as well as to support the analysis of data from exposomics studies and related experiments. Key features of MetaboAnalyst 6.0 include: (i) a significantly enhanced Spectra Processing module with support for MS2 data and the asari algorithm; (ii) a MS2 Peak Annotation module based on comprehensive MS2 reference databases with fragment-level annotation; (iii) a new Statistical Analysis module dedicated for handling complex study design with multiple factors or phenotypic descriptors; (iv) a Causal Analysis module for estimating metabolite – phenotype causal relations based on two-sample Mendelian randomization, and (v) a Dose-Response Analysis module for benchmark dose calculations. In addition, we have also improved MetaboAnalyst’s visualization functions, updated its compound database and metabolite sets, and significantly expanded its pathway analysis support to around 130 species. MetaboAnalyst 6.0 is freely available at https://www.metaboanalyst.ca.
Wang F, Pasin D, Skinnider MA, Liigand J, Kleis JN, Brown D, Oler E, Sajed T, Gautam V, Harrison S, Greiner R, Foster LJ, Dalsgaard PW, Wishart DS. (Anal Chem. 2023 Dec 19;95(50):18326-18334. doi: 10.1021/acs.analchem.3c02413. Epub 2023 Dec 4. PMID: 38048435; PMCID: PMC10733899.)
Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances.
Wang F, Pasin D, Skinnider MA, Liigand J, Kleis JN, Brown D, Oler E, Sajed T, Gautam V, Harrison S, Greiner R, Foster LJ, Dalsgaard PW, Wishart DS. (Anal Chem. 2023 Dec 19;95(50):18326-18334. doi: 10.1021/acs.analchem.3c02413. Epub 2023 Dec 4. PMID: 38048435; PMCID: PMC10733899.)
Abstract
The market for illicit drugs has been reshaped by the emergence of more than 1100 new psychoactive substances (NPS) over the past decade, posing a major challenge to the forensic and toxicological laboratories tasked with detecting and identifying them. Tandem mass spectrometry (MS/MS) is the primary method used to screen for NPS within seized materials or biological samples. The most contemporary workflows necessitate labor-intensive and expensive MS/MS reference standards, which may not be available for recently emerged NPS on the illicit market. Here, we present NPS-MS, a deep learning method capable of accurately predicting the MS/MS spectra of known and hypothesized NPS from their chemical structures alone. NPS-MS is trained by transfer learning from a generic MS/MS prediction model on a large data set of MS/MS spectra. We show that this approach enables a more accurate identification of NPS from experimentally acquired MS/MS spectra than any existing method. We demonstrate the application of NPS-MS to identify a novel derivative of phencyclidine (PCP) within an unknown powder seized in Denmark without the use of any reference standards. We anticipate that NPS-MS will allow forensic laboratories to identify more rapidly both known and newly emerging NPS. NPS-MS is available as a web server at https://nps-ms.ca/, which provides MS/MS spectra prediction capabilities for given NPS compounds. Additionally, it offers MS/MS spectra identification against a vast database comprising approximately 8.7 million predicted NPS compounds from DarkNPS and 24.5 million predicted ESI-QToF-MS/MS spectra for these compounds.
Knox C, Wilson M, Klinger CM, Franklin M, Oler E, Wilson A, Pon A, Cox J, Chin NEL, Strawbridge SA, Garcia-Patino M, Kruger R, Sivakumaran A, Sanford S, Doshi R, Khetarpal N, Fatokun O, Doucet D, Zubkowski A, Rayat DY, Jackson H, Harford K, Anjum A, Zakir M, Wang F, Tian S, Lee B, Liigand J, Peters H, Wang RQR, Nguyen T, So D, Sharp M, da Silva R, Gabriel C, Scantlebury J, Jasinski M, Ackerman D, Jewison T, Sajed T, Gautam V, Wishart DS. (Nucleic Acids Res. 2024 Jan 5;52(D1):D1265-D1275. doi: 10.1093/nar/gkad976. PMID: 37953279; PMCID: PMC10767804.)
DrugBank 6.0: the DrugBank Knowledgebase for 2024
Knox C, Wilson M, Klinger CM, Franklin M, Oler E, Wilson A, Pon A, Cox J, Chin NEL, Strawbridge SA, Garcia-Patino M, Kruger R, Sivakumaran A, Sanford S, Doshi R, Khetarpal N, Fatokun O, Doucet D, Zubkowski A, Rayat DY, Jackson H, Harford K, Anjum A, Zakir M, Wang F, Tian S, Lee B, Liigand J, Peters H, Wang RQR, Nguyen T, So D, Sharp M, da Silva R, Gabriel C, Scantlebury J, Jasinski M, Ackerman D, Jewison T, Sajed T, Gautam V, Wishart DS. (Nucleic Acids Res. 2024 Jan 5;52(D1):D1265-D1275. doi: 10.1093/nar/gkad976. PMID: 37953279; PMCID: PMC10767804.)
First released in 2006, DrugBank (https://go.drugbank.com) has grown to become the ‘gold standard’ knowledge resource for drug, drug-target and related pharmaceutical information. DrugBank is widely used across many diverse biomedical research and clinical applications, and averages more than 30 million views/year. Since its last update in 2018, we have been actively enhancing the quantity and quality of the drug data in this knowledgebase. In this latest release (DrugBank 6.0), the number of FDA approved drugs has grown from 2646 to 4563 (a 72% increase), the number of investigational drugs has grown from 3394 to 6231 (a 38% increase), the number of drug-drug interactions increased from 365 984 to 1 413 413 (a 300% increase), and the number of drug-food interactions expanded from 1195 to 2475 (a 200% increase). In addition to this notable expansion in database size, we have added thousands of new, colorful, richly annotated pathways depicting drug mechanisms and drug metabolism. Likewise, existing datasets have been significantly improved and expanded, by adding more information on drug indications, drug-drug interactions, drug-food interactions and many other relevant data types for 11 891 drugs. We have also added experimental and predicted MS/MS spectra, 1D/2D-NMR spectra, CCS (collision cross section), RT (retention time) and RI (retention index) data for 9464 of DrugBank’s 11 710 small molecule drugs. These and other improvements should make DrugBank 6.0 even more useful to a much wider research audience ranging from medicinal chemists to metabolomics specialists to pharmacologists.
Maternal pathological conditions such as infections and chronic diseases, along with unexpected events during labor, can lead to life-threatening perinatal outcomes. These outcomes can have irreversible consequences throughout an individual’s entire life. Urinary metabolomics can provide valuable insights into early physiological adaptations in healthy newborns, as well as metabolic disturbances in premature infants or infants with birth complications. In the present study, we measured 180 metabolites and metabolite ratios in the urine of 13 healthy (hospital-discharged) and 38 critically ill newborns (admitted to the neonatal intensive care unit (NICU)). We used an in-house-developed targeted tandem mass spectrometry (MS/MS)-based metabolomic assay (TMIC Mega) combining liquid chromatography (LC-MS/MS) and flow injection analysis (FIA-MS/MS) to quantitatively analyze up to 26 classes of compounds. Average urinary concentrations (and ranges) for 167 different metabolites from 38 critically ill NICU newborns during their first 24 h of life were determined. Similar sets of urinary values were determined for the 13 healthy newborns. These reference data have been uploaded to the Human Metabolome Database. Urinary concentrations and ranges of 37 metabolites are reported for the first time for newborns. Significant differences were found in the urinary levels of 44 metabolites between healthy newborns and those admitted at the NICU. Metabolites such as acylcarnitines, amino acids and derivatives, biogenic amines, sugars, and organic acids are dysregulated in newborns with bronchopulmonary dysplasia (BPD), asphyxia, or newborns exposed to SARS-CoV-2 during the intrauterine period. Urine can serve as a valuable source of information for understanding metabolic alterations associated with life-threatening perinatal outcomes.
The market for illicit drugs has been reshaped by the emergence of more than 1100 new psychoactive substances (NPS) over the past decade, posing a major challenge to the forensic and toxicological laboratories tasked with detecting and identifying them. Tandem mass spectrometry (MS/MS) is the primary method used to screen for NPS within seized materials or biological samples. The most contemporary workflows necessitate labor-intensive and expensive MS/MS reference standards, which may not be available for recently emerged NPS on the illicit market. Here, we present NPS-MS, a deep learning method capable of accurately predicting the MS/MS spectra of known and hypothesized NPS from their chemical structures alone. NPS-MS is trained by transfer learning from a generic MS/MS prediction model on a large data set of MS/MS spectra. We show that this approach enables a more accurate identification of NPS from experimentally acquired MS/MS spectra than any existing method. We demonstrate the application of NPS-MS to identify a novel derivative of phencyclidine (PCP) within an unknown powder seized in Denmark without the use of any reference standards. We anticipate that NPS-MS will allow forensic laboratories to identify more rapidly both known and newly emerging NPS. NPS-MS is available as a web server at https://nps-ms.ca/, which provides MS/MS spectra prediction capabilities for given NPS compounds. Additionally, it offers MS/MS spectra identification against a vast database comprising approximately 8.7 million predicted NPS compounds from DarkNPS and 24.5 million predicted ESI-QToF-MS/MS spectra for these compounds.
PathBank (https://pathbank.org) and its predecessor database, the Small Molecule Pathway Database (SMPDB), have been providing comprehensive metabolite pathway information for the metabolomics community since 2010. Over the past 14 years, these pathway databases have grown and evolved significantly to meet the needs of the metabolomics community and respond to continuing changes in computing technology. This year’s update, PathBank 2.0, brings a number of important improvements and upgrades that should make the database more useful and more appealing to a larger cross-section of users. In particular, these improvements include: (i) a significant increase in the number of primary or canonical pathways (from 1720 to 6951); (ii) a massive increase in the total number of pathways (from 110 234 to 605 359); (iii) significant improvements to the quality of pathway diagrams and pathway descriptions; (iv) a strong emphasis on drug metabolism and drug mechanism pathways; (v) making most pathway images more slide-compatible and manuscript-compatible; (vi) adding tools to support better pathway filtering and selecting through a more complete pathway taxonomy; (vii) adding pathway analysis tools for visualizing and calculating pathway enrichment. Many other minor improvements and updates to the content, the interface and general performance of the PathBank website have also been made. Overall, we believe these upgrades and updates should greatly enhance PathBank’s ease of use and its potential applications for interpreting metabolomics data.
R Kpanou, P Dallaire, E Rousseau, J Corbeil (BMC bioinformatics 25 (1), 47)
Learning self-supervised molecular representations for drug–drug interaction prediction
R Kpanou, P Dallaire, E Rousseau, J Corbeil (BMC bioinformatics 25 (1), 47)
Authors :Rogia Kpanou, Patrick Dallaire, Elsa Rousseau, Jacques Corbeil
Publication date : 2024/1/30
Description:
Drug–drug interactions (DDI) are a critical concern in healthcare due to their potential to cause adverse effects and compromise patient safety. Supervised machine learning models for DDI prediction need to be optimized to learn abstract, transferable features, and generalize to larger chemical spaces, primarily due to the scarcity of high-quality labeled DDI data. Inspired by recent advances in computer vision, we present SMR–DDI, a self-supervised framework that leverages contrastive learning to embed drugs into a scaffold-based feature space. Molecular scaffolds represent the core structural motifs that drive pharmacological activities, making them valuable for learning informative representations. Specifically, we pre-trained SMR–DDI on a large-scale unlabeled molecular dataset. We generated augmented views for each molecule via SMILES enumeration and optimized the embedding process through …
Sample Boosting Algorithm (SamBA)-An interpretable greedy ensemble classifier based on local expertise for fat data
Bauvin B et al. (PMLR 216:130–140, 2023)
Abstract
Ensemble methods are a very diverse family ofalgorithms with a wide range of applications. Oneof the most commonly used is boosting, with theprominent Adaboost. Adaboost relies on greed-ily learning base classifiers that rectify the errorfrom previous iterations. Then, it combines themthrough a weighted majority vote, based on theirquality on the entire learning set. In this paper, wepropose a supervised binary classification frame-work that propagates the local knowledge acquiredduring the boosting iterations to the predictionfunction. Based on this general framework, weintroduce SamBA, an interpretable greedy ensem-ble method designed for fat datasets, with a largenumber of dimensions and a small number of sam-ples. SamBA learns local classifiers and combinesthem, using a similarity function, to optimize its ef-ficiency in data extraction. We provide a theoreticalanalysis of SamBA, yielding convergence and gen-eralization guarantees. In addition, we highlightSamBA’s empirical behavior in an extensive exper-imental analysis on both real biological and gen-erated datasets, comparing it to state-of-the-art en-semble methods and similarity-based approaches
Integrating and Reporting Full Multi-View Supervised Learning Experiments Using SuMMIT
Bauvin B et al. (PMRL 183:139–150, 2022)
Abstract
SuMMIT (Supervised Multi Modal Integration Tool) is a software offering manyfunctionalities for running, tuning, and analyzing experiments of supervised classificationtasks specifically designed for multi-view data sets. SuMMIT is part of a platform 1 thataggregates multiple tools to deal with multiview datasets such as scikit-multimodallearn(Benielli et al., 2021) or MAGE (Bauvin et al., 2021). This paper presents use cases ofSuMMIT, including hyper-parameters optimization, demonstrating the usefulness of sucha platform for dealing with the complexity of multi-view benchmarking on an imbalanceddataset. SuMMIT is powered by Python3 and based on scikit-learn, making it easy touse and extend by plugging one’s own specific algorithms, score functions or adding newfeatures2. By using continuous integration, we encourage collaborative development.Keywords: Multimodal, Supervised, Classification, Benchmarking, Python, ReproducibleResearch, Modularity, Explainability, Interpretability
I have rarely been as enthusiastic about a new research direction. We call them GFlowNets, for Generative Flow Networks. They live somewhere at the intersection of reinforcement learning, deep generative models and energy-based probabilistic modelling. They are also related to variational models and inference and I believe open new doors for non-parametric Bayesian modelling, generative active learning, and unsupervised or self-supervised learning of abstract representations to disentangle both the explanatory causal factors and the mechanisms that relate them. What I find exciting is that they open so many doors, but in particular for implementing the system 2 inductive biases I have been discussing in many of my papers and talks since 2017, that I argue are important to incorporate causality and deal with out-of-distribution generalization in a rational way. They allow neural nets to model distributions over data structures like graphs (for example molecules, as in the NeurIPS paper, or explanatory and causal graphs, in current and upcoming work), to sample from them as well as to estimate all kinds of probabilistic quantities (like free energies, conditional probabilities on arbitrary subsets of variables, or partition functions) which otherwise look intractable.
Flexible protein database based on amino acid k-mers
Déraspe M. et al (Sci Rep 2022 Jun 1;12(1):9101)
Identification of proteins is one of the most computationally intensive steps in genomics studies. Itusually relies on aligners that do not accommodate rich information on proteins and require additionalpipelining steps for protein identification. We introduce kAAmer, a protein database engine based onamino‑acid k‑mers that provides efficient identification of proteins while supporting the incorporationof flexible annotations on these proteins. Moreover, the database is built to be used as a microservice,to be hosted and queried remotely.
One fundamental task in genomics is the identification and annotation of DNA coding regions that translate intoproteins via a genetic code. Protein databases increase in size as new variants, orthologous and novel genes, oftenfound in metagenomics studies, are being sequenced. This is particularly true within the microbial world wherebacterial proteomes’ diversity follows their rapid evolution. For instance, UniProtKB (Swiss-Prot/TrEMBL)1 andNCBI RefSeq 2 contain over 100 million bacterial proteins and that number is increasing rapidly.Identification of proteins often relies on accurate, but slow, alignment software such as BLAST or hiddenMarkov model (HMM) profile-based software 3,4 . Although other approaches (such as DIAMOND 5 ) have con-siderably improved the speed of searching proteins in large datasets, from a database standpoint much can bedone to offer a more versatile experience. One such approach would be to expose the database as a permanentservice, which can make use of computational resources for increased performance (e.g. memory mapping)and leveraging the cloud for remote analyses via a HTTP API. Another approach would be to extend the resultset with comprehensive information on protein targets to facilitate subsequent genomics and metagenomicsanalysis pipelines.
Wishart D. et al (Nucleic Acids Res 2022 Jan 7;50(D1):D622-D631.
HMDB 5.0: the Human Metabolome Database for 2022
Wishart D. et al (Nucleic Acids Res 2022 Jan 7;50(D1):D622-D631.
Abstract
The Human Metabolome Database or HMDB (https://hmdb.ca) has been providing comprehensive ref-erence information about human metabolites andtheir associated biological, physiological and chemi-cal properties since 2007. Over the past 15 years, theHMDB has grown and evolved significantly to meetthe needs of the metabolomics community and re-spond to continuing changes in internet and comput-ing technology. This year’s update, HMDB 5.0, bringsa number of important improvements and upgradesto the database. These should make the HMDB moreuseful and more appealing to a larger cross-sectionof users. In particular, these improvements include:(i) a significant increase in the number of metabo-lite entries (from 114 100 to 217 920 compounds); (ii)enhancements to the quality and depth of metabolitedescriptions; (iii) the addition of new structure, spec-tral and pathway visualization tools; (iv) the inclusionof many new and much more accurately predictedspectral data sets, including predicted NMR spec-tra, more accurately predicted MS spectra, predictedretention indices and predicted collision cross sec-tion data and (v) enhancements to the HMDB’s searchfunctions to facilitate better compound identification
Kpanou R et al. (BMC Bioinformatic, 2021 Oct 4;22(1):477).
On the robustness of generalization of drug– drug interaction models
Kpanou R et al. (BMC Bioinformatic, 2021 Oct 4;22(1):477).
Abstract
Background: Deep learning methods are a proven commodity in many fields andendeavors. One of these endeavors is predicting the presence of adverse drug–druginteractions (DDIs). The models generated can predict, with reasonable accuracy,the phenotypes arising from the drug interactions using their molecular structures.Nevertheless, this task requires improvement to be truly useful. Given the complexityof the predictive task, an extensive benchmarking on structure-based models for DDIsprediction was performed to evaluate their drawbacks and advantages.
Results: We rigorously tested various structure-based models that predict drug inter-actions using different splitting strategies to simulate different real-world scenarios.In addition to the effects of different training and testing setups on the robustnessand generalizability of the models, we then explore the contribution of traditionalapproaches such as multitask learning and data augmentation
Kothari C. et al. (Sci Rep, 2020 Jun 26;10(1):10464.)
Machine learning analysis identifies genes differentiating triple negative breast cancers
Kothari C. et al. (Sci Rep, 2020 Jun 26;10(1):10464.)
Triple negative breast cancer (TNBC) is one of the most aggressive form of breast cancer (BC) with thehighest mortality due to high rate of relapse, resistance, and lack of an effective treatment. Variousmolecular approaches have been used to target TNBC but with little success. Here, using machinelearning algorithms, we analyzed the available BC data from the Cancer Genome Atlas Network(TCGA) and have identified two potential genes, TBC1D9 (TBC1 domain family member 9) and MFGE8(Milk Fat Globule‑EGF Factor 8 Protein), that could successfully differentiate TNBC from non‑TNBC,irrespective of their heterogeneity. TBC1D9 is under‑expressed in TNBC as compared to non‑TNBCpatients, while MFGE8 is over‑expressed. Overexpression of TBC1D9 has a better prognosis whereasoverexpression of MFGE8 correlates with a poor prognosis. Protein–protein interaction analysis byaffinity purification mass spectrometry (AP‑MS) and proximity biotinylation (BioID) experimentsidentified a role for TBC1D9 in maintaining cellular integrity, whereas MFGE8 would be involved invarious tumor survival processes. These promising genes could serve as biomarkers for TNBC anddeserve further investigation as they have the potential to be developed as therapeutic targets forTNBC.
Mass spectrometry is a valued method to evaluate the metabolomics content of a biological sample.The recent advent of rapid ionization technologies such as Laser Diode Thermal Desorption (LDTD) andDirect Analysis in Real Time (DART) has rendered high-throughput mass spectrometry possible. It isused for large-scale comparative analysis of populations of samples. In practice, many factors resultingfrom the environment, the protocol, and even the instrument itself, can lead to minor discrepanciesbetween spectra, rendering automated comparative analysis difficult. In this work, a sequence/pipelineof algorithms to correct variations between spectra is proposed. The algorithms correct multiple spectraby identifying peaks that are common to all and, from those, computes a spectrum-specific correction.We show that these algorithms increase comparability within large datasets of spectra, facilitatingcomparative analysis, such as machine learning.
Plante PL. et al. (Ana Chem, 2019 Apr 16;91(8):5191-5199)
Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS
Plante PL. et al. (Ana Chem, 2019 Apr 16;91(8):5191-5199)
Abstract
Untargeted metabolomic measurements using mass spectrometry are a powerful tool foruncovering new small molecules with environmental and biological importance. The smallmolecule identification step, however, still remains an enormous challenge due to fragmentationdifficulties or unspecific fragment ion information. Current methods to address this challenge areoften dependent on databases or require the use of nuclear magnetic resonance (NMR), whichhave their own difficulties. The use of the gas-phase collision cross section (CCS) values obtainedfrom ion mobility spectrometry (IMS) measurements were recently demonstrated to reduce thenumber of false positive metabolite identifications. While promising, the amount of empirical CCSinformation currently available is limited, thus predictive CCS methods need to be developed. Inthis article, we expand upon current experimental IMS capabilities by predicting the CCS valuesusing a deep learning algorithm. We successfully developed and trained a prediction model forCCS values requiring only information about a compound’s SMILES notation and ion type. Theuse of data from five different laboratories using different instruments allowed the algorithm to betrained and tested on more than 2400 molecules. The resulting CCS predictions were found toachieve a coefficient of determination of 0.97 and median relative error of 2.7% for a wide range
Interpretable genotype-to-phenotype classifiers with performance guarantees
Drouin A. et al (Sci Rep, 2019 Mar 11;9(1):4071)
Understanding the relationship between the genome of a cell and its phenotype is a central problemin precision medicine. Nonetheless, genotype-to-phenotype prediction comes with great challengesfor machine learning algorithms that limit their use in this setting. the high dimensionality of thedata tends to hinder generalization and challenges the scalability of most learning algorithms.Additionally, most algorithms produce models that are complex and difficult to interpret. We alleviatethese limitations by proposing strong performance guarantees, based on sample compression theory,for rule-based learning algorithms that produce highly interpretable models. We show that theseguarantees can be leveraged to accelerate learning and improve model interpretability. our approachis validated through an application to the genomic prediction of antimicrobial resistance, an importantpublic health concern. Highly accurate models were obtained for 12 species and 56 antibiotics, andtheir interpretation revealed known resistance mechanisms, as well as some potentially new ones. Anopen-source disk-based implementation that is both memory and computationally efficient is providedwith this work. The implementation is turnkey, requires no prior knowledge of machine learning, and iscomplemented by comprehensive tutorials.