In this work, we curate the Global Natural Product Social Molecular Networking (GNPS) Drug Library to enable data science strategies on drug readouts. GNPS Drug Library contains a collection of MS/MS reference spectra for drugs and their metabolites/analogs along with standardized, machine-readable pharmacologic metadata for each reference spectrum, encompassing exposure source, pharmacologic class, therapeutic indication, and mechanism of action. This resource enables data science analysis to empirically - and retroactively - determine drug exposures using untargeted metabolomics, supporting research across disciplines.
The creation of this library involved three key steps: (1) collecting MS/MS spectra of drugs and their metabolites from publicly available MS/MS reference libraries; (2) finding MS/MS spectra analogs of those drugs in publicly accessible untargeted metabolomics data - to enhance coverage of the metabolized versions of drugs; and (3) linking each reference MS/MS spectrum to controlled-vocabulary metadata - the key component of this resource that facilitates efficient data interpretation (Fig. 1a).
The reference MS/MS spectra of drugs and their known metabolites were collected from two of the largest open-access mass spectral libraries, namely the GNPS Spectral Library and MSLib. To identify drug spectra from these two resources, we first performed metadata enrichment for all compounds in GNPS and MSLib by structural and name search against PubChem (for synonyms), DrugCentral, the Broad Institute Drug Repurposing Hub databases, ChEMBL (for pharmacologic information), and DrugBank (for pharmacologic information and the Anatomical Therapeutic Chemical Classification code). Based on the enriched metadata regarding clinical phases, all MS/MS spectra of drugs and compounds in clinical trials were compiled into the centralized GNPS Drug reference library (see method details in Supplementary Text 1). This represents all publicly accessible drug MS/MS spectra, covering 4723 unique drugs (represented by 99,122 MS/MS spectra) analyzed with diverse instrumentation and collision energies. The compound names in the GNPS Drug Library were automatically curated and set to the first synonym in PubChem. We point out that the term "drug" is used here in a broad sense, as the GNPS Drug Library includes not only prescribed and over-the-counter medications but also compounds currently in clinical trials, drugs that have been withdrawn, as well as substances with potential for abuse (e.g., cocaine, fentanyl).
To ensure coverage of drug metabolites, we performed a second "partial name match" to include previously missed metabolites. For example, by searching for the name "venlafaxine" (an antidepressant), we obtained reference spectra for five of its metabolites, including "N-desmethylvenlafaxine", "O-desmethylvenlafaxine", "N,O-didesmethylvenlafaxine", "N,N-didesmethylvenlafaxine", and "venlafaxine N-oxide" that were not captured in the previous database matching. This approach allowed us to add reference spectra for metabolites of 110 drugs (2080 MS/MS spectra). Lastly, we also included the MS/MS spectra collected in the development of dmCCS, a collision cross-section database for drugs and their metabolites where human liver microsomes and S9 fractions were used for in vitro generation of drug metabolites. In total, we collected metabolite spectra for 470 drugs in the GNPS Drug Library (represented by 2080 MS/MS spectra; Fig. 1a).
The extensive collection effort yielded drug metabolite reference spectra for 10% of all the drugs included in the GNPS Drug Library, leaving most drugs without spectra for their metabolites. To overcome this challenge, we hypothesized that unannotated drug metabolites are present in public untargeted metabolomics datasets. We further hypothesized that spectral alignment strategies can be used to find the modified versions of the drugs. In other words, public untargeted metabolomics datasets could be used to create a reference library of candidate drug metabolites that will facilitate the drug exposure readout in future studies.
Based on MS/MS spectral alignment using two computational methods: repository-scale molecular networking and fast Mass Spectrometry Search Tool (fastMASST) with analog search, we retrieved all MS/MS spectra analogous to drugs from all data in the MetaboLights, Metabolomics Workbench, and GNPS/MassIVE, three of the largest public repositories for metabolomics data (covering ~3500 LC-MS/MS datasets). These spectra represent drug-related molecules potentially derived from metabolism (host or microbiome), abiotic processes, or adducts of the drugs from MS measurements. We obtained analogous MS/MS spectra for 24.6% of the 103,209 reference spectra for drug and drug metabolites (>19 million drug-analog spectral pairs).
In testing of the propagated analog library, we identified the need for additional filters to enhance relevance of the propagated analog library to drug exposure (Fig. 1b, c). First, it is unlikely to determine the sources of exogenously supplied chemicals that can also be produced endogenously or derived from the diet. Consequently, structural analogs of drugs with endogenous or dietary sources were excluded from the propagated drug analog library (e.g., analogous MS/MS spectra of testosterone used to treat hypogonadism and for gender-affirming care, or caffeine used as a stimulant drug, were excluded). Second, propagated analogs with uncommon or unexplained mass offsets (precursor mass difference between the propagated analog and the paired drug) were excluded. Common mass offsets were collected from UNIMOD, from a community-curated list of explainable delta masses (Supplementary Data 1), and from the Host Gut Microbiota Metabolism Xenobiotics Database, and were manually curated for those relevant to drug metabolism (e.g., 14.02 Da, methylation; 176.03 Da, glucuronidation) or mass spectrometry adducts (e.g., 17.03 Da, ammonium adduct; see Supplementary Data 2 for the 156 mass offsets that were included). Third, since drugs within the same pharmacologic family often have similar structures, they can be retrieved as analogs of each other through spectral alignments. Therefore, we excluded MS/MS spectra with matches to the whole public GNPS Library from the propagated analog annotations. For example, a propagated analog of quinapril, an angiotensin converting enzyme (ACE) inhibitor, had a spectral match to ramipril, another ACE inhibitor (Fig. 1c). Excluding these analog annotations ensures that they do not overwrite library matches of known drugs and metabolites. Finally, we tested the propagated drug analog library against 20 public LC-MS/MS datasets to filter out analogs that have unrealistically high detection frequency. The selected datasets represent a broad range of human tissue types and biofluids, including fecal (n = 5), breast milk (n = 2), plasma (n = 3), skin (n = 1), and brain (n = 1), as well as multiple mouse tissues (n = 8). Here, we observed analogs of tocofersolan (a synthetic vitamin E derivative), iloprost (a synthetic prostacyclin mimetic), desonide (a synthetic topical corticosteroid), medroxyprogesterone (a synthetic progestin), and vidarabine (an adenosine analog used as an antiviral) in >50% of the human fecal samples from the American Gut Project (n = 1993 individuals), a cohort of the general population. The connected drugs for these analogs are derivatives of endogenous or food-derived molecules and are unlikely to be used by more than half of the population. Therefore, these analogs cannot be confidently linked to drug exposures and were excluded. Analogs of polidocanol (a synthetic long-chain fatty alcohol used as anesthetics) were observed in >70% of 2463 human milk samples. They are likely surfactants/contaminants with the polyethylene glycol structural units and thus were excluded from the propagated drug analog library (Fig. 1c).
After the filtering steps, propagated analogs of 1277 drugs (12,455 clustered non-duplicated MS/MS spectra) were collected in the final drug analog library. We observed that 63% of the propagated analogs occurred at least once with the corresponding parent drugs in the same data file, highlighting the relevance of the propagated drug analogs to drug exposures (Fig. S1). The most common mass offsets in the drug analog library correspond to a gain or loss of 14.02 Da, which can be interpreted as (de)methylation, followed by a gain of 17.03 Da (NH, ammonium adduct), a gain of 15.99 Da (oxygen, oxidative metabolism), a loss of 18.01 Da (HO, dehydration), and a loss of 28.03 Da (CH, (de)ethylation; Fig. 1d). Based on peak shape correlation and fragment matching analysis, 33% of the propagated drug analogs could be other ion forms of the parent drug, including isotopes (5%), adducts (17%), or in-source fragments (11%), while the rest 67% are likely drug metabolites or structural analogs (Fig. 1d, S1). Although the propagated analogs include non-biological ion forms such as adducts and in-source fragments, we retain them in the library to enhance detection sensitivity. Their presence still signals drug exposure even if they do not represent metabolic derivatives. To extend structural hypotheses for the drug analogs that we found, we employed the newly developed ModiFinder, which leverages the shifted MS/MS fragment peaks in the MS/MS alignment to predict the most likely location of the structural modifications (Fig. 1e, S2). The source (as in-source fragments, adducts, or structural analogs) and the availability of ModiFinder prediction for each drug analog were provided in the GNPS Drug Library metadata (see Data Availability section).
Connecting drug detections to their therapeutic indications typically requires expert knowledge and/or extensive literature searches. The GNPS Drug Library addresses this challenge by providing controlled-vocabulary metadata together with the specific drug annotations. This allows users to annotate all drugs in an untargeted metabolomics dataset and directly obtain a table with exposure sources, pharmacologic classes, therapeutic indications, and mechanisms of action of the drugs, with their structures and names in a data science-ready format (Fig. 1f, S3). Particularly, the "exposure source" information categorizes the drugs in a combination of five classes, namely medical, endogenous, food, personal care, and industrial sources, which was developed based on the source categorizations from the Chemical Functional Ontology (ChemFOnt) database and curated manually - by parsing of web pages and scientific literature - to increase compound coverage and improve accuracy and consistency. This categorization allows distinguishing endogenous or food-sourced molecules from those used only as medications. Examples include deoxycholic acid, an endogenous molecule also used for liver disease, and lactitol, a food sweetener also used as a laxative. Using the GNPS Drug Library metadata, such annotations can be separated from those molecules used exclusively as drugs, which have distinct exposure implications.
Through structural and name matches, we then extracted the pharmacologic classes of 900 drugs from the U.S. Food and Drug Administration (FDA) and the therapeutic areas, therapeutic indications, and mechanisms of action for 3894 drugs from the Broad Institute Drug Repurposing Hub. However, we noticed substantial variability in the extracted information (e.g., inconsistent therapeutic areas assigned to drugs within the same pharmacological class; the sulfonamide antimicrobials sulfamethizole, sulfamethazine, and sulfacetamide were categorized as infectious disease, gastroenterology, and ophthalmology, respectively), or insufficient metadata for several drugs (e.g., common therapeutic indications missing). Therefore, this metadata was further manually curated by expert clinical pharmacologists to enhance and clean up the information retrieved from databases. This manual curation increased the metadata coverage to 4560 drugs. Drugs without associated metadata are typically those that have been withdrawn from the market (e.g., indoprofen), were in drug development but never marketed (e.g., tarafenacin), or are under development but do not yet have regulatory approval (e.g., firsocostat).
In total, 735 drugs in the GNPS Drug Library (38,001 spectra) were identified with endogenous or dietary sources. The final metadata of the drug library covers 27 unique therapeutic areas, 571 pharmacological classes, 920 therapeutic indications, and 823 mechanisms of action (Fig. 1f, S3). Therapeutic areas of neurology/psychiatry, infectious disease, and cardiology have the highest number of included drugs (Fig. 1f) and reference spectra (Fig. S3). We note that these incidences reflect the availability of the reference spectra but not the prevalence of these drugs in the general population. Combining the exposure source and therapeutic area, we noticed that fewer drugs related to infection and neurology/psychiatry have endogenous or food sources, while higher portions of drugs used for gastroenterology (e.g., deoxycholic acid, riboflavin) and dermatology (e.g., salicylic acid, nicotinamide) are endogenous and/or food-derived molecules.
The GNPS Drug Library can detect drugs known to be consumed with enhanced sensitivity, provided by the drug metabolites and propagated analogs. To demonstrate this, we analyzed two pharmacokinetic datasets where healthy individuals received specific probe drugs followed by time-series sampling. In the first study, 10 participants received a single oral dose of diphenhydramine. The drug was not detected in plasma and skin samples before administration, but was detected in all individuals post-administration over the course of 24 h (Fig. S4a, b). In plasma, detection frequencies peaked at 1-2 h (Fig. S4a, b), aligning with the reported time to maximum concentration (~2 hours) for diphenhydramine. In skin, peak detection occurred at 10-12 h (Fig. S4a, b), reflecting the delayed deposition to skin compared to plasma for orally administered drugs. In the second study, 14 participants received a cocktail of oral probe drugs, namely caffeine, midazolam, and omeprazole. The parent drugs were detected in plasma from 100% (caffeine), 46% (midazolam), and 100% (omeprazole) of participants within 8 hours post-administration, but were not detected in fecal samples (Fig. S4c, e). Inclusion of drug metabolites and propagated analogs improved detection rates: midazolam detection rose to 69% in plasma and 7.1% in feces. Similarly, for omeprazole, fecal detection increased from 0% to 21.4% when considering its metabolites and analogs (Fig. S4d, f). Notably, at the 8 h time point, 61.5% of participants exhibited detectable omeprazole only through its metabolites or analogs in plasma - highlighting the value of including derivative forms in drug exposure assessments. Together, these results underscore that drug detection is both biofluid- and time-dependent, and that metabolites and propagated analogs enhance detection sensitivity across sample types and time points. More broadly, the results emphasize the need to establish drug exposures empirically in the context of the analyzed samples, as clinical records do not account for drug distributions and rarely consider the time between drug intake and sample collection.
Connected with public untargeted metabolomics data, the GNPS Drug Library can reveal distinct drug exposure profiles among individuals in different disease, age, and sex groups. For different disease studies, we used the human disease ontology identifier (DOID) curated in ReDU, a controlled-vocabulary metadata for public metabolomics datasets. Samples from individuals with inflammatory bowel disease, Kawasaki disease, and dental caries were characterized by high detection frequencies of antibiotics (Fig. S5a). Skin swabs from patients with psoriasis - who are at increased risk for fungal infections - were characterized by the presence of antifungal agents. Samples from people with human immunodeficiency virus (HIV) showed a high frequency of antivirals, while samples from individuals with Alzheimer's disease contained cardiology and neurology/psychiatry drugs, aligning with expected medication use in this aging population given the well-established link between Alzheimer's and cardiovascular disease (Fig. S5a).
To investigate drug exposures among different age and sex groups, we profiled 1993 fecal samples from the American Gut Project, with participants from the United States (US), Europe, and Australia, with age 46 ± 18 years (range 3-93; 53% female). A total of 89 different drugs were detected; the most frequently detected pharmacologic classes included histamine-1 receptor antagonist (allergy), angiotensin II-receptor blocker (cardiology), ACE inhibitor (cardiology), beta-adrenergic receptor inhibitor (cardiology), statin (lipid-lowering), non-steroidal anti-inflammatory drug (NSAID; analgesics), and selective serotonin reuptake inhibitor (SSRI; antidepressant), which matches the most commonly prescribed drug classes in these regions (Fig. 1g). There were more drugs per individual noted in the US cohort compared to the European and Australian cohorts (chi-square test; χ (8, n = 1903) = 44, p = 5.6 × 10, Fig. S5b). When connected with age and sex information, the drug detection agrees with the expected usage patterns of different drug classes (Fig. 1h). For example, cardiovascular drugs were detected more frequently with increasing age, while analgesics, antihistamines, and antibiotics were detected across all ages. We also observed that analgesics, such as NSAIDs and paracetamol, were more frequently detected in females (chi-square test; χ (1, n = 1958) = 15.4, p = 8.5 × 10), consistent with the literature, and that drugs for sexual dysfunction were detected predominantly in males. Overall, empirical drug readout using untargeted metabolomics, facilitated by the GNPS Drug Library, demonstrated good specificity among individuals with different diseases, ages, and sexes. We note that the re-analysis of public metabolomics datasets is based on MS/MS matches, thus corresponding to level 2/3 confidence according to the 2007 Metabolomics Standards Initiative. Due to practical challenges to access sample extracts for >20 studies discussed in the above analyses, we were not able to further validate the annotations with analytical standards.
The GNPS Drug Library can allow the discovery of previously uncharacterized drug metabolites. To illustrate this, we analyzed fecal samples from the HIV Neurobehavioral Research Center (HNRC) cohort (n = 322; age 55 ± 12 years), which included both people with HIV (n = 222) and people without HIV (n = 100). Among the 17,729 unique MS/MS spectra detected, 643 were annotated with the GNPS Drug Library. After removing drugs that could be from endogenous or food sources (because we cannot assess whether they were given as a medication) and grouping annotations of drugs, metabolites, or analogs, 175 unique drugs remained. Antiretroviral drugs (ARVs; drugs for the treatment of HIV), cardiovascular drugs, and drugs for anxiety and depression were the most frequently detected categories (Fig. 2a, S6a). Although antiretroviral therapy (ART) - the combination of multiple ARVs to treat HIV - has led to high rates of viral suppression, individuals living with HIV continue to experience disproportionately high rates of depression and cardiovascular disease. This is reflected in the frequent detection of antidepressants and cardiovascular drugs in the analyzed samples.
Interestingly, 38% of the drugs were annotated together with their metabolites or analogs. The drug metabolites/analogs often co-occurred with the parent drugs, validating the relevance of drug metabolites/analog detections to exposures of the drugs (Fig. 2a). For example, darunavir (an ARV) was observed with 22 analogs (Fig. S6b). Peak shape analysis indicated that five of the darunavir analogs are adducts or in-source fragments (as judged by overlapping retention times), while the others represent unknown metabolites or derivatives of this drug (Fig. S6c, d). For the analogs that are not in-source fragments, 78-100% (median 98%) of their occurrences were together with the darunavir parent drug (Fig. S6b). The detection of darunavir analogs without the parent drug likely reflects complete metabolism at the time of sample collection, which, again, underscores the value of including drug metabolites and analogs in untargeted metabolomic screening to enhance the sensitivity and accuracy of drug exposure detection.
To further investigate the potential metabolic sources of the observed drug analogs, we cultured darunavir and 12 other drugs with a defined and complex synthetic microbial community of 111 bacterial species commonly found in the human gut. To focus on drug analogs that are most relevant to this cohort, we included all drugs observed with three or more metabolites/analogs that were present in >10% samples (10 drugs in total). Omeprazole, loratadine, and terbinafine were additionally included because their analogs were frequently observed in samples without the respective parent drugs (Fig. 2a, Supplementary Data 3). Shared analogs were observed for 11 of the 13 drugs between the fecal samples and the microbial incubations. Among them, metabolites/transformation products were observed for 5 drugs (ritonavir, atorvastatin, abacavir, quetiapine, and omeprazole; Fig. 2b, S7), while the rest of the analogs were adducts or in-source fragments based on peak shape correlations. The ritonavir, atorvastatin, abacavir, and quetiapine analogs increased in intensity with increased microbial incubation time (Fig. S7a-f), indicating gut microbial metabolism as a possible origin of the analogs and consistent with their observation in fecal samples. The omeprazole analog (m/z 330.127) appeared to be an abiotic transformation product because it was already present at t = 0 cultures, and its intensity decreased with increased incubation time (Fig. S7g-i). This is consistent with the fast activation of omeprazole (m/z 346.122), a proton-pump inhibitor and a prodrug, to the reactive sulphenamide product (m/z 330.127) at low pH. Rapid photolysis and hydrolysis of omeprazole have also been reported in abiotic environments with a major deoxygenation transformation product (m/z 330.127).
The GNPS Drug Library enables sample stratification based on drug profiles, facilitating the discovery of connections between drug exposures and metabolic changes. N-acyl lipids are a class of signaling molecules made by host-associated microbiota that play important roles in the immune system, memory function, and insulin regulation of the human body. Our recent work found that levels of histamine N-acyl lipids varied with HIV serostatus. Specifically, we observed elevated levels of histamine-C2:0, histamine-C3:0, and histamine-C6:0 in people living with HIV compared to those without HIV in the HNRC cohort. To explore whether these differences were associated with drug exposures, we further stratified the samples based on their ARV exposure profiles. As anticipated, the ARV profiles aligned closely with HIV serostatus (Fig. S8a). High intensities of different ARVs were observed in fecal samples from people with HIV, while ARVs were only occasionally observed in people without HIV with low intensities. ARVs observed in people without HIV include tenofovir, atazanavir, and raltegravir, which are commonly prescribed for HIV prophylaxis (Fig. S8a). To control for HIV serostatus and investigate the effects of ARV exposure, we focused on samples from people with HIV (n = 222) and further stratified them based on ARV co-occurrences. Hierarchical clustering revealed four distinct ARV exposure groups that corresponded well with common combination antiretroviral therapy (cART) regimens (Fig. 2c). For instance, Group 1 (n = 47), characterized by lamivudine, abacavir, and dolutegravir exposures, corresponded to the dolutegravir/abacavir/lamivudine treatment regimen. Group 2 (n = 64) with emtricitabine, darunavir, ritonavir, and cobicistat exposures, agreed with the darunavir/ritonavir regimen and the darunavir/cobicistat/emtricitabine/tenofovir regimen. Group 3 (n = 52), characterized by emtricitabine and bictegravir exposures, may be related to the bictegravir/emtricitabine/tenofovir treatment regimen. Group 4 (n = 59) exhibited either only emtricitabine exposure or no detectable ARV, potentially due to poor adherence, severe comorbidities, HIV elite control, or ARVs not included in the GNPS Drug Library or not amenable with LC-MS/MS detections (Fig. 2c).
Notably, histamine-C2:0 - previously associated with HIV serostatus - along with eight other N-acyl lipids, showed significant differences across these four ARV exposure groups (Kruskal-Wallis test, p-value < 0.05; see specific p-value in Fig. 2d). This suggests that variation in ARV exposure, beyond HIV serostatus alone, may influence the levels of microbial-derived N-acyl lipids. We emphasize that these patterns would likely remain obscured without empirical drug readouts from untargeted metabolomics. Clinical metadata may not document exposures to individual drugs and often do not provide quantitative information on the exposure levels. For example, metadata for the HNRC cohort on current ARV usage, which is based on self-reports, documented drug usage as "ARV-naïve" (never received ARV), "no ARV" (no current ARV use), "non-HAART" (currently using less than three ARVs), and "HAART" (currently using three or more ARVs). Based on these classifications, no significant differences were observed for the N-acyl lipids detected in these samples (Fig. S8b). Without the empirical drug readout, enabled by the GNPS Drug Library, the effects of drugs on microbial N-acyl lipid levels would be overlooked.