Genome-wide mapping of RNA-protein associations through sequencing - Nature Biotechnology

Despite significant advances, identifying the entire network of RNA-protein associations remains a formidable challenge. The human genome comprises ~40,000 annotated genes, with ~20,000 protein-coding and ~20,000 non-coding genes. Generating a comprehensive pairwise RNA-protein interaction map is daunting due to the vast number of potential interactions, estimated at 40,000 × 20,000 candidate pairs, not even accounting for splicing variants. Existing technologies are limited in capacity, typically focusing on one RNA or RBP at a time and operating on a 'one-to-many' mapping scale. In response to this limitation, we propose a 'many-to-many' mapping strategy with PRIM-seq (protein-RNA interaction mapping by sequencing), which concurrently identifies RNA-associating proteins and their associated RNAs at the genome scale. PRIM-seq does not require specific reagents to target individual proteins or RNAs, allowing for the de novo identification of RNA-protein associations.

The fundamental concept behind PRIM-seq is to transform each RNA-protein association into a chimeric DNA sequence, where a part of this chimeric sequence reflects the protein and the other part the associated RNA. These chimeric DNA sequences are sequenced by high-throughput sequencing and used for decoding the RNA-protein associations. These chimeric sequences are created by two steps. First, a protein library is created where each protein is conjugated with its own mRNA (Fig. 1a and Extended Data Fig. 1a). Second, mRNA is converted to cDNA, creating a library of cDNA-labeled proteins. These proteins are allowed to interact with the transcriptome retrieved from the cells of interest. Protein-associated RNA is ligated with the cDNA of the associating protein, creating a cDNA-RNA chimeric sequence. These cDNA-RNA chimeric sequences are converted to DNA sequences for sequencing (Fig. 1b).

PRIM-seq comprises two experimental modules and a bioinformatics module. The first experimental module utilizes the recently developed SMART-display technology, which labels tens of thousands of proteins with their own mRNA to generate a library of RNA-labeled proteins (Fig. 1a). Briefly, SMART-display extracts mRNAs from input cells, removes the poly(A) tails, appends a puromycin-conjugated linker to the 3' ends of these mRNAs and translates the mRNAs to yield proteins that are covalently linked with the mRNAs (mRNA-linker-protein). A key advantage of SMART-display is its capability to create the entire library of mRNA-labeled proteins in a one-pot procedure, eliminating the need for RNA-, gene- or protein-specific reactions.

The second experimental module is called REILIS (reverse transcription, incubation, ligation and sequencing). REILIS transforms each RNA-protein pair into a chimeric sequence that includes an 'RNA-end read' and a 'protein-end read' representing the associating RNA and protein, respectively (Fig. 1b, Extended Data Fig. 1 and Fig. 2). REILIS takes two inputs: a library of mRNA-labeled proteins generated by SMART-display and an RNA library. REILIS converts each protein's mRNA label into a cDNA label (Fig. 2a,b), incubates the protein library with an RNA library to allow for RNA-protein association (Fig. 2c), ligates the RNA and the cDNA label of each RNA-protein pair into a chimeric sequence to form an RNA-linker-cDNA structure (Fig. 2d) and subjects this chimeric sequence to paired-end sequencing (Fig. 2e-i and Extended Data Fig. 1). The cDNA fraction of this chimeric sequence reflects the protein (protein end), and the RNA fraction reflects the associated RNA (RNA end; Fig. 2i). We note that REILIS is designed to capture RNA-protein pairs from RNA-protein complexes, which are crosslinked by formaldehyde; therefore, REILIS (and PRIM-seq) is not intended exclusively for detecting direct protein-RNA binding.

The bioinformatics module resolves the RNA-protein pairs from the chimeric sequences (Fig. 1c). This module performs three main functions. First, it identifies the gene pair mapped with the two reads of a read pair. Second, it determines the protein end and the RNA end. Our experimental design ensures that any sequencing read from the 5' end is either the sense strand of the RNA or the antisense strand of the cDNA (Extended Data Fig. 2a). Leveraging this fact, the bioinformatics module assigns the read mapped to the sense strand as the RNA end and the read mapped to the antisense strand as the protein end. Read pairs where both reads are mapped to the sense or the antisense strand are excluded from downstream analysis. The retained read pairs are termed 'chimeric read pairs.'

The third function of the bioinformatics module is to identify associating RNA-protein pairs using statistical tests. The input for these tests is the set of chimeric read pairs. The null hypothesis for each gene pair is that the presence of reads originating from one gene is independent of the presence of reads originating from the other gene. A chi-square test is performed for each gene pair (Extended Data Fig. 2b), and Bonferroni-Hochberg (BH) correction is applied to account for multiple hypothesis testing. An RNA-protein pair is identified when the BH-corrected P value is less than 0.05 and the number of read pairs mapped to this gene pair is at least four times the expected number of read pairs (number of read pairs >4X, where X is the expected number of read pairs). We have implemented these data processing and statistical analyses into an open-source software package called PRIMseqTools (Extended Data Fig. 2c), which is available at PRIMseqTools GitHub repository.

To derive an RNA-protein association network from human cells, we generated PRIM-seq libraries from human embryonic kidney cells (HEK293T) and human lymphoblast cells (K562) (Supplementary Table 1). Comparison of biological replicate libraries within HEK293T and K562 revealed overlaps between replicate libraries (odds ratio (OR) > 2,241.3, P < 1 × 10, chi-square test; Supplementary Fig. 1). Furthermore, the overlaps increased as the threshold for calling RNA-protein associations was raised (Supplementary Fig. 1), indicating good reproducibility of these replicate libraries.

We merged the sequencing data from all the PRIM-seq libraries into a single PRIM-seq dataset. Applying PRIMseqTools to this dataset revealed 365,094 RNA-protein associations (BH-corrected P < 0.05 and number of read pairs >4X), involving 11,311 proteins and 7,248 RNAs (Fig. 3a and Supplementary Fig. 2a,b). We call this network the Human RNA-Protein Association Network (HuRPA) and refer to its involved proteins and RNAs as HuRPA proteins and HuRPA RNAs. We developed a web interface to query, visualize and download HuRPA (https://sysbiocomp.ucsd.edu/prim/).

We initiated our analysis by examining specific RBPs on selected target RNAs. RPS3, FUS, SRSF1 and TAF15 displayed highly concordant binding profiles between PRIM-seq and eCLIP data, revealing both broad similarities and site-specific nuances (Extended Data Fig. 3). For instance, on the TARDBP RNA, eCLIP data indicate that all four RBPs except SRSF1 bind at the 5' end of exon 6 (arrow 1; Extended Data Fig. 3a) and exon 5 (arrow 2), whereas only FUS binds to exon 3 (arrow 3). PRIM-seq successfully replicated these distinct binding patterns, illustrating its ability to resolve fine-scale differences in RNA-protein interactions. However, eCLIP also detected a site bound by RPS3 that PRIM-seq did not identify, suggesting a potential false negative (arrow 4; Extended Data Fig. 3a). These examples underscore PRIM-seq's capacity for site-level resolution while acknowledging its limitations.

We tested whether the HuRPA proteins are enriched with any Gene Ontology (GO) annotations. 'RNA processing' (GO:0006396), 'cytoplasmic stress granule' (GO:0010494) and 'translation factor activity, RNA binding' (GO:0008135) emerged as the most enriched BP, CC and MF terms, respectively (false discovery rate (FDR) = 2.99 × 10, 2.09 × 10, and 7.99 × 10, Fisher's exact test) (Fig. 3b and Extended Data Fig. 4). The RNA processing-associated HuRPA proteins included RNA splicing factors, RNA metabolism proteins, RNA processing factors, RNA methyltransferases, ribosomal proteins and RNA helicases (Extended Data Fig. 4c). Cytoplasmic stress granules are membraneless organelles composed of proteins and RNAs (Extended Data Fig. 4d,e). These data highlight RNA-protein association as the most prominent characteristic of HuRPA among all functional annotations.

We compared HuRPA proteins with database-documented RBPs. We compiled RBPs from six databases (RBP2GO, RBPDB, ATtRACT, hRBPome, RBPbase and starBase), resulting in 2,311 database-documented RBPs (Supplementary Fig. 2c). Most of these (2,137, 92.5%) are HuRPA proteins, representing 18.9% of the 11,311 HuRPA proteins and suggesting an overlap (OR = 17.0, P = 3 × 10, chi-squared test; Fig. 3c). Protein abundance does not explain this overlap (Supplemental Note 1 and Supplementary Fig. 2f). Thus, HuRPA recovers most database-documented RBPs.

We explored whether any HuRPA proteins outside the database-documented RBPs (undatabased HuRPA proteins) have been detected as candidate RBPs by recent technologies. Among the 9,174 undatabased HuRPA proteins, 764 were captured by peptide crosslinking and affinity purification (pCLAP) (OR = 8.1, P = 3.1× 10, chi-squared test) and 129 by RBDmap (OR = 18.2, P = 1.1 × 10, chi-squared test), revealing enrichments of the undatabased HuRPA proteins in the candidate RBPs detected by recent technology (Supplementary Fig. 2d and Supplementary Table 2). These data corroborate the idea that PRIM-seq can detect previously uncharacterized RBPs.

We compared database-documented RNA-protein association (RPA) pairs with HuRPA's RNA-protein association pairs (HuRPA RPAs). Using the RPAs documented in the RNAInter database as the reference set, HuRPA RPAs exhibited higher precision and recall than randomly sampled RPAs (Supplementary Fig. 3a). Furthermore, as we increased the read count threshold for calling RPA from the PRIM-seq data, the resulting subnetworks of HuRPA exhibited larger precision, that is a larger fraction of the subnetwork being RNAInter RPAs (Supplementary Fig. 3a). Changing the reference set to RPAs detected by iCLIP (iCLIP RPAs) or HITS-CLIP (HITS-CLIP RPAs) (Supplementary Table 2) revealed similar outcomes, where HuRPA RPAs exhibited higher precision and recall than randomly sampled RPAs (Supplementary Fig. 3b,c). These data suggest a consistency between HuRPA RPAs and previously identified RPAs.

For example, the PTBP1 protein was associated with 63 target RNAs in HuRPA. Among these, 31 overlap with the RNAInter database (OR = 7.8, P = 5.6 × 10, Fisher's exact test; Extended Data Fig. 5a). A de novo motif search yielded two motifs, the higher-ranking of which (CUCUCUCUGG' -- matches the known PTBP1 binding motif (Extended Data Fig. 5b-e). These results confirm that PRIM-seq can detect established PTBP1 binding patterns and also capture additional targets not yet recorded in RNAInter.

To further validate PRIM-seq, we compared our results with 76 eCLIP datasets(Supplementary Fig. 4). Of these, 66 RBPs are also identified by HuRPA, and 63 (95.5%) exhibit an overlap in target RNAs (FDR < 0.1). Moreover, 56 (84.8%) show highly notable overlap (FDR < 0.01). These benchmarks underscore the reliability of PRIM-seq in capturing biologically meaningful RNA-protein interactions.

HuRPA exhibits a scale-free topology, where the number of proteins is negatively correlated with the number of their associated RNAs (Fig. 3d), and conversely, the number of RNAs is inversely correlated with the number of their associated proteins (Fig. 3e). We asked whether a HuRPA protein with a higher degree -- that is, associated with more RNAs -- is more likely to be detected by other methods. Database-documented RBPs exhibited higher degrees in HuRPA than the other proteins (P = 1.4 × 10, t-test, two-sided; Fig. 3f). Protein abundance does not explain this difference (Supplementary Fig. 2g). Furthermore, among the undatabased HuRPA proteins, those detected by either pCLAP or RBDmap exhibited higher degrees in HuRPA than the remaining undatabased HuRPA proteins (P = 4.8 × 10 for pCLAP, P = 1.1 × 10 for RBDmap, t-test, two-sided; Supplementary Fig. 2e). Thus, the more associated RNAs a protein has in HuRPA, the more likely it is detected by another technology.

Moreover, we obtained highly connected subnetworks of HuRPA by removing the HuRPA proteins with small degrees (that is, small numbers of associated RNAs). As the degree threshold for removing proteins increased, the subnetworks exhibited higher precision for any given recall, and higher recall for any given precision, compared to RNAInter RPAs as the reference set (Supplementary Fig. 3a). Changing the reference set from RNAInter RPAs to iCLIP RPAs and HITS-CLIP RPAs led to similar results (Supplementary Fig. 3b,c). Thus, the HuRPA proteins with more associated RNAs are more likely detected by other methods.

When we designed PRIM-seq, we did not anticipate a strong correlation between PRIM-seq reads and the RNA-binding domains (RBDs) within proteins. Nevertheless, we compared PRIM-seq's protein-end reads with the protein sequences. HuRPA includes 1,831 RBD-containing proteins, with a total of 3,702 RBDs. These RBDs represent 2.1% of the total length of the mature mRNAs of these RBD-containing proteins. 13.9% of the protein-end reads were mapped to these RBDs, indicating an enrichment of PRIM-seq protein-end reads originating from RBDs (P = 4.1 × 10, binomial test, one-sided; Fig. 4a). For example, the heterogeneous nuclear ribonucleoprotein R (HNRNPR) has four RBDs, which align with regions of abundant protein-end reads (Fig. 4b).

HuRPA proteins contain seven classes of RBDs: RNA-recognition motifs (RRMs), pseudouridine synthase and archaeosine transglycosylase (PUA) domains, like Sm (LSm) domains, K homology (KH) domains, cold-shock domains (CSDs), DEAD domains and ribosomal protein-S1-like (S1) domains (Supplementary Table 3). Each RBD class showed enrichment in PRIM-seq's protein-end reads (largest P = 8.0 × 10, binomial test, one-sided; Fig. 4c,d and Supplementary Table 3).

RNA sequences bound by an RBD often exhibit conserved sequence patterns, known as RBD-binding motifs. We focused on RRMs for subsequent analysis, as RRM is the largest RBD class and the most enriched with protein-end reads (P = 8.6 × 10, binomial test, one-sided). We retrieved the RNA-end reads that paired with the RRM-mapped protein-end reads. De novo motif finding revealed 13 RNA sequence motifs (motif length = 10, BH-corrected P < 0.05), three of which matched previously reported RNA-recognition sequences for RRM (Fig. 4e). These data suggest that PRIM-seq's protein-end and RNA-end reads are enriched with RBDs and RBD-binding motifs, respectively.

PRIM-seq protein-end reads exhibit a bias toward the 5' end (Supplementary Fig. 5a), as expected due to the random priming strategy used in cDNA synthesis. To determine whether this 5' bias affects the inferred RBD distributions toward protein N-termini, we analyzed known RBDs from the RNA-binding Proteins Database (RBPDB). Density plots comparing HuRPA protein-end reads within these RBDs against the RBPDB background did not reveal a significant N-terminal enrichment (Supplementary Fig. 5b). This absence of bias remained consistent across individual RNA-recognition motifs, including RRMs and S1 domains (Supplementary Fig. 5c-j). Thus, despite the inherent 5' bias in PRIM-seq protein-end reads, our analyses suggest that PRIM-seq does not substantially introduce positional bias into inferred RBDs.

We set out to validate previously uncharacterized RNA-protein associations (RPAs) in HuRPA, beginning with the long intergenic non-coding RNA (lincRNA) LINC00339. LINC00339 is the HuRPA RNA with the largest number of associated proteins, making it the largest RNA hub (Fig. 3e,g). Notably, LINC00339 has no previously reported interacting proteins. It is expressed in most human tissues and is associated with tumorigenesis and progression of various cancers. Given LINC00339's wide distribution across most subcellular compartments, including the cell membrane, we selected 15 LINC00339-associated proteins from HuRPA for validation, representing diverse subcellular compartments (green nodes in Fig. 3g, Supplementary Table 4).

We used the RNA-proximity ligation assay (RNA-PLA) for validation. RNA-PLA detects specific RNA-protein interactions by using an RNA probe targeting the specific RNA and an antibody recognizing the associated protein (Fig. 3h). As a sanity check, we reproduced the contrast of RNA-PLA signals between the U1 snRNA-Sm protein complex and the Sm antibody-only control in HEK293T cells (Supplementary Fig. 6).

We applied RNA-PLA to the 15 selected RNA-protein pairs: LINC0039 with IGF1R, IGF2R, CHRNA3, INSR, TYRO3, CD71, IL6ST, IL10RB, FGFR1, FGFR3, IFNAR1, IFNAR2, IFNGR1, PVR and TNFRSF21 (Fig. 3i, Extended Data Fig. 6a and Supplementary Table 4). We included four sets of controls: antibody-only controls (RNA probe not added); four RNA-protein pairs not included in HuRPA (LINC0039-CD40, LINC0039-CD32, LINC0039-LTBR and LINC0039-green fluorescent protein (GFP)); RNA probe-only control (antibody not added); and no-probe-no-antibody control (Fig. 3j and Extended Data Fig. 6b,c). The controls consistently exhibited few RNA-PLA foci (Fig. 3j, Extended Data Fig. 6b,c), while each tested RNA-protein pair showed more RNA-PLA foci per cell than the controls (smallest P = 1.1 × 10 for IFNAR1, largest P = 0.044 for FGFR3, t-test, two-sided; Fig. 3k and Extended Data Fig. 6). For example, the LINC0039-IGF2R pair exhibited an average of 16.3 RNA-PLA foci per cell, which is 69-fold higher than the antibody-only control (P = 4.5 × 10, t-test, two-sided) (Fig. 3i-k). These data suggest that previously uncharacterized RNA-protein associations in HuRPA can be validated using alternative methods.

We further validated our PRIM-seq results by conducting RIP-seq experiments on four additional HuRPA proteins: SMC1A, SMC3, RAD21 -- none of which had previously been characterized as RBPs -- and HDAC2, a noncanonical RBP (Supplementary Table 5). Using IgG as a control, RIP-seq identified 300, 301, 291 and 112 target RNAs for these proteins, respectively. Each set of targets showed notable overlap with those found by PRIM-seq, as indicated in association tests (largest P = 5.2 × 10, Fisher's exact test) (Extended Data Fig. 7a,b) and receiver operating characteristic (ROC) analysis (Extended Data Fig. 7c-j). Additionally, we assessed whether PRIM-seq can specifically identify the RNA targets of an individual protein from the broader pool of RNAs bound by other RBPs. Chi-square tests examining the distribution of PRIM-seq chimeric reads between each protein's RIP-identified target RNAs and other HuRPA RNAs revealed positive associations (Extended Data Fig. 7k,l). Furthermore, when the analysis was restricted to only these four proteins (SMC1A, SMC3, RAD21 and HDAC2) and their respective RIP-seq-identified RNA targets, enrichment persisted (Extended Data Fig. 7m). Collectively, these results confirm PRIM-seq's capability to delineate specific RNA-protein interactions.

We proceeded to test an undatabased HuRPA protein that has not been documented in any RBP database. We chose phosphoglycerate dehydrogenase (PHGDH) for validation, as it is an undatabased HuRPA protein and a hub protein associated with a large number of RNAs (Figs. 3d and 5a, and Extended Data Fig. 8a). Additionally, PHGDH appears to have conflicting roles compared to its known function as an enzyme in the serine synthesis pathway in some neural systems, suggesting the possibility of an uncharacterized function.

Consistent with PRIM-seq data, a proteome-wide RNA-binding domain (RBD) screening by peptide-mass spectrometry (RBDmap) reported two candidate RBDs within the PHGDH protein (RBD1 and RBD2, Fig. 5b). More than 60% of PRIM-seq's protein-end reads mapped to PHGDH formed a peak (36,210 in-peak reads/59,986 total reads), pinpointing RBD1 (AA 22-33) (P = 2.6 × 10, binomial test, one-sided; Fig. 5b). These data highlight the consistency between PRIM-seq, a DNA-sequencing-based technology, and RBDmap, a peptide-mass spectrometry-based technology.

To further test if PHGDH is an RNA-associating protein, we performed RNA immunoprecipitation followed by sequencing (RIP-seq) on PHGDH with IgG as the control in HEK293T cells in two biological replicates (Supplementary Table 5). The RIP-seq data revealed 107 PHGDH-associated RNAs, 54 of which overlapped with PHGDH-associated RNAs in HuRPA, showing a strong enrichment (OR = 42.3, P = 4.1 × 10⁻, Fisher's exact test; Fig. 5c and Extended Data Fig. 8b). Moreover, as we increased the threshold for calling PHGDH-associated RNAs from RIP-seq, the degree of overlap with PHGDH-associated RNAs in HuRPA, indicated by the OR, also increased (Extended Data Fig. 8c). Conversely, PHGDH-associated RNAs with higher read counts in HuRPA were more often detected by RIP-seq (P = 0.034, t-test, two-sided; Extended Data Fig. 8d). These RIP-seq data corroborate PHGDH's RNA association ability and validate a subset of PHGDH-associated RNAs identified by PRIM-seq.

The mRNAs of Beclin-1 (BECN1) and activating transcription factor 4 (ATF4) are among the PHGDH-associated RNAs in both HuRPA and the RIP-seq identified targets (Fig. 5c and Extended Data Fig. 8b). Considering BECN1 and ATF4's roles in regulating autophagy and apoptosis, we specifically tested the association of these two mRNAs with PHGDH. RIP followed by quantitative PCR (RIP-qPCR) confirmed the co-precipitation of BECN1 mRNA (P = 0.018, t-test, two-sided) and ATF4 mRNA with PHGDH (P = 1.2 × 10 for ATF4, t-test, two-sided; Fig. 5d). Additionally, we used RNA-PLA to test the ATF4 mRNA-PHGDH association. The ATF4 mRNA-PHGDH pair exhibited 18- to 85-fold more PLA foci compared to the three controls lacking the RNA probe (P = 5.3 × 10), the antibody (P = 3.0 × 10), or both the RNA probe and the antibody (P = 8.6 × 10, t-test, two-sided, Fig. 5e, Extended Data Fig. 8e-h). This data further supports the association of PHGDH protein with BECN1 and ATF4 mRNAs.

We tested whether PHGDH modulates either the mRNA or protein levels of BECN1 and ATF4. Two PHGDH-targeting siRNAs (si-1, si-2) reduced PHGDH levels by approximately 50% compared to a scrambled siRNA (Control) in HEK293T cells (P = 0.018 for si-1, P = 0.014 for si-2, t-test, two-sided; Fig. 5f,g), confirming effective knockdown of PHGDH. Neither siRNA altered the mRNA levels of BECN1 or ATF4 (P > 0.05 for both mRNAs, t-test, two-sided; Fig. 5h,i). However, both siRNAs increased the protein levels of both BECN1 and ATF4 (P < 0.0011 for BECN1, P < 0.0039 for ATF4, t-test, two-sided; Fig. 5j-m).

The induction of ATF4 protein is expected to promote apoptosis and neurite outgrowth. Immunostaining analysis revealed increased activated Caspase-3 (aCaspase3), an apoptosis marker, in both knockdowns compared to the scramble control in HEK293T cells (P = 2.1 × 10 for si-1, P = 9.7 × 10 for si-2, t-test, two-sided; Extended Data Fig. 9e,f) and also in mESCs (P = 0.031 for si-1, P = 0.0099 for si-2, t-test, two-sided; Extended Data Fig. 10f,g). Furthermore, Sholl analysis of the morphology of mNSCs revealed longer dendrites and more dendritic crossings in the knockdowns compared to the scramble control (P = 0.018 for si-1, P = 0.0071 for si-2, Kolmogorov-Smirnov test, two-sided, Extended Data Fig. 10h-j). Thus, reducing PHGDH promotes apoptosis in both cell types and neurite outgrowth in mNSCs.

To determine whether these protein level changes of BECN1 and ATF4 can be attributed to PHGDH's enzymatic function, we overexpressed wild-type (WT) PHGDH and an enzymatically dead (ED) variant of PHGDH in HEK293T cells (Fig. 5n,o). The ED variant abolishes PHGDH's enzymatic function with an R236Q mutation on the active site. Overexpression of both WT and ED PHGDH reduced the protein levels of BECN1 and ATF4 (P < 0.011 for BECN1, P < 0.014 for ATF4, t-test, two-sided; Fig. 5r-u) without discernible changes in their mRNA levels (P > 0.05 for both BECN1 and ATF4, t-test, two-sided, Fig. 5p,q). These overexpression data corroborate PHGDH's suppressive roles in BECN1 and ATF4 protein expression and suggest this role is independent of PHGDH's enzymatic function. Taken together, these examples highlight PRIM-seq's ability to unveil uncharacterized RNA-protein associations.

Genome-wide mapping of RNA-protein associations through sequencing - Nature Biotechnology

POPULAR CATEGORY

misc

entertainment

corporate

research

wellness

athletics