Evolutionary origins of FAM222: duplication of genes in gnathostomes
Despite the growing interest in FAM222A and its medical relevance, its phylogenetic history remains unexplored. Furthermore, no homologous domains have been identified within the amino acid sequence, raising fundamental questions about its evolutionary origins and diversification. To address this, we investigated the evolutionary history of FAM222A and FAM222B. We performed a BLASTp search using the human FAM222A (hFAM222A) sequence as a query. This search retrieved 2208 protein sequences annotated as FAM222A and FAM222B, which were exclusively found in 682 vertebrate species. We then selected a representative subset consisting of 20 FAM222A and 18 FAM222B sequences from various vertebrate lineages. To further investigate the evolutionary relationships among these sequences, we performed a multiple sequence alignment (MSA) (Supplementary Fig. 1) followed by phylogenetic reconstruction using the maximum likelihood method (Fig. 1A). The resulting phylogeny revealed that FAM222A and FAM222B cluster into two distinct clades within gnathostomes (jawed vertebrates), supporting the hypothesis that these proteins are paralogs that originated from a gene duplication event in a common ancestor of gnathostomes.
Furthermore, sequence identity analysis showed that hFAM222A and human FAM222B (hFAM222B) share only 34.3% identity, indicating that these proteins have undergone significant evolutionary divergence (Fig. 1B). Additionally, we found that FAM222A maintains 64.6% identity from zebrafish to humans, while FAM222B retains 58% identity over the same evolutionary range.
Interestingly, our phylogenetic analysis identified two FAM222-like proteins that form an outgroup at the base of the vertebrate tree, corresponding to cyclostomes (jawless vertebrates), specifically lampreys and hagfish (Fig. 1A). These sequences exhibit 28% and 26.6% identity with hFAM222A, respectively, and 25.6% and 26% identity with hFAM222B, while sharing 40% identity between themselves (Fig. 1B). This indicates that the cyclostome FAM222-like proteins are single-copy orthologs of the ancestral gene that later duplicated in gnathostomes.
Taken together, our results indicate that FAM222A and FAM222B originated from an ancestral gene that was already present before the divergence of cyclostomes and gnathostomes. However, the gene duplication event in gnathostomes played a crucial role in the functional specialization of FAM222 proteins, which would contribute to lineage-specific adaptations in jawed vertebrates.
The conserved regions in proteins can reveal essential biological functions and provide information into specific evolutionary pressures. To identify functionally relevant regions within the FAM222 family, we analyzed the conservation of their sequences in vertebrates. Previous studies have reported that hFAM222A contains a proline-rich region (PRR), while hFAM222B possesses a glycine-rich region in the central part of the protein (GRR). To further define these regions, we used the amino acid sequences of FAM222 proteins and analyzed them with the scanProsite server. Our results showed that hFAM222A contains a PRR of 152 amino acid residues (147-299), whereas hFAM222B has a glutamine-rich (QRR), rather than a GRR, region consisting of 72 amino acid residues (148-220) (Fig. 2A).
A more detailed examination identified a highly conserved region spanning 72 to 77 amino acid residues, present from hagfish to humans, with an identity of 62.5% (Fig. 2A and Supplementary Fig. 2). The strong evolutionary conservation of this region suggests a critical functional role. Additionally, a BLASTp search against the RefSeq Protein database (excluding vertebrates) yielded no hits, indicating that this amino acid region is an exclusive evolutionary signature of this lineage.
Given that hFAM222A has previously been characterized as an IDP through circular dichroism, we evaluated whether this conserved region also exhibited structural disorder. To address this, we used the DR-BERT tool to analyze disorder in the conserved amino acid sequences of hFAM222A (35-106) and hFAM222B (34-105), comparing them with their orthologs in cyclostomes, such as lampreys (1-77) and hagfish (44-117). The results indicated that this region has an average disorder percentage of 60%, with a more pronounced tendency in its second half, allowing it to be classified as an IDR (Fig. 2B, gray-shaded segment). To further explore the conservation of its physicochemical properties, we assessed the hydropathy profile of this region using the Kyte-Doolittle scale. Two significantly hydrophobic segments were identified in the central and C-terminal regions, leading us to hypothesize that these hydrophobic patches could play a role in protein-protein interaction (Fig. 2C, gray-shaded segments). Since the remainder of the sequence is predominantly hydrophilic, we conducted a Grand Average of Hydropathy (GRAVY) analysis (Fig. 2D). All values obtained were negative, indicating a strong hydrophilic tendency. This property is consistent with the presence of solvent-exposed regions or a disordered conformation.
The combination of strong evolutionary conservation, structural disorder, and hydrophilic profile suggests that this region may play a key functional role in the FAM222 family. Specifically, its conservation from hagfish to humans supports the hypothesis that it could participate in molecular recognition-mediated interactions.
To gain deeper insight into the structural organization of the conserved region between humans and cyclostomes, we generated structural models using AlphaFold3. Although AlphaFold3 has been extensively validated for high-confidence predictions of globular protein structures, it has also proven to be a valuable tool for identifying intrinsically disordered regions, providing insights into their potential structural organization and conformational tendencies.
As a first step, we predicted the structures of hFAM222A and hFAM222B from their amino acid sequences. The results indicated that both proteins contain five α-helices embedded within extensive disordered regions, characterized by low pLDDT values (50-70), suggesting a flexible and intrinsically disordered behavior (Fig. 3A and Supplementary Videos 1 and 2).
Unexpectedly, we identified that α-helix H1, present in both proteins, exhibited a pLDDT>90, indicating high confidence in its structural prediction. This structural motif is located at positions 42-53 in hFAM222A and 41-52 in hFAM222B, within the previously identified conserved region. To assess the structural conservation of this structural motif in basal vertebrates, we generated structural models of FAM222-like proteins XP_032803296.1 lamprey (laXP), and XP_067992272.1 hagfish (haXP) (Supplementary Fig. 3 and Supplementary Videos 3 and 4). The results revealed that H1 is also present in both species with a pLDDT comparable to the proteins from human, suggesting that this structural motif has been maintained throughout vertebrate evolution (Fig. 3B and C). In contrast, the remainder of the conserved region remains largely disordered, with pLDDT values between 50 and 70, characteristic of IDRs. However, we observed the formation of small α-helices at the end of the conserved region, except in hFAM222B, and the presence of a possible β-hairpin-like, except in hFAM222A (Fig. 3B and C). Given the low pLDDT values of these elements, it is likely that they represent transient structures that stabilize only under specific conditions.
On the other hand, the conservation of H1 and its high pLDDT in all analyzed species suggests that this structure is stable and plays a key functional role in FAM222 proteins. Its conservation in vertebrates, from cyclostomes to humans, raises the possibility that this α-helix has been retained due to selective pressures, possibly related to protein-protein interactions or molecular regulatory mechanisms.
Based on our previous analysis and findings, we have designated the conserved region within FAM222 proteins as DCD222. Given the possibility that this region mediates interactions with other proteins, we explored its interaction network using the STRING database, restricting our analysis to experimentally validated interactions.
Our results (Fig. 4A) revealed that hFAM222A physically interacts with seven proteins, including four transcription factors from the MEIS/PBX families (MEIS1, MEIS2, PBX1, PBX2, and PBX3) and with a putative nucleotidyl transferase (MAB21L1) and a MAPK (NLK). Similarly, hFAM222B was found to interact with three additional proteins, including an RNA helicase (DDX39A), a ubiquilin protein (UBQLN2), and a transcription factor (SOHLH1), as well as NLK . To assess the relevance of these interactions, we used AlphaFold3 to model protein-protein and protein-DCD222 complexes, as well as to evaluate the effect of DCD222 in silico deletion on the protein-protein interaction (Fig. 4B). Our findings indicated that the strongest interaction was established with NLK, as interface predicted template modelling (ipTM) values exceeded the 0.6 threshold for both hFAM222A and hFAM222B, while all other interactions fell below this cutoff point. However, we observed that ipTM scores for FAM222-NLK were within AlphaFold's gray zone (0.6-0.8), a range where predictions may be prone to false positives or inaccuracies in the predicted interaction sites. To analyze this interaction in more depth, we specifically evaluated the protein-DCD222 interactions. In particular, the ipTM scores for NLK increased above the gray zone, suggesting a more reliable interaction. On the contrary, when DCD222 was deleted, the FAM222-NLK interaction dropped below the 0.6 threshold, indicating that removal of DCD222 significantly weakens or abolishes binding to NLK in both hFAM222A and hFAM222B (Fig. 4B). These results demonstrate that DCD222 is essential for the interaction with NLK, suggesting that this domain plays a pivotal role in the functional association of FAM222 proteins with this kinase.
We investigated whether this association has been evolutionarily conserved among vertebrates. To assess this hypothesis, we replicated our analysis using the laXP and haXP proteins of cyclostomes, employing AlphaFold3 to model their interactions with NLK. The results were highly consistent with those observed for hFAM222A and hFAM222B, revealing similar interaction patterns and structural stability (Fig. 4C). To further explore the structural conservation of DCD222 in its NLK-bound state, we performed a structural alignment of DCD222 from hFAM222A, hFAM222B, laXP, and haXP (Fig. 4D). Surprisingly, this alignment revealed an average root mean square deviation (RMSD) of 0.570±0.047 Å, indicating that despite being an IDR, DCD222 adopts a stable and highly conserved conformation upon binding to NLK, reinforcing its functional relevance. Moreover, this structural stability is further supported by a pLDDT above 90 across most of the domain, suggesting high confidence in the predicted local structure (Fig. 4F and Supplementary Fig. 4). Additionally, Predicted Aligned Error (PAE) values close to 0 Å indicate high accuracy in the structural alignment between DCD222 from humans and cyclostomes in its NLK-bound state (Fig. 4E and Supplementary Fig. 4). These findings suggest that the FAM222-NLK interaction is not only structurally and functionally significant but also ancestral and evolutionarily preserved across vertebrates, highlighting its potential role in conserved regulatory processes.
IDRs often contain short linear motifs (SLiMs), short amino acid sequences (8-23 residues) that mediate interactions with other proteins. Given this, we hypothesized that the DCD222 domain might harbor SLiMs involved in its interaction with NLK. To investigate this, we analyzed the DCD222 sequence of hFAM222A as a query in both orientations: N-terminal to C-terminal and C-terminal to N-terminal, using the eukaryotic linear motif resource (ELM) (Fig. 5A). The results revealed that DCD222 contains a revD motif, an 8 to 12 residue sequence that binds to the docking groove and the common docking (CD) region of MAPKs. These regions correspond to a hydrophobic groove and a negatively charged surface, respectively, and are exclusive to this family of kinases.
Interestingly, the revD motif in DCD222 is highly conserved from hagfish to humans and contains the signature sequence ΦXΦXXXXXXXR/K, where Φ represents a hydrophobic residue, X any amino acid, and R/K an arginine or lysine, respectively (Fig. 5A). To assess how this motif interacts with NLK, we analyzed its structural integration within the DCD222 region. We found that L58, S59, I60, K61, and I62 directly interact directly with the docking groove, forming four hydrogen bonds. Furthermore, L58 and I60 insert deeply into hydrophobic pockets of the docking groove with pLDDT values ranging from 90 to 100, indicating high-confidence structural predictions (Fig. 5B). We found that the revD motif does not interact with the CD region or any negatively charged region in NLK, since our analysis confirmed that this MAPK lacks such a region (Supplementary Fig. 5).
Interestingly, we observed that this motif binds to NLK in the same manner across its orthologs, suggesting that this interaction mechanism has been evolutionarily conserved across all vertebrates (Fig. 5C).
Since the docking groove is an exclusive feature of MAPKs, we further tested whether the DCD222 domain could mediate interactions with other human MAPKs (MAPK1, MAPK3, MAPK4, MAPK6, MAPK7, MAPK8, MAPK9, MAPK10, MAPK11, MAPK12, MAPK13, and MAPK14) using AlphaFold3. The results showed that all ipTM values were below 0.6 (Fig. 5D and Supplementary Fig. 6), suggesting that FAM222 proteins interact specifically with NLK and not with other MAPKs through the DCD222 domain. To further support the biological relevance of this interaction, we examined RNA expression patterns at the single-cell level using the Human Protein Atlas and found that human NLK, hFAM222A, and hFAM222B are expressed at high levels in overlapping populations of neuronal, glial, and germline cells (Supplementary Fig. 7). This expression pattern indicates that these proteins are present in the same cellular contexts, supporting their potential involvement in shared signaling pathways.
To further investigate the interaction between the intrinsically disordered DCD222 domain and NLK, we conducted a detailed structural analysis. Despite its overall disordered nature, DCD222 consistently wraps around NLK with high pLDDT values across all AlphaFold3 models, from hagfish to humans (Supplementary Fig. 8 and Supplementary Videos 5 and 6). Within this flexible domain, the preformed α-helix H1 directly interacts with the C-lobe of NLK (Fig. 6A). In addition, the most hydrophobic segment, corresponding to the revD motif, engages the docking groove of NLK (Fig. 6B), highlighting how specific disordered regions can establish stable interfaces through SLiMs embedded within a flexible structural context.
The remaining portion of the domain associates with the N-lobe without directly engaging the active site. This interaction induces the formation and stabilization of a β-hairpin-like structure, with a pLDDT of 90, indicating high confidence in this prediction (Fig. 6C). In addition, we identified a previously unreported hydrophobic groove in the upper region of the N-lobe (Fig. 6D). Within this groove, DCD222 forms and inserts a short β-sheet composed of three residues from the conserved GLLAIV motif (Fig. 6E and Supplementary Fig. 9), which directly interacts with the C-terminal extension of NLK. This interaction, not observed with other MAPKs (Supplementary Fig. 6),reinforces the specificity of the DCD222-NLK complex. We also examined whether other regions of NLK contribute to binding, including the CMGC insert, but found no interaction in any of the AlphaFold3 models generated (Supplementary Fig. 9).
Based on these findings, we propose that DCD222 binds to NLK through an extended conformation, a model in which little to no structural changes occur upon binding, allowing the formation of a specific complex, a mechanism clearly observed in the DCD222-NLK interaction. These findings suggest that this extended conformation has been evolutionarily conserved in vertebrates.
Our analysis identified the formation of 20 to 26 hydrogen bonds, 265 to 297 non-bonded contacts (including hydrophobic, electrostatic, and aromatic interactions) and one salt bridge between the DCD222 region and the kinase domain of NLK (Fig. 7A). The combination of these interactions suggests that the DCD222-NLK interaction is highly stable and structurally regulated.
Through the MSA (Fig. 7B), we determined that only 13 out of the 40 to 44 amino acid residues form hydrogen bonds and are highly conserved from hagfish to humans. These conserved residues are in five distinct regions within DCD222, which may be essential for its interaction with NLK: the α-helix H1, the revD motif, the β-hairpin-like, the β-sheet-forming region, and the PQH/R motif. We observed that 8 of the 13 conserved residues form hydrogen bonds with NLK through the peptide backbone, while the remaining 5 residues interact via their side chains. Recently, it was reported that NLK can phosphorylate S59 of hFAM222A. Interestingly, we found that phosphorylated S59 is part of the revD motif, however, its side chain does not form hydrogen bonds with NLK, whereas its peptide backbone does. This could suggest that this post-translational modification may not directly affect the binding of the DCD222 region to NLK (Supplementary Fig. 5).
We also observed that the amino acids within the kinase domain of NLK that interact with DCD222 are highly conserved. However, their conservation is comparable to that of the rest of the domain from hagfish to humans (Supplementary Fig. 10). Therefore, we conclude that there is no evidence of a co-evolutionary process between the DCD222 region and NLK.
Finally, to assess the dynamic stability of the complex, we performed three independent atomistic simulations of the hFAM222A-hNLK system using GROMACS 2025.3 (CUDA-enabled) under standard isothermal-isobaric (NPT) conditions. The simulated systems remained thermodynamically stable throughout, with mean temperature and pressure fluctuations closely centered around target values (T ≈ 300.0 ± 0.33 K; P ≈ 1.0 ± 30 bar). To quantify inter-chain stability and interface persistence, we computed the center-of-mass distance between the chains, the minimum heavy-atom distance, heavy-atom contact counts (< 0.45 nm), and inter-chain hydrogen-bond counts and lifetimes (Supplementary Table S1). Per-residue contact occupancies (hFAM222A) and residue-to-residue minimum-distance heatmaps were also generated. Additional analyses were performed on the intrinsically disordered hFAM222A, focusing on residues 43-53 (α-helix H1) and 35-106 (DCD222). Across all simulations, the complex remained continuously associated (bound fraction ca. 1.0; mean minimum distance of 0.162-0.170 nm), exhibiting extensive inter-chain hydrogen bonding (54, 50, and 56 bonds per frame in the three replicates, with lifetimes around 2.5-2.8 ns). The DCD222 segment formed the principal and most persistent interface (2.7-2.9 contacting residues on average) with low variance in contact count and long contact survival, consistent with a robust but moderately dynamic interface characteristic of IDP interactions. By contrast, α-helix H1 remained spatially proximal (minimum distance of 0.186-0.197 nm) but contributed relatively few direct contacts (all analyses were performed using PBC-corrected trajectories sampled every 1 ns). Overall, these simulations reveal that the hFAM222A-hNLK complex is maintained through a stable yet flexible interface dominated by the DCD222 region, a configuration emblematic of functional interactions mediated by IDPs.
Taken together, these findings support our proposal that the DCD222 region represents an intrinsically disordered domain (IDD) that specifically mediates the interaction with NLK (Fig. 7C). This interaction appears to be both ancestral and functionally relevant, suggesting that DCD222 has played a conserved role in regulating NLK function throughout vertebrate evolution.