AI-Powered Insights into Molecular Evolution: From Codon Usage to Gene Expression in Natural Environments

AI-Powered Insights into Molecular Evolution: From Codon Usage to Gene Expression in Natural Environments

The study of evolution by natural selection at the molecular level has advanced significantly with the advent of genomic technologies. Traditionally, researchers have focused on observable traits like flowering time or growth. However, gene expression provides an intermediate phenotype that connects genomic data to these macroscopic traits, offering a deeper understanding of selection pressures. In a recent study involving Ivyleaf Morning Glory (*Ipomoea hederacea*), researchers utilized RNA sequencing to analyze gene expression under natural field conditions. The challenge of dealing with high-dimensional, small-sample-size data typical of transcriptomics was addressed using machine learning methods. These methods, known for their ability to handle complex, multivariate data, revealed that genes related to photosynthesis, stress response, and light response were crucial in predicting fitness. This demonstrates the potential of ML models to uncover important biological processes and genes under selection in natural environments, overcoming the limitations of traditional statistical approaches.

Additionally, the intricate patterns of codon usage, which vary significantly across and within species, are influenced by evolutionary selection. A study explored whether AI could predict codon sequences from given amino acid sequences in different organisms, including yeast and bacteria. The researchers used advanced AI models, specifically the mBART transformer-based architecture, to capture complex dependencies in codon usage that simple frequency-based methods fail to detect. Their findings indicate that AI can effectively learn and predict these codon patterns, particularly in highly expressed genes and longer proteins. This suggests that codon choice is influenced by evolutionary pressures related to protein expression and folding. This approach improves our understanding of codon bias and its impact on protein synthesis and provides a new tool for optimizing codon usage in biotechnology and synthetic biology applications.

Summary of Methods:

The study utilized NCBI coding sequences from S. cerevisiae, S. pombe, E. coli, and B. subtilis, divided into training, validation, and testing sets. CD-HIT clustered amino acid sequences, ensuring clusters remained within individual sets. BLAST identified similar sequences and expression levels categorized proteins. Codon prediction models included frequency-based methods and mBART models with varying configurations. The training protocol featured pretraining and fine-tuning with specific hyper-parameters. Fixed-sized windows were applied during inference, and predictions were averaged across windows: accuracy and perplexity metrics evaluated model performance against true codon sequences.


Training and Evaluation of mBART Models:

mBART models were trained to predict codon sequences from amino acid sequences using masking and mimicking. Masking involved predicting codons from the amino acid sequence alone while mimicking predicted codons based on those of an orthologous protein from a different organism. The mimicking approach is based on the hypothesis that codons can influence the translation elongation rate, which is critical for co-translational protein folding. Training datasets consisted of S. cerevisiae, S. pombe, E. coli, and B. subtilis proteins, divided into training, validation, and test sets with no amino acid sequence overlap between training and test sets. The evaluation of models showed that mBART models generally outperformed frequency-based baselines, especially in predicting codons for proteins with higher expression levels. This suggests that mBART can learn and utilize long-range interactions among codons more effectively.

Accuracy of Masking and Mimicking Predictions:

The mBART models’ masking-mode predictions showed superior accuracy compared to frequency-based methods, demonstrating the ability to capture complex patterns in codon usage. Different window sizes were tested, with the 30-codon window model performing the best. Although mimicking-mode predictions were slightly more accurate than masking-mode predictions, they still showed potential, especially in eukaryotic organisms and for highly conserved orthologous segments. The mBART models’ performance did not significantly benefit from sequence similarities between training and test sets, indicating robust learning of codon usage patterns. Additionally, the models’ accuracy varied across proteins with different expression levels and molecular functions, with notable improvements for proteins involved in ribosomal functions, nucleic acid binding, and catalytic activities in S. cerevisiae and E. coli.


Tissue was collected from Ipomoea hederacea, an annual vine distributed across the eastern USA. A field experiment involved planting 100 individuals from 56 populations in a glasshouse and transplanting them to a field. Soil samples were analyzed for heavy metals a year later. Leaf tissue was collected after 71 days, and mRNA was extracted and sequenced. Data processing included aligning reads to the Ipomoea nil genome, transforming gene counts, and filtering low-expression genes. Analytical methods involved principal component regression and supervised modeling using neural networks and gradient tree boosting. Important genes were identified, and GO term enrichment analysis was conducted using Blast2Go and goseq.

Insights from AI-Driven Codon Prediction and Gene Expression Analysis:

Advanced AI models, such as mBART, have been leveraged to predict codon usage across various organisms and to analyze gene expression’s impact on fitness. These models highlight significant correlations between codon usage and protein expression, evolutionary conservation, and functional attributes. High-expression genes and conserved proteins exhibit more predictable codon patterns. Additionally, machine learning approaches effectively identify gene expression patterns related to fitness, particularly in genes associated with stress response and reproductive development. This underscores the utility of AI in decoding complex biological sequences and enhancing our understanding of evolutionary biology and gene regulation.


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Source link


Be the first to comment

Leave a Reply

Your email address will not be published.