Use machine learning to design codon-optimized DNA sequences for increased protein expression in CHO cells.
Description of project: https://www.liebertpub.com/doi/10.1089/cmb.2021.0458
Packages required:
- Python (built using Python 3.7.4 via Ubuntu-18.04)
- Keras (tensorflow)
- json
- NumPy
- OS
To predict the 'best' DNA sequence for a given amino acid sequence using the pre-trained model, use rnn_predict.py.
Codon optimization procedure:
- Place the following in the same directory:
- rnn_predict.py
- rnn_model.h5
- dna_tokenizer.json
- aa_tokenizer.json
- Text file containing amino acid sequence (save as "sequence.txt")
- Single-letter amino acid abbreviations
- Signal peptide included, if applicable
- No stop codon (will be added to output DNA sequence)
- In line 19 of rnn_predict.py, change to the working directory containing the above files.
- Run rnn_predict.py.
- The optimized DNA sequence is output as "sequence_opt.txt" in the same directory.