jome0169 / mendietapablo_annotation_paper_scripts Goto Github PK
View Code? Open in Web Editor NEWScripts used for analysis and pipline of Mendieta et al 2020
Scripts used for analysis and pipline of Mendieta et al 2020
One reviewer wants me to further this analysis by looking at the split gene annotations found within gramene. This is a set of annoations which were IDd on gramene as being split, but for some reason, have not been fixed in the main annotation.
ftp://ftp.gramene.org/../pub/gramene/CURRENT_RELEASE/split_genes/ <- Link to the ftp site provided by the reviewer. A lab mate assisted me in retrieveing this data, and from there I have been "off to the races" as it were. Utilized an awk command to generate the bed file: awk -v OFS='\t' '{if ($7=="1") print $4,$5,$6,$3,"+"; else print $4,$5,$6,$3,"-"}' zea_mays_split_genes_gramene.txt | grep -v "chr_start" | uniq | bedtools sort -i - > zea_mays_split_genes_gramene.sorted.bed
. From there realized that this is all the genes in a single list, not the genes paired with their corresponding merged partner.
Used a bedtools closest command in order to pair up genes - merged pairs should be equally distant from each other, and those that aren't will just be printed for later analysis and use. Made quick python script to help with this as well.
Bedtools command: ❯ bedtools closest -a zea_mays_split_genes_gramene.sorted.bed -b zea_mays_split_genes_gramene.sorted.bed -io -t first -d > closest_thing.txt
❯ python quick_fix.py closest_thing.txt > mostly_fixed.txt
Opened the mostly_fixed.txt
file in excel and fixed the remaining triplet pairs there. Copy and ptasted the output from the excel editing into a vim to remove weird formatting, and finally put this file into bedtools sort. Finally saved the final file as Gramene_split.final.sorted.bed.
In total there are 78 gramene merged gene pairs/triplets. Doing a simple bedtools intersect command we find that we capure 31 out of the 78, so 40%. that's fine. Probably a decent proportion of these do not fall within regions where we have data.
One of the reviewers asked if there was any evidence of conservation between some distal initiation only modifications in the Sorghum genome. Basically asking the questions - well are these genes exlusive to the Sorghum genome, and more recently shut off in the maize genome?
It's an interesting question, and one I was curious to do. The analysis steps are fairly simply.
All following parallel commands will be run on this dataset
❯ ls -a1 *.bed
ear_initiation_overlapping_neither_elongation.bed
leaf_initiation_overlapping_neither_elongation.bed
root_initiation_overlapping_neither_elongation.bed
We can't just do this analysis in a vacuum. To make this a more meaningful comparison, I'm going to be taking an equal sample of initiation modifications overlapping genes and compare the level of conservation (by means of counts) as compared to those initiation modification regions which are not overlapping genes. This might tell us that these initiation modification only regions are undergoing some sort of different evolutionary trajectory, thus maybe performing a different funciton.
Input Control Files:
~/Projects/03.ncRNA_project/03.Figures/Figure2/05.Blastn_sequences
❯ wc -l *intersecting_genes.bed
20941 ear_initiation.intersecting_genes.bed
19724 leaf_initiation.intersecting_genes.bed
23387 root_initiation.intersecting_genes.bed
64052 total
Generation of Control Dataset:
Made a very small bash file to assist in this called equal_sub_sample.sh
which is comprised of these simple command:
set -euxo pipefail
gen_number=$(wc -l < ${1})
shuf -n ${gen_number} ${2}
Command ran to generate:
❯ parallel "bash equal_sub_sample.sh {}_initiation_overlapping_neither_elongation.bed {}_initiation.intersecting_genes.bed > {}_initiation.intersecting_genes.control_sub_sample.bed" ::: ear leaf root
++ wc -l
+ gen_number=' 2340'
+ shuf -n 2340 ear_initiation.intersecting_genes.bed
++ wc -l
+ gen_number=' 2605'
+ shuf -n 2605 leaf_initiation.intersecting_genes.bed
++ wc -l
+ gen_number=' 3486'
+ shuf -n 3486 root_initiation.intersecting_genes.bed
parallel "bedtools getfasta -fi Zea_mays.AGPv4.dna.toplevel.unwrapped.nocontigs.reorderd.fa -bed {} > {.}.seq.fa" ::: *.bed
Copied these over from local Location: ~/Projects/03.ncRNA_project/03.Figures/Figure2/05.Blastn_sequences
to sapelo2 where the further analysis is happening in the directory: /scratch/jpm73279/04.lncRNA/08.comprative_maize_sorghum
parallel "blastn -db Sorghum_bicolor_var_BTx623.mainGenome.fasta -query {} -outfmt 6 -num_threads 10 -evalue .0001 -max_target_seqs 1 > blast_output/{/.}.blastn_out.txt" ::: isolated_seq/*.fa
parallel "python fitler_blastn_maize_sog_comp_genBED.py -i {} -o seq_loc_sorghum_bed/{/.}.bed" ::: blast_output/*.txt
parallel "bedtools intersect -a {} -b Sbicolorv5.1.gene.bed -u -wa > {.}.intersecting_sorghum_genes.bed " ::: *.blastn_out.bed
Count the number intersecting Sorghum genes"
2077 ear_initiation.intersecting_genes.control_sub_sample.seq.blastn_out.intersecting_sorghum_genes.bed
201 ear_initiation_overlapping_neither_elongation.seq.blastn_out.intersecting_sorghum_genes.bed
2318 leaf_initiation.intersecting_genes.control_sub_sample.seq.blastn_out.intersecting_sorghum_genes.bed
178 leaf_initiation_overlapping_neither_elongation.seq.blastn_out.intersecting_sorghum_genes.bed
3125 root_initiation.intersecting_genes.control_sub_sample.seq.blastn_out.intersecting_sorghum_genes.bed
287 root_initiation_overlapping_neither_elongation.seq.blastn_out.intersecting_sorghum_genes.bed
8186 total```
#### Call Initiation Only Mod Peaks
The manuscript Using multiple reference genomes to identify and resolve annotation inconsistencies is a recent publication by Monnahan et al from 2020. It details a clever approach which I appreciate which attempts to identify potentially split genes in the maize genome.
It does so by comparing multiple versions of maize genome annotations (B73 - Ph207 - W22) against one another, and compares them in a pairwise manner to identify genes with a "One gene hits many in another" genome approach. This can be somewhat seen in their figure 1.
The manuscript goes on to basically test whether these sets of genes should be merged - or kept as split genes. Asking the question - "which annotation is correct?" It does so by looking at RNA-seq data across multiple different tissue types which asking whether the 'split' genes involved are actually similarly expressed. This is summed in their metric M2f (Mean 2 fold expression difference) - Again figure 2 here:
It's definitely an interesting approach and one I appreciate. Something to note here though is that the cut off they're using - this "top 10 of m2f" metric is very stringent (which makes sense). Basically if the values of M2F for one of their calculated pairs doesn't fall on either side (the split or merged side) of the distribution, they do not "make a call" as it were. So this does mitigate conclusions they can make about some of the potential pairs they have. For instance - initially they identify 481 potential mergers in B73 - and end up identifying 96 potential mergers, 170 genes which should remain split, and 240 which were unable to be called (this adds up to 506??).
Downloaded the other species data in the directory /scratch/jpm73279/04.lncRNA/06.annotate_other_species_2020-05-11
using the script submit_fastq_dump_download.sh
. I have now downloaded all the species Zefu has given me which is excellent. Now, I just need to modify the pipeline in such as way as to more pliable in regards to the inclusion of having replicates or not.
Getting the species into the correct formats I think will be a little irritating, as they are all going to require me to weed out scaffold level assemblies, as well as generate the separate lincRNA files for each genome. Initially I think I should start on maybe getting the most interesting genomes going first - focusing on sorghum, Asparagus, Eutreme, Gycine Max
I've generated what feels like maybe 1000 metaplots, but I like to think we're closing in on the final set of metaplots.
Generated the directory structure for alternative species, which looks like this:
├── 05.A_officialnalis
│ ├── 00.data
│ │ ├── bed_file
│ │ ├── chip_seq
│ │ │ ├── H3
│ │ │ ├── H3K27me3
│ │ │ ├── H3K36me3
│ │ │ ├── H3K4me1
│ │ │ ├── H3K4me3
│ │ │ ├── H3K56ac
│ │ │ ├── H3K9
│ │ │ └── input
│ │ ├── reference
│ │ └── rna
Based off of the structure which I used for Z.mays. Probably overkill, but this is what is going to allow me to quickly implement this in other species
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.