jome0169 / mendietapablo_annotation_paper_scripts Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 165.2 MB

Scripts used for analysis and pipline of Mendieta et al 2020

Python 23.28% Shell 1.25% R 75.48%

mendietapablo_annotation_paper_scripts's People

Contributors

Stargazers

Watchers

mendietapablo_annotation_paper_scripts's Issues

Comparing Split Annotation Class from Gramene

4/26/2021

One reviewer wants me to further this analysis by looking at the split gene annotations found within gramene. This is a set of annoations which were IDd on gramene as being split, but for some reason, have not been fixed in the main annotation.

ftp://ftp.gramene.org/../pub/gramene/CURRENT_RELEASE/split_genes/ <- Link to the ftp site provided by the reviewer. A lab mate assisted me in retrieveing this data, and from there I have been "off to the races" as it were. Utilized an awk command to generate the bed file: awk -v OFS='\t' '{if ($7=="1") print $4,$5,$6,$3,"+"; else print $4,$5,$6,$3,"-"}' zea_mays_split_genes_gramene.txt | grep -v "chr_start" | uniq | bedtools sort -i - > zea_mays_split_genes_gramene.sorted.bed . From there realized that this is all the genes in a single list, not the genes paired with their corresponding merged partner.

Used a bedtools closest command in order to pair up genes - merged pairs should be equally distant from each other, and those that aren't will just be printed for later analysis and use. Made quick python script to help with this as well.

Bedtools command: ❯ bedtools closest -a zea_mays_split_genes_gramene.sorted.bed -b zea_mays_split_genes_gramene.sorted.bed -io -t first -d > closest_thing.txt

❯ python quick_fix.py closest_thing.txt > mostly_fixed.txt

Opened the mostly_fixed.txt file in excel and fixed the remaining triplet pairs there. Copy and ptasted the output from the excel editing into a vim to remove weird formatting, and finally put this file into bedtools sort. Finally saved the final file as Gramene_split.final.sorted.bed.

In total there are 78 gramene merged gene pairs/triplets. Doing a simple bedtools intersect command we find that we capure 31 out of the 78, so 40%. that's fine. Probably a decent proportion of these do not fall within regions where we have data.

Fixing GTF File display

daler/gffutils#137

Conservation of Initiation Modification Class in Sorghum

5/5/2021

One of the reviewers asked if there was any evidence of conservation between some distal initiation only modifications in the Sorghum genome. Basically asking the questions - well are these genes exlusive to the Sorghum genome, and more recently shut off in the maize genome?

It's an interesting question, and one I was curious to do. The analysis steps are fairly simply.

Take the sequence underlying the initiation only modifications I identified in maize, and using BlastN blast these sequences to the Sorghum genomes
Ask the question - of these sequences - what proportion overlap genes in the Sorghum genome? This would tell us about potential genes which were conserved in Sorghum and lost in maize
2b) This will also require a control sequence as well. An equal number of initiation regions also identified in genes in the maize genome
What proportion of these regions found in the Sorghum genome also have the histone modificatios for initiation? NOTE: This will only really be able to be asked about the leaf modifications, as we don't have matched data sets.

Input BED Files

All following parallel commands will be run on this dataset

❯ ls -a1 *.bed
ear_initiation_overlapping_neither_elongation.bed
leaf_initiation_overlapping_neither_elongation.bed
root_initiation_overlapping_neither_elongation.bed

Generation of Control Datasets

We can't just do this analysis in a vacuum. To make this a more meaningful comparison, I'm going to be taking an equal sample of initiation modifications overlapping genes and compare the level of conservation (by means of counts) as compared to those initiation modification regions which are not overlapping genes. This might tell us that these initiation modification only regions are undergoing some sort of different evolutionary trajectory, thus maybe performing a different funciton.

Input Control Files:

~/Projects/03.ncRNA_project/03.Figures/Figure2/05.Blastn_sequences
❯ wc -l *intersecting_genes.bed
   20941 ear_initiation.intersecting_genes.bed
   19724 leaf_initiation.intersecting_genes.bed
   23387 root_initiation.intersecting_genes.bed
   64052 total

Generation of Control Dataset:
Made a very small bash file to assist in this called equal_sub_sample.sh which is comprised of these simple command:

set -euxo pipefail

gen_number=$(wc -l < ${1})
shuf -n ${gen_number} ${2}

Command ran to generate:

❯ parallel "bash equal_sub_sample.sh {}_initiation_overlapping_neither_elongation.bed {}_initiation.intersecting_genes.bed > {}_initiation.intersecting_genes.control_sub_sample.bed" ::: ear leaf root
++ wc -l
+ gen_number='    2340'
+ shuf -n 2340 ear_initiation.intersecting_genes.bed
++ wc -l
+ gen_number='    2605'
+ shuf -n 2605 leaf_initiation.intersecting_genes.bed
++ wc -l
+ gen_number='    3486'
+ shuf -n 3486 root_initiation.intersecting_genes.bed

Get underlying Nt Fasta

parallel "bedtools getfasta -fi Zea_mays.AGPv4.dna.toplevel.unwrapped.nocontigs.reorderd.fa -bed {} > {.}.seq.fa" ::: *.bed

Copied these over from local Location: ~/Projects/03.ncRNA_project/03.Figures/Figure2/05.Blastn_sequences to sapelo2 where the further analysis is happening in the directory: /scratch/jpm73279/04.lncRNA/08.comprative_maize_sorghum

Blast Against Sorghum Genome

parallel "blastn -db Sorghum_bicolor_var_BTx623.mainGenome.fasta -query {} -outfmt 6 -num_threads 10 -evalue .0001 -max_target_seqs 1 > blast_output/{/.}.blastn_out.txt" ::: isolated_seq/*.fa

Grab the Sequence regions in Sorghum (Blast output format to Bed)

parallel "python fitler_blastn_maize_sog_comp_genBED.py -i {} -o seq_loc_sorghum_bed/{/.}.bed" ::: blast_output/*.txt

Intersect With a list of Sorghum Genes

parallel "bedtools intersect -a {} -b Sbicolorv5.1.gene.bed -u -wa > {.}.intersecting_sorghum_genes.bed " ::: *.blastn_out.bed

Count the number intersecting Sorghum genes"

  2077 ear_initiation.intersecting_genes.control_sub_sample.seq.blastn_out.intersecting_sorghum_genes.bed
   201 ear_initiation_overlapping_neither_elongation.seq.blastn_out.intersecting_sorghum_genes.bed
  2318 leaf_initiation.intersecting_genes.control_sub_sample.seq.blastn_out.intersecting_sorghum_genes.bed
   178 leaf_initiation_overlapping_neither_elongation.seq.blastn_out.intersecting_sorghum_genes.bed
  3125 root_initiation.intersecting_genes.control_sub_sample.seq.blastn_out.intersecting_sorghum_genes.bed
   287 root_initiation_overlapping_neither_elongation.seq.blastn_out.intersecting_sorghum_genes.bed
  8186 total```

#### Call Initiation Only Mod Peaks

Comparing merged class against class found in Monnahan et_al

The manuscript Using multiple reference genomes to identify and resolve annotation inconsistencies is a recent publication by Monnahan et al from 2020. It details a clever approach which I appreciate which attempts to identify potentially split genes in the maize genome.

It does so by comparing multiple versions of maize genome annotations (B73 - Ph207 - W22) against one another, and compares them in a pairwise manner to identify genes with a "One gene hits many in another" genome approach. This can be somewhat seen in their figure 1.

The manuscript goes on to basically test whether these sets of genes should be merged - or kept as split genes. Asking the question - "which annotation is correct?" It does so by looking at RNA-seq data across multiple different tissue types which asking whether the 'split' genes involved are actually similarly expressed. This is summed in their metric M2f (Mean 2 fold expression difference) - Again figure 2 here:

It's definitely an interesting approach and one I appreciate. Something to note here though is that the cut off they're using - this "top 10 of m2f" metric is very stringent (which makes sense). Basically if the values of M2F for one of their calculated pairs doesn't fall on either side (the split or merged side) of the distribution, they do not "make a call" as it were. So this does mitigate conclusions they can make about some of the potential pairs they have. For instance - initially they identify 481 potential mergers in B73 - and end up identifying 96 potential mergers, 170 genes which should remain split, and 240 which were unable to be called (this adds up to 506??).

Notes

5/12/2019

Alternative Species

Downloaded the other species data in the directory /scratch/jpm73279/04.lncRNA/06.annotate_other_species_2020-05-11 using the script submit_fastq_dump_download.sh . I have now downloaded all the species Zefu has given me which is excellent. Now, I just need to modify the pipeline in such as way as to more pliable in regards to the inclusion of having replicates or not.

Getting the species into the correct formats I think will be a little irritating, as they are all going to require me to weed out scaffold level assemblies, as well as generate the separate lincRNA files for each genome. Initially I think I should start on maybe getting the most interesting genomes going first - focusing on sorghum, Asparagus, Eutreme, Gycine Max

Final Metaplots

I've generated what feels like maybe 1000 metaplots, but I like to think we're closing in on the final set of metaplots.

5/15/2019

Alternative Species Continuted

Generated the directory structure for alternative species, which looks like this:
├── 05.A_officialnalis
│   ├── 00.data
│   │   ├── bed_file
│   │   ├── chip_seq
│   │   │   ├── H3
│   │   │   ├── H3K27me3
│   │   │   ├── H3K36me3
│   │   │   ├── H3K4me1
│   │   │   ├── H3K4me3
│   │   │   ├── H3K56ac
│   │   │   ├── H3K9
│   │   │   └── input
│   │   ├── reference
│   │   └── rna

Based off of the structure which I used for Z.mays. Probably overkill, but this is what is going to allow me to quickly implement this in other species