Code Monkey home page Code Monkey logo

roh-selfing's Introduction

Repository for the paper “Using runs of homozygosity and machine learning to disentangle sources of inbreeding and infer self-fertilization rates”

Description

What this is

This is a repository to estimate selfing rate from runs of homozygosity for a population in a vcf file. You can also recreate simulations and plots for the manuscript “Using runs of homozygosity and machine learning to disentangle sources of inbreeding and infer self-fertilization rates”. For more information please refer to https://www.biorxiv.org/content/10.1101/2024.02.20.581206v1. If you use or adapt our code, data, output or plots, please cite this manuscript.

If you have additional questions regarding running our models or recreating our results, feel free to open an issue.

How to estimate selfing rates of you own data

You can use our pretrained models to estimate selfing rates in your own data. We provide multiple models and a script to run the predictions. The computations is very light weight and requires only few computational resources as long as the requirements are satisfied.

Requirements

  • R (>4.2.2)
    • dplyr
    • data.table
    • caret
    • optparse
  • bcftools 1.18. We highly recommend against using earlier versions of bcftools since the training was done using 1.18. There was a change in previous versions of bcftools ROH that affects the calculation of ROHs.
  • plink 1.90
  • vcftools 0.1.15

See also env_short.yml for the a conda installation of R requirements.

Prepare input files

First we prepare the input files for the models. The algorithm will only use the input files specified in the model (see table below), but the wrapper scripts require all input files (Tajima, F, ROHs), even if you just want to use ROHs for the prediction.

FILE=your.vcf.gz
OUTNAME=$(basename $FILE)
plink  --vcf $FILE --het --out $OUTNAME # *.het file
vcftools --gzvcf $FILE --TajimaD 100000 --out $OUTNAME # *.Tajima.D file
bcftools roh -O r -o rohoutput.txt $FILE  # calculate ROHs

In addition to these input files you will also require chromosome lengths for your reference. These can be easily obtained with the following commands.

samtools faidx your_reference_genome.fasta
cut -f1-2 your_reference_genome.fasta.fai > your_reference_genome.fasta.lengths

The output will be a tab separated file with two columns, and no header: chromosome ID, length in bp.

Choose a model

You can try estimating selfing rates just from ROH, just from F and Tajima’s D, or with both. We recommend using the sequential model, which uses ROH+F+Tajima’s D and also estimates a rough census size (Ncurrent).

Model NameFile NameDescriptionModel ID
<25><10><20>
RF-sequentialrohSelfing_Sequential_N.RDatasequential model, same data as RF-full, classification part202310021917048AtJy
RF-sequentialrohSelfing_Sequential.RDatasequential model, same data as RF-full, regression part202310021917048AtJy
RF-fullrohSelfing_Full.RDataregression RF, uses ROH, F, Tajima’s D20231002191703IfZEj
RF-ROHrohSelfing_ROH.RDataregression RF, uses only ROH statistics20231002191703bwG5E
RF-statsrohSelfing_Stats.RDataregression RF, uses only F and Tajima’s D20231002191703RXUjU

Run predictions

Now apply the model to your prepared data. There are two scripts you can call with Rscript. randomForestRunEmpirical_Sequential.R runs the sequential model, randomForestRunEmpirical.R runs other 3 models.

Sequential model

Replace the dummy filenames with your actual data in the following command.

Rscript scripts/randomForestRunEmpirical_Sequential.R \
--modelFileSelf models/rohSelfing_Sequential.RData \
--modelFileN models/rohSelfing_Sequential_N.RData \
--fisFile your.vcf.gz.het \
--tajimaFile your.vcf.gz.Tajima.D \
--rohFile your.rohoutput.txt \
--chromosomeLengths your_reference_genome.fasta.lengths \
--out your.selfingrates.txt

Other models (optional)

Here, you can replace models/rohSelfing_Full.RData with the ROH or stats model (not sequential).

Rscript scripts/randomForestRunEmpirical.R \
--modelFile models/rohSelfing_Full.RData \
--fisFile your.vcf.gz.het \
--tajimaFile your.vcf.gz.Tajima.D \
--rohFile your.rohoutput.txt \
--chromosomeLengths your_reference_genome.fasta.lengths \
--out your.selfingrates.txt

Output files

The output file specified in the previous command contains one selfing rate per chromosome/linkage group. I.e., if your organism has 8 chromosomes, the file will contain 8 lines.

The other file that is generated (*.params) contains parameters from ROH and summary stats for the population, as well as N and selfing rate estimates (sequential model only). This file has a header: chr,lengthVar,countVar,propVar,roh_count_ind,proportionInROH,lengthMedian,gapMedian,gapVar,TajimaD,Fis,selfRate,N

meaning: chromosome id, ROH length variance, ROH count variance, F_ROH variance, ROH count, F_ROH, ROH length, ROH gap, ROH gap variance, Tajima’s D, F, selfing rate, binned census size (sequential model only).

How to reproduce manuscript data and plots

Scripts

There are intermediate files for all the plots, as output from various scripts hosted here. This allows to recreate the plots without rerunning slim, etc. Should there be missing input files or any other problems please feel free to open an issue.

Reproduce Figures

All figures can be reproduced using the provided scripts and supplementary data. For all main figure and most of the supplementary plots, run ms_plots.R For comparing different models (supplementary plots S5, S6), run allmodelplots.R. For subsampling and popstructre results (supplementary plots S7, S8), run n10eval.R or admixpopstructure.R.

Requirements

  • R (tested on 4.1.3)
  • SLiM version 3.7.1
  • bcftools 1.18
  • plink 1.90
  • vcftools 0.1.15

See also env.yml for the a conda installation of R requirements.

roh-selfing's People

Contributors

lzeitler avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.