Repository for the paper “Using runs of homozygosity and machine learning to disentangle sources of inbreeding and infer self-fertilization rates”
This is a repository to estimate selfing rate from runs of homozygosity for a population in a vcf file. You can also recreate simulations and plots for the manuscript “Using runs of homozygosity and machine learning to disentangle sources of inbreeding and infer self-fertilization rates”. For more information please refer to https://www.biorxiv.org/content/10.1101/2024.02.20.581206v1. If you use or adapt our code, data, output or plots, please cite this manuscript.
If you have additional questions regarding running our models or recreating our results, feel free to open an issue.
You can use our pretrained models to estimate selfing rates in your own data. We provide multiple models and a script to run the predictions. The computations is very light weight and requires only few computational resources as long as the requirements are satisfied.
- R (>4.2.2)
- dplyr
- data.table
- caret
- optparse
- bcftools 1.18. We highly recommend against using earlier versions of bcftools since the training was done using 1.18. There was a change in previous versions of bcftools ROH that affects the calculation of ROHs.
- plink 1.90
- vcftools 0.1.15
See also env_short.yml for the a conda installation of R requirements.
First we prepare the input files for the models. The algorithm will only use the input files specified in the model (see table below), but the wrapper scripts require all input files (Tajima, F, ROHs), even if you just want to use ROHs for the prediction.
FILE=your.vcf.gz OUTNAME=$(basename $FILE) plink --vcf $FILE --het --out $OUTNAME # *.het file vcftools --gzvcf $FILE --TajimaD 100000 --out $OUTNAME # *.Tajima.D file bcftools roh -O r -o rohoutput.txt $FILE # calculate ROHs
In addition to these input files you will also require chromosome lengths for your reference. These can be easily obtained with the following commands.
samtools faidx your_reference_genome.fasta
cut -f1-2 your_reference_genome.fasta.fai > your_reference_genome.fasta.lengths
The output will be a tab separated file with two columns, and no header: chromosome ID, length in bp.
You can try estimating selfing rates just from ROH, just from F and Tajima’s D, or with both. We recommend using the sequential model, which uses ROH+F+Tajima’s D and also estimates a rough census size (Ncurrent
).
Model Name | File Name | Description | Model ID |
<25> | <10> | <20> | |
---|---|---|---|
RF-sequential | rohSelfing_Sequential_N.RData | sequential model, same data as RF-full, classification part | 202310021917048AtJy |
RF-sequential | rohSelfing_Sequential.RData | sequential model, same data as RF-full, regression part | 202310021917048AtJy |
RF-full | rohSelfing_Full.RData | regression RF, uses ROH, F, Tajima’s D | 20231002191703IfZEj |
RF-ROH | rohSelfing_ROH.RData | regression RF, uses only ROH statistics | 20231002191703bwG5E |
RF-stats | rohSelfing_Stats.RData | regression RF, uses only F and Tajima’s D | 20231002191703RXUjU |
Now apply the model to your prepared data. There are two scripts you can call with Rscript
. randomForestRunEmpirical_Sequential.R
runs the sequential model, randomForestRunEmpirical.R
runs other 3 models.
Replace the dummy filenames with your actual data in the following command.
Rscript scripts/randomForestRunEmpirical_Sequential.R \
--modelFileSelf models/rohSelfing_Sequential.RData \
--modelFileN models/rohSelfing_Sequential_N.RData \
--fisFile your.vcf.gz.het \
--tajimaFile your.vcf.gz.Tajima.D \
--rohFile your.rohoutput.txt \
--chromosomeLengths your_reference_genome.fasta.lengths \
--out your.selfingrates.txt
Here, you can replace models/rohSelfing_Full.RData
with the ROH or stats model (not sequential).
Rscript scripts/randomForestRunEmpirical.R \
--modelFile models/rohSelfing_Full.RData \
--fisFile your.vcf.gz.het \
--tajimaFile your.vcf.gz.Tajima.D \
--rohFile your.rohoutput.txt \
--chromosomeLengths your_reference_genome.fasta.lengths \
--out your.selfingrates.txt
The output file specified in the previous command contains one selfing rate per chromosome/linkage group. I.e., if your organism has 8 chromosomes, the file will contain 8 lines.
The other file that is generated (*.params
) contains parameters from ROH and summary stats for the population, as well as N and selfing rate estimates (sequential model only). This file has a header:
chr,lengthVar,countVar,propVar,roh_count_ind,proportionInROH,lengthMedian,gapMedian,gapVar,TajimaD,Fis,selfRate,N
meaning: chromosome id, ROH length variance, ROH count variance, F_ROH variance, ROH count, F_ROH, ROH length, ROH gap, ROH gap variance, Tajima’s D, F, selfing rate, binned census size (sequential model only).
There are intermediate files for all the plots, as output from various scripts hosted here. This allows to recreate the plots without rerunning slim, etc. Should there be missing input files or any other problems please feel free to open an issue.
All figures can be reproduced using the provided scripts and supplementary data. For all main figure and most of the supplementary plots, run ms_plots.R For comparing different models (supplementary plots S5, S6), run allmodelplots.R. For subsampling and popstructre results (supplementary plots S7, S8), run n10eval.R or admixpopstructure.R.
- R (tested on 4.1.3)
- SLiM version 3.7.1
- bcftools 1.18
- plink 1.90
- vcftools 0.1.15
See also env.yml for the a conda installation of R requirements.