kr-colab / relernn Goto Github PK

View Code? Open in Web Editor NEW

44.0 6.0 11.0 1.03 MB

Recombination Landscape Estimation using Recurrent Neural Networks

License: MIT License

Python 100.00%

population-genomics deep-learning recurrent-neural-networks recombination

relernn's Introduction

ReLERNN

Recombination Landscape Estimation using Recurrent Neural Networks

====================================================================

ReLERNN uses deep learning to infer the genome-wide landscape of recombination from as few as four individually sequenced chromosomes, or from allele frequencies inferred by pooled sequencing. This repository contains the code and instructions required to run ReLERNN, and includes example files to ensure everything is working properly. The manuscript detailing ReLERNN can be found here.

Recommended installation on linux

Install tensorflow 2 on your system. Directions can be found here. You will also need to install the CUDA toolkit and CuDNN. ReLERNN requires the use of a CUDA-Enabled NVIDIA GPU. The current version of ReLERNN has been successfully tested with tensorflow/2.2.0, cudatoolkit/10.1.243, and cudnn/7.6.5.

Further dependencies for ReLERNN can be installed with pip. This is done with the following commands:

$ git clone https://github.com/kr-colab/ReLERNN.git
$ cd ReLERNN
$ pip install .

It should be as simple as that.

Testing ReLERNN

An example VCF file (5 contigs; 10 haploid chromosomes) and a shell script for running ReLERNN's four modules is located in $/ReLERNN/examples. To test the functionality of ReLERNN simply use the following commands:

$ cd examples
$ ./example_pipeline.sh

Provided everything worked as planned, $ReLERNN/examples/example_output/ should be populated with a few directories along with the files: example.PREDICT.txt and example.PREDICT.BSCORRECT.txt. The latter is the finalized output file with your recombination rate predictions and estimates of uncertainty.

The above example took 57 seconds to complete on a Xeon machine using four CPUs and one NVIDIA 2070 GPU. Note that the parameters used for this example were designed only to test the success of the installation, not to make accurate predictions. Please use the guidelines below for the best results when analyzing real data.

You can now test the functionality of ReLERNN for use with pool-seq data by using the following commands:

$ cd examples
$ ./example_pipeline_pool.sh

Estimating a recombination landscape from individually sequenced chromosomes

The ReLERNN pipeline is executed using four commands: ReLERNN_SIMULATE, ReLERNN_TRAIN, ReLERNN_PREDICT, and the optional ReLERNN_BSCORRECT (see the Method flow diagram).

Before running ReLERNN

ReLERNN takes as input a VCF file of biallelic variants. Users should use appropriate QC techniques (filtering low-quality variants, etc.) and remove non-biallelic variants before running ReLERNN. Small contigs (<< 250 SNPs) should not be included in the genome file --genome, though these do not need to be removed from the VCF. ReLERNN also requires that the number of sampled chromosomes is identical across all contigs, and VCFs should be filtered accordingly. Hemizygous chromosomes or haploid samples in an otherwise diploid dataset should ideally be run separately using a separate VCF. It is possible to treat hemizygous chromosomes as "diploids with missing data" using the --forceDiploid option, however this is not recommended. It is now possible to run ReLERNN on VCFs with missing genotypes (coded as a .).

If you want to make predictions based on equilibrium simulations, you can skip ahead to executing ReLERNN_SIMULATE. While ReLERNN is generally robust to demographic model misspecification, prediction accuracy may potentially be improved by simulating the training set under a demographic history that accurately matches that of your sample. ReLERNN optionally takes the output files from three popular demographic history inference programs (stairwayplot_v1, SMC++, and MSMC), and simulates a training set under these histories. Note: for SMC++ use the .csv output (option -c in SMC++). It is up to the user to perform the proper due diligence to ensure that the population size histories reported by these programs are sound. In our opinion, unless you know exactly how these programs work and you expect your data to represent a history dramatically different from equilibrium, you are better off skipping this step and training ReLERNN on equilibrium simulations. Once you have run one of the demographic history inference programs listed above, you simply provide the raw output file from that program to ReLERNN_SIMULATE using the --demographicHistory option.

Step 1) ReLERNN_SIMULATE

ReLERNN_SIMULATE reads your VCF file and splits it by chromosome. The chromosomes to be evaluated must be specified by providing a BED file of said positions using the --genome argument. A BED-formatted accessibility mask (with non-overlapping ascending windows) may be optionally provided using the --mask option. Use the --phased or --unphased flag to train using phased or unphased genotypes (the default is unphased). It is required that the VCF file use the extension .vcf. The prefix of that file will serve as the prefix used for all output files (e.g. running ReLERNN on the file population7.vcf will generate the result file population7.PREDICT.txt). It is strongly recommended that you use the default setting for --maxWinSize, larger values can cause training to fail and smaller values can result in lower accuracy. Users are required to provide an estimate of the per-base mutation rate for your sample, along with an estimate for generation time (in years). If you previously ran one of the demographic history inference programs listed above, just use the same values that you used for them. This is also where you will point to the output from said program, using --demographicHistory. If you are not simulating under an inferred history, simply do not include this option. Importantly, you can also set a value for the maximum recombination rate to be simulated using --upperRhoThetaRatio. If you have an a priori estimate for an upper bound to the ratio of rho to theta go ahead and set this here. Keep in mind that higher values will dramatically slow the coalescent simulations. We recommend using the default number of train/test/validation simulation examples, but if you want to simulate more examples, go right ahead. ReLERNN_SIMULATE then uses msprime to simulate 100k training examples and 1k validation and test examples. All output files will be generated in subdirectories within the path provided to --projectDir. It is required that you use the same projectDir for all four ReLERNN commands. If you want to run ReLERNN of multiple populations/taxa, you can run them independently using a unique projectDir for each. This step is simulation heavy and runtimes will strongly depend on the inferred population size.

The complete list of arguments used in ReLERNN_SIMULATE is found below:

ReLERNN_SIMULATE -h

usage: ReLERNN_SIMULATE [-h] [-v VCF] [-g GENOME] [-m MASK] [-d OUTDIR]
                        [-n DEM] [-u MU] [-l GENTIME] [-r UPRTR] [-t NCPU] [-s SEED]
                        [--phased] [--unphased] [--forceDiploid] [--phaseError PHASEERROR]
                        [--maxWinSize WINSIZEMX] [--maskThresh MASKTHRESH]
                        [--nTrain NTRAIN] [--nVali NVALI] [--nTest NTEST]

optional arguments:
  -h, --help            show this help message and exit
  -v VCF, --vcf VCF     Filtered and QC-checked VCF file. Important: Every row
                        must correspond to a biallelic SNP with no missing
                        data!
  -g GENOME, --genome GENOME
                        BED-formatted (i.e. zero-based) file corresponding to
                        chromosomes and positions to consider
  -m MASK, --mask MASK  BED-formatted file corresponding to inaccessible bases
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  -n DEM, --demographicHistory DEM
                        Output file from either stairwayplot, SMC++, or MSMC
  -u MU, --assumedMu MU
                        Assumed per-base mutation rate
  -l GENTIME, --assumedGenTime GENTIME
                        Assumed generation time (in years)
  -r UPRTR, --upperRhoThetaRatio UPRTR
                        Assumed upper bound for the ratio of rho to theta
  -t NCPU, --nCPU NCPU  Number of CPUs to use (defaults to total available cores)
  -s SEED, --seed SEED  Random seed
  --phased              VCF file is phased
  --unphased            VCF file is unphased
  --forceDiploid        Treats all samples as diploids
                        with missing data (bad idea; see README)
  --phaseError PHASEERROR
                        Fraction of bases simulated with incorrect phasing
  --maxWinSize WINSIZEMX
                        Max number of sites per window to train on. Important:
                        too many sites causes problems in training
  --maskThresh MASKTHRESH
                        Discard windows where >= maskThresh percent of sites
                        are inaccessible
  --nTrain NTRAIN       Number of training examples to simulate
  --nVali NVALI         Number of validation examples to simulate
  --nTest NTEST         Number of test examples to simulate

Step 2) ReLERNN_TRAIN

ReLERNN_TRAIN takes the simulations created by ReLERNN_SIMULATE and uses them to train a recurrent neural network. Again, we recommend using the defaults for --nEpochs and --nValSteps, but if you would like to do more training, feel free. To set the GPU to be used for machines with multiple dedicated GPUs use --gpuID (e.g. if running an analysis on two populations simultaneously, set --gpuID 0 for the first population and --gpuID 1 for the second). ReLERNN_TRAIN outputs some basic metrics of the training results for you, generating the figure $/projectDir/networks/vcfprefix.pdf. The default value of -nCPU is 1 for this step, as this is often produces the shortest training times per epoch (depending on missing data and the mask). Feel free to test training times using multiple cores, and set -nCPU to whatever works best for your data/machine.

The complete list of arguments used in ReLERNN_TRAIN is found below:

ReLERNN_TRAIN -h

usage: ReLERNN_TRAIN [-h] [-d OUTDIR] [--nEpochs NEPOCHS]
                     [-t NCPU] [-s SEED]
                     [--nValSteps NVALSTEPS] [--gpuID GPUID]

optional arguments:
  -h, --help            show this help message and exit
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  -t NCPU, --nCPU NCPU  Number of CPUs to use (defaults to 1)
  -s SEED, --seed SEED  Random seed
  --nEpochs NEPOCHS     Number of epochs to train over
  --nValSteps NVALSTEPS
                        Number of validation steps
  --gpuID GPUID         Identifier specifying which GPU to use

Step 3) ReLERNN_PREDICT

ReLERNN_PREDICT now takes the same VCF file you used in ReLERNN_SIMULATE and predicts per-base recombination rates in non-overlapping windows across the genome. The output file of predictions will be created as $/projectDir/vcfprefix.PREDICT.txt. It is important to note that the window size used for predictions might be different for different chromosomes. A complete list of the window sizes used for each chromosome can be found in third column of $/projectDir/networks/windowSizes.txt. Use the optional --minSites argument to exclude windows with fewer than the desired number of SNPs. If you are not interested in estimating confidence intervals around the predictions, your ReLERNN analysis is now finished. If you are getting OOM errors at this step you can try setting --batchSizeOverride to a value significantly less than the total number of windows along a chromosome (found in the last column of $/projectDir/networks/windowSizes.txt).

The complete list of arguments used in ReLERNN_PREDICT is found below:

ReLERNN_PREDICT -h

usage: ReLERNN_PREDICT [-h] [-v VCF] [-d OUTDIR] [--minSites MINS]
                       [--gpuID GPUID] [--batchSizeOverride BSO] [-s SEED]

optional arguments:
  -h, --help            show this help message and exit
  -v VCF, --vcf VCF     Filtered and QC-checked VCF file. Important: Every row
                        must correspond to a biallelic SNP with no missing
                        data!
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  --phased              VCF file is phased
  --unphased            VCF file is unphased
  --minSites MINS       Minimum number of SNPs in a genomic window required to
                        return a prediction
  --gpuID GPUID         Identifier specifying which GPU to use
  --batchSizeOverride BSO
                        Batch size to use for low memory applications
  -s SEED, --seed SEED  Random seed

Optional Step 4) ReLERNN_BSCORRECT

However, you might want to have an idea of the uncertainty around your predictions. This is where ReLERNN_BSCORRECT comes in. ReLERNN_BSCORRECT generates 95% confidence intervals around each prediction, and additionally attempts to correct for systematic bias (see Materials and Methods). It does this by simulated a set of --nReps examples at each of nSlice recombination rate bins. It then uses the network that was trained in ReLERNN_TRAIN and estimates the distribution of predictions around each know recombination rate. The result is both an estimate of uncertainty, and a prediction that has been slightly corrected to account for biases in how the network predicts in this area of parameter space. The resulting file is created as $/projectDir/vcfprefix.PREDICT.BSCORRECT.txt, and is formatted similarly to $/projectDir/vcfprefix.PREDICT.txt, with the addition of columns for the low and high 95CI bounds. Note that this step is simulation heavy and runtimes can be slow.

The complete list of arguments used in ReLERNN_BSCORRECT is found below:

ReLERNN_BSCORRECT -h

usage: ReLERNN_BSCORRECT [-h] [-d OUTDIR] [-t NCPU] [-s SEED] [--gpuID GPUID]
                         [--nSlice NSLICE] [--nReps NREPS]

optional arguments:
  -h, --help            show this help message and exit
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  -t NCPU, --nCPU NCPU  Number of CPUs to use (defaults to total available cores)
  -s SEED, --seed SEED  Random seed
  --gpuID GPUID         Identifier specifying which GPU to use
  --nSlice NSLICE       Number of recombination rate bins to simulate over
  --nReps NREPS         Number of simulations per step

Estimating a recombination landscape from pool-seq data

Similar to the directions above, the ReLERNN pipeline for pool-seq data is executed using four commands: ReLERNN_SIMULATE_POOL, ReLERNN_TRAIN_POOL, ReLERNN_PREDICT_POOL, and the optional ReLERNN_BSCORRECT.

Before running ReLERNN

ReLERNN for pool-seq analyses takes as input a file of genomic positions and allele frequencies (herein a 'POOLFILE'; see example file).

Similar to ReLERNN for individually sequenced chromosomes, if you want to make predictions based on equilibrium simulations, you can skip ahead to executing ReLERNN_SIMULATE_POOL. While ReLERNN is generally robust to demographic model misspecification, prediction accuracy may potentially be improved by simulating the training set under a demographic history that accurately matches that of your sample. ReLERNN optionally takes the raw output files from three popular demographic history inference programs (stairwayplot_v1, SMC++, and MSMC), and simulates a training set under these histories. It is up to the user to perform the proper due diligence to ensure that the population size histories reported by these programs are sound. In our opinion, unless you know exactly how these programs work and you expect your data to represent a history dramatically different from equilibrium, you are better off skipping this step and training ReLERNN on equilibrium simulations. Once you have run one of the demographic history inference programs listed above, you simply provide the raw output file from that program to ReLERNN_SIMULATE_POOL using the --demographicHistory option.

Step 1) ReLERNN_SIMULATE_POOL

ReLERNN_SIMULATE_POOL reads your POOLFILE and splits it by chromosome. The number of chromosomes in the pool must be specified using the --sampleDepth argument. The genomic chromosomes to be evaluated must be specified by providing a BED file of said positions using the --genome argument. A BED-formatted accessibility mask (with non-overlapping ascending windows) may be optionally provided using the --mask option. It is required that the POOLFILE use the extension .pool. The prefix of that file will serve as the prefix used for all output files (e.g. running ReLERNN on the file population7.pool will generate the result file population7.PREDICT.txt). It is strongly recommended that you use the default setting for --maxSites, larger values can cause training to fail and smaller values can result in lower accuracy. Users are required to provide an estimate of the per-base mutation rate for your sample, along with an estimate for generation time (in years). If you previously ran one of the demographic history inference programs listed above, just use the same values that you used for them. This is also where you will point to the output from said program, using --demographicHistory. If you are not simulating under an inferred history, simply do not include this option. Importantly, you can also set a value for the maximum recombination rate to be simulated using --upperRhoThetaRatio. If you have an a priori estimate for an upper bound to the ratio of rho to theta go ahead and set this here. Keep in mind that higher values will dramatically slow the coalescent simulations. We recommend using the default number of train/test/validation simulation examples, but if you want to simulate more examples, go right ahead. ReLERNN_SIMULATE_POOL then uses msprime to simulate 100k training examples and 1k validation and test examples. All output files will be generated in subdirectories within the path provided to --projectDir. It is required that you use the same projectDir for all four ReLERNN commands. If you want to run ReLERNN of multiple populations/taxa, you can run them independently using a unique projectDir for each. This step is simulation heavy and runtimes will strongly depend on the inferred population size.

The complete list of arguments used in ReLERNN_SIMULATE_POOL is found below:

ReLERNN_SIMULATE_POOL -h

usage: ReLERNN_SIMULATE_POOL [-h] [-p POOL] [--sampleDepth SAMD] [-g GENOME] [-m MASK] [-d OUTDIR]
                        [-n DEM] [-u MU] [-l GENTIME] [-r UPRTR] [-t NCPU] [-s SEED]
                        [--maxSites WINSIZEMX] [--maskThresh MASKTHRESH]
                        [--nTrain NTRAIN] [--nVali NVALI] [--nTest NTEST]

optional arguments:
  -h, --help            show this help message and exit
  -p POOL, --pool POOL     Filtered and QC-checked POOL file.
  --sampleDepth SAMD    Number of chromosomes in pool
  -g GENOME, --genome GENOME
                        BED-formatted (i.e. zero-based) file corresponding to
                        chromosomes and positions to consider
  -m MASK, --mask MASK  BED-formatted file corresponding to inaccessible bases
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  -n DEM, --demographicHistory DEM
                        Output file from either stairwayplot, SMC++, or MSMC
  -u MU, --assumedMu MU
                        Assumed per-base mutation rate
  -l GENTIME, --assumedGenTime GENTIME
                        Assumed generation time (in years)
  -r UPRTR, --upperRhoThetaRatio UPRTR
                        Assumed upper bound for the ratio of rho to theta
  -t NCPU, --nCPU NCPU  Number of CPUs to use (defaults to total available cores)
  -s SEED, --seed SEED  Random seed
  --maxSites WINSIZEMX
                        Max number of sites per window to train on. Important:
                        too many sites causes problems in training
  --maskThresh MASKTHRESH
                        Discard windows where >= maskThresh percent of sites
                        are inaccessible
  --nTrain NTRAIN       Number of training examples to simulate
  --nVali NVALI         Number of validation examples to simulate
  --nTest NTEST         Number of test examples to simulate

Step 2) ReLERNN_TRAIN_POOL

ReLERNN_TRAIN_POOL takes the simulations created by ReLERNN_SIMULATE_POOL and uses them to train a recurrent neural network. The only difference here is that the mean read depth of the pool must be specified using the --readDepth argument. You can also specify a minor allele frequency threshold (--maf), if a similar threshold was used to generate your POOLFILE. Again, we recommend using the defaults for --nEpochs and --nValSteps, but if you would like to do more training, feel free. To set the GPU to be used for machines with multiple dedicated GPUs use --gpuID (e.g. if running an analysis on two populations simultaneously, set --gpuID 0 for the first population and --gpuID 1 for the second). ReLERNN_TRAIN_POOL outputs some basic metrics of the training results for you, generating the figure $/projectDir/networks/poolprefix.pdf. The default value of -nCPU for this step is the max number of available cores, as training on pooled data with a single core can be very slow.

The complete list of arguments used in ReLERNN_TRAIN_POOL is found below:

ReLERNN_TRAIN_POOL -h

usage: ReLERNN_TRAIN_POOL [-h] [-d OUTDIR] [--readDepth SEQD] [--maf MAF] [--nEpochs NEPOCHS]
                     [--nValSteps NVALSTEPS] [-t NCPU] [-s SEED] [--gpuID GPUID]

optional arguments:
  -h, --help            show this help message and exit
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  --readDepth SEQD     Mean read depth of the pool
  --maf MAF     discard simulated sites with allele frequencies < maf
  --nEpochs NEPOCHS     Number of epochs to train over
  --nValSteps NVALSTEPS
                        Number of validation steps
  -t NCPU, --nCPU NCPU           Number of CPUs to use (defaults to total available cores)
  -s SEED, --seed SEED  Random seed
  --gpuID GPUID         Identifier specifying which GPU to use

Step 3) ReLERNN_PREDICT_POOL

ReLERNN_PREDICT_POOL now takes the same POOL file you used in ReLERNN_SIMULATE_POOL and predicts per-base recombination rates in non-overlapping windows across the genome. The output file of predictions will be created as $/projectDir/poolprefix.PREDICT.txt. It is important to note that the window size used for predictions might be different for different chromosomes. A complete list of the window sizes used for each chromosome can be found in third column of $/projectDir/networks/windowSizes.txt. Use the optional --minSites argument to exclude windows with fewer than the desired number of SNPs. If you are not interested in estimating confidence intervals around the predictions, your ReLERNN analysis is now finished. If you are getting OOM errors at this step you can try setting --batchSizeOverride to a value significantly less than the total number of windows along a chromosome (found in the last column of $/projectDir/networks/windowSizes.txt).

The complete list of arguments used in ReLERNN_PREDICT_POOL is found below:

ReLERNN_PREDICT_POOL -h

usage: ReLERNN_PREDICT [-h] [-p POOL] [-d OUTDIR] [--minSites MINS]
                       [--batchSizeOverride BSO] [--gpuID GPUID] [-s SEED]

optional arguments:
  -h, --help            show this help message and exit
  -p POOL, --pool POOL     Filtered and QC-checked POOL file.
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  --minSites MINS       Minimum number of SNPs in a genomic window required to
                        return a prediction
  --batchSizeOverride BSO
                        Batch size to use for low memory applications
  --gpuID GPUID         Identifier specifying which GPU to use
  -s SEED, --seed SEED  Random seed

Optional Step 4) ReLERNN_BSCORRECT

This step is exactly the same as in ReLERNN for individually sequenced chromosomes (above).

The complete list of arguments used in ReLERNN_BSCORRECT is found below:

ReLERNN_BSCORRECT -h

usage: ReLERNN_BSCORRECT [-h] [-d OUTDIR] [-t NCPU] [-s SEED] [--gpuID GPUID]
                         [--nSlice NSLICE] [--nReps NREPS]

optional arguments:
  -h, --help            show this help message and exit
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  -t NCPU, --nCPU NCPU  Number of CPUs to use (defaults to total available cores)
  -s SEED, --seed SEED  Random seed
  --gpuID GPUID         Identifier specifying which GPU to use
  --nSlice NSLICE       Number of recombination rate bins to simulate over
  --nReps NREPS         Number of simulations per step

relernn's People

Contributors

Stargazers

Watchers

Forkers

jgallowa07 jradrion lzeitler skyclub3 brandonlind peterdfields silastittes octpalacios ziyimo ningshuang-yao thomasbrazier

relernn's Issues

loss: nan - val_loss nan

Hi,
For some of my training runs, not all, the loss values eventually start being reported as 'nan'.
Epoch 308/1000
99/100 [============================>.] - ETA: 2s - loss: 0.2445Epoch 1/1000
100/100 [==============================] - 328s 3s/step - loss: 0.2445 - val_loss: 0.2342
Epoch 309/1000
99/100 [============================>.] - ETA: 3s - loss: nanEpoch 1/1000
100/100 [==============================] - 332s 3s/step - loss: nan - val_loss: nan
Epoch 310/1000
99/100 [============================>.] - ETA: 2s - loss: nanEpoch 1/1000
100/100 [==============================] - 325s 3s/step - loss: nan - val_loss: nan

It usually terminates successfully after a repeat of 'nan'. Just wanted to verify that there is nothing wrong with the training that would influence the predictions in this case.
thanks,
@stsmall

Question about software usage

Hi
I would like to know if it is possible to use this software to estimate recombination rates and then analyze population demographics using linkage-disequilibrium-based Ne estimation that use recombination rates information.
If you have time, please tell me.

tensorflow needed in requirements.txt

After creating an anaconda environment conda create --name relernn python=3.7, and installing dependencies following the README, I was not able to run ./example_pipeline.sh without manually installing tensorflow : pip install tensorflow. After pip installing tensorflow (tensorflow-2.2.0-cp37-cp37m-manylinux2010_x86_64.whl), the example script ran through just fine.

NumPy Version Related Error

Hi! I am trying to get the example for ReLERNN working but I keep getting a NumPy version error where it wants a newer version of NumPy. Do you know which line in which script specifies the version of NumPy?

How many diploid samples should be at least used

Hi, I am wondering how many diploid samples should be used, I found the paper used at least 4 chromosomes, so, for diploid samples, at least two samples should be used. Am I right? And what about the accuracy?
Thank you very much

ReLERNN_PREDICT_HOTSPO unable to run

When I run ReLERNN_PREDICT_HOTSPOT with default parameters, I get an error of

File "/project-whj/software/ReLERNN/ReLERNN/ReLERNN_PREDICT_HOTSPOT", line 100, in main
pred_sequence = VCFBatchGenerator(**bds_pred_params)
TypeError: init() got an unexpected keyword argument 'WIN'

Even using the examples file of the software, I don't know how to solve it, please give me your guidance

Demographic history error message

Hi Jeff,
Just following up about the --demographicHistory error message with SIMULATE. In the --help output of SIMULATE it says it needs the output from SMC++ which is currently a model.final.json file but this returns an error. --demographicHistory is looking for .csv file and I've confirmed this works if you have SMC++ produce a .csv file. So, if you change the error message of SIMULATE to supply a .csv or include it in the help message it should clear up any confusion. Thanks for your help!
Best,
Kenny

Issue with chromosome length

Hi!

I am facing a very usual issue with ReLERNN regarding chromosome length. I am studying a species with chromosomes longer than 2,147,483,647 bp which is the usual limit for integer storage in memory.
I can of course divide my chromosomes to take that into account (which I usually have to do, as most software have the same issue), but if you could consider that for one of your next releases, it would be amazing!

Thanks a lot!

Issue with seed in examples

Hi!

I just installed ReLERNN and tried it on the example dataset, but I got some unexpected issue during the simulation stage.
Here is the error message

Traceback (most recent call last):
  File "/hpc2n/eb/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/hpc2n/eb/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/j/jbruxaux/.local/lib/python3.11/site-packages/ReLERNN/simulator.py", line 301, in worker_simulate
    result_q.put([i,self.runOneMsprimeSim(i,direc)])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/jbruxaux/.local/lib/python3.11/site-packages/ReLERNN/simulator.py", line 87, in runOneMsprimeSim
    random.seed(SEED)
  File "/hpc2n/eb/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/random.py", line 160, in seed
    raise TypeError('The only supported seed types are: None,\n'
TypeError: The only supported seed types are: None,
int, float, str, bytes, and bytearray.

I am using python 3.11, TensorFlow 2.13.0, CUDA 11.4.1 and cuDNN 8.2.2.26 on a v100 gpu node.
Am I missing something?

Thanks for your help!

Chromosome length bounded to 20 Mbp

Hello,

I have tried the ReLERNN pipeline on some poolseq data for a genome including 5 chromosomes of about 45 Mbp each.
The pipeline works fine but I noticed that the maximal position considered in the splitPOOL files is 20Mbp.
I guess this is a result of the max number of sites the pipeline can handle at once, is this correct ? Or is there any other issue I should worry about ?

In any case, great tool and impressive computational performances.

Best,
Guillaume

Separate predictions on different samples from same vcf?

I have a use case where samples are all in the same vcf and have same hyper parameters (mutation rate etc.), but would like to make predictions for separate taxa. Out of the box prediction on new samples from the same vcf didn't work I guess because the windows are broken up according to the vcf/samples given to ReLERNN_SIMULATE. Seems like a waste to do distinct sims for this. I would be happy to try tackling this if it seems feasible. Naively it seems like it would require basing the windows according to a reference genome instead of a vcf? Are this issues I'm overlooking?

ReLERNN_SIMULATE not splitting vcf file properly

Dear users,

I have used ReLERNN with a previous individuals resequencing data with no problem at all.

However, I am running into problems with a new dataset. It seems ReLERNN fails in the step of splitting the vcf into the different chromosome vcfs. By looking at this directory for the analzsis that previously worked, I see there are files missing.

[s_menb@jupiter SS1]$ cd splitVCFs/
[s_menb@jupiter splitVCFs]$ ll
total 4204
-rw-r--r--. 1 s_menb clusteruser 1285117 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr1:0-61357614.hdf5
-rw-r--r--. 1 s_menb clusteruser 1284380 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr2:0-58906861.hdf5
-rw-r--r--. 1 s_menb clusteruser 1162613 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr3:0-53163979.hdf5
-rw-r--r--. 1 s_menb clusteruser 566572 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr4:0-17018963.hdf5

My vcf file (for a single individual was created using Platypus and filtering for biallelic positions.

My bed file is tab separated as follows:

Chr1 0 61357614
Chr2 0 58906861
Chr3 0 53163979
Chr4 0 17018963

And below the error message:

Traceback (most recent call last):
File "/cluster/software/relernn/ReLERNN-1.0.0_ve/bin/ReLERNN_SIMULATE", line 245, in
main()
File "/cluster/software/relernn/ReLERNN-1.0.0_ve/bin/ReLERNN_SIMULATE", line 160, in main
thetaW=maxS/a
ZeroDivisionError: division by zero

Any ideas or suggestions?

Best wishes

Genome bed file

Hello,
I am trying to run ReLERNN for my species on pool sample and it is failing due to this error:
Error: genome file must be formatted as a bed file (i.e.'chromosome start end')
head of my genome.bed is :
CM009931.2 0 27754200
CM009932.2 0 16093500
CM009933.2 0 13619445
CM009934.2 0 13404451
CM009935.2 0 13920984

Any suggestions what I am doing wrong?
Thank You in advance.
Tanu

what is the recommended threshold for maf filtering

When I run ReLERNN with different MAF parameters, I get the exact opposite pattern of three populations. How do I solve this problem, what is the recommended threshold for maf filtering?

RELERNN SIMULATE issue with vcf?

Hello I think I may be having a similar issue to the closed issue #7. I commented on that thread as well but wasn't sure if GitHub alerts of a closed issue comment?

I’ve removed all the hemizygous/haploid chromosomes from my vcf and my windowSizes file only has chromosomes with sample size 6

However, I am getting the following error:
Reading HDF5 mask: /home/ddebaun/mendel-nas1/redo_recombination/splitVCFs/Leioheterodon_madagascarensis_B_biallelic_7204_RagTag:0-11000_md_mask.hdf5...
Traceback (most recent call last):
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 245, in
main()
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 152, in main
md_mask = np.concatenate(md_mask)
File "<array_function internals>", line 180, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 6 and the array at index 6 has size 3

I’m including the information I used for this run (first 50k lines of vcf). Is this also an issue with the vcf for scikit-allel? Any help would be appreciated!
filestorun.zip

missing comma from line 9 in setup.py

Hi, I believe there is a comma missed from line 9 in the new committed setup.py; this is causing the command for installing dependency "pip install ." failing

memory allocate error on PREDICT

Hi,
I have successfully run SIMULATE and TRAIN, but am now running into a memory error on PREDICT:
"MemoryError: Unable to allocate 32.4 GiB for an array with shape (14212, 1801, 340) and data type float32"
I realize that this likely on my side of things, but I wanted to know if it was possible to pass a VCF to PREDICT that was a subset of the VCF used in the SIMULATE step? PREDICT seems to look for the VCF as an hdf5 ...
thanks,
@stsmall

Recommendations for how to parameterize ReLERNN

Dear @jradrion @andrewkern et al,

I am exploring the possibility of using ReLERNN to infer recombination rates in a non-model arthropod with a large genome and high levels of nucleotide diversity (1-1.5%). Nothing is known currently about the recombination in this species and we have no ground-truth evidence to fall back on to verify results. Our assembly is relatively fragmented and our sample size is just below 70 diploids. The decay of LD seems relatively rapid in our data (phased with BEAGLE 4.0), similar to what is seen in many other arthopods.

The ReLERNN paper speaks much about the of relationship between mutation and recombination rates, and both the mutation rate and parameter "--upperRhoThetaRatio" seems to be key to successful inference.

I tried setting --upperRhoThetaRatio to 35 as in the paper and used a mutation rate typical for arthropods, and while all steps in ReLERNN worked on my machine with a powerful GPU, the inferred recombination rates came out very flat, with dips around contig brakes along scaffolds or genes with reduced levels of variation, suggesting training and parameterization has not worked well.

ReLERNN is new to me and I am not sure how to move forward.

Can you give some hints as for what parameters to tweak?

A higher or lower "--upperRhoThetaRatio"?

Removing variants with low minor allele frequencies?

Error running example of ReLERNN

Hello,
After installing all the dependencies and packages when I try to run the example I am getting the below error.
ReLERNN/examples$ python2.7 ./example_pipeline.sh
File "./example_pipeline.sh", line 14
${SIMULATE}
^
SyntaxError: invalid syntax
ReLERNN/examples$ python3.5 ./example_pipeline.sh
File "./example_pipeline.sh", line 14
${SIMULATE}
^
SyntaxError: invalid syntax

/ReLERNN/examples$ ./example_pipeline.sh
Using TensorFlow backend.
Traceback (most recent call last):
File "/usr/local/bin/ReLERNN_SIMULATE", line 4, in
import('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/ReLERNN-0.1-py2.7.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 7, in
from ReLERNN.imports import *
File "/usr/local/lib/python2.7/dist-packages/ReLERNN-0.1-py2.7.egg/ReLERNN/init.py", line 4, in
from ReLERNN.helpers import *
File "/usr/local/lib/python2.7/dist-packages/ReLERNN-0.1-py2.7.egg/ReLERNN/helpers.py", line 184
out_fp.write(f"pop0,{time_years},{size}\n")
^
SyntaxError: invalid syntax
./example_pipeline.sh: line 15: --vcf: command not found
./example_pipeline.sh: line 16: --genome: command not found
./example_pipeline.sh: line 17: --mask: command not found
./example_pipeline.sh: line 18: --phased: command not found
./example_pipeline.sh: line 19: --projectDir: command not found
./example_pipeline.sh: line 20: --assumedMu: command not found
./example_pipeline.sh: line 21: --upperRhoThetaRatio: command not found
./example_pipeline.sh: line 22: --nTrain: command not found
./example_pipeline.sh: line 23: --nVali: command not found
./example_pipeline.sh: line 24: --nTest: command not found
./example_pipeline.sh: line 25: --nCPU: command not found

Any help or suggestion is appreciated.
Thank you,
Tanushree

The train step needs large memory

Hi,
When I run the RELERNN_TRAIN with default settings, the step was killed because of the large memory, how to deal with this? could you share your help? Thank you very much.

Illegal instruction 4 errors when running the example

Hello,

I am running ReLERNN on a mac with an M1 chip and suspect that this might be the main cause of the following error when running the example file. Is there an update for MacOS instillations with the M1-M3 chips?

Here is the error:

(msprime-env) frankburbrink@Mac-Studio examples % ./example_pipeline_pool.sh
./example_pipeline_pool.sh: line 25: 54699 Illegal instruction: 4 ${SIMULATE} --pool ${POOL} --sampleDepth 20 --genome ${GENOME} --mask ${MASK} --projectDir ${DIR} --assumedMu ${MU} --upperRhoThetaRatio ${URTR} --nTrain 13000 --nVali 2000 --nTest 100 --seed ${SEED}
./example_pipeline_pool.sh: line 34: 54779 Illegal instruction: 4 ${TRAIN} --projectDir ${DIR} --readDepth 20 --maf 0.05 --nEpochs 2 --nValSteps 2 --seed ${SEED}
./example_pipeline_pool.sh: line 40: 54783 Illegal instruction: 4 ${PREDICT} --pool ${POOL} --projectDir ${DIR} --seed ${SEED}
./example_pipeline_pool.sh: line 47: 54788 Illegal instruction: 4 ${BSCORRECT} --projectDir ${DIR} --nSlice 2 --nReps 2 --seed ${SEED}

Thanks for any advice!

Frank

The filepath provided must end in `.keras` (Keras model format)

Hello,
I'm trying to test ReLERNN installation running the example_pipeline.sh and I'm having the following error during training step:

Traceback (most recent call last):
File "/home/quaranta/anaconda3/bin/ReLERNN_TRAIN", line 130, in
main()
File "/home/quaranta/anaconda3/bin/ReLERNN_TRAIN", line 109, in main
runModels(ModelFuncPointer=GRU_TUNED84,
File "/home/quaranta/anaconda3/lib/python3.10/site-packages/ReLERNN/helpers.py", line 353, in runModels
ModelCheckpoint(
File "/home/quaranta/anaconda3/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py", line 191, in init
raise ValueError(
ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=./example_output/networks/weights.h5

How can i solve? Thanks

Which python version needed to successfully run ReLERNN?

Hi, I'm having difficulty getting all the dependencies to install when running pip install. What python version should I use? Perhaps this could be included in documentation? Thank you!!

Andre Moncrieff
Postdoc at Louisiana State University

Is there a method to convert the result to other window size based results

Hi,
I got the predicted result from using ReLERNN, and Now I want to convert the result to other window size, like 50kb, and convert to map file used for the input of Relate (https://myersgroup.github.io/relate/input_data.html#Prepare), could you share some methods to do such work?

Thanks.

ReLERNN_TRAIN_POOL is slow

I had to remove multiprocessing from model.fit to remedy a memory leak with tensorflow 2, which means that the generation of training data when training on pooled sequences is now painfully slow. I will be working on a fix for this issue, but I do not currently have a resolution.

Using --mask option does not change the output

Hello,

I've run ReLERNN both with and without the --mask option, yet I've obtained identical results (same window size and the resulting table with nSites). For --mask option I provide a .bed file containing masked transposable elements obtained from EDTA.
In the log file, I notice the following message:

'Accessibility mask found: calculating the proportion of the genome that is masked...
44.0% of the genome inaccessible'

Despite this, there is no impact on the output. Could you shed some light on why this might be the case?

Error with ReLERNN_SIMULATE

Hi!

Really excited about using ReLERNN to estimate recombination in some natural data with a low-ish sample size (n22) and also to have a go on some poolseq data too

Just tried to run on my natural data, and I get the following error message when reading the hd5f files

Reading HDF5: "ReLERNN/splitVCFs/paria_marianne_1027798.final_chr1:0-34343053.hdf5"...
Process Process-2:
Error: chromosomes have different numbers of samples
Traceback (most recent call last):
  File "/gpfs/ts0/home/jrp228/.local/bin/ReLERNN_SIMULATE", line 4, in <module>
    __import__('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE')
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 750, in run_script
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 1527, in run_script
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 219, in <module>
Traceback (most recent call last):
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 199, in worker_countSites
    if md_mask.any():
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/allel/abc.py", line 43, in __getattr__
    return getattr(self.values, item)
AttributeError: 'Dataset' object has no attribute 'any'
    main()
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 108, in main
    wins, nSamps, maxS, maxLen = vcf_manager.countSites(nProc=nProc)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 170, in countSites
    return sorted_wins, nSamps[0], maxS, maxLen
IndexError: list index out of range

My vcf file is pretty standard, although there is some missing data, and I'm running ReLERNN like this:
ReLERNN_SIMULATE -v paria_marianne_1027798.final.vcf -g STAR.extents.bed -m STAR.chromosomes.release.repeats.bed -d ReLERNN/ -u 4.8e-8 --unphased

I checked the vcf files generated in the first step of the script, and they all have the same number of samples:
for i in *vcf; do bcftools query -l $i | wc -l; done | sort | uniq
22

empty files in SplitVCF/

when I run it on my pc, all the files listed in SplitVCF/ folder are empty after running simulation. And I didn't observe any errors about that through log console.
PS, I just use example.

ReLERNN_TRAIN step has a problem: ValueError: The filepath provided must end in `.keras` (Keras model format)

Hi, I got an Error when I ran the TRAIN step, could you share your help? Thank you very much, here is my code:

/data2/software-use/ReLERNN/ReLERNN/ReLERNN_SIMULATE -v species.vcf
-d myfile
--demographicHistory species_plot.csv
-u 3.3e-9
-l 1
-t 40
-s 123
-g genome.length.bed
--unphased

/data2/software-use/ReLERNN/ReLERNN/ReLERNN_TRAIN -d myfile -t 20 -s 123
The output is :
Total params: 76,002,769 (289.93 MB)
Trainable params: 76,002,769 (289.93 MB)
Non-trainable params: 0 (0.00 B)
Traceback (most recent call last):
File "/data2/software-use/ReLERNN/ReLERNN/ReLERNN_TRAIN", line 130, in
main()
File "/data2/software-use/ReLERNN/ReLERNN/ReLERNN_TRAIN", line 109, in main
runModels(ModelFuncPointer=GRU_TUNED84,
File "/data2/software-use/anaconda3/envs/ReLERNN_python/lib/python3.10/site-packages/ReLERNN/helpers.py", line 353, in runModels
ModelCheckpoint(
File "/data2/software-use/anaconda3/envs/ReLERNN_python/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py", line 191, in init
raise ValueError(
ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=myfile/networks/weights.h5

Installation instructions

The dependencies in requirements.txt and setup.py are identical, so it seems to me the instructions could be simplified to

$ git clone https://github.com/kern-lab/ReLERNN.git
$ cd ReLERNN
$ pip install .

instead of

$ git clone https://github.com/kern-lab/ReLERNN.git
$ cd ReLERNN
$ pip install -r requirements.txt
$ python setup.py install

Or is there any reason for the two steps and legacy install?

very different corrections when bootstrapping

Hi!

I used ReLERNN to estimate the recombination rate along a very long genome, and ran the analyses by pieces of 500Mb. The results between the different parts are comparable when I use the results of the "predict" function, but differ a lot after correction with the "bscorrect" function.
For example, before correction:

And after correction:

Any idea what could cause such differences? Is there anything I should do?
Thanks in advance!

Error with ReLERNN example script.

Hello,
I was able to install and run the script but in the second step I get this warning and error

ReLERNN_SIMULATE_POOL.py FINISHED!

Using TensorFlow backend.
Warning: training data to be treated as if generated by pool-seq
Model: "model_1"

Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 2930, 2) 0

bidirectional_1 (Bidirection (None, 168) 44352

dense_1 (Dense) (None, 256) 43264

dropout_1 (Dropout) (None, 256) 0

dense_2 (Dense) (None, 1) 257

Total params: 87,873
Trainable params: 87,873
Non-trainable params: 0

2019-12-13 14:29:39.255505: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-13 14:29:39.289945: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2497315000 Hz
2019-12-13 14:29:39.293212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x532a2b0 executing computations on platform Host. Devices:
2019-12-13 14:29:39.293241: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNN' used by {{node bidirectional_1/CudnnRNN}}with these attrs: [dropout=0, seed=87654321, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="gru", is_training=true, seed2=0]
Registered devices: [CPU, XLA_CPU]
Registered kernels:

     [[bidirectional_1/CudnnRNN]]

During handling of the above exception, another exception occurred:

I don't have GPUs, is there any other way I can run it.
Any suggestions?
Thank You,
Tanushree

Unable to allocate memory with ReLERNN_TRAIN_POOL

I'm running into memory issues with ReLERNN_TRAIN_POOL. I'm not sure if this is a ReLERNN problem, or (more likely) something about the way my cluster and GPUs are set up.

I'm running my analysis on a cluster (CPU: Intel Xeon Gold 6240 @ 2.60GHz, GPU: NVIDIA RTX 2080Ti), using 24 threads. I installed the dependencies through conda, here are the versions I'm using:

While ReLERNN_TRAIN_POOL runs I get frequent warnings:

WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.

It runs for a while and then eventually (~24 hours, 49 epochs) relernn gives a memory allocation error:

Traceback (most recent call last):
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/utils/data_utils.py", line 843, in _run
    with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/utils/data_utils.py", line 820, in pool_fn
    pool = get_pool_class(True)(
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/context.py", line 276, in _Popen
    return Popen(process_obj)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

This error doesn't cause any out-of-memory errors in the scheduler and doesn't halt the program: I just notice that the logs stop updating and I have to manually stop the job.

Here's the command I ran:

ReLERNN_TRAIN_POOL -d genome_MQ20_minDP90_maf05 --readDepth 255 --maf 0.05 -t 24

If it helps, this is coming from a pool of 81 diploid individuals (so I specified --sampleDepth 162 in ReLERNN_SIMULATE_POOL). The genome size is ~3Gb, and I have just under 55k SNPs at the current level of filtering. The cluster I'm running on has I think up to 512 Gb of memory to work with, though if I check memory usage of the failed job with sacct I get some nonsensical numbers (MaxRSS = 18130.31G, MaxVMSize = 19035.73G), so I'm not sure what's going on there.

Also if it helps, I was successfully able to run the example pooled pipeline, but it took longer than I was expecting given what the readme says for the non-pooled example: ~60 minutes running on 4 cores.

Please let me know if you need any other info. I'm very new with running GPU-based analyses, so even if you can just point me in the right direction in terms of questions to ask my sysadmin, I'd appreciate it.

Wrong auto-estimate of #CPUs (Slurm)

The automatic estimate of the number of CPUs available is wrong on an HPC cluster with Slurm scheduler. The program counts all CPUs on the node, not just the ones allocated by Slurm. As a result, if it has not been allocated the whole node, it tries to start too many processes and it crashes.

To fix this, it should detect whether the environment variable SLURM_NTASKS has been set, and if so, set the number of processes equal to: either SLURM_NTASKS * SLURM_CPUS_PER_TASK (if SLURM_CPUS_PER_TASK has been set), or SLURM_NTASKS (if SLURM_CPUS_PER_TASK has not been set). If SLURM_NTASKS is unset, proceed as before.

The problem is of course resolved by using the -t flag. It is, however, inconvenient that one must then also alter the example_pipeline.sh and example_pipeline_pool.sh scripts, which are meant as elementary ready-to-run tests.

Hi Im wondering how to generate the example.vcf haplotype file format

I simulated the example.vcf file format and split a phased VCF file into VCF file containing two haplotypes (CHR like 3L 3R) by my own python script. I made sure the haplotype VCF file can be processed by other VCF file tools (VCFtools, bcftools). But ReLERNN.SIMILATE told me 'Error: chromosomes have different sample sizes!'.
Please help me, and told how to get the haplotype VCF file quickly. THANKS!!!

ReLERNN train TF2 model.fit memory leak and errors

I have problems running the TF2 version of relernn.
I'm using:
tensorflow 2.1
cudatk 10.1.243
cudnn 7.6.4
CUDA enabled GPU (1080Ti)

Memory leak
Each training iteration memory usage keeps increasing which eventually leads to >200GB RAM usage. I think it's related to these issues
tensorflow/tensorflow#33030
tensorflow/tensorflow#35100
I also tried nightly which has the same issue.

Error message
I'm also getting error and warning messages in each epoch with TF2.

2020-02-22 01:44:32.078164: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.

I don't know if these problems are related but maybe they are.

On a side note, I can run the example pipeline with producing output, even though another error comes up when loading modules (Could not load dynamic library 'libnvinfer.so.6')

There was another issue with ReLERNN train (earlier TF1 commits, where model.fit_generator was used). The model fitting would not succeed after all epochs ran, without any error message. Maybe you have an idea what the problem could be here? Then I could use the TF1 version of ReLERNN and run my stuff that way.

I'm running it on a dataset with 5 individuals and about 2M SNPs (unphased, with some missing data).

Any help would be greatly appreciated.

kr-colab / relernn Goto Github PK

relernn's Introduction

ReLERNN

Recombination Landscape Estimation using Recurrent Neural Networks

Recommended installation on linux

Testing ReLERNN

Estimating a recombination landscape from individually sequenced chromosomes

Before running ReLERNN

Step 1) ReLERNN_SIMULATE

Step 2) ReLERNN_TRAIN

Step 3) ReLERNN_PREDICT

Optional Step 4) ReLERNN_BSCORRECT

Estimating a recombination landscape from pool-seq data

Before running ReLERNN

Step 1) ReLERNN_SIMULATE_POOL

Step 2) ReLERNN_TRAIN_POOL

Step 3) ReLERNN_PREDICT_POOL

Optional Step 4) ReLERNN_BSCORRECT

relernn's People

Contributors

Stargazers

Watchers

Forkers

relernn's Issues

Layer (type) Output Shape Param #

dense_2 (Dense) (None, 1) 257

Recommend Projects

Recommend Topics

Recommend Org