williamslab / ped-sim Goto Github PK
View Code? Open in Web Editor NEWPedigree simulator
License: GNU General Public License v3.0
Pedigree simulator
License: GNU General Public License v3.0
all -O2 -o main.o -c main.cc
g++ -MMD -std=c++11 -Wall -O2 -o cmdlineopts.o -c cmdlineopts.cc
g++ -MMD -std=c++11 -Wall -O2 -o cointerfere.o -c cointerfere.cc
gcc -MMD -Wall -O2 -o memcpy.o -c memcpy.c
<inline asm>:1:1: error: unknown directive
.symver memcpy, memcpy@GLIBC_2.2.5
^
1 error generated.
make: *** [memcpy.o] Error 1
Line 127 in fbaa79e
Sorry, I will follow-up with an e-mail; thought I ought to query this one: the result from 'no_interfere' will either be truncated or rounded to integer, and given the likely values of 'p' & 'Length' (e.g. those of Campbell et al) the following loop will seldom be entered? - I think you need to test 'p' within the loop (as Karl has done)?
n.b. I'm a Systems & Software Engineer with an interest in Genealogy - we're most interested in your upcoming Paper, which is underpinned by this software?
Regards,
Malcolm Peach (BSc Hons, MSc)
Hi there :)
I'm getting a seg fault when working with the following command + genetic map. No VCF, just simulating a pedigree. Just working with the example def file for now. The command i used:
/workflow/scripts/ped-sim/ped-sim -d workflow/scripts/ped-sim/example/cousins-1st_half_to_3rd.def -m results/ped-sim-ag.gmap -o ped-sim --pois
and the genetic map:
2RL 1 0.0
2RL 58984778 117.97
2RL 61545105 119.25
2RL 61645106 119.25
2RL 64076722 120.466
2RL 64132875 120.578
2RL 66687494 121.855
2RL 66724067 121.929
2RL 67433980 122.284
2RL 111009430 209.434
3RL 1 0.0
3RL 38988757 77.978
3RL 41860198 79.413
3RL 52161877 100.017
3RL 53200684 100.536
3RL 53200685 100.536
3RL 55015803 101.444
3RL 55097514 101.607
3RL 57435893 102.776
3RL 57465397 102.83500000000001
3RL 58232376 103.219
3RL 95164119 177.082
Here is the log:
Thank you.
Hi,
I'm not sure if it's an issue in ped-sim, or in the structure of my input VCF, but when trying to run ped-sim and get out a vcf file, I am getting a segfault as soon as it starts writing out the new vcf.
if it helps, I ran gdb on the run that is failing, which is attached.
Any help would be very much appreciated.
Starting program: /data/projects/skate/data/ped-sim/1000G_high/../../../software/ped-sim/ped-sim -d full_half_1st_2nd_cousins.def -m refined_mf.simmap -o test_cousins_High_confidence1000G --intf nu_p_campbell.tsv -i test2.vcf --keep_phase
Pedigree simulator! v1.1.11 (Released 19 Sep 2020)
Def file: full_half_1st_2nd_cousins.def
Map file: refined_mf.simmap
Input VCF: test2.vcf
Output prefix: test_cousins_High_confidence1000G
Random seed: 641804785
Interference file: nu_p_campbell.tsv
Genotype error rate: 1.0e-03
Opposite homozygous error rate: 0.00
Missingness rate: 1.0e-03
Pseudo-haploid rate: 0
Not retaining extra samples in output VCF (printing only simulated samples)
Output VCF will contain phased data
Simulating haplotype transmissions... done.
Printing IBD segments... done.
Reading input VCF meta data... done.
Input contains 433 samples, using 17 as founders, and retaining 0
Generating VCF file...
Program received signal SIGSEGV, Segmentation fault.
0x000000000041806c in void makeVCF<_IO_FILE*>(std::vector<SimDetails, std::allocator >&, Person*****, int, char const*, char*, GeneticMap&, _IO_FILE**) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-322.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb)
When I try to run ped-sim
on VCF without chrX data and map-file without chrX and don't specify --sexes
, I get the following error:
ERROR: could not open sexes file (null)!
open: Bad address
It seems like there is inconsistency in input arguments handling
if (CmdLineOpts::inVCFfile) {
if (!CmdLineOpts::vcfSexesFile && haveXmap) {
for(int o = 0; o < 2; o++) {
fprintf(outs[o], "WARNING: input VCF supplied and an X chromosome genetic map, but no --sexes file\n");
fprintf(outs[o], " output VCF will *not* include X chromosome data\n\n");
}
}
else
readSexes(sexes, sexesCountData, CmdLineOpts::vcfSexesFile);
}
In my case:
haveXmap
is False
CmdLineOpts::vcfSexesFile
is False
So (!CmdLineOpts::vcfSexesFile && haveXmap)
is False and else case is executing what is wrong, because i don't specify sexes-file.
I had to draw them out to visualize the example relationships. It would be great to include a image per def in the examples to verify.
The First Crossover (generated from the G* array) has a coin-toss associated with it although the zero-crossovers decision has already been made... I think that means there is also a zero-crossovers path through the crossovers section (i.e. through the coin-tosses)?
Hello! Thanks for making this open-source and for putting together such an informative README. I'm still wrapping my brain around branching and the pedigree def file, but for now, I'm using your examples trying to simulate genetic data from a VCF file, and I'm getting an unexpected error about chromosomes in my VCF either out of order or not present.
I first created the map file exactly as specified in the README, named it the same, refined_mf.simmap
. My ped def file (sims.def
) file is pretty simple:
# test definition
def testped 5 3
1 1
2 1
3 1
When not using an input VCF I don't run into any issue:
$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv -o sims/test-gbr
Pedigree simulator! v1.1.16 (Released 8 Feb 2021)
Def file: sims.def
Map file: refined_mf.simmap
Input VCF: [none: no genetic data]
Output prefix: sims/test-gbr
Random seed: 2550799409
Interference file: ../nu_p_campbell.tsv
Simulating haplotype transmissions... done.
Printing IBD segments... done.
To simulate genetic data, must use an input VCF with 20 founders.
However, when I do try to simulate variants using an input VCF, I run into a problem:
$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv -i gbr.22.snps.vcf.gz -o sims/test-gbr
Pedigree simulator! v1.1.16 (Released 8 Feb 2021)
Def file: sims.def
Map file: refined_mf.simmap
Input VCF: gbr.22.snps.vcf.gz
Output prefix: sims/test-gbr
Random seed: 4250646984
Interference file: ../nu_p_campbell.tsv
Genotype error rate: 1.0e-03
Opposite homozygous error rate: 0.00
Missingness rate: 1.0e-03
Pseudo-haploid rate: 0
Not retaining extra samples in output VCF (printing only simulated samples)
Output VCF will contain unphased data
Simulating haplotype transmissions... done.
Printing IBD segments... done.
Reading input VCF meta data... done.
Input contains 40 samples, using 20 as founders, and retaining 0
Generating VCF file... ERROR: chromosome 22 in VCF file either out of order or not present
in genetic map
Prior to running this I checked that there were no SNPs in the VCF outside the range defined in the map. I created the VCF with minimal pre-processing starting from the GRCh37 1000 Genomes chromosome 22 VCF. I first collected sample IDs for 20 unrelated samples from GBR, then subsetted the VCF getting only those samples, only variants on chromosome 22 between the region 17152611-51175626, and the -v snps -m2 -M2 -i 'INFO/AF>0.05'
gets only biallelic SNPs with a global MAF>0.05, then finally sorts the output and writes a new VCF.
bcftools view ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--regions 22:17152611-51175626 \
--samples 40-gbr.txt \
-v snps -m2 -M2 -i 'INFO/AF>0.05' \
| bcftools sort -Oz -o gbr.22.snps.vcf.gz
Here's that VCF in case you want it for a reproducible example:
Looking at the first few and last few chromosome 22 entries on the sex-specific map:
$ grep ^22 refined_mf.simmap | sort -nk2 | head -n2
22 17087705 0 0
22 17152611 0.003573334924 0.0023823617488
$ grep ^22 refined_mf.simmap | sort -nk2 | tail -n2
22 51175626 61.01549487409 76.3046256962048
22 51229805 61.023018820178 76.3068982668672
... shows that subsetting the vcf to the region 22:17152611-51175626 shouldn't result in any variants outside the range defined by the map. Indeed, checking the VCF itself shows that the first record is at position 17152611, and the last is at position 51175626, both included in the sex-specific map (refined_mf.simmap
).
$ bcftools view --no-header gbr.22.snps.vcf.gz | head -n1 | cut -f1-6
22 17152611 rs4819849 A G 100
$ bcftools view --no-header gbr.22.snps.vcf.gz | tail -n1 | cut -f1-6
22 51175626 rs3810648 A G 100
Thanks, and please let me know if I can help with reproducing this issue on your end!
I have found that the format fields for all individuals must only contain the GT
tag. If I specify any other tag, some of the genotypes get very odd formats. See the second sample in this example.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT trio1_g2-b1-i1 trio1_g2-b2-i1
chr1 752721 rs3131972 A G . PASS . GT:PS 0|1:752721 0:752721|0
chr1 838555 rs4970383 C A . PASS . GT:PS 1|0:752721 0:752721|0
chr1 846808 rs4475691 C T . PASS . GT:PS 0|0:752721 1:752721|0
chr1 854250 rs7537756 A G . PASS . GT:PS 0|0:752721 1:752721|0
chr1 861808 rs13302982 A G . PASS . GT:PS 0|1:752721 0:752721|1
chr1 888659 rs3748597 T C . PASS . GT:PS 1|1:752721 1:752721|0
chr1 894573 rs13303010 G A . PASS . GT:PS 0|0:752721 0:752721|1
chr1 897564 rs13303229 T C . PASS . GT:PS 1|1:752721 1:752721|0
chr1 918573 rs2341354 A G . PASS . GT:PS 0|0:752721 0:752721|0
chr1 990417 rs2465136 T C . PASS . GT:PS 0|1:752721 1:752721|1
chr1 998395 rs7526076 A G . PASS . GT:PS 1|1:752721 1:752721|0
chr1 1018704 rs9442372 A G . PASS . GT:PS 1|0:752721 0:752721|0
chr1 1040026 rs6671356 T C . PASS . GT:PS 0|0:752721 1:752721|1
chr1 1062638 rs9442373 C A . PASS . GT:PS 0|0:752721 0:752721|0
chr1 1097335 rs9442385 T G . PASS . GT:PS 1|1:752721 1:752721|0
chr1 1110019 rs11260542 A G . PASS . GT:PS 0|0:752721 0:752721|0
chr1 1113121 rs12092254 G A . PASS . GT:PS 0|0:752721 0:752721|0
chr1 1130727 rs10907175 A C . PASS . GT:PS 0|0:752721 0:752721|0
chr1 1158277 rs3813199 G A . PASS . GT:PS 0|0:752721 0|0
chr1 1178925 rs12093154 G A . PASS . GT:PS 0|0:752721 0|0
chr1 1203938 rs4018608 T C . PASS . GT:PS 0|0:752721 0|1
chr1 1220136 rs2144440 A G . PASS . GT:PS 0|1:752721 0|0
chr1 1227897 rs3737721 A G . PASS . GT:PS 0|0:752721 0|1
chr1 1258246 rs3845293 A C . PASS . GT:PS 0|0:752721 0|1
chr1 1305561 rs17160669 C T . PASS . GT:PS 0|1:752721 0|0
chr1 1335302 rs1240707 C T . PASS . GT:PS 0|0:752721 0|0
Line 59 in 20d921b
The resulting Crossover Interval should have a mean of 100cM; if the distribution about this were triangular and extended 100cM, the resultant PDF will be a horizontal line i.e. a Random distribution. As the Poisson distribution has a bell-shape, and assuming the tails extend to 100cM, the resultant distribution will be essentially a Random one with a ripple superimposed?
I've been working on an implementation of the Broman & Weber (Interference-only) model (in Microsoft Excel) and have used the FHS & AGRE data to calibrate the 'nu' values; I've had the luxury of being able to calculate (1 - G*) to set the zero-crossover rates, but on comparison with the data reported in Fledel-Alon et al I've found it necessary to scale the rates by 1/sqrt(2) to balance the model in agreement with both data-sets. I haven't been able to find any other data-sources for the zero crossover rates; it's also possible that I've made a mistake somewhere, but I thought it was worth a mention.
Regards,
Malcolm Peach
I'd like to specify which reproducing parent from a branch in previous generation to use, but we only get the first one.
If two parents are listed as in [parent_branch1]_[parent_branch2], the two reproducing parents in the indicated branches (from the previous generation) have children.
But the closest def I could create was:
def example 1 3
1 1 3
2 2 3 1:1 2:2 3:3
3 2 2 1:1_2 2:2_3
Hi,
Thank you for being so responsive. I hope this will be my last issue. Running ped-sim with compressed vcf files, the output from pedsim is gzipped vcf (.vcf.gz). However, bcftools requires that the file be compressed with bgzip, not gzip (which is also a .gz file, for reasons that I have never understood).
Is it possible to update to use bgzip not gzip for outputs?
Thank you,
=Matt
When the input VCF is compressed (.gz using bgzip), the output should also be block-compressed. Unfortunately, it is gzip compressed. This is not compatible with downstream tools. See how samtools/htslib outputs vcfs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.