Code Monkey home page Code Monkey logo

ped-sim's People

Contributors

lirazk avatar stephenturner avatar williamslab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ped-sim's Issues

Unable to compile macOS 10.13.2: error: unknown directive

all  -O2 -o main.o -c main.cc
g++ -MMD -std=c++11 -Wall  -O2 -o cmdlineopts.o -c cmdlineopts.cc
g++ -MMD -std=c++11 -Wall  -O2 -o cointerfere.o -c cointerfere.cc
gcc -MMD -Wall  -O2 -o memcpy.o -c memcpy.c
<inline asm>:1:1: error: unknown directive
.symver memcpy, memcpy@GLIBC_2.2.5
^
1 error generated.
make: *** [memcpy.o] Error 1

Escape pathway closed?

int n_nixo = no_interfere(randomGen);

Sorry, I will follow-up with an e-mail; thought I ought to query this one: the result from 'no_interfere' will either be truncated or rounded to integer, and given the likely values of 'p' & 'Length' (e.g. those of Campbell et al) the following loop will seldom be entered? - I think you need to test 'p' within the loop (as Karl has done)?

n.b. I'm a Systems & Software Engineer with an interest in Genealogy - we're most interested in your upcoming Paper, which is underpinned by this software?

Regards,
Malcolm Peach (BSc Hons, MSc)

seg fault

Hi there :)

I'm getting a seg fault when working with the following command + genetic map. No VCF, just simulating a pedigree. Just working with the example def file for now. The command i used:

/workflow/scripts/ped-sim/ped-sim -d workflow/scripts/ped-sim/example/cousins-1st_half_to_3rd.def -m results/ped-sim-ag.gmap -o ped-sim --pois

and the genetic map:

2RL 1 0.0
2RL 58984778 117.97
2RL 61545105 119.25
2RL 61645106 119.25
2RL 64076722 120.466
2RL 64132875 120.578
2RL 66687494 121.855
2RL 66724067 121.929
2RL 67433980 122.284
2RL 111009430 209.434
3RL 1 0.0
3RL 38988757 77.978
3RL 41860198 79.413
3RL 52161877 100.017
3RL 53200684 100.536
3RL 53200685 100.536
3RL 55015803 101.444
3RL 55097514 101.607
3RL 57435893 102.776
3RL 57465397 102.83500000000001
3RL 58232376 103.219
3RL 95164119 177.082

Here is the log:

ped-sim.log

Thank you.

segfault?

Hi,

I'm not sure if it's an issue in ped-sim, or in the structure of my input VCF, but when trying to run ped-sim and get out a vcf file, I am getting a segfault as soon as it starts writing out the new vcf.

if it helps, I ran gdb on the run that is failing, which is attached.

Any help would be very much appreciated.

Starting program: /data/projects/skate/data/ped-sim/1000G_high/../../../software/ped-sim/ped-sim -d full_half_1st_2nd_cousins.def -m refined_mf.simmap -o test_cousins_High_confidence1000G --intf nu_p_campbell.tsv -i test2.vcf --keep_phase
Pedigree simulator! v1.1.11 (Released 19 Sep 2020)

Def file: full_half_1st_2nd_cousins.def
Map file: refined_mf.simmap
Input VCF: test2.vcf
Output prefix: test_cousins_High_confidence1000G

Random seed: 641804785

Interference file: nu_p_campbell.tsv

Genotype error rate: 1.0e-03
Opposite homozygous error rate: 0.00
Missingness rate: 1.0e-03
Pseudo-haploid rate: 0

Not retaining extra samples in output VCF (printing only simulated samples)
Output VCF will contain phased data

Simulating haplotype transmissions... done.
Printing IBD segments... done.
Reading input VCF meta data... done.
Input contains 433 samples, using 17 as founders, and retaining 0
Generating VCF file...
Program received signal SIGSEGV, Segmentation fault.
0x000000000041806c in void makeVCF<_IO_FILE*>(std::vector<SimDetails, std::allocator >&, Person*****, int, char const*, char*, GeneticMap&, _IO_FILE**) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-322.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb)

Invalid sex file presence handling

When I try to run ped-sim on VCF without chrX data and map-file without chrX and don't specify --sexes, I get the following error:

ERROR: could not open sexes file (null)!
open: Bad address

It seems like there is inconsistency in input arguments handling

if (CmdLineOpts::inVCFfile) {
    if (!CmdLineOpts::vcfSexesFile && haveXmap) {
      for(int o = 0; o < 2; o++) {
	fprintf(outs[o], "WARNING: input VCF supplied and an X chromosome genetic map, but no --sexes file\n");
	fprintf(outs[o], "         output VCF will *not* include X chromosome data\n\n");
      }
    }
    else
      readSexes(sexes, sexesCountData, CmdLineOpts::vcfSexesFile);
  }

In my case:
haveXmap is False
CmdLineOpts::vcfSexesFile is False
So (!CmdLineOpts::vcfSexesFile && haveXmap) is False and else case is executing what is wrong, because i don't specify sexes-file.

Coin-toss on First Cross-over

The First Crossover (generated from the G* array) has a coin-toss associated with it although the zero-crossovers decision has already been made... I think that means there is also a zero-crossovers path through the crossovers section (i.e. through the coin-tosses)?

VCF out of order or chomosome not present

Hello! Thanks for making this open-source and for putting together such an informative README. I'm still wrapping my brain around branching and the pedigree def file, but for now, I'm using your examples trying to simulate genetic data from a VCF file, and I'm getting an unexpected error about chromosomes in my VCF either out of order or not present.

I first created the map file exactly as specified in the README, named it the same, refined_mf.simmap. My ped def file (sims.def) file is pretty simple:

# test definition
def testped 5 3
1 1
2 1
3 1

When not using an input VCF I don't run into any issue:

$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv  -o sims/test-gbr
Pedigree simulator!  v1.1.16    (Released  8 Feb 2021)

  Def file:             sims.def
  Map file:             refined_mf.simmap
  Input VCF:            [none: no genetic data]
  Output prefix:        sims/test-gbr

  Random seed:          2550799409

  Interference file:    ../nu_p_campbell.tsv

Simulating haplotype transmissions... done.
Printing IBD segments... done.

To simulate genetic data, must use an input VCF with 20 founders.

However, when I do try to simulate variants using an input VCF, I run into a problem:

$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv -i gbr.22.snps.vcf.gz -o sims/test-gbr
Pedigree simulator!  v1.1.16    (Released  8 Feb 2021)

  Def file:             sims.def
  Map file:             refined_mf.simmap
  Input VCF:            gbr.22.snps.vcf.gz
  Output prefix:        sims/test-gbr

  Random seed:          4250646984

  Interference file:    ../nu_p_campbell.tsv

  Genotype error rate:  1.0e-03
  Opposite homozygous error rate:       0.00
  Missingness rate:     1.0e-03
  Pseudo-haploid rate:  0

  Not retaining extra samples in output VCF (printing only simulated samples)
  Output VCF will contain unphased data

Simulating haplotype transmissions... done.
Printing IBD segments... done.
Reading input VCF meta data... done.
  Input contains 40 samples, using 20 as founders, and retaining 0
Generating VCF file... ERROR: chromosome 22 in VCF file either out of order or not present
       in genetic map

Prior to running this I checked that there were no SNPs in the VCF outside the range defined in the map. I created the VCF with minimal pre-processing starting from the GRCh37 1000 Genomes chromosome 22 VCF. I first collected sample IDs for 20 unrelated samples from GBR, then subsetted the VCF getting only those samples, only variants on chromosome 22 between the region 17152611-51175626, and the -v snps -m2 -M2 -i 'INFO/AF>0.05' gets only biallelic SNPs with a global MAF>0.05, then finally sorts the output and writes a new VCF.

bcftools view ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    --regions 22:17152611-51175626 \
    --samples 40-gbr.txt \
    -v snps -m2 -M2 -i 'INFO/AF>0.05' \
    | bcftools sort -Oz -o gbr.22.snps.vcf.gz

Here's that VCF in case you want it for a reproducible example:

gbr.22.snps.vcf.gz

Looking at the first few and last few chromosome 22 entries on the sex-specific map:

$ grep ^22 refined_mf.simmap | sort -nk2 | head -n2
22      17087705        0       0
22      17152611        0.003573334924  0.0023823617488

$ grep ^22 refined_mf.simmap | sort -nk2 | tail -n2
22      51175626        61.01549487409  76.3046256962048
22      51229805        61.023018820178 76.3068982668672

... shows that subsetting the vcf to the region 22:17152611-51175626 shouldn't result in any variants outside the range defined by the map. Indeed, checking the VCF itself shows that the first record is at position 17152611, and the last is at position 51175626, both included in the sex-specific map (refined_mf.simmap).

$ bcftools view --no-header gbr.22.snps.vcf.gz | head -n1 | cut -f1-6
22      17152611        rs4819849       A       G       100

$ bcftools view --no-header gbr.22.snps.vcf.gz | tail -n1 | cut -f1-6
22      51175626        rs3810648       A       G       100

Thanks, and please let me know if I can help with reproducing this issue on your end!

Update documentation for the input VCF to require only GT in the FORMAT field

I have found that the format fields for all individuals must only contain the GT tag. If I specify any other tag, some of the genotypes get very odd formats. See the second sample in this example.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  trio1_g2-b1-i1  trio1_g2-b2-i1
chr1    752721  rs3131972   A   G   .   PASS    .   GT:PS   0|1:752721  0:752721|0
chr1    838555  rs4970383   C   A   .   PASS    .   GT:PS   1|0:752721  0:752721|0
chr1    846808  rs4475691   C   T   .   PASS    .   GT:PS   0|0:752721  1:752721|0
chr1    854250  rs7537756   A   G   .   PASS    .   GT:PS   0|0:752721  1:752721|0
chr1    861808  rs13302982  A   G   .   PASS    .   GT:PS   0|1:752721  0:752721|1
chr1    888659  rs3748597   T   C   .   PASS    .   GT:PS   1|1:752721  1:752721|0
chr1    894573  rs13303010  G   A   .   PASS    .   GT:PS   0|0:752721  0:752721|1
chr1    897564  rs13303229  T   C   .   PASS    .   GT:PS   1|1:752721  1:752721|0
chr1    918573  rs2341354   A   G   .   PASS    .   GT:PS   0|0:752721  0:752721|0
chr1    990417  rs2465136   T   C   .   PASS    .   GT:PS   0|1:752721  1:752721|1
chr1    998395  rs7526076   A   G   .   PASS    .   GT:PS   1|1:752721  1:752721|0
chr1    1018704 rs9442372   A   G   .   PASS    .   GT:PS   1|0:752721  0:752721|0
chr1    1040026 rs6671356   T   C   .   PASS    .   GT:PS   0|0:752721  1:752721|1
chr1    1062638 rs9442373   C   A   .   PASS    .   GT:PS   0|0:752721  0:752721|0
chr1    1097335 rs9442385   T   G   .   PASS    .   GT:PS   1|1:752721  1:752721|0
chr1    1110019 rs11260542  A   G   .   PASS    .   GT:PS   0|0:752721  0:752721|0
chr1    1113121 rs12092254  G   A   .   PASS    .   GT:PS   0|0:752721  0:752721|0
chr1    1130727 rs10907175  A   C   .   PASS    .   GT:PS   0|0:752721  0:752721|0
chr1    1158277 rs3813199   G   A   .   PASS    .   GT:PS   0|0:752721  0|0
chr1    1178925 rs12093154  G   A   .   PASS    .   GT:PS   0|0:752721  0|0
chr1    1203938 rs4018608   T   C   .   PASS    .   GT:PS   0|0:752721  0|1
chr1    1220136 rs2144440   A   G   .   PASS    .   GT:PS   0|1:752721  0|0
chr1    1227897 rs3737721   A   G   .   PASS    .   GT:PS   0|0:752721  0|1
chr1    1258246 rs3845293   A   C   .   PASS    .   GT:PS   0|0:752721  0|1
chr1    1305561 rs17160669  C   T   .   PASS    .   GT:PS   0|1:752721  0|0
chr1    1335302 rs1240707   C   T   .   PASS    .   GT:PS   0|0:752721  0|0

Random distribution?

locations.push_back( unif_prob(randomGen) * length[sex] );

The resulting Crossover Interval should have a mean of 100cM; if the distribution about this were triangular and extended 100cM, the resultant PDF will be a horizontal line i.e. a Random distribution. As the Poisson distribution has a bell-shape, and assuming the tails extend to 100cM, the resultant distribution will be essentially a Random one with a ripple superimposed?

Zero-crossover percentages

I've been working on an implementation of the Broman & Weber (Interference-only) model (in Microsoft Excel) and have used the FHS & AGRE data to calibrate the 'nu' values; I've had the luxury of being able to calculate (1 - G*) to set the zero-crossover rates, but on comparison with the data reported in Fledel-Alon et al I've found it necessary to scale the rates by 1/sqrt(2) to balance the model in agreement with both data-sets. I haven't been able to find any other data-sources for the zero crossover rates; it's also possible that I've made a mistake somewhere, but I thought it was worth a mention.

Regards,
Malcolm Peach

Unable to create specify a simple three-generation pedigree

I'd like to specify which reproducing parent from a branch in previous generation to use, but we only get the first one.

If two parents are listed as in [parent_branch1]_[parent_branch2], the two reproducing parents in the indicated branches (from the previous generation) have children.

For example, I want:
img_7130

But the closest def I could create was:

def example 1 3
1 1 3
2 2 3 1:1 2:2 3:3
3 2 2 1:1_2 2:2_3

which gives:
pedigree

bgzip for vcf output

Hi,

Thank you for being so responsive. I hope this will be my last issue. Running ped-sim with compressed vcf files, the output from pedsim is gzipped vcf (.vcf.gz). However, bcftools requires that the file be compressed with bgzip, not gzip (which is also a .gz file, for reasons that I have never understood).

Is it possible to update to use bgzip not gzip for outputs?

Thank you,
=Matt

Output VCF.gz is not block compressed

When the input VCF is compressed (.gz using bgzip), the output should also be block-compressed. Unfortunately, it is gzip compressed. This is not compatible with downstream tools. See how samtools/htslib outputs vcfs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.