brentp / peddy Goto Github PK

View Code? Open in Web Editor NEW

129.0 7.0 39.0 40.29 MB

genotype :: ped correspondence check, ancestry check, sex check. directly, quickly on VCF

License: MIT License

Python 8.67% HTML 2.22% Shell 0.03% Jupyter Notebook 89.08%

genomics vcf bioinformatics pedigree genotype ancestry

peddy's Introduction

Fast Pedigree::VCF QC

peddy compares familial-relationships and sexes as reported in a PED/FAM file with those inferred from a VCF.

It samples the VCF at about 25000 sites (plus chrX) to accurately estimate relatedness, IBS0, heterozygosity, sex and ancestry. It uses 2504 thousand genome samples as backgrounds to calibrate the relatedness calculation and to make ancestry predictions.

It does this very quickly by sampling, by using C for computationally intensive parts, and by parallelization.

If you use peddy, please cite Pedersen and Quinlan, Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy, The American Journal of Human Genetics (2017), http://dx.doi.org/10.1016/j.ajhg.2017.01.017

Note that somalier is a more scalable, faster, replacement for peddy that uses some of the same methods as peddy along with some new ones.

Quickstart

See installation below.

Most users will only need to run as a command-line tool with a ped and VCF, e.g:

python -m peddy -p 4 --plot --prefix ceph-1463 data/ceph1463.peddy.vcf.gz data/ceph1463.ped

This will use 4 cpus to run various checks and create ceph-1463.html which you can open in any browser to interactively explore your data.

It will also create create 4 csv files and 4 QC plots. These will indicate:

discrepancies between ped-reported and genotype-inferred relations
discrepancies between ped-reported and genotype-inferred sex
samples with higher levels of HET calls, lower depth, or more variance in b-allele-frequency (ref / (ref + alt )) for het calls.
an ancestry prediction based on projection onto the thousand genomes principal components

Finally, it will create a new file ped files ceph1463.peddy.ped that also lists the most useful columns from the het-check and sex-check. Users can first look at this extended ped file for an overview of likely problems.

See the docs for a walk-through and thorough explanation of each plot.

hg38 or custom sites

By default, peddy uses hg19/GRCh37. It can be forced to use sites for hg38 by passing --sites hg38. To create custom sites, have a look at the sites files included with peddy along with the corresponding .bin.gz which is just the raw binary alternate counts (gt_types) from thousand-genomes that have been written as uint8 and gzipped.

Speed

Because of the sampling approach and parallelization, peddy is very fast. With 4 CPUs, on the 17-member CEPH1643 pedigree whole-genome VCF, peddy can run the het-check and PCA in ~ 8 seconds. The pedigree check comparing all vs. all samples run in 3.6 seconds. It finishes the full set of checks in about 20 seconds.

In comparison KING runs in 14 seconds (it is extremely fast); the time including the conversion from VCF to binary ped is 85 seconds.

On larger datasets, with hundreds or thousands of samples, it can be beneficial to add as many cores as possible; for smaller datasets with dozens of samples about 4 processors reduces the computation time as much as 8 or more would.

Validation

The results between peddy and KING are comparable, but peddy does better on cohorts where most samples are related. See the figure below where the peddy relatedness estimate is closer to the actual than KING which over-estimates relatedness.

Peddy uses the KING algorithm for calculating relatedness and so they match quite well. Peddy also runs PCA on the 2504 samples from 1000 genomes, then fitting an SVM and predicting ancestry in addition to calculating relatedness among all pairwise combinations of the 17 samples.

Warnings and Checks

On creating a pedigree object (via Ped('some.ped'). peddy will print warnings to STDERR as appropriate like:

pedigree warning: '101811-101811' is dad but has female sex
pedigree warning: '101897-101897' is dad but has female sex
pedigree warning: '101896-101896' is mom of self
pedigree warning: '102110-102110' is mom but has male sex
pedigree warning: '102110-102110' is mom of self
pedigree warning: '101381-101381' is dad but has female sex
pedigree warning: '101393-101393' is mom but has male sex

unknown sample: 102498-102498 in family: K34175
unknown sample: 11509-11509 in family: K567331
unknown sample: 5180-5180 in family: K8565

Installation

Conda

Nearly all users should install using conda in the anaconda python distribution. This means have your own version of python easily installed via:

INSTALL_PATH=~/anaconda
wget http://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
# or wget http://repo.continuum.io/miniconda/Miniconda2-latest-MacOSX-x86_64.sh
bash Miniconda2-latest* -fbp $INSTALL_PATH
PATH=$INSTALL_PATH/bin:$PATH

conda update -y conda
conda config --add channels bioconda

conda install -y peddy

This should install all dependencies so you can then run peddy with 4 processes as:

python -m peddy --plot -p 4 --prefix mystudy $VCF $PED

Github

git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .

run with

peddy --plot -p 4 --prefix mystudy $VCF $PED

peddy's People

Contributors

Stargazers

Watchers

peddy's Issues

Only plot PCA from cohort without doing any other analysis

Hi brent,

Maybe you add an option to just plot PCA from cohort without doing any other analysis? I have a cohort of few hundreds of exomes and wanted to plot PCA wth 1K Genome data and not willing to do any other analysis, so if you add a simple option of doing that ( preferably if without even making that ped file as well )

Thanks
Najeeb

process was killed, Memory Error.

Thank you for providing this great tool.
I have a problem, my process was killed after running half an hour. From traceback, the last line displayed 'Memory Error', my desktop with 128 GB ram.
I have ~4,000 WES samples with a 62 GB vcf file and a 67 GB ped file.
Would you please give me some suggestions to successfully run the Peddy.
Thank you very much!

Yiming

heterozygosity

we can find probable sample contamination by looking at the percent of heterozygous calls for each sample. we can plot a histogram and report the value in the ped_check.

cc @arq5x

incorrect expected relatedness for consanguineous family

I have a family where the parents are first cousins. I gave Peddy an extended pedigree and it was able to distinguish the parents as cousins, which matches the calculated coefficient of relatedness (the parents are the green dot on the plot below). However, the children were all given an expected relatedness of 0.031, though the pedigree clearly lists both parents for each child. They are the orange dots on the plot below. I'm using Peddy v0.3.1.

Thanks!
Eric

Python/CyVCF2 error: "ZeroDivisionError: float division"

Hej @brentp!

I tried running peddy installed using conda on a standard VCF from our pipeline (GATK+Samtools+Freebayes) and seeing this error:

python -m peddy --plot --prefix cust002-F0009251 /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251.fam

no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:74402404-74402404
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:74671768-74671768
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:74692336-74692336
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:74962645-74962645
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:76754871-76754871
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:77137202-77137202
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:77473086-77473086
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:77703476-77703476
no intervals found for /mnt/hds/proj/bioinfo/SERVER/apps/housekeeper/analyses/cust002-F0009251/2017-03-14/F0009251_sorted_md_brecal_comb_vep_parsed_ranked_SV.vcf.gz at 18:77893683-77893683
Traceback (most recent call last):
  File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/stage/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/stage/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/stage/lib/python2.7/site-packages/peddy/__main__.py", line 222, in <module>
    main(a.vcf, a.ped, prefix, a.plot, a.each, a.procs, a.sites)
  File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/stage/lib/python2.7/site-packages/peddy/__main__.py", line 153, in main
    in ("ped_check", "het_check", "sex_check")]):
  File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/stage/lib/python2.7/site-packages/peddy/__main__.py", line 32, in run
    prefix=prefix, **kwargs)
  File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/stage/lib/python2.7/site-packages/peddy/peddy.py", line 955, in ped_check
    min_depth=min_depth, each=each)
  File "cyvcf2/cyvcf2.pyx", line 50, in cyvcf2.cyvcf2.par_relatedness (cyvcf2/cyvcf2.c:4342)
  File "cyvcf2/cyvcf2.pyx", line 824, in cyvcf2.cyvcf2.VCF._relatedness_finish (cyvcf2/cyvcf2.c:20634)
ZeroDivisionError: float division

Maybe you recognize something or can point us to what we can try to narrow down what the cause is?

add per-sample call-rate to vis

idea from @jxchong

would be nice if we had some indication of call-rate in the output. this could be indicated by the size of points in the middle plot (median depth vs proportion of HET calls).

allow exporting HTML table to csv

Running peddy with provided vcf and ped file

I attempted to run peddy with the provided vcf and ped files.

Command: python -m peddy -p 4 --plot --prefix ceph-1463 ceph1463.peddy.vcf.gz ceph1463.ped

Running Peddy version 0.4.3
Traceback (most recent call last):
File "/projects/academic/gbcstaff/utils/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/projects/academic/gbcstaff/utils/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/projects/academic/gbcstaff/utils/peddy/peddy/main.py", line 14, in
sys.exit(cli())
File "/projects/academic/gbcstaff/utils/anaconda2/lib/python2.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/projects/academic/gbcstaff/utils/anaconda2/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/projects/academic/gbcstaff/utils/anaconda2/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/projects/academic/gbcstaff/utils/anaconda2/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/projects/academic/gbcstaff/utils/peddy/peddy/cli.py", line 162, in peddy
ped_obj = Ped(ped)
File "/projects/academic/gbcstaff/utils/peddy/peddy/peddy.py", line 441, in init
self._parse(ped)
File "/projects/academic/gbcstaff/utils/peddy/peddy/peddy.py", line 460, in _parse
sample = Sample.from_row(toks, header=self.header, warn=self.warn)
File "/projects/academic/gbcstaff/utils/peddy/peddy/peddy.py", line 262, in from_row
return cls(row[0], row[1], row[2] or "-9", row[3] or "-9", row[4], row[5],
IndexError: list index out of range

can you use `https://cdn.plot.ly/plotly-latest.min.js` for plotly

http://cdn.plot.ly/plotly-latest.min.js produces a mixed content error in modern browsers

Mixed Content: The page at ...was loaded over HTTPS, but requested an insecure script 'http://cdn.plot.ly/plotly-latest.min.js'. This request has been blocked; the content must be served over HTTPS.

qc by family

flag unexpected relationships with different family ideas. Can use relatedness measures to 1kg samples as distribution.

use linkage blocks

This paper has a method to find independent linkage blocks. we could use this to pull a few snps for each block to calculate relatedness on unlinked variants.
http://bioinformatics.oxfordjournals.org/content/early/2015/10/01/bioinformatics.btv546.full

peddy warning sampleID with leading zero

Dear Peddy team,

bcbio_nextgen wes pipeline is runnig peddy as part of QC and here is the command

peddy --plot -p 4 --prefix test 02003-effects-annotated-filterSNP-filterINDEL.vcf.gz 02003-effects-annotated-filterSNP-filterINDEL.ped

Warning message

2018-11-18 09:04:04 ngslogin1 peddy.cli[13805] INFO Running Peddy version 0.4.2
2018-11-18 09:04:04 ngslogin1 peddy.cli[13805] WARNING 1 samples in vcf not in ped:
02003

2018-11-18 09:04:04 ngslogin1 peddy.cli[13805] WARNING 1 samples in ped not in vcf:
2003

I observed that it is ignoring leading zero from .ped file.
Is there any way to make peddy not to ignore leading zero's from sampleID field of .ped file.
Please suggest me?

Thanks In Advance
Fazulur Rehaman

Sex check error

I'm running peddy on a vcf created by plink. I had to "manually" change chromosome 25 to X to make sex check work and I get no errors now. But no sex is determined for any of my 96 samples. In the HTML, sex_check = TRUE for all samples.
How can I troubleshoot or make sex check work?
The output is here:
https://www.dropbox.com/s/1j1s5xny4uswprt/peddy.tgz?dl=0
Thanks.
I'm running 0.3.1.
Edit: plink predicts sex correctly for almost all samples, misses 9 (that's why I'm trying peddy)

get warning when run peddy

Hi,

When I try to run peddy, I always got warnings like the following.
"WARNING:
24 samples in vcf not in ped:"
Does this matter? The results looks fine.

Thanks very much!

IndexError: list index out of range

Hi I was trying to run PEDDY on family cohorts.
Here is PED file for family.

[nsyed@hpclogin1 PMC01]$ head PMC01.ped
PMC01   AB082422_S5_L005    ABM090010_S7_L007   ABC09005_S2_L002    1   -9          
PMC01   ABC09005_S2_L002    0   0   2   -9
PMC01   ABM090010_S7_L007   0   0   1   -9
[nsyed@hpclogin1 PMC01]$ zgrep  -m1 "#CHROM" ../../PMC/PMC01/PMC01.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AB082422_S5_L005    ABC09005_S2_L002    ABM090010_S7_L007

And this is command that I have used.
python -m peddy -p 12 --plot --prefix PMC01 ../../PMC/PMC01/PMC01.vcf.gz PMC01.ped

And below is the error.

[nsyed@hpclogin1 PMC01]$ python -m peddy -p 12 --plot ../../PMC/PMC01/PMC01.vcf.gz PMC01.ped
Traceback (most recent call last):
  File "/gpfs/software/tools/python2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/gpfs/software/tools/python2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/gpfs/software/tools/python2.7/lib/python2.7/site-packages/peddy-0.2.7-py2.7.egg/peddy/__main__.py", line 203, in <module>
    main(a.vcf, a.ped, prefix, a.plot, a.each, a.procs, a.sites)
  File "/gpfs/software/tools/python2.7/lib/python2.7/site-packages/peddy-0.2.7-py2.7.egg/peddy/__main__.py", line 121, in main
    sep="\t")
  File "/gpfs/software/tools/python2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/gpfs/software/tools/python2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/gpfs/software/tools/python2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/gpfs/software/tools/python2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
  File "pandas/parser.pyx", line 985, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11060)
  File "pandas/parser.pyx", line 1246, in pandas.parser.TextReader._get_column_name (pandas/parser.c:15064)
IndexError: list index out of range

support for hg38 and single chromosome

Hi Brent,

Thanks for the tool. I am wondering whether peddy support hg38 for PCA or other functions.

Also, is there a way to use peddy on a single chromosome? This wouldn't make sense for some functions, but would be useful to do quick check on chr22 from a large VCF.

Best,

pca sklearn failure in latest development version

Brent;
We've been trying to use the latest development in bcbio (to get the initial hg38 support) and running into a failure during pca. Is there a minimum number of input samples needed to run pca, or am I doing something else wrong?

2018-05-30 23:07:51 r3467 peddy.pca[22570] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 1)) in 0.8 seconds
Traceback (most recent call last):
  File "/g/data3/gx8/local/development/bcbio/anaconda/bin/peddy", line 11, in <module>
    load_entry_point('peddy==0.4.0', 'console_scripts', 'peddy')()
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 204, in peddy
    in ("ped_check", "het_check", "sex_check")]):
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 43, in run
    prefix=prefix, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/peddy.py", line 871, in het_check
    pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/pca.py", line 61, in pca
    clf.fit(genos1kg, background_target)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 213, in _fit
    **fit_params_steps[name])
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 348, in fit_transform
    U, S, V = self._fit(X)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 394, in _fit
    return self._fit_truncated(X, n_components, svd_solver)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 468, in _fit_truncated
    % (n_components, n_features, svd_solver))
ValueError: n_components=4 must be between 1 and n_features=1 with svd_solver='randomized'

Ideally we'd love to be able to get ancestry prediction for single sample inputs. Thanks for any pointers/suggestions.

IndexError: list index out of range (ped file is tab seperated and matplotlib is also updated).

2017-11-17 11:46:21 l-1-01 peddy.cli[97880] INFO Running Peddy version 0.3.2
Traceback (most recent call last):
File "/home/ganesans/miniconda/envs/grinenv/bin/peddy", line 11, in
load_entry_point('peddy', 'console_scripts', 'peddy')()
File "/home/ganesans/miniconda/envs/grinenv/lib/python3.5/site-packages/click/core.py", line 716, in call
return self.main(*args, **kwargs)
File "/home/ganesans/miniconda/envs/grinenv/lib/python3.5/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/ganesans/miniconda/envs/grinenv/lib/python3.5/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ganesans/miniconda/envs/grinenv/lib/python3.5/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/mnt/isilon/helbig_lab/software/peddy/peddy/cli.py", line 159, in peddy
ped_obj = Ped(ped)
File "/mnt/isilon/helbig_lab/software/peddy/peddy/peddy.py", line 441, in init
self._parse(ped)
File "/mnt/isilon/helbig_lab/software/peddy/peddy/peddy.py", line 460, in _parse
sample = Sample.from_row(toks, header=self.header, warn=self.warn)
File "/mnt/isilon/helbig_lab/software/peddy/peddy/peddy.py", line 262, in from_row
return cls(row[0], row[1], row[2] or "-9", row[3] or "-9", row[4], row[5],
IndexError: list index out of range

handle 'chr' prefix

currently, if the query vcf has 'chr' prefixes, then none of the variants will be found. I think the best solution is a --strip-chr flag that will remove the 'chr' prefix if it is found.

Could also check the sites and see that they don't have 'chr' and then strip 'chr' if it is found in the query.

IndexError: list index out of range

HYE THERE
I am using peddy to calculate PCA with command
-m peddy -p 4 --plot --prefix 69_out PCA_data/69_without_N_imputated.vcf.gz PCA_data/69_without_N_imputed.plk.ped

after some initial start it is giving the following error.
IndexError: list index out of range
After reading a previous discussion I also modified the last row of the ped file, but still error is prevailing.

built installed from anaconda and git hub both are giving the same error.
Kindly help me sorting it out

No Hets found on numerous samples

Hi,

I get an error peddy: no hets found for sample, across numerous samples, however when looking in the VCF there appears to be het sites for these samples.

Data:
WGS, passing sites only, called individually and merged using vcftools. Pedigree file and VCF look fine. Same hg19 reference used throughout.

Log:

2018-09-17 10:32:39 job-FKPg89Q0KK87P2F23fF3f42K.dnanex.us peddy.cli[23825] INFO Running Peddy version 0.4.2
2018-09-17 10:32:40 job-FKPg89Q0KK87P2F23fF3f42K.dnanex.us peddy.cli[23825] INFO ped_check
#...numerous no hets found errors
2018-09-17 10:32:40 job-FKPg89Q0KK87P2F23fF3f42K.dnanex.us peddy.peddy[23825] INFO plotting
2018-09-17 10:32:41 job-FKPg89Q0KK87P2F23fF3f42K.dnanex.us peddy.cli[23825] INFO ran in 1.0 seconds
2018-09-17 10:32:41 job-FKPg89Q0KK87P2F23fF3f42K.dnanex.us peddy.cli[23825] INFO het_check
Traceback (most recent call last):
  File "/home/dnanexus/anaconda/bin/peddy", line 11, in <module>
    load_entry_point('peddy==0.4.2', 'console_scripts', 'peddy')()
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 207, in peddy
    in ("ped_check", "het_check", "sex_check")]):
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 43, in run
    prefix=prefix, **kwargs)
  File "/home/dnanexus/anaconda/lib/python2.7/site-packages/peddy/peddy.py", line 866, in het_check
    sites, min_depth=min_depth)
  File "cyvcf2/cyvcf2.pyx", line 77, in cyvcf2.cyvcf2.par_het (cyvcf2/cyvcf2.c:5531)
  File "cyvcf2/cyvcf2.pyx", line 741, in cyvcf2.cyvcf2.VCF._finish_het (cyvcf2/cyvcf2.c:18532)
IndexError: index 0 is out of bounds for axis 0 with size 0

Understandably, I think the error comes from not being able to plot the het figures, if it can't find any hets. Is there anything I could try? The calls were generated by Dragen, but was passed through VCFtools, so I think the VCF schema is intact, I just wonder why it's not picking up the hets. Could it be because there's missing data in some samples from the merge?

1000 genomes PCA data

Hi there,

@leipzig requested that I add a module to parse Peddy data for a tool I've written called MultiQC (see multiqc.info and ewels/MultiQC).

Quick question - does the PCA data shown in the background of the PCA Projection onto 1000 Genomes vary between datasets? See MultiQC/MultiQC#355 for the discussion - basically as these co-ordinates aren't exported by Peddy I've just taken a snapshot of that data, and I want to check that this is valid.

Many thanks,

Phil

Half-siblings are treated as full siblings

I noticed that in a pedigree I analyzed where two individuals are half-siblings (same mother), they were classified as sib-sib (0.50), but they cluster with the individuals who are uncle/aunt-niece (0.25). I think half-siblings can be classified as 0.25 so they will cluster properly. See AI001905 and his half-siblings AI001903 and AI001904 in the attached files.

Thanks.

X server dependency error

Is it possible to run peddy without the X server dependency?

I get this error when running:

$python -m peddy --version --procs 4 --plot --prefix peddy/118 in.vcf.gz 118.fam
$: cannot connect to X server

support VCFs with only genotypes

interactive plots

see: https://plot.ly/python/line-and-scatter/#Data-Labels-on-the-Plot

Peddy version in output

Does peddy output the version in the output somewhere? Like log file or stderr?

I need to log which version samples has been processed with.

I am currently using 0.2.9.

output kinship file?

Any way to get peddy to output its kinship estimates to a file? (e.g. king.kin/kin0 equivalent files) I'm trying to investigate whether peddy replicates some maybe 2nd-cousin-level relatedness and it's hard to eyeball with the plot alone.

Explicitly enter duplicates in pedigree.

What happens if we have intentional proband duplicates? Is there a way to specify relatedness=1, or duplicates in the pedigree files?

QC tutorial

sex qc and ped qc are now implemented. Should make a nice tutorial on how to use them.

ped file "missing phenotype"

We are successfully using Peddy. However, when using -9 or 0 in the phenotype column in the ped file, we always get "affected" instead of "missing" in the table within the html file after running Peddy. When using 1 and 2, we correctly see "not affected" and "affected" in the table. Are we doing something wrong?

Thanks in advance!

report 1kg sample names in json output.

add jitter to sex-check plot for html

sex UNKNOWN

Hi I am trying to run peddy on a vcf of ~1500 individuals genotype with the mega EX array. The ancestry analyses seem to be okay. However, I am not getting predicted sexes for some reason. The STDOUT for the sex check is:
peddy.cli[15732] INFO sex_check
warning: not all samples in PED found in VCF
peddy.peddy[15732] INFO sex-check: 1876 skipped / 23858 kept

From what I can see, the not all samples warnings just has to do with the header line of the ped.

The head of the sex.check file is:
error,het_count,het_ratio,hom_alt_count,hom_ref_count,ped_sex,predicted_sex,samp
True,0,-0.06,0,0,male,UNKNOWN,202066010233_R01C01
True,0,-0.06,0,0,female,UNKNOWN,202066010233_R02C01
True,0,-0.06,0,0,female,UNKNOWN,202066010233_R03C01
True,0,-0.06,0,0,male,UNKNOWN,202066010233_R05C01
True,0,-0.06,0,0,female,UNKNOWN,202066010233_R06C01
True,0,-0.06,0,0,female,UNKNOWN,202066010233_R07C01

size points by number of good calls

In the het plot and sex-check, the points should be sized by the number of sites used to make the final data-point in the plot.

question about PCA plot

Sometimes the pc2 flipped in the plot (top plot). Since the data are projected to the same reference panel, should the backgroud PC2 vs. PC1 pattern in the plot always be the same?

check Y chromosome

might help resolve samples that are ambiguous from just looking at X.

Perform only ancestry-determination for sinlge sample

Hi,

I tested peddy and it was basically the only tool for ancestry-determination that works out-of-the-box.
Great work!

However I am only interested in ancestry determination and nothing else.
Is is possible to skip the rest of the checks?

We are doing diagnostics and typically process single samples not families - as do most diagnostics labs.
Is it possible to hand over only a VCF.GZ file and skip the PED file somehow?

Best,
Marc

ped_check error -> ZeroDivisionError: float division

Hi,

I am running peddy for the first time. I have managed to run peddy on the toy example of your repository but, when I apply it to my data, I have the following error:

[cruiz@isgws05 300]$  python -m peddy --plot --prefix cancer plink.vcf.gz ped.fam
2017-12-13 17:35:54 isgws05.isglobal.lan peddy.cli[64409] INFO Running Peddy version 0.3.1
2017-12-13 17:35:54 isgws05.isglobal.lan peddy.cli[64409] INFO ped_check
Traceback (most recent call last):
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/peddy/__main__.py", line 14, in <module>
    sys.exit(cli())
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 204, in peddy
    in ("ped_check", "het_check", "sex_check")]):
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 43, in run
    prefix=prefix, **kwargs)
  File "/home/isglobal.lan/cruiz/anaconda/lib/python2.7/site-packages/peddy/peddy.py", line 954, in ped_check
    min_depth=min_depth, each=each)
  File "cyvcf2/cyvcf2.pyx", line 52, in cyvcf2.cyvcf2.par_relatedness (cyvcf2/cyvcf2.c:4380)
  File "cyvcf2/cyvcf2.pyx", line 826, in cyvcf2.cyvcf2.VCF._relatedness_finish (cyvcf2/cyvcf2.c:20672)
ZeroDivisionError: float division

My dataset is a case-control study of prostate cancer so all individuals are males. Could this fact cause a problem?

Thank you
Carlos

--Loglevel setting doesn't seem to work

I tried to setup log level to error, however it is still logging out the INFO.

/python -m peddy -p 4 --plot --prefix /Desktop/ceph-1463 --loglevel CRITICAL /Desktop/peddy-master/data/ceph1463.peddy.vcf.gz /Desktop/peddy-master/data/ceph1463.ped
2018-08-01 11:53:59 local peddy.cli[1537] INFO Running Peddy version 0.4.2
2018-08-01 11:53:59 local peddy.cli[1537] INFO ped_check
2018-08-01 11:54:02 local peddy.peddy[1537] INFO plotting
2018-08-01 11:54:03 local peddy.cli[1537] INFO ran in 3.3 seconds
2018-08-01 11:54:03 local peddy.cli[1537] INFO het_check
2018-08-01 11:54:05 local peddy.pca[1537] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 2724)) in 0.8 seconds
2018-08-01 11:54:06 local peddy.pca[1537] INFO ran randomized PCA on thousand-genomes samples at 2724 sites in 0.3 seconds
2018-08-01 11:54:06 local peddy.pca[1537] INFO Projected thousand-genomes genotypes and sample genotypes and predicted ancestry via SVM in 0.0 seconds
2018-08-01 11:54:07 local peddy.cli[1537] INFO ran in 4.2 seconds
2018-08-01 11:54:07 local peddy.cli[1537] INFO sex_check
2018-08-01 11:54:07 local peddy.peddy[1537] INFO sex-check: 0 skipped / 814 kept
2018-08-01 11:54:07 local peddy.cli[1537] INFO ran in 0.4 seconds

Support for specific families.

It would be nice to be able to do the following:

from peddy import Ped, SEX, PHENOTYPE

p = Ped('b.ped')

# iterable
f = p.families()

# get a specific family by family_id
shmoes = p.families('shmoes')

# do neat stuff with a family.
kids = shmoes.kids

# it could get complictaed with multiple generations.  specify a generation
granddad = shmoes.dad(generation=1)

PCA

can easily to PCA on the 5000 markers that we are collecting. would be another good metric to add.

KeyError in PCA.py

Hi Brent,

I cloned the peddy package yesterday and have got this error:

Traceback (most recent call last):
File "/home/torme/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/torme/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/torme/Desktop/TOOLS/peddy/peddy/main.py", line 14, in
sys.exit(cli())
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 716, in call
return self.main(*args, **kwargs)
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "peddy/cli.py", line 207, in peddy
in ("ped_check", "het_check", "sex_check")]):
File "peddy/cli.py", line 43, in run
prefix=prefix, **kwargs)
File "peddy/peddy.py", line 878, in het_check
pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
File "peddy/pca.py", line 46, in pca
idxs = np.array([kgsites_index[s] for s in sites])
KeyError: u'1:900505:G:C'

I get output files ped_check and ped_check_rel-difference, but no files for sex or PCA.

I get an error at position 1:900505 - I'm not sure how peddy runs in terms of order of variants in the file, but this is not the first variant position in my VCF. (However perhaps for speed the program does not run through chronologically so not sure this is relevant anyway.) This position looks fine in my VCF (is present, has the same ref and alt alleles as in 1000Genomes data). Is there anything obvious that would cause this error?

Sorry if this is an obvious oversight on my part... The output that I did get looks beautiful, so thank you!

Many thanks,
Tatiana

Problem with pandas and python 3

This is not really a peddy problem, has anyone else experienced this?

I get:

/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated since IPython 4.0. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
Traceback (most recent call last):
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/mansmagnusson/Projects/peddy/peddy/__main__.py", line 15, in <module>
    sys.exit(cli())
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/mansmagnusson/Projects/peddy/peddy/cli.py", line 202, in peddy
    in ("ped_check", "het_check", "sex_check")]):
  File "/Users/mansmagnusson/Projects/peddy/peddy/cli.py", line 42, in run
    prefix=prefix, **kwargs)
  File "/Users/mansmagnusson/Projects/peddy/peddy/peddy.py", line 1037, in ped_check
    sub = df.eval('((rel > 0.1) & (pedigree_relatedness < 0.05)) | ((rel < 0.05) & (pedigree_relatedness > 0.1)) | (rel_difference > 0.1) | (rel_difference < -0.1)')
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/frame.py", line 2284, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/eval.py", line 262, in eval
    truediv=truediv)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 727, in __init__
    self.terms = self.parse()
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 744, in parse
    return self._visitor.visit(self.expr)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 313, in visit
    return visitor(node, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 319, in visit_Module
    return self.visit(expr, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 313, in visit
    return visitor(node, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 322, in visit_Expr
    return self.visit(node.value, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 313, in visit
    return visitor(node, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 671, in visit_BoolOp
    return reduce(visitor, operands)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 664, in visitor
    rhs = self._try_visit_binop(y)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 659, in _try_visit_binop
    return self.visit(bop)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 313, in visit
    return visitor(node, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 644, in visit_Compare
    return self.visit(binop)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 313, in visit
    return visitor(node, **kwargs)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 417, in visit_BinOp
    left, right = self._maybe_downcast_constants(left, right)
  File "/Users/mansmagnusson/miniconda3/envs/peddy/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 369, in _maybe_downcast_constants
    name = self.env.add_tmp(np.float32(right.value))
AttributeError: 'UnaryOp' object has no attribute 'value'

Which I guess is the same as pandas-dev/pandas#11235, but that issue should have been solved in an older version of pandas...
I have pandas (0.20.1) installed.

Why isn't this project called pedagree?

Because obviously.

missing samples

when parents are not specified in the ped (or actually even if it is in the ped and not in the VCF) then siblings will show a ped error because they are more related than expected.
We should check if they share the same parent and, if they do, then just issue a notice, if not then suggest that a parent could be added to the ped.

Input PED file; replace sep with whitespace

Hi Brent,

I was having trouble running peddy for my analysis and dug into the problem. It looks like in cli.py, line 176, you have sep='\t' set for the input DataFrame coming from the PED file. When going through your documentation, looking at the link for the PED file format links me to PLINK. PLINK, as of 1.9, outputs its PED files with spaces in between variables, not tabs.

Locally, I've replaced the above with sep=r"\s+" without issue.

Thanks,
John

pip install fails to find `version` string

-bash-4.1$ pip3 install peddy
You are using pip version 7.0.3, however version 7.1.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting peddy
  Downloading peddy-0.0.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-z_azrrhl/peddy/setup.py", line 20, in <module>
        setup(version=get_version(),
      File "/tmp/pip-build-z_azrrhl/peddy/setup.py", line 9, in get_version
        with open(os.path.join("peddy", "__init__.py"), "r") as init_file:
    FileNotFoundError: [Errno 2] No such file or directory: 'peddy/__init__.py'

It looks like this method of parsing the version string is not working when installing the PyPI version. Installing from a tarball worked fine for me.

ouput contamination estimate

based on width of allele balance histogram (or IQR of allele balances).

allow lasso selection of points to get unrelated (or whichever) samples

the selection will update the table

brentp / peddy Goto Github PK

peddy's Introduction

Fast Pedigree::VCF QC

Quickstart

hg38 or custom sites

Speed

Validation

Warnings and Checks

Installation

Conda

Github

peddy's People

Contributors

Stargazers

Watchers

Forkers

peddy's Issues

Recommend Projects

Recommend Topics

Recommend Org