omerwe / s-pcgc Goto Github PK
View Code? Open in Web Editor NEWHeritability, genetic correlation and functional enrichment estimation for case-control studies
License: GNU General Public License v3.0
Heritability, genetic correlation and functional enrichment estimation for case-control studies
License: GNU General Public License v3.0
Hey There,
Thanks for your reading. I'm using this package and get an error:
[WARNING] 8634629 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
100%|???????????????????????????????????????????????????????????????????????| 22/22 [16:14<00:00, 44.29s/it]
/S-PCGC/pcgc_main.py:397: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself.
Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
gencov_arr = np.empty((len(pcgc_data_list), len(pcgc_data_list)), dtype=np.object)
Traceback (most recent call last):
File "/S-PCGC/pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "/S-PCGC/pcgc_main.py", line 402, in init
cov_ij = self.create_cov_obj(args, oi, oj,
File "/pcgc_main.py", line 628, in create_cov_obj
self.compute_taus(args, oi, oj,
File "/S-PCGC/pcgc_main.py", line 753, in compute_taus
z1_anno = df_annotations_sumstats_noneg.values * sumstats1[:, np.newaxis] * np.sqrt(trace_ratios1)
ValueError: operands could not be broadcast together with shapes (8634629,97) (8636723,1)
And with the same data, same codes we get a different error message when performed by another person:
[WARNING] 8636723 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
[INFO] reading annot files...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [21:54<00:00, 59.77s/it]
Traceback (most recent call last):
File "pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "pcgc_main.py", line 394, in init
self.load_annotations_data(args, df_prodr2, index_intersect)
File "pcgc_main.py", line 488, in load_annotations_data
is_same = (df.index == index_intersect).all()
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 123, in cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare
Do you have any idea how this error comes and how to solve it? Thanks a lot and looking forward to your reply :))
Dear Omer,
I'm trying to implement s-PCGC and running the example with 1000G data. I'm able to run all steps with the toy example. I'm also able to tun pcgc_sync.py.
I run into an error with pcgc_r2.py:
[INFO] reading annot file...
/hpc/hers_en/mbakker2/tools/S-PCGC/pcgc_utils.py:56: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_chr = pd.read_table(fname, delim_whitespace=True, index_col=index_col, header=header, usecols=usecols)
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:29: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_sync = pd.read_table(args.sync+'sync', index_col='Category')
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:42: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_extract = pd.read_table(args.extract, squeeze=True, header=None)
[INFO] Extracting 60 SNPs
[INFO] Computing r^2 products for 60 SNPs
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:63: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_fam = pd.read_table(args.bfile+'.fam', header=None)
[INFO] Loading SNP file...
After filtering, 60 SNPs remain
Traceback (most recent call last):
File "/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py", line 115, in
df_prod_r2 = compute_r2_prod(args)
File "/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py", line 73, in compute_r2_prod
df_annotations = df_annotations.iloc[geno_array.kept_snps]
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 2221, in _getitem_axis
return self._get_list_axis(key, axis=axis)
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 2203, in _get_list_axis
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
I used:
chr=22
python $PCGC/pcgc_r2.py \
--annot $baseline/baselineLD.${chr}. \
--sync $baseline/baselineLD. \
--bfile $refDir/1000G_EUR_Phase3_plink/1000G.EUR.QC.${chr} \
--extract $PCGC/example/good_snps.txt \
--out $baseline/baselineLD.goodSNPs.${chr}
When I leave out --extract $PCGC/example/good_snps.txt
PCGC runs fine and I get the expected output. But for all SNPs and not the 'good SNPs'
I tried to subset good_snps.txt
to only the chr22 SNPs, but the same error occurs.
I am using conda to create an environment to run PCGC. This is created using
conda create -n pcgc python=2.7 numpy pandas scikit-learn scipy tqdm bitarray
And conda list -n pcgc
gives:
# packages in environment at /hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc:
#
# Name Version Build Channel
atomicwrites 1.3.0 py_0 conda-forge
attrs 19.1.0 py_0 conda-forge
backports_abc 0.5 py_1 conda-forge
bitarray 0.8.3 py27h14c3975_0
blas 1.0 mkl
bokeh 1.0.4 py27_1000 conda-forge
ca-certificates 2019.3.9 hecc5488_0 conda-forge
certifi 2019.3.9 py27_0 conda-forge
cffi 1.12.2 py27hf0e25f4_1 conda-forge
click 7.0 py_0 conda-forge
cloudpickle 0.8.0 py_0 conda-forge
cytoolz 0.9.0.1 py27h14c3975_1001 conda-forge
dask 1.1.4 py_0 conda-forge
dask-core 1.1.4 py_0 conda-forge
distributed 1.26.0 py27_1 conda-forge
freetype 2.9.1 h94bbf69_1005 conda-forge
funcsigs 1.0.2 py_3 conda-forge
futures 3.2.0 py27_1000 conda-forge
heapdict 1.0.0 py27_1000 conda-forge
intel-openmp 2019.1 144
jinja2 2.10 py_1 conda-forge
jpeg 9c h14c3975_1001 conda-forge
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.36 h84994c4_1000 conda-forge
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h648cc4a_1001 conda-forge
locket 0.2.0 py_2 conda-forge
markupsafe 1.1.1 py27h14c3975_0 conda-forge
mkl 2019.1 144
mkl_fft 1.0.10 py27ha843d7b_0
mkl_random 1.0.2 py27hd81dba3_0
more-itertools 4.3.0 py27_1000 conda-forge
msgpack-python 0.6.1 py27h6bb024c_0 conda-forge
ncurses 6.1 he6710b0_1
numpy 1.16.2 py27h7e9f1db_0
numpy-base 1.16.2 py27hde5b4d6_0
olefile 0.46 py_0 conda-forge
openssl 1.1.1b h14c3975_1 conda-forge
packaging 19.0 py_0 conda-forge
pandas 0.24.1 py27he6710b0_0
pandas-plink 1.2.30 py27h14c3975_0 conda-forge
partd 0.3.9 py_0 conda-forge
pathlib2 2.3.3 py27_1000 conda-forge
pillow 5.3.0 py27h00a061d_1000 conda-forge
pip 19.0.3 py27_0
pluggy 0.9.0 py_0 conda-forge
psutil 5.6.1 py27h14c3975_0 conda-forge
py 1.8.0 py_0 conda-forge
pycparser 2.19 py27_1 conda-forge
pyparsing 2.3.1 py_0 conda-forge
pytest 4.3.1 py27_0 conda-forge
pytest-runner 4.4 py_0 conda-forge
python 2.7.15 h9bab390_6
python-dateutil 2.8.0 py27_0
pytz 2018.9 py27_0
pyyaml 5.1 py27h14c3975_0 conda-forge
readline 7.0 h7b6447c_5
scandir 1.10.0 py27h14c3975_0 conda-forge
scikit-learn 0.20.3 py27hd81dba3_0
scipy 1.2.1 py27h7c811a0_0
setuptools 40.8.0 py27_0
singledispatch 3.4.0.3 py27_1000 conda-forge
six 1.12.0 py27_0
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.27.2 h7b6447c_0
tblib 1.3.2 py_1 conda-forge
tk 8.6.8 hbc83047_0
toolz 0.9.0 py_1 conda-forge
tornado 5.1.1 py27h14c3975_1000 conda-forge
tqdm 4.31.1 py_0
wheel 0.33.1 py27_0
xz 5.2.4 h14c3975_1001 conda-forge
yaml 0.1.7 h14c3975_1001 conda-forge
zict 0.1.4 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
I am not sure how to handle this error:
conda 4.3.30
conda env create PCGC
conda create --name PCGC
source activate PCGC
conda install numpy
conda install scipy
conda install scikit-learn
conda install pandas
conda install pandas-plink
conda install -c conda-forge pandas-plink
conda install tqdm
git clone https://github.com/omerwe/S-PCGC
python test_sumstats.py
temporary directory: /tmp/2qujmo39
Creating summary statistics for study s3...
Command python pcgc_sumstats_creator.py --bfile example/s3 --covar example/s3.cov --prev 0.01 --pheno example/s3.phe --frqfile /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model.1. --annot /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model.1. --out /tmp/2qujmo39/s3.1 --sync /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model. returned 1 with the following stdout:
b'pcgc_sumstats_creator.py:388: DeprecationWarning: np.int
is a deprecated alias for the builtin int
. To silence this warning, use int
by itself. Doing this will not modify any behavior and is safe. When replacing np.int
, you may wish to use e.g. np.int64
or np.int32
to specify the precision. If you wish to review your current use, check the release note link for additional information.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n df_pheno.iloc[:,-1] = df_pheno.iloc[:,-1].astype(np.int)\npcgc_sumstats_creator.py:50: DeprecationWarning: np.float
is a deprecated alias for the builtin float
. To silence this warning, use float
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64
here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n df_covar[c] = df_covar[c].astype(np.float)\n[INFO] reading frq file...\n[INFO] Reading plink file from disk (this may take a while...)\n*********************************************************************\n* S-PCGC sumstats creator\n* Version 2.0.0\n* (C) 2018 Omer Weissbrod\n*********************************************************************\n\n\rMapping files: 0%| | 0/3 [00:00<?, ?it/s]\rMapping files: 100%|\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88| 3/3 [00:00<00:00, 81.52it/s]\n[INFO] 1500 SNPs remained after removing SNPs without MAFs\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool
is a deprecated alias for the builtin bool
. To silence this warning, use bool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_
here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool
is a deprecated alias for the builtin bool
. To silence this warning, use bool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_
here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool
is a deprecated alias for the builtin bool
. To silence this warning, use bool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_
here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\nTraceback (most recent call last):\n File "pcgc_sumstats_creator.py", line 592, in \n sumstats_creator = PCGC_Sumstats(args)\n File "pcgc_sumstats_creator.py", line 75, in init\n self.bfile, df_pheno, df_maf, self.num_snps, self.sample_size = self.read_plink(args, df_pheno, df_maf)\n File "pcgc_sumstats_creator.py", line 486, in read_plink\n is_consistent_snp = (df_maf[allele1_col] == df_bim['a1']) & (df_maf[allele0_col] == df_bim['a0'])\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/ops/common.py", line 70, in new_method\n return method(self, other)\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/arraylike.py", line 40, in eq\n return self._cmp_method(other, operator.eq)\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/series.py", line 5617, in _cmp_method\n raise ValueError("Can only compare identically-labeled Series objects")\nValueError: Can only compare identically-labeled Series objects\n'
Traceback (most recent call last):
File "test_sumstats.py", line 106, in
test_sumstats(temp_dir)
File "test_sumstats.py", line 58, in test_sumstats
run_command(ss_command)
File "test_sumstats.py", line 20, in run_command
raise IOError()
OSError
I apologize for my obtusity, but I Can't figure out from the manual how to actually analyze my data.
I have, let's say, two summary statistics. I do not have access to the subject left genotypes or phenotypes. Just the summary statistics (and they were done with Regenie too).
I can build an LD panel from 1000 G (I am a bit confused regarding whether I need to select good SNPs since the summary stats is already seriously filtered) but how to I turn a summary statistics from Regenie into the one that would work for you?
Would it be correct to say that all I need is to preprocess it with LDSC tools? specifically pass them through munge_sumstats.py and then into S-PCGA?
Sincerely
When a SNP with different alleles in the plink and MAF files are present they are successfully removed via:
is_consistent_snp = (df_maf[allele1_col] == df_bim['a1']) & (df_maf[allele0_col] == df_bim['a0'])
is_consistent_snp = is_consistent_snp | (df_maf[allele1_col] == df_bim['a0']) & (df_maf[allele0_col] == df_bim['a1'])
if not np.all(is_consistent_snp):
logging.info('%d SNPs will be removed because they have different alleles in plink and MAF files'%(np.sum(~is_consistent_snp)))
df_maf = df_maf.loc[is_consistent_snp]
df_bim = df_bim.loc[is_consistent_snp]
However, this produces an error later in the code due to a size mismatch.
0: Traceback (most recent call last):
0: File "path/S-PCGC/pcgc_sumstats_creator.py", line 593, in <module>
0: sumstats_creator.compute_all_sumstats(args.chunk_size)
0: File "path/S-PCGC/pcgc_sumstats_creator.py", line 271, in compute_all_sumstats
0: self.set_locus(snp1, snp2)
0: File "path/S-PCGC/pcgc_sumstats_creator.py", line 318, in set_locus
0: snp_maf = self.mafs[snp1+j]
0: File "path/lib/python3.8/site-packages/pandas/core/series.py", line 879, in __getitem__
0: return self._values[key]
0: IndexError: index 44269 is out of bounds for axis 0 with size 44269
If I force removal of additional SNPs, the index at which the error occurs reduces by a corresponding interval which leads me to believe this is what causes the problem. E.g. I forced removal of 2 additional SNPs the error becomes.
IndexError: index 44267 is out of bounds for axis 0 with size 44267
I am able to get around this by identifying the mismatched SNPs and removing them from my good_snps.txt
for now
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.