hringbauer / ancibd Goto Github PK
View Code? Open in Web Editor NEWDetecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
License: GNU General Public License v3.0
Detecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
License: GNU General Public License v3.0
Hi, I am loading vcf files from GLIMPSE to ancIBD, however it occurs this problem. And when I try tutorials by using test_data which you provided, it also has the same problem.
It seems python codeDecode problem, however when I write "
import sys
reload(sys)
sys.setdefaultencoding('utf-8')"
into sitecustomize.py under python" lib/site-packages/ "path, it could not solve the error. DO you have any suggestions? Thanks a lot!
Hi again.
This issue can be merged with #5.
I've used GLIMPSE to impute two samples, and resulted with the following vcf file:
(base) batel:~/glimpse/Levant_BA$ bcftools view merged.vcf.gz | head -20
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=01/08/2023 - 11:43:29
##source=GLIMPSE_phase v2.0.0
##contig=<ID=chr22>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="IMPUTE info quality score for diploid samples">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=15
##FPLOIDY=2
##bcftools_mergeVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_mergeCommand=merge -l GLIMPSE_ligate/samples_merge.txt -o merged.vcf.gz; Date=Tue Aug 1 13:03:00 2023
##bcftools_viewVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_viewCommand=view merged.vcf.gz; Date=Tue Aug 1 13:04:29 2023
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S10769.SN S10770.SN
chr22 10519276 22:10519276:G:C G C . . RAF=0.0377889;AF=0.0572749;INFO=1 GT:DS:GP 0|0:0.106:0.896,0.101,0.003 0|0:0.115:0.888,0.108,0.004
chr22 10519389 22:10519389:T:C T C . . RAF=0.0376327;AF=0.0567751;INFO=1 GT:DS:GP 0|0:0.104:0.898,0.098,0.004 0|0:0.114:0.889,0.107,0.004
When trying to convert to HD5 using
from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf
ch = 22
#base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
map_path = f"./data/v51.1_1240k.snp",
af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
col_sample_af = "",
buffer_size=20000, chunk_width=8, chunk_length=20000,
ch=ch)
I get the following error:
Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_7983/2495295663.py in <module>
2 ch = 22
3 #base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
----> 4 vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
5 in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
6 path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
125 print("Merging in LD Map..")
126 if len(map_path)>0:
--> 127 merge_in_ld_map(path_h5=path_h5,
128 path_snp1240k=map_path,
129 chs=[ch])
~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/h5_modify.py in merge_in_ld_map(path_h5, path_snp1240k, chs, write_mode)
115 chs: Which Chromosomes to merge in HDF5 [list].
116 write_mode: Which mode to use on hdf5. a: New field. r+: Change Field"""
--> 117 with h5py.File(path_h5, "r") as f:
118 print("Lifting LD Map from eigenstrat to HDF5...")
119 print("Loaded %i variants." % np.shape(f["calldata/GT"])[0])
~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, **kwds)
531 fs_persist=fs_persist, fs_threshold=fs_threshold,
532 fs_page_size=fs_page_size)
--> 533 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
534
535 if isinstance(libver, tuple):
~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
224 if swmr and swmr_support:
225 flags |= h5f.ACC_SWMR_READ
--> 226 fid = h5f.open(name, flags, fapl=fapl)
227 elif mode == 'r+':
228 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './data/hdf5/example_hazelton_chr22.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
The directory does exist, as I get no error when running:
os.listdir("./data/hdf5/")
and I've also tried writing the absolute path.
Moreover, when running the example you provided -- just by switching the in_vcf
it runs perfectly fine.
So it must be something in the vcf file iteslf.
Any ideas?
Batel
I'm trying to install ancIBD in my institute's cluster, but it's failing when trying to build wheel
.
I've tried running pip install wheel
, and also upgraded it to the latest version.
(ancIBD) batelziv@moriah-gw-02:~/softwares/ancIBD$ python3 -m pip install ancIBD
Collecting ancIBD
Using cached ancIBD-0.5.tar.gz (201 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in ./lib/python3.7/site-packages (from ancIBD) (1.21.6)
Requirement already satisfied: pandas in ./lib/python3.7/site-packages (from ancIBD) (1.3.5)
Collecting scipy (from ancIBD)
Using cached scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
Requirement already satisfied: h5py in ./lib/python3.7/site-packages (from ancIBD) (3.8.0)
Requirement already satisfied: psutil in ./lib/python3.7/site-packages (from ancIBD) (5.9.5)
Requirement already satisfied: cython in ./lib/python3.7/site-packages (from ancIBD) (3.0.0)
Requirement already satisfied: matplotlib in ./lib/python3.7/site-packages (from ancIBD) (3.5.3)
Collecting pysam (from ancIBD)
Using cached pysam-0.21.0-cp37-cp37m-manylinux_2_24_x86_64.whl (18.1 MB)
Collecting scikit-allel (from ancIBD)
Using cached scikit-allel-1.3.6.tar.gz (9.7 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [36 lines of output]
[scikit-allel] setup extensions with cython
Compiling allel/opt/model.pyx because it changed.
Compiling allel/opt/stats.pyx because it changed.
Compiling allel/opt/io_vcf_read.pyx because it changed.
[1/3] Cythonizing allel/opt/io_vcf_read.pyx
[2/3] Cythonizing allel/opt/model.pyx
[3/3] Cythonizing allel/opt/stats.pyx
Traceback (most recent call last):
File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "<string>", line 100, in <module>
File "<string>", line 96, in setup_package
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/__init__.py", line 107, in setup
return distutils.core.setup(**attrs)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 159, in setup
dist.parse_config_files()
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 898, in parse_config_files
pyprojecttoml.apply_configuration(self, filename, ignore_option_errors)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 66, in apply_configuration
config = read_configuration(filepath, True, ignore_option_errors, dist)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 128, in read_configuration
validate(subset, filepath)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 55, in validate
raise ValueError(f"{error}\n{summary}") from None
ValueError: invalid pyproject.toml config: `project`.
configuration error: `project` must contain ['name'] properties
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Any ideas?
For ARM users, please use this workaround to simulate x86:
conda create -n py37
conda activate py37
conda config --env --set subdir osx-64
conda install python=3.7
on /docs
, do:
pip install -r requirements.txt
pip install Cython
pip Install ancIBD
Hi, I've been trying to run ancIBD on data imputed with GLIMPSE and the 1000G panel. I have the full vcf per chromosome, I could run the transformation and filtering to 1240k snps smoothly, calculating the af directly with the command:
for ch in range(1,23):
vcf_to_1240K_hdf(in_vcf_path = f"./data.chr{ch}.vcf.gz",
path_vcf = f"./ancIB_data/data.1240.chr{ch}.vcf.gz",
path_h5 = f"./ancIB_data/data.1240.chr{ch}.h5",
marker_path = f"./ancIBA_filters/snps_bcftools_ch{ch}.csv",
map_path = f"./ancIB_data/v51.1_1240k.snp",
#af_path = f"./ancIB_afs/v51.1_1240k_AF_ch{ch}.tsv",
col_sample_af = "AF_ALL",
buffer_size=20000, chunk_width=8, chunk_length=20000,
ch=ch). But when I run hapBLOCK_chroms
Then when I run the next command it fail to detect any IBD in any pairs among my 90 samples...
for ch in range(1,23):
df_ibd = hapBLOCK_chroms(folder_in='./ancIB_data/data.1240k.chr',
iids=iids, run_iids=[],
ch=ch, folder_out='./ancIB_data/ibd/',
output=False, prefix_out='', logfile=True,
l_model='hdf5', e_model='haploid_gl', h_model='FiveStateScaled', t_model='standard',
ibd_in=1, ibd_out=10, ibd_jump=400,
min_cm=6, cutoff_post=0.99, max_gap=0.0075,
processes=1)
When I run the code with the example data I obtain the exact same result as in the tutorial.
When I use plink or KING on my imputed dataset I detect IDB (and I know for sure that some individuals are 1st degree related).
Could you see where is the problem?
Here an example of the filtered vcf obtained by the first command:
21 10205629 rs140777501 A G 100 PASS NS=2504;AA=A|||;VT=SNP;DP=21926;RAF=0.166534;AF=0.166534;INFO=1;EAS_AF=0.2173;AMR_AF=0.1988;AFR_AF=0.1135;EUR_AF=0.2177;SAS_AF=0.1104;AN=54;AC=2 GT:DS:GP 0|0:0:1,0,0 ./.:.:. ./.:.:. 0|0:0:1,0,0 ./.:.:. 0|0:0:1,0,0 0|0:0.001:0.999,0.001,0 ./.:.:. ./.:.:. ./.:.:. ./.:
21 14601415 rs2775537 G A 100 PASS NS=2504;AA=g|||;VT=SNP;DP=17928;RAF=0.550519;AF=0.550519;INFO=1;EAS_AF=0.376;AMR_AF=0.4251;AFR_AF=0.8169;EUR_AF=0.4722;SAS_AF=0.5399;AN=30;AC=3 GT:DS:GP ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 1|1:1.998:0,0.001,0.999 ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:
21 14652908 rs3869758 T C 100 PASS NS=2504;AA=.|||;VT=SNP;DP=23683;RAF=0.0896565;AF=0.0896565;INFO=1;EAS_AF=0;AMR_AF=0.0648;AFR_AF=0.2814;EUR_AF=0.0239;SAS_AF=0.0082;AN=176;AC=0 GT:DS:GP 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0
Hi, recently I was running the ancIBD-0.5 using the example data (downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0), using python3.8.3 with the same parameters in the tutorial. All commands can run smoothly without errors, but the results I got is different from the results showed in the tutorial. And I still have not figure out what caused this difference. I will really appreciate it if you can help!!
I checked the output when prepare the hdf5 files, and it is the same from what is in the tutorial. But results start to differ when calling the IBD, and it seems that the IBD that I got is much less.
Chromosome 1; Loaded 18 IBD
Chromosome 2; Loaded 13 IBD
Chromosome 3; Loaded 11 IBD
Chromosome 4; Loaded 9 IBD
Chromosome 5; Loaded 9 IBD
Chromosome 6; Loaded 11 IBD
Chromosome 7; Loaded 12 IBD
Chromosome 8; Loaded 8 IBD
Chromosome 9; Loaded 10 IBD
Chromosome 10; Loaded 10 IBD
Chromosome 11; Loaded 10 IBD
Chromosome 12; Loaded 6 IBD
Chromosome 13; Loaded 12 IBD
Chromosome 14; Loaded 6 IBD
Chromosome 15; Loaded 6 IBD
Chromosome 16; Loaded 8 IBD
Chromosome 17; Loaded 9 IBD
Chromosome 18; Loaded 6 IBD
Chromosome 19; Loaded 11 IBD
Chromosome 20; Loaded 10 IBD
Chromosome 21; Loaded 9 IBD
Chromosome 22; Loaded 10 IBD
Saved 214 IBD to /home/han_shi/script/ancibd/output/ibd_hazelton2/ch_all.tsv.
When summarizing the IBD between pairs, some of the results yield IBD but differ in sum and number, but some pairs(I12438 & I12896, I12440& I12896...), which yield some IBD in the tutorials, did not yield one IBD here..
iid1 | iid2 | max_IBD | sum_IBD>8 | n_IBD>8 | sum_IBD>12 | n_IBD>12 | sum_IBD>16 | n_IBD>16 | sum_IBD>20 | n_IBD>20 |
---|---|---|---|---|---|---|---|---|---|---|
I12440 | I30300 | 268.7866928288713 | 3316.770885130245 | 22.0 | 3316.770885130245 | 22.0 | 3316.770885130245 | 22.0 | 3316.770885130245 | 22.0 |
I12438 | I30300 | 283.65220259875065 | 3307.30799218436 | 21.0 | 3307.30799218436 | 21.0 | 3307.30799218436 | 21.0 | 3307.30799218436 | 21.0 |
I12440 | I12438 | 176.73970285686664 | 1657.2198443172965 | 22.0 | 1647.3969392536674 | 21.0 | 1647.3969392536674 | 21.0 | 1647.3969392536674 | 21.0 |
I12439 | I12440 | 153.30770015716553 | 1626.9729871419258 | 21.0 | 1604.7398943570442 | 19.0 | 1604.7398943570442 | 19.0 | 1604.7398943570442 | 19.0 |
I12439 | I30300 | 98.25259447097778 | 919.9655823060311 | 13.0 | 919.9655823060311 | 13.0 | 919.9655823060311 | 13.0 | 919.9655823060311 | 13.0 |
I12439 | I12438 | 93.37389469146729 | 475.9611673653126 | 10.0 | 459.80747416615486 | 8.0 | 459.80747416615486 | 8.0 | 459.80747416615486 | 8.0 |
I21390 | I30300 | 13.226699829101562 | 13.226699829101562 | 1.0 | 13.226699829101562 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12896 | I30300 | 12.573707103729248 | 12.573707103729248 | 1.0 | 12.573707103729248 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12439 | I21390 | 11.11750602722168 | 11.11750602722168 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12439 | I12896 | 8.414804935455322 | 8.414804935455322 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12438 | I12896 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12438 | I21390 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12440 | I12896 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12440 | I21390 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12896 | I21390 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Could you please give me some advice about what might caused this problem? Many thanks!!
Hi,
Is it on purpose that not all SNPs from the 1240K markers are included in the filters data that is used to restrict imputed SNPs to the 1240K marker set? At the hdf5 data import the imputed GTs are filtered by these coordinates, so basically you always "lose" these non included markers even though you do have them in the imputed GTs.
I am aware that some SNPs included in the 1240K CHIP does have lower concordance with true shotgun WGS data. However not all data are coming from CHIP thus removing these from solely WGS dataset would be not required. On the other hand we are talking about imputed GTs anyway where the other markers in LD were allready used to figure out the diploid phased haplotypes on a large genomic chunk based on gold standard WGS ref data (considering random positions for the non concordant SNPs imputation should fix their error alreads). Accordingly, in case these files contains less markers because of trying to avoid "bad markers" then this kind of marker removal should have happened prior to the imputation step for CHIP data and not in the IBD identification step. That way imputation supposed to get better for CHIP while it does not affect true shotgun WGS. Furthermore that approach would not thin your markers at the IBD detection step for either WGS or CHIP data while you still should be able to co-analyze mixed datasets.
But again, I reserve the right to be dumb/ignorant and it may very well happen that I am unaware of some other valid reason to remove these markers. Could you please ellaborate on why ~50k (~4.4%) autosomal 1240K markers are excluded at filtering?
Regards,
Zoltan
Hello!
In the bioarchive article, you write that you are using data from the AADR, but there they are in the eigenstrat format. Tell me how to translate them into VCF with the flag of GL. And how, then, is the difference between capture and whole genome data? Or do you have a BAM file for the 1240k panel?
Hi,
First of all, thank you for your detailed and easy to follow instructions in the notebook.
Regarding the vcf to HDF5 conversion function:
To process the data obtained from Dropbox, it is necessary to unzip the vcf.gz files prior to executing the code mentioned in the notebook.
Second,
I get this error after unziping using gunzip *.gz
:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
This is the full error print:
Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<timed exec> in <module>
~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
114
115 print("Converting to HDF5...")
--> 116 allel.vcf_to_hdf5(input=path_vcf, output=path_h5,
117 fields = ['variants/*', 'calldata/*', "samples"],
118 types = {"samples":"S60", "calldata/GT":np.int8,
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
691
692 # setup chunk iterator
--> 693 fields, samples, headers, it = iter_vcf_chunks(
694 input, fields=fields, exclude_fields=exclude_fields, types=types,
695 numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
1136
1137 # setup iterator
-> 1138 fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
1139
1140 # setup transformers
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
1634
1635 # read VCF headers
-> 1636 headers = _read_vcf_headers(stream)
1637
1638 # setup samples
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _read_vcf_headers(stream)
1709 # read first header line
1710 header = stream.readline()
-> 1711 header = str(header, 'utf8')
1712
1713 while header and header[0] == '#':
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Hello,
I am attempting to convert the example VCF file from Dropbox to HDF5, but I am facing an error during the interpolation step. Do you have any insights on what might be causing this issue?
Thank you!
Print downsampling to 1240K...
Running bash command:
bcftools view -Ov -o test/example_hazelton.ch20.1240k.vcf -T ./data/filters/snps_bcftools_ch20.csv -M2 -v snps ./data/vcf.raw/example_hazelton_chr20.vcf
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28940 variants.
Loaded 6 individuals.
Loaded 0 Chr.20 1240K SNPs.
Intersection 0 out of 28940 HDF5 SNPs
Interpolating 28940 variants.
Traceback (most recent call last):
File "/data/users/dorozco/.conda/envs/gnomix/bin/ancIBD-run", line 8, in <module>
sys.exit(main())
File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/run_ancIBD.py", line 63, in main
vcf_to_1240K_hdf(in_vcf_path = args.vcf,
File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/prepare_h5.py", line 135, in vcf_to_1240K_hdf
merge_in_ld_map(path_h5=path_h5,
File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/h5_modify.py", line 154, in merge_in_ld_map
rec_ch = np.interp(x1, x, y)
File "<__array_function__ internals>", line 180, in interp
File "/data/users/dorozco/.conda/envs/gnomix/lib/python3.8/site-packages/numpy/lib/function_base.py", line 1594, in interp
return interp_func(x, xp, fp, left, right)
ValueError: array of sample points is empty
Hi,
It would be nice if you could add the maskfile option to either to the
hapBLOCK_chroms
(to not emit IBD from mask areas, since all relevant genom coordinate info is available here)
or
filter_ibd_df
plus the caller create_ind_ibd_df
ind_all_ibd_df
(to filter IBD instead of (or additionally with) the SNP density parameter)
functions as a parameter since this could be handled naturally in the base package.
(The individual IBD data in the output of hapBLOCK_chroms
(yet) does not contain the genomic coordinates, and the mapping data is not the same scale (M vs cM) as in the mask data, thus simple "shell magic" would be complex to do this.)
While at a few samples, and at the individual pairwise IBD share it is not an issue, when you work with several hundreds individuals the combinations (N*(N-1)/2) gets large and at these genome locations almost everyone will share IBD with all other samples. This result in nedlessly large portion of these false positive IBD compared to the randomly distributed true IBD in the outputs.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.