hringbauer / ancibd Goto Github PK

Detecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.

License: GNU General Public License v3.0

Jupyter Notebook 91.29% Shell 0.02% Python 2.33% HTML 6.12% Cython 0.23%

adna ibd

ancibd's People

Contributors

Stargazers

Watchers

Forkers

hyl317 sanjaynagi

ancibd's Issues

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Hi, I am loading vcf files from GLIMPSE to ancIBD, however it occurs this problem. And when I try tutorials by using test_data which you provided, it also has the same problem.

It seems python codeDecode problem, however when I write "

encoding:utf-8

import sys

reload(sys)
sys.setdefaultencoding('utf-8')"

into sitecustomize.py under python" lib/site-packages/ "path, it could not solve the error. DO you have any suggestions? Thanks a lot!

Merging in LD Map.. results in error

Hi again.
This issue can be merged with #5.

I've used GLIMPSE to impute two samples, and resulted with the following vcf file:

(base) batel:~/glimpse/Levant_BA$ bcftools view merged.vcf.gz | head -20
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=01/08/2023 - 11:43:29
##source=GLIMPSE_phase v2.0.0
##contig=<ID=chr22>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="IMPUTE info quality score for diploid samples">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=15
##FPLOIDY=2
##bcftools_mergeVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_mergeCommand=merge -l GLIMPSE_ligate/samples_merge.txt -o merged.vcf.gz; Date=Tue Aug  1 13:03:00 2023
##bcftools_viewVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_viewCommand=view merged.vcf.gz; Date=Tue Aug  1 13:04:29 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S10769.SN	S10770.SN
chr22	10519276	22:10519276:G:C	G	C	.	.	RAF=0.0377889;AF=0.0572749;INFO=1	GT:DS:GP	0|0:0.106:0.896,0.101,0.003	0|0:0.115:0.888,0.108,0.004
chr22	10519389	22:10519389:T:C	T	C	.	.	RAF=0.0376327;AF=0.0567751;INFO=1	GT:DS:GP	0|0:0.104:0.898,0.098,0.004	0|0:0.114:0.889,0.107,0.004

When trying to convert to HD5 using

from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf
ch = 22
#base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
    in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
    path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
                 path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                 marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                 map_path = f"./data/v51.1_1240k.snp",
                 af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                 col_sample_af = "",
                 buffer_size=20000, chunk_width=8, chunk_length=20000,
                 ch=ch)

I get the following error:

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_7983/2495295663.py in <module>
      2 ch = 22
      3 #base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
----> 4 vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
      5     in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
      6     path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
    125     print("Merging in LD Map..")
    126     if len(map_path)>0:
--> 127         merge_in_ld_map(path_h5=path_h5, 
    128                     path_snp1240k=map_path,
    129                     chs=[ch])

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/h5_modify.py in merge_in_ld_map(path_h5, path_snp1240k, chs, write_mode)
    115     chs: Which Chromosomes to merge in HDF5 [list].
    116     write_mode: Which mode to use on hdf5. a: New field. r+: Change Field"""
--> 117     with h5py.File(path_h5, "r") as f:
    118         print("Lifting LD Map from eigenstrat to HDF5...")
    119         print("Loaded %i variants." % np.shape(f["calldata/GT"])[0])

~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, **kwds)
    531                                  fs_persist=fs_persist, fs_threshold=fs_threshold,
    532                                  fs_page_size=fs_page_size)
--> 533                 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    534 
    535             if isinstance(libver, tuple):

~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    224         if swmr and swmr_support:
    225             flags |= h5f.ACC_SWMR_READ
--> 226         fid = h5f.open(name, flags, fapl=fapl)
    227     elif mode == 'r+':
    228         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './data/hdf5/example_hazelton_chr22.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

The directory does exist, as I get no error when running:
os.listdir("./data/hdf5/")
and I've also tried writing the absolute path.

Moreover, when running the example you provided -- just by switching the in_vcf it runs perfectly fine.
So it must be something in the vcf file iteslf.

Any ideas?

Batel

Installing ancIBD fails when building wheel

I'm trying to install ancIBD in my institute's cluster, but it's failing when trying to build wheel.
I've tried running pip install wheel, and also upgraded it to the latest version.


(ancIBD) batelziv@moriah-gw-02:~/softwares/ancIBD$ python3 -m pip install ancIBD
Collecting ancIBD
  Using cached ancIBD-0.5.tar.gz (201 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in ./lib/python3.7/site-packages (from ancIBD) (1.21.6)
Requirement already satisfied: pandas in ./lib/python3.7/site-packages (from ancIBD) (1.3.5)
Collecting scipy (from ancIBD)
  Using cached scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
Requirement already satisfied: h5py in ./lib/python3.7/site-packages (from ancIBD) (3.8.0)
Requirement already satisfied: psutil in ./lib/python3.7/site-packages (from ancIBD) (5.9.5)
Requirement already satisfied: cython in ./lib/python3.7/site-packages (from ancIBD) (3.0.0)
Requirement already satisfied: matplotlib in ./lib/python3.7/site-packages (from ancIBD) (3.5.3)
Collecting pysam (from ancIBD)
  Using cached pysam-0.21.0-cp37-cp37m-manylinux_2_24_x86_64.whl (18.1 MB)
Collecting scikit-allel (from ancIBD)
  Using cached scikit-allel-1.3.6.tar.gz (9.7 MB)
  Installing build dependencies ... done


  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      [scikit-allel] setup extensions with cython
      Compiling allel/opt/model.pyx because it changed.
      Compiling allel/opt/stats.pyx because it changed.
      Compiling allel/opt/io_vcf_read.pyx because it changed.
      [1/3] Cythonizing allel/opt/io_vcf_read.pyx
      [2/3] Cythonizing allel/opt/model.pyx
      [3/3] Cythonizing allel/opt/stats.pyx
      Traceback (most recent call last):
        File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 338, in run_setup
          exec(code, locals())
        File "<string>", line 100, in <module>
        File "<string>", line 96, in setup_package
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/__init__.py", line 107, in setup
          return distutils.core.setup(**attrs)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 159, in setup
          dist.parse_config_files()
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 898, in parse_config_files
          pyprojecttoml.apply_configuration(self, filename, ignore_option_errors)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 66, in apply_configuration
          config = read_configuration(filepath, True, ignore_option_errors, dist)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 128, in read_configuration
          validate(subset, filepath)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 55, in validate
          raise ValueError(f"{error}\n{summary}") from None
      ValueError: invalid pyproject.toml config: `project`.
      configuration error: `project` must contain ['name'] properties
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Any ideas?

Mac M1 / M2 install tips

For ARM users, please use this workaround to simulate x86:

conda create -n py37
conda activate py37
conda config --env --set subdir osx-64
conda install python=3.7

on /docs , do:

pip install -r requirements.txt
pip install Cython
pip Install ancIBD

Fail to detect any IBD in data

Hi, I've been trying to run ancIBD on data imputed with GLIMPSE and the 1000G panel. I have the full vcf per chromosome, I could run the transformation and filtering to 1240k snps smoothly, calculating the af directly with the command:

for ch in range(1,23):
vcf_to_1240K_hdf(in_vcf_path = f"./data.chr{ch}.vcf.gz",
                     path_vcf = f"./ancIB_data/data.1240.chr{ch}.vcf.gz",
                     path_h5 = f"./ancIB_data/data.1240.chr{ch}.h5",
                     marker_path = f"./ancIBA_filters/snps_bcftools_ch{ch}.csv",
                     map_path = f"./ancIB_data/v51.1_1240k.snp",
                     #af_path = f"./ancIB_afs/v51.1_1240k_AF_ch{ch}.tsv",
                     col_sample_af = "AF_ALL",
                     buffer_size=20000, chunk_width=8, chunk_length=20000,
                     ch=ch). But when I run hapBLOCK_chroms

Then when I run the next command it fail to detect any IBD in any pairs among my 90 samples...

for ch in range(1,23):
    df_ibd = hapBLOCK_chroms(folder_in='./ancIB_data/data.1240k.chr',
                             iids=iids, run_iids=[],
                             ch=ch, folder_out='./ancIB_data/ibd/',
                             output=False, prefix_out='', logfile=True,
                             l_model='hdf5', e_model='haploid_gl', h_model='FiveStateScaled', t_model='standard',
                             ibd_in=1, ibd_out=10, ibd_jump=400,
                             min_cm=6, cutoff_post=0.99, max_gap=0.0075,
                             processes=1)

When I run the code with the example data I obtain the exact same result as in the tutorial.

When I use plink or KING on my imputed dataset I detect IDB (and I know for sure that some individuals are 1st degree related).
Could you see where is the problem?

Here an example of the filtered vcf obtained by the first command:

21      10205629        rs140777501     A       G       100     PASS    NS=2504;AA=A|||;VT=SNP;DP=21926;RAF=0.166534;AF=0.166534;INFO=1;EAS_AF=0.2173;AMR_AF=0.1988;AFR_AF=0.1135;EUR_AF=0.2177;SAS_AF=0.1104;AN=54;AC=2        GT:DS:GP        0|0:0:1,0,0     ./.:.:. ./.:.:. 0|0:0:1,0,0     ./.:.:. 0|0:0:1,0,0     0|0:0.001:0.999,0.001,0 ./.:.:. ./.:.:. ./.:.:. ./.:
21      14601415        rs2775537       G       A       100     PASS    NS=2504;AA=g|||;VT=SNP;DP=17928;RAF=0.550519;AF=0.550519;INFO=1;EAS_AF=0.376;AMR_AF=0.4251;AFR_AF=0.8169;EUR_AF=0.4722;SAS_AF=0.5399;AN=30;AC=3 GT:DS:GP        ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 1|1:1.998:0,0.001,0.999 ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:
21      14652908        rs3869758       T       C       100     PASS    NS=2504;AA=.|||;VT=SNP;DP=23683;RAF=0.0896565;AF=0.0896565;INFO=1;EAS_AF=0;AMR_AF=0.0648;AFR_AF=0.2814;EUR_AF=0.0239;SAS_AF=0.0082;AN=176;AC=0  GT:DS:GP        0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0

Problems when running example data

Hi, recently I was running the ancIBD-0.5 using the example data (downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0), using python3.8.3 with the same parameters in the tutorial. All commands can run smoothly without errors, but the results I got is different from the results showed in the tutorial. And I still have not figure out what caused this difference. I will really appreciate it if you can help!!

I checked the output when prepare the hdf5 files, and it is the same from what is in the tutorial. But results start to differ when calling the IBD, and it seems that the IBD that I got is much less.

Chromosome 1; Loaded 18 IBD
Chromosome 2; Loaded 13 IBD
Chromosome 3; Loaded 11 IBD
Chromosome 4; Loaded 9 IBD
Chromosome 5; Loaded 9 IBD
Chromosome 6; Loaded 11 IBD
Chromosome 7; Loaded 12 IBD
Chromosome 8; Loaded 8 IBD
Chromosome 9; Loaded 10 IBD
Chromosome 10; Loaded 10 IBD
Chromosome 11; Loaded 10 IBD
Chromosome 12; Loaded 6 IBD
Chromosome 13; Loaded 12 IBD
Chromosome 14; Loaded 6 IBD
Chromosome 15; Loaded 6 IBD
Chromosome 16; Loaded 8 IBD
Chromosome 17; Loaded 9 IBD
Chromosome 18; Loaded 6 IBD
Chromosome 19; Loaded 11 IBD
Chromosome 20; Loaded 10 IBD
Chromosome 21; Loaded 9 IBD
Chromosome 22; Loaded 10 IBD
Saved 214 IBD to /home/han_shi/script/ancibd/output/ibd_hazelton2/ch_all.tsv.

When summarizing the IBD between pairs, some of the results yield IBD but differ in sum and number, but some pairs(I12438 & I12896, I12440& I12896...), which yield some IBD in the tutorials, did not yield one IBD here..

iid1	iid2	max_IBD	sum_IBD>8	n_IBD>8	sum_IBD>12	n_IBD>12	sum_IBD>16	n_IBD>16	sum_IBD>20	n_IBD>20
I12440	I30300	268.7866928288713	3316.770885130245	22.0	3316.770885130245	22.0	3316.770885130245	22.0	3316.770885130245	22.0
I12438	I30300	283.65220259875065	3307.30799218436	21.0	3307.30799218436	21.0	3307.30799218436	21.0	3307.30799218436	21.0
I12440	I12438	176.73970285686664	1657.2198443172965	22.0	1647.3969392536674	21.0	1647.3969392536674	21.0	1647.3969392536674	21.0
I12439	I12440	153.30770015716553	1626.9729871419258	21.0	1604.7398943570442	19.0	1604.7398943570442	19.0	1604.7398943570442	19.0
I12439	I30300	98.25259447097778	919.9655823060311	13.0	919.9655823060311	13.0	919.9655823060311	13.0	919.9655823060311	13.0
I12439	I12438	93.37389469146729	475.9611673653126	10.0	459.80747416615486	8.0	459.80747416615486	8.0	459.80747416615486	8.0
I21390	I30300	13.226699829101562	13.226699829101562	1.0	13.226699829101562	1.0	0.0	0.0	0.0	0.0
I12896	I30300	12.573707103729248	12.573707103729248	1.0	12.573707103729248	1.0	0.0	0.0	0.0	0.0
I12439	I21390	11.11750602722168	11.11750602722168	1.0	0.0	0.0	0.0	0.0	0.0	0.0
I12439	I12896	8.414804935455322	8.414804935455322	1.0	0.0	0.0	0.0	0.0	0.0	0.0
I12438	I12896	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
I12438	I21390	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
I12440	I12896	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
I12440	I21390	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
I12896	I21390	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Could you please give me some advice about what might caused this problem? Many thanks!!

filters/snps_bcftools_chX.csv

Hi,

Is it on purpose that not all SNPs from the 1240K markers are included in the filters data that is used to restrict imputed SNPs to the 1240K marker set? At the hdf5 data import the imputed GTs are filtered by these coordinates, so basically you always "lose" these non included markers even though you do have them in the imputed GTs.

I am aware that some SNPs included in the 1240K CHIP does have lower concordance with true shotgun WGS data. However not all data are coming from CHIP thus removing these from solely WGS dataset would be not required. On the other hand we are talking about imputed GTs anyway where the other markers in LD were allready used to figure out the diploid phased haplotypes on a large genomic chunk based on gold standard WGS ref data (considering random positions for the non concordant SNPs imputation should fix their error alreads). Accordingly, in case these files contains less markers because of trying to avoid "bad markers" then this kind of marker removal should have happened prior to the imputation step for CHIP data and not in the IBD identification step. That way imputation supposed to get better for CHIP while it does not affect true shotgun WGS. Furthermore that approach would not thin your markers at the IBD detection step for either WGS or CHIP data while you still should be able to co-analyze mixed datasets.

But again, I reserve the right to be dumb/ignorant and it may very well happen that I am unaware of some other valid reason to remove these markers. Could you please ellaborate on why ~50k (~4.4%) autosomal 1240K markers are excluded at filtering?

Regards,
Zoltan

Input for GLIMPSE

Hello!

In the bioarchive article, you write that you are using data from the AADR, but there they are in the eigenstrat format. Tell me how to translate them into VCF with the flag of GL. And how, then, is the difference between capture and whole genome data? Or do you have a BAM file for the 1240k panel?

Converting example vcf to HDF5

Hi,
First of all, thank you for your detailed and easy to follow instructions in the notebook.

Regarding the vcf to HDF5 conversion function:
To process the data obtained from Dropbox, it is necessary to unzip the vcf.gz files prior to executing the code mentioned in the notebook.

Second,
I get this error after unziping using gunzip *.gz:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This is the full error print:

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
    114 
    115     print("Converting to HDF5...")
--> 116     allel.vcf_to_hdf5(input=path_vcf, output=path_h5, 
    117                       fields = ['variants/*', 'calldata/*', "samples"],
    118                       types = {"samples":"S60", "calldata/GT":np.int8,

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
    691 
    692     # setup chunk iterator
--> 693     fields, samples, headers, it = iter_vcf_chunks(
    694         input, fields=fields, exclude_fields=exclude_fields, types=types,
    695         numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
   1136 
   1137     # setup iterator
-> 1138     fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
   1139 
   1140     # setup transformers

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
   1634 
   1635     # read VCF headers
-> 1636     headers = _read_vcf_headers(stream)
   1637 
   1638     # setup samples

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _read_vcf_headers(stream)
   1709     # read first header line
   1710     header = stream.readline()
-> 1711     header = str(header, 'utf8')
   1712 
   1713     while header and header[0] == '#':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Converting example vcf to HDF5: Error at interpolation step

Hello,

I am attempting to convert the example VCF file from Dropbox to HDF5, but I am facing an error during the interpolation step. Do you have any insights on what might be causing this issue?

Thank you!

Print downsampling to 1240K...
Running bash command: 
bcftools view -Ov -o test/example_hazelton.ch20.1240k.vcf -T ./data/filters/snps_bcftools_ch20.csv -M2 -v snps ./data/vcf.raw/example_hazelton_chr20.vcf
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28940 variants.
Loaded 6 individuals.
Loaded 0 Chr.20 1240K SNPs.
Intersection 0 out of 28940 HDF5 SNPs
Interpolating 28940 variants.
Traceback (most recent call last):
  File "/data/users/dorozco/.conda/envs/gnomix/bin/ancIBD-run", line 8, in <module>
    sys.exit(main())
  File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/run_ancIBD.py", line 63, in main
    vcf_to_1240K_hdf(in_vcf_path = args.vcf,
  File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/prepare_h5.py", line 135, in vcf_to_1240K_hdf
    merge_in_ld_map(path_h5=path_h5, 
  File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/h5_modify.py", line 154, in merge_in_ld_map
    rec_ch = np.interp(x1, x, y) 
  File "<__array_function__ internals>", line 180, in interp
  File "/data/users/dorozco/.conda/envs/gnomix/lib/python3.8/site-packages/numpy/lib/function_base.py", line 1594, in interp
    return interp_func(x, xp, fp, left, right)
ValueError: array of sample points is empty

add option to use maskfile

Hi,

It would be nice if you could add the maskfile option to either to the

hapBLOCK_chroms (to not emit IBD from mask areas, since all relevant genom coordinate info is available here)
or
filter_ibd_df plus the caller create_ind_ibd_df ind_all_ibd_df (to filter IBD instead of (or additionally with) the SNP density parameter)

functions as a parameter since this could be handled naturally in the base package.

(The individual IBD data in the output of hapBLOCK_chroms (yet) does not contain the genomic coordinates, and the mapping data is not the same scale (M vs cM) as in the mask data, thus simple "shell magic" would be complex to do this.)

While at a few samples, and at the individual pairwise IBD share it is not an issue, when you work with several hundreds individuals the combinations (N*(N-1)/2) gets large and at these genome locations almost everyone will share IBD with all other samples. This result in nedlessly large portion of these false positive IBD compared to the randomly distributed true IBD in the outputs.

Thanks!