arvkevi / ezancestry Goto Github PK

Easy genetic ancestry predictions in Python

Home Page: https://ezancestry.streamlit.app

License: MIT License

Python 27.33% Shell 0.07% Jupyter Notebook 72.58% Procfile 0.02%

genomics genomics-visualization personal-genomics streamlit data-visualization dimensionality-reduction ancestry genotypes

ezancestry's Introduction

ezancestry

Easily visualize your direct-to-consumer genetics next to 2500+ samples from the 1000 genomes project. Evaluate the performance of a custom set of ancestry-informative snps (AISNPs) at classifying the genetic ancestry of the 1000 genomes samples using a machine learning model.

A subset of 1000 Genomes Project samples' single nucleotide polymorphism(s), or, SNP(s) have been parsed from the publicly available .bcf files.
The subset of SNPs, AISNPs (ancestry-informative snps), were chosen from two publications:

Set of 55 AISNPs. Progress toward an efficient panel of SNPs for ancestry inference. Kidd et al. 2014
Set of 128 AISNPs. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America.. Kosoy et al. 2009 (Seldin Lab)

ezancestry ships with pretrained k-nearest neighbor models for all combinations of following:

* Kidd (55 AISNPs)
* Seldin (128 AISNPs)

* continental-level population (superpopulation)
* regional population (population)

* principal component analysis (PCA)

Installation
Config
Usage
- command line tool
  - fetch
  - predict
  - plot
  - train
- Python API
Visualization
Contributing

Installation

Install ezancestry with pip:

pip install ezancestry

Or clone the repository and run pip install from the directory:

git clone [email protected]:arvkevi/ezancestry.git
cd ezancestry
pip install .

Config

The first time ezancestry is run it will generate a config.ini file and data/ directory in your home directory under ${HOME}/.ezancestry. You can edit conf.ini to change the default settings, but it is not necessary to use ezancestry. The settings are just a utility for the user so they don't have to be verbose when interacting with the software. The settings are also keyword arguments to each of the commands in the ezancestry API, so you can always override the default settings.

These will be created in your home directory:

${HOME}/.ezancestry/conf.ini
${HOME}/.ezancestry/data/

Explanations of each setting is described in the Options section of the --help of each command, for example:

ezancestry predict --help

Usage: ezancestry predict [OPTIONS] INPUT_DATA

  Predict ancestry from genetic data.

  * Default arguments are from the ~/.ezancestry/conf.ini file. *

Arguments:
  INPUT_DATA  Can be a file path to raw genetic data (23andMe, ancestry.com,
              .vcf) file, a path to a directory containing several raw genetic
              files, or a (tab or comma) delimited file with sample ids as
              rows and snps as columns.  [required]


Options:
  --output-directory TEXT         The directory where to write the prediction
                                  results file

  --write-predictions / --no-write-predictions
                                  If True, write the predictions to a file. If
                                  False, return the predictions as a
                                  dataframe.  [default: True]

  --models-directory TEXT         The path to the directory where the model
                                  files are located.

  --aisnps-directory TEXT         The path to the directory where the AISNPs
                                  files are located.

  --aisnps-set TEXT               The name of the AISNP set to use. To start,
                                  choose either 'Kidd' or 'Seldin'. The
                                  default value in conf.ini is 'Kidd'. *If
                                  using your AISNP set, this value will be the
                                  in the namingc onvention for all the new
                                  model files that are created*

  --help                          Show this message and exit.

Usage

ezancestry can be used as a command-line tool or as a Python library.

ezancestry predict comes with pre-trained models when --aisnps-set=kidd or --aisnps-set=seldin.`

command-line interface

There are four commands available:

fetch: generate a csv file with all the 1000 Genome samples (rows) at the specified AISNPs locations (columns).
predict: predict the genetic ancestry of a sample or cohort of samples using the nearest neighbors model.
plot: plot the genetic ancestry of samples using the output of predict.
train: build a k-nearest neighbors model from the 1000 genomes data using a custom set of AISNPs.

Use the commands in the following way:

predict

ezancestry can predict the genetic ancestry of a sample or cohort of samples using the nearest neighbors model. The input_data can be a file path to raw genetic data (23andMe, ancestry.com, .vcf) file, a path to a directory containing several raw genetic files, or a (tab or comma) delimited file with sample ids as rows and snps as columns.

This writes a file, predictions.csv to the output_directory (defaults to current directory). This file contains predicted ancestry for each sample.

Direct-to-consumer genetic data file (23andMe, ancestry.com, etc.):

ezancestry predict mygenome.txt

Directory of direct-to-consumer genetic data files or .vcf files:

ezancestry predict /path/to/genetic_datafiles

comma-separated file with sample ids as rows and snps as columns, filled with genotypes as values

ezancestry predict ${HOME}/.ezancestry/data/aisnps/thousand_genomes.KIDD.dataframe.csv

plot

Visualize the output of predict using the plot command. This will open a 3d scatter plot in a browser.

ezancestry plot predictions.csv

fetch

This command will download all of the data required to build a new nearest neighbors model for a custom set of AISNPs. If you want to use existing models, see predict and plot.

Without any arguments this command will download all necessary data to build new models and put it in the ${HOME}/.ezancestry/data/ directory.

ezancestry fetch

Now you are ready to build a new model with train.

train

Test the discriminative power of your custom set of AISNPs.

This command will build all the necessary models to visualize and predict the 1000 genomes samples as well as user-uploaded samples. A model performace evaluation report will be generated for a five-fold cross-validation on the training set of the 1000 genomes samples as well as a report for the holdout set.

Create a custom AISNP file here: ~/.ezancestry/data/aisnps/custom.AISNP.txt. The prefix of the filename, custom, can be whatever you want. Note that this value is used as the aisnps-set keyword argument for other ezancestry commands.

The file should look like this:

id      chromosome      position
rs731257        7       12669251
rs2946788       11      24010530
rs3793451       9       71659280
rs10236187      7       139447377
rs1569175       2       201021954

ezancestry train --aisnps-set=custom

Python API

See the notebook

Visualization

Open in Streamlit

Contributing

Contributions are welcome! Please feel free to create an issue for discussion or make a pull request.

ezancestry's People

Contributors

Stargazers

Watchers

Forkers

jefferyansah apriha lisaiceland peppyguy datascienceecuador mcuevasg nadia-el knmkr dowing ancestrydna

ezancestry's Issues

Script erroring on single VCF input

Hi Kevin,
Here's the output from running script on a .VCF file:
(dnafinger) (base) ➜ VCFs ls HG01082.hard-filtered.vcf (dnafinger) (base) ➜ VCFs ezancestry predict ./ no SNPs loaded... 2022-06-15 08:44:45.149 | DEBUG | ezancestry.process:_input_to_dataframe:260 - No snps found in the input_data 2022-06-15 08:44:45.159 | DEBUG | ezancestry.process:process_user_input:194 - 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte 2022-06-15 08:44:45.159 | WARNING | ezancestry.process:process_user_input:195 - Skipping .DS_Store because it was not valid 2022-06-15 08:44:45.170 | INFO | ezancestry.process:encode_genotypes:137 - Successfully loaded an encoder from /Users/bramid/.ezancestry/data/models/one_hot_encoder.kidd.bin Traceback (most recent call last): File "/Users/bramid/dnafinger/bin/ezancestry", line 8, in <module> sys.exit(app()) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/typer/main.py", line 214, in __call__ return get_command(self)(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper return callback(**use_params) # type: ignore File "/Users/bramid/dnafinger/lib/python3.9/site-packages/ezancestry/commands.py", line 288, in predict snpsdf = encode_genotypes( File "/Users/bramid/dnafinger/lib/python3.9/site-packages/ezancestry/process.py", line 140, in encode_genotypes X = ohe.transform(df.values) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 471, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown, File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 113, in _transform X_list, n_samples, n_features = self._check_X( File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 44, in _check_X X_temp = check_array(X, dtype=None, File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/utils/validation.py", line 726, in check_array raise ValueError("Found array with %d sample(s) (shape=%s) while a" ValueError: Found array with 0 sample(s) (shape=(0, 55)) while a minimum of 1 is required.

I'm attaching the first 1k lines of this VCF in case it's a problem with VCF, not the code.
The complete VCF can be downloaded as so:
aws s3 cp --no-sign-request s3://1000genomes-dragen-3.7.6/data/individuals/hg19-graph-based/HG01082/HG01082.hard-filtered.vcf.gz .

1k_HG01082.hard-filtered.vcf.gz

fix streamlit app

File "/home/appuser/venv/lib/python3.7/site-packages/streamlit/script_runner.py", line 379, in _run_script
    exec(code, module.__dict__)
File "/app/ezancestry/streamlit/app.py", line 12, in <module>
    from util import (
File "streamlit/util.py", line 4, in <module>
    import vcf
File "/home/appuser/venv/lib/python3.7/site-packages/vcf/__init__.py", line 166, in <module>
    from parser import Reader, Writer

check app logs

Unable to load umap model

Hi, I'm unable to load the included umap models. PCA models work.

When running the following code,

# write all the super population dimred models for kidd and Seldin
for aisnps_set, df, df_labels in zip(
    ["kidd", "Seldin"], 
    [df_kidd_encoded, df_seldin_encoded], 
    [df_kidd["superpopulation"], df_seldin["superpopulation"]]
):
    for algorithm, labels in zip(["pca", "umap", "nca"], [None, None, None, df_labels]):
        print(algorithm,aisnps_set,OVERWRITE_MODEL,labels)
        df_reduced = dimensionality_reduction(df, algorithm=algorithm, aisnps_set=aisnps_set, overwrite_model=OVERWRITE_MODEL, labels=labels, population_level="super population")
        knn_model = train(df_reduced, df_labels, algorithm=algorithm, aisnps_set=aisnps_set, k=9, population_level="superpopulation", overwrite_model=OVERWRITE_MODEL)

I get the error below:

2022-08-22 17:16:03.823 | INFO     | ezancestry.dimred:dimensionality_reduction:126 - Successfully loaded a dimensionality reduction model
pca kidd False None
umap kidd False None
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [17], in <cell line: 2>()
      7 for algorithm, labels in zip(["pca", "umap", "nca"], [None, None, None, df_labels]):
      8     print(algorithm,aisnps_set,OVERWRITE_MODEL,labels)
----> 9     df_reduced = dimensionality_reduction(df, algorithm=algorithm, aisnps_set=aisnps_set, overwrite_model=OVERWRITE_MODEL, labels=labels, population_level="super population")
     10     knn_model = train(df_reduced, df_labels, algorithm=algorithm, aisnps_set=aisnps_set, k=9, population_level="superpopulation", overwrite_model=OVERWRITE_MODEL)

File ~/ezancestry/ezancestry/dimred.py:107, in dimensionality_reduction(df, algorithm, aisnps_set, n_components, overwrite_model, labels, population_level, models_directory, random_state)
    105 if algorithm in set(["pca", "umap"]):
    106     try:
--> 107         reducer = joblib.load(
    108             models_directory.joinpath(f"{algorithm}.{aisnps_set}.bin")
    109         )
    110     except FileNotFoundError:
    111         return None

File ~/opt/anaconda3/lib/python3.9/site-packages/joblib/numpy_pickle.py:587, in load(filename, mmap_mode)
    581             if isinstance(fobj, str):
    582                 # if the returned file object is a string, this means we
    583                 # try to load a pickle file generated with an version of
    584                 # Joblib so we load it with joblib compatibility function.
    585                 return load_compatibility(fobj)
--> 587             obj = _unpickle(fobj, filename, mmap_mode)
    588 return obj

File ~/opt/anaconda3/lib/python3.9/site-packages/joblib/numpy_pickle.py:506, in _unpickle(fobj, filename, mmap_mode)
    504 obj = None
    505 try:
--> 506     obj = unpickler.load()
    507     if unpickler.compat_mode:
    508         warnings.warn("The file '%s' has been generated with a "
    509                       "joblib version less than 0.10. "
    510                       "Please regenerate this pickle file."
    511                       % filename,
    512                       DeprecationWarning, stacklevel=3)

File ~/opt/anaconda3/lib/python3.9/pickle.py:1212, in _Unpickler.load(self)
   1210             raise EOFError
   1211         assert isinstance(key, bytes_types)
-> 1212         dispatch[key[0]](self)
   1213 except _Stop as stopinst:
   1214     return stopinst.value

File ~/opt/anaconda3/lib/python3.9/pickle.py:1589, in _Unpickler.load_reduce(self)
   1587 args = stack.pop()
   1588 func = stack[-1]
-> 1589 stack[-1] = func(*args)

File ~/opt/anaconda3/lib/python3.9/site-packages/numba/core/serialize.py:97, in _unpickle__CustomPickled(serialized)
     92 def _unpickle__CustomPickled(serialized):
     93     """standard unpickling for `_CustomPickled`.
     94 
     95     Uses `NumbaPickler` to load.
     96     """
---> 97     ctor, states = loads(serialized)
     98     return _CustomPickled(ctor, states)

AttributeError: Can't get attribute '_rebuild_function' on <module 'numba.core.serialize' from '/Users/jacksonc08/opt/anaconda3/lib/python3.9/site-packages/numba/core/serialize.py'>

I have tested that it is certainly the UMAP model that is causing the issue.

import pandas as pd

import joblib

obj = joblib.load(r"/Users/jacksonc08/ezancestry/data/models/umap.kidd.bin")

This gives the same error.

Looking online, it seems to be an issue with the numba package (a dependency of joblib), which no longer includes the _rebuild_function function. See here.

Do you have any recommendations on how to fix this error? Many thanks.

Bump Python version

Heroku-22 stack does not support 3.8.6 specified in runtime.txt.

Using latest 1k Genome dataset for building model

Hi,
I'm exploring adding an ethnic background to sample QC pipeline I'm working on and this tool seems to check all the boxes
I was involved in making re-analysis of the latest addition of samples for the 1k Genome project available using Illumina DRAGEN available on AWS.

DRAGEN reanalysis of the 1000 Genomes Dataset now available on the Registry of Open Data

Although I can't find similar complete aggregate bed file as the ones pulled by your fetch script, do you think your pipeline can easily be modified to create a model using this updated data set?
Would love to hear your thoughts on cost/benefit of this approach.

Thanks,
Daniel Brami

Dependency issues installing with pip

There seems to be a dependency issue that occurs during processing after installing ezancestry with pip, related to cyvcf2. See here. After installing ezancestry and downgrading to cyvcf2==0.30.14, the issue is resolved.

Support for compressed VCF files

Hi Kevin,
I started to try playing with the script but I'm running into a few speed bumps that I thought I would share in different tickets):

does not support .vcf.gz files
Not sure if BCF is more useful but for now, my data is in .vcf.gz

Is supporting compressed VCF hard to implement?

Add to awesome-streamlit

use static 1kg samples file

Stop using ftp site to populate dfsamples.

ValueError: vcf is not a valid file or directory. Please provide a valid file or directory.

Hi Kevin , I'm trying this script but I'm running into this error during the prediction:
(the vcf file was annotated with VEP)

DEBUG | ezancestry.process:process_user_input:214 - list index out of range
Traceback (most recent call last):
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 217, in process_user_input
snpsdf = pd.read_csv(
File "/usr/local/lib/python3.9/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 270, in read
alldata = self._rows_to_cols(content)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 1013, in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 739, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: Expected 3 fields in line 7, saw 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/tigem/r.desantis/.local/bin/ezancestry", line 8, in
sys.exit(app())
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 532, in wrapper
return callback(**use_params) # type: ignore
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/commands.py", line 286, in predict
snpsdf = process_user_input(input_data, aisnps_directory, aisnps_set)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 232, in process_user_input
raise ValueError(
ValueError: a1.VEP.ann.vcf is not a valid file or directory. Please provide a valid file or directory.

Here is an example of the error:

 =================================== FAILURES ===================================
 ____________________________ TestSnps.test_ancestry ____________________________
 __init__.pxd:942: in numpy.import_array
     ???
 E   RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
 
 During handling of the above exception, another exception occurred:
 tests/test_snps.py:486: in test_ancestry
     self._make_ancestry_assertions(s.predict_ancestry())
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/snps/snps.py:1683: in predict_ancestry
     from ezancestry.commands import predict
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/ezancestry/commands.py:10: in <module>
     from sklearn.model_selection import train_test_split
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sklearn/__init__.py:82: in <module>
     from .base import clone
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sklearn/base.py:17: in <module>
     from .utils import _IS_32BIT
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sklearn/utils/__init__.py:22: in <module>
     from .murmurhash import murmurhash3_32
 sklearn/utils/murmurhash.pyx:27: in init sklearn.utils.murmurhash
     ???
 __init__.pxd:944: in numpy.import_array
     ???
 E   ImportError: numpy.core.multiarray failed to import

I tried fixing this build by installing different combinations of dependencies (new versions of numpy and pandas were released on 06/22/2022 and 06/23/2022, respectively), but no luck yet.