Code Monkey home page Code Monkey logo

decoden's Introduction

decoden logo

Multi-assay ChIP-Seq Analysis with DecoDen

DOI:10.1101/2022.10.18.512665 GPLv3 license

DecoDen uses replicates and multi-histone ChIP-Seq experiments for a target cell type to learn and remove shared biases from fragmentation, PCR amplification and sequence mappability.

Installation

The installation of DecoDen is currently offered as a Poetry project while in development. The procedure proposed requires a local installation of git and a C compiler. We recommend the use of Conda to create a suitable environment, with a command such as conda create -n decoden python>=3.10. After the activation of the environment (conda activate decoden), follow these steps:

  1. Install Poetry
  2. Clone the repository and install with poetry
# Clone the repository
git clone [email protected]:ntanmayee/DecoDen.git
cd decoden

# Install the external dependencies and DecoDen
conda install pyarrow poetry
conda install samtools zlib

# If there is no C compiler installed include also the following command
conda install c-compiler

poetry install

Quick Start

Input data

Running decoden requires two inputs:

  1. Aligned reads in .bam format from ChIP-Seq experiments
  2. Sample annotation file in .csv format

Auto-generate a sample annotation file

To generate a skeleton sample annotation file, run -

decoden create_csv 

This will create samples.csv in your current directory. Edit this file and fill in the columns with appropriate information. There are more details here.

Run DecoDen

Run the DecoDen pipeline with default parameters -

decoden run -i samples.csv -o output_directory -gs genome-size

Detailed Usage Guidelines

The following commands are available in DecoDen. Please click on the links to know more about them.

Command Description
create_csv Create a skeleton sample annotation file
run Run the full DecoDen pipeline to preprocess end denoise BAM/BED files
preprocess Pre-process BAM/BED data to be in the correct format for running DecoDen
denoise Run the denoising step of DecoDen on suitably preprocessed data
detect Detect peaks in the processed DecoDen signals

There is more helpful information in the wiki.

Bug Reports and Suggestions for Improvement

Please raise an issue if you find bugs or if you have any suggestions for improvement.

Funding

This project has received funding from the European Union's Framework Programme for Research and Innovation Horizon 2020 (2014-2020) under the Marie SkΕ‚odowska-Curie Grant Agreement No. 813533-MSCA-ITN-2018

decoden's People

Contributors

dependabot[bot] avatar gvisona avatar ntanmayee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

jamesabbott

decoden's Issues

Data Correlation Heatmap

It would be nice to have the correlation clustermap as a standard figure, when the option to save figures is True.

Multiprocessing for preprocessing is not working

I tried to run 9 jobs concurrently, and it doesn't work.

python run_preprocess.py -i samples.csv -o decoden -bs 200 -n 9

There's an error about one of the intermediate files not being found.

sklearn warnings

When running run_decoden I receive a lot of warnings

/cluster/gjb_lab/mgierlinski/software/miniconda3/envs/decoden/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:874: RuntimeWarning: invalid value encountered in scalar divide
  if (previous_error - error) / error_at_init < tol:
/cluster/gjb_lab/mgierlinski/software/miniconda3/envs/decoden/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:1665: ConvergenceWarning: Maximum number of iterations 500 reached. Increase it to improve convergence.

They come from sklearn library, so it might be difficult to trace them down, but it is something in the data passed to sklearn that causes the warning. I asked Chat-GPT for help with this warning and here is what it suggested:

This warning indicates that an invalid value (e.g., NaN or infinity) was encountered during a scalar division operation in the Non-negative Matrix Factorization (NMF) module of the scikit-learn library. The specific line mentioned in the warning checks for the convergence of the NMF algorithm by comparing the relative change in the error with a user-specified tolerance level (tol).

There might be several reasons for encountering such invalid values during the computation:

Input data issues: The input data could have missing values (NaNs), very large values, or other problematic features that could cause the algorithm to produce invalid results during computations. Make sure to preprocess your data by removing or imputing missing values, scaling or normalizing the data, and removing outliers if necessary.

Initialization issues: NMF relies on the initialization of matrices for its iterative optimization process. If the initialization is poor or leads to numerical instability, the algorithm might not converge properly. By default, scikit-learn uses 'nndsvd' initialization, which is generally a good choice. However, you can try different initialization strategies by setting the 'init' parameter when creating the NMF instance, for example, 'random' or 'nndsvda'.

Parameter choices: The choice of hyperparameters for the NMF algorithm, such as the number of components, regularization terms, and maximum number of iterations, can also impact convergence. Experiment with different hyperparameter settings to see if the issue persists.

Numerical precision issues: Sometimes, the computations might result in very small or very large intermediate values, which can cause numerical instability and produce invalid values. This is more common when working with high-dimensional data or when the algorithm is close to convergence. You can try increasing the 'tol' parameter to allow for a more relaxed convergence criterion, which might help prevent such issues.

If the warning persists after addressing these potential causes, consider using other dimensionality reduction techniques, such as PCA or TruncatedSVD, which might be more stable for your specific dataset.

Installation problems

While running poetry install I get

Installing dependencies from lock file
Warning: poetry.lock is not consistent with pyproject.toml. You may be getting improper dependencies. Run `poetry lock [--no-update]` to fix it.

Because decoden depends on pyarrow (^11.0.0) which doesn't match any versions, version solving failed.

When I run poetry lock [--no-update], I get

No arguments expected for "lock" command, got "[--no-update]"

I suspect that some dependencies can be downgraded.

Default argument values contain hardcoded paths

When running run_decoden without --blacklist_file specified, the script crashes with error:

FileNotFoundError: [Errno 2] No such file or directory: '../DecoDen_GV/data/annotations/hg19-blacklist.v2.bed'

This happens after a few minutes of calculations, indicating that the input arguments are not checked at the start of the code.

Installation fails without some tweaks...

Hello,

I've successfully installed decoden, but following the instructions directly did not work and needed a couple of modifications.

  1. Our cluster environment is very minimalist, so C compilers are not available by default. The necessary build tools can be installed with conda install c-compiler. I also needed to install git, so maybe simply stating in the installation instructions that git and a modern C compiler are required should be enough to address this - many systems will have these available by default.

  2. pysam requires htslib/samtools and zlib in order to compile successfully, so a conda install samtools zlib is also required

  3. MACS2 fails to install with python 3.12. MACS2.2.9.1 has been released which says it handle some cython updates, but this still fails in the same way. I've had to downgrade to python 3.11 to get the installation to complete which is not ideal if you are stuck needing a particular python version. I don't know that the MACS developers will care about MACS2 being still installable since MACS3 is now out.

James

Edits to README

  • update decoden commands and descriptions
  • update paper title
  • mention pre-requisite of C compiler and git - #23
  • add requirements of samtools and zlib - #23

Missing HSR_results.ftr file

I ran the following commands:

python run_preprocess.py -i ips_bmp4_samples.csv -o newtest -bs 50 -n 5
python run_decoden.py --data_folder newtest --output_folder newtest --files_reference newtest/experiment_conditions.json --blacklist_file hg38-blacklist.chr.v2.bed --conditions "IPS_BMP4_input" "IPS_BMP4"

The ips_bmp4_samples.csv is as follows:

filepath,exp_name,is_control
bamchr/IPS_BMP4_input_1.bam,IPS_BMP4_input,1
bamchr/IPS_BMP4_input_2.bam,IPS_BMP4_input,1
bamchr/IPS_BMP4_1.bam,IPS_BMP4,0
bamchr/IPS_BMP4_2.bam,IPS_BMP4,0
bamchr/IPS_BMP4_3.bam,IPS_BMP4,0

The scripts run with no errors, until completion. However, the main results file, HSR_results.ftr is missing. Here are all the files created by the scripts:

newtest/
β”œβ”€β”€ config.json
β”œβ”€β”€ data
β”‚Β Β  β”œβ”€β”€ IPS_BMP4_1_filterdup_pileup_tiled.bed
β”‚Β Β  β”œβ”€β”€ IPS_BMP4_2_filterdup_pileup_tiled.bed
β”‚Β Β  β”œβ”€β”€ IPS_BMP4_3_filterdup_pileup_tiled.bed
β”‚Β Β  β”œβ”€β”€ IPS_BMP4_input_1_filterdup_pileup_tiled.bed
β”‚Β Β  └── IPS_BMP4_input_2_filterdup_pileup_tiled.bed
β”œβ”€β”€ experiment_conditions.json
└── NMF
    β”œβ”€β”€ mixing_matrix.csv
    β”œβ”€β”€ mixing_matrix.pdf
    β”œβ”€β”€ signal_matrix.ftr
    └── signal_matrix_sample.pdf

There is only a signal_matrix.ftr file, containing values IPS_BMP4 and IPS_BMP4_input binned in 50-bp bins. But I cannot find NMR and HSR results.

The JSON file experiment_conditions.json looks fine:

{
 "data/IPS_BMP4_input_1_filterdup_pileup_tiled.bed": "IPS_BMP4_input",
 "data/IPS_BMP4_input_2_filterdup_pileup_tiled.bed": "IPS_BMP4_input",
 "data/IPS_BMP4_1_filterdup_pileup_tiled.bed": "IPS_BMP4",
 "data/IPS_BMP4_2_filterdup_pileup_tiled.bed": "IPS_BMP4",
 "data/IPS_BMP4_3_filterdup_pileup_tiled.bed": "IPS_BMP4"
}

The BED files also look correct, at least at the first glance. And yet, no result file is present.

`deeptools` `countReadsPerBin` output is not sorted

The new preprocessing pipeline uses deeptools countReadsPerBin class. This uses multiprocessing and is much faster than before.

However, the output from this is not sorted. This means that two runs of crpb.run() can give different results making the rest of the DecoDen pipeline wrong.

pd.read_csv warnings

When running run_decoden I receive mixed types warnings:

/cluster/gjb_lab/mgierlinski/projects/decoden_test/decoden/utils.py:55: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(os.path.join(data_folder, fname), sep="\t", names=["seqnames", "start", "end", colname])

Must be easy to fix.

`config.json` is empty

Right now, config.json is empty. Do we want to include DecoDen details or skip writing it entirely?

Usage for run_decoden with no arguments

When running run_decoden.py with no arguments, one would expect a 'usage' comment. Instead, there is an error with some mysterious hard-wired directory:

> python run_decoden.py

β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•  β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•  β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘
β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β•β•β•šβ•β•  β•šβ•β•β•β•
-----------------------------------------------------------
Narendra, T., VisonΓ , G., de Jesus Cardona, C., & Schweikert,
G. (2022). Multi-histone ChIP-Seq Analysis with DecoDen. bioRxiv.
-----------------------------------------------------------

Traceback (most recent call last):
  File "/cluster/gjb_lab/mgierlinski/projects/decoden_test/run_decoden.py", line 129, in <module>
    main(args)
  File "/cluster/gjb_lab/mgierlinski/projects/decoden_test/run_decoden.py", line 24, in main
    with open(args.files_reference, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../DecoDen_GV/data/shallow_e114_200bp_bedGraph_files/sample_files.json'

Misconfiguration in `run_decoden.py`

When control samples are at the end of the configuration file, wrong samples are picked up as control samples.

Example -

{
    "h3k27me3_1/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k27me3_2/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k27me3_3/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k27me3_4/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k4me3_1/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k4me3_2/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k4me3_3/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k4me3_4/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k27me3_1/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k27me3_2/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k27me3_3/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k27me3_4/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_1/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_2/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_3/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_4/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control"
}

Here, the first 8 samples of H3K4me3 and H3K27me3 are chosen as samples.

Quick workaround - edit the JSON file manually. But this needs to be fixed later.

Does not work with chromosome names 1, 2, 3, ...

When chromosomes are named 1, 2, 3... (not chr1, chr2, chr3, ...) the run_decoden script crashes with error:

pyarrow.lib.ArrowInvalid: ("Could not convert '9' with type str: tried to convert to int64", 'Conversion failed for column seqnames with type object')

I confirmed this by changing my chromosome names into chr1, chr2, chr3, ..., upon which the script completed with no errors.

Write processed read in feather format

Right now, processed data is written in .npy format. Should we transition to feather .ftr format? Already the HSR and NMF results are being written in feather. Also, feather has more cross-platform support than numpy format.

Upgrade from MACS2 to MACS3

As mentioned in #23 , there are installation issues with using MACS2. It's probably best to move to MACS3.

The DecoDen pipeline uses MACS2 to calculate fragment length. At first glance, it looks like the predictd command is supported in MACS3, so hopefully everything works seamlessly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.