ntanmayee / decoden Goto Github PK

🌊 Analyse experimental conditions and replicates jointly to remove cell-type specific bias in multi-condition ChIP-Seq

License: GNU General Public License v3.0

Python 100.00%

decoden's Introduction

Multi-assay ChIP-Seq Analysis with DecoDen

DecoDen uses replicates and multi-histone ChIP-Seq experiments for a target cell type to learn and remove shared biases from fragmentation, PCR amplification and sequence mappability.

Installation

The installation of DecoDen is currently offered as a Poetry project while in development. The procedure proposed requires a local installation of git and a C compiler. We recommend the use of Conda to create a suitable environment, with a command such as conda create -n decoden python>=3.10. After the activation of the environment (conda activate decoden), follow these steps:

Install Poetry
Clone the repository and install with poetry

# Clone the repository
git clone [email protected]:ntanmayee/DecoDen.git
cd decoden

# Install the external dependencies and DecoDen
conda install pyarrow poetry
conda install samtools zlib

# If there is no C compiler installed include also the following command
conda install c-compiler

poetry install

Quick Start

Input data

Running decoden requires two inputs:

Aligned reads in .bam format from ChIP-Seq experiments
Sample annotation file in .csv format

Auto-generate a sample annotation file

To generate a skeleton sample annotation file, run -

decoden create_csv

This will create samples.csv in your current directory. Edit this file and fill in the columns with appropriate information. There are more details here.

Run DecoDen

Run the DecoDen pipeline with default parameters -

decoden run -i samples.csv -o output_directory -gs genome-size

Detailed Usage Guidelines

The following commands are available in DecoDen. Please click on the links to know more about them.

Command	Description
`create_csv`	Create a skeleton sample annotation file
`run`	Run the full DecoDen pipeline to preprocess end denoise BAM/BED files
`preprocess`	Pre-process BAM/BED data to be in the correct format for running DecoDen
`denoise`	Run the denoising step of DecoDen on suitably preprocessed data
`detect`	Detect peaks in the processed DecoDen signals

There is more helpful information in the wiki.

Bug Reports and Suggestions for Improvement

Please raise an issue if you find bugs or if you have any suggestions for improvement.

Funding

This project has received funding from the European Union's Framework Programme for Research and Innovation Horizon 2020 (2014-2020) under the Marie Skłodowska-Curie Grant Agreement No. 813533-MSCA-ITN-2018

decoden's People

Contributors

Stargazers

Watchers

Forkers

jamesabbott

decoden's Issues

Data Correlation Heatmap

It would be nice to have the correlation clustermap as a standard figure, when the option to save figures is True.

Multiprocessing for preprocessing is not working

I tried to run 9 jobs concurrently, and it doesn't work.

python run_preprocess.py -i samples.csv -o decoden -bs 200 -n 9

There's an error about one of the intermediate files not being found.

`poetry` needs to be installed before running DecoDen

README should be updated to mention that poetry should be installed before running

poetry install

sklearn warnings

When running run_decoden I receive a lot of warnings

/cluster/gjb_lab/mgierlinski/software/miniconda3/envs/decoden/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:874: RuntimeWarning: invalid value encountered in scalar divide
  if (previous_error - error) / error_at_init < tol:
/cluster/gjb_lab/mgierlinski/software/miniconda3/envs/decoden/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:1665: ConvergenceWarning: Maximum number of iterations 500 reached. Increase it to improve convergence.

They come from sklearn library, so it might be difficult to trace them down, but it is something in the data passed to sklearn that causes the warning. I asked Chat-GPT for help with this warning and here is what it suggested:

This warning indicates that an invalid value (e.g., NaN or infinity) was encountered during a scalar division operation in the Non-negative Matrix Factorization (NMF) module of the scikit-learn library. The specific line mentioned in the warning checks for the convergence of the NMF algorithm by comparing the relative change in the error with a user-specified tolerance level (tol).

There might be several reasons for encountering such invalid values during the computation:

Input data issues: The input data could have missing values (NaNs), very large values, or other problematic features that could cause the algorithm to produce invalid results during computations. Make sure to preprocess your data by removing or imputing missing values, scaling or normalizing the data, and removing outliers if necessary.

Initialization issues: NMF relies on the initialization of matrices for its iterative optimization process. If the initialization is poor or leads to numerical instability, the algorithm might not converge properly. By default, scikit-learn uses 'nndsvd' initialization, which is generally a good choice. However, you can try different initialization strategies by setting the 'init' parameter when creating the NMF instance, for example, 'random' or 'nndsvda'.

Parameter choices: The choice of hyperparameters for the NMF algorithm, such as the number of components, regularization terms, and maximum number of iterations, can also impact convergence. Experiment with different hyperparameter settings to see if the issue persists.

Numerical precision issues: Sometimes, the computations might result in very small or very large intermediate values, which can cause numerical instability and produce invalid values. This is more common when working with high-dimensional data or when the algorithm is close to convergence. You can try increasing the 'tol' parameter to allow for a more relaxed convergence criterion, which might help prevent such issues.

If the warning persists after addressing these potential causes, consider using other dimensionality reduction techniques, such as PCA or TruncatedSVD, which might be more stable for your specific dataset.

Installation problems

While running poetry install I get

Installing dependencies from lock file
Warning: poetry.lock is not consistent with pyproject.toml. You may be getting improper dependencies. Run `poetry lock [--no-update]` to fix it.

Because decoden depends on pyarrow (^11.0.0) which doesn't match any versions, version solving failed.

When I run poetry lock [--no-update], I get

No arguments expected for "lock" command, got "[--no-update]"

I suspect that some dependencies can be downgraded.

Default argument values contain hardcoded paths

When running run_decoden without --blacklist_file specified, the script crashes with error:

FileNotFoundError: [Errno 2] No such file or directory: '../DecoDen_GV/data/annotations/hg19-blacklist.v2.bed'

This happens after a few minutes of calculations, indicating that the input arguments are not checked at the start of the code.

Installation fails without some tweaks...

Hello,

I've successfully installed decoden, but following the instructions directly did not work and needed a couple of modifications.

Our cluster environment is very minimalist, so C compilers are not available by default. The necessary build tools can be installed with conda install c-compiler. I also needed to install git, so maybe simply stating in the installation instructions that git and a modern C compiler are required should be enough to address this - many systems will have these available by default.
pysam requires htslib/samtools and zlib in order to compile successfully, so a conda install samtools zlib is also required
MACS2 fails to install with python 3.12. MACS2.2.9.1 has been released which says it handle some cython updates, but this still fails in the same way. I've had to downgrade to python 3.11 to get the installation to complete which is not ideal if you are stuck needing a particular python version. I don't know that the MACS developers will care about MACS2 being still installable since MACS3 is now out.

James

Edits to README

update decoden commands and descriptions
update paper title
mention pre-requisite of C compiler and git - #23
add requirements of samtools and zlib - #23

Missing HSR_results.ftr file

I ran the following commands:

python run_preprocess.py -i ips_bmp4_samples.csv -o newtest -bs 50 -n 5
python run_decoden.py --data_folder newtest --output_folder newtest --files_reference newtest/experiment_conditions.json --blacklist_file hg38-blacklist.chr.v2.bed --conditions "IPS_BMP4_input" "IPS_BMP4"

The ips_bmp4_samples.csv is as follows:

filepath,exp_name,is_control
bamchr/IPS_BMP4_input_1.bam,IPS_BMP4_input,1
bamchr/IPS_BMP4_input_2.bam,IPS_BMP4_input,1
bamchr/IPS_BMP4_1.bam,IPS_BMP4,0
bamchr/IPS_BMP4_2.bam,IPS_BMP4,0
bamchr/IPS_BMP4_3.bam,IPS_BMP4,0

The scripts run with no errors, until completion. However, the main results file, HSR_results.ftr is missing. Here are all the files created by the scripts:

newtest/
├── config.json
├── data
│   ├── IPS_BMP4_1_filterdup_pileup_tiled.bed
│   ├── IPS_BMP4_2_filterdup_pileup_tiled.bed
│   ├── IPS_BMP4_3_filterdup_pileup_tiled.bed
│   ├── IPS_BMP4_input_1_filterdup_pileup_tiled.bed
│   └── IPS_BMP4_input_2_filterdup_pileup_tiled.bed
├── experiment_conditions.json
└── NMF
    ├── mixing_matrix.csv
    ├── mixing_matrix.pdf
    ├── signal_matrix.ftr
    └── signal_matrix_sample.pdf

There is only a signal_matrix.ftr file, containing values IPS_BMP4 and IPS_BMP4_input binned in 50-bp bins. But I cannot find NMR and HSR results.

The JSON file experiment_conditions.json looks fine:

{
 "data/IPS_BMP4_input_1_filterdup_pileup_tiled.bed": "IPS_BMP4_input",
 "data/IPS_BMP4_input_2_filterdup_pileup_tiled.bed": "IPS_BMP4_input",
 "data/IPS_BMP4_1_filterdup_pileup_tiled.bed": "IPS_BMP4",
 "data/IPS_BMP4_2_filterdup_pileup_tiled.bed": "IPS_BMP4",
 "data/IPS_BMP4_3_filterdup_pileup_tiled.bed": "IPS_BMP4"
}

The BED files also look correct, at least at the first glance. And yet, no result file is present.

`deeptools` `countReadsPerBin` output is not sorted

The new preprocessing pipeline uses deeptools countReadsPerBin class. This uses multiprocessing and is much faster than before.

However, the output from this is not sorted. This means that two runs of crpb.run() can give different results making the rest of the DecoDen pipeline wrong.

Missing --conditions argument leads to an obscure error

When --conditions argument is not specified in run_decoden, the script crashes with an obscure error:

KeyError: 'IPS_BMP4'

pd.read_csv warnings

When running run_decoden I receive mixed types warnings:

/cluster/gjb_lab/mgierlinski/projects/decoden_test/decoden/utils.py:55: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(os.path.join(data_folder, fname), sep="\t", names=["seqnames", "start", "end", colname])

Must be easy to fix.

`config.json` is empty

Right now, config.json is empty. Do we want to include DecoDen details or skip writing it entirely?

Usage for run_decoden with no arguments

When running run_decoden.py with no arguments, one would expect a 'usage' comment. Instead, there is an error with some mysterious hard-wired directory:

> python run_decoden.py

██████╗ ███████╗ ██████╗ ██████╗ ██████╗ ███████╗███╗   ██╗
██╔══██╗██╔════╝██╔════╝██╔═══██╗██╔══██╗██╔════╝████╗  ██║
██║  ██║█████╗  ██║     ██║   ██║██║  ██║█████╗  ██╔██╗ ██║
██║  ██║██╔══╝  ██║     ██║   ██║██║  ██║██╔══╝  ██║╚██╗██║
██████╔╝███████╗╚██████╗╚██████╔╝██████╔╝███████╗██║ ╚████║
╚═════╝ ╚══════╝ ╚═════╝ ╚═════╝ ╚═════╝ ╚══════╝╚═╝  ╚═══╝
-----------------------------------------------------------
Narendra, T., Visonà, G., de Jesus Cardona, C., & Schweikert,
G. (2022). Multi-histone ChIP-Seq Analysis with DecoDen. bioRxiv.
-----------------------------------------------------------

Traceback (most recent call last):
  File "/cluster/gjb_lab/mgierlinski/projects/decoden_test/run_decoden.py", line 129, in <module>
    main(args)
  File "/cluster/gjb_lab/mgierlinski/projects/decoden_test/run_decoden.py", line 24, in main
    with open(args.files_reference, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../DecoDen_GV/data/shallow_e114_200bp_bedGraph_files/sample_files.json'

Misconfiguration in `run_decoden.py`

When control samples are at the end of the configuration file, wrong samples are picked up as control samples.

Example -

{
    "h3k27me3_1/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k27me3_2/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k27me3_3/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k27me3_4/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k27me3",
    "h3k4me3_1/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k4me3_2/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k4me3_3/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k4me3_4/mTest.bed_filterdup_pileup.bdg_tiled.bed": "h3k4me3",
    "h3k27me3_1/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k27me3_2/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k27me3_3/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k27me3_4/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_1/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_2/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_3/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control",
    "h3k4me3_4/mTest_input.bed_filterdup_pileup.bdg_tiled.bed": "control"
}

Here, the first 8 samples of H3K4me3 and H3K27me3 are chosen as samples.

Quick workaround - edit the JSON file manually. But this needs to be fixed later.

Function `get_fragment_length` extracts the tag size instead of the fragment length

Line 40 of the file decoden/preprocessing/pipeline.py uses the selector
if 'tag size is' in s

which extracts the wrong number from the macs2 output

The best solution would likely involve replacing macs2 for the calculation of fragment size

Does not work with chromosome names 1, 2, 3, ...

When chromosomes are named 1, 2, 3... (not chr1, chr2, chr3, ...) the run_decoden script crashes with error:

pyarrow.lib.ArrowInvalid: ("Could not convert '9' with type str: tried to convert to int64", 'Conversion failed for column seqnames with type object')

I confirmed this by changing my chromosome names into chr1, chr2, chr3, ..., upon which the script completed with no errors.

Write processed read in feather format

Right now, processed data is written in .npy format. Should we transition to feather .ftr format? Already the HSR and NMF results are being written in feather. Also, feather has more cross-platform support than numpy format.

Upgrade from MACS2 to MACS3

As mentioned in #23 , there are installation issues with using MACS2. It's probably best to move to MACS3.

The DecoDen pipeline uses MACS2 to calculate fragment length. At first glance, it looks like the predictd command is supported in MACS3, so hopefully everything works seamlessly.