ptrebert / sciddo Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 1.0 1.03 MB

Home of the SCIDDO tool

License: GNU General Public License v3.0

Python 81.87% C 17.67% Shell 0.46%

bioinformatics tool epigenomics chromatin

sciddo's Introduction

SCIDDO: Score-based identification of differential chromatin domains

Publication

Manuscript: DOI: 10.1093/bioinformatics/btaa960

bioRxiv preprint: DOI: 10.1101/441766

Use cases

SCIDDO is a tool for the differential analysis of histone chromatin data. SCIDDO uses chromatin state segmentation maps, e.g., as generated by ChromHMM or EpiCSeg, for identifying regions of differential chromatin state between individual samples or groups of replicated samples.

The detected differential chromatin domains can be expected to overlap largely with regulatory regions or differentially expressed genes (see our manuscript preprint for detailed results). Moreover, the score-based approach implemented in SCIDDO affords a straightforward customization of scoring chromatin state differences to emphasize different aspects of chromatin dynamics.

Code maturity

SCIDDO is currently in BETA status

master branch:

dev branch:

Setup

SCIDDO supports only Linux environments (that is unlikely to change in the future) and is developed using Python3.6. Other Python3.x versions may or may not work, but are not officially supported.

For easy setup, it is highly recommended to install SCIDDO inside a dedicated Conda environment. A suitable environment is specified in environments/sciddo_env.yml.

Otherwise, install the HDF5 library (tested with version 1.8.18) as appropriate for your local environment, and the necessary Python dependencies from the requirements.txt file:

sudo apt-get install libhdf5
sudo pip install -r requirements.txt

Empirically, the setup of PyTables and HDF5 can create some headaches. In this case, the best advice is to use Conda.

After all dependencies have been installed successfully, run the SCIDDO setup as appropriate for your environment:

[sudo] python setup.py install

Execution

Input and output data formats

SCIDDO supports common text-based input and output data formats. Chromatin state segmentations as tabular (BED-like) files should be compatible as long as they have a fixed bin width of at least 100 bp. Output files from ChromHMM or EpiCSeg are supported out-of-the-box, and SCIDDO is designed to be used immediately downstream of these tools (e.g., SCIDDO knows that ChromHMM segmentation files have the suffix "_segments.bed" and will strip that from file names before determining possible sample labels). Auxiliary files such as chromatin state label or color mappings are supoprted in form of simple tab-separated "key-value" text files.

SCIDDO's internal data managements is realized with the popular pandas Python package, and data are stored in HDF5 files (*.h5) that are created with pandas. The main reason for using HDF5 files for storing data and metadata is efficiency, but all contents of a HDF5 file can be dumped to text. After the first step in a SCIDDO analysis of converting the input data to HDF5, all subsequent operations will be performed on this HDF5 file.

When dumping identified differential chromatin domains (DCDs) or raw candidate regions to text, the output adheres to the BED column layout (with header) chromosome, start, end, name, score, plus additional columns containing statistics and sample/group names. If downstream tools cannot work with non-standard BED-like text files, a simple cut -f 1,2,3,4,5 <SCIDDO_TABLE>.tsv > <SCIDDO_TABLE>.bed can be used to restrict the output to the first five, BED-compliant columns.

Getting help

sciddo.py --help or sciddo.py <SUBCOMMAND> --help is your friend.

For a step-by-step help on how to use SCIDDO, please refer to the tutorial hosted as part of this repositry.

Standard analysis run

A standard SCIDDO analysis run is split into several distinct steps that are realized by different code modules. Besides module specific parameters, there are several global parameters to adjust SCIDDO's runtime behavior. Importantly, these global parameters always have to be specified before the subcommand, i.e.,

sciddo.py [GLOBAL_PARAMETERS] <SUBCOMMAND> [MODULE_PARAMETERS]

The global parameters are:

--workers: number of CPUs to use (no sanity checks!)
--debug: print debug messages to stderr; otherwise, SCIDDO operates silently
--config-dump: folder to dump run configuration (JSON); defaults to current working directory
--no-dump: do not dump run configuration

Step 1: convert

Convert all input data (state segmentations plus metadata) into a binary HDF5 file. Currently, ChromHMM and EpiCSeg output files are supported out-of-the-box. This creates the SCIDDO DATA file.

sciddo.py [GLOBAL_PARAMETERS] convert --help

Step 2: stats

Compute a bunch of statistics (e.g., state composition per sample) that are potentially needed downstream.

sciddo.py [GLOBAL_PARAMETERS] stats --help

Step 3: score

Add scoring schemes (matrices) to the dataset. These can be derived automatically from the state segmentation model emissions (if provided during the convert step), or can be supplied in form of a user-defined file. Note that, in principle, an arbitrary number of scoring schemes can be added to a dataset.

sciddo.py [GLOBAL_PARAMETERS] score --help

Step 4: scan

Scan the dataset for differential chromatin domains. As opposed to the previous commands, this creates a separate output file per run, i.e., the SCIDDO RUN file.

sciddo.py [GLOBAL_PARAMETERS] scan --help

Step 5: dump

All data and metadata in the SCIDDO DATA and RUN file can be dumped to text files (e.g., TSV tables or BED files) for downstream analysis.

sciddo.py [GLOBAL_PARAMETERS] dump --help

sciddo's People

Contributors

Stargazers

Watchers

Forkers

havu73

sciddo's Issues

chromatin state dynamics error

Hi！

Thanks for such a great tool and it's very helpful for my project. I followed the tutorial and it runs well from step1 to step5 of part1. 725 candidate regions are found. But for the part2, I want to found the difference between repressed regions(states 1,2,7,8) and active TSS regions(states 4,5,6) using the following commond.
sciddo.py --debug dump --data-file testdata.h5 --support-file test_run_penem.h5 --output test_run_penem_split-DCD.enh-to-pcr.tsv --data-type dynamics --split-segments --from-states 1 2 7 8 --to-states 4 5 6 --add-inverse --limit-bed-output
It gives the error:

Here is the design matrix with no replicates.

It runs well with the SCIDDO testdata. Is this caused by the data features or other?
Another question is can the SCIDDO identify the regions which are in the same states for the two groups but with significant different signal strength? It seems that the 725 candidate regions are the switching between the states.

Many thanks!
Nan

No chromosome sizes read from file hg38_chrom_sizes2.txt

Dear,
I am trying to run sccido with the following line

sciddo.py --workers 5 convert --state-labels cmm_label.txt --state-colors cmm_color.txt --design-matrix design_matrix.txt --model-emissions /Model/model_10.txt --bin-size 200 --chrom-filter "(chr)?[0-22XY]+(\s|$$)" --seg-format ChromHMM --chrom-sizes hg38_chrom_sizes2.txt --state-seg /Model/X_d1_10_segments.bed /Model/X_d2_10_segments.bed /Model/Y_d1_10_segments.bed /Model/Y_d2_10_segments.bed /Model/Z_d1_10_segments.bed /Model/Z_d2_10_segments.bed --output /sciddo/sciddo-data_hg38_cmm10.h5

However, I am getting the following error:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scidlib-0.8.0-py3.6-linux-x86_64.egg/EGG-INFO/scripts/sciddo.py", line 156, in
exc = cli_args.execute(cli_args, logger)
File "/usr/local/lib/python3.6/dist-packages/scidlib-0.8.0-py3.6-linux-x86_64.egg/scidlib/cmd_convert.py", line 328, in _run_cmd_convert
chromosomes = read_chrom_sizes(args.chromsizes, args.chromfilter)
File "/usr/local/lib/python3.6/dist-packages/scidlib-0.8.0-py3.6-linux-x86_64.egg/scidlib/auxiliary.py", line 103, in read_chrom_sizes
assert chromosomes, 'No chromosome sizes read from file {}'.format(fpath)
AssertionError: No chromosome sizes read from file hg38_chrom_sizes2.txt
2020-05-03 01:09:57,020 - ERROR: Properly handled shutdown upon error:

No chromosome sizes read from file hg38_chrom_sizes2.txt

The hg38_chrom_sizes2.txt looks very normal (please see attached) .

hg38_chrom_sizes2.txt

What could be the problem?

Many thanks in advance,
Best wishes,
Anna

ciddo.py scan

Hello!

I am trying to use sciddo on my chromHMM bed files, when I reach the scan step, an error message appears that I have to identify --scoring/-sc matrix to be used, I really don't understand this parameter since my bioinformatics knowledge is very limited, can you please lead me in this? And thanks in advance..

Yussuf

Index out of bounds error in score function

After creating appropriate input files for sciddo.py convert, and running sciddo.py convert which successfully produces esc_brain.h5, I try to run sciddo.py score with esc_brain.h5 as an input. Using any combination of --treat-background and --add-scoring parameters, I run into an index out of bounds error (as shown in the attached screenshot). This same error occurs when using a custom scoring file instead of --add-scoring. I have tried this on both the master and developer branches. Do you have an idea of what is causing this issue. Any help would be appreciated.

Happy holidays,

Zane Koch

stats (counts) cannot be computed

Hello,

I have analyzed the HepG2 and K562 cell lines from ENCODE with ChromHMM. Now I would like to use SCIDDO to identify differential chromatin states.

My output folder from ChromHMM looks like a classic ChromHMM output folder:
ls HM_TF-SRSF1-PTBP1/
emissions_20.png HepG2_20_overlap.svg HepG2_20_RefSeqTSS_neighborhood.svg K562_20_overlap.svg K562_20_RefSeqTSS_neighborhood.svg transitions_20.txt
emissions_20.svg HepG2_20_overlap.txt HepG2_20_RefSeqTSS_neighborhood.txt K562_20_overlap.txt K562_20_RefSeqTSS_neighborhood.txt webpage_20.html
emissions_20.txt HepG2_20_RefSeqTES_neighborhood.png HepG2_20_segments.bed K562_20_RefSeqTES_neighborhood.png K562_20_segments.bed
HepG2_20_dense.bed.gz HepG2_20_RefSeqTES_neighborhood.svg K562_20_dense.bed.gz K562_20_RefSeqTES_neighborhood.svg model_20.txt
HepG2_20_expanded.bed.gz HepG2_20_RefSeqTES_neighborhood.txt K562_20_expanded.bed.gz K562_20_RefSeqTES_neighborhood.txt transitions_20.png
HepG2_20_overlap.png HepG2_20_RefSeqTSS_neighborhood.png K562_20_overlap.png K562_20_RefSeqTSS_neighborhood.png transitions_20.svg
As first step with sciddo I used the convert command:
sciddo.py --debug --workers 16 convert -col -ssg ../chromhmm/HM_TF-SRSF1-PTBP1/ -csz ../../ChromHMM/CHROMSIZES/hg38.txt -mde ../chromhmm/HM_TF-SRSF1-PTBP1/model_20.txt -o HM_TF-SRSF1-PTBP1
2020-08-13 16:49:52,167 - DEBUG: Configuration written to my_path/sciddo/20200813-164952-pmip_SCIDDO_convert_cfg.json
2020-08-13 16:49:52,169 - DEBUG: Reading state emission probabilities from file ../chromhmm/firstAll_noeCLIP/model_20.txt
2020-08-13 16:49:52,197 - DEBUG: Creating output folder
2020-08-13 16:49:52,198 - DEBUG: Start processing files - 48 jobs in total
2020-08-13 16:49:53,919 - DEBUG: Finished job 1 of 48: state/HepG2_20_segments/chrX
2020-08-13 16:49:53,977 - DEBUG: Finished job 2 of 48: state/HepG2_20_segments/chr13
2020-08-13 16:49:54,050 - DEBUG: Finished job 3 of 48: state/HepG2_20_segments/chr15
2020-08-13 16:49:54,358 - DEBUG: Finished job 4 of 48: state/HepG2_20_segments/chr14
2020-08-13 16:49:54,415 - DEBUG: Finished job 5 of 48: state/HepG2_20_segments/chr9
2020-08-13 16:49:54,459 - DEBUG: Finished job 6 of 48: state/HepG2_20_segments/chr11
2020-08-13 16:49:54,478 - DEBUG: Finished job 7 of 48: state/HepG2_20_segments/chr10
2020-08-13 16:49:54,496 - DEBUG: Finished job 8 of 48: state/HepG2_20_segments/chr12
2020-08-13 16:49:54,513 - DEBUG: Finished job 9 of 48: state/HepG2_20_segments/chr8
2020-08-13 16:49:54,683 - DEBUG: Finished job 10 of 48: state/HepG2_20_segments/chr4
2020-08-13 16:49:54,733 - DEBUG: Finished job 11 of 48: state/HepG2_20_segments/chr7
2020-08-13 16:49:54,747 - DEBUG: Finished job 12 of 48: state/HepG2_20_segments/chrY
2020-08-13 16:49:54,779 - DEBUG: Finished job 13 of 48: state/HepG2_20_segments/chr5
2020-08-13 16:49:54,801 - DEBUG: Finished job 14 of 48: state/HepG2_20_segments/chr18
2020-08-13 16:49:54,991 - DEBUG: Finished job 15 of 48: state/HepG2_20_segments/chr3
2020-08-13 16:49:55,006 - DEBUG: Finished job 16 of 48: state/HepG2_20_segments/chr21
2020-08-13 16:49:55,210 - DEBUG: Finished job 17 of 48: state/HepG2_20_segments/chr22
2020-08-13 16:49:55,319 - DEBUG: Finished job 18 of 48: state/HepG2_20_segments/chr6
2020-08-13 16:49:55,373 - DEBUG: Finished job 19 of 48: state/HepG2_20_segments/chr16
2020-08-13 16:49:55,513 - DEBUG: Finished job 20 of 48: state/HepG2_20_segments/chr19
2020-08-13 16:49:55,632 - DEBUG: Finished job 21 of 48: state/HepG2_20_segments/chr17
2020-08-13 16:49:55,854 - DEBUG: Finished job 22 of 48: state/HepG2_20_segments/chr20
2020-08-13 16:49:56,001 - DEBUG: Finished job 23 of 48: state/K562_20_segments/chrX
2020-08-13 16:49:56,454 - DEBUG: Finished job 24 of 48: state/K562_20_segments/chr4
2020-08-13 16:49:56,472 - DEBUG: Finished job 25 of 48: state/K562_20_segments/chr13
2020-08-13 16:49:56,739 - DEBUG: Finished job 26 of 48: state/K562_20_segments/chr9
2020-08-13 16:49:56,803 - DEBUG: Finished job 27 of 48: state/K562_20_segments/chr5
2020-08-13 16:49:56,819 - DEBUG: Finished job 28 of 48: state/K562_20_segments/chr14
2020-08-13 16:49:56,955 - DEBUG: Finished job 29 of 48: state/K562_20_segments/chr8
2020-08-13 16:49:56,982 - DEBUG: Finished job 30 of 48: state/K562_20_segments/chr3
2020-08-13 16:49:57,125 - DEBUG: Finished job 31 of 48: state/K562_20_segments/chr10
2020-08-13 16:49:57,142 - DEBUG: Finished job 32 of 48: state/K562_20_segments/chr11
2020-08-13 16:49:57,213 - DEBUG: Finished job 33 of 48: state/K562_20_segments/chrY
2020-08-13 16:49:57,409 - DEBUG: Finished job 34 of 48: state/K562_20_segments/chr12
2020-08-13 16:49:57,486 - DEBUG: Finished job 35 of 48: state/K562_20_segments/chr7
2020-08-13 16:49:57,591 - DEBUG: Finished job 36 of 48: state/K562_20_segments/chr6
2020-08-13 16:49:57,616 - DEBUG: Finished job 37 of 48: state/K562_20_segments/chr18
2020-08-13 16:49:57,734 - DEBUG: Finished job 38 of 48: state/K562_20_segments/chr15
2020-08-13 16:49:57,802 - DEBUG: Finished job 39 of 48: state/K562_20_segments/chr21
2020-08-13 16:49:57,829 - DEBUG: Finished job 40 of 48: state/K562_20_segments/chr16
2020-08-13 16:49:57,837 - DEBUG: Finished job 41 of 48: state/K562_20_segments/chr20
2020-08-13 16:49:57,863 - DEBUG: Finished job 42 of 48: state/K562_20_segments/chr22
2020-08-13 16:49:58,294 - DEBUG: Finished job 43 of 48: state/K562_20_segments/chr17
2020-08-13 16:49:58,342 - DEBUG: Finished job 44 of 48: state/K562_20_segments/chr19
2020-08-13 16:50:01,017 - DEBUG: Finished job 45 of 48: state/HepG2_20_segments/chr1
2020-08-13 16:50:02,796 - DEBUG: Finished job 46 of 48: state/K562_20_segments/chr2
2020-08-13 16:50:03,257 - DEBUG: Finished job 47 of 48: state/HepG2_20_segments/chr2
2020-08-13 16:50:06,222 - DEBUG: Finished job 48 of 48: state/K562_20_segments/chr1
2020-08-13 16:50:06,315 - DEBUG: Saving metadata to output file
2020-08-13 16:50:06,335 - DEBUG: Estimated background state: 4 (100.0%)
2020-08-13 16:50:06,400 - DEBUG: All metadata saved
2020-08-13 16:50:06,486 - DEBUG: Output file closed - exiting...
2020-08-13 16:50:06,498 - DEBUG: Configuration written to my_path/sciddo/20200813-164952-pmip_SCIDDO_convert_cfg.json
Now for the stats I run into the following problem:
sciddo.py --debug --workers 16 stats -d HM_TF-SRSF1-PTBP1 -c -a
2020-08-13 16:50:30,460 - DEBUG: Configuration written to my_path/sciddo/20200813-165030-vtzh_SCIDDO_stats_cfg.json
2020-08-13 16:50:30,461 - DEBUG: Computing counting statistics...
2020-08-13 16:50:30,461 - DEBUG: Count statistics: state composition per sample
2020-08-13 16:50:30,614 - DEBUG: Created parameter list of length 48 to process
2020-08-13 16:50:31,775 - DEBUG: All counts collected, building data frames...
2020-08-13 16:50:31,812 - DEBUG: Stored counts /statistic/counts/composition/HepG2_20_segments
2020-08-13 16:50:31,840 - DEBUG: Stored counts /statistic/counts/composition/K562_20_segments
2020-08-13 16:50:31,857 - DEBUG: Count statistics: state composition per sample - done
2020-08-13 16:50:31,858 - DEBUG: Count statistics: state transitions between sample pairs
2020-08-13 16:50:31,865 - WARNING: No design matrix in dataset - treating all samples as singletons, resulting in an all-vs-all comparison.
2020-08-13 16:50:31,968 - DEBUG: Identified 24 chromosomes in dataset
2020-08-13 16:50:31,968 - DEBUG: Identified 1128 sample pairs for comparison: singletons
2020-08-13 16:50:32,009 - DEBUG: Created parameter list of size 27072 to process
2020-08-13 16:55:55,331 - DEBUG: All sample pairs processed
2020-08-13 16:55:55,332 - DEBUG: Total positions processed: 17417828136
2020-08-13 16:55:55,333 - DEBUG: Thereof, identical positions: 12522689256 (71.9%)
2020-08-13 16:55:55,333 - DEBUG: Thereof, variable positions: 4895138880 (28.1%)
2020-08-13 16:55:55,391 - DEBUG: Stored transition counts for comparison / chromosome: /statistic/counts/transitions/singletons/chr13
Traceback (most recent call last):
File "/nfs/home/students/ga89koc/.conda/envs/sciddo/lib/python3.6/site-packages/scidlib-0.8.0-py3.6-linux-x86_64.egg/EGG-INFO/scripts/sciddo.py", line 156, in
exc = cli_args.execute(cli_args, logger)
File "/nfs/home/students/ga89koc/.conda/envs/sciddo/lib/python3.6/site-packages/scidlib-0.8.0-py3.6-linux-x86_64.egg/scidlib/cmd_stats.py", line 389, in _run_cmd_stats
compute_pair_state_transitions(args, logger)
File "/nfs/home/students/ga89koc/.conda/envs/sciddo/lib/python3.6/site-packages/scidlib-0.8.0-py3.6-linux-x86_64.egg/scidlib/cmd_stats.py", line 269, in compute_pair_state_transitions
assert count_total == total_pos, 'Missing counts, expected {} but stored {}'.format(total_pos, count_total)
AssertionError: Missing counts, expected 17417828136 but stored 13122860840
2020-08-13 16:55:55,441 - ERROR: Properly handled shutdown upon error:

Missing counts, expected 17417828136 but stored 13122860840
2020-08-13 16:55:55,441 - DEBUG: Run configuration stored at: my_path/sciddo/20200813-165030-vtzh_SCIDDO_stats_cfg.json

Do you have any idea how I can handle that? I am not sure about the error message.
I am happy to supply more information if needed. :)

Thank you in advance.
Best,
Quirin