Code Monkey home page Code Monkey logo

abc-enhancer-gene-prediction's Introduction

CircleCI CircleCI

📝 Note: This is a revamp of the ABC codebase presented in [1]. If you wish to access that version of the ABC repo, please check out https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/tree/master.

We have documented the codebase and usage in Read The Docs

Activity by Contact Model of Enhancer-Gene Specificity

The Activity-by-Contact (ABC) model predicts which enhancers regulate which genes on a cell type specific basis. This repository contains the code needed to run the ABC model as well as small sample data files, example commands, and some general tips and suggestions.

If you use the ABC model in published research, please cite:

[1] Fulco CP, Nasser J, Jones TR, Munson G, Bergman DT, Subramanian V, Grossman SR, Anyoha R, Doughty BR, Patwardhan TA, Nguyen TH, Kane M, Perez EM, Durand NC, Lareau CA, Stamenova EK, Aiden EL, Lander ES & Engreitz JM. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019). https://www.nature.com/articles/s41588-019-0538-0

Contact

Please submit a github issue with any questions or if you experience any issues/bugs.

abc-enhancer-gene-prediction's People

Contributors

argschwind avatar atancoder avatar charlesfulco avatar engreitz avatar jnasser3 avatar kmualim avatar mayasheth avatar rosaxma avatar thouis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

abc-enhancer-gene-prediction's Issues

5Mb window size

why is the default window to 5MB. As Hi-C loops are not very reliable beyond 1-2MB (and for CRISPRi-FlowFISH experiment you set only 0.45MB). When testing ABC methodon the different dataset with available ATAC-seq peaks, I realized that we get more associations (ABC>0.02) when boundary is 1MB (not 5MB) and those associations are even more distal on average.

3 questions about code usage (white list and bioreplicates)

Dear all,

Thanks for the nice method and associated code.

This is not really an issue but some general questions about the usage of the code.

  1. can the white list file present here
    example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.chr22.bed
    be used for any cell line, or is it specifically made for K562?

  2. is it necessary or just recommended to use a white list file in the whole process? what are the advantages and drawbacks?

  3. for H3K27ac data, do you recommend to use several bioreplicates whenever possible?

Best,

candidateRegions.bed file is empty after use the example code

I am running the example now. I run the following program:
But the file wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.candidateRegions.bed is empty, not sure what I should do

python src/makeCandidateRegions.py
--narrowPeak example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted
--bam example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam
--outDir example_chr22/ABC_output/Peaks/
--chrom_sizes example_chr22/reference/chr22
--regions_blocklist reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed
--regions_includelist example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.chr22.bed
--peakExtendFromSummit 250
--nStrongestPeaks 3000

predict.py trouble reading bampe

I'm currently trying to run predict.py. I have HiC in BAMPE format in one file. It seems the code is trying to look for files per chromosome rather than a single file?

Command:
python /home/cjr78/packages/ABC-Enhancer-Gene-Prediction-no_promoter_overlaps/src/predict.py \ --enhancers /rds/user/cjr78/hpc-work/HiC/ABC/Neighborhoods/FlFl/EnhancerList.txt \ --genes /rds/user/cjr78/hpc-work/HiC/ABC/Neighborhoods/FlFl/GeneList.txt \ --HiCdir /rds/user/cjr78/hpc-work/HiC/BAMPE/raw/FlFl/ \ --hic_type bedpe \ --hic_resolution 40000 \ --scale_hic_using_powerlaw \ --threshold .02 \ --cellType FlFl \ --outdir /rds/user/cjr78/hpc-work/HiC/ABC/Predict/FlFl/ \ --make_all_putative

Error:
reading genes reading enhancers Making predictions for chromosome: chr7 Making putative predictions table... Begin HiC Loading HiC Traceback (most recent call last): File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-no_promoter_overlaps/src/predict.py", line 148, in <module> main() File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-no_promoter_overlaps/src/predict.py", line 107, in main this_chr = make_predictions(chromosome, this_enh, this_genes, args) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-no_promoter_overlaps/src/predictor.py", line 17, in make_predictions pred = add_hic_to_enh_gene_table(enhancers, genes, pred, hic_file, hic_norm_file, hic_is_vc, chromosome, args) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-no_promoter_overlaps/src/predictor.py", line 66, in add_hic_to_enh_gene_table gamma = args.hic_gamma) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-no_promoter_overlaps/src/hic.py", line 55, in load_hic HiC = pd.read_csv(hic_file, sep="\t", names = ['chr1','x1','x2','chr2','y1','y2','name','hic_contact']) File "/home/cjr78/.conda/envs/seq_analysis/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f return _read(filepath_or_buffer, kwds) File "/home/cjr78/.conda/envs/seq_analysis/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read parser = TextFileReader(fp_or_buf, **kwds) File "/home/cjr78/.conda/envs/seq_analysis/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__ self._make_engine(self.engine) File "/home/cjr78/.conda/envs/seq_analysis/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/home/cjr78/.conda/envs/seq_analysis/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 613, in pandas._libs.parsers.TextReader._setup_parser_source File "/home/cjr78/.conda/envs/seq_analysis/lib/python3.6/gzip.py", line 163, in __init__ fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/rds/user/cjr78/hpc-work/HiC/BAMPE/raw/FlFl/chr7/chr7.bedpe.gz'

bamcompare input corrected chipseq file

Hi there, I am using bamCompare to create input_corrected chip file. Is that what you would suggest to use as input corrected bam file? bamCompare -b1 chip_sample -b2 input_control_bam -of "bigwig" -o chip_input_corrected.bigwig

Question on use of H3K27ac HiChIP data

Hi,
Thanks so much for making this code available and usable!

In your NG manuscript, you look at some different options for using H3K27ac HiChIP, as well as including quantitative DHS signal. It appears that ABC_sqrt(DHS x H3K27ac HiChIP) and ABC_(DHS x H3K27ac HiChIP) performed best.

I have ATAC, H3K27ac ChIP and HiChIP in a lymphoma cell line and primary cells, and used your ABC model, substituting HiChIP in place of HiC. Would you recommend the above mentioned approaches, rather than simply substituting HiC with HiChIP as I did? Do you have plans to implement the above varieties of your ABC model in the published code?

thanks a lot. If you think it's worth using the HiChIP varieties of your model, I may look into this further and happy to share that code.

Best,
Ashley

How to get .KRobserved files from .hic file type?

Hi,

I'm having issues whenever I try to use a .hic file type to compute powerlaw fit. From the github, I thought I could feed in just a .hic file, but compute_powerlaw_fit_from_hic.py seems to expect .KRobserved files too? I've attached some example input and errors below.

First, I tried downloading public Hi-C data and fitting a powerlaw distrubtion, e.g.:

cd HiC_from_HCT116-RAD21/
wget https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/4a1b58d3-9ae6-43e4-91cf-49f1fcbbab33/4DNFIYWONU7A.hic
cd ..

python ABC-Enhancer-Gene-Prediction/src/compute_powerlaw_fit_from_hic.py \
--hicDir HiC_from_HCT116-RAD21/ \
--outDir HiC_from_HCT116-RAD21/powerlaw/ \
--maxWindow 1000000 \
--minWindow 5000 \
--resolution 5000

But it failed with the following error:
[Errno 2] No such file or directory: 'HiC_from_HCT116-RAD21/chr1/chr1.KRobserved.gz'

I get the same error when I try to download the file suggested on the github, i.e.:

python ABC-Enhancer-Gene-Prediction/src/juicebox_dump.py \
--hic_file https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined_30.hic \
--juicebox "java -jar juicer_tools.jar" \
--outdir testHic/ \
--chromosomes 1

gives this error:

Starting chr1 ... 
java -jar juicer_tools.jar dump observed KR https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined_30.hic 1 1 BP 5000 testHic//chr1//chr1.KRobserved
Running command: gzip testHic//chr1//chr1.KRobserved
gzip: testHic//chr1//chr1.KRobserved: No such file or directory
subprocess.CalledProcessError: Command 'gzip testHic//chr1//chr1.KRobserved' returned non-zero exit status 1.

Am I missing something? Do we need to generate the *.KRobserved files ourselves?

Any help would be great, many thanks in advance!

Replicates

Hi the authors,

I found that in the example the candidate regions were called from one replicate but the activity was calculated from multiple replicates. Now we have multiple biological replicates and technical replicates, I was wondering how to fit those data into the ABC model?

Thanks!

Refseq gene list

Dear All,

I want to use the Ensemble gene list instead of refseq with hg38 assembly, is there a way to create a similar file as yours with the TSS being the largest region that contains all possible isoforms of a certain gene? and thanks in advance.

Best Regards,

Yussuf

Value error problem in the predict.py step of the pipeline

Hi,

First of all, thank you for developing the software, the repository and writing the documentation, and, also thank you for the support. We are trying to run the AbC enhancer prediction in the cluster of the Sanger Institute. We have installed the different dependencies and we are able to run the pipeline up to the last step, the predict step. I have checked the issues reported and I have none that matches the error we have:

$ python /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/src/predict.py --enhancers /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Neighborhoods/EnhancerList.txt --genes /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Neighborhoods/GeneList.txt --chrom_sizes /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/example_chr22/reference/chr22 --HiCdir /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/example_chr22/input_data/HiC/raw/ --hic_resolution 5000 --scale_hic_using_powerlaw --threshold .02 --cellType K562 --outdir /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Predictions/ --make_all_putative


reading genes
reading enhancers
Making predictions for chromosome: chr22
Making putative predictions table...
Using: /nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/example_chr22/input_data/HiC/raw/chr22/chr22.KRobserved.gz
Begin HiC
Loading HiC
hic.to.sparse: Elapsed time: 4.214726209640503
HiC Matrix has row sums of 2462.539801718806, making doubly stochastic...
HiC has 1479373 rows after windowing between 0 and 5000000
process.hic: Elapsed time: 3.5757029056549072
HiC added to predictions table. Elapsed time: 0.315493106842041
HiC Complete
Completed chromosome: chr22. Elapsed time: 9.790779829025269

Writing output files...
Traceback (most recent call last):
  File "/nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/src/predict.py", line 154, in <module>
    main()
  File "/nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/src/predict.py", line 129, in main
    make_gene_prediction_stats(all_putative, args)
  File "/nfs/team151/software/AbC_scores/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 197, in make_gene_prediction_stats
    summ1 = summ1.merge(summ2, left_index=True, right_index=True)
  File "/software/hgi/installs/anaconda3/envs/hgi_base/lib/python3.7/site-packages/pandas/core/frame.py", line 7959, in merge
    validate=validate,
  File "/software/hgi/installs/anaconda3/envs/hgi_base/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 89, in merge
    return op.get_result()
  File "/software/hgi/installs/anaconda3/envs/hgi_base/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 668, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/software/hgi/installs/anaconda3/envs/hgi_base/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 872, in _get_join_info
    right_ax, how=self.how, return_indexers=True, sort=self.sort
  File "/software/hgi/installs/anaconda3/envs/hgi_base/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3446, in join
    return self._join_multi(other, how=how, return_indexers=return_indexers)
  File "/software/hgi/installs/anaconda3/envs/hgi_base/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3548, in _join_multi
    raise ValueError("cannot join with no overlapping index names")
ValueError: cannot join with no overlapping index names

Can you help us? Best regards,

without bam file

hello,
Thanks for developing ABC model which is a great algorithm predicting enhancer target genes.
I just have ChIP-seq and ATAC-seq bigwig files. For calling peak, I can convert these bigwig files to bedgraph or bed files. but for the remaining steps, such as Call candidate regions and Quantifying Enhancer Activity, does ABC model take as input these bigwig/bed/bedgraph files?
best,

Error when using my own HiC data

Thank you so much for this useful pipeline. I've gotten it to work with the example data and provided average HiC data. However, I am now trying to use my own cell type-specific HiC data (starting with .hic file) and am getting an error when computing the powerlaw. The error is below when i try just with chr22 as a test (using v0.2.2):

Traceback (most recent call last):
  File "../src/compute_powerlaw_fit_from_hic.py", line 70, in load_hic_for_powerlaw
    interpolate_nan=False)
  File "../src/hic.py", line 52, in load_hic
    apply_diagonal_bin_correction = apply_diagonal_bin_correction)
  File "../src/hic.py", line 80, in process_hic
    assert(np.max(sums[sums > 0])/np.min(sums[sums > 0]) < 1.001)
AssertionError
Traceback (most recent call last):
  File "../src/compute_powerlaw_fit_from_hic.py", line 119, in <module>
    main()
  File "../src/compute_powerlaw_fit_from_hic.py", line 38, in main
    HiC = load_hic_for_powerlaw(args)
  File "../src/compute_powerlaw_fit_from_hic.py", line 100, in load_hic_for_powerlaw
    all_data = pd.concat(all_data_list)
  File "../lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 255, in concat
    sort=sort,
  File "../lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 304, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

hic file in aws is not available

Hi there,
I think the hic file in this link is not accessible:

python src/juicebox_dump.py \
--hic_file https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined_30.hic \
--juicebox "java -jar juicer_tools.jar" \
--outdir example_chr22/input_data/HiC/raw/ \
--chromosomes 22

When I run the jucer_tools on that hic file:

It gave error:

WARN [2021-06-29T16:29:06,216]  [Globals.java:138] [main]  Development mode is enabled
Could not read hic file: hicfiles.s3.amazonaws.com

Also when I try to open https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined_30.hic in browser, it gave 'Access Denied' error:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>S2SRN13HPFNYTSDY</RequestId>
<HostId>NR+zp1rPguPf+C7odVLVBbGV6/B0u2jW4vkdsO+ao/vLAlRjdx1JoOMZ8zWxzcE+ZhSfpm/Etto=</HostId>
</Error>

typo in field 'hic_kr' in src/makeAverageHiC.py

makeAverageHiC.py performs several operations on the 'hic_kr' column of the hic file (i.e. https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/blob/master/src/makeAverageHiC.py#L75). However, the hic.py file does not define such a column, referring instead to 'hic_contact' (i.e. https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/blob/master/src/hic.py#L116). Should reference to 'hic_kr' be changed to 'hic_contact'? I am not seeing a 'hic_kr' column in my downloaded hic files, there is just 'bin1','bin2','hic_contact'.

Fast count method failed to count: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Hi there!

I am working on a pipeline wrapper for this repository at https://github.com/neekonsu/alphabet in order to facilitate the use of this analytical tool. I am currently running into some errors trying to run the prescribed commands from the readme, and I would be hugely appreciative of any help sorting them out. The following is the printout when running the pipeline on the example data herein. I have omitted the MACS2 output since it seems to be functioning properly. If you would like to try this out for yourselves, I have dockerized my testing environment, and it is located at https://hub.docker.com/repository/docker/neekonsu/alphabet/general if you would like to try out the pipeline.

In order to replicate my results, open the container interactively and type ./alphabet/run.sh into the terminal.
Just type "default" or click enter for each of the 12 prompts until MACS2 begins, and good luck!

Thanks,

Neekon Saadat

chr22   51188356        51188653        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10529       14      .       1.54202 1.44043 0.28931 92
chr22   51189078        51189375        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10530       13      .       1.50750 1.36602 0.28931 42
chr22   51194793        51195090        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10531       15      .       1.90576 1.59073 0.37010 267
chr22   51195166        51196145        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10532       86      .       5.71728 8.62799 6.67055 437
chr22   51198355        51198930        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10533       46      .       3.81152 4.69934 2.94552 203
chr22   51201722        51202026        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10534       19      .       2.07541 1.97506 0.61642 151
chr22   51202508        51202805        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10535       11      .       1.41263 1.18152 0.28931 148
chr22   51204400        51204697        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10536       20      .       1.74140 2.00030 0.63284 148
chr22   51209309        51209606        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10537       11      .       1.41263 1.18152 0.28931 5
chr22   51209909        51210206        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10538       11      .       1.41263 1.18152 0.28931 148
chr22   51210440        51211000        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10539       20      .       2.11894 2.05549 0.66806 279
chr22   51211395        51211692        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10540       11      .       1.41263 1.18152 0.28931 148
chr22   51213441        51214600        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10541       51      .       4.15081 5.12941 3.34652 376
chr22   51221767        51222521        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10542       216     .       9.50236 21.64822        19.31868        413
chr22   51225284        51225825        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10543       15      .       1.85673 1.53513 0.34175 205
chr22   51230584        51234650        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10544       24      .       1.83638 2.42698 0.94401 3231
chr22   51239353        51239680        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10545       96      .       6.46412 9.69035 7.69163 164

CondaValueError: prefix already exists: /root/anaconda/envs/environment

# conda environments:
#
base                  *  /root/anaconda
environment              /root/anaconda/envs/environment

Running: awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' ./example_chr22/reference/chr22 <(samtools view -H ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | grep SQ | cut -f 2 | cut -c 4- )  > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Running: bedtools sort -faidx ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -i ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted | bedtools coverage -g ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -counts -sorted -a stdin -b ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}'  | bedtools sort -faidx ./example_chr22/reference/chr22 -i stdin > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed; rm ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Fast count method failed to count: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Error: The requested file (./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted) could not be opened. Error message: (No such file or directory). Exiting!

Trying bamtobed method ...

Running: bedtools bamtobed -i ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | cut -f 1-3 | bedtools intersect -wa -a stdin -b ./example_chr22/reference/chr22.bed | bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 | bedtools coverage -g ./example_chr22/reference/chr22 -counts -sorted -a ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted -b stdin | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}' > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed
No columns to parse from file
b'Error: Unable to open file ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted. Exiting.\n'
BEDTools failed to count file: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Error: Unable to open file ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted. Exiting.

Running: bedtools sort -i ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed -faidx ./example_chr22/reference/chr22 | bedtools merge -i stdin -c 4 -o max | sort -nr -k 4 | head -n 3000 |bedtools intersect -b stdin -a ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted -wa |awk '{print $1 "\t" $2 + $10 "\t" $2 + $10}' |bedtools slop -i stdin -b 250 -g ./example_chr22/reference/chr22 |bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 |bedtools merge -i stdin | bedtools intersect -v -wa -a stdin -b ./src/../reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed | cut -f 1-3 | (bedtools intersect -a ./example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.chr22.bed -b ./example_chr22/reference/chr22.bed -wa | cut -f 1-3 && cat) |bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 | bedtools merge -i stdin > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.candidateRegions.bed
. . . Moving on to next stage, please wait . . . 
Namespace(ATAC='', DHS='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.chr22.bam,dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.chr22.bam', H3K27ac='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/ENCFF384ZZM.chr22.bam', candidate_enhancer_regions='./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.candidateRegions.bed', cellType='K562', chrom_sizes='./example_chr22/reference/chr22', default_accessibility_feature=None, enhancer_class_override=None, expression_table='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt', gene_name_annotations='symbol', genes='./example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.chr22.bed', genes_for_class_assignment=None, outdir='./example_chr22/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='./src/../reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False)
Using gene expression from files: ['dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'] 

[Errno 2] File dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt does not exist: 'dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'
Traceback (most recent call last):
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 39, in load_genes
    expr = pd.read_table(expression_table, names=[primary_id, name + '.Expression'])
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 673, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt does not exist: 'dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'
Failed on dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt
Running command: bedtools sort -faidx ./example_chr22/reference/chr22 -i ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed > ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted; mv ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed
Loading coverage from pre-calculated file for Genes.H3K27ac.ENCFF384ZZM.chr22.bam
Usage: samtools idxstats [options] <in.bam>
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -@, --threads INT
               Number of additional threads to use [0]
Traceback (most recent call last):
  File "./src/run.neighborhoods.py", line 97, in <module>
    main(args)
  File "./src/run.neighborhoods.py", line 93, in main
    processCellType(args)
  File "./src/run.neighborhoods.py", line 74, in processCellType
    outdir = args.outdir)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 89, in annotate_genes_with_features
    genes = count_features_for_bed(genes, bounds_bed, genome_sizes, features, outdir, "Genes", force=force, use_fast_count=use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 415, in count_features_for_bed
    df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 438, in count_single_feature_for_bed
    total_counts = count_total(feature_bam)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 528, in count_total
    total_counts = count_bam_mapped(infile)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 503, in count_bam_mapped
    data = check_output(command, shell=True)
  File "/root/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/root/anaconda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'samtools idxstats dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/ENCFF384ZZM.chr22.bam' returned non-zero exit status 1.
exit status 1
. . . Moving on to next stage, please wait . . . 
reading genes
reading enhancers
Making predictions for chromosome: chr22
Making putative predictions table...
Using: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/HiC/raw/chr22/chr22.VCobserved.gz
Begin HiC
Loading HiC
Traceback (most recent call last):
  File "./src/predict.py", line 148, in <module>
    main()
  File "./src/predict.py", line 107, in main
    this_chr = make_predictions(chromosome, this_enh, this_genes, args)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 17, in make_predictions
    pred = add_hic_to_enh_gene_table(enhancers, genes, pred, hic_file, hic_norm_file, hic_is_vc, chromosome, args)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 66, in add_hic_to_enh_gene_table
    gamma = args.hic_gamma)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/hic.py", line 42, in load_hic
    HiC_sparse_mat = hic_to_sparse(hic_file, hic_norm_file, hic_resolution)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/hic.py", line 150, in hic_to_sparse
    header=None, engine='c', memory_map=True)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 613, in pandas._libs.parsers.TextReader._setup_parser_source
  File "/root/anaconda/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
NotADirectoryError: [Errno 20] Not a directory: './example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/HiC/raw/chr22/chr22.VCobserved.gz'
exit status 1

ChIP-seq input correction and questions

Hi,
Thank you very much for your work. Looks very interesting!
So I was trying to use my own data when I have realized that in our experiment, we have an input ChIP-seq file to correct the H3K27ac one. In the function run.neighborhoods.py, for the parameter -H3K27ac, I guess the file is supposed to be already input-corrected? Or there is any other way of providing the input file and the software will perform correction?

Thanks!

run.neighborhoods.py 'Killed' and no EnhancerList.txt produced

Hello,
I am getting a 'Killed' message after 'Assigning classes to enhancers' when running 'run.neighborhoods.py'. I successfully get the GeneList.txt output but do not get the EnhancerList.txt output file. Before the 'Killed' message, everything appears to be running correctly.

Below is what happens why I execute 'run.neighborhoods.py'.

Thank you!

Screen Shot 2020-07-23 at 11 17 21 AM

Screen Shot 2020-07-23 at 11 17 51 AM

'neighborhoods.py' (line 448) ~> pandas dataframe size mismatch on assertion in function 'count_single_feature_for_bed'

Hello all.

I just finished debugging issue #25 , and with the success of that first leg of the pipeline, a new data processing error has emerged.

Everything before this error is successful, so I believe the reason for this new error is that the old errors I fixed enabled data transformations to occur which were originally stopped by the blocking in stage 1.

Please ask if you would like the whole printout, I only provide the most relevant printout below.

The following is the printout from run.neighborhoods.py, and the specific error comes from the assertion that the dataframe maintains its original shape on line 448; this is all being run on the example data embedded in this repository.

Thank you very much for your help, and I look forward to eliminating these bugs.

run.neighborhoods.py ~>

Running command: bedtools sort -faidx /alphabet/ABC-Enhancer-Gene-Prediction/example_chr22/reference/chr22 -i /alphabet/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed > /alphabet/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted; mv /alphabet/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted /alphabet/ABC-Enhancer-Gene-Prediction/example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed
    Loading coverage from pre-calculated file for Genes.H3K27ac.ENCFF384ZZM.chr22.bam
    Feature H3K27ac completed in 0.025640010833740234
    Loading coverage from pre-calculated file for Genes.DHS.wgEncodeUwDnaseK562AlnRep1.chr22.bam
    Loading coverage from pre-calculated file for Genes.DHS.wgEncodeUwDnaseK562AlnRep2.chr22.bam
    Feature DHS completed in 0.0519099235534668
    Loading coverage from pre-calculated file for Genes.TSS1kb.H3K27ac.ENCFF384ZZM.chr22.bam
    Feature H3K27ac completed in 0.024033546447753906
    Loading coverage from pre-calculated file for Genes.TSS1kb.DHS.wgEncodeUwDnaseK562AlnRep1.chr22.bam
    Loading coverage from pre-calculated file for Genes.TSS1kb.DHS.wgEncodeUwDnaseK562AlnRep2.chr22.bam
    Feature DHS completed in 0.049674034118652344
    Loading coverage from pre-calculated file for Enhancers.H3K27ac.ENCFF384ZZM.chr22.bam
    Traceback (most recent call last):
    File "/alphabet/ABC-Enhancer-Gene-Prediction/src/run.neighborhoods.py", line 97, in <module>
        main(args)
    File "/alphabet/ABC-Enhancer-Gene-Prediction/src/run.neighborhoods.py", line 93, in main
        processCellType(args)
    File "/alphabet/ABC-Enhancer-Gene-Prediction/src/run.neighborhoods.py", line 88, in processCellType
        outdir = args.outdir)
    File "/alphabet/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 197, in load_enhancers
        enhancers = count_features_for_bed(enhancers, candidate_peaks, genome_sizes, features, outdir, "Enhancers", skip_rpkm_quantile, force, use_fast_count)
    File "/alphabet/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 415, in count_features_for_bed
        df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count)
    File "/alphabet/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 448, in count_single_feature_for_bed
        assert df.shape[0] == orig_shape, "Dimension mismatch"
    AssertionError: Dimension mismatch
    exit status 1

neighborhoods.py ~>

def count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count):
    orig_shape = df.shape[0]
    feature_name = feature + "." + os.path.basename(feature_bam)
    feature_outfile = os.path.join(directory, "{}.{}.CountReads.bedgraph".format(filebase, feature_name))

    if force or (not os.path.exists(feature_outfile)) or (os.path.getsize(feature_outfile) == 0):
        print("Regenerating", feature_outfile)
        print("Counting coverage for {}".format(filebase + "." + feature_name))
        run_count_reads(feature_bam, feature_outfile, bed_file, genome_sizes, use_fast_count)
    else:
        print("Loading coverage from pre-calculated file for {}".format(filebase + "." + feature_name))

    domain_counts = read_bed(feature_outfile)
    score_column = domain_counts.columns[-1]

    total_counts = count_total(feature_bam)

    domain_counts = domain_counts[['chr', 'start', 'end', score_column]]
    featurecount = feature_name + ".readCount"
    domain_counts.rename(columns={score_column: featurecount}, inplace=True)
    domain_counts['chr'] = domain_counts['chr'].astype('str')

    df = df.merge(domain_counts.drop_duplicates())
    #df = smart_merge(df, domain_counts.drop_duplicates())

    assert df.shape[0] == orig_shape, "Dimension mismatch" ### Line 448 ###

    df[feature_name + ".RPM"] = 1e6 * df[featurecount] / float(total_counts)

    if not skip_rpkm_quantile:
        df[featurecount + ".quantile"] = df[featurecount].rank() / float(len(df))
        df[feature_name + ".RPM.quantile"] = df[feature_name + ".RPM"].rank() / float(len(df))
        df[feature_name + ".RPKM"] = 1e3 * df[feature_name + ".RPM"] / (df.end - df.start).astype(float)
        df[feature_name + ".RPKM.quantile"] = df[feature_name + ".RPKM"].rank() / float(len(df))

    return df[~ df.duplicated()]

def average_features(df, feature, feature_bam_list, skip_rpkm_quantile):
    feature_RPM_cols = [feature + "." + os.path.basename(feature_bam) + '.RPM' for feature_bam in feature_bam_list]

    df[feature + '.RPM'] = df[feature_RPM_cols].mean(axis = 1)
    
    if not skip_rpkm_quantile:
        feature_RPKM_cols = [feature + "." + os.path.basename(feature_bam) + '.RPKM' for feature_bam in feature_bam_list]
        df[feature + '.RPM.quantile'] = df[feature + '.RPM'].rank() / float(len(df))
        df[feature + '.RPKM'] = df[feature_RPKM_cols].mean(axis = 1)
        df[feature + '.RPKM.quantile'] = df[feature + '.RPKM'].rank() / float(len(df))

    return df

what's the preferred setting when mean associated enhancers is lower than 2 ?

Hi there,

I ran your code on my own Hi-C data for 3 cell lines (Neural progenitors cell lines, Derived Neurons and Astrocyte glials) using the methodology you describe, but I find a average number of associated distal enhancers per genes lower than 2 (about 1.80 for my 3 cell lines).
You mentioned in one of the last section of your github, some adjustments can be done, quantile normalization or modify the ABC-Score threshold.
I used default settings of you code, KR normalization on 10 Kb resolution, however when I VC normalize my data I obtain a average number of distal enhancer in the required range of values.
I would like to know what is the preferred setting to adjust? Do quantile normalization, decrease the threshold, use a particular normalization ?

Thanks,

Loic.

Downloading Hi-C Data for Chromosome 9

Hi, I'm having trouble downloading Hi-C data for chromosome 9 specifically using juicebox. When I use juicebox, it creates 2 files (chr9.KRobserved.gz. and chr9.KRNorm.gz) respectively, but both files are completely empty. Is there an alternate way I can download Hi-C data for specifically chromosome 9/has anyone else encountered this issue? I'm able to download Hi-C data using juicebox for all other chromosomes

Can't compute powerlaw fit on VC normalized HiC files

Hello,

I have been trying to use my own HiC data for the ABC model, and due to the original .hic files containing different resolutions, I am using ".VCnorm" and ."VCobserved" files I generated using juicebox_dump.py. I am now trying to computer the powerlaw using computer_powerlaw_fit_from_hic.py, but get the following error:

Using: /sc/arion/projects/YangLab_NGS/data/hic_for_ABC/AD_VC/chr22/chr22.VCobserved.gz
Working on /sc/arion/projects/YangLab_NGS/data/hic_for_ABC/AD_VC/chr22/chr22.VCobserved.gz
Loading HiC
hic.to.sparse: Elapsed time: 0.23026204109191895

Traceback (most recent call last):
File "compute_powerlaw_fit_from_hic.py", line 70, in load_hic_for_powerlaw
interpolate_nan=False)
File "/sc/arion/projects/YangLab_NGS/ABC-Enhancer-Gene-Prediction/src/hic.py", line 52, in load_hic
apply_diagonal_bin_correction = apply_diagonal_bin_correction)
File "/sc/arion/projects/YangLab_NGS/ABC-Enhancer-Gene-Prediction/src/hic.py", line 80, in process_hic
assert(np.max(sums[sums > 0])/np.min(sums[sums > 0]) < 1.001)
AssertionError

Is there something wrong with my VC normalized files? Really appreciate any help/advice, as I have been stuck on this problem for days. Thank you.

ValueError in compute_powerlaw_fit_from_hic.py with BEDPE file

Hi,
I encountered an error while running 'compute_powerlaw_fit_from_hic.py' for BEDPE Hi-C contact file:

Invalid file path or buffer object type: <class 'tuple'>
Traceback (most recent call last):
  File "/home/fgprgn/tools/ABC-Enhancer-Gene-Prediction/src/compute_powerlaw_fit_from_hic.original.py", line 76, in load_hic_for_powerlaw
    this_data = load_hic(hic_file = hic_file,
  File "/yshare2/ZETTAI_path_WA_slash_home_KARA/home/fgprgn/tools/ABC-Enhancer-Gene-Prediction/src/hic.py", line 55, in load_hic
    HiC = pd.read_csv(hic_file, sep="\t", names = ['chr1','x1','x2','chr2','y1','y2','name','hic_contact'])
  File "/home/fgprgn/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/fgprgn/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 434, in _read
    fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
  File "/home/fgprgn/anaconda3/lib/python3.8/site-packages/pandas/io/common.py", line 243, in get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'tuple'>

I just solved this by modifying line 74 as below.
hic_file, hic_norm_file, hic_is_vc = get_hic_file(chrom, args.hicDir, hic_type='bedpe')

Without H3K27ac from the same samples

Hi there,
If I only have bulk ATAC-seq data for samples but without H3K27ac from the same samples? Can I use ATAC-seq data + H3K27ac (for example, just download one data from GEO), then call the enhancers in each sample and predict the ABC interaction? But I still could obtain the individual specific enhancers? due to the overlapped regions between ATAC-seq peaks and the only one H3K27ac peaks are different? Is there any potential issue?
Thank you.

Best,
Xin

Is it possible to include HiC data for GRCh38?

Hi dear developers of the ABC model,
I'm using ABC model for data of build GRCh38.
Up to now, I was only able to run predict.py without using HiC data and only with power law estimate.
For a quantitative comparison with other results, I would like to also run predict.py with HiC data. However, juicebox_dump.py seems to be unable to process GRCh38 HiC. Is there a way to workaround for using GRCh38?
Do you have an average HiC for GRCh38?
Thank you

where to download mm10 curated TSS

I am using mm10 to analysis my data, but I am not sure where to find mm10 RefSeqCurated.TSS500bp.bed
or how I can get it? Thank you very much.

Syntax Error while running MakeCandidateRegions.py

Hello,

As per the documentation, I have been running the following command. However, I keep getting a "syntax" error. Would appreciate any advice.

python src/makeCandidateRegions.py \

--narrowPeak D15_rep1_atac.macs2_peaks.narrowPeak.sorted
--bam D15_rep1_atac.bam
--outDir /Output/
--chrom_sizes chr.txt
--regions_blacklist wgEncodeHg19ConsensusSignalArtifactRegions.bed
--regions_whitelist RefSeqGeneBounds.txt
--peakExtendFromSummit 250
--nStrongestPeaks 150000
Traceback (most recent call last):
File "src/makeCandidateRegions.py", line 5, in
from peaks import *
File "ABC-Enhancer-Gene-Prediction-0.2.2/src/peaks.py", line 5, in
from tools import *
File "src/tools.py", line 7, in
import pyranges as pr
File "lib/python2.7/site-packages/pyranges/init.py", line 16, in
import pyranges as pr
File "lib/python2.7/site-packages/pyranges/pyranges.py", line 3272
def print(self, n=8, merge_position=False, sort=False, formatting=None, chain=False):
^
SyntaxError: invalid syntax

run.neighborhoods.py Dimension Mismatch

Hi, I am getting an AssertionError: dimension mismatch when running run.neighborhoods.py. The error produced is included below. Any insight as to what is causing the error is appreciated!

thanks,
Annika

Screen Shot 2020-09-04 at 9 21 36 AM

How are precision and recall computed for the example data?

Hi, in step 3 of the "Running the ABC Model" section in the README.md file, it says evaluation shows approximately 70% recall and 60% precision.

Is it possible to compute these values from the files here? The only way I obtained similar values is as shown below, but I calculated the values incorrectly (for example, the fn line would normally also require that x['nDistalEnhancersPredicted'] == 0 and 1 should probably be x['nDistalEnhancersPredicted'])

Could you please point me in the right direction?

Thanks!

df = pd.read_csv('GenePredictionStats.txt', sep='\t')

fp = (df.apply(lambda x: x['nDistalEnhancersPredicted'] if not x['geneIsExpressed'] else 0, axis=1)).sum()
tp = (df.apply(lambda x: x['nDistalEnhancersPredicted'] if x['geneIsExpressed'] else 0, axis=1)).sum()
fn = (df.apply(lambda x: 1 if x['geneIsExpressed'] else 0, axis=1)).sum()
tn = (df.apply(lambda x: 1 if x['nDistalEnhancersPredicted'] == 0 and not x['geneIsExpressed'] else 0, axis=1)).sum()

print([fp, tp, fn, tn])
print('Precision: {:.2}'.format(tp/(tp+fp)))
print('Recall: {:.2}'.format(tp/(tp+fn)))

Output:

[444.0, 737.0, 310, 72]
Precision: 0.62
Recall: 0.7

Add mouse genome support in juicebox_dump.py and compute_powerlaw_fit_from_hic.py

Hi @jnasser3 ,

I noticed juicebox_dump.py does not support mouse genome if set --chromsomes all.
It would be nice to add an option like,

parser.add_argument('--genome', choices=['hg', 'mm'], help="The species genome. Currently support human (hg) or mouse (mm).")

And in the main code,

if args.chromosomes == "all":
        if args.genome == "hg":
            chromosomes = list(range(1,23)) + ['X']
        elif args.genome == "mm":
            chromosomes = list(range(1,20)) + ['X']
        else:
            sys.exit("Error: When render on all chromosomes, please specify the species genome, hg or mm.") # 'import sys' if use this error check. 

Similarly, in "compute_powerlaw_fit_from_hic.py", in addition to the genome option,

if args.chr == 'all':
     if args.genome == "hg":
        chromosomes = ['chr' + str(x) for x in  list(range(1,23))] + ['chrX']
     elif args.genome == "mm":
        chromosomes = ['chr' + str(x) for x in  list(range(1,20))] + ['chrX']
    else:
            sys.exit("Error: When run on all chromosomes, please specify the species genome, hg or mm.") 

Thank,
YZ

Error in run.neighborhoods.py

Hi team,

I came across error when trying to run run.neighborhoods.py. Essentially it can't find the the chom.sizes file because it adds a .bed onto the end.

My code is:
python /home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/run.neighborhoods.py \ --candidate_enhancer_regions /rds/user/cjr78/hpc-work/HiC/ABC/CandidateRegions/FlFl/merged_1_2_3_peaks.narrowPeak.candidateRegions.bed \ --genes /rds/user/cjr78/hpc-work/scripts/mm10_refseq_collapsed_genes.bed \ --cellType FlFl \ --outdir /rds/user/cjr78/hpc-work/HiC/ABC/Neighborhoods/FlFl/ \ --H3K27ac /rds/user/cjr78/hpc-work/ChIP/H3K27ac/bam/merged_bams/1_2_3_merged.bam \ --ATAC rds/user/cjr78/hpc-work/ATAC/bowtie/merged/merged_1_2_3.bam \ --expression_table /home/cjr78/rds/hpc-work/HiC/ABC/Input/flfl_tpm.txt \ --chrom_sizes /rds/user/cjr78/hpc-work/scripts/mm10.chrom.sizes

The error is:
Namespace(ATAC='rds/user/cjr78/hpc-work/ATAC/bowtie/merged/merged_1_2_3.bam', DHS='', H3K27ac='/rds/user/cjr78/hpc-work/ChIP/H3K27ac/bam/merged_bams/1_2_3_merged.bam', candidate_enhancer_regions='/rds/user/cjr78/hpc-work/HiC/ABC/CandidateRegions/FlFl/merged_1_2_3_peaks.narrowPeak.candidateRegions.bed', cellType='FlFl', chrom_sizes='/rds/user/cjr78/hpc-work/scripts/mm10.chrom.sizes', default_accessibility_feature=None, enhancer_class_override=None, expression_table='/home/cjr78/rds/hpc-work/HiC/ABC/Input/flfl_tpm.txt', gene_name_annotations='symbol', genes='/rds/user/cjr78/hpc-work/scripts/mm10_refseq_collapsed_genes.bed', genes_for_class_assignment=None, outdir='/rds/user/cjr78/hpc-work/HiC/ABC/Neighborhoods/FlFl/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes=None, use_secondary_counting_method=False) Using gene expression from files: ['/home/cjr78/rds/hpc-work/HiC/ABC/Input/flfl_tpm.txt']

Traceback (most recent call last): File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/run.neighborhoods.py", line 97, in <module> main(args) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/run.neighborhoods.py", line 93, in main processCellType(args) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/run.neighborhoods.py", line 74, in processCellType outdir = args.outdir) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 85, in annotate_genes_with_features tss1kb = make_tss_region_file(genes, outdir, genome_sizes) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 111, in make_tss_region_file sizes_pr = df_to_pyranges(read_bed(sizes + '.bed')) File "/home/cjr78/packages/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 479, in read_bed skip = 1 if ("track" in open(filename, "r").readline()) else 0 FileNotFoundError: [Errno 2] No such file or directory: '/rds/user/cjr78/hpc-work/scripts/mm10.chrom.sizes.bed'

Bizarrely, if I rename the chrome.sizes file so that it ends with .bed, it can't find it because it's looking for the correct file name!

juicebox_dump.py

Hi,
I want to use my own HiC data, I used juicer to convert the 7 columns of validpairs into a .hic file, but the error is reported as follows, what is the reason?

python ~/software/ABC-Enhancer-Gene-Prediction/src/juicebox_dump.py \
> --hic_file ~/01.data/11.HiC/01.PAM/16075_format_AllValidPairs.hic \
> --juicebox "java -jar ~/software/juicer_tools_1.22.01.jar" \
> --outdir ./raw/ \
>  --chromosomes 1
Starting chr1 ...
java -jar ~/software/juicer_tools_1.22.01.jar dump observed KR ~/01.data/11.HiC/01.PAM/16075_format_AllValidPairs.hic 1 1 BP 5000 ./raw//chr1//chr1.KRobserved
Running command: gzip ./raw//chr1//chr1.KRobserved
gzip: ./raw//chr1//chr1.KRobserved: No such file or directory
Traceback (most recent call last):
  File "~/software/ABC-Enhancer-Gene-Prediction/src/juicebox_dump.py", line 54, in <module>
    main(args)
  File "~/software/ABC-Enhancer-Gene-Prediction/src/juicebox_dump.py", line 35, in main
    run_command("gzip {0}/chr{1}.KRobserved".format(outdir, chromosome))
  File "~/software/ABC-Enhancer-Gene-Prediction/src/tools.py", line 15, in run_command
    return check_call(command, shell=True, **args)
  File "~/software/miniconda3/envs/final-abc-env/lib/python3.7/subprocess.py", line 328, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gzip ./raw//chr1//chr1.KRobserved' returned non-zero exit status 1.

Thanks!

More questions about code usage

Hi there,

Many thanks for creating the tool - excellent work! I have a few questions about using it with my data.

I have ATAC-seq and H3K27ac ChIP-seq from three individuals (biological replicates).

I see that the tool allows replicates to be included, which is great. My question is - in this case, what is the best way to define the initial set of candidates?

I have three ATAC-seq files, and there might be small differences in the precise locations of peaks among individuals.

Is it acceptable to just use one of these ATAC-seq BAMs to define the candidates, and then put all of them into the tool as replicates? Or could I perhaps merge my three ATAC-seq BAM files and call MACS2 on this merged file to get an average location of the summits? Any help on this would be grealy appreciately.

David

Index Error during Step 2

Hello, I'd appreciate advice troubleshooting this index error. Also, is NPC a valid input for cell type? If not, what are the acceptable inputs?
Lastly, is there a better way to contact you regarding issues with ABC rather than Github? If so, please let me know.

python3.6 src/run.neighborhoods.py --candidate_enhancer_regions /apps/a/output/D15_rep1_atac.macs2_peaks.narrowPeak.sorted.candidateRegions.bed --genes /apps/a/RefSeqUniqueGenesFinal5nodupes.bed
--H3K27ac /apps/a/d15_k27_1.2777_4.R1.fq.gz.sorted.bam
--DHS /apps/a/D15_rep1_atac.bam
--chrom_sizes /apps/a/chr
--cellType NPC
--outdir /apps/a/output-neighborhoods28/

Script output

Running: bedtools bamtobed -i /apps/a/d15_k27_1.2777_4.R1.fq.gz.sorted.bam | cut -f 1-3 | bedtools intersect -wa -a stdin -b /apps/a/chr.bed | bedtools sort -i stdin -faidx /apps/a/chr | bedtools coverage -g /apps/a/chr -counts -sorted -a /apps/a/output-neighborhoods28/GeneList.bed -b stdin | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}' > /apps/a/output-neighborhoods28/Genes.H3K27ac.d15_k27_1.2777_4.R1.fq.gz.sorted.bam.CountReads.bedgraph
No columns to parse from file
b'***** ERROR: illegal character '\r' found in integer conversion of string "248956422\r". Exiting...\n'
BEDTools failed to count file: /apps/a/d15_k27_1.2777_4.R1.fq.gz.sorted.bam

". Exiting...teger conversion of string "248956422

Traceback (most recent call last):
File "src/run.neighborhoods.py", line 97, in
main(args)
File "src/run.neighborhoods.py", line 93, in main
processCellType(args)
File "src/run.neighborhoods.py", line 74, in processCellType
outdir = args.outdir)
File "/apps/a/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 89, in annotate_genes_with_features
genes = count_features_for_bed(genes, bounds_bed, genome_sizes, features, outdir, "Genes", force=force, use_fast_count=use_fast_count)
File "/apps/a/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 415, in count_features_for_bed
df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count)
File "/apps/a/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 435, in count_single_feature_for_bed
domain_counts = read_bed(feature_outfile)
File "/apps/a/ABC-Enhancer-Gene-Prediction-0.2.2/src/neighborhoods.py", line 483, in read_bed
assert result.columns[0] == "chr"
File "/home/a/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4097, in getitem
return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

TSS annotation question

Hello @jnasser3 ,

I understand ABC model can be sensitive to TSS annotation.
I am wondering what you would recommend on how to annotate TSS.

Based on your paper, one TSS is selected for each gene if the "TSS is used by the largest number of coding isoforms." My interpretation on this sentence is that when there are a number of coding isoforms (saying 5) for a gene, the TSS shared by the largest number of them (for example 3) would be selected to represent that gene. If I am right, how do you handle a tier (like, two coding isoforms with different TSSs)?

It could also be getting TPM at transcript level and taking the TSS from transcript with the highest TPM for each gene. In this way, the TSS annotation would be experiment dependent (cell type, conditions, etc.)

I personally favor the latter one, but haven't explored yet. Have you any advice?
Thank you,
YZ

Creating acceptable Hi-C input

Hello,

I would like to ask if it is possible to use contact scores generated by CHICAGO as "contact frequency" in Hi-C bedpe format. I am trying to use capture Hi-C data set to create ABC model but after reading the publications and the github page, It is not clear to me how to get the required input files. I am not sure if it is ok to create genome wide binned matrix (e.g. by Juicer) with capture Hi-C data.

Could you advise me on how could I generate acceptable Hi-C input from capture Hi-C data please?

Many thanks in advance,
Matus

error in run.neighborhoods.py

Hi, when I run run.neighborhoods.py, I encounter errors shown as below. Can you help me findout where is the problem? Thank you very much.
"Assigning classes to enhancers
Traceback (most recent call last):
File "/home/yzhang/Software/ABC-Enhancer-Gene-Prediction-master/src/run.neighborhoods.py", line 97, in
main(args)
File "/home/yzhang/Software/ABC-Enhancer-Gene-Prediction-master/src/run.neighborhoods.py", line 93, in main
processCellType(args)
File "/home/yzhang/Software/ABC-Enhancer-Gene-Prediction-master/src/run.neighborhoods.py", line 88, in processCellType
outdir = args.outdir)
File "/home/yzhang/Software/ABC-Enhancer-Gene-Prediction-master/src/neighborhoods.py", line 206, in load_enhancers
enhancers = assign_enhancer_classes(enhancers, genes, tss_slop = tss_slop_for_class_assignment)
File "/home/yzhang/Software/ABC-Enhancer-Gene-Prediction-master/src/neighborhoods.py", line 248, in assign_enhancer_classes
genes, promoters = get_class_pyranges(enh)
File "/home/yzhang/Software/ABC-Enhancer-Gene-Prediction-master/src/neighborhoods.py", line 235, in get_class_pyranges
promoter_enh = promoter_enh.df[['symbol','uid']].groupby('uid',as_index=False).aggregate(lambda x: ','.join(list(set(x))))
File "/home/yzhang/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2908, in getitem
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/yzhang/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/yzhang/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['symbol', 'uid'], dtype='object')] are in the [columns]"
"

My command is
"python ~/Software/ABC-Enhancer-Gene-Prediction-master/src/run.neighborhoods.py --ATAC ATAC.rmdup.bam --candidate_enhancer_regions ABC_output/Peaks/enhancer.narrowPeak.candidateRegions.bed --genes mm9.bed --H3K27ac WT-H3K27ac.bam --expression_table WT.TPM.txt --chrom_sizes mm9.chrom.sizes --outdir ABC_output/Neighborhoods/"

Add Gz conversion step in juicebox_dump.py

Hi,

I launched your code on my own data (neurons) and It seems in the juicebox_dump.py step it misses a gz conversion for hic normalized files. Your python scripts for ABC score computation or powerlaw fitting wait a gz extension file and juicebox_dumb.py does not do it automatically. I did myself manually but you should probably add it in the code.

Loïc.

Hi-C data for mouse?

Thank you for your tool.
I would like to use your tool with mouse H3K27ac and ATAC-seq data (mm10); however it seems that your average Hi-C data is only for human. I've tried to run your tool without the Hi-C data, but it doesn't work. Is it possible to skip the Hi-C step, or do you have Hi-C data for mouse (mm10)?
Cheers

KeyError(key) in run.neighborhoods.py

So, all of the data I'm using is public. I'm trying to create customized ABC regions for lung data. Here is the error I encounter, along with the script I used to generate it:


cd ~/ABC-Enhancer-Gene-Prediction/
cd lung
mkdir DNase
cd DNase
wget https://www.encodeproject.org/files/ENCFF023JFG/@@download/ENCFF023JFG.bam
#sort and index
samtools sort -l 1 -m 7G -o IMR90_DNASE_ENCFF023JFG.sorted.bam -O bam -@ 32 ENCFF023JFG.bam
samtools index -@ 32 IMR90_DNASE_ENCFF023JFG.sorted.bam
cd ..
mkdir acetyl
cd acetyl
wget https://www.encodeproject.org/files/ENCFF146UYU/@@download/ENCFF146UYU.bam
wget https://www.encodeproject.org/files/ENCFF071VOI/@@download/ENCFF071VOI.bam
samtools merge IMR90_H3K27ME3_reps_1_2.bam *.bam
cd ..
cd ..

mkdir -p ./lung/ABC_output/Peaks/
cd reference
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip *.*
mv ENCFF356LFX.bed hg38ENCODEblacklist.bed
./liftOver RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed hg19ToHg38.over.chain RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg19tohg38unmapped.bed
macs2 callpeak \
-t lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
-n IMR90_DNASE_ENCFF023JFG.macs2 \
-f BAM \
-g hs \
-p .1 \
--call-summits \
--outdir lung/ABC_output/Peaks/
python3 src/makeCandidateRegions.py \
--narrowPeak lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak \
--bam lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
--outDir lung/ABC_output/Peaks/ \
--chrom_sizes reference/hg38.chrom.sizes \
--regions_blacklist reference/hg38ENCODEblacklist.bed \
--regions_whitelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed \
--peakExtendFromSummit 250 \
--nStrongestPeaks 150000 

./reference/liftOver  ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed ./reference/hg19ToHg38.over.chain ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg19tohg38unmapped.bed
mkdir -p ./lung/ABC_output/Neighborhoods/
mkdir -p ./lung/rnaseq
cd ./lung/rnaseq
wget https://www.encodeproject.org/files/ENCFF833OTW/@@download/ENCFF833OTW.tsv
wget https://www.encodeproject.org/documents/a4a6caad-61ab-4d35-820a-409cadce1121/@@download/attachment/RSEM_quantifications_specifications.txt
awk '$1 = $1 FS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed
cd ../..
python3 src/run.neighborhoods.py \
--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed \
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed \
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam \
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
--chrom_sizes reference/hg38.chrom.sizes \
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
--cellType IMR90 \
--outdir lung/ABC_output/Neighborhoods/

Error:

~/ABC-Enhancer-Gene-Prediction$ python3 src/run.neighborhoods.py \
> --candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed \
> --genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed \
> --H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam \
> --DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
> --chrom_sizes reference/hg38.chrom.sizes \
> --ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
> --cellType IMR90 \
> --outdir lung/ABC_output/Neighborhoods/
Namespace(ATAC='', DHS='lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam', H3K27ac='lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam', candidate_enhancer_regions='lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed', cellType='IMR90', chrom_sizes='reference/hg38.chrom.sizes', default_accessibility_feature=None, enhancer_class_override=None, expression_table='', gene_name_annotations='symbol', genes='reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed', genes_for_class_assignment=None, outdir='lung/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'start'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/run.neighborhoods.py", line 97, in <module>
    main(args)
  File "src/run.neighborhoods.py", line 93, in main
    processCellType(args)
  File "src/run.neighborhoods.py", line 74, in processCellType
    outdir = args.outdir)
  File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 85, in annotate_genes_with_features
    tss1kb = make_tss_region_file(genes, outdir, genome_sizes)
  File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 111, in make_tss_region_file
    sizes_pr = df_to_pyranges(read_bed(sizes + '.bed'))
  File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/tools.py", line 53, in df_to_pyranges
    df['Start'] = df[start_col] - start_slop
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 'start'

Slightly different results using example data

Hello,
I am attempting to replicate the results using the chromosome 22 example to verify that the code is working properly before applying it to my own data. I am running the code on a linux-based cluster with slightly different versions of the necessary packages. I am using anaconda to load the necessary packages, and I am using the following versions:

Anaconda: 4.3.0
Python: 3.6.4
samtools: 1.9
bedtools: 2.29
Macs: 2.1.2
Java: 10.0.2
Juicer Tools: 1.11.09

Python packages:
pyranges: 0.0.63
numpy: 1.17.3
pandas: 0.25.3
scipy: 1.3.1
pyBigWig: 0.3.13

I am not getting any warnings or errors, and I am getting results very similar to the results in the example folder, but my re-calculated results are consistently lower than the example which results in less enhancer-gene pairs, since the cutoff is 0.2. I have included a scatterplot of the ABC scores from the enhancer-gene pairs that both the example and my results have in common, and the IGV visualization of these results below. The mean absolute difference between each ABC score pair was about 0.0026.

compABCscores

Here is a visual comparison, looking at the EnhancerPreductions.bedpe output from the example_chr22/ABC_output/Predictions included here and the EnhancerPreductions.bedpe output from running the code with the exact specifications as in the README file.
IGVcomparison

Since the results are well correlated, I don't think this is a big issue, but I am not sure why these differences are occurring. It could be some rounding differences on the cluster, or some differences in the versions of the packages I am using, or the fact that I am running the packages through anaconda. Any thoughts?

Best regards

Running predict.py: VCobserved doesn't exist, but the program keeps looking for it

Hello,
Great program (if I could get it to work)! I am trying to run predict.py after running hic steps for juicebox data (juicebox_dump.py, compute_powerlaw_fit_from_hic.py). The steps to get neighborhoods and all previous steps ran smoothly.

predict.py keeps looking for VCobserved.gz files even when I define the --hic_type juicebox. Editing hic.py to force allow_vc to false didn't help. I have pasted parameter text file below, as well as a pastebin link of the full run + error messages. How can I force the program to bypass the absence of VCobserved files?

Any help is appreciated. Thanks!

enhancers abc/peaks_macs2/Neighborhoods/EnhancerList.txt
genes abc/peaks_macs2/Neighborhoods/GeneList.txt
outdir abc/Predictions/
window 5000000
score_column ABC.Score
threshold 0.02
cellType U2OS
chrom_sizes genomes/hg38/hg38/hg38.chrom.sizes
HiCdir abc/hic/
hic_resolution 5000
tss_hic_contribution 100
hic_pseudocount_distance 1000000.0
hic_type juicebox
hic_is_doubly_stochastic False
scale_hic_using_powerlaw True
hic_gamma 0.87
hic_gamma_reference 0.87
run_all_genes False
expression_cutoff 1
promoter_activity_quantile_cutoff 0.4
make_all_putative True
use_hdf5 False
tss_slop 500
chromosomes all
include_chrY False

https://pastebin.com/xQV9U8hE

Predictions needs chrY file without --includechrY

I am running the predict.py script as follows:

python src/predict.py
--enhancers Neighborhoods_Cortex/EnhancerList.txt
--genes Neighborhoods_Cortex/GeneList.txt
--HiCdir CHiP_HiC_input/Cortex_HiC
--hic_resolution 10000
--scale_hic_using_powerlaw --threshold .02 --cellType Cortex --outdir Predictions_Cortex/
--make_all_putative

However, I get an error for chromosome Y, stating that the file is not found. Which makes sense, since I do not have the HiC Data for chrY.

Making predictions for chromosome: Y
Making putative predictions table...
Using: CHiP_HiC_input/Cortex_HiC/chrY/chrY.VCobserved.gz
Begin HiC
Loading HiC
Traceback (most recent call last):
File "src/predict.py", line 141, in
main()
File "src/predict.py", line 103, in main
this_chr = make_predictions(chromosome, this_enh, this_genes, args)
File "/psycl/g/mpsziller/liesa/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 14, in make_predictions
pred = add_hic_to_enh_gene_table(enhancers, genes, pred, hic_file, hic_norm_file, hic_is_vc, chromosome, args)
File "/psycl/g/mpsziller/liesa/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 63, in add_hic_to_enh_gene_table
gamma = args.hic_gamma)
File "/psycl/g/mpsziller/liesa/ABC-Enhancer-Gene-Prediction/src/hic.py", line 38, in load_hic
HiC_sparse_mat = hic_to_sparse(hic_file, hic_norm_file, hic_resolution)
File "/psycl/g/mpsziller/liesa/ABC-Enhancer-Gene-Prediction/src/hic.py", line 146, in hic_to_sparse
header=None, engine='c', memory_map=True)
File "/u/liewe/conda-envs/bpnet/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/u/liewe/conda-envs/bpnet/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/u/liewe/conda-envs/bpnet/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/u/liewe/conda-envs/bpnet/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/u/liewe/conda-envs/bpnet/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 613, in pandas._libs.parsers.TextReader._setup_parser_source
File "/u/liewe/conda-envs/bpnet/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'CHiP_HiC_input/Cortex_HiC/chrY/chrY.VCobserved.gz'

I thought that without the explicit flag, chromosome Y is not needed, as it is not computed?

difficulty reproducing the ABC score of K562's in Nature 2021

Hi,
We are trying to reproduce the ABC scores of the K562s (Fulco2019) scores in Nature 2021.
We keep getting values that are somewhat correlated but not quite.
the first step of candidate enhancer coordinates are also not matching up.
We also tried forgoing step 1 and using Fulco2019 enhancer candidate regions instead, but we still were not able to get the same scores.
Could you look at our commands and see what we are doing wrong?
Thank you for you help.


#juicer_tools
wget http://hicfiles.tc4ga.com.s3.amazonaws.com/public/juicer/juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar
ln -s juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar juicer_tools.jar

#gather data
cd example_wg/input_data
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562AlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562AlnRep2.bam
wget https://www.encodeproject.org/files/ENCFF384ZZM/@@download/ENCFF384ZZM.bam

conda activate final-abc-env
samtools index wgEncodeUwDnaseK562AlnRep1.bam
samtools index wgEncodeUwDnaseK562AlnRep2.bam
samtools index ENCFF384ZZM.bam
cd ../..
cp example_chr22/input_data/Expression/K562.ENCFF934YBO.TPM.txt example_wg/input_data/

#Download hic matrix file from juicebox
python src/juicebox_dump.py \
--hic_file https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined_30.hic \
--juicebox "java -jar juicer_tools.jar" \
--outdir example_wg/input_data/HiC/raw/ 

#Fit HiC data to powerlaw model and extract parameters
python src/compute_powerlaw_fit_from_hic.py \
--hicDir example_wg/input_data/HiC/raw/ \
--outDir example_wg/input_data/HiC/raw/powerlaw/ \
--maxWindow 1000000 \
--minWindow 5000 \
--resolution 5000 



#Define candidate enhancer regions
conda activate macs-py2.7
macs2 callpeak -t example_wg/input_data/wgEncodeUwDnaseK562AlnRep1.bam -n wgEncodeUwDnaseK562AlnRep1.macs2 -f BAM -g hs -p .1 --call-summits  --outdir example_wg/ABC_output/Peaks/
conda activate final-abc-env
bedtools sort -faidx reference/chr_sizes -i  example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak > example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak.sorted


python src/makeCandidateRegions.py \
--narrowPeak example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak.sorted \
--bam example_wg/input_data/wgEncodeUwDnaseK562AlnRep1.bam \
--outDir example_wg/ABC_output/Peaks/ \
--chrom_sizes reference/chr_sizes \
--regions_blocklist reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed \
--regions_includelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed \
--peakExtendFromSummit 250 \
--nStrongestPeaks 150000

#Quantifying Enhancer Activity
python src/run.neighborhoods.py \
--candidate_enhancer_regions example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed \
--H3K27ac example_wg/input_data/ENCFF384ZZM.bam \
--DHS example_wg/input_data/wgEncodeUwDnaseK562AlnRep1.bam,example_wg/input_data/wgEncodeUwDnaseK562AlnRep2.bam \
--expression_table example_wg/input_data/K562.ENCFF934YBO.TPM.txt \
--chrom_sizes reference/chr_sizes \
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
--cellType K562 \
--outdir example_wg/ABC_output/Neighborhoods/ 


#Computing the ABC Score
python src/predict.py \
--enhancers example_wg/ABC_output/Neighborhoods/EnhancerList.txt \
--genes example_wg/ABC_output/Neighborhoods/GeneList.txt \
--HiCdir example_wg/input_data/HiC/raw/ \
--chrom_sizes reference/chr_sizes \
--hic_resolution 5000 \
--scale_hic_using_powerlaw \
--threshold .02 \
--cellType K562 \
--outdir example_wg/ABC_output/Predictions/ \
--make_all_putative \
--chromosome chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.