epigen / open_pipelines Goto Github PK

View Code? Open in Web Editor NEW

20.0 10.0 11.0 562 KB

Pipelines for NGS data preprocessing by the Bock lab and friends

Python 96.36% R 3.64%

bioinformatics pipeline ngs-pipeline looper pypiper chip-seq atac-seq rna-seq

open_pipelines's People

Contributors

Stargazers

Watchers

Forkers

databio genomicsnx himrichova zzygyx9119 frankzelph alabarga anu-bioinfo maryam1353 fwzhao pinarsiyah

open_pipelines's Issues

bugs in chipseq.py & ngstk.py

Peak calling in open_pipelines/chipseq.py was not working. It turns out that there are a couple of issues to be corrected. Here are my solutions.

1-) In open_pipelines/chipseq.py (line 535), an underscore is missing, i.e. pipe_manager.wait_for_file... -> pipe_manager._wait_for_file...
2-) macs2CallPeaks in pypiper/ngstk.py (line 1140) was not functioning properly when it is called in open_pipelines/chipseq.py (line 545). I think this is due to not including "self" in macs2CallPeaks's definition. (this subtle problem is explained here: http://blog.rtwilson.com/got-multiple-values-for-argument-error-with-keyword-arguments-in-python-classes/)). I added a "self" in my local copy.
3-) Also, to be consistent with open_pipelines/chipseq.py, I also changed treatmentBams -> treatmentBam AND controlBams -> controlBam in macs2CallPeaks in pypiper/ngstk.py (line 1140)

With these 3 changes, now open_pipelines/chipseq.py call peaks properly.

bug here

open_pipelines/pipelines/atacseq.py

Line 113 in 4d42c1d

self.bigwig = pjoin(self.coverage, self.sample_name + ".bigWig")

Should be self.coverage_dir for bigwig file, nto self.coverage (a file)

Fastqc parser bug

Hello,

FastQC parsing fails due to "IndexError: list index out of range". Changing the line:

open_pipelines/pipelines/atacseq.py

Line 510 in d40d700

line = [i for i in range(len(content)) if "Sequence length " in content[i]][0]

to
line = [i for i in range(len(content)) if "Sequence length" in content[i]][0]
fixes the issue. Looks like the last 2 space characters in the search pattern were causing the problem.

Best,
Bekir

Add all required arguments as pipelines inputs

This should ease the requirement to use the pipelines with looper (through the sample yaml).
Passed arguments in command line should overwride any passed in the sample yaml if this is also passed.

sambamba command not called as defined in the tools section of the atacseq.yaml pipeline file when generating oracle_FRiP using sambamba view

After macs2 sambamba is called as defined in atacseq.yaml, but the nex sambamba view command is simply calling "sambamba":

macs2 callpeak -t ....
sambamba-0.7.1 depth region -t ....
sambamba view -t ....

set_file_path error

Reported by @mtugrul

I was not able to run chipseq.py and atacseq.py in looper. The following error comes, I looked into the corresponding lines but did not understand why. Any help?

chipseq:

Traceback (most recent call last):
File "/home/mtugrul/software/pipelines/pipelines/chipseq.py", line 755, in
sys.exit(main())
File "/home/mtugrul/software/pipelines/pipelines/chipseq.py", line 473, in main
sample.set_file_paths()
File "/home/mtugrul/software/pipelines/pipelines/chipseq.py", line 78, in set_file_paths
super(ChIPseqSample, self).set_file_paths()
TypeError: set_file_paths() takes exactly 2 arguments (1 given)

atacseq:

Traceback (most recent call last):
File "/home/mtugrul/software/pipelines/pipelines/atacseq.py", line 723, in
sys.exit(main())
File "/home/mtugrul/software/pipelines/pipelines/atacseq.py", line 456, in main
sample.set_file_paths()
File "/home/mtugrul/software/pipelines/pipelines/atacseq.py", line 54, in set_file_paths
super(ATACseqSample, self).set_file_paths()
TypeError: set_file_paths() takes exactly 2 arguments (1 given)

ok, it looks like this problem is due to new looper versions. When I install looper v0.5, it seems to be running now. I tried with all other dev versions of looper v0.6, but no success! I will stick to v0.5 for now, but this should be solved in long term.

Update to PEP stack 2.0

Hi,

I'm no longer using this code, but I'm still collaborating with @sreichl on projects that use this.

I've heard there's some trouble upgrading this to work with the PEP stack>=2.0.

@fwzhao I believe you did some work on this on the project side to upgrade project configs, etc.
Do you want to share your progress, and any issues you might have so we can start upgrading the pipelines?

Anyone else interested, please pitch in.

Missing run_spp.R for chipseq.py

I tried subbing in the script from pipelines/tools that sounds like it could've been a substitute for the spp tool referenced in the pipeline config file's tools section, but when I look at the sample's log file, it seems like the command being run would be appropriate for a different script, so it seems like run_spp.R and spp_peak_calling.R are for different things?

Question: program naming for pipeline scripts

It looks like some pipelines, e.g. chipseq.py, are defined as command-line programs in setup.py. There it has an underscored name while in the description to the argument parser, the name is hyphenated. Is this due to a Python-related hyphens-to-underscores conversion, or should those match?

Attribute reference to 'bigwig' on Sample in chipseq.py

In the chipseq.py pipeline, there are three usages of sample.bigwig, but the Sample instance being used does not have a bigwig attribute.

Target exists: `/sfs/lustre/allocations/shefflab/processed//kipnis_chip/micro/results_pipeline/input_12k/mapped/input_12k.trimmed.bowtie2.filtered.bam.bai`
Removed existing flag: /sfs/lustre/allocations/shefflab/processed//kipnis_chip/micro/results_pipeline/input_12k/chipseq_failed.flag
Traceback (most recent call last):
  File "/home/vpr9v/code/open_pipelines/pipelines/chipseq.py", line 757, in <module>
    sys.exit(main())
  File "/home/vpr9v/code/open_pipelines/pipelines/chipseq.py", line 484, in main
    process(sample, pipe_manager, args)
  File "/home/vpr9v/code/open_pipelines/pipelines/chipseq.py", line 642, in process
    track_dir = os.path.dirname(sample.bigwig)
AttributeError: 'ChIPseqSample' object has no attribute 'bigwig'

Pypiper terminating spawned child process 190059

Change status from running to failed

Peak count stat for SPP

I can read through it if this is unknown, but does anyone happen to be aware if the SPP peak calling RScript handles responsibility for the task of reporting peak count? When MACS2 is the caller, this is done post-hoc with report_dict, but not when it's SPP.

Leveraging pypiper ngstk

atacseq.py and chipseq.py should use functions from pypiper.ngstk. Specifically, this is in regard to bam_to_bigwig / bamToBigWig (though it may also apply to other pipeline-defined functions). At least from quick look over the version in chipseq.py, it seems like the only real difference is a hook for normalization factor. These functions should use a central version from ngstk once it parameterizes that.
databio/pypiper#52

Add FASTQ input to all pipelines

Help needed with CROP-seq / open_pipeline isses

Hello epigen,

I am a wetlab cellbiologist with beginner to intermediate coding skills and I am trying setup the CROP-seq pipeline in our laboratory.
They use looper and also the open_pipelines repository.
sadly I cannot get passed a certain step.
Now I'm not sure if this is a problem that has anything to do with the open_pipeline scripts, but i do not know where to ask elsewhere. the #https://github.com/epigen/crop-seq.git is archived and cant create issues there.
it is called makeref.py and is making a gtf, STAR index and refFlat of the genome i'm using and the viral genome/gRNA that i want.
during this script it calls the a looper config.yaml once.
however i get this error
(CROPenv) lucask@kolossus:~/crop-seq$ make makeref python src/guides_to_ref.py Traceback (most recent call last): File "src/guides_to_ref.py", line 58, in <module> prj = Project(os.path.join("metadata", "config.yaml")) File "/home/lucask/crop-seq/src/looper/looper/models.py", line 772, in __init__ process_pipeline_interfaces(self.metadata.pipelines_dir) File "/home/lucask/crop-seq/src/looper/looper/models.py", line 376, in process_pipeline_interfaces proto_iface = ProtocolInterface(pipe_iface_location) File "/home/lucask/crop-seq/src/looper/looper/models.py", line 2845, in __init__ self.pipe_iface = PipelineInterface(self.pipe_iface_path) File "/home/lucask/crop-seq/src/looper/looper/models.py", line 2471, in __init__ with open(config, 'r') as f: IOError: [Errno 2] No such file or directory: '/media/draco/lucask/open_pipelines/config/pipeline_interface.yaml' make: *** [makeref] Error 1
the config.yaml https://github.com/epigen/crop-seq/blob/master/metadata/config.yaml is a copy of the original except with my directories. I dont know enough (if anything at all) about looper to exactly understand what i am missing.
i'm using an enviroment that should have exactly all the dependancies installed.
I do see an atacseq.interface.yaml in the open_pipelines but not one for drop-seq, (the one i will eventually need)
I was not sure if this issues stems from the fact that you guys are updating this repository or from me making some mistakes with how i should setup the looper config.yaml
Could you help me or give some advice.
thank you in advance.

Kind regards,
Lucas Kuijpers

P.S. let me know if i need to send more info, or if there is another github/webpage where i should look

ATAC-seq pipeline exit if no mitochondrial reads are duplicated

ATAC-seq pipeline exit if no mitochondrial reads are duplicated due to zero division error - see logfile:
File "/Users/christianschmidl/src/open_pipelines/pipelines/atacseq.py", line 359, in parse_duplicate_stats
prefix + "duplicate_percentage": (float(duplicates) / (single_ends + paired_ends * 2)) * 100}
ZeroDivisionError: float division by zero