databio / pypiper Goto Github PK

View Code? Open in Web Editor NEW

45.0 16.0 9.0 1.58 MB

Python toolkit for building restartable pipelines

Home Page: http://pypiper.databio.org

License: BSD 2-Clause "Simplified" License

Python 95.02% Jupyter Notebook 4.98%

pipeline python-toolkit python bioinformatics bioinformatics-pipeline

pypiper's People

Contributors

Stargazers

Watchers

Forkers

vreuter hussainzaidi mnhan32 changlab jpsmith5 berguner bohannonsarah fhalbritter

pypiper's Issues

convert to class

make pypiper a class, to be instantiated in a pipeline script.

Pipeline parameters configuration & AttributeDict strictness

AttributeDict has a behavioral parameter--specified at construction time--that governs handling of request for the value of a key/attribute that's not present. The config of a PipelineManager behaves in the loose / non-strict way. It returns the key/attribute itself there's no value for it (it's not stored in the AttributeDict).

From discussion with @nsheff : the original motivation for this behavior was the desire to enable specification of a different Python executable than the one on PATH that's activated by simply referencing python. Sometimes a python may not be specified in a configuration file, though, in which case it's nice to be able to fall back on the version pointed to by python, which is accomplished by this non-strict AttributeDict behavior. Analogous reasoning applies to other executables.

More specifically, still with Python as the particular example executable, ngstk relies on its tools.python attribute pointing to a valid Python version. To preserve this nicety while also offering the more intuitive, flexible behavior of a strict AttributeDict, a PipelineManager's config.parameters could be strict, and a separate configuration section that uses the non-strict AttributeDict could be used for ngstk settings.

Hard coded total chromosome sizes

Total chromosome sizes are hardcoded in the function "macs2CallPeaksATACSeq" and "macs2CallPeaks" of ngstk.py. So I ran into problems when I did the analysis with mm9. Maybe this could be added to the atacseq.yaml

Also, I wonder if those genome sizes are correct:
For mm9, I summed up the chromosome size values from the chromosome_sizes files:
/data/prod/ngs_resources/genomes/mm9/mm9_chromlength.txt
The size i get exactly corresponds to this one:
http://genomewiki.ucsc.edu/index.php/Genome_size_statistics

However, if I do the same for the other genomes (e.g. hg19) I do get 3.1e9 bases, which is similar to the link above but different from what's defined in ngstk.py.

streamlining sanity checks

With the current system, pipelines use a rather cumbersome way of checking results, for example:

# Fastq conversion
mypiper.call_lock(cmd, unaligned_fastq)
if not args.no_check:
    raw_reads = ngstk.count_reads(local_unmapped_bam_abs, args.paired_end)
    mypiper.report_result("Raw_reads", str(raw_reads))
    fastq_reads = ngstk.count_reads(unaligned_fastq, args.paired_end)
    mypiper.report_result("Fastq_reads", fastq_reads)
    if fastq_reads != int(raw_reads):
        raise Exception("Fastq conversion error? Number of reads doesn't match unaligned bam")

This ends up being run every time the pipeline is re-run, even if the bam to fastq conversion isn't re-run -- a minor annoyance.

This could be streamlined with a postcall parameter that would be passed to call_lock. This would take a function (a lambda perhaps?) that would be run only if the call was run. It could be useful for generically checking results of a call to make sure they are OK for proceeding.

A second addition to make it even more useful would be a pypiper.assert() function, which fails the pipeline if its result evaluates to False. If we also make report_result return the number it is reporting, then the final system could look like this:

# Fastq conversion
mypiper.call_lock(cmd, unaligned_fastq, postcall= lambda: 
mypiper.assert(
mypiper.report_result("Raw_reads", str(ngstk.count_reads(local_unmapped_bam_abs, args.paired_end))) == 
mypiper.report_result("Fastq_reads", ngstk.count_reads(unaligned_fastq, args.paired_end)) )

This would now only be run if this call was executed, changing the fastq file. It would then avoid the need for args.no_check alltogether. Furthermore, it would only report the result if the command was run, avoiding the bloated stats files that we currently have as these results are reported with every re-run.

Or better yet, there could be also a postassert argument, which evalutates itself and then fails if it's false. Then it's as simple as:

# Fastq conversion
pl.call_lock(cmd, unaligned_fastq, assert= 
   pl.report_result("Raw_reads", str(ngstk.count_reads(local_unmapped_bam_abs, args.paired_end))) ==  
    ngstk.count_reads(unaligned_fastq, args.paired_end))

Dynamic recovery not working on preempted jobs

I havent been able to figure this one out.

Running pypiper-0.5.0rc2 and looper 0.6.0rc1

I ran 37 jobs. 10 completed without problem and ran through the whole pipeline perfectly. The remaining 27 all ran into problems during the pre-alignments stage, failing to complete either the alphasat or alu alignments. They all generated the tmp intermediate bam files but did not compile them into the final bam file.

The one thing that all of the failed jobs have in common is that they were preempted. All of the completed jobs were not preempted. It looks like the jobs have some trouble recovering after being re-submitted. I'm attaching an example log file here. It looks like there is some confusion about the file lock during alignment. It looks like the alignment is not restarted but rather the pipeline just waits for it to finish (and the file lock to be released) and it never does.

Happy to help trouble shoot this. I dont have much intuition for how this is all happnning. I'll keep digging.

Include PipelineInterface outputs in sample YAML

See pepkit/peppy#170, specifically pepkit/peppy#170 (comment)

bugs in chipseq.py & ngstk.py

Peak calling in open_pipelines/chipseq.py was not working. It turns out that there are a couple of issues to be corrected. Here are my solutions.

1-) In open_pipelines/chipseq.py (line 535), an underscore is missing, i.e. pipe_manager.wait_for_file... -> pipe_manager._wait_for_file...
2-) macs2CallPeaks in pypiper/ngstk.py (line 1140) was not functioning properly when it is called in open_pipelines/chipseq.py (line 545). I think this is due to not including "self" in macs2CallPeaks's definition. (this subtle problem is explained here: http://blog.rtwilson.com/got-multiple-values-for-argument-error-with-keyword-arguments-in-python-classes/)). I added a "self" in my local copy.
3-) Also, to be consistent with open_pipelines/chipseq.py, I also changed treatmentBams -> treatmentBam AND controlBams -> controlBam in macs2CallPeaks in pypiper/ngstk.py (line 1140)

With these 3 changes, now open_pipelines/chipseq.py call peaks properly.

PipelineManager notifications & status

Email for completion/fail/continue-past error
HTML version of live logs
Results DB
#28 (comment)

NGSTk could use pigz

ngstk uses gunzip sometimes. it could instead use pigz if it's installed, maybe seamlessly falling back to gunzip if not.

maybe this could become a built-in pypiper helper; you say pm.zip and it uses either pigz or gzip as appropriate, so you can also implement this in pipelines.

File waiting method is protected

PipelineManager._wait_for_file is protected. This seems like a useful public method for the PipelineManager API, and it's used in e.g. open_pipelines/pipelines/chipseq.py. If there's not an public wrapper around this, could this be made public?

Dry-run

Add a dry-run option that prints commands but does not run them.
This will need: 1. a new option in args for current pipes, with Pypiper class constructor understanding it; 2. recognizing the parameter in callprint; 3. a new state variable in the Pypiper class.

NGSTk counting functions should each return an integer

Since NGSTk offers--nominally and substantively--a toolkit of NGS functions, it seems like there's intent to have users be able to treat it as a collection of general-purpose functions, maybe in conjunction with a PipelineManager, but perhaps separately. I understand that it works alright in the report_result context, but to a new/external user, it'd likely be surprising to be given a text result from a counting function. Additionally, there are cases in which there's further computation to be done on a returned result, and then it's inconvenient and non-intuitive to do string processing before arithmetic after calling a counting function. I propose doing any string processing and type conversion within these functions rather than requiring it on the user side.

Add ability to start from or stop at a pipeline stage/phase

Suggested recently by @nsheff

macs2CallPeaks not producing NAME_model.r script

From the NGSTK code and from the MACS 2 docs, it looks like the model building intends to write a script called NAME_model.r, where NAME is the sample name. When this function's called from chipseq.py in open_pipelines, though, that script is missing, causing the subsequent plotting call to fail. I thought maybe the name was off, but there's actually no .r file in the any of the project's sample's output folder. I'm just wondering if anyone else has run into this and perhaps knows what could be happening, maybe @afrendeiro ? There's no indication in the sample's logfile:

Calling peaks with MACS2 (08-22 00:19:55) elapsed:0:00:00 _TIME_

Target exists: `/sfs/lustre/allocations/shefflab/processed//kipnis_chip/full/results_pipeline/chip_30k/peaks/chip_30k_peaks.broadPeak`
> `peaks`       1911319 chipseq _RES_
Ploting MACS2 model (08-22 00:19:56) elapsed:0:00:01 _TIME_

Target to produce: `/sfs/lustre/allocations/shefflab/processed//kipnis_chip/full/results_pipeline/chip_30k/peaks/chip_30k/chip_30k_model.pdf`
> `Rscript /sfs/lustre/allocations/shefflab/processed/kipnis_chip/full/results_pipeline/chip_30k_model.r`

<pre>
Fatal error: cannot open file '/sfs/lustre/allocations/shefflab/processed/kipnis_chip/full/results_pipeline/chip_30k_model.r': No such file or directory
</pre>

Place checkpoint files in separate sample subfolder

For a pipeline that wants to define many checkpoints, placing individual checkpoint files in the sample's root output directory clutters it a bit. It'd be cleaner to place the checkpoint files into a separate subfolder within each sample's output root.

update call_lock to use kwargs

call_lock has 2 modes: call_lock and call_lock_nofail, which are just aliases. this could be streamlined by using kwargs instead of a string of default function arguments.

Support use of virtual environments?

It seems that right now PipelineManager.run will not find packages installed in a virtual environment. This may depend on the argument to the shell parameter. It may also/instead be that some hook mechanism needs to be added to make this possible. This is desirable for users wishing to keep relatively clean their system site-packages.

error with tee output

when doing a local compute run, I get this error

Pypiper terminating spawned child process 221622...
child process terminated
close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr

having trouble figuring out where it's coming from.

Problems with setting up Pypiper + Pipelines

First: thank you all for developing all of this. It's very handy

I just went through the whole process of making it run on my system and was asked to summarize my experience. It was a bit tricky to find the right versions that fit together plus correct some bugs. In particular, NGSTK and Pipelines (or open_pipelines) didn’t work together, neither master – master nor dev – dev. It's running for me at the moment and there ultimately weren't that many things to fix (but they were tricky to find). What I think would help most is to know which versions go together. Here are a few specific points, I will submit them in the dev branches (which I ultimately used):

NGSTK (master) was missing one method entirely (added in the dev branch)
In the dev and master branch, the “all_args” argument in pypiper.add_pypiper_args(parser, all_args=True) was set in all pipelines. However pypiper didn’t use this argument. (I had to use ‘groups=[“all”]’)
NGSTK bam_fastq was missing the “+” in cmd += r'{ print "@"$1"\n"$10"\n+\n"$11
I needed the pipelines’ yaml files that are found in dev branch
Looper wasn’t in $PATH upon installation, for some reason it worked in the microtest but it took me some time to figure out where pip had installed it to add it to $PATH to make it work outside the virtual env

Some ideas on how we could do things in the future:

All pipelines should be tested before a release because specific pipelines require specific ngstk/pypiper methods thus testing only some pipelines does not guarantee full functionality of ngstk/pypiper
For the current state of things, I would say develop the dev branch to work, then merge with master for all repos. In general, I would suggest using dev branch to fix things and other branches to implement yet additional functionality, thus keeping those two things separate
Some of the errors above are found by running “python YOUR_PIPELINE.py” we could use this for tests?

Thoughts?

Nik

make cleanups relative?

right now cleanups are absolute, so if the folder gets moved, they no longer work.
can this be fixed easily?

relax config requiremment

pypiper current requires args to have config, but it shouldn't require this

Change status from initializing to running
Traceback (most recent call last):
  File "/sfs/nfs/blue/ns5bc/code/sratofastq/sra_convert.py", line 58, in <module>
    main(sys.argv[1:])
  File "/sfs/nfs/blue/ns5bc/code/sratofastq/sra_convert.py", line 45, in main
    args=args)
  File "/sfs/nfs/blue/ns5bc/.local/lib/python2.7/site-packages/pypiper/pypiper.py", line 182, in __init__
    if args and args.config_file is not None:
AttributeError: 'Namespace' object has no attribute 'config_file'

Incorporate command builder function

databio/pepatac#17 introduced a build_command function that's not pipeline-specific and could be included in pypiper, maybe as part of ngstk.

NGSTk function name/argument homogeneization

Since many of the NGSTk functions were written long ago, they aren't all using the same naming scheme (eg. camel caps vs lowercase with underscores).
This would mean updating all code that uses these functions (mostly pipelines I guess).

'waiting' status flag

when waiting for a lock file, pyiper should change the status flag to 'waiting' instead of leaving it as 'running'. then terminally waiting processes will be easier to identify.

Allow specification of dependencies

It's nice to not have pypiper be dependency-free, but that allows installation and then potential use of ngstk, for example, without required dependencies since many are Imported at the function level. Allow installation based on need with something like pip install --upgrade pypiper[ngstk]

failures of early stages in shell pipelines don't fail the pipeline

If you run a shell command through Pypiper using a shell pipe (|), then the shell will return the return value of the last command in the pipe string. Thus, even if an early step fails, it could return a 0, and the pypiper pipeline would continue.

You can solve this in bash with an option called pipefail, which makes it so failure at any stage fails the whole pipe. If you ask me, this should have been the default.

Anway, if you force the python subshell to use bash, you can do this in python like explained here. But maybe there's a better way?

Proposal: eliminate dependence on particular nesting of pipeline manager settings

Right now, the access pattern for, e.g., the path to the sequencing adapters is pm.config.resources.adapters. It would be nice to have this and related details of a pipeline manager's configuration settings directly available, e.g. pm.adapters. Not only would this simplify usage within pipelines, but also allow the code to flex around changes in pipeline interface files. Then changes to how the interface file is parsed would only impact pipeline interface files. A pipeline author using pypiper would need only to update interface file(s), not associated pipeline(s).

Tool flexibility for a given processing task

Since processing tasks that are common to a variety of pipelines often are implemented as different tools, and since some use cases or environments may not have the full array of tools for a given task installed or otherwise available, or a pipeline may wish to use a particular tool for the given task, it'd be a nice feature (maybe in NGSTK) to provide this sort of tool substitution flexibility. For instance, filterReads could use samtools or sambamba. While individual pipelines could independently implement this sort of logic, it could substantially reduce code duplication and simplify the process of pipeline authorship to roll this sort of tool parameterization into NGSTK itself. This could be done in a number of ways, from independent functions, to a function like filterReads accepting one or more other functions as arguments, to something like a ReadFilter class in which the tool of choice and filtration parameters could be supplied at construction time or at call time.

proposed change to output file naming

Currently, pypiper output files start with the pipeline name and then the file type, like PIPE_log.md and PIPE_profile.md, etc. (see docs)

I did it this way to make it so you could find stuff by multiple pipelines easier. But the thing is, I've realized that 95% of the time we're running only a single pipeline, ad it's annoying to have 4 files all starting with PIPE_* for tab completion.

PIPE_log.md
PIPE_profile.md
PIPE_status.flag
PIPE_commands.md

instead, these could be:

log_PIPE.md
profile_PIPE.md
status_PIPE.flag

etc. The advantage being tab completion is now much easier for folders running a single pipeline, which is most of the time.

But this is kind of a big change. Any opposed? Ideas?

Command created by NGSTk.calculate_frip doesn't always write a value

An empty FRiP file can be created when the command returned by calculate_frip is executed. I think this is due to an empty peaks file. In such a case, it may make more sense to write a nan rather than an empty file. More broadly, calculate_frip feels like the sort of function that should return a numeric value that the user then writes to a file or otherwise uses, rather than a command-returning function.

Summed runtime for pipeline restarts

Add up the runtimes in a logfile so that if restart(s) is/are involved, the reported execution time is more accurate.

changing reporting results file name

Right now, the simple report_result function just puts results into a PIPE_stats.txt file. I think we should add another (optional) parameter specifying location, so we could report different classes of results; for instance, report_result("file_size", file_size, location="input_file") would instead record results to PIPE_input_file.txt. This could be used to compartmentalize results.

AttributeDict centralization

There are (behaviorally) different versions of AttributeDict in looper.models and here in pypiper. Since they're meant to accomplish the same things and be used in similar ways, it would be great to share a single implementation. Maybe this will be simplified with the migration of some of the core looper classes to a more general repository.

link rawdata to raw folder

In line 131 of ngstk, could we use pm.run instead of pm.callprint,in order to prevent that two pipelines which use the same input data try to make the link at the same time?

Combine BAM-to-FASTQ conversion functions

There's bam2fastq and bam_to_fastq.

Pythonize code and repository

Repository is not appropriately packaged. A setup.py is missing. This is specially important for deployment but also useful in development.

Code is not agreeing with Python's style and best practices e.g. abusive use of globals, variable naming inconsistency (caps, camel case, underscores...)

Parameterize normalization factor on bam_to_bigwig

A couple pipelines in open_pipelines have quite similar versions to what's in ngstk, just with accommodation for normalization factor. That accommodation should be made in the ngstk version of bam_to_bigwig so that individual pipelines can use it and we avoid code duplication.

Dynamic recovery

Originally prompted by an issue in looper by @rcorces: pepkit/peppy#140

Solution: we could have a new mode called 'dynamic recover' mode. This runs in normal mode by default, but if a 'recover' flag is set, THEN it runs in recover mode. In other words, if a file lock is set, it checks for a recover flag, and then restarts only if a recover flag exists. If a dynamic recover job is pre-empted (ie it fails via a sigkill or a sigterm, not via a job failure), then it sets a recover flag. When a pre-empted job is restarted, it will recover; when a job that failed naturally is restarted, it will not. A dual-submission still succeeds.

Wow that's pretty sweet. On Sherlock (or any cluster that can pre-empt jobs) you would just always run in dynamic recover mode. Thoughts? This would be pretty easy to implement actually, and would fit sherlock's system perfectly. everything would be handled automatically.

pypiper function to add sample attributes

Just brainstorming a bit here...Pypiper should enable you to add attributes to a sample; and then it could write the sample object (as a sample yaml) at the end of the pipeline run.

Andre I think your classes accomplish this, but right now there's no explanation or documentation of how people can do that.

Also right now, your pipelines have to import looper, which is not ideal...

from looper.models import AttributeDict, Sample

What if we added the generic sample object to Pypiper?

version in version

In [2]: pypiper.__version__
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-a0c3be52517d> in <module>()
----> 1 pypiper.__version__

AttributeError: 'module' object has no attribute '__version__'

pypiper should show it's version in __version__

profiling report

Pypiper could produce a profiling report, Pipeline_profile.txt. It would include:

A single line for each subprocess that is run, detailing:

elapsed time
peak memory used

Also, this could indicate elapsed time between each timestamp.

Implementation: when a subprocess is running, add a variable tracking local max mem use, and elapsed time. When a each subprocess completes, write a line to the report indicating.

Also a killed child process should write the elapsed time and max memory even if it didn't complete.

Such a report would make it easy to identify the most time-consuming and memory-consuming steps of a pipeline for optimization. anyone interested?

recording reference genomes in log outputs

Right now if you want to run the same pipeline (say, RRBS) on the same set of samples, with 2 different reference genomes (say, hg19 and hg38, or hg38 and mm10), you can do this and results will be intact, but your log files, stats files, profiles, etc. will be merged because pypiper writes output files specific to pipelines, but not to reference genomes.

Should the filenames for such log files be amended to include the reference genome?

Enable full pipelinemanager-free usage of NGSTk

(dev branch)
Right now, if not passing a pipeline manager object to NGSTk (pm) one could not many commands, since these call the tools attributes of the NGSTk self, resulting in error.

Example:

from pypiper import NGSTk
tk = NGSTk()
tk.fastqc("a", "a")

AttributeError: 'str' object has no attribute 'fastqc'

This is because the fastqc command calls internally "self.tools.fastqc" and since tools does not exist (it comes from the pipeline manager) it is returning a string "tools" which has no attribute "fastqc".

We should enable all commands in NGSTk to be run even if not passing a pm object to it initially.

pypiper.Pypiper object cannot be initialized without "args" argument

In a previous commit (0d0c510), self.manual_clean and self.overwrite_locks were set to be attributes of args but if one doesn't pass args during the object creation, this will cause it to fail.

Unused CPU count in peakTools

ngstk.NGSTK.peakTools requires argument for cpus parameter, but this is unused. At quick glance, it seems like run_spp.R uses this to parallelize with snow. Is cpus not included in the command to run run_spp.R due to potential pitfalls with snow? If so, can the cpus parameter be removed, or at least given a default of None? https://github.com/epigen/pypiper/blob/dev/pypiper/ngstk.py#L916-L918

Handling pipelines/protocols with dependent tasks

#48

I've been thinking about this.
Background: That function is not used anywhere within pypiper. It's only used (apparently) in this pipeline (or maybe some others).
My thoughts: I actually think it sort of breaks the mantra of pypiper to have it called externally. I think pypiper should manage the waiting, entirely via run. If something outside is trying to mess with waiting, that indicates that something complex and confusing is going on in the pipeline, which isn't in line with pypiper's goal. In this case, it indicates that this pipeline is trying to implement external dependencies, which I think is not the right way to solve the problem. There are other, better ways to solve it that are more straightforward and fitting with the goals of the tools.
So I would be in favor of leaving it there, but protected, exactly to indicate that you can do this, but you shouldn't.
I see no reason to wait for a file at this point; this implies cross-pipeline communication. but it may become useful in the future.
it's in there in case it becomes useful, but is not currently being used, I guess.

Pypiper tutorials

I produced one Pypiper tutorial (in example_pipelines/) but it would be nice to have a couple of other tutorials showing more advanced features. I outlined these in the readme (maybe an advanced tutorial and a bioinformatics example tutorial).

I'm happy to create such tutorials but I thought I would invite someone else to participate. Is anyone else interested in being a contributor on this project interested in putting together a few tutorials? It shouldn't take long.

time formats

elapsed time could display seconds always, but could also display hours and days. something like this (R code):

secToTime = function(timeInSec) {
    hr = timeInSec %/% 3600 #hours
    min = timeInSec %% 3600 %/% 60 #minutes
    sec = timeInSec %% 60 #seconds
    return(paste0(sprintf("%02d", hr), "h ", sprintf("%02d", min), "m ", sprintf("%02.01f", signif(sec, 3)), "s"))
}

Memory visibility

Peer inside shells that enclose Python-spawned subprocess(es) to more accurately assess peak memory usage of pipeline stages.

NGSTK peak caller function signature homogenization

Homogenization seems to be such a 🔥 theme right now. Anyway...
It'd be nice to bring the signatures for the peak caller functions into closer alignment. There's a slight discrepancy both in names and order. Paths to treatment/control come first and are adjacent in sppCallPeaks while they're not adjacent in macs2CallPeaks. Additionally, we have plural versions of the parameter names in the MACS2 wrapper.

The first one's clearly super easy and I could just switch it, but I wanted to ask before doing so to get an estimate of the extent of usage these functions are? Significant beyond open_pipelines?
For the second issue, it looks like the R script that the SPP function wraps doesn't support multiple input files for treatment and for control, hence the singular parameter names. The easy way out would be to singularize the names in the MACS2 wrapper, but the more interesting/perhaps better way would be to support multiple treatment and control paths in the SPP script.

Parameter order
Parameter names