sunbeam-labs / sunbeam Goto Github PK
View Code? Open in Web Editor NEWA robust, extensible metagenomics pipeline
Home Page: http://sunbeam.readthedocs.io
A robust, extensible metagenomics pipeline
Home Page: http://sunbeam.readthedocs.io
Ensure all the package versions are correct (bioconda idba instead of eclarke)
The following keys need to be added in the default config file in sunbeamlib so that sunbeam_init
works, and we should also see what caused the tests not to pick this up:
Putative feature list for release 1.0:
Compatibility issue with snakemake version 3.13.2, workaroung by forcing 3.13.0 (conda install snakemake=3.13.0
)
Top-level traceback:
Error in job parse_genes_mga while creating [output files]
RuleException:
AttributeError in line 39 of ~/sunbeam/rules/annotation/orf.rules:
'Namedlist' object has no attribute 'readline'
It looks like specifying a config file in the Snakefile on line 25 prevents the --configfile command from working correctly: it ignores this flag and continues parsing example_config.yml
. Can we revert back to the previous behavior (specifying configfile is required, raising an error if not specified) until we figure out a workaround?
I'm seeing some cases where the auto-generated IGV images don't quite show the full genome, as though it's been zoomed in slightly in the toolbar in the IGV GUI. It looks like I can fix this by explicitly saying goto <sequence_id>:1-<sequence_length>
every time in the commands right after loading the genome fasta file, rather than just an optional goto <sequence_id>
for multiple segments/chromosomes only.
When a tool called from the shell undergoes a core dump (due to running out of memory?) Snakemake doesn't detect it as an error. This may be because the tool does not return for some reason. It could also be due to an issue in the job submission process on the cluster
Possible workarounds:
TypeError: string indices must be integers
This arises because _build_samples_from_file
in sunbeamlib produces a different struct than _build_samples_from_dir
.
Temporarily commenting the test for this out of the test.sh script to resolve other issues.
As I have it written right now, bowtie2 keeps all reads in its output files, even unaligned ones, so the total size can be much bigger than it needs to be. It should default to leaving those out but use a Snakemake parameter and Sunbeam configuration option to control it explicitly.
The default IGV preferences don't give enough screen space on the left panel to display long input filenames, so multiple similar names can't be distinguished. Setting the panel width explicitly as a preference would fix this.
Currently encountering the same error in calling a few rules related to contig annotation.
Specifically:
"Workflow Error:Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards."
This error appears when individually calling at least the following rules. This occurs even when valid paths are given to nucleotide and protein databases in the config file.
find_genes_mga
run_blastn
run_blastp
run_blastx
However, when calling the rule all_annotate, there is no error even though that rule includes run_blastn.
When the number of references and samples is high, there are far too many files written directly to the mapping output directory. We should split that up into sections (like the qc directory has, for example).
If cutadapt is skipped, gzipped files lose their .gz during this move:
https://github.com/eclarke/sunbeam/blob/51fbb55549258dfb373ea8f145ad4e24de76ad9d/rules/qc/qc.rules#L40
This causes errors downstream because further rules expect plaintext and receive gzipped.
As suggested in the title, the default --maxseqlength
of vserach 50,000, and we don't want to filter those long contigs out. Add --maxseqlength
10000000000.
We should have a section in the documentation pertaining to running things on clusters, including tips and tricks like -w90
.
When I run the testing script from the stable branch as suggested in the tutorial, I got an error when running rules remove_low_complexity
. I attached the error log file test_all.err.txt, related to #123
Following up on 20180328:
I figured out this is a issue of kcomplexity
conda package. So I conda remove kcomplexity
and install the kcomplesity from the git repo, as a temporary walk around.
And here is the related issue for rust.
Soliciting comments from @kylebittinger and @zhaoc1
We have bare-bones functional testing up and working right now. I would like to get things a little bit more formal for future development, since we now have active users and we need to worry about things breaking during our updates.
Highest priority:
set -e
enabled and they should not be committed if it is missing or commented out.master
unless it passes functional testing on Travis.sunbeam/sunbeam
package. We should test this using a service like Landscape to ensure code correctness and well-formattedness.Architectural changes:
The way I have it written, the bowtie2_align rule assumes FASTQ input with paired-end reads in separate files. Ideally it would just do the right thing based on what's in the Samples dict.
In the QC step of the pipeline, there is removal of custom sequences (defined in the config file as Cutadapt's fwd and rev adaptors) that are introduced in the Bushman lab cDNA synthesis workup of RNA samples. These sequences would be expected to be found at both 5' and 3' ends of cDNA. In some cases, these sequences can form long concatemers (again a product of the Bushman lab cDNA synthesis step) and are therefore of little value to be analyzed in downstream steps.
However, these sequences would not be considered wetside artifacts in sequencing of DNA samples as they are not deliberately introduced at any point in the library preparation.
In a typical DNA sample, it appears that these sequences are identified in 10% of reads (probably just by chance). Currently, these reads are discarded. However, in the case of DNA samples, there is no reason to discard these reads and therefore losing 10% of "real" data.
Some suggestions to address separate treatment of DNA versus cDNA samples:
This would also be important in cDNA submitted by other users who don't necessarily use the same protocol as we do.
The Readme file shows conda env -d sunbeam
to remove the existing Sunbeam environment, but it looks like the correct syntax should be conda env remove sunbeam
.
IGV is now on bioconda so we should use that rather than our custom install script
NameError while running decontamination. Trace per Erik's request:
194 of 2691 steps (7%) done
rule decontam_human:
input: sunbeam_output/qc/paired/SSND_R1.fastq, sunbeam_output/qc/paired/SSND_R2.fastq
output: sunbeam_output/qc/decontam-human/SSND_R1.fastq, sunbeam_output/qc/decontam-human/SSND_R2.fastq
log: sunbeam_output/qc/log/decontam-human/SSND_summary.json
wildcards: sample=SSND
Error in job decontam_human while creating output files sunbeam_output/qc/decontam-human/SSND_R1.fastq, sunbeam_output/qc/decontam-human/SSND_R2.fastq.
RuleException:
NameError in line 29 of sunbeam/rules/qc/decontaminate.rules:
The name 'human_index_fp' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
File "sunbeam/rules/qc/decontaminate.rules", line 29, in __rule_decontam_human
File "~/miniconda3/envs/sunbeam/lib/python3.5/concurrent/futures/thread.py", line 55, in run
Currently the mapping rules take the raw data files as input. Instead they should use the QC'd data as other steps already use.
In my environment, currently with conda 4.3.33, install.sh doesn't detect an already-existing environment.
Line 25 greps for the environment name:
conda env list | grep -Fxq $SUNBEAM_ENV_NAME
But conda env list
gives paths as well as names, so the grep doesn't match any lines. For example:
$ conda env list
# conda environments:
#
ExampleProject /home/jesse/miniconda3/envs/JesseProject
circonspect /home/jesse/miniconda3/envs/circonspect
gcc5 /home/jesse/miniconda3/envs/gcc5
I don't remember this happening until recently. Did conda's output change, maybe?
We should have better default values on the config file to avoid the series of errors relating to paths.
The igv_snapshot
rule calls a function that assumes a constant X server number for running IGV, so if that rule is used in parallel with --threads, it fails. I should be able to fix that by letting it use the first available X server number each time.
Right now the mapping rules only align to a set of existing fasta files provided as input to Sunbeam. It would be useful to also allow the mapping of reads to the contigs created by the assembly section. We should add this as a new feature in the mapping section.
While writing tests and creating test data, we would like to inspect some of the intermediate files.
Requesting a new feature -- if a directory is passed to the test.sh script, then the test output should be written there. Otherwise, the output should be written to a temp file and removed, as it is now.
Error in rule mask_low_complexity:
jobid: 0
output: /home/guanxian/sunbeam_output/qc/masked/D5_R2.fastq.gz
RuleException:
KeyError in line 92 of /gpfs/fs02/home/guanxian/sunbeam/rules/qc/qc.rules:
'mask_low_complexity'
File "/gpfs/fs02/home/guanxian/sunbeam/rules/qc/qc.rules", line 92, in __rule_mask_low_complexity
File "/home/guanxian/miniconda3/envs/sunbeam/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/fs02/home/guanxian/sunbeam/.snakemake/log/2018-03-01T135544.313766.snakemake.log
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/fs02/home/guanxian/sunbeam/.snakemake/log/2018-03-01T135417.663390.snakemake.log
When run under the latest snakemake available from the bioconda Anaconda channel, multiple rules fail because some of the snakemake objects have changed. Version 3.13.2 fails, but 3.13.0 still works.
This isn't something we need anymore (moving to Kraken)
the Samples dictionary get from reading barcode files had empty value (directory to the files), thus it actually failed for rule custom_removal. need to fix this.
There seems to be an update or change to kraken-build
where this step fails due to a prompt. For whatever reason, this only happens locally, not on Travis.
https://github.com/eclarke/sunbeam/blob/51fbb55549258dfb373ea8f145ad4e24de76ad9d/tests/test.sh#L55
Switch to rust-bio-tools' fastq-filter
Sometimes Snakemake raises a MissingFilesError, and then encounters another error, when handling missing files. This prevents the job from returning and prevents the workflow from continuing. It seems to be a bug because re-running it often completes without error. We need to know why this occurs, but the fix may actually need to happen on Snakemake's end.
All rules should take gzipped fastq files and output gzipped fastq files (unless they're intermediate steps, in which case the uncompressed fastq outputs should be marked with the temp()
snakemake rule)
In this case, the igv.sh file is never written inside the local/ directory, and the test script fails.
The testing system I had cobbled together for this pipeline isn't working currently. We should figure out what's wrong with it and fix it so that pull requests can be checked and merged automatically.
Can the end user request for saving the intermediate files in the assembly step (IDBA_UD). These intermediate files can aid in finding reads that are mapped to contigs (or not) and can be valuable for downstream processing.
There's been an extensive amount of work from @ressy getting read mapping and associated visualizations working. Those should be documented in the Readme so people know how to use them.
The mapping rules that call bowtie2
and samtools
aren't using those programs' multithreading support, so they could run much faster than they do right now.
Conda debug or update gets stalled on the PMACS cluster.
The prompt gets hung up on
Fetching package metadata ...
Note, this is not an issue on microb120 and 191. I am reading more about this, and it seems this issue can be due to some proxy settings on the cluster?
Do you have a quick solution for this?
Error near the end of the install.sh progression, although it appears not to have prevented install (I'll double check to make sure everything works)
install.sh: line 50: Solving: command not found
2018-03-16 19:46:26 UTC [ error] Error in ~~sunbeam/install.sh in function debug_capture on line 50
If IDBA-ud does not generate a contig, then the followup program cap3 (which needs the contigs from IDBA-ud to further process them) gives an error.
Perhaps always creating a file from IDBA-ud can help solve this problem.
This is confusing to new users and is only applicable for Travis
When I qsub all_decontam
to respublica, I got the following error messages from filter_reads
rule:
"Error occurred during initialization of VM
Cannot create VM thread. Out of system resources."
After googling the error message, this seems to be a java version issue.
My experience with respublica is that it doesn't allowing passing the LD_LIBRARY_PATH
for security issue, and this will cause error when the local java version is different from the conda environment java version. One way to walk around this, thanks to @ressy, is to explicitly setting LD_LIBRARY_PATH="$CONDA_PREFIX/lib64"
. However, I am not sure whether this is the reason for our error message.
Since we are filtering reads based by ids, shall we just add a python script to sunbeamlib to do the work?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.