Code Monkey home page Code Monkey logo

Comments (50)

brwnj avatar brwnj commented on August 26, 2024 1

In c8952cd, the checkm reference database is now downloaded via atlas download and the subsequent rule initialize_checkm writes logs/checkm_init.txt. If a user has issues with this rule, they can delete logs/checkm_init.txt to re-run the rule. If the user already has everything set up, they will still have to make sure the database files are pre-downloaded before running assembly.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

I have to look at this tomorrow.
@camel315 You are running this on a cluster? which system?
Do you have internet on the execution machine?

@brwnj I don't understand why checkm doesn't put out an error or a log file?

As a manual fix: have a look at the script atlas/atlas/rules/initialize_checkm.py there you can see the 2 coomands used to download the checkm databases.

from atlas.

 avatar commented on August 26, 2024

@SilasK Yes, I am running this on a cluster. It has Suse Linux Enterprise 2011 with Load Sharing Facility (LSF) Batch System, more than 200 nodes, 3000 cores. The components of cluster are connected with very fast internet.

from atlas.

 avatar commented on August 26, 2024

@SilasK I probably find the reason, but do not know how to fix. I downloaded the checkM data manually at https://data.ace.uq.edu.au/public/CheckM_databases/. This process took around 20 minutes in my case. However, when the pipeline runs, these files will be automatically deleted, thus only empty fold left. When checkm need these database, it will report error and stop running.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

so atlas tries to overwrites the files, but doesn't download them?

what happens when you use the commands checkm data setRootto set the folder and then checkm data update to download the data? Which error do you get?

from atlas.

 avatar commented on August 26, 2024

@SilasK How can I do this test, in initialize_checkm.py or assemble.snakefile?
The original scripts:
run_popen(["checkm", "data", "setRoot"], [snakemake.params.database_dir, snakemake.params.database_dir])
run_popen(["checkm", "data", "update"], ["y", "y"])
Shall I change to:
run_popen(["checkm", "data", "setRoot"], [snakemake.params.database_dir, snakemake.params.database_dir, snakemake.params.database_dir])
run_popen(["checkm", "data", "update"], ["y", "y", "y"])
And in the latest version of checkM on github, I did not find the initialize_checkm.py script.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

In #64 I added a log file to the checkm_init rule. If you use the corresponding branch you can test again and see why atlas doesn't work.

git clone https://github.com/pnnl/atlas.git
cd atlas
git checkout assembly
python setup.py install develop

I'm sorry, but as I understood checkm is very complicated to integrate in a pipeline.

from atlas.

 avatar commented on August 26, 2024

new errors:
Executing: snakemake --snakefile /home/syang/anaconda3/lib/python3.6/site-packages/atlas/Snakefile --directory /panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang --printshellcmds --jobs 40 --rerun-incomplete --configfile '/panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/config.yaml' --nolock --use-conda --config workflow=complete --latency-wait 10
WorkflowError in line 159 of /home/syang/anaconda3/lib/python3.6/site-packages/atlas/Snakefile:
Failed to open /home/syang/anaconda3/lib/python3.6/site-packages/atlas/rules/qc.snakefile.
File "/home/syang/anaconda3/lib/python3.6/site-packages/atlas/Snakefile", line 159, in
[2017-12-07 11:56 CRITICAL] Command 'snakemake --snakefile /home/syang/anaconda3/lib/python3.6/site-packages/atlas/Snakefile --directory /panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang --printshellcmds --jobs 40 --rerun-incomplete --configfile '/panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/config.yaml' --nolock --use-conda --config workflow=complete --latency-wait 10' returned non-zero exit status 1.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

I recently split the assemble snakefile in two: qc.snakefile and assemble.snakefile. Somehow the new snakefile didn't got isntalled:

/home/syang/anaconda3/lib/python3.6/site-packages/atlas/rules/qc.snakefile

check if in the git repository you donwnloaded there is the atlas/rules/qc.snakefile
may be uninstall and reinstall atlas..

from atlas.

 avatar commented on August 26, 2024

After I reinstalled atlas, there is still the problem of initialize_checkm:
Conda environment defines Python version < 3.3. Using Python of the master process to execute script.
/home/syang/anaconda3/bin/python /home/syang/anaconda3/lib/python3.6/site-packages/atlas/rules/.snakemake.waps8v21.initialize_checkm.py


[CheckM - data] Check for database updates. [setRoot]


Data location successfully changed to: /panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/databases/checkm


[CheckM - data] Check for database updates. [update]


Waiting at most 10 seconds for missing files.
Error in job initialize_checkm while creating output files /panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/databases/checkm/test_data/637000110.fna, ....
MissingOutputException in line 476 of /home/syang/anaconda3/lib/python3.6/site-packages/atlas/rules/assemble.snakefile:
Missing files after 10 seconds:
...
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job initialize_checkm since they might be corrupted:
/panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/databases/checkm/.dmanifest, logs/checkm_init.txt
Will exit after finishing currently running jobs.
Finished job 184.
1 of 241 steps (0.41%) done
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

from atlas.

SilasK avatar SilasK commented on August 26, 2024

Now can you send me the log file: working_dir/logs/initialize_checkm.log

from atlas.

 avatar commented on August 26, 2024

@SilasK Sorry to disturb you. Regarding the development of atlas as you suggsted:
git clone https://github.com/pnnl/atlas.git
cd atlas
git checkout assembly
python setup.py install develop

It will generate a new folder 'atlas', which is not the same as the installation with 'pip install -U pnnl-atlas' or with 'pip install git+https://github.com/pnnl/atlas.git'. In this case, does the path will automatically direct to the new atlas? I am bit confused in this point.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

from atlas.

 avatar commented on August 26, 2024

@SilasK, should be pip uninstall pnnl-atlas. I have removed all packages and reinstalled from conda. All the *.py and *.snakemake files from your atlas folder were manually copied to the folder where atlas (under anaconda3) was installed with pip install -U pnnl-atlas. Error popped up as (snakemake version 4.3.1):

 Executing: snakemake --snakefile /home/syang/anaconda3/lib/python3.6/site-packages/atlas/Snakefile --directory /panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang --printshellcmds --jobs 24 --rerun-incomplete --configfile '/panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/config.yaml' --nolock --use-conda  --config workflow=complete  --latency-wait 20
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
Building DAG of jobs...
Creating conda environment /home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/required_packages.yaml...
Traceback (most recent call last):
  File "/home/syang/anaconda3/lib/python3.6/site-packages/snakemake/conda.py", line 161, in create
    stderr=subprocess.STDOUT)
  File "/home/syang/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/home/syang/anaconda3/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['conda', 'env', 'create', '--file', '/home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/required_packages.yaml', '--prefix', '/panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/.snakemake/conda/1765b780']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/syang/anaconda3/lib/python3.6/site-packages/snakemake/__init__.py", line 520, in snakemake
    cluster_status=cluster_status)
  File "/home/syang/anaconda3/lib/python3.6/site-packages/snakemake/workflow.py", line 518, in execute
    dag.create_conda_envs(dryrun=dryrun)
  File "/home/syang/anaconda3/lib/python3.6/site-packages/snakemake/dag.py", line 172, in create_conda_envs
    env.create(dryrun)
  File "/home/syang/anaconda3/lib/python3.6/site-packages/snakemake/conda.py", line 170, in create
    e.output.decode())
snakemake.exceptions.CreateCondaEnvironmentException: Could not create conda environment from /home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/required_packages.yaml:
Fetching package metadata ...Using Anaconda API: https://api.anaconda.org

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/bioconda/linux-64/repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='conda.anaconda.org', port=443): Max retries exceeded with url: /bioconda/linux-64/repodata.json (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x2b48b4b6aa20>: Failed to establish a new connection: [Errno -2] Name or service not known',))",),)



[2017-12-08 12:17 CRITICAL] Command 'snakemake --snakefile /home/syang/anaconda3/lib/python3.6/site-packages/atlas/Snakefile --directory /panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang --printshellcmds --jobs 24 --rerun-incomplete --configfile '/panfs/panfs14.gfz-hpcc.cluster/home/gmb/syang/config.yaml' --nolock --use-conda  --config workflow=complete  --latency-wait 20' returned non-zero exit status 1.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

@camel315 might it be that you don't have internet access on the server? You can also try to run atlas on your local server with the additional attributes --until initialize_checkm . This should only execute the one step.

Edit: updated the --until command to the correct spelling of the rule.

from atlas.

 avatar commented on August 26, 2024

@SilasK It was running on cluster. I can download the atlas file from git, e.g.
syang@glic2: git clone https://github.com/pnnl/atlas.git
Cloning into 'atlas'...
remote: Counting objects: 2679, done.
remote: Compressing objects: 100% (164/164), done.
remote: Total 2679 (delta 207), reused 189 (delta 122), pack-reused 2393
Receiving objects: 100% (2679/2679), 4.99 MiB | 2.80 MiB/s, done.
Resolving deltas: 100% (1686/1686), done.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

It seems you don't have internet acces:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url
https://conda.anaconda.org/bioconda/linux-64/repodata.json

What happens when you put:

conda env create --file /home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/required_packages.yaml

from atlas.

 avatar commented on August 26, 2024

@SilasK conda env create --file /home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/required_packages.yaml
Using Anaconda API: https://api.anaconda.org

CondaValueError: prefix already exists: /home/syang/anaconda3

from atlas.

SilasK avatar SilasK commented on August 26, 2024

Can you answer my question if you have internet on the executing cluster machine?

from atlas.

SilasK avatar SilasK commented on August 26, 2024

Starting again, create a new conda environement:
conda create -n atlas_env -c bioconda python=3.5 snakemake bbmap=37.17 click

source activate atlas_env
git clone https://github.com/pnnl/atlas.git
cd atlas
git checkout assembly
python setup.py install develop

now you have a fresh atlas version in the conda environment atlas_env.

You can also try to run atlas on your local server with the additional attributes --until init_checkm

from atlas.

 avatar commented on August 26, 2024

@SilasK I still did not get my problem solved and unable to determine the source of the problem. Will it be related to a problem of conda? For example 'optional_genome_binning.yaml', the package 'maxbin2=2.2.1=r3.3.2_1', could not be found with 'conda search -c bioconda maxbin2'. Will this check cause error in running the conda?

from atlas.

SilasK avatar SilasK commented on August 26, 2024

Do you have internet on the executing machines?

Can you install maxbin using the file: conda create -n maxbin2_env --file optional_genome_binning.yaml or so.

Try to run atlas without the genome binning so you get already the contigs an everything.

from atlas.

 avatar commented on August 26, 2024

@SilasK For the maxbin2 and related fraggenescan, I could get positive feedback like conda search -c bioconda maxbin2, but I could not get it installed like conda install -c bioconda maxbin2.

syang@glic2:~> conda search -c bioconda fraggenescan
Fetching package metadata .............
fraggenescan                 1.30                 pl5.22.0_0  bioconda        
                             1.30                 pl5.22.0_1  bioconda        
syang@glic2:~> conda install -c bioconda fraggenescan
Fetching package metadata .............
Solving package specifications: 

PackageNotFoundError: Packages missing in current channels:
            
  - fraggenescan -> perl 5.22.0*

We have searched for the packages in the following channels:
            
  - https://conda.anaconda.org/bioconda/linux-64
  - https://conda.anaconda.org/bioconda/noarch
  - https://repo.continuum.io/pkgs/main/linux-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/linux-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/linux-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/linux-64
  - https://repo.continuum.io/pkgs/pro/noarch
            

syang@glic2:~> conda search -c bioconda maxbin2
Fetching package metadata .............
maxbin2                      2.2.1                         0  bioconda        
                             2.2.1                  r3.3.1_1  bioconda        
                             2.2.1                  r3.3.2_1  bioconda        
                             2.2.1                  r3.4.1_1  bioconda        
                             2.2.4                  r3.4.1_0  bioconda        
syang@glic2:~> conda install -c bioconda maxbin2
Fetching package metadata .............
Solving package specifications: 

PackageNotFoundError: Packages missing in current channels:
            
  - maxbin2 -> fraggenescan >=1.30 -> perl 5.22.0*

We have searched for the packages in the following channels:
            
  - https://conda.anaconda.org/bioconda/linux-64
  - https://conda.anaconda.org/bioconda/noarch
  - https://repo.continuum.io/pkgs/main/linux-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/linux-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/linux-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/linux-64
  - https://repo.continuum.io/pkgs/pro/noarch

I am not sure whether this is the reason of conda HTTP error

with conda create -n maxbin2_env --file /home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/optional_genome_binning.yaml

CondaValueError: could not parse '- python=2.7' in: /home/syang/anaconda3/lib/python3.6/site-packages/atlas/envs/optional_genome_binning.yaml

from atlas.

SilasK avatar SilasK commented on August 26, 2024

@camel315 I don't know why you can't parse the file. I assume you also need the conda-forge and may be the r channel. conda install -c bioconda -c conda-forge -c r maxbin2

from atlas.

 avatar commented on August 26, 2024

@SilasK I have skipped using cluster and back to our small workstation. The atlas worked without CondaHTTP error. But the initialize_checkm error still exists. The checkM files which were manually released from checkm_data_16012015_v0.9.7.tar.gz, were removed but not generated. Please see the message below:

$ cat logs/initialize_checkm.log 

*******************************************************************************
 [CheckM - data] Check for database updates. [setRoot]
*******************************************************************************

Data location successfully changed to: /home/syang/databases/checkm

*******************************************************************************
 [CheckM - data] Check for database updates. [update]
*******************************************************************************

**The following error message was automatically printed in the screen:**

stats.sh in=OD3/assembly/OD3_prefilter_contigs.fasta format=3 -Xmx16G > OD3/assembly/contig_stats/prefilter_contig_stats.txt                                             
Finished job 17.
3 of 176 steps (2%) done
Waiting at most 5 seconds for missing files.
Error in job initialize_checkm while creating output files /home/syang/databases/checkm/test_data/637000110.fna, /home/syang/databases/checkm/taxon_marker_sets.tsv, /home/syang/databases/checkm/selected_marker_sets.tsv, /home/syang/databases/checkm/pfam/tigrfam2pfam.tsv, /home/syang/databases/checkm/pfam/Pfam-A.hmm.dat, /home/syang/databases/checkm/img/img_metadata.tsv, /home/syang/databases/checkm/hmms_ssu/SSU_euk.hmm, /home/syang/databases/checkm/hmms_ssu/SSU_bacteria.hmm, /home/syang/databases/checkm/hmms_ssu/SSU_archaea.hmm, /home/syang/databases/checkm/hmms_ssu/createHMMs.py, /home/syang/databases/checkm/hmms/phylo.hmm.ssi, /home/syang/databases/checkm/hmms/phylo.hmm, /home/syang/databases/checkm/hmms/checkm.hmm.ssi, /home/syang/databases/checkm/hmms/checkm.hmm, /home/syang/databases/checkm/genome_tree/missing_duplicate_genes_97.tsv, /home/syang/databases/checkm/genome_tree/missing_duplicate_genes_50.tsv, /home/syang/databases/checkm/genome_tree/genome_tree.taxonomy.tsv, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/phylo_modelJqWx6_.json, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.tre, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.log, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.fasta, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/CONTENTS.json, /home/syang/databases/checkm/genome_tree/genome_tree.metadata.tsv, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/phylo_modelEcOyPk.json, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.tre, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.log, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.fasta, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/CONTENTS.json, /home/syang/databases/checkm/genome_tree/genome_tree.derep.txt, /home/syang/databases/checkm/.dmanifest, /home/syang/databases/checkm/distributions/td_dist.txt, /home/syang/databases/checkm/distributions/gc_dist.txt, /home/syang/databases/checkm/distributions/cd_dist.txt, logs/checkm_init.txt.         
MissingOutputException in line 521 of /home/syang/miniconda3/lib/python3.5/site-packages/atlas/rules/assemble.snakefile:                                                  
Missing files after 5 seconds:                                                       
/home/syang/databases/checkm/test_data/637000110.fna                                 
/home/syang/databases/checkm/taxon_marker_sets.tsv                                   
/home/syang/databases/checkm/selected_marker_sets.tsv                                
/home/syang/databases/checkm/pfam/tigrfam2pfam.tsv                                   
/home/syang/databases/checkm/pfam/Pfam-A.hmm.dat                                     
/home/syang/databases/checkm/img/img_metadata.tsv                                    
/home/syang/databases/checkm/hmms_ssu/SSU_euk.hmm                                    
/home/syang/databases/checkm/hmms_ssu/SSU_bacteria.hmm                               
/home/syang/databases/checkm/hmms_ssu/SSU_archaea.hmm                                
/home/syang/databases/checkm/hmms_ssu/createHMMs.py                                  
/home/syang/databases/checkm/hmms/phylo.hmm.ssi                                      
/home/syang/databases/checkm/hmms/phylo.hmm                                          
/home/syang/databases/checkm/hmms/checkm.hmm.ssi                                     
/home/syang/databases/checkm/hmms/checkm.hmm                                         
/home/syang/databases/checkm/genome_tree/missing_duplicate_genes_97.tsv              
/home/syang/databases/checkm/genome_tree/missing_duplicate_genes_50.tsv              
/home/syang/databases/checkm/genome_tree/genome_tree.taxonomy.tsv                    
/home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/phylo_modelJqWx6_.json                                                                                
/home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.tre  
/home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.log  
/home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.fasta
/home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/CONTENTS.json    
/home/syang/databases/checkm/genome_tree/genome_tree.metadata.tsv                    
/home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/phylo_modelEcOyPk.json                                                                                   
/home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.tre
/home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.log
/home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.fasta
/home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/CONTENTS.json
/home/syang/databases/checkm/genome_tree/genome_tree.derep.txt
/home/syang/databases/checkm/distributions/td_dist.txt
/home/syang/databases/checkm/distributions/gc_dist.txt
/home/syang/databases/checkm/distributions/cd_dist.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job initialize_checkm since they might be corrupted:
/home/syang/databases/checkm/.dmanifest, logs/checkm_init.txt
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

from atlas.

brwnj avatar brwnj commented on August 26, 2024

Implementing CheckM wasn't as easy as it should be, so I apologize for your continued headaches and appreciate your patience.

My recommendation would be to delete the checkm database directory -- in your latest instance that's /home/syang/databases/checkm. Also delete /logs/checkm_init.txt. Re-running atlas assemble will now re-run rule initialize_checkm and attempt to re-download the checkm reference databases.

If you don't have internet access on compute nodes, running:

atlas assemble --jobs 24 --out-dir results config.yaml --create-envs-only

from atlas.

 avatar commented on August 26, 2024

@brwnj Following your suggestion, I run on cluster with

bsub -n 24 -q qintel -e err.txt -o out.txt atlas assemble --jobs 24 --out-dir results /home/syang/config.yaml --create-envs-only True --latency-wait 20

It reported:
Executing: snakemake --snakefile /home/syang/anaconda3/lib/python3.5/site-packages/atlas/Snakefile --directory /home/syang/results --printshellcmds --jobs 24 --rerun-incomplete --configfile '/home/syang/config.yaml' --nolock --use-conda --config workflow=complete --create-envs-only --latency-wait 20
usage: snakemake [-h] [--snakefile FILE] [--gui [PORT]] [--cores [N]]
[--local-cores N] [--resources [NAME=INT [NAME=INT ...]]]
[--config [KEY=VALUE [KEY=VALUE ...]]] [--configfile FILE]
[--list] [--list-target-rules] [--directory DIR] [--dryrun]
[--printshellcmds] [--debug-dag] [--dag]
[--force-use-threads] [--rulegraph] [--d3dag] [--summary]
[--detailed-summary] [--archive FILE] [--touch]
[--keep-going] [--force] [--forceall]
[--forcerun [TARGET [TARGET ...]]]
[--prioritize TARGET [TARGET ...]]
[--until TARGET [TARGET ...]]
[--omit-from TARGET [TARGET ...]] [--allow-ambiguity]
[--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
[--drmaa-log-dir DIR] [--cluster-config FILE]
[--immediate-submit] [--jobscript SCRIPT] [--jobname NAME]
[--reason] [--stats FILE] [--nocolor] [--quiet] [--nolock]
[--unlock] [--cleanup-metadata FILE [FILE ...]]
[--rerun-incomplete] [--ignore-incomplete]
[--list-version-changes] [--list-code-changes]
[--list-input-changes] [--list-params-changes]
[--latency-wait SECONDS] [--wait-for-files [FILE [FILE ...]]]
[--benchmark-repeats N] [--notemp] [--keep-remote]
[--keep-target-files] [--keep-shadow]
[--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
[--max-jobs-per-second MAX_JOBS_PER_SECOND]
[--restart-times RESTART_TIMES] [--timestamp]
[--greediness GREEDINESS] [--no-hooks] [--print-compilation]
[--overwrite-shellcmd OVERWRITE_SHELLCMD] [--verbose]
[--debug] [--profile FILE] [--mode {0,1,2}]
[--bash-completion] [--use-conda] [--conda-prefix DIR]
[--wrapper-prefix WRAPPER_PREFIX]
[--default-remote-provider {S3,GS,SFTP,S3Mocked}]
[--default-remote-prefix DEFAULT_REMOTE_PREFIX] [--version]
[target [target ...]]
snakemake: error: unrecognized arguments: --create-envs-only

I check 'create-envs-only', which is a parameter of snakemake API, how can I use correctly here?

from atlas.

brwnj avatar brwnj commented on August 26, 2024

You likely just need to update snakemake. The latest available version on bioconda today is 4.3.1.

I thought before you determined that the compute nodes did NOT have access to the internet. The command line above should be run directly on the head node (or the node that has outside internet connectivity).

from atlas.

 avatar commented on August 26, 2024

@brwnj The running in our own workstation failed because the checkm still want to initialize check. Thus the same error pop up 'Error in job initialize_checkm while creating output files'
Cannot find ID3/assembly/opts.txt
Please check whether the output directory is correctly set by "-o"
Now switching to normal mode.
MEGAHIT v1.1.2
--- [Thu Dec 14 23:31:08 2017] Start assembly. Number of CPU threads 8 ---
--- [Thu Dec 14 23:31:08 2017] Available memory: 25065529344, used: 50000000000
--- [Thu Dec 14 23:31:08 2017] Converting reads to binaries ---
b' [read_lib_functions-inl.h : 209] Lib 0 (ID3/assembly/reads/normalized.errorcorr.merged_R1.fastq.gz): se, 3209514 reads, 151 max length'
b' [utils.h : 126] Real: 8.4324\tuser: 2.9971\tsys: 0.3602\tmaxrss: 162404'
....
--- [Thu Dec 14 23:44:43 2017] Building graph for k = 121 ---
--- [Thu Dec 14 23:44:50 2017] Assembling contigs from SdBG for k = 121 ---
--- [Thu Dec 14 23:45:38 2017] Merging to output final contigs ---
--- [STAT] 48932 contigs, total 45512920 bp, min 500 bp, max 16287 bp, avg 930 bp, N50 944 bp
--- [Thu Dec 14 23:45:38 2017] ALL DONE. Time elapsed: 870.849289 seconds ---
Removing temporary output file ID3/assembly/reads/normalized.errorcorr.merged_R1.fastq.gz.
Removing temporary output file ID3/assembly/reads/normalized.errorcorr.merged_R2.fastq.gz.
Removing temporary output file ID3/assembly/reads/normalized.errorcorr.merged_se.fastq.gz.
Finished job 181.
4 of 173 steps (2%) done
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

2, for the cluster, all the jobs are submitted to queues administration and then assigned to computing nodes. I am not sure whether I understood your solution correctly. After updating snakemake to 4.3.1, it still showed 'Creating conda environment /home/syang/anaconda3/lib/python3.5/site-packages/atlas/envs/required_packages.yaml...... CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/bioconda/linux-64/repodata.json...... Command 'snakemake --snakefile ... --nolock --use-conda --config workflow=complete --create-envs-only --latency-wait 20' returned non-zero exit status 1'

from atlas.

brwnj avatar brwnj commented on August 26, 2024

CheckM can't be run (easily) without using conda, but an internet connection is only required once. Your compute nodes appear to not have internet, so running as you are you will continue running into the same error telling you that you don't have internet.

My suggestion was to run the command on a login node or head node, basically one of them that is capable of an outside connection. So, rather than:

bsub -n 24 -q qintel -e err.txt -o out.txt atlas assemble --jobs 24 --out-dir results /home/syang/config.yaml --create-envs-only True --latency-wait 20

It would be simply:

atlas assemble --jobs 24 --out-dir results /home/syang/config.yaml --create-envs-only --latency-wait 20

from atlas.

 avatar commented on August 26, 2024

@brwnj Thank you for your solution. The pipeline starts working now.

from atlas.

 avatar commented on August 26, 2024

@brwnj Sorry to disturb you again with initialize_checkm error:

Finished job 85.
39 of 241 steps (16%) done

localrule postprocess_after_decontamination:
input: OD3/sequence_quality_control/OD3_clean_R1.fastq.gz, OD3/sequence_quality_control/OD3_clean_R2.fastq.gz, OD3/sequence_quality_control/OD3_clean_se.fastq.gz
output: OD3/sequence_quality_control/OD3_QC_R1.fastq.gz, OD3/sequence_quality_control/OD3_QC_R2.fastq.gz, OD3/sequence_quality_control/OD3_QC_se.fastq.gz
jobid: 88
wildcards: sample=OD3

localrule initialize_checkm:
output: /home/syang/databases/checkm/test_data/637000110.fna, /home/syang/databases/checkm/taxon_marker_sets.tsv, /home/syang/databases/checkm/selected_marker_sets.tsv, /home/syang/databases/checkm/pfam/tigrfam2pfam.tsv, /home/syang/databases/checkm/pfam/Pfam-A.hmm.dat, /home/syang/databases/checkm/img/img_metadata.tsv, /home/syang/databases/checkm/hmms_ssu/SSU_euk.hmm, /home/syang/databases/checkm/hmms_ssu/SSU_bacteria.hmm, /home/syang/databases/checkm/hmms_ssu/SSU_archaea.hmm, /home/syang/databases/checkm/hmms_ssu/createHMMs.py, /home/syang/databases/checkm/hmms/phylo.hmm.ssi, /home/syang/databases/checkm/hmms/phylo.hmm, /home/syang/databases/checkm/hmms/checkm.hmm.ssi, /home/syang/databases/checkm/hmms/checkm.hmm, /home/syang/databases/checkm/genome_tree/missing_duplicate_genes_97.tsv, /home/syang/databases/checkm/genome_tree/missing_duplicate_genes_50.tsv, /home/syang/databases/checkm/genome_tree/genome_tree.taxonomy.tsv, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/phylo_modelJqWx6_.json, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.tre, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.log, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/genome_tree.fasta, /home/syang/databases/checkm/genome_tree/genome_tree_reduced.refpkg/CONTENTS.json, /home/syang/databases/checkm/genome_tree/genome_tree.metadata.tsv, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/phylo_modelEcOyPk.json, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.tre, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.log, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/genome_tree.fasta, /home/syang/databases/checkm/genome_tree/genome_tree_full.refpkg/CONTENTS.json, /home/syang/databases/checkm/genome_tree/genome_tree.derep.txt, /home/syang/databases/checkm/.dmanifest, /home/syang/databases/checkm/distributions/td_dist.txt, /home/syang/databases/checkm/distributions/gc_dist.txt, /home/syang/databases/checkm/distributions/cd_dist.txt, logs/checkm_init.txt
log: logs/initialize_checkm.log
jobid: 130

Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
/home/syang/anaconda3/bin/python /home/syang/anaconda3/lib/python3.5/site-packages/atlas/rules/.snakemake.1dkqwxrq.initialize_checkm.py
Activating conda environment /home/syang/results/.snakemake/conda/11f0e3ea.
Removing temporary output file OD3/sequence_quality_control/OD3_clean_R1.fastq.gz.
Removing temporary output file OD3/sequence_quality_control/OD3_clean_R2.fastq.gz.
Removing temporary output file OD3/sequence_quality_control/OD3_clean_se.fastq.gz.
Finished job 88.
40 of 241 steps (17%) done
Waiting at most 20 seconds for missing files.
Removing output files of failed job initialize_checkm since they might be corrupted:
/home/syang/databases/checkm/.dmanifest, logs/checkm_init.txt
Will exit after finishing currently running jobs.

results/logs/initialize_checkm.log


[CheckM - data] Check for database updates. [setRoot]


Data location successfully changed to: /home/syang/databases/checkm


[CheckM - data] Check for database updates. [update]


Exiting because a job execution failed. Look above for error message
Complete log: /home/syang/.snakemake/log/2017-12-15T111045.335026.snakemake.log
[2017-12-15 14:06 CRITICAL] Command 'snakemake --snakefile /home/syang/anaconda3/lib/python3.5/site-packages/atlas/Snakefile --directory /home/syang/results --printshellcmds --jobs 24 --rerun-incomplete --configfile '/home/syang/config.yaml' --nolock --use-conda --config workflow=complete --latency-wait 20' returned non-zero exit status 1

from atlas.

brwnj avatar brwnj commented on August 26, 2024

I anticipated this occurring as this step also needs an internet connection. Now that these data are processed to this point, we need to run another command on the head node and the rule target cannot have wildcards. It'll be something like what Silas referred to earlier:

atlas assemble --jobs 24 --out-dir results /home/syang/config.yaml --latency-wait 20 logs/checkm_init.txt

The file specification on the end is telling Snakemake that we only want to build this file, so it should have minimal impact on the head node.

from atlas.

 avatar commented on August 26, 2024

@brwnj I followed your suggestion and the localrule initialize_checkm also took place. This error also occurred in our group workstation which is for sure with good internet connection. Thus, there should be some other reasons or bugs.

from atlas.

brwnj avatar brwnj commented on August 26, 2024

I've been going through the code this week and hope to get around to taking another look at the bin validation step soon. In the meantime, you could always set perform_genome_binning: false in your configuration. All other steps besides binning and checkm will run.

from atlas.

 avatar commented on August 26, 2024

@brwnj Thank you for your fast reply. I am running the rest of the pipeline according to your suggestion. By the way, is it possible to include the analysis of virus, fungus and protozoa in this magic pipeline? Will the the modification very complicated and cover more packages? Thank you.

from atlas.

brwnj avatar brwnj commented on August 26, 2024

It does take a fair amount of effort to add things like that. I've considered adding a kmer-based annotation protocol as maybe an alternate annotation. If the method is reasonably fast I would implement it to run as a default protocol. Adding something like https://github.com/bioinformatics-centre/kaiju is what I'm thinking. Then the user would choose the alternate annotation database from their offerings:

  • refseq
  • progenomes
  • nr (archaea, bacteria, fungi and microbial eukaryotes)

Virus references can be added to any of the above 3.

We're trying to wrap up the paper soon, so this will likely be added to a development branch when it's started.

from atlas.

 avatar commented on August 26, 2024

@brwnj Sorry to disturb you again. Still the question about checkm. For instance, due to security reasons, the computing cluster nodes are in private network and have no connection to outside servers like conda.anconda.org. In this case, I would like to ask whether the 'rule intialize_checkm' is mandatory for processing each sample for the remaining jobs? If not, is it possible to modify the script to only run once in the head nodes? In addition, I checked the github page of checkm require python<3.0. Does this point also make it not easy to be incorporated to the pipeline?

In addition, I met a error related to qc.snakefile. I am not sure whether it is also related to checkm and 'perform_genome_binning: false'

Error in rule calculate_insert_size:
jobid: 91
output: OD3/sequence_quality_control/read_stats/QC_insert_size_hist.txt, OD3/sequence_quality_control/read_stats/QC_read_length_hist.txt
log: OD3/logs/OD3_calculate_insert_size.log

RuleException:
CalledProcessError in line 407 of /home/syang/anaconda3/lib/python3.5/site-packages/atlas/rules/qc.snakefile:
Command 'source activate /home/syang/results/.snakemake/conda/dbc7d302; set -euo pipefail; bbmerge.sh -Xmx32G threads=24 in1=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz loose ecct k=62 extend2=50 ihist=OD3/sequence_quality_control/read_stats/QC_insert_size_hist.txt merge=f mininsert0=35 minoverlap0=8 2> >(tee OD3/logs/OD3_calculate_insert_size.log)

            readlength.sh in=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz out=OD3/sequence_quality_control/read_stats/QC_read_length_hist.txt 2> >(tee OD3/logs/OD3_calculate_insert_size.log) ' returned non-zero exit status 1

File "/home/syang/anaconda3/lib/python3.5/site-packages/atlas/rules/qc.snakefile", line 407, in __rule_calculate_insert_size
File "/home/syang/anaconda3/lib/python3.5/concurrent/futures/thread.py", line 55, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

from atlas.

SilasK avatar SilasK commented on August 26, 2024

You can run only the step needing internet on your head node by atlas assemble -R intialize_checkm

can you send us the OD3/logs/OD3_calculate_insert_size.log

from atlas.

 avatar commented on August 26, 2024

@SilasK To avoid checkm-initialize, I switched off the block of rule initialize_checkm in assemble.snakemake. So far, the OD3/logs/OD3_calculate_insert_size.log looks like:

java -Djava.library.path=/home/syang/results/.snakemake/conda/dbc7d302/opt/bbmap-37.17/jni/ -ea -Xmx32G -Xms32G -cp /home/syang/results/.snakemake/conda/dbc7d302/opt/bbmap-37.17/current/ jgi.BBMerge -Xmx32G threads=24 in1=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz loose ecct k=62 extend2=50 ihist=OD3/sequence_quality_control/read_stats/QC_insert_size_hist.txt merge=f mininsert0=35 minoverlap0=8
Executing jgi.BBMerge [-Xmx32G, threads=24, in1=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz, in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz, loose, ecct, k=62, extend2=50, ihist=OD3/sequence_quality_control/read_stats/QC_insert_size_hist.txt, merge=f, mininsert0=35, minoverlap0=8]

BBMerge version 37.17
Revised arguments: [minoverlap=8, minoverlap0=9, qualiters=4, mismatches=3, margin=2, ratiooffset=0.4, minsecondratio=0.08, maxratio=0.11, ratiomargin=4.7, ratiominoverlapreduction=2, pfilter=0.00002, efilter=8, minentropy=30, minapproxoverlap=30, -Xmx32G, threads=24, in1=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz, in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz, ecct, k=62, extend2=50, ihist=OD3/sequence_quality_control/read_stats/QC_insert_size_hist.txt, merge=f, mininsert0=35, minoverlap0=8]

Set threads to 24
Executing assemble.Tadpole2 [in=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz, in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=62, prealloc=false, prefilter=0, ecctail=false, eccpincer=false, eccreassemble=true]

Tadpole version 37.17
Using 24 threads.
Executing ukmer.KmerTableSetU [in=OD3/sequence_quality_control/OD3_QC_R1.fastq.gz, in2=OD3/sequence_quality_control/OD3_QC_R2.fastq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=62, prealloc=false, prefilter=0, ecctail=false, eccpincer=false, eccreassemble=true]

Initial:
Ways=61, initialSize=128000, prefilter=f, prealloc=f
Memory: max=32928m, free=32069m, used=859m

Estimated kmer capacity: 613835151
After table allocation:
Memory: max=32928m, free=31725m, used=1203m

java.lang.OutOfMemoryError: Java heap space
at ukmer.AbstractKmerTableU.allocLong2D(AbstractKmerTableU.java:218)
at ukmer.HashArrayU1D.resize(HashArrayU1D.java:186)
at ukmer.HashArrayU1D.incrementAndReturnNumCreated(HashArrayU1D.java:90)
at ukmer.HashBufferU.dumpBuffer_inner(HashBufferU.java:196)
at ukmer.HashBufferU.dumpBuffer(HashBufferU.java:168)
at ukmer.HashBufferU.incrementAndReturnNumCreated(HashBufferU.java:57)
at ukmer.KmerTableSetU$LoadThread.addKmersToTable(KmerTableSetU.java:553)
at ukmer.KmerTableSetU$LoadThread.run(KmerTableSetU.java:479)

This program ran out of memory. Try increasing the -Xmx flag and setting prealloc.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

The program didn't had enough memory. what was the command, which you used to run it on the cluster?

from atlas.

 avatar commented on August 26, 2024

@SilasK In the config.yaml file, I set threads to 36 and Java_mem to 48. And
bsub -n 36 -q par120 -e err.txt -o out.txt atlas assemble --jobs 36 --out-dir results /home/syang/config.yaml --latency-wait 240.
In addition, have you finished the pipeline to merge results (e.g. taxonomic or functional profiles) of multiple samples to one table which looks like a normal OTU table? Is that just what qc.snakefile does? Thank you.

from atlas.

 avatar commented on August 26, 2024

@brwnj or @SilasK Error in job convert_sam_to_bam while creating output file OD3/sequence_alignment/OD3.bam.
RuleException:
AttributeError in line 621 of /home/syang/anaconda3/lib/python3.5/site-packages/atlas/rules/assemble.snakefile:
'Wildcards' object has no attribute 'sample'
File "/home/syang/anaconda3/lib/python3.5/site-packages/atlas/rules/assemble.snakefile", line 621, in __rule_convert_sam_to_bam
File "/home/syang/anaconda3/lib/python3.5/string.py", line 191, in format
File "/home/syang/anaconda3/lib/python3.5/string.py", line 195, in vformat
File "/home/syang/anaconda3/lib/python3.5/string.py", line 235, in _vformat
File "/home/syang/anaconda3/lib/python3.5/string.py", line 306, in get_field
File "/home/syang/anaconda3/lib/python3.5/concurrent/futures/thread.py", line 55, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

from atlas.

brwnj avatar brwnj commented on August 26, 2024

That fix hasn't been pulled into master yet, but was implemented in f49bd12.

from atlas.

SilasK avatar SilasK commented on August 26, 2024

@camel315 for the memory problem on the cluster: request memory from the cluster which is 15% higher than java_mem. could you then send how this work with the bsub command?

from atlas.

 avatar commented on August 26, 2024

Increasing memory by 15% appears to insufficient for workstation, the cluster is still running. In addition, the qc.snake file has another error like:
Error
File "/home/syang/anaconda3/lib/python3.5/site-packages/atlas/rules/qc.snakefile", line 407, in __rule_calculate_insert_size
File "/home/syang/anaconda3/lib/python3.5/concurrent/futures/thread.py", line 55, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

I checked the tadpole.sh and there are several parameters to avoid running out of memory. After including prealloc=t, prefilter=t, and minprob=0.8 in the corresponding lines in qc.snakefile, the pipeline worked again.

from atlas.

jmtsuji avatar jmtsuji commented on August 26, 2024

Automated database updates to be phased out in CheckM
Going back to the original issue in this thread: I've been having the same problem with initialize_checkm.py. The "failed to connect to server" error appears be a known issue with CheckM's automated database update feature (checkm data update) rather than an issue with ATLAS. I just posted a Github issue on this and got a response: Ecogenomics/CheckM#132. It's recommended to do a manual database download instead of an automated update (and in fact, automated database update functionality has been removed in newer versions of CheckM).

Temporary workaround
As a temporary workaround, I've commented out the checkm data update command in initialize_checkm.py, as well as the relevant output files specified in assemble.snakefile (rule initialize_checkm), to skip checking the database, and I have downloaded the most recent CheckM database manually (wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_v1.0.9.tar.gz). This seems to work.

Solution for ATLAS
Could the CheckM databases be downloaded as part of atlas download instead of relying on checkm data update?

from atlas.

brwnj avatar brwnj commented on August 26, 2024

I'm looking into it. Any chance someone has time to update the checkm bioconda recipe?

I'm heavily in favor of distributing the downloads via Zenodo because connections to their servers are often very slow.

from atlas.

jmtsuji avatar jmtsuji commented on August 26, 2024

I could make a pull request for disabling checkm data update in the ATLAS pipeline, if helpful. Updating the CheckM bioconda recipe would not be strictly needed in this case.

Also, I think that adding the CheckM database to Zenodo makes sense. The last database update was in 2015, so the database appears fairly stable.

Most recent version is: https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_v1.0.9.tar.gz

from atlas.

brwnj avatar brwnj commented on August 26, 2024

I'm testing the changes to the download method now that were implemented in b777353. Afterwards, I'll test the assembly protocol on a clean environment.

from atlas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.