mbhall88 / tbpore Goto Github PK
View Code? Open in Web Editor NEWMycobacterium tuberculosis genomic analysis from Nanopore sequencing data
License: MIT License
Mycobacterium tuberculosis genomic analysis from Nanopore sequencing data
License: MIT License
I have the following directory structure: sample_example/mada_101/mada_101.nanopore.fastq.gz
. If I run tbpore
giving the sample_example
dir as input, as well as sample_example/mada_101/mada_101.nanopore.fastq.gz
, we will build a fastq by duplicating sample_example/mada_101/mada_101.nanopore.fastq.gz
and concatenating the copies:
(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ poetry run tbpore --recursive sample_example/ sample_example/mada_101/mada_101.nanopore.fastq.gz
[2022-04-26 10:58:21] INFO | Welcome to TBpore version 0.1.0
[2022-04-26 10:58:21] INFO | Searching for fastq files...
[2022-04-26 10:58:21] INFO | Found 2 fastq files. Joining them...
I am getting these linting error when running just lint
:
(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ just lint
poetry run flake8 . --extend-exclude=".venv/"
./tests/test_external_tools.py:26:89: E501 line too long (95 > 88 characters)
./tests/test_external_tools.py:30:89: E501 line too long (95 > 88 characters)
./tests/test_tbpore.py:27:89: E501 line too long (94 > 88 characters)
./tests/test_tbpore.py:42:89: E501 line too long (95 > 88 characters)
./tests/test_tbpore.py:48:89: E501 line too long (112 > 88 characters)
./tests/test_tbpore.py:49:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:56:89: E501 line too long (122 > 88 characters)
./tests/test_tbpore.py:62:89: E501 line too long (120 > 88 characters)
./tests/test_tbpore.py:69:89: E501 line too long (118 > 88 characters)
./tests/test_tbpore.py:76:89: E501 line too long (117 > 88 characters)
./tests/test_tbpore.py:83:89: E501 line too long (114 > 88 characters)
./tests/test_tbpore.py:90:89: E501 line too long (114 > 88 characters)
./tests/test_tbpore.py:92:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:98:89: E501 line too long (113 > 88 characters)
./tests/test_tbpore.py:99:89: E501 line too long (102 > 88 characters)
./tests/test_tbpore.py:104:89: E501 line too long (124 > 88 characters)
./tests/test_tbpore.py:130:89: E501 line too long (95 > 88 characters)
./tests/test_tbpore.py:136:89: E501 line too long (124 > 88 characters)
./tests/test_tbpore.py:137:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:138:89: E501 line too long (98 > 88 characters)
./tests/test_tbpore.py:144:89: E501 line too long (104 > 88 characters)
./tests/test_tbpore.py:151:89: E501 line too long (129 > 88 characters)
./tests/test_tbpore.py:158:89: E501 line too long (136 > 88 characters)
./tests/test_tbpore.py:165:89: E501 line too long (126 > 88 characters)
./tests/test_tbpore.py:172:89: E501 line too long (123 > 88 characters)
./tests/test_tbpore.py:179:89: E501 line too long (114 > 88 characters)
./tests/test_tbpore.py:181:89: E501 line too long (118 > 88 characters)
./tests/test_tbpore.py:187:89: E501 line too long (122 > 88 characters)
./tests/test_tbpore.py:188:89: E501 line too long (120 > 88 characters)
./tests/test_tbpore.py:197:89: E501 line too long (123 > 88 characters)
./tests/test_tbpore.py:228:89: E501 line too long (112 > 88 characters)
./tests/test_tbpore.py:229:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:236:89: E501 line too long (122 > 88 characters)
./tests/test_tbpore.py:241:89: E501 line too long (120 > 88 characters)
./external_scripts/consensus.py:1:89: E501 line too long (126 > 88 characters)
./external_scripts/consensus.py:378:89: E501 line too long (110 > 88 characters)
./external_scripts/apply_filters.py:1:89: E501 line too long (120 > 88 characters)
./external_scripts/apply_filters.py:380:89: E501 line too long (112 > 88 characters)
./external_scripts/apply_filters.py:390:89: E501 line too long (128 > 88 characters)
./external_scripts/apply_filters.py:401:89: E501 line too long (132 > 88 characters)
./external_scripts/apply_filters.py:411:89: E501 line too long (112 > 88 characters)
./external_scripts/apply_filters.py:440:89: E501 line too long (104 > 88 characters)
./external_scripts/apply_filters.py:482:89: E501 line too long (90 > 88 characters)
./external_scripts/apply_filters.py:493:89: E501 line too long (90 > 88 characters)
./external_scripts/apply_filters.py:504:89: E501 line too long (94 > 88 characters)
./external_scripts/apply_filters.py:689:89: E501 line too long (89 > 88 characters)
./external_scripts/apply_filters.py:700:89: E501 line too long (89 > 88 characters)
./external_scripts/apply_filters.py:710:89: E501 line too long (94 > 88 characters)
./external_scripts/apply_filters.py:720:89: E501 line too long (94 > 88 characters)
./external_scripts/apply_filters.py:730:89: E501 line too long (98 > 88 characters)
./external_scripts/apply_filters.py:741:89: E501 line too long (89 > 88 characters)
./external_scripts/apply_filters.py:752:89: E501 line too long (89 > 88 characters)
./tbpore/tbpore.py:122:89: E501 line too long (95 > 88 characters)
./tbpore/tbpore.py:189:89: E501 line too long (100 > 88 characters)
./tbpore/tbpore.py:226:89: E501 line too long (106 > 88 characters)
./tbpore/tbpore.py:259:89: E501 line too long (107 > 88 characters)
error: Recipe `lint` failed on line 19 with exit code 1
black
did not manage to format these long lines. I decided to ignore E501
when running flake8
. Are you fine with this? If not, what would be your suggestion to fix this? Manually format these long lines?
As we are trying to keep a low memory footprint and want to be able to distribute as many resources as possible with tbpore, we only really need to use the H37Rv reference and maybe a few other NTMs as well. Then we keep anything that maps to H37Rv.
Some NTM references can be found in https://www.nature.com/articles/srep45258#Sec25 and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0214088
The code I used to do this in the head to head was https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/data/QC/Snakefile#L272-L383 (ignore the rule for extracting illumina reads obviously)
Original issue: #1
Could anyone check H37RV genome and mask https://github.com/leoisl/tbpore/tree/main/data/H37RV_genome ? I copied from some Michael's dirs in codon, but don't remember the exact path right now. Probably good to double check that we are using the correct reference genome and masks
Done in d9a0c02 but IDK if you agree with this strategy. Basically, it creates a .cache
dir in the repo root and stores the mykrobe/ctx binary skeletons there. However, if a new skeleton should be built (due to a first run or a catalogue update) and if >1 runs of tbpore
is done simultaneously, this will create bugs. Should we address this programatically or tell them to take care about this and live with it?
From leoisl@8a644c6 , we have NTM+human decontamination in tbpore
, so the results should be very similar to H2H
, however they are not. #16 is too long/convoluted, so I am starting this new issue to compare H2H
and tbpore
output step by step
This issue is to compare paper results with results obtained with new filters. These new filters are min_depth 5
(paper is 0
), min_qual 25
(paper is 85
), min_mq 30
(paper is 0
). It is worth noting that some filters were left untouched, i.e. min_fed 0.2
, min_strand_bias 1
, min_vdb 0.00001
and min_frs 0.90
. This is related to #22 but there will be several plots comparison here, so I am starting a new issue.
Using mpileup
and call
. MUST be bcftools version >=1.13
. (Nothing of relevance to these subcommands seems to have changed in the recent versions so probably safe to use any version above 1.13)
I have the following directory structure: sample_example/mada_101/mada_101.nanopore.fastq.gz
. If I run tbpore
giving the sample_example
dir as input, I get this issue:
(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ poetry run tbpore sample_example
[2022-04-26 10:36:34] INFO | Welcome to TBpore version 0.1.0
[2022-04-26 10:36:34] INFO | Searching for fastq files...
[2022-04-26 10:36:34] INFO | Found 0 fastq files. Joining them...
And it gets stuck here
Align the subsampled, decontaminated reads (#3) to H37Rv - this will be used for variant calling.
https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/analysis/baseline_variants/Snakefile#L111-L134 is how I did this in the paper.
tbpore
does not use a decontamination step, is there a list of samples that we could run tbpore
to compare against the head-to-head pipeline results? This would allow us to understand the impact of not decontaminating samples, and ensure us that is actually fine to skip this step. And can also be another test to ensure the tool is actually doing what it is supposed to do
This is a request from Floriane and Pedro from our meeting on 18/05/2022.
H37Rv is not always the reference they might want describe the variants with respect to, so the reference has to be configurable (e.g. through the parameters). Right now it is always forced to be H37Rv.
i.e., aplply the SNPs to H37Rv
Snakemake rule: https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/analysis/baseline_variants/Snakefile#L358-L386
Should we add a sample example (a real but downsampled fastq.gz
file), that could go potentially in the integration tests, and as a first test command for users to run? I am in doubt as this would make the package heavier
Just noticed this when making PR #26 . This will impact a little bit runtime, as the DB index is 9GB, but I think is worth it to make sure we are using the correct index...
I think it might be worth adding a download
subcommand that downloads the database and checks it's md5sum (or even better, sha256) and puts it in the expected location. Have had issues with both of these things when supporting collaborators.
This is a request from Floriane and Pedro from our meeting on 18/05/2022.
just test-run
did not work when they tried to run it. However, this sample run worked when they manually run the bash script at https://github.com/mbhall88/tbpore/blob/main/scripts/run_sample_example.sh . Investigate why. I was unable to reproduce this issue.
As we are trying to keep a low memory footprint and want to be able to distribute as many resources as possible with tbpore, we only really need to use the H37Rv reference and maybe a few other NTMs as well. Then we keep anything that maps to H37Rv.
Some NTM references can be found in https://www.nature.com/articles/srep45258#Sec25 and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0214088
The code I used to do this in the head to head was https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/data/QC/Snakefile#L272-L383 (ignore the rule for extracting illumina reads obviously)
Just a reminder to not forget, see https://github.com/leoisl/tbpore/blob/73dac25f75c96216b291b7e2b0d62d3a60e05d1f/.config.yaml#L12-L13
Hi there,
What does the ''subsampled" mean in the vcf filenames?
sorry to comment on this again...is there a way to 'disable' rasusa and not do the subsampling from the fastq file?
Originally posted by @jloubser in #49 (comment)
I'll add an option that allows for controlling the depth to subsampling, with an options to disable too
Have an option in cluster
that produces strongly connected component clusters.
A use case for this was when there is a sample or two with low distance to a sample from a different lineage. This caused two clusters from separate lineages to get joined because the two samples had lots of masked variants due to low FRS
I will leave this to you, as I think you know how to better describe this tool
The current minimap2
index was built with -I 12G
to match the H2H
index. This pushes the tbpore
RAM usage when running tbpore process
to 13.1
GB. We could instead build the index with -I 500M
, which would take the tbpore process
RAM down to ~5GB, which is much more runnable in a personal laptop, but then the results are not identical to the H2H
results. We should evaluate the impact of this different index on the clustering and on the tbpore
results in general, and infer if is indeed OK to switch to this lighter index. This might be related to #22
This should use the settings used in the head to head paper. These will be released in mykrobe v0.11.0 (currently on master as --ont
). If needing to set manually, the settings can found in https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/resistance_prediction/workflow/rules/mykrobe.smk.
It is up for debate whether we want to use the -A
option to report all calls. Maybe we can have this as an option in tbpore and have it turned off by default?
Command: tbpore cluster <list_of_consensus> -o <outdir>
Will run psdm
and generate the clusters as in https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/notebooks/clustering.py.ipynb . No need for cluster visualisation, just the identifiers of the samples that are clustered, like comma-sep lists.
Another option is based on this excerpt from slack:
I was basically thinking that in the docs we explain that after a tbpore run you do the clustering by running psdm between the previous consensus sequences (all in one file) and the new consensus. You would need to use psdm’s long-form output (-l) and you can append this long-form distances to an existing file of all previous distances. Then after this you append the new consensus sequence to the previous ones.
Then the user I think would need to run a couple of cat
commands to update some files, psdm
to update the matrix and a script we provide to build the clusters from the psdm
matrix. As this is some commands, it seems it would be easier for them to have this tbpore cluster
command.
The current tbpore
execution becomes command tbpore process
This is a request from Floriane and Pedro from our meeting on 18/05/2022.
Their samples come in batches, so it would be helpful if tbpore process
could be run on several samples with a single command instead of having to run a tbpore process
command per sample.
This is a request from Floriane and Pedro from our meeting on 18/05/2022.
The cluster they will use to run tbpore
in Canada has a policy to not allow users to run conda
because it messes up with the ~/.bashrc
configuration. Is there an alternative way for us to provide the dependencies? pypi
would be great, but IDK if all dependencies are installable through pypi
, my guess is not e.g. psdm
, rasusa
, etc... @mbhall88 do you have any suggestions? They said that they managed to run tbpore
for now, by installing the dependencies through conda
, but IIRC in a virtual machine (or a container?) in their cluster, but they also said this is not a long-term solution...
Hello tbpore team,
Thank you for uploading the new tbpore 0.2.0 container. The new --db
option seems to work without issue, however we're still getting this error upon running it:
File "/usr/local/bin/tbpore", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/tbpore/tbpore.py", line 506, in main
main_cli()
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/tbpore/tbpore.py", line 258, in process
tmp, logdir = setup_dirs(outdir, tmp)
File "/usr/local/lib/python3.9/site-packages/tbpore/tbpore.py", line 110, in setup_dirs
cache_dir.mkdir(parents=True, exist_ok=True)
File "/usr/local/lib/python3.9/pathlib.py", line 1323, in mkdir
self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/usr/local/lib/python3.9/site-packages/.cache'
That's not a section of the code we would be able to change ourselves, so could you please help us with it?
The command we're using is as follow:
singularity exec -H path_to/tbpore tbpore_0.2.0.sif tbpore process --db data/decontamination_db/tbpore.remove_contam.fa.gz.map-ont.mmi -o sample_example/tbpore_out --cleanup sample_example/mada_1-7.subsampled.fastq.gz
Thanks again for your time and help!
In tbpore
, we downsample the reads depth to 150x and we have a lower bound for the phred-scaled VCF call quality of 85 (which accepts only 1 error in 100M - this is an oversimplification of the quality score, there are other data that also affect the score, but this is a good approximation). This causes us to flag and thus discard all non-perfect calls as low-quality calls (we define here a perfect call when all reads agree with the consensus, and non-perfect calls all the others (even if a single read disagree with the consensus out of ~150 or even 1000 reads, we will flag the call as low quality because our threshold for the call quality just accepts 1 error in 100M)). In practice, we see perfect calls having phred-scaled quality between 250 and 350, and quasi-perfect calls (the ones that have a single read disagreeing with the consensus) between 20 and 30, the exact values depend on how many reads overlap the locus and other info. The issue is that there are hundreds of thousands of quasi-perfect calls that get filtered out, and this becomes a N
in the consensus sequence. When we cluster based on the consensus sequence, we might overcluster as we will consider that N
equals any base (if we compare two samples, and one calls an alt
and the other calls an N
, but this N
was a due to a quasi-perfect call to ref
, then we will not take this difference in consideration, whereas we should).
I calculated the mean and max number of calls the H2H
pipeline discards due to them being tagged as low quality (lq
flag) while having an "acceptable" quality (phred-scaled VCF call quality >= 20, i.e. we accept at most 1 error in 100).
For nanopore, we have ~767k VCF calls (bases) on average per sample (and a max of 1158k calls) that are filtered out solely due to being lq
while having quality >=20
. Note that these calls have no other flag, i.e. if we reduce the quality threshold from 85 to 20, we will fish back all these calls:
$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/madagascar/ -name *.bcf`; do bcftools view $f | awk '$6>=20 && $7=="lq" ' | wc -l; done | datamash mean 1 max 1
767632.54945055 1158679
If we reduce the quality threshold down to 20, we would instead filter out just ~1.72 calls per sample solely due to being lq
(157 calls in all the 91 samples):
$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/madagascar/ -name *.bcf`; do bcftools view $f | awk '$6<20 && $7=="lq" ' | wc -l; done | datamash mean 1 max 1
1.7252747252747 34
These would be the 3 lq
calls with highest quality that we would ignore:
NC_000962.3 3711723 . A . 19.695 lq DP=59;ADF=26;ADR=21;SCR=18;VDB=0.112489;SGB=-0.590765;RPBZ=-2.43914;MQBZ=4.21187;MQSBZ=1.66428;BQBZ=-1.38604;SCBZ=-0.576347;FS=0;MQ0F=0.610169;AN=1;DP4=26,21,1,4;MQ=4 GT:SP 0:7
NC_000962.3 2635661 . G . 19.6942 lq DP=150;ADF=73;ADR=29;SCR=97;VDB=0.18;SGB=-0.453602;RPBZ=0.698446;MQBZ=2.87256;MQSBZ=-1.63911;BQBZ=1.45854;SCBZ=-1.15867;FS=0;MQ0F=0.233333;AN=1;DP4=73,29,1,1;MQ=2GT:SP 0:3
NC_000962.3 3711721 . T . 19.377 lq DP=75;ADF=33;ADR=23;SCR=20;VDB=0.116687;SGB=-0.511536;RPBZ=-2.82962;MQBZ=3.38282;MQSBZ=1.801;BQBZ=0.570723;SCBZ=1.42944;FS=0;MQ0F=0.48;AN=1;DP4=33,23,2,1;MQ=3 GT:SP 0:0
The second one has 102 reads calling ref
and only 2
calling alt.
We can't fairly compare the stats above with the Illumina calls. IDK much about the H2H
pipeline, but it seems that the illumina VCFs were not generated by us, but by a thirdparty using the COMPASS pipeline? Anyway, I did not find the same flags between the ONT and the Illumina VCFs, but I think the Illumina VCFs are filtering with a phred-scaled VCF call quality >= 25, due to this header:
##FILTER=<ID=S25,Number=0,Type=Float,Description="Base not called because min SNP quality less than 25">
If we assume this, we have much less calls filtered out (~255 per sample on average) solely due to being low quality (note that here we are not even requiring the quality to be >=20, although we know it is <25 already):
$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/illumina/gvcfs/ -name mada*.vcf.gz`; do zcat $f | awk '$7=="S25"' | wc -l; done | datamash mean 1 max 1
255.61016949153 5413
Just by looking at the nanopore stats, it seems we might be filtering way too much (average of ~17.4% of the genome). If when comparing two samples, an alt
call matches an imperfect ref
call of the other sample, or we have an imperfect alt
call in one of the samples, this difference will not be counted, and thus we might overcluster our samples. A possible solution is to just decrease the lower bound for the phred-scaled VCF call quality from 85 to 20 and evaluate it. The issue is that I don't know how to evaluate if results are better or not. 20 might be a too lenient value too, as we would discard just ~2 calls per sample. Maybe we should use the same value that COMPASS use, 25? We need a way to evaluate results and do a parameter sweep to find the best threshold. I am wondering if rerunning the H2H
pipeline help us infer if results are better.
This is a request from Floriane and Pedro from our meeting on 18/05/2022.
After clustering the samples, we should build a tree from it. I have no idea which would be the best tool do build such a tree... @iqbal-lab @mbhall88 @jeff-k @martinghunt any ideas?
We had an example where running tbpore on a single, combined fastq produced difference results to running it on a directory of the individual fastqs that make up the combined one. The reason the results were different between using the combined fastq and the directory of fastqs has to do with subsampling. When we randomly subsample the reads we set a random seed so that the same reads will always be selected. BUT this is only guaranteed if the reads are in the same order. When we combine the reads on the command line we get a differet ordering to when we combine the reads from a directory with python.
To avoid this problem in the future, we will sort the reads with seqkit sort
(which is already a dependency) before subsampling
Do we need this?
This is the pipeline code to run the script https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/analysis/baseline_variants/Snakefile#L183-L228
And this is the script https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/baseline_variants/scripts/apply_filters.py
Performance documentation, but just on the sample data (mada_1-7
subsampled reads: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR903/ERR9030361/mada_1-7.subsampled.fastq.gz , ~163k ONT reads).
On my local laptop with:
Looks OK to me, i.e. runnable on a standard laptop using <1 GB of RAM and in a few minutes. Multithreading does not speed up the execution much though...
I used the subsampled reads from https://www.ebi.ac.uk/ena/browser/view/ERR9030361?show=reads , which I think were the rasusa
subsampled reads? Should I re-evaluate with the full reads dataset, which might be closer to their use case?
1 thread:
[2022-04-26 13:13:16] INFO | Welcome to TBpore version 0.1.0
[2022-04-26 13:13:16] INFO | Started running mykrobe predict --sample mada_1-7 -t 1 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:14:37] INFO | Done running mykrobe predict --sample mada_1-7 -t 1 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:14:37] INFO | Started running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:15:19] INFO | Done running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:15:19] INFO | Started running minimap2 -t 1 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz ...
[2022-04-26 13:17:10] INFO | Done running minimap2 -t 1 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz
[2022-04-26 13:17:10] INFO | Started running samtools sort -@ 1 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam ...
[2022-04-26 13:17:12] INFO | Done running samtools sort -@ 1 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam
[2022-04-26 13:17:12] INFO | Started running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 1 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam ...
[2022-04-26 13:20:04] INFO | Done running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 1 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam
[2022-04-26 13:20:04] INFO | Started running bcftools call --threads 1 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf ...
[2022-04-26 13:20:18] INFO | Done running bcftools call --threads 1 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf
[2022-04-26 13:20:18] INFO | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf ...
[2022-04-26 13:22:37] INFO | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf
[2022-04-26 13:22:37] INFO | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed ...
[2022-04-26 13:23:08] INFO | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed
[2022-04-26 13:23:08] INFO | Cleaning up temporary files...
[2022-04-26 13:23:08] SUCCESS | Done
Command being timed: "tbpore -o sample_example/tbpore_out --cleanup sample_example/mada_1-7.subsampled.fastq.gz"
User time (seconds): 590.41
System time (seconds): 12.49
Percent of CPU this job got: 101%
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:52.40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 674368
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 87
Minor (reclaiming a frame) page faults: 2037873
Voluntary context switches: 145720
Involuntary context switches: 84042
Swaps: 0
File system inputs: 478520
File system outputs: 6127072
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
4 threads:
(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ /usr/bin/time --verbose tbpore -o sample_example/tbpore_out --cleanup -t 4 sample_example/mada_1-7.subsampled.fastq.gz
[2022-04-26 13:26:00] INFO | Welcome to TBpore version 0.1.0
[2022-04-26 13:26:00] INFO | Started running mykrobe predict --sample mada_1-7 -t 4 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:26:55] INFO | Done running mykrobe predict --sample mada_1-7 -t 4 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:26:55] INFO | Started running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:27:39] INFO | Done running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:27:39] INFO | Started running minimap2 -t 4 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz ...
[2022-04-26 13:28:25] INFO | Done running minimap2 -t 4 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz
[2022-04-26 13:28:25] INFO | Started running samtools sort -@ 4 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam ...
[2022-04-26 13:28:27] INFO | Done running samtools sort -@ 4 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam
[2022-04-26 13:28:27] INFO | Started running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 4 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam ...
[2022-04-26 13:31:12] INFO | Done running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 4 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam
[2022-04-26 13:31:12] INFO | Started running bcftools call --threads 4 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf ...
[2022-04-26 13:31:27] INFO | Done running bcftools call --threads 4 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf
[2022-04-26 13:31:27] INFO | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf ...
[2022-04-26 13:33:52] INFO | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf
[2022-04-26 13:33:52] INFO | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed ...
[2022-04-26 13:34:24] INFO | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed
[2022-04-26 13:34:24] INFO | Cleaning up temporary files...
[2022-04-26 13:34:24] SUCCESS | Done
Command being timed: "tbpore -o sample_example/tbpore_out --cleanup -t 4 sample_example/mada_1-7.subsampled.fastq.gz"
User time (seconds): 659.30
System time (seconds): 11.96
Percent of CPU this job got: 133%
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:24.28
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 835656
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 2086481
Voluntary context switches: 67926
Involuntary context switches: 63398
Swaps: 0
File system inputs: 664
File system outputs: 6127080
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
We have an output directory parameter right now in tbpore
:
@click.option(
"-o",
"--outdir",
help="Directory to place output files",
default=".",
show_default=True,
type=click.Path(file_okay=False, writable=True, path_type=Path),
)
However, the only output we are producing right now is a single consensus fasta file. Could we change this output directory parameter to an output consensus file, or are there other files we should be outputting and I am missing?
Haven't searched much, but don't know how to solve this cryptic Mac OS CI issue:
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/usr/local/miniconda/envs/tbpore/lib/python3.8/site-packages/~ackaging-20.9.dist-info'
This is happening when poetry
is installing toml
samtools_1371651c968ac93f04abc9fa581340126a976c9033b2fe999a59053379aa8c92.txt
Hi,
I am not able to run a sample on Tbpore, please see attached doc
Floriane
We can subsample the decontaminated reads (#1) with rasusa to 150x to ensure we know the upper limit on memory usage. We could concieveably go lower, but this just copies what I did in the head to head paper.
I used this snakemake pipeline to run tbpore
on 91 madagascar samples from Michael. I wanted to compare the head_to_head_pipeline
results (described here) and tbpore
results when inputting the raw ONT reads, but had difficulties comparing the mykrobe
output files as well as the VCFs, due to a high number of differences, so I decided to just compare the easiest output: the consensus fastas. I tried to use edit distance, but it was too slow, so I switched to mash dist
. However, the differences between the head_to_head_pipeline
and tbpore
on these 91 samples were extremely high (3rd column is the mash
distance, first pair of samples have a mash
distance if 12.5%, the second 22.9% and so on):
head_to_head_pipeline_output/mada_102.consensus.fa tbpore_output/mada_102.consensus.fa 0.125753 0 3697/100000
head_to_head_pipeline_output/mada_103.consensus.fa tbpore_output/mada_103.consensus.fa 0.229735 0 120/29764
head_to_head_pipeline_output/mada_104.consensus.fa tbpore_output/mada_104.consensus.fa 0.0584744 0 17157/100000
head_to_head_pipeline_output/mada_105.consensus.fa tbpore_output/mada_105.consensus.fa 0.114674 0 4711/100000
head_to_head_pipeline_output/mada_106.consensus.fa tbpore_output/mada_106.consensus.fa 0.0823425 0 9735/100000
head_to_head_pipeline_output/mada_107.consensus.fa tbpore_output/mada_107.consensus.fa 0.106524 0 5640/100000
head_to_head_pipeline_output/mada_109.consensus.fa tbpore_output/mada_109.consensus.fa 0.124576 0 3793/100000
head_to_head_pipeline_output/mada_1-10.consensus.fa tbpore_output/mada_1-10.consensus.fa 0.19382 0 291/33797
head_to_head_pipeline_output/mada_110.consensus.fa tbpore_output/mada_110.consensus.fa 0.0446784 0 24325/100000
head_to_head_pipeline_output/mada_1-11.consensus.fa tbpore_output/mada_1-11.consensus.fa 0.171562 0 1000/72402
head_to_head_pipeline_output/mada_111.consensus.fa tbpore_output/mada_111.consensus.fa 0.102631 0 6150/100000
head_to_head_pipeline_output/mada_1-12.consensus.fa tbpore_output/mada_1-12.consensus.fa 0.151909 0 2070/98494
head_to_head_pipeline_output/mada_112.consensus.fa tbpore_output/mada_112.consensus.fa 0.145902 0 2391/100000
head_to_head_pipeline_output/mada_1-13.consensus.fa tbpore_output/mada_1-13.consensus.fa 0.079314 0 10441/100000
head_to_head_pipeline_output/mada_113.consensus.fa tbpore_output/mada_113.consensus.fa 0.0946241 0 7359/100000
head_to_head_pipeline_output/mada_1-14.consensus.fa tbpore_output/mada_1-14.consensus.fa 0.0903614 0 8104/100000
head_to_head_pipeline_output/mada_1-15.consensus.fa tbpore_output/mada_1-15.consensus.fa 0.111099 0 5097/100000
head_to_head_pipeline_output/mada_115.consensus.fa tbpore_output/mada_115.consensus.fa 0.0974869 0 6900/100000
head_to_head_pipeline_output/mada_1-16.consensus.fa tbpore_output/mada_1-16.consensus.fa 0.086014 0 8948/100000
head_to_head_pipeline_output/mada_116.consensus.fa tbpore_output/mada_116.consensus.fa 0.101133 0 6359/100000
head_to_head_pipeline_output/mada_1-17.consensus.fa tbpore_output/mada_1-17.consensus.fa 0.0743196 0 11731/100000
head_to_head_pipeline_output/mada_117.consensus.fa tbpore_output/mada_117.consensus.fa 0.218162 0 192/37306
head_to_head_pipeline_output/mada_1-18.consensus.fa tbpore_output/mada_1-18.consensus.fa 0.0882895 0 8495/100000
head_to_head_pipeline_output/mada_118.consensus.fa tbpore_output/mada_118.consensus.fa 0.207319 0 277/42805
head_to_head_pipeline_output/mada_1-19.consensus.fa tbpore_output/mada_1-19.consensus.fa 0.0531443 0 19587/100000
head_to_head_pipeline_output/mada_1-1.consensus.fa tbpore_output/mada_1-1.consensus.fa 0.127968 0 3523/100000
head_to_head_pipeline_output/mada_1-20.consensus.fa tbpore_output/mada_1-20.consensus.fa 0.103732 0 6001/100000
head_to_head_pipeline_output/mada_120.consensus.fa tbpore_output/mada_120.consensus.fa 0.253677 6.99497e-181 55/22590
head_to_head_pipeline_output/mada_1-21.consensus.fa tbpore_output/mada_1-21.consensus.fa 0.0574401 0 17600/100000
head_to_head_pipeline_output/mada_121.consensus.fa tbpore_output/mada_121.consensus.fa 0.13785 0 2844/100000
head_to_head_pipeline_output/mada_1-22.consensus.fa tbpore_output/mada_1-22.consensus.fa 0.135488 0 2993/100000
head_to_head_pipeline_output/mada_122.consensus.fa tbpore_output/mada_122.consensus.fa 0.101508 0 6306/100000
head_to_head_pipeline_output/mada_123.consensus.fa tbpore_output/mada_123.consensus.fa 0.0397561 0 27708/100000
head_to_head_pipeline_output/mada_124.consensus.fa tbpore_output/mada_124.consensus.fa 0.199724 0 298/39217
head_to_head_pipeline_output/mada_1-25.consensus.fa tbpore_output/mada_1-25.consensus.fa 0.0776128 0 10862/100000
head_to_head_pipeline_output/mada_125.consensus.fa tbpore_output/mada_125.consensus.fa 0.0898912 0 8191/100000
head_to_head_pipeline_output/mada_126.consensus.fa tbpore_output/mada_126.consensus.fa 0.122175 0 3997/100000
head_to_head_pipeline_output/mada_127.consensus.fa tbpore_output/mada_127.consensus.fa 0.117475 0 4430/100000
head_to_head_pipeline_output/mada_1-28.consensus.fa tbpore_output/mada_1-28.consensus.fa 0.0694959 0 13146/100000
head_to_head_pipeline_output/mada_128.consensus.fa tbpore_output/mada_128.consensus.fa 0.133612 0 3117/100000
head_to_head_pipeline_output/mada_129.consensus.fa tbpore_output/mada_129.consensus.fa 0.10114 0 6358/100000
head_to_head_pipeline_output/mada_1-2.consensus.fa tbpore_output/mada_1-2.consensus.fa 0.0881142 0 8529/100000
head_to_head_pipeline_output/mada_1-30.consensus.fa tbpore_output/mada_1-30.consensus.fa 0.164465 0 1167/72632
head_to_head_pipeline_output/mada_130.consensus.fa tbpore_output/mada_130.consensus.fa 0.06561 0 14425/100000
head_to_head_pipeline_output/mada_131.consensus.fa tbpore_output/mada_131.consensus.fa 0.0527482 0 19783/100000
head_to_head_pipeline_output/mada_1-32.consensus.fa tbpore_output/mada_1-32.consensus.fa 0.146706 0 2350/100000
head_to_head_pipeline_output/mada_132.consensus.fa tbpore_output/mada_132.consensus.fa 0.0323916 0 33914/100000
head_to_head_pipeline_output/mada_1-33.consensus.fa tbpore_output/mada_1-33.consensus.fa 0.0736212 0 11925/100000
head_to_head_pipeline_output/mada_133.consensus.fa tbpore_output/mada_133.consensus.fa 0.11453 0 4726/100000
head_to_head_pipeline_output/mada_134.consensus.fa tbpore_output/mada_134.consensus.fa 0.121946 0 4017/100000
head_to_head_pipeline_output/mada_135.consensus.fa tbpore_output/mada_135.consensus.fa 0.0628593 0 15415/100000
head_to_head_pipeline_output/mada_1-36.consensus.fa tbpore_output/mada_1-36.consensus.fa 0.139695 0 2733/100000
head_to_head_pipeline_output/mada_136.consensus.fa tbpore_output/mada_136.consensus.fa 0.106821 0 5603/100000
head_to_head_pipeline_output/mada_137.consensus.fa tbpore_output/mada_137.consensus.fa 0.118191 0 4361/100000
head_to_head_pipeline_output/mada_1-38.consensus.fa tbpore_output/mada_1-38.consensus.fa 0.0678884 0 13659/100000
head_to_head_pipeline_output/mada_1-39.consensus.fa tbpore_output/mada_1-39.consensus.fa 0.188621 0 289/30063
head_to_head_pipeline_output/mada_139.consensus.fa tbpore_output/mada_139.consensus.fa 0.139256 0 2759/100000
head_to_head_pipeline_output/mada_1-3.consensus.fa tbpore_output/mada_1-3.consensus.fa 0.136916 0 2902/100000
head_to_head_pipeline_output/mada_1-40.consensus.fa tbpore_output/mada_1-40.consensus.fa 0.096656 0 7030/100000
head_to_head_pipeline_output/mada_140.consensus.fa tbpore_output/mada_140.consensus.fa 0.0559536 0 18260/100000
head_to_head_pipeline_output/mada_1-41.consensus.fa tbpore_output/mada_1-41.consensus.fa 0.192205 0 277/31089
head_to_head_pipeline_output/mada_141.consensus.fa tbpore_output/mada_141.consensus.fa 0.122324 0 3984/100000
head_to_head_pipeline_output/mada_142.consensus.fa tbpore_output/mada_142.consensus.fa 0.131246 0 3281/100000
head_to_head_pipeline_output/mada_1-43.consensus.fa tbpore_output/mada_1-43.consensus.fa 0.048752 0 21894/100000
head_to_head_pipeline_output/mada_143.consensus.fa tbpore_output/mada_143.consensus.fa 0.139407 0 2750/100000
head_to_head_pipeline_output/mada_1-44.consensus.fa tbpore_output/mada_1-44.consensus.fa 0.133346 0 3135/100000
head_to_head_pipeline_output/mada_144.consensus.fa tbpore_output/mada_144.consensus.fa 0.131063 0 3294/100000
head_to_head_pipeline_output/mada_1-46.consensus.fa tbpore_output/mada_1-46.consensus.fa 0.0370269 0 29830/100000
head_to_head_pipeline_output/mada_1-47.consensus.fa tbpore_output/mada_1-47.consensus.fa 0.0886736 0 8421/100000
head_to_head_pipeline_output/mada_1-48.consensus.fa tbpore_output/mada_1-48.consensus.fa 0.136063 0 2956/100000
head_to_head_pipeline_output/mada_148.consensus.fa tbpore_output/mada_148.consensus.fa 0.124649 0 3787/100000
head_to_head_pipeline_output/mada_1-50.consensus.fa tbpore_output/mada_1-50.consensus.fa 0.125964 0 3680/100000
head_to_head_pipeline_output/mada_150.consensus.fa tbpore_output/mada_150.consensus.fa 0.0814042 0 9948/100000
head_to_head_pipeline_output/mada_1-51.consensus.fa tbpore_output/mada_1-51.consensus.fa 0.217261 0 185/35269
head_to_head_pipeline_output/mada_151.consensus.fa tbpore_output/mada_151.consensus.fa 0.128034 0 3518/100000
head_to_head_pipeline_output/mada_152.consensus.fa tbpore_output/mada_152.consensus.fa 0.140103 0 2709/100000
head_to_head_pipeline_output/mada_1-53.consensus.fa tbpore_output/mada_1-53.consensus.fa 0.111055 0 5102/100000
head_to_head_pipeline_output/mada_1-54.consensus.fa tbpore_output/mada_1-54.consensus.fa 0.135272 0 3007/100000
head_to_head_pipeline_output/mada_154.consensus.fa tbpore_output/mada_154.consensus.fa 0.116689 0 4507/100000
head_to_head_pipeline_output/mada_1-5.consensus.fa tbpore_output/mada_1-5.consensus.fa 0.126227 0 3659/100000
head_to_head_pipeline_output/mada_1-6.consensus.fa tbpore_output/mada_1-6.consensus.fa 0.144978 0 2439/100000
head_to_head_pipeline_output/mada_1-7.consensus.fa tbpore_output/mada_1-7.consensus.fa 0.0620913 0 15705/100000
head_to_head_pipeline_output/mada_1-8.consensus.fa tbpore_output/mada_1-8.consensus.fa 0.120456 0 4150/100000
head_to_head_pipeline_output/mada_2-1.consensus.fa tbpore_output/mada_2-1.consensus.fa 0.204092 0 194/28002
head_to_head_pipeline_output/mada_2-25.consensus.fa tbpore_output/mada_2-25.consensus.fa 0.142783 0 2557/100000
head_to_head_pipeline_output/mada_2-31.consensus.fa tbpore_output/mada_2-31.consensus.fa 0.0593619 0 16787/100000
head_to_head_pipeline_output/mada_2-34.consensus.fa tbpore_output/mada_2-34.consensus.fa 0.136964 0 2899/100000
head_to_head_pipeline_output/mada_2-42.consensus.fa tbpore_output/mada_2-42.consensus.fa 0.254198 1.24683e-206 63/26161
head_to_head_pipeline_output/mada_2-46.consensus.fa tbpore_output/mada_2-46.consensus.fa 0.104229 0 5935/100000
head_to_head_pipeline_output/mada_2-50.consensus.fa tbpore_output/mada_2-50.consensus.fa 0.10934 0 5299/100000
head_to_head_pipeline_output/mada_2-53.consensus.fa tbpore_output/mada_2-53.consensus.fa 0.0893892 0 8285/100000
We are looking here at head_to_head_pipeline
and tbpore
having consensus distances between 5% and 25% which I think is way too high, incomparable, especially for TB. So I made a small test: I got the mada_2-42
sample, which has a mash distance
of 25.4%, and
tbpore
with the raw nanopore reads (/hps/nobackup/iqbal/mbhall/tech_wars/data/madagascar/nanopore/mada_2-42/mada_2-42.nanopore.fastq.gz
- @mbhall88 could you please confirm this is the path to the raw nanopore reads?) and then mash dist
to confirm that we are indeed getting a mash distance
of 25.4% and there is nothing wrong with the pipeline:$ tbpore -o mada_2-42_without_decom --cleanup /hps/nobackup/iqbal/mbhall/tech_wars/data/madagascar/nanopore/mada_2-42/mada_2-42.nanopore.fastq.gz
...
$ mash dist -s 100000 head_to_head_pipeline_output/mada_2-42.consensus.fa mada_2-42_without_decom/mada_2-42.consensus.fa
Sketching head_to_head_pipeline_output/mada_2-42.consensus.fa (provide sketch file made with "mash sketch" to skip)...done.
head_to_head_pipeline_output/mada_2-42.consensus.fa mada_2-42_without_decom/mada_2-42.consensus.fa 0.254198 1.24683e-206 63/26161
tbpore
with the decontaminated nanopore reads (/hps/nobackup/iqbal/mbhall/tech_wars/data/QC/filtered/madagascar/nanopore/mada_2-42/mada_2-42.filtered.fastq.gz
- @mbhall88 could you please confirm this is the path to decontaminated nanopore reads?) and then mash dist
to now get a mash distance
of only 0.05%:$ tbpore -o mada_2-42_decom --cleanup /hps/nobackup/iqbal/mbhall/tech_wars/data/QC/filtered/madagascar/nanopore/mada_2-42/mada_2-42.filtered.fastq.gz
...
(tbpore) hl-codon-41-01:/hps/nobackup/iqbal/leandro/tbpore/pipelines/snakemake/comparison
$ mash dist -s 100000 head_to_head_pipeline_output/mada_2-42.consensus.fa mada_2-42_decom/mada_2-42.consensus.fa
Sketching head_to_head_pipeline_output/mada_2-42.consensus.fa (provide sketch file made with "mash sketch" to skip)...done.
head_to_head_pipeline_output/mada_2-42.consensus.fa mada_2-42_decom/mada_2-42.consensus.fa 0.000535225 0 22914/23432
So it seems to me decontamination is essential to tbpore
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.