Code Monkey home page Code Monkey logo

tbpore's People

Contributors

github-actions[bot] avatar leoisl avatar mbhall88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tbpore's Issues

Duplicated files input bug

I have the following directory structure: sample_example/mada_101/mada_101.nanopore.fastq.gz. If I run tbpore giving the sample_example dir as input, as well as sample_example/mada_101/mada_101.nanopore.fastq.gz, we will build a fastq by duplicating sample_example/mada_101/mada_101.nanopore.fastq.gz and concatenating the copies:

(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ poetry run tbpore --recursive sample_example/ sample_example/mada_101/mada_101.nanopore.fastq.gz
[2022-04-26 10:58:21] INFO     | Welcome to TBpore version 0.1.0
[2022-04-26 10:58:21] INFO     | Searching for fastq files...
[2022-04-26 10:58:21] INFO     | Found 2 fastq files. Joining them...

Linting

I am getting these linting error when running just lint:

(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ just lint
poetry run flake8 . --extend-exclude=".venv/"
./tests/test_external_tools.py:26:89: E501 line too long (95 > 88 characters)
./tests/test_external_tools.py:30:89: E501 line too long (95 > 88 characters)
./tests/test_tbpore.py:27:89: E501 line too long (94 > 88 characters)
./tests/test_tbpore.py:42:89: E501 line too long (95 > 88 characters)
./tests/test_tbpore.py:48:89: E501 line too long (112 > 88 characters)
./tests/test_tbpore.py:49:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:56:89: E501 line too long (122 > 88 characters)
./tests/test_tbpore.py:62:89: E501 line too long (120 > 88 characters)
./tests/test_tbpore.py:69:89: E501 line too long (118 > 88 characters)
./tests/test_tbpore.py:76:89: E501 line too long (117 > 88 characters)
./tests/test_tbpore.py:83:89: E501 line too long (114 > 88 characters)
./tests/test_tbpore.py:90:89: E501 line too long (114 > 88 characters)
./tests/test_tbpore.py:92:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:98:89: E501 line too long (113 > 88 characters)
./tests/test_tbpore.py:99:89: E501 line too long (102 > 88 characters)
./tests/test_tbpore.py:104:89: E501 line too long (124 > 88 characters)
./tests/test_tbpore.py:130:89: E501 line too long (95 > 88 characters)
./tests/test_tbpore.py:136:89: E501 line too long (124 > 88 characters)
./tests/test_tbpore.py:137:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:138:89: E501 line too long (98 > 88 characters)
./tests/test_tbpore.py:144:89: E501 line too long (104 > 88 characters)
./tests/test_tbpore.py:151:89: E501 line too long (129 > 88 characters)
./tests/test_tbpore.py:158:89: E501 line too long (136 > 88 characters)
./tests/test_tbpore.py:165:89: E501 line too long (126 > 88 characters)
./tests/test_tbpore.py:172:89: E501 line too long (123 > 88 characters)
./tests/test_tbpore.py:179:89: E501 line too long (114 > 88 characters)
./tests/test_tbpore.py:181:89: E501 line too long (118 > 88 characters)
./tests/test_tbpore.py:187:89: E501 line too long (122 > 88 characters)
./tests/test_tbpore.py:188:89: E501 line too long (120 > 88 characters)
./tests/test_tbpore.py:197:89: E501 line too long (123 > 88 characters)
./tests/test_tbpore.py:228:89: E501 line too long (112 > 88 characters)
./tests/test_tbpore.py:229:89: E501 line too long (100 > 88 characters)
./tests/test_tbpore.py:236:89: E501 line too long (122 > 88 characters)
./tests/test_tbpore.py:241:89: E501 line too long (120 > 88 characters)
./external_scripts/consensus.py:1:89: E501 line too long (126 > 88 characters)
./external_scripts/consensus.py:378:89: E501 line too long (110 > 88 characters)
./external_scripts/apply_filters.py:1:89: E501 line too long (120 > 88 characters)
./external_scripts/apply_filters.py:380:89: E501 line too long (112 > 88 characters)
./external_scripts/apply_filters.py:390:89: E501 line too long (128 > 88 characters)
./external_scripts/apply_filters.py:401:89: E501 line too long (132 > 88 characters)
./external_scripts/apply_filters.py:411:89: E501 line too long (112 > 88 characters)
./external_scripts/apply_filters.py:440:89: E501 line too long (104 > 88 characters)
./external_scripts/apply_filters.py:482:89: E501 line too long (90 > 88 characters)
./external_scripts/apply_filters.py:493:89: E501 line too long (90 > 88 characters)
./external_scripts/apply_filters.py:504:89: E501 line too long (94 > 88 characters)
./external_scripts/apply_filters.py:689:89: E501 line too long (89 > 88 characters)
./external_scripts/apply_filters.py:700:89: E501 line too long (89 > 88 characters)
./external_scripts/apply_filters.py:710:89: E501 line too long (94 > 88 characters)
./external_scripts/apply_filters.py:720:89: E501 line too long (94 > 88 characters)
./external_scripts/apply_filters.py:730:89: E501 line too long (98 > 88 characters)
./external_scripts/apply_filters.py:741:89: E501 line too long (89 > 88 characters)
./external_scripts/apply_filters.py:752:89: E501 line too long (89 > 88 characters)
./tbpore/tbpore.py:122:89: E501 line too long (95 > 88 characters)
./tbpore/tbpore.py:189:89: E501 line too long (100 > 88 characters)
./tbpore/tbpore.py:226:89: E501 line too long (106 > 88 characters)
./tbpore/tbpore.py:259:89: E501 line too long (107 > 88 characters)
error: Recipe `lint` failed on line 19 with exit code 1

black did not manage to format these long lines. I decided to ignore E501 when running flake8. Are you fine with this? If not, what would be your suggestion to fix this? Manually format these long lines?

Remove contamination

As we are trying to keep a low memory footprint and want to be able to distribute as many resources as possible with tbpore, we only really need to use the H37Rv reference and maybe a few other NTMs as well. Then we keep anything that maps to H37Rv.

Some NTM references can be found in https://www.nature.com/articles/srep45258#Sec25 and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0214088

The code I used to do this in the head to head was https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/data/QC/Snakefile#L272-L383 (ignore the rule for extracting illumina reads obviously)

Original issue: #1

Cache mykrobe/ctx binary skeletons

Done in d9a0c02 but IDK if you agree with this strategy. Basically, it creates a .cache dir in the repo root and stores the mykrobe/ctx binary skeletons there. However, if a new skeleton should be built (due to a first run or a catalogue update) and if >1 runs of tbpore is done simultaneously, this will create bugs. Should we address this programatically or tell them to take care about this and live with it?

Comparing paper results with results with new filters (min_depth 5, min_qual 25, min_mq 30)

This issue is to compare paper results with results obtained with new filters. These new filters are min_depth 5 (paper is 0), min_qual 25 (paper is 85), min_mq 30 (paper is 0). It is worth noting that some filters were left untouched, i.e. min_fed 0.2, min_strand_bias 1, min_vdb 0.00001 and min_frs 0.90. This is related to #22 but there will be several plots comparison here, so I am starting a new issue.

Directory input bug

I have the following directory structure: sample_example/mada_101/mada_101.nanopore.fastq.gz. If I run tbpore giving the sample_example dir as input, I get this issue:

(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ poetry run tbpore sample_example
[2022-04-26 10:36:34] INFO     | Welcome to TBpore version 0.1.0
[2022-04-26 10:36:34] INFO     | Searching for fastq files...
[2022-04-26 10:36:34] INFO     | Found 0 fastq files. Joining them...

And it gets stuck here

Real data test, tbpore vs head-to-head pipeline

tbpore does not use a decontamination step, is there a list of samples that we could run tbpore to compare against the head-to-head pipeline results? This would allow us to understand the impact of not decontaminating samples, and ensure us that is actually fine to skip this step. And can also be another test to ensure the tool is actually doing what it is supposed to do

Make the reference configurable

This is a request from Floriane and Pedro from our meeting on 18/05/2022.

H37Rv is not always the reference they might want describe the variants with respect to, so the reference has to be configurable (e.g. through the parameters). Right now it is always forced to be H37Rv.

Add sample example

Should we add a sample example (a real but downsampled fastq.gz file), that could go potentially in the integration tests, and as a first test command for users to run? I am in doubt as this would make the package heavier

Better handling of decontamination database

I think it might be worth adding a download subcommand that downloads the database and checks it's md5sum (or even better, sha256) and puts it in the expected location. Have had issues with both of these things when supporting collaborators.

Remove contamination

As we are trying to keep a low memory footprint and want to be able to distribute as many resources as possible with tbpore, we only really need to use the H37Rv reference and maybe a few other NTMs as well. Then we keep anything that maps to H37Rv.

Some NTM references can be found in https://www.nature.com/articles/srep45258#Sec25 and https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0214088

The code I used to do this in the head to head was https://github.com/mbhall88/head_to_head_pipeline/blob/3ed818c9f3754612a33060428ee651cd0f91234b/data/QC/Snakefile#L272-L383 (ignore the rule for extracting illumina reads obviously)

Expose subsampling options

          sorry to comment on this again...is there a way to 'disable' rasusa and not do the subsampling from the fastq file?

Originally posted by @jloubser in #49 (comment)

I'll add an option that allows for controlling the depth to subsampling, with an options to disable too

Add option for strongly connected components

Have an option in cluster that produces strongly connected component clusters.

A use case for this was when there is a sample or two with low distance to a sample from a different lineage. This caused two clusters from separate lineages to get joined because the two samples had lots of masked variants due to low FRS

Update README.md

I will leave this to you, as I think you know how to better describe this tool

Test the impact on clustering of changing `minimap2` index to use less RAM

The current minimap2 index was built with -I 12G to match the H2H index. This pushes the tbpore RAM usage when running tbpore process to 13.1GB. We could instead build the index with -I 500M, which would take the tbpore process RAM down to ~5GB, which is much more runnable in a personal laptop, but then the results are not identical to the H2H results. We should evaluate the impact of this different index on the clustering and on the tbpore results in general, and infer if is indeed OK to switch to this lighter index. This might be related to #22

Create subcommand `tbpore cluster`

Command: tbpore cluster <list_of_consensus> -o <outdir>

Will run psdm and generate the clusters as in https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/notebooks/clustering.py.ipynb . No need for cluster visualisation, just the identifiers of the samples that are clustered, like comma-sep lists.

Another option is based on this excerpt from slack:

I was basically thinking that in the docs we explain that after a tbpore run you do the clustering by running psdm between the previous consensus sequences (all in one file) and the new consensus. You would need to use psdm’s long-form output (-l) and you can append this long-form distances to an existing file of all previous distances. Then after this you append the new consensus sequence to the previous ones.

Then the user I think would need to run a couple of cat commands to update some files, psdm to update the matrix and a script we provide to build the clusters from the psdm matrix. As this is some commands, it seems it would be easier for them to have this tbpore cluster command.

The current tbpore execution becomes command tbpore process

Run several samples at once

This is a request from Floriane and Pedro from our meeting on 18/05/2022.

Their samples come in batches, so it would be helpful if tbpore process could be run on several samples with a single command instead of having to run a tbpore process command per sample.

Switch away from conda

This is a request from Floriane and Pedro from our meeting on 18/05/2022.

The cluster they will use to run tbpore in Canada has a policy to not allow users to run conda because it messes up with the ~/.bashrc configuration. Is there an alternative way for us to provide the dependencies? pypi would be great, but IDK if all dependencies are installable through pypi, my guess is not e.g. psdm, rasusa, etc... @mbhall88 do you have any suggestions? They said that they managed to run tbpore for now, by installing the dependencies through conda, but IIRC in a virtual machine (or a container?) in their cluster, but they also said this is not a long-term solution...

Singularity Container read-only file system error

Hello tbpore team,

Thank you for uploading the new tbpore 0.2.0 container. The new --db option seems to work without issue, however we're still getting this error upon running it:

  File "/usr/local/bin/tbpore", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/tbpore/tbpore.py", line 506, in main
    main_cli()
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/tbpore/tbpore.py", line 258, in process
    tmp, logdir = setup_dirs(outdir, tmp)
  File "/usr/local/lib/python3.9/site-packages/tbpore/tbpore.py", line 110, in setup_dirs
    cache_dir.mkdir(parents=True, exist_ok=True)
  File "/usr/local/lib/python3.9/pathlib.py", line 1323, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/usr/local/lib/python3.9/site-packages/.cache'

That's not a section of the code we would be able to change ourselves, so could you please help us with it?

The command we're using is as follow:
singularity exec -H path_to/tbpore tbpore_0.2.0.sif tbpore process --db data/decontamination_db/tbpore.remove_contam.fa.gz.map-ont.mmi -o sample_example/tbpore_out --cleanup sample_example/mada_1-7.subsampled.fastq.gz

Thanks again for your time and help!

Threshold for the phred-scaled VCF call quality

Problem overview

In tbpore, we downsample the reads depth to 150x and we have a lower bound for the phred-scaled VCF call quality of 85 (which accepts only 1 error in 100M - this is an oversimplification of the quality score, there are other data that also affect the score, but this is a good approximation). This causes us to flag and thus discard all non-perfect calls as low-quality calls (we define here a perfect call when all reads agree with the consensus, and non-perfect calls all the others (even if a single read disagree with the consensus out of ~150 or even 1000 reads, we will flag the call as low quality because our threshold for the call quality just accepts 1 error in 100M)). In practice, we see perfect calls having phred-scaled quality between 250 and 350, and quasi-perfect calls (the ones that have a single read disagreeing with the consensus) between 20 and 30, the exact values depend on how many reads overlap the locus and other info. The issue is that there are hundreds of thousands of quasi-perfect calls that get filtered out, and this becomes a N in the consensus sequence. When we cluster based on the consensus sequence, we might overcluster as we will consider that N equals any base (if we compare two samples, and one calls an alt and the other calls an N, but this N was a due to a quasi-perfect call to ref, then we will not take this difference in consideration, whereas we should).

Data

I calculated the mean and max number of calls the H2H pipeline discards due to them being tagged as low quality (lq flag) while having an "acceptable" quality (phred-scaled VCF call quality >= 20, i.e. we accept at most 1 error in 100).

For nanopore, we have ~767k VCF calls (bases) on average per sample (and a max of 1158k calls) that are filtered out solely due to being lq while having quality >=20. Note that these calls have no other flag, i.e. if we reduce the quality threshold from 85 to 20, we will fish back all these calls:

$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/madagascar/ -name *.bcf`; do bcftools view $f | awk '$6>=20 && $7=="lq" ' | wc -l; done | datamash mean 1 max 1
767632.54945055	1158679

If we reduce the quality threshold down to 20, we would instead filter out just ~1.72 calls per sample solely due to being lq (157 calls in all the 91 samples):

$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/madagascar/ -name *.bcf`; do bcftools view $f | awk '$6<20 && $7=="lq" ' | wc -l; done | datamash mean 1 max 1
1.7252747252747	34

These would be the 3 lq calls with highest quality that we would ignore:

NC_000962.3	3711723	.	A	.	19.695	lq	DP=59;ADF=26;ADR=21;SCR=18;VDB=0.112489;SGB=-0.590765;RPBZ=-2.43914;MQBZ=4.21187;MQSBZ=1.66428;BQBZ=-1.38604;SCBZ=-0.576347;FS=0;MQ0F=0.610169;AN=1;DP4=26,21,1,4;MQ=4	GT:SP	0:7
NC_000962.3	2635661	.	G	.	19.6942	lq	DP=150;ADF=73;ADR=29;SCR=97;VDB=0.18;SGB=-0.453602;RPBZ=0.698446;MQBZ=2.87256;MQSBZ=-1.63911;BQBZ=1.45854;SCBZ=-1.15867;FS=0;MQ0F=0.233333;AN=1;DP4=73,29,1,1;MQ=2GT:SP	0:3
NC_000962.3	3711721	.	T	.	19.377	lq	DP=75;ADF=33;ADR=23;SCR=20;VDB=0.116687;SGB=-0.511536;RPBZ=-2.82962;MQBZ=3.38282;MQSBZ=1.801;BQBZ=0.570723;SCBZ=1.42944;FS=0;MQ0F=0.48;AN=1;DP4=33,23,2,1;MQ=3	GT:SP	0:0

The second one has 102 reads calling ref and only 2 calling alt.

We can't fairly compare the stats above with the Illumina calls. IDK much about the H2H pipeline, but it seems that the illumina VCFs were not generated by us, but by a thirdparty using the COMPASS pipeline? Anyway, I did not find the same flags between the ONT and the Illumina VCFs, but I think the Illumina VCFs are filtering with a phred-scaled VCF call quality >= 25, due to this header:

##FILTER=<ID=S25,Number=0,Type=Float,Description="Base not called because min SNP quality less than 25">

If we assume this, we have much less calls filtered out (~255 per sample on average) solely due to being low quality (note that here we are not even requiring the quality to be >=20, although we know it is <25 already):

$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/illumina/gvcfs/ -name mada*.vcf.gz`; do zcat $f | awk '$7=="S25"' | wc -l; done | datamash mean 1 max 1
255.61016949153	5413

Conclusion

Just by looking at the nanopore stats, it seems we might be filtering way too much (average of ~17.4% of the genome). If when comparing two samples, an alt call matches an imperfect ref call of the other sample, or we have an imperfect alt call in one of the samples, this difference will not be counted, and thus we might overcluster our samples. A possible solution is to just decrease the lower bound for the phred-scaled VCF call quality from 85 to 20 and evaluate it. The issue is that I don't know how to evaluate if results are better or not. 20 might be a too lenient value too, as we would discard just ~2 calls per sample. Maybe we should use the same value that COMPASS use, 25? We need a way to evaluate results and do a parameter sweep to find the best threshold. I am wondering if rerunning the H2H pipeline help us infer if results are better.

Sort fastq to ensure reproducibility

We had an example where running tbpore on a single, combined fastq produced difference results to running it on a directory of the individual fastqs that make up the combined one. The reason the results were different between using the combined fastq and the directory of fastqs has to do with subsampling. When we randomly subsample the reads we set a random seed so that the same reads will always be selected. BUT this is only guaranteed if the reads are in the same order. When we combine the reads on the command line we get a differet ordering to when we combine the reads from a directory with python.

To avoid this problem in the future, we will sort the reads with seqkit sort (which is already a dependency) before subsampling

Performance

Performance documentation, but just on the sample data (mada_1-7 subsampled reads: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR903/ERR9030361/mada_1-7.subsampled.fastq.gz , ~163k ONT reads).

Summary

On my local laptop with:

  • 1 thread: time: 9m52s, RAM: 658MB
  • 4 threads: time: 8m24s, RAM: 816MB

Looks OK to me, i.e. runnable on a standard laptop using <1 GB of RAM and in a few minutes. Multithreading does not speed up the execution much though...
I used the subsampled reads from https://www.ebi.ac.uk/ena/browser/view/ERR9030361?show=reads , which I think were the rasusa subsampled reads? Should I re-evaluate with the full reads dataset, which might be closer to their use case?

Breakdown

1 thread:

[2022-04-26 13:13:16] INFO     | Welcome to TBpore version 0.1.0
[2022-04-26 13:13:16] INFO     | Started running mykrobe predict --sample mada_1-7 -t 1 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:14:37] INFO     | Done running mykrobe predict --sample mada_1-7 -t 1 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:14:37] INFO     | Started running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:15:19] INFO     | Done running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:15:19] INFO     | Started running minimap2 -t 1 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz ...
[2022-04-26 13:17:10] INFO     | Done running minimap2 -t 1 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz
[2022-04-26 13:17:10] INFO     | Started running samtools sort -@ 1 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam ...
[2022-04-26 13:17:12] INFO     | Done running samtools sort -@ 1 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam
[2022-04-26 13:17:12] INFO     | Started running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 1 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam ...
[2022-04-26 13:20:04] INFO     | Done running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 1 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam
[2022-04-26 13:20:04] INFO     | Started running bcftools call --threads 1 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf ...
[2022-04-26 13:20:18] INFO     | Done running bcftools call --threads 1 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf
[2022-04-26 13:20:18] INFO     | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf ...
[2022-04-26 13:22:37] INFO     | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf
[2022-04-26 13:22:37] INFO     | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed ...
[2022-04-26 13:23:08] INFO     | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed
[2022-04-26 13:23:08] INFO     | Cleaning up temporary files...
[2022-04-26 13:23:08] SUCCESS  | Done
        Command being timed: "tbpore -o sample_example/tbpore_out --cleanup sample_example/mada_1-7.subsampled.fastq.gz"
        User time (seconds): 590.41
        System time (seconds): 12.49
        Percent of CPU this job got: 101%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 9:52.40
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 674368
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 87
        Minor (reclaiming a frame) page faults: 2037873
        Voluntary context switches: 145720
        Involuntary context switches: 84042
        Swaps: 0
        File system inputs: 478520
        File system outputs: 6127072
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

4 threads:

(tbpore) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/tbpore$ /usr/bin/time --verbose tbpore -o sample_example/tbpore_out --cleanup -t 4 sample_example/mada_1-7.subsampled.fastq.gz 
[2022-04-26 13:26:00] INFO     | Welcome to TBpore version 0.1.0
[2022-04-26 13:26:00] INFO     | Started running mykrobe predict --sample mada_1-7 -t 4 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:26:55] INFO     | Done running mykrobe predict --sample mada_1-7 -t 4 --tmp sample_example/tbpore_out/.tbpore --skeleton_dir /home/leandro/git/tbpore/.cache -e 0.08 --ploidy haploid --format json --min_proportion_expected_depth 0.20 --species tb -m 2048MB -o sample_example/tbpore_out/mada_1-7.mykrobe.json -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:26:55] INFO     | Started running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz ...
[2022-04-26 13:27:39] INFO     | Done running rasusa -c 150 -g 4411532 -s 88 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz -i sample_example/tbpore_out/.tbpore/mada_1-7.fq.gz
[2022-04-26 13:27:39] INFO     | Started running minimap2 -t 4 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz ...
[2022-04-26 13:28:25] INFO     | Done running minimap2 -t 4 -a -L --sam-hit-only --secondary=no -x map-ont -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.fastq.gz
[2022-04-26 13:28:25] INFO     | Started running samtools sort -@ 4 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam ...
[2022-04-26 13:28:27] INFO     | Done running samtools sort -@ 4 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sam
[2022-04-26 13:28:27] INFO     | Started running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 4 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam ...
[2022-04-26 13:31:12] INFO     | Done running bcftools mpileup -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz --threads 4 -x -I -Q 13 -a INFO/SCR,FORMAT/SP,INFO/ADR,INFO/ADF -h100 -M10000 -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.sorted.sam
[2022-04-26 13:31:12] INFO     | Started running bcftools call --threads 4 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf ...
[2022-04-26 13:31:27] INFO     | Done running bcftools call --threads 4 --ploidy 1 -V indels -m -o sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.pileup.vcf
[2022-04-26 13:31:27] INFO     | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf ...
[2022-04-26 13:33:52] INFO     | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/apply_filters.py -P --verbose --overwrite -d 0 -D 0 -q 85 -s 1 -b 0 -m 0 -r 0 -V 1e-05 -G 0 -K 0.9 -M 0 -x 0.2 -o sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -i sample_example/tbpore_out/.tbpore/mada_1-7.subsampled.snps.vcf
[2022-04-26 13:33:52] INFO     | Started running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed ...
[2022-04-26 13:34:24] INFO     | Done running /home/leandro/miniconda3/envs/tbpore/bin/python /home/leandro/git/tbpore/external_scripts/consensus.py --sample-id mada_1-7 --verbose --ignore all --het-default none -o sample_example/tbpore_out/mada_1-7.consensus.fa -i sample_example/tbpore_out/mada_1-7.subsampled.snps.filtered.vcf -f /home/leandro/git/tbpore/data/H37RV_genome/h37rv.fa.gz -m /home/leandro/git/tbpore/data/H37RV_genome/compass-mask.bed
[2022-04-26 13:34:24] INFO     | Cleaning up temporary files...
[2022-04-26 13:34:24] SUCCESS  | Done
        Command being timed: "tbpore -o sample_example/tbpore_out --cleanup -t 4 sample_example/mada_1-7.subsampled.fastq.gz"
        User time (seconds): 659.30
        System time (seconds): 11.96
        Percent of CPU this job got: 133%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 8:24.28
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 835656
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3
        Minor (reclaiming a frame) page faults: 2086481
        Voluntary context switches: 67926
        Involuntary context switches: 63398
        Swaps: 0
        File system inputs: 664
        File system outputs: 6127080
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Output directory param or output consensus param?

We have an output directory parameter right now in tbpore:

@click.option(
    "-o",
    "--outdir",
    help="Directory to place output files",
    default=".",
    show_default=True,
    type=click.Path(file_okay=False, writable=True, path_type=Path),
)

However, the only output we are producing right now is a single consensus fasta file. Could we change this output directory parameter to an output consensus file, or are there other files we should be outputting and I am missing?

Cryptic Mac OS CI issue

Haven't searched much, but don't know how to solve this cryptic Mac OS CI issue:

  ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/usr/local/miniconda/envs/tbpore/lib/python3.8/site-packages/~ackaging-20.9.dist-info'

This is happening when poetry is installing toml

Subsample reads

We can subsample the decontaminated reads (#1) with rasusa to 150x to ensure we know the upper limit on memory usage. We could concieveably go lower, but this just copies what I did in the head to head paper.

Decontamination has a huge effect on real data tests

I used this snakemake pipeline to run tbpore on 91 madagascar samples from Michael. I wanted to compare the head_to_head_pipeline results (described here) and tbpore results when inputting the raw ONT reads, but had difficulties comparing the mykrobe output files as well as the VCFs, due to a high number of differences, so I decided to just compare the easiest output: the consensus fastas. I tried to use edit distance, but it was too slow, so I switched to mash dist. However, the differences between the head_to_head_pipeline and tbpore on these 91 samples were extremely high (3rd column is the mash distance, first pair of samples have a mash distance if 12.5%, the second 22.9% and so on):

head_to_head_pipeline_output/mada_102.consensus.fa	tbpore_output/mada_102.consensus.fa	0.125753	0	3697/100000
head_to_head_pipeline_output/mada_103.consensus.fa	tbpore_output/mada_103.consensus.fa	0.229735	0	120/29764
head_to_head_pipeline_output/mada_104.consensus.fa	tbpore_output/mada_104.consensus.fa	0.0584744	0	17157/100000
head_to_head_pipeline_output/mada_105.consensus.fa	tbpore_output/mada_105.consensus.fa	0.114674	0	4711/100000
head_to_head_pipeline_output/mada_106.consensus.fa	tbpore_output/mada_106.consensus.fa	0.0823425	0	9735/100000
head_to_head_pipeline_output/mada_107.consensus.fa	tbpore_output/mada_107.consensus.fa	0.106524	0	5640/100000
head_to_head_pipeline_output/mada_109.consensus.fa	tbpore_output/mada_109.consensus.fa	0.124576	0	3793/100000
head_to_head_pipeline_output/mada_1-10.consensus.fa	tbpore_output/mada_1-10.consensus.fa	0.19382	0	291/33797
head_to_head_pipeline_output/mada_110.consensus.fa	tbpore_output/mada_110.consensus.fa	0.0446784	0	24325/100000
head_to_head_pipeline_output/mada_1-11.consensus.fa	tbpore_output/mada_1-11.consensus.fa	0.171562	0	1000/72402
head_to_head_pipeline_output/mada_111.consensus.fa	tbpore_output/mada_111.consensus.fa	0.102631	0	6150/100000
head_to_head_pipeline_output/mada_1-12.consensus.fa	tbpore_output/mada_1-12.consensus.fa	0.151909	0	2070/98494
head_to_head_pipeline_output/mada_112.consensus.fa	tbpore_output/mada_112.consensus.fa	0.145902	0	2391/100000
head_to_head_pipeline_output/mada_1-13.consensus.fa	tbpore_output/mada_1-13.consensus.fa	0.079314	0	10441/100000
head_to_head_pipeline_output/mada_113.consensus.fa	tbpore_output/mada_113.consensus.fa	0.0946241	0	7359/100000
head_to_head_pipeline_output/mada_1-14.consensus.fa	tbpore_output/mada_1-14.consensus.fa	0.0903614	0	8104/100000
head_to_head_pipeline_output/mada_1-15.consensus.fa	tbpore_output/mada_1-15.consensus.fa	0.111099	0	5097/100000
head_to_head_pipeline_output/mada_115.consensus.fa	tbpore_output/mada_115.consensus.fa	0.0974869	0	6900/100000
head_to_head_pipeline_output/mada_1-16.consensus.fa	tbpore_output/mada_1-16.consensus.fa	0.086014	0	8948/100000
head_to_head_pipeline_output/mada_116.consensus.fa	tbpore_output/mada_116.consensus.fa	0.101133	0	6359/100000
head_to_head_pipeline_output/mada_1-17.consensus.fa	tbpore_output/mada_1-17.consensus.fa	0.0743196	0	11731/100000
head_to_head_pipeline_output/mada_117.consensus.fa	tbpore_output/mada_117.consensus.fa	0.218162	0	192/37306
head_to_head_pipeline_output/mada_1-18.consensus.fa	tbpore_output/mada_1-18.consensus.fa	0.0882895	0	8495/100000
head_to_head_pipeline_output/mada_118.consensus.fa	tbpore_output/mada_118.consensus.fa	0.207319	0	277/42805
head_to_head_pipeline_output/mada_1-19.consensus.fa	tbpore_output/mada_1-19.consensus.fa	0.0531443	0	19587/100000
head_to_head_pipeline_output/mada_1-1.consensus.fa	tbpore_output/mada_1-1.consensus.fa	0.127968	0	3523/100000
head_to_head_pipeline_output/mada_1-20.consensus.fa	tbpore_output/mada_1-20.consensus.fa	0.103732	0	6001/100000
head_to_head_pipeline_output/mada_120.consensus.fa	tbpore_output/mada_120.consensus.fa	0.253677	6.99497e-181	55/22590
head_to_head_pipeline_output/mada_1-21.consensus.fa	tbpore_output/mada_1-21.consensus.fa	0.0574401	0	17600/100000
head_to_head_pipeline_output/mada_121.consensus.fa	tbpore_output/mada_121.consensus.fa	0.13785	0	2844/100000
head_to_head_pipeline_output/mada_1-22.consensus.fa	tbpore_output/mada_1-22.consensus.fa	0.135488	0	2993/100000
head_to_head_pipeline_output/mada_122.consensus.fa	tbpore_output/mada_122.consensus.fa	0.101508	0	6306/100000
head_to_head_pipeline_output/mada_123.consensus.fa	tbpore_output/mada_123.consensus.fa	0.0397561	0	27708/100000
head_to_head_pipeline_output/mada_124.consensus.fa	tbpore_output/mada_124.consensus.fa	0.199724	0	298/39217
head_to_head_pipeline_output/mada_1-25.consensus.fa	tbpore_output/mada_1-25.consensus.fa	0.0776128	0	10862/100000
head_to_head_pipeline_output/mada_125.consensus.fa	tbpore_output/mada_125.consensus.fa	0.0898912	0	8191/100000
head_to_head_pipeline_output/mada_126.consensus.fa	tbpore_output/mada_126.consensus.fa	0.122175	0	3997/100000
head_to_head_pipeline_output/mada_127.consensus.fa	tbpore_output/mada_127.consensus.fa	0.117475	0	4430/100000
head_to_head_pipeline_output/mada_1-28.consensus.fa	tbpore_output/mada_1-28.consensus.fa	0.0694959	0	13146/100000
head_to_head_pipeline_output/mada_128.consensus.fa	tbpore_output/mada_128.consensus.fa	0.133612	0	3117/100000
head_to_head_pipeline_output/mada_129.consensus.fa	tbpore_output/mada_129.consensus.fa	0.10114	0	6358/100000
head_to_head_pipeline_output/mada_1-2.consensus.fa	tbpore_output/mada_1-2.consensus.fa	0.0881142	0	8529/100000
head_to_head_pipeline_output/mada_1-30.consensus.fa	tbpore_output/mada_1-30.consensus.fa	0.164465	0	1167/72632
head_to_head_pipeline_output/mada_130.consensus.fa	tbpore_output/mada_130.consensus.fa	0.06561	0	14425/100000
head_to_head_pipeline_output/mada_131.consensus.fa	tbpore_output/mada_131.consensus.fa	0.0527482	0	19783/100000
head_to_head_pipeline_output/mada_1-32.consensus.fa	tbpore_output/mada_1-32.consensus.fa	0.146706	0	2350/100000
head_to_head_pipeline_output/mada_132.consensus.fa	tbpore_output/mada_132.consensus.fa	0.0323916	0	33914/100000
head_to_head_pipeline_output/mada_1-33.consensus.fa	tbpore_output/mada_1-33.consensus.fa	0.0736212	0	11925/100000
head_to_head_pipeline_output/mada_133.consensus.fa	tbpore_output/mada_133.consensus.fa	0.11453	0	4726/100000
head_to_head_pipeline_output/mada_134.consensus.fa	tbpore_output/mada_134.consensus.fa	0.121946	0	4017/100000
head_to_head_pipeline_output/mada_135.consensus.fa	tbpore_output/mada_135.consensus.fa	0.0628593	0	15415/100000
head_to_head_pipeline_output/mada_1-36.consensus.fa	tbpore_output/mada_1-36.consensus.fa	0.139695	0	2733/100000
head_to_head_pipeline_output/mada_136.consensus.fa	tbpore_output/mada_136.consensus.fa	0.106821	0	5603/100000
head_to_head_pipeline_output/mada_137.consensus.fa	tbpore_output/mada_137.consensus.fa	0.118191	0	4361/100000
head_to_head_pipeline_output/mada_1-38.consensus.fa	tbpore_output/mada_1-38.consensus.fa	0.0678884	0	13659/100000
head_to_head_pipeline_output/mada_1-39.consensus.fa	tbpore_output/mada_1-39.consensus.fa	0.188621	0	289/30063
head_to_head_pipeline_output/mada_139.consensus.fa	tbpore_output/mada_139.consensus.fa	0.139256	0	2759/100000
head_to_head_pipeline_output/mada_1-3.consensus.fa	tbpore_output/mada_1-3.consensus.fa	0.136916	0	2902/100000
head_to_head_pipeline_output/mada_1-40.consensus.fa	tbpore_output/mada_1-40.consensus.fa	0.096656	0	7030/100000
head_to_head_pipeline_output/mada_140.consensus.fa	tbpore_output/mada_140.consensus.fa	0.0559536	0	18260/100000
head_to_head_pipeline_output/mada_1-41.consensus.fa	tbpore_output/mada_1-41.consensus.fa	0.192205	0	277/31089
head_to_head_pipeline_output/mada_141.consensus.fa	tbpore_output/mada_141.consensus.fa	0.122324	0	3984/100000
head_to_head_pipeline_output/mada_142.consensus.fa	tbpore_output/mada_142.consensus.fa	0.131246	0	3281/100000
head_to_head_pipeline_output/mada_1-43.consensus.fa	tbpore_output/mada_1-43.consensus.fa	0.048752	0	21894/100000
head_to_head_pipeline_output/mada_143.consensus.fa	tbpore_output/mada_143.consensus.fa	0.139407	0	2750/100000
head_to_head_pipeline_output/mada_1-44.consensus.fa	tbpore_output/mada_1-44.consensus.fa	0.133346	0	3135/100000
head_to_head_pipeline_output/mada_144.consensus.fa	tbpore_output/mada_144.consensus.fa	0.131063	0	3294/100000
head_to_head_pipeline_output/mada_1-46.consensus.fa	tbpore_output/mada_1-46.consensus.fa	0.0370269	0	29830/100000
head_to_head_pipeline_output/mada_1-47.consensus.fa	tbpore_output/mada_1-47.consensus.fa	0.0886736	0	8421/100000
head_to_head_pipeline_output/mada_1-48.consensus.fa	tbpore_output/mada_1-48.consensus.fa	0.136063	0	2956/100000
head_to_head_pipeline_output/mada_148.consensus.fa	tbpore_output/mada_148.consensus.fa	0.124649	0	3787/100000
head_to_head_pipeline_output/mada_1-50.consensus.fa	tbpore_output/mada_1-50.consensus.fa	0.125964	0	3680/100000
head_to_head_pipeline_output/mada_150.consensus.fa	tbpore_output/mada_150.consensus.fa	0.0814042	0	9948/100000
head_to_head_pipeline_output/mada_1-51.consensus.fa	tbpore_output/mada_1-51.consensus.fa	0.217261	0	185/35269
head_to_head_pipeline_output/mada_151.consensus.fa	tbpore_output/mada_151.consensus.fa	0.128034	0	3518/100000
head_to_head_pipeline_output/mada_152.consensus.fa	tbpore_output/mada_152.consensus.fa	0.140103	0	2709/100000
head_to_head_pipeline_output/mada_1-53.consensus.fa	tbpore_output/mada_1-53.consensus.fa	0.111055	0	5102/100000
head_to_head_pipeline_output/mada_1-54.consensus.fa	tbpore_output/mada_1-54.consensus.fa	0.135272	0	3007/100000
head_to_head_pipeline_output/mada_154.consensus.fa	tbpore_output/mada_154.consensus.fa	0.116689	0	4507/100000
head_to_head_pipeline_output/mada_1-5.consensus.fa	tbpore_output/mada_1-5.consensus.fa	0.126227	0	3659/100000
head_to_head_pipeline_output/mada_1-6.consensus.fa	tbpore_output/mada_1-6.consensus.fa	0.144978	0	2439/100000
head_to_head_pipeline_output/mada_1-7.consensus.fa	tbpore_output/mada_1-7.consensus.fa	0.0620913	0	15705/100000
head_to_head_pipeline_output/mada_1-8.consensus.fa	tbpore_output/mada_1-8.consensus.fa	0.120456	0	4150/100000
head_to_head_pipeline_output/mada_2-1.consensus.fa	tbpore_output/mada_2-1.consensus.fa	0.204092	0	194/28002
head_to_head_pipeline_output/mada_2-25.consensus.fa	tbpore_output/mada_2-25.consensus.fa	0.142783	0	2557/100000
head_to_head_pipeline_output/mada_2-31.consensus.fa	tbpore_output/mada_2-31.consensus.fa	0.0593619	0	16787/100000
head_to_head_pipeline_output/mada_2-34.consensus.fa	tbpore_output/mada_2-34.consensus.fa	0.136964	0	2899/100000
head_to_head_pipeline_output/mada_2-42.consensus.fa	tbpore_output/mada_2-42.consensus.fa	0.254198	1.24683e-206	63/26161
head_to_head_pipeline_output/mada_2-46.consensus.fa	tbpore_output/mada_2-46.consensus.fa	0.104229	0	5935/100000
head_to_head_pipeline_output/mada_2-50.consensus.fa	tbpore_output/mada_2-50.consensus.fa	0.10934	0	5299/100000
head_to_head_pipeline_output/mada_2-53.consensus.fa	tbpore_output/mada_2-53.consensus.fa	0.0893892	0	8285/100000

We are looking here at head_to_head_pipeline and tbpore having consensus distances between 5% and 25% which I think is way too high, incomparable, especially for TB. So I made a small test: I got the mada_2-42 sample, which has a mash distance of 25.4%, and

  1. Rerun tbpore with the raw nanopore reads (/hps/nobackup/iqbal/mbhall/tech_wars/data/madagascar/nanopore/mada_2-42/mada_2-42.nanopore.fastq.gz - @mbhall88 could you please confirm this is the path to the raw nanopore reads?) and then mash dist to confirm that we are indeed getting a mash distance of 25.4% and there is nothing wrong with the pipeline:
$ tbpore -o mada_2-42_without_decom --cleanup /hps/nobackup/iqbal/mbhall/tech_wars/data/madagascar/nanopore/mada_2-42/mada_2-42.nanopore.fastq.gz
...
$ mash dist -s 100000 head_to_head_pipeline_output/mada_2-42.consensus.fa mada_2-42_without_decom/mada_2-42.consensus.fa 
Sketching head_to_head_pipeline_output/mada_2-42.consensus.fa (provide sketch file made with "mash sketch" to skip)...done.
head_to_head_pipeline_output/mada_2-42.consensus.fa	mada_2-42_without_decom/mada_2-42.consensus.fa	0.254198	1.24683e-206	63/26161
  1. Run tbpore with the decontaminated nanopore reads (/hps/nobackup/iqbal/mbhall/tech_wars/data/QC/filtered/madagascar/nanopore/mada_2-42/mada_2-42.filtered.fastq.gz - @mbhall88 could you please confirm this is the path to decontaminated nanopore reads?) and then mash dist to now get a mash distance of only 0.05%:
$ tbpore -o mada_2-42_decom --cleanup /hps/nobackup/iqbal/mbhall/tech_wars/data/QC/filtered/madagascar/nanopore/mada_2-42/mada_2-42.filtered.fastq.gz
...
(tbpore) hl-codon-41-01:/hps/nobackup/iqbal/leandro/tbpore/pipelines/snakemake/comparison
$ mash dist -s 100000 head_to_head_pipeline_output/mada_2-42.consensus.fa mada_2-42_decom/mada_2-42.consensus.fa 
Sketching head_to_head_pipeline_output/mada_2-42.consensus.fa (provide sketch file made with "mash sketch" to skip)...done.
head_to_head_pipeline_output/mada_2-42.consensus.fa	mada_2-42_decom/mada_2-42.consensus.fa	0.000535225	0	22914/23432

So it seems to me decontamination is essential to tbpore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.