makovalab-psu / discovery Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 5.0 24.01 MB

K-mer based classifier for Y-contig identification from Whole Genome Assemblies

License: MIT License

Python 1.39% Jupyter Notebook 96.23% Shell 2.38%

discovery's People

Contributors

Stargazers

Watchers

Forkers

rsharris yun-xia biocko shivanshss

discovery's Issues

Is python3 required?

I was trying to run this using python2 (because I didn't know any better). I get an exception from str.maketrans() on kmers.py. In python2 (apparently) string objects don't have a maketrans method, whereas (I guess) in python3 they do.

If the package isn't intended to support python2, it would be nice to say that in the readme.

memory

Hi,
This is probably related to issues #13 and #12 but since there are no helpful answers to those I have opened a new one. I am running DiscoverY for what is probably a large dataset (though the genome size is ~1.2Gb, so smaller than human), and I keep getting a python MemoryError:

Started DiscoverY
Mode female+male
Using default of k=25 and input folder='data'
Shortlisting Y-contigs
Need to make Bloom Filter of k-mers from female
Done creating bloom filter
Generating a dictionary from kmers in kmers_from_male_reads
Traceback (most recent call last):
  File "discoverY.py", line 70, in <module>
    main()
  File "discoverY.py", line 65, in main
    classify_ctgs.classify_ctgs(k_size, bloom_filt, bf_capacity, female_kmers, mode)
  File "/scratch/24769731/DiscoverY/scripts/classify_ctgs.py", line 143, in classify_ctgs
    classify_fm_male_mode(kmer_size, female_kmers_bf)
  File "/scratch/24769731/DiscoverY/scripts/classify_ctgs.py", line 52, in classify_fm_male_mode
    kmer_abundance_dict_from_male = kmers.make_dict_from_kmer_abundance(reads_kmers, kmer_size)
  File "/scratch/24769731/DiscoverY/scripts/kmers.py", line 44, in make_dict_from_kmer_abundance
    kmer_dicts[line[:kmer_size]] = current_abundance
MemoryError

The input files end up quite large:

1.7G female.bloom
23G female_kmers
191G kmers_from_male_reads
1.2G male_contigs.fasta

But I have requested compute resources with 1TB of RAM and the usage states say that the job uses a maximum of 600GB.

Any help would be appreciated.

specifying bloom filter size/overflowerror

Hi there,

I am relatively new to python and trying to run discoverY.py in female+male mode using male_contigs.fasta, kmers_from_male_reads, and female reference assembly (female.fasta) files. I am running python 3.7.4,and all the dependencies are installed properly. I created the kmers_from_male_reads file using DSK as per the readme file, and the command I used to run discoverY.py is:

python discoverY.py --mode female+male --kmer_size 25

When I run this, I get this output:

Started DiscoverY
Mode female+male
Using default of k=25 and input folder='data'
Please set bloom filter size before running this program
Shortlisting Y-contigs
Need to make Bloom Filter of k-mers from female
Traceback (most recent call last):
File "./discoverY.py", line 59, in
main()
File "./discoverY.py", line 54, in main
classify_ctgs.classify_ctgs(k_size, bloom_filt, female_kmers, mode)
File "/lustre04/scratch/eocampbe/DiscoverY/scripts/classify_ctgs.py", line 142, in classify_ctgs
female_kmers_bf = getbloomFilter(bf, fem_kmers, kmer_size)
File "/lustre04/scratch/eocampbe/DiscoverY/scripts/classify_ctgs.py", line 20, in getbloomFilter
female_kmers_bf = BloomFilter(bf_size, .001, bf_filename)
File "src/pybloomfilter.pyx", line 87, in pybloomfilter.BloomFilter.cinit
OverflowError: value too large to convert to int

I'm finding it difficult to determine how I might fix this issue. For instance, is the line "Please set bloom filter size before running this program" the source of this error? I can't figure out how I would specify bloom filter size, as there appears to be no option to do so and I can't find any documentation about this in the readme file. Or, is this primarily a memory issue, indicated by the OverflowError? Any help you could give me would be much appreciated!

Suffers from bizarre bug in BioPython

In rare cases, the fasta headers in the annotated output can lack one of the fields due to a seriously bizarre bug in BioPython's SeqIO.write() function.

This occurs if the sequence's length happens to be the same as the sequence's name. In this case the description DiscoverY generates, which starts with the length, is mis-interpreted inside SeqIO.write() as including the sequence name. And SeqIO.write() does you the 'favor' of removing that duplication.

This obviously can only happen if the contig names are numbers. Unfortunately for me the output of whatever assembler create my contigs file does use numbers for names. And one of them happened to match the sequence length.

Why this is a problem is I was attempting to automatically convert the annotations into a table that I could process with other tools (e.g. R). But the table can't be correctly parsed due to the favor BioPython has done.

The only useful workaround I can see is that users should be warned (in the README) that their sequence names shouldn't be numbers.

Dependency list is incomplete?

The list of dependencies currently shown in the readme seems to be missing a few.

Trying to run the example shown in the readme, I find I need pybloomfilter. (and I'm not certain, but I think there are two packages with that name -- the one that pip installs appears to be badly broken).

Then when I get to the classifier stage, I discover I need jupyter. And snooping in the notebook file, it looks like I will also need sklearn.

DSK abundance limit disagrees with manuscript

The methods section of the DiscoverY manuscript says that kmers occurring fewer than three times were filtered out. But the DSK scripts included in this repository set the abundance limit to zero -- no filtering.

The README ought to mention kmer filtering and advise the user how to accomplish that.

question about memory requirements and determining appropriate kmer size

Hi there,

I have completed an initial analysis using DiscoverY in the female+male mode, and I am wondering how I might determine whether the kmer size I used is optimal for the data I have. For the analysis I've done so far, I used the default size of 25, but I understand that this may need to be adjusted based on the specific characteristics of the genome I'm working with. I have plotted the results of my analysis (attached), and there seem to be a large number of kmers with very low similarity to the female genome, which is of quite good quality, but high depth. The organism we're working on has a neo sex chromosome system, so I suspect the Y regions are clustering in with the X regions on the bottom right of the graph (confirming this was actually my reason for using DiscoverY), however I'm less sure about why there might be so many male contigs that have very low similarity to the female, but rather high coverage. I don't know if this is a result of my kmer parameter or something else, but I'm hoping you might be able to offer some advice.

In addition, this analysis required about 720 Gb of RAM, which is about double what is estimated in the paper and is almost the maximum amount of RAM I'm allowed to ask for per node of the cluster I'm using. Can DiscoverY run in parallel so that I can spread this memory out across multiple nodes? I don't see anything in the documentation or the paper that mentions this, but it would be very helpful for subsequent analyses.

Thanks,
Erin
graph.pdf

DSK script creates different filename than classify_ctgs wants

The result of run_dsk_Linux.sh is a file named kmers_from_reads.

In classify_ctgs.py, classify_fm_male_mode() needs this file, but expects it to be named kmers_from_male_reads.

I guess there's nothing wrong with that per se, but it does seem odd. Moreover, the user who fails to realize that can lose 12+ hours while the female bloom filter is being built, before discoverY discovers that kmers_from_male_reads doesn't exist.

Better description of how to run the tool would be helpful

The current readme doesn't clearly describe how the user can use the tool to solve the problem it is intended to solve. If I have an assembly, and I want to identify Y-specific contigs, how do I do that?

My best guess, from trying the run the example in the repo, is that the info about which contigs are Y-specific is encoded in the headers of proportion_annotated_contigs.fastq. But this information in not described in the readme. Nor is any step mentioned that will separate Y contigs from the input contigs.

Note that that conclusion is based on the fact that, for me, the output of discovery.py (proportion_annotated_contigs.fastq) is identical to the input (data/male_contigs.fasta), except that annotation has been added.

The command I ran was the one shown as "a typical run":
python discoverY.py --female_bloom --mode female+male
But it is not clear whether this is the appropriate command to run for the example. Based on the files provided, and after digging through the code to see which options would cause all the provided files to be used, that was the command I can up with. This would be made clearer by having a "tutorial" section in the readme that showed the command to be run.

It would also be helpful to provide, as part of the example, the expected output. As it stands, I don't know whether my run of discovery.py worked. It's possible that it is not working and that this has fed into my misunderstanding of how it is supposed to be used.

It's also possible that I don't understand what the example is intended to demonstrate.

The discussion of 'best mode' and the jupyter notebook stuff should clarify whether this step is intended as part of the tyipcal usage pipeline or not. After having a lot of difficulty with the notebook, and looking at it in more detail, and realizing that it doesn't read the output from discovery.py, my best guess is that this is a pre-computing step, to be run before discovery.py, to guide the choice of threshold. However (assuming that is true), there's nothing that indicates how the resulting threhold would be used.

To recap, as it is currently described quite a bit of insight, digging, and guesswork is required on the part of the user.

DiscoverY killed

Hi, I am trying to run DiscoverY,
I followed the manual and run it in this way
python discoverY.py --female_bloom --mode female+male
with the data like this
ls data/
female.fasta female_kmers kmers_from_male_reads male_contigs.fasta
I get the following information from the program:
Started DiscoverY
Mode female+male
Using default of k=25 and input folder='data'
Shortlisting Y-contigs
Opening Bloom Filter of k-mers from female
Done
Generating a dictionary from kmers in kmers_from_male_reads
Killed
The female.bloom file gets created and also an empty file called proportion_annotated_contigs.fasta in the directory above the data directory
I was suspecting a memory issue but it does not seem to be the case. I am running on a Unix cluster using a virtual environment
Any help would be great
Thanks
Astrid

Proportion reported by classify_ctgs.py described incorrectly in README?

The README says "the proportion shared between each contig with a female reference is computed."

Maybe I am wrong about the rest of this, but it seems like that contradicts what the code does.

In both classify_fm_male_mode() and classify_fm_mode() it looks like what is reported is (C-F) / C, where C is the number of kmers in the contig (with duplicates counted as often as they appear and all-N kmers not counted) and F is the number of kmers in the contig and also in the female reference.

So a proportion reported as 1.0 would mean none of the contig's kmers were found in the female reference. So that would be evidence that the contig is from something not found in female — presumably male specific.

A proportion reported as 0.0 would mean all of the contig's kmers were found in the female reference. Evidence that the contig is not male specific.

depth of coverage

Hi ,

I am wondering where in the output the depth of coverage is written??

Thanks