Code Monkey home page Code Monkey logo

smudgeplot's Introduction

Smudgeplot

This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.

Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example:

smudgeexample

Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy.

This tool is planned to be a part of GenomeScope in the near future.

Installation

You will need a program for counting kmers such as KMC installed, and you should definitely also run GenomeScope (a classical kmer spectra analysis). It's not rare that both GenomeScope and Smudgeplot are needed to make sense out of the sequencing data. We recommend using tbenavi1/KMC. This version of KMC has an additional smudge_pairs program which finds heterozygous kmer pairs more quickly and using less memory than the python version of hetkmers.

The easiest way to install smudgeplot is certainly using conda, where smudgeplot package is available. If you have conda installed, you can get smudgeplot simply by

conda install -c bioconda smudgeplot

For more detailed installation instructions, see our installation wiki page. If the installation procedure does not work, if you encounter any other problem, or if you would like to get help with the interpretation of your smudgeplot, please open an issue.

Usage

The input is a set of whole genome sequencing reads, the more coverage the better. The method is designed to process big datasets, don't hesitate to pull all single-end/pair-end libraries together.

The workflow is automatic, but it's not fool-proof. It requires some decisions. The usage is shown here using tbenavi1/KMC and GenomeScope. We also provide a real data tutorial on our wiki: strawberry genome analysis. If you are interested in running Jellyfish instead of KMC, look at the manual of smudgeplot with Jellyfish.

Give KMC all the files with trimmed reads to calculate kmer frequencies and then generate a histogram of kmers:

mkdir tmp
ls *.fastq.gz > FILES
# kmer 21, 16 threads, 64G of memory, counting kmer coverages between 1 and 10000x
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmcdb tmp
kmc_tools transform kmcdb histogram kmcdb_k21.hist -cx10000

where -k is the kmer length, -m is the approximate amount of RAM to use in GB (1 to 1024), -ci<value> excludes kmers occurring less than <value> times, -cs is the maximum value of a counter, FILES is a file name with a list of input files, kmcdb is the output file name prefix for the KMC database, tmp is a temporary directory, and -cx<value> is the maximum value of counter to be stored in the histogram file.

The next step is to extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using command smudgeplot.py cutoff <kmcdb_k21.hist> <L/U>.

L=$(smudgeplot.py cutoff kmcdb_k21.hist L)
U=$(smudgeplot.py cutoff kmcdb_k21.hist U)
echo $L $U # these need to be sane values
# L should be like 20 - 200
# U should be like 500 - 3000

Then, extract kmers in the coverage range from L to U using kmc_tools. Then run smudge_pairs on the reduced file to compute the set of kmer pairs.

smudge_pairs is available at tbenavi1/KMC. Specifically, after compiling this version of KMC, smudge_pairs will be in the bin directory.

kmc_tools transform kmcdb -ci"$L" -cx"$U" reduce kmcdb_L"$L"_U"$U"
smudge_pairs kmcdb_L"$L"_U"$U" kmcdb_L"$L"_U"$U"_coverages.tsv kmcdb_L"$L"_U"$U"_pairs.tsv > kmcdb_L"$L"_U"$U"_familysizes.tsv

Alternatively, if you don't have tbenavi1/KMC installed, you can extract kmers in the coverage range from L to U using kmc_dump. Then run smudgeplot.py hetkmers on the dump of kmers the compute the set of kmer pairs.

kmc_tools transform kmcdb -ci"$L" -cx"$U" dump -s kmcdb_L"$L"_U"$U".dump
smudgeplot.py hetkmers -o kmcdb_L"$L"_U"$U" < kmcdb_L"$L"_U"$U".dump

Now you can finally generate the smudgeplot using the coverages of the identified kmer pairs (*_coverages.tsv file). You can either supply the haploid kmer coverage (reported by GenomeScope) or let it be estimated directly from the data. Something like this

smudgeplot.py plot kmcdb_L"$L"_U"$U"_coverages.tsv

will generate a basic smudgeplot. To see all the parameters of smudgeplot.py plot you can run smudgeplot.py plot --help.

Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. Note that a single polymorphism generates multiple heterozygous kmers. As such, the reported numbers do not directly correspond to the number of variants. Instead, the actual number is approximately 1/k times the reported numbers, where k is the kmer size (in summary already recalculated). It's important to note that this process does not exhaustively attempt to find all of the heterozygous kmers from the genome. Instead, only a sufficient sample is obtained in order to identify relative genome structure. You can also report the minimal number of loci that are heterozygous if the inference is correct.

GenomeScope for estimation of L/U

You can feed the kmer coverage histogram to GenomeScope. (Either run the genomescope script or use the web server)

Rscript genomescope.R kmcdb_k21.hist <k-mer_length> <read_length> <output_dir> [kmer_max] [verbose]

This script estimates the size, heterozygosity, and repetitive fraction of the genome. By inspecting the fitted model you can determine the location of the smallest peak after the error tail. Then, you can decide the low end cutoff below which all kmers will be discarded as errors (cca 0.5 times the haploid kmer coverage), and the high end cutoff above which all kmers will be discarded (cca 8.5 times the haploid kmer coverage).

Frequently Asked Questions

Are collected on our wiki. Smudgeplot does not demand much on computational resources, but make sure you check memory requirements before you extract kmer pairs (hetkmers task). If you don't find an answer for your question in FAQ, open an issue or drop us an email.

Check projects to see how the development goes.

Contributions

This is definitely an open project, contributions are welcome. You can check some of the ideas for the future in projects and in the development dev branch.

The file playground/DEVELOPMENT.md contains some development notes. The directory playground contains some snippets, attempts, and other items of interest.

Reference

Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11, 1432 (2020). https://doi.org/10.1038/s41467-020-14998-3

Acknowledgements

This blogpost by Myles Harrison has largely inspired the visual output of smudgeplots. The colourblind friendly colour theme was suggested by @ggrimes. Grateful for helpful comments of beta testers and pre-release chatters!

smudgeplot's People

Contributors

adamtaranto avatar dependabot[bot] avatar hyphaltip avatar kamilsjaron avatar tbenavi1 avatar zhenzhenyang-psu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smudgeplot's Issues

Bad examples of smudgeplot

A collection of smudgeplots that went wrong... (a material for a wiki page showing potential pitfalls)

Types of problems:

  • a problem of input data
  • a problem of input parameters
  • a problem of smudge annotation

memory issues with `hetkmers.py`

I run hetkmers.py on 1.3G genome, cca 30x.

The script allocated 750Gbp of memory at the point of generating the graph G. I killed the process in this moment.

The graph is very efficient implementation in terms on number of lines, but it does not scale well for bigger genomes. The computation was long, but not super long (2 days for the calculation of pairs distant exactly 1). Speeding up would be useful, but not necessary, the memory is issue here.

Any ideas @tbenavi1 ?

Incorrect ploidy inference

About your genome

The genome is Meloidogyne Javanica. It should be tetraploid, but smudgeplot is estimating diploid.

smudgeplot

Please see the attached smudgeplot:
MJAV_SmudgePlot_smudgeplot_log10

Make sure that the R version and python have the same functionality

Stuff missing:

  • consideration of triploid smudges in the draft 1n estimate (the whole smudge fitting strategy should be reevaluated, maybe I should add a parameter, something like: method = "brightest smuge", "diploid", "triploid")
  • filtering kmer pairs around f(cov) = L / cov (the error kmer line)
  • summary reports (logging?)

Smudgeplot interpretation Plukenetia volubilis

Hi Kamil, I have troubles understanding my smudgeplot. I have used the following command to generate it

ls $folder/reads/*.fastq > $folder/FILES
srun kmc -k21 -t$threads -m64 -ci1 -cs10000 @FILES kmer_counts tmp
srun kmc_tools -t$threads transform kmer_counts -cx10000 histogram kmer_k21.hist

L=$(kmer_cov_cutoff.R kmer_k21.hist L)
U=$(kmer_cov_cutoff.R kmer_k21.hist U)
echo $L $U # these need to be sane values like 30 800 or so
srun kmc_tools -t$threads transform kmer_counts -ci$L -cx$U dump -s kmer_k21.dump
srun hetkmers.py -k 21 -t $threads -o kmer_pairs < kmer_k21.dump
srun smudgeplot.R -i kmer_pairs_coverages_2.tsv -o sacha_ploidy -t "Plukenetia volubilis"

and it look like this:

sacha_ploidy_smudgeplot

sacha_ploidy_smudgeplot_log10

Checking L and U, they were set to 10 and 4000 respectively.

Now, I have an indication from the literature that perennial plants in this family are hexaploid. So I am a little bit surprised that most of the k-mers are in the triploid smudge. how strong this suggests it is a triploid?

Also, my understanding is that triploid organisms cannot reproduce sexually due to problems in chromosomes segregation during meiosis (here my basics could go really wrong), which contradicts the fact that some lab-mates have successfully done sexual crosses between several individuals.

Do you think it is possible that, if it is an homopolyploid, the chromosomes are so similar that it is not possible to differentiate the k-mers and, as a result, they got mixed in the same smudge?
In other words, could that triploid be actually a hexaploid with really similar chromosomes? I would like to know if I could be sure this result discards the literature hypothesis, or if it is still possible

How should I understand my smudgeplot?

Thanks in advance

Plukenetia volubilis attacks again: Allopolyploid or Autopolyploid?

Hi again Kamil,

I noted that the preprint of smudgeplot is finally released! And for that, I congratulate to you and the other authors. I am sure you are really proud of this amazing job.

From the preprint, one section caught my attention: Allotetraploid vs Autotetraploid. There, you discussed (if I understood correctly) that based on the proportion of kmers AAAB vs AABB (generated by genomescope2) you can infer if you are dealing with an autopolyploid or an allopolyploid.

You can imagine I could not avoid wondering which would be the case for my plant, Plukenetia volubilis, which we previously analyzed together in #36 and determined it was a tetraploid.

I relaunched genomescope2 with the same histogram we used in #36, and here you can see the result. We can see that the proportion of AAAB is almost inexistent and the proportion of AABB is really high. Hence, following the preprint logic, we would say it is an allotetraploid, right?

However, I noticed that the proportion between these two is different from the ones we see in the smudgeplot. In fact, in my particular case, there are no AAAB kmers in the summary provided by smudgeplot. I know that smudgeplot deals with heterozygous kmers, but I wonder if there is a relation between the proportions of kmers of both programs and if you can infer the Allo vs Auto feature from smudgeplots as well.

Beforehand, thanks for your comments, and I am really happy that we have something to cite and give the corresponding credits.

PS: In case the genomescope result doesn't load (I am trying but it does not for some reason, hopefully it will for you), I will append the model results here:

GenomeScope version 2.0
input file = user_uploads/XU4ieWwELZjptFnWUpUe
output directory = user_data/XU4ieWwELZjptFnWUpUe
p = 4
k = 21

property                      min               max               
Homozygous (aaaa)             67.8924%          95.0473%          
Heterozygous (not aaaa)       4.95271%          32.1076%          
aaab                          0%                6.61037%          
aabb                          4.90424%          20.3037%          
aabc                          0.0484783%        2.8474%           
abcd                          0%                2.34608%          
Genome Haploid Length         250,542,571 bp    251,341,685 bp    
Genome Repeat Length          87,932,927 bp     88,213,392 bp     
Genome Unique Length          162,609,643 bp    163,128,293 bp    
Model Fit                     69.9239%          90.834%           
Read Error Rate               0.0690489%        0.0690489%    

My smugeplot looks strange

I have deep sequencing of a single individual of my really important species. I thought it was a diploid but this is was I got

mflo2_smudgeplot__smudgeplot

It seems that there is greater heat in AAB peak. How should I understand it? Does it mean that my species is triploid? Can smudgeplot be wrong?

The assembly contains only two copies of expected single copy genes. Who should I trust, the assembly or smudgeplot?

summary report

  • estimate 1n coverage from the data
  • quantify relative proportions of the smudges (heterozygosity structure of a genome)
  • generate summary report (something nicely readable)

problem with running `smudgeplot hetkmers` with `--all` flag

When I was packaging smudgeplot into a single environment, I generated a problem with multiprocessing library in python. I have no time to look at it in a great detail, but I don't suppose it will be a huge problem.

The error log is attached below. I think it has something to do with packaging. Something like missing import in the main script or something like that. I am writing this issue so I can not to take care about it right away.

Related SE post; link to multiprocessing manual.

INFO:root:Running smudgeplot v0.1.3 beta3-development
INFO:root:Task: hetkmers
INFO:root:Kmers and coverages loaded. Initial kmer halves processed. Starting processes.
Traceback (most recent call last):
  File "/home/kjaron/bin/smudgeplot", line 4, in <module>
    __import__('pkg_resources').run_script('smudgeplot-KamilSJaron===0.1.13devN', 'smudgeplot')
  File "/software/lib/python3.5/site-packages/pkg_resources/__init__.py", line 746, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/software/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1508, in run_script
    exec(script_code, namespace, namespace)
  File "/home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 99, in <module>
  File "/home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 90, in main
  File "/home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 21, in hetkmers
  File "/home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/smudgeplot/hetkmers.py", line 170, in all_one_away

  File "/software/lib64/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/software/lib64/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/software/lib64/python3.5/multiprocessing/context.py", line 274, in _Popen
    return Popen(process_obj)
  File "/software/lib64/python3.5/multiprocessing/popen_spawn_posix.py", line 33, in __init__
    super().__init__(process_obj)
  File "/software/lib64/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/software/lib64/python3.5/multiprocessing/popen_spawn_posix.py", line 43, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/software/lib64/python3.5/multiprocessing/spawn.py", line 173, in get_preparation_data
    main_mod_name = getattr(main_module.__spec__, "name", None)
AttributeError: module '__main__' has no attribute '__spec__'

explore other options how to identify heterozygous kmers

The smudgeplot is now limited for heterozygosity only to certain extend (we extract kemrs that are 1SNP from each other). If the divergence of haplotypes is too great, we are unable to /identify any (the case of Rotifer).

We might want to explore a way how to identify heterozygous kmers with indels, or of greater divergence than 1 SNP (parameter?).

I wonder if blast is a possibility or not. One could define a similarity treshold and uniqueness threshold (if simularity would be from 0 - 100, you could stay that you accept all pairs with score >90 where the third closest kmers is < 50 or something like that). Or some sort of blast-like approach that would be bit more sloppy on the alignment part (Minimap?).

How do you like the interface of `dev` or `R_finalization` branches?

Dear Users, would you mind to share your experience with the user interface of smudgeplot?

In the versions of smudgeplot up to 0.1.3 we used three functions that were exposed to user:

smudgeplot.R
hetkmers.py
kmer_cov_cutoff.R

This was bit clumsy as a user was asked to remember more than one command and every change in the computational part was projected into the interface.

Branches dev and R_finalization* contain (the same) new interface written in python. The scripts before are replaced with a single script called smudgeplot that takes <task> as the first argument.

smudgeplot <task> [parameters] <input_file>

Tasks are called cutoff, hetkmers and plot and are properly documented in the help pages of the command.

*R_finalization is the branch where the code is finalized for the first publication of the visualization technique (that will be v0.2.0). dev is a branch with rewritten backend in python, it does not work that well as the R version that was polished thanks for many users and their bug reports. The big merge of the dev branch is expected by the end of this year (yes, it will take a while, it will probably be v0.3.0).

Why this?

For a while now we are working on the further development of smudgeplots. The changes are not ready for a wide audience yet and they won't be in the near near future. However, if we switch to the new interface, many of these changes can be integrated without any change of the interface.

For example, right now the plotting is done in R. Once we completely switch to python, this interface won't change, only the backend, therefore users won't have to learn any new commands or whatsoever. They will only need to update their version of smudgeplot. Another example could be if we decide to turn hetkmers algorithm to C or C++ to gain some speed. The backend of smudgeplot will change, but the user will again be exposed to exactly the same interface. Check Projects if you are curious what are we up to...

Finally, this interface allows the very easy extension. We could easily add more <task> (for example map for mapping kmers to the genome, that is already available in the development dev branch).

Your opinion matters

I see that we get quite a lot of traffic here, I am very happy to see that people are interested in smudgeplots and it motivates me to continue to develop smudgeplots further. I even see many people looking at the dev branch which is great!

However, I do not have that much feedback, especially from dev diggers. It would be super valuable for me to know what do you think. Especially about this interface change I am planning!

Let me know what do you think!!!

a big potential speed-up

a comment by Richard Durbin; an idea by Haynes Heaton: A search for kmer pairs with a SNP on a fixed position could be sufficient for genomic representation. For genomes with low heterozygosity the kmer pairs will be exactly the same (because probability of two SNPs being close to each other is small) and for the highly heterozygous genomes we should keep the option to do the full search.

The naive speed up is 20x. However I think there is more to it. I wonder if we could do some sort of kmer sorting trick to avoid loading all the kmers to memory.

misslabeled smudges

About your genome

It's a diploid strawberry, the one that is in the tutorial.

smudgeplot

I used the dev version of smudgeplot with following command

smudgeplot plot -o f_iinumae_2 -t "Fragaria\ iinumae" -nbins 30 -n 144 kmer_pairs_coverages.tsv

and the smudgeplots looks like... (the second one is without specification of expected coverage and number of bins)

f_iinumae_2_smudgeplot_pythonic

f_iinumae_smudgeplot_pythonic

-> it's these AB smudges that make the program think that two smudges are detected at the same place and then reducing the number of bins to ridiculous numbers. It's needed to figure out what is wrong with the python implementation of the fitting algorithm. The R implementation certainly does this fit well.

error in execution of the actual smudge rscript

Hi Kamil,

I am getting this:

running Smudgeplot v0.1.0

######################
## INPUT PROCESSING ##
######################
Removing        76026   kmer pairs with coverage higher than    339     (0.99 quantile)

#############
## SUMMARY ##
#############
Error in paste(sum(to_filter), "peaks of kmer pairs detected with coverage < (1n_coverage * 2) =",  :
  object 'n' not found
Calls: smudge_warn -> write -> cat -> paste -> paste
Execution halted

What does it mean?

make complete examples

We need some pilot examples that will go from raw reads to smudgeplot. We should do both those that are big and complicated as well as those that are small and simple, I htink three examples will be perfect.

Good candidates for demonstration of power:

Warning output: report cases when L < 1/2 * 1n; etc

I figured out a source of missleading information.

If the genome 1n coverage is smaller than L, than all the heterozygosity would be filtered out and we would again look at paralogs. If L is more than twice smaller than estimated 1n we can be quite sure that if there is any heterozygosity, we are picking it up.

I should create a warning file that would report potential problems like the one I have described since it's very easy to pick it up from the data.

it could be also nice if we could read GenomeScope model and report disagreements (like in the case of completely homozygous genomes etc.)

smudgeplot installation error

Rscript install.R
Error in library(devtools) : there is no package called ‘devtools’
Execution halted

Is devtools required?

erradicate print from the code

As it was pointed in #38 I messed up logging and printing of results.

I should go over the execution script and adjust all the logging to be printed on stderr instead of stdout.

The question is if we should do a hotfix release or not, because our tutorials do not work right now (L=$(smudgeplot cutoff hist) is not working in the last release; gosh we desperately need travis!)

smudgeplot.py hetkmers keeps on failing

Hi there:

I trying to run "smudgeplot" on kmc-generated kmers (22) from pacbio data of a 3.6g genome.

Here are my commands:

smudgeplot.py hetkmers -o kmer_pairs kmer_k22_ci30cx1500.dump
smudgeplot.py plot -o S_purpurea -t "S. purpurea" -q 0.99 kmer_pairs_coverages.tsv

after an hour of running, the job gets killed and I get this message:

/var/spool/torque/mom_priv/jobs/1529182.sapelo2.SC: line 20: 47649 Killed                  smudgeplot.py hetkmers -o kmer_pairs kmer_k22_ci30cx1500.dump

I am running it on a single node with 850gb of RAM allocation.

The dump file is ~ 33Gb and has over 1.3b lines. by multiply the number of kmers in the dump file by 350, roughly about 600Gb are needed for the job. But it still fails with 850Gb.

Any suggestions or guidance are highly appreciated.

Many thanks,
Magdy

Preferred citation/DOI request

Hi,

Thanks for the wonderful tool, it's been very useful for my data and I am preparing this data to be submitted as a manuscript soon. I was wondering if you had a preferred citation since smudgeplot is not yet integrated with genomescope or published.

Perhaps you could assign it a DOI with zenodo?

Thanks again!

Possible silent stalling

I am trying to run smudgeplot hetkmers on a server with the following command:

smudgeplot hetkmers -o MELC-2E11_R1_trim30_KMC21_10-950_pairs < MELC-2E11_R1_trim30_KMC21_10-950.txt

The output I have so far is:

Running smudgeplot v0.2.0
Task: hetkmers
Kmers and coverages loaded.

This showed up relatively quickly upon executing the command, but that was four days ago. There have been no new files generated in the working directory, and there has been no output to the terminal suggesting any continued operation. Is it likely there has been a silent error? Would you expect any output showing continued progress or is it expected to be silent until completion? If it is likely to be an error, do you have any thought on what that could be?

Thanks,
Michael

Ploidy on mouse genome

Hi!

First of all, thanks for your work with smudgeplot. Great idea!

I wanted to try smudgeplot on some WGS data of murine cancer cells, and thought I'd start with a normal control sample taken from the mouse tail. So, the sample should definitely be diploid. It's WGS, and the number of k-mers was pretty large, so I decided to run every chromosome by itself to stop smudgeplot hetkmers from running into memory troubles.

Now, the problem is, though we expect polyploidy in the cancer celllines, the control should be diploid. The smudgeplots of the various chromosomes show varying ploidies from di- to tetra- or even higher ploidies.

The L and U cutoffs that were suggested using smudgeplot cutoff are also pretty low compared to what you recommend in the Readme (L≈12, U≈420).

I attached smugeplot and genomescope plot for chr1.
Do you have any suggestions? Is this kind of input data feasible for smudgeplot? Should I tweak the cutoffs around?

Best,
Jan

plot
511950_control_1_smudgeplot
511950_control_1_smudgeplot_log10

explore potential of extracting kmers from smuges

Smudges contain useful info - kmers that are in known count in the genome.

It should be possible to extract kmer pairs of the smudges and map them to genome assembly to infer rates of separately assembled haplotypes and collapsed duplications, triplications, etc.

To do this one needs to know:

  • ploidy of the organism (possible to infer from smugeplot)
  • type of the assembly (haploid assembly - collapsed haplotypes, or n-ploid assembly with haplotypes assembled separately)

The steps to be done:

  1. identify conditions when smudges are separable (if they are too close to each other it might be sneaky)
  2. Collapse neibouring kmers? Up to 21 kmer pairs represent one heterozygous loci. It should be possible to identify them a merge them in heterozygous variants (so we will end up with 21x less of 2*k - 1 long strings of nucleotides instead of string of length k). This will both make the search faster and more reliable (because some of the kmers might have not been picked up and therefore less kmers would represent the variant. We don't want to count kmers, but variants). This could be potetinally done BEFORE smudgeplots
  3. Identify variants in genome, annotate genomic lcoations that are collapses of duplications or separatelly assembled variants.
  4. Generate a summary of it

add info to README

add info about

  • nbins scaling if the plot is messy
  • coverage needed and desired (throw everything you got)

Also, think of more comprehensive description of the method. Maybe in wiki? Webpage? Really long readme?

Error while installing smudgeplot

Dear,

I'm getting a strange ERROR message while installing smudgeplot that I cannot directly place.
Do I perhaps have an incorrect version of devtools?

You are a coding rockstar!
Warning message:
`encoding` is deprecated; all files now assumed to be UTF-8 
Installing package into ‘/home/nicky/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
Warning: invalid package ‘../smudgeplot’
Error: ERROR: no packages specified
Warning message:
In install.packages("../smudgeplot", repos = NULL, type = "source") :
  installation of package ‘../smudgeplot’ had non-zero exit status

error in saving kmers when decreasing number of bins

When I run

smudgeplot plot kmer_pairs_coverages.tsv -o Tps_middle_pe_raw_reads -t "Timema poppensis" -kmer_file kmer_pairs_sequences.tsv

I get the following error. Needs to be fixed

INFO:root:Running smudgeplot v0.1.3 beta3-development
INFO:root:Task: plot
INFO:root:loading kmers
INFO:root:calculating initial 1n coverage estimate
INFO:root:initial 1n coverage estimate 23.62
INFO:root:coverage range for smudgeplot: 31 - 236
INFO:root:calculating hist with nbins: 40
WARNING:root:detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 35
INFO:root:calculating hist with nbins: 35
WARNING:root:detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 30
INFO:root:calculating hist with nbins: 30
WARNING:root:detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 25
INFO:root:calculating hist with nbins: 25
INFO:root:final 1n coverage estimate 24.00
INFO:root:saving kmers in smudges Tps_middle_pe_raw_reads_kmers_in_smudge_*.txt
Traceback (most recent call last):
  File "/home/kjaron/bin/smudgeplot", line 4, in <module>
    __import__('pkg_resources').run_script('smudgeplot-KamilSJaron===0.1.13devN', 'smudgeplot')
  File "/software/lib/python3.5/site-packages/pkg_resources/__init__.py", line 746, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/software/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1501, in run_script
    exec(code, namespace, namespace)
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 104, in <module>
    main()
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 98, in main
    plot(parser.arguments)
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 69, in plot
    smudge.saveKmersFromSmudges()
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/smudgeplot/smudgedata.py", line 211, in saveKmersFromSmudges
    assigned_smudge = coor_smudge_dict[(y_coord, x_coord)]
KeyError: (25, 20)

Error in density.default

Hi there,

Just trying this out using the KMC instructions - looks like a great program. I'm getting the following error when running smudgeplot.R:

running Smudgeplot v0.1.3

######################

INPUT PROCESSING

######################
Error in density.default(.total_pair_cov[.subset], adjust = .adjust) :
need at least 2 points to select a bandwidth automatically
Calls: estimate_1n_coverage_1d_subsets -> get_1d_peaks -> density -> density.default
Execution halted

Any Idea why?

Ricinus communis ploidy

Hi, I have troubles understanding my smudgeplots.

I run an initial smudgeplot with ~14 Gbp data from SRA and as the results weren't the expected I downloaded more data (from the same Project) and rerun with ~139 Gbp. I passed them for trimmomatic before doing the kmer counting with KMC with almost the same code you put in the main README.

I have used following command to generate it

ls $path/Reads/Trimmed/*.fastq > FILES
srun kmc -k21 -t24 -m64 -ci1 -cs10000 @FILES kmer_counts tmp
echo "kmc db successful"
srun kmc_tools -t24 transform kmer_counts -cx10000 histogram kmer_k21.hist
echo "kmc histogram successful"

L=$(kmer_cov_cutoff.R kmer_k21.hist L)
U=$(kmer_cov_cutoff.R kmer_k21.hist U)
echo $L $U # these need to be sane values like 30 800 or so
srun kmc_tools -t24 transform kmer_counts -ci$L -cx$U dump -s kmer_k21.dump
srun hetkmers.py -k 21 -t 24 -o kmer_pairs < kmer_k21.dump
srun smudgeplot.R -i kmer_pairs_coverages_2.tsv -o ricinus_ploidy -t "Ricinus communis"

Note: the srun are there for execution in cluster, which uses Slurm.

And the results look like this:

low_coverage_output
high_coverage_output

Now, in the genome assembly paper, the authors got a genome size of ~350 Mbp and by other genome analyses, they strongly suspect it is a hexaploid. This does not make sense together with the smudge because it predicts triploid and diploid for low and high coverage respectively. However, I suspect the results are not clear because smudges are not separating well in the first place.

What do you think is happening here?
Should I rerun this analysis with some parameters modifications?
How should I understand these results?

Thank you for your help

-q parameter does not work in the new python API

Traceback (most recent call last):
  File "/home/kjaron/bin/smudgeplot", line 4, in <module>
    __import__('pkg_resources').run_script('smudgeplot-KamilSJaron===0.1.13devN', 'smudgeplot')
  File "/software/lib/python3.5/site-packages/pkg_resources/__init__.py", line 746, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/software/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1501, in run_script
    exec(code, namespace, namespace)
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 104, in <module>
    main()
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 98, in main
    plot(parser.arguments)
  File "/Home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/EGG-INFO/scripts/smudgeplot", line 32, in plot
    smudge.quantile_filt()
  File "/home/kjaron/.local/lib/python3.5/site-packages/smudgeplot_KamilSJaron-0.1.13devN-py3.5.egg/smudgeplot/smudgedata.py", line 63, in quantile_filt
    if self.args.q < 1:
TypeError: unorderable types: str() < int()

automatic scaling nbins

make automatic scaling of nbins by the number of detected kmer pairs.

Something like

< 50,000 -> nbins = 20
else < 200,000 -> nbins = 30
else -> nbins = 40

smudgeplot cutoff

Hi!

First at all, thanks for this great idea.

I have trying to run the smudgeplot protocol, but when I launch "smudgeplot.py cutoff" with different boundaries from 1 to 10000, I always get the same cutoff. This is my code:

kmc -k21 -t20 -m64 -ci1 -cs10000 -fq @list.txt kmer_counts tmp
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
smudgeplot.py cutoff kmer_k21.hist 20
smudgeplot.py cutoff kmer_k21.hist 3000

Am I doing something wrong?

Cheers.

quantile filtering leads to wrong inference

Rhyker's observation: On autotetraploid blueberry (Vaccinium corymbosum) it incorrectly predicts diploid with the default quantile filtering, and correctly predicts tetraploid with the --no_quantile_filt.

-> this feature was originally made for nicer scaling of pictures, but perhaps it should be disabled by default

packaging

Right now we have a python script searching for kmer pairs and R script that is parsing the coverages and creating plots.

I would like to turn

  • everything R into package, so the development is easier and I can separate all the functions into separated files
  • once I know that analysis is meaningful move it from R to python to increase speed
  • R should be in the end used only for plotting.

Consider if we could do upload it to conda or some other repos to make installation easy. It's just bunch of scripts in the end.

R requirement

Hello,

I just got the error:
"unused argument (left.open = T)" of the R command findInterval
this option was added in R version 3.3.2
I suggest checking the R version in the install process.

After updating R it works great. Thanks,
Stephan Majda

Wrong result of Salix

Hi,

I ran into some problems when I do smudgeplot with Salix data.
Here is my reslut:
ljs_smudgeplot
ljs_smudgeplot_log10

My L and U is determined after I ran genomescope:
plot

I'm confused because this picutre says that my reads' coverage is only ~7, which actually should be ~18

Afterall, accroding to this picture, my L is 7. And I have a terrible result.
The thing I want to know is: Is there anything wrong with my reads data? Or something wrong with other part?

Thanks for your program.

Exploring multimember families

Here I report my exploration of the multiallelic loci of Fragaria Illuminae, a diploid previous version of smudgeplot classified as tetraploid (see the strawberry tutorial for details). The exploration is recorded in this script.

First, I am not entirely sure why, but the kmer pairs identified by the modified version of KMC and the original python script are not identical (although they are very close). here is a sceenshot of the smudgeplot made of families of 2 using the KMC (left) and the original smudgeplot from two strawberry tutorial (right)

smudgeplot_new_just_pairs_vs_old

Is it a problem? Not sure, but I would prefer to see two sets of identical numbers.

Here is the overview of family sizes

    2       3        4        5         6        7         8 
60.4%   18.3%     8.9%     5.2%      3.4%     2.3%      1.5%

The kmer pairs we used till now constitute ~60% of the dataset. Now I exclude these families of 2, because I would like to explore what additional information is hidden in the families of more than 2.

Here is a coverage sum of all multi kmer families with marked expected ploidy levels (we know that the haploid kmer coverage is ~144)

multi_family_covsum_hist

We can see that on the left side of the histogram we are able to distinguish between ploidy levels, but borders get blurrier and blurrier with higher ploidy levels of the kmer family.

Now I will make smudgeplot like plot, taking the coverage of the minimal kmer within the family (so techinically it's exactly the same as smudgeplot).

multi_family_smudgeplot

The limit for the minor coverage is 1/3, as the minimal family size is 3 and the if it's a minimal coverage the ABC and AABBCC will represent the edge cases and |C| / |ABC| can not be > 1/3. We see that AABBCC is actually the by far the brightest smudge.

We can also plot a smudgeplot-like plot where instead of min/sum on the x axis we plot max/sum. In that case however the axis must span from 0 to 1 as in theory the limit is close to 1 for higher ploidy levels. That's the reason why I used a classical 2d honeycomb histogram instead of our smudgeplot plotting engine

multi_family_max_rel_cov

somehow I thought that it will be more indicative and will spread the smudges on a bigger surface, but there is very little information other than what was already apparent in the min/sum plot.

Finally, instead of trying to visualize the kmers themselves, we can just calculate a summary, becase in the case of strawberry we know the haploid coverage. Because I wanted to reduce the problem, I used for the following statistics only families of 3.

triallelic_smudge_sizes

Again, we see that the AABBCC is the most frequent structure, but here we also see that other common structures are usually (A)BBCC., where A is repeats 2/4/6/... times.

Somehow I think these explorations do give us some insights to the genome structure, I am not entirely sure how to utilize this information for better ploidy estimation.

Note that using the families of any number while keeping everything the same in the smudgeplot would lead to plot that are a simple overlap of the smudeplot shown at the top (the diploid-only) and the smudgeplot made from families > 3. I.e. it is simply going to cause higher ploidy levels estimates.

Plukenetia volubilis genome size mistery

Hi @KamilSJaron, @tbenavi1!

I used both Smudgeplot and Genomescope2 for the early stages of my assembly project last year and you have been really helpful(issues #36 and #45 for some context) for me.

Currently, I already did genome assembly and I am doing some post-assembly of the genome, I would say it is almost finished (at the possible extent that our data allows).

However, a mystery emerged after assembly, that I thought would be solved after post-assembly stages, that is seriously drilling my head.

As I had a troublesome genome in hand I explored several approaches. In the end, I error corrected using Brownie Corrector and Karect and then assembled with several assemblers. As I had Nanopore data as well, I also tried Flye as a long-read assembler and hybrid assemblies.

Here is a brief of the final workflow:

assembly_workflow

And the contiguity of the resulting genomes were these (generated with abyss-fac -s 2000):

n n:2000 L50 min N75 N50 N25 E-size max sum name
669643 25539 3683 2000 19958 40670 74530 56920 464213 537.3e6 abyss_110.fasta
18738 14216 3184 2000 19616 31064 45848 35529 156402 297.7e6 flye_polished.fasta
13713 10722 1428 2001 55625 112275 213463 168686 1627529 594.3e6 masurca.fasta
775739 29681 4340 2000 17458 35558 62757 47630 351667 538.6e6 megahit.fasta
290756 22588 3169 2000 24016 48992 87791 66992 494742 551.6e6 minia.fasta
688159 20551 1898 2000 37869 88042 169274 123032 750764 616.9e6 platanus.fasta
223514 15277 1719 2000 45373 94114 177120 129199 750764 584.3e6 Platanus_curated.fasta

As you can see, all the short read and hybrid assemblies consistently report genome sizes of +500 Mb. The only exception Is the Flye assembly which originally was around 180 Mb and after polishing with the short reads the assembly size went up to ~ 300 Mb. This is not a surprise as our nanopore data is really bad (low coverage and an N50 of just 2kb), so this was not a viable assembly either way.

Regarding gene completeness, estimated with Busco, all the short read and hybrid assemblies reported +95% completeness, with most of them single-copy (+90%). The Flye assembly was the only one incomplete (~50%).

With these results, I wondered why they were so different from the original estimation of 200 Mb by Genomescope2 (In issue #36 we concluded 250 Mb but it was actually inflated by organellar DNA) so consistently.

At first glance, I thought I had several ploidies in the assemblies, but this is not consistent with the Busco reports, as there is a low proportion of orthologs duplicated (~4-5%). So if the genome is redundant, at least it is not evenly redundant. There should be some regions that are more overrepresented than others.

The next thing that I thought was that maybe some complex repeats gave troubles to the assemblers and, not being able to solve them, the assemblers returned several fragments for those regions generating the redundancy. So I proceeded to quantify repeats.

I used both Repeatmasker and REPET-TEdenovo, with Repbase as a reference database, to search repeats in the masurca assembly (as it was the best).

RepeatMasker reported less than 9.7% of repeats (~58 Mb), while REPET was more conservative with just 40Mb estimated as repeats which is like 6.7%. This was another surprise, as I expected a higher repeat content according to the aforementioned hypothesis. Furthermore, this proportion contrasts a lot with the GenomeScope2 estimate of 17.4%.

At that point, I ran out of ideas but then I found FinichseSC, which is explained here, and essentially help to resolve repeats using long reads, but it also filters out embedded contigs. I used as first post-assembly and got really good results in contiguity, and it did reduce the genome size from 594 Mb to 568 Mb but did not solve the mystery.

The last thing that I tried was to do some phasing with HaploMerger2 , thinking that the genome might have several haplotypes there causing the redundancy. However, again, it did reduce the genome size to 533 Mb but is not a significant reduction and the mystery holds.

So finally I apologize for the extension of the issue and if it may go beyond your scope as developers of these amazing tools, but I would really like to know what you think about this genomic mystery and the differences between estimation and assembly in both genome size and proportion of repeats.

Thanks in advance,
Sivico

smudgeplot.py hetkmers -k option missing

I'm trying to run smudgeplot on my data but get stuck with the following line of code I got from the manual https://github.com/KamilSJaron/smudgeplot/wiki/manual-of-smudgeplot-with-jellyfish

jellyfish dump -c -L $L -U $U kmer_counts.jf | smudgeplot.py hetkmers -k 21 -o kmer_pairs

It seems there is no option -k. I receive the following output:
usage: smudgeplot hetkmers [-h] [-o O] [--middle] [infile] smudgeplot hetkmers: error: argument infile: can't open '21': [Errno 2] No such file or directory: '21'

When I go into the help of smudgeplot I see this:

smudgeplot.py hetkmers -h
usage: smudgeplot hetkmers [-h] [-o O] [--middle] [infile]

Calculate unique kmer pairs from a Jellyfish or KMC dump file.

positional arguments:
  infile      Alphabetically sorted Jellyfish or KMC dump file (stdin).

optional arguments:
  -h, --help  show this help message and exit
  -o O        The pattern used to name the output (kmerpairs).
  --middle    Get all kmer pairs that are exactly the same but in the middle
              nt. When this flag is used, the input dump must be
              alphabetically sorted/ (default: different by a SNP at any
              position).

I'm running smudgeplot v0.2.1

Install error

When running Rscript install.R, I get some errors, here is the whole output as it is printed in stderr

Updating smudgeplot documentation
Loading smudgeplot
Loading required package: viridis
Loading required package: viridisLite
Loading required package: argparse
Writing NAMESPACE
Writing annotate_peaks.Rd
Writing annotate_summits.Rd
Writing estimate_1n_coverage_1d_subsets.Rd
Writing estimate_1n_coverage_highest_peak.Rd
Writing filter_peak_sizes.Rd
Writing filter_peaks.Rd
Writing generate_summary.Rd
Writing get_1d_peaks.Rd
Writing get_default_col_ramp.Rd
Writing get_peak_sizes.Rd
Writing get_peak_summary.Rd
Writing get_smudge_container.Rd
Writing get_trinoploid_1n_est.Rd
Writing guess_genome_structure.Rd
Writing peak_agregation.Rd
Writing plot_expected_haplotype_structure.Rd
Writing plot_histograms.Rd
Writing plot_legend.Rd
Writing plot_seq_error_line.Rd
Writing plot_smudgeplot.Rd
Writing round_minor_variant_cov.Rd
Writing rounding.Rd
Writing smudge_summary.Rd
Writing smudge_warn.Rd
Writing transform_minor_variant_cov.Rd
Writing transform_pair_cov.Rd
Warning messages:
1: roxygen2 requires Encoding: UTF-8
2: In .doLoadActions(where, attach) :
trying to execute load actions without 'methods' package
Loading smudgeplot

Attaching package: ‘testthat’

The following objects are masked from ‘package:devtools’:

setup, test_file

Testing smudgeplot
✔ | OK F W S | Context
✔ | 1 | 1d subsets fitting [0.1 s]
✖ | 0 1 | integration
────────────────────────────────────────────────────────────────────────────────
test-fitting.R:17: error: (unknown)
unused argument (left.open = T)
1: get_smudge_container(minor_variant_rel_cov, total_pair_cov, .nbins = nbins) at /users/jblommaert/smudgeplot/tests/testthat/test-fitting.R:17
2: as.data.frame(table(findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T),
findInterval(.total_pair_cov, smudge_container$y, left.open = T)), stringsAsFactors = F) at /users/jblommaert/smudgeplot/R/get_smudge_container.R:17
3: table(findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T), findInterval(.total_pair_cov,
smudge_container$y, left.open = T))
4: findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T)
────────────────────────────────────────────────────────────────────────────────
✖ | 0 1 | smudge container generation
────────────────────────────────────────────────────────────────────────────────
test-get_smudge_container.R:6: error: (unknown)
unused argument (left.open = T)
1: get_smudge_container(x, y, .nbins = 3, .xlim = c(0, 3), .ylim = c(0, 3)) at /users/jblommaert/smudgeplot/tests/testthat/test-get_smudge_container.R:6
2: as.data.frame(table(findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T),
findInterval(.total_pair_cov, smudge_container$y, left.open = T)), stringsAsFactors = F) at /users/jblommaert/smudgeplot/R/get_smudge_container.R:17
3: table(findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T), findInterval(.total_pair_cov,
smudge_container$y, left.open = T))
4: findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T)
────────────────────────────────────────────────────────────────────────────────
✖ | 0 1 | generation of peak summary
────────────────────────────────────────────────────────────────────────────────
test-get_summary.R:17: error: (unknown)
unused argument (left.open = T)
1: get_smudge_container(minor_variant_rel_cov, total_pair_cov, .nbins = nbins) at /users/jblommaert/smudgeplot/tests/testthat/test-get_summary.R:17
2: as.data.frame(table(findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T),
findInterval(.total_pair_cov, smudge_container$y, left.open = T)), stringsAsFactors = F) at /users/jblommaert/smudgeplot/R/get_smudge_container.R:17
3: table(findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T), findInterval(.total_pair_cov,
smudge_container$y, left.open = T))
4: findInterval(.minor_variant_rel_cov, smudge_container$x, left.open = T)
────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 0.3 s

OK: 1
Failed: 3
Warnings: 0
Skipped: 0
Installing package into ‘/users/jblommaert/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)

  • installing source package ‘smudgeplot’ ...
    ** R
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
  • DONE (smudgeplot)

install smudgeplot on cluster

Dear Smudgeplot creator,
I am able to install without any problem smudgeplot on a local pc and I also success to create a conda env able to run smudgeplot, but i got an error because my "smudgeplot and smudgeplot_plot.R" are not in /usr/local/bin...

Some plotting bug to fix

./smudgeplot.R data/Aric1/Aric1_pairs_coverages_2.tsv data/Aric1/Aric1_smudgeplot_n35.png "Adineta ricciae" 35.3
Loading required package: RColorBrewer
Loading required package: MASS
Warning message:
package 'MASS' was built under R version 3.3.2 
Error in 2:8 * .n : non-numeric argument to binary operator
Calls: plot_smudgeplot -> axis
Execution halted

question about smudgeplot_plot.R

image
Hi, when I run smudgeplot, 'smudgeplot.py plot ' occurs error, showing: smudgeplot_plot.R command not found. How can I deal with it? Thank you.

Smudgeplot interpretation

Hi Kamil,
I am using smudgeplot to look at the state of ploidy in a plant pathogenic fungs. I counted kmer coverages with KMC and then plotted with smudgeplot as instrected in the tutorial. The fungs is a diploid/dikaryon (two haploid genome in two nuclei) which what I see with most of the isolates I tested with smudgeplot but interestingly one isolate was estimated to be tetraploid!
I still not sure how smudgeplot estimated ploidy by counting the pair of heterozygous k-mers, so could you please have a look at the plot and help me to understand what is going on.
PST_S00832A_smudgeplot.

Also, most of the fungs isolate estimated to be diploid with one smudge, but I see one with two smudges, so wondering how do I interpret the number of the smudges in the plot.
Look at the plots here:
PST_DK09_smudgeplot
ERR1607728_smudgeplot

The size of the genome is estimated to be ~80-90 mb but no information about the number of chromosomes is known.

Thanks
Ram

strange sacharomyces smudgpelot

About your genome

Saccharomyces, 12M monoploid genome size.

smudgeplot

Described here.

Long story short, what should have been a tetraploid smudgeplot looks like this:

wrong_smudgeplot.

hetkmers.py is computing after it saved coverages_2.tsv file

I made a small trick to avoid the dump file in my computer (I asked kmc to save the output to std out stream and piped it in the hetkmers.py). The log suggest that the job should be finished

screen shot 2018-10-02 at 09 41 55

but it's still actively computing on a single core. This is a htop screenshot:

screen shot 2018-10-02 at 09 41 23

The output seems correct I managed to create a smudgeplot so I killed, but it's strange no?

make hetkmers --all single threaded again

We had a bunch of troubles with multiprocessing library (like #18 or #26), they probably would be fixable, but since we have introduced middle nt approach for kmer search we probably don't need this algorithm to run in parallel because for small genomes it's alright on a single core too.

-> turn hetkmers --all to a single core again

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.