I have had some troubles understanding my smudgeplot of a highly repetitive genome tha

Repetitive mollusc genome interpretation about smudgeplot HOT 3 CLOSED

mason-linscott commented on August 26, 2024

Repetitive mollusc genome interpretation

from smudgeplot.

Comments (3)

KamilSJaron commented on August 26, 2024

Hi M.,

10x is not the best for making smudgeplots, the coverage variation is a bit higher compared to other types of libraries (surprisingly even PacBio HiFi is better for kmer spectra analyses). So I think your analysis will be a bit more sensitive to the L parameter you chose. What did you use? And can you show us also a genomescope plot? That helps a lot to decide about a meaningful L.

I am a bit puzzled of the 1n coverage estimate, I am not 100% how smudgeplot ended up on that number... (the labels are not well overlaping with the smudges). But I also supect your L was too low and the bottom part of the smudgeplot is simply error kmers paired with the unique genomic kmers and then the upper smudge is the real diploid one (which would also estimate the 1n coverage a bit closer to the expected 30x, assuming that the 60x is what people call genome coverage, not per-haplotype coverage).

from smudgeplot.

mason-linscott commented on August 26, 2024

Hi Kamil,

Thank you for looking into this! Here is the genomscope plot made using 31-mers (high K chosen due to genome repetitiveness). I was not sure whether to also apply a higher K for smudgeplot and just followed the guide. I hope it helps with your interpretation.

L was set to 15 and U to 2700 based on the output from the smudgeplot.py cutoff. I can raise L on the next smudgepairs run (takes a day and a half).

Thanks again,
Mason

from smudgeplot.

KamilSJaron commented on August 26, 2024

Hi Mason,

that is a giant genome. That must be a huge dataset you are dealing with.

So, the first thing first. Your sequencing is not the cleanest (which is understandable for giant mollusc genomes) - the error peak and genome peak are overlapping. The problem is that you don't have a clean cut between errors and real genomic kmers, which is what we want L to be.

Good news is that the genomescope looks alright (both het and homo zygous peaks are kind of visible and the model relativelly well explains the data). Also, you have very little evidence of polyploidy (assuming your 1n coverage fit is right), because those usually have some of the higher coverage peaks relatively high as well. So the only thing I would suggest to you is to redo the smudgeplot PLOT with a coverage prior from the genomescope model. You can run it as:

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -n 18 -o "Land_snail" -t "O. idahoensis"

I am suspecting you will get the peaks annotated right this time (as AB and AABB). I suspect the proportion of AB vs AABB kmers will be a bit different and quite possibly favouring AABB. That practically means that the 2n peak kmers have very often a very similar kmers within the same 2n peak. It is a puzzling signature for genomes with high heterozygosity, but perhaps we should not be bothered by this that much given how much troubles we had separating 1n and error peaks. Perhaps just keep in mind for downstream analyses that you might have quite a lot of paralogy in your genome (and keep even deeper in your mind that there polyploidy was not conclusively rejected in this analysis).

from smudgeplot.

Repetitive mollusc genome interpretation about smudgeplot HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent