Dear madam/sir,
A few weeks ago I received my first single-cell methylation data. While exploring all possibilities for analyzing these samples, I ran into your Episcanpy paper. I was quite impressed by your user friendly and extensively described method, which is nicely substantiated by the associated Github.
While performing some quality control on cell/feature coverage, I realized something was wrong with the numbers (methylation values) in the count matrix. To be sure, I went through your code that builds the count matrix based on genomic windows of 100.000 basepairs: ‘_bld_met_mtx.py’. I think I found two mistakes:
1. The methylation values are assigned to right cells but to the wrong genomic windows, because the order of ‘chromosome’ and ‘sorted(features.keys())’ is different. Meaning that the methylation values end up in the wrong chromosome.
2. The code within the funciton ‘ def methylation_level(reduced_cyt, feature, chromosome, threshold=1):’, should loop through every cytosine (‘CG’ in my case) within a certain genomic window. Instead, it measures the methylation value of the first cytosine and even though it loops correctly, due to an iterator mistake it keeps adding the coverage values of this same cytosine, constantly noting the same methylation values. This results in an over-representation of zeros in the count matrix when the first cytosine in a window has meth.reads=0.
I will use the mouse brain data from your tutorial to show how this happens:
1.
![image](https://user-images.githubusercontent.com/86351835/146746409-adef063e-51cd-41b0-ba93-28fe4c0b2241.png)
Above is seen how ‘chromosomes’ is specified (1-19, X and Y), used in the ‘def methylation_level(reduced_cyt, feature, chromosome, threshold=1):’ function, which is responsible for creating a list containing all methylation values (meth_reads/tot_reads), see point 2.
On the other hand ‘sorted(feature.keys())’ is specified in a sorted way (1, 10-19, 2-9, X and Y), used in the ‘def extract_feature_names(feature):’ function, which creates the genomic bins.
Later in the function ‘def build_count_mtx(cells, annotation, path="", output_file=None, writing_option="w", meth_context="CG", chromosome=None, feature_names=None, threshold=1, ct_mtx=None, sparse=False, copy=False):’, the methylation values from the first function are added to the genomic bins from the second functions, which means they are appointed in the wrong order.
Removing the ‘sorted’ from features solves this problem and gives the same order as ‘chromosome’.
Solution:
![image](https://user-images.githubusercontent.com/86351835/146746687-67876066-755a-47ec-b9de-1b3fdacf2f52.png)
2.
![image](https://user-images.githubusercontent.com/86351835/146746482-7751d556-0184-4cf4-9790-b1e788fb32ae.png)
Here we are looking at cell 1 ('../methylation_play_data/cell1.tsv'), more specific, the first three loops of the ‘def methylation_level(reduced_cyt, feature, chromosome, threshold=1):’ function, looping through the first three cytosines of the first genomic window (chr1_3000001_3100000) containing a ‘CG’. In the first loop it nicely takes the meth_reads (=5) and tot_reads (=5) and adds it to the starting point; meth_reads (=0) tot_reads (=1), resulting in meth_reads (=5) and tot_reads (=6). However, in the second loop it does not take the meth_reads (=1) and tot_reads (=1), but adds the meth_reads (=5) and tot_reads (=5) from the first cytosine again. This results in meth_reads (=10) and tot_reads (=11), instead of meth_reads (=6) and tot_reads (=7). The same story for every other cytosine in a certain genomic window.
Solution:
![image](https://user-images.githubusercontent.com/86351835/146746849-42c555f6-07df-4532-8168-45af769d4c37.png)
Let me know if you need some more information regarding this potential edit.
I am curious about your opinion and hope to hear from you soon!
Kind regards,
Tim Sakkers