Describe the issue Trying to align paired Smart-seq data with the

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Yes, cached from the loom. I mentioned that because I noticed the <code class="notrans

Smart-seq vars in 0.25 about kb_python HOT 8 CLOSED

pachterlab commented on June 1, 2024

Smart-seq vars in 0.25

from kb_python.

Comments (8)

JBreunig commented on June 1, 2024 1

Ok, seems to work using that index. Now I get a "gene_name" column that is easy to work with...thanks!!

adata
Out[6]:
AnnData object with n_obs × n_vars = 384 × 53292
var: 'gene_name'

adata.var.head()#
Out[8]:
gene_name
ENSG00000000003.14 TSPAN6
ENSG00000000005.6 TNMD
ENSG00000000419.12 DPM1
ENSG00000000457.14 SCYL3
ENSG00000000460.17 C1orf112

from kb_python.

Lioscro commented on June 1, 2024

Hi, @JBreunig,
It looks like kallisto is outputting transcript IDs instead of gene IDs, probably because processing smartseq fastqs require a slightly different procedure than standard UMI-based single cell fastqs.

I agree that gene IDs are easier to work with. I'll take a look at how we can convert those to gene IDs instead.

Regarding the last adata.var you show, I'm not sure what is happening with the genes there. Perhaps you are using the index and t2g from this (https://github.com/pachterlab/kallisto-transcriptome-indices) repository? If so, my recommendation would be to manually generating the transcriptome index following the procedure described in the FAQ 1 in our tutorials because there is some additional columns in the t2g that kb uses, while the t2g in the repository do not include those.

from kb_python.

JBreunig commented on June 1, 2024

Ok, that all makes sense. Thanks in advance for looking into converting to gene IDs!

from kb_python.

Lioscro commented on June 1, 2024

Hi, @JBreunig,

I've pushed an update to the devel branch that should now output gene IDs instead of transcript IDs as the columns.
Could you try running and verify? Once I get the confirmation, we'll release a new version of kb with these changes.
You can install the devel branch with the following command:

pip install git+https://github.com/pachterlab/kb_python@devel --upgrade

from kb_python.

JBreunig commented on June 1, 2024

Hi Joseph,
I think it's close but do you have a suggestion to remove the .XXXXXX places and condense duplicates from the format "ENSG00000277400.1.A14056" to "ENSG00000277400.1" as is shown in the tutorials and typical with droplet data?

Here are the outputs using the same reference validated with human 10X datasets:

BT1030f = sc.read('/mnt/1TBretrySamsung/HuEpendymoma/BT1030/BT1030rd/adata.loom', cache=True, validate=False)
writing an h5ad cache file to speedup reading next time

adata = BT1030f

t2g = pd.read_csv("/mnt/WD2TBunderside/WorkingHumanRefKBtools081520/t2g.txt", header=None, sep="\t", names=["tid", "gid", "gene"]) #Human

t2g = t2g.drop_duplicates(["gid", "gene"])
t2g = t2g.set_index("gid")
t2g.head()
Out[8]:
tid gene
gid
ENSG00000223972.5 ENST00000456328.2 DDX11L1
ENSG00000227232.5 ENST00000488147.1 WASH7P
ENSG00000278267.1 ENST00000619216.1 MIR6859-1
ENSG00000243485.5 ENST00000473358.1 MIR1302-2HG
ENSG00000284332.1 ENST00000607096.1 MIR1302-2

adata.var.head()#
Out[9]:
Empty DataFrame
Columns: []
Index: [ENSG00000277400.1.A14056, ENSG00000277400.1.A32841, ENSG00000277400.1.A35311, ENSG00000277400.1.A36796, ENSG00000277400.1.A44325]

adata
Out[10]: AnnData object with n_obs × n_vars = 384 × 845338

adata.var["Gene"] = adata.var.index.map(t2g["gene"])

adata.var.head()#
Out[12]:
Gene
ENSG00000277400.1.A14056 NaN
ENSG00000277400.1.A32841 NaN
ENSG00000277400.1.A35311 NaN
ENSG00000277400.1.A36796 NaN
ENSG00000277400.1.A44325 NaN

from kb_python.

Lioscro commented on June 1, 2024

I actually have no idea where the AXXXX suffixes are coming from. I have never seen that happen on my test data. Are you using a cached version of the output matrix?

from kb_python.

JBreunig commented on June 1, 2024

Just to be clear, cached from the loom?

When I look at the genes.txt and transcripts.txt in the output directory from today, they are there as well:
genes.txt
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796

transcripts.txt
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796

Retrying now with a new, rebuilt index...will post when done.

from kb_python.

Lioscro commented on June 1, 2024

Yes, cached from the loom. I mentioned that because I noticed the cache=True option in your sc.read command, which may be reading a cached (older) version of the loom file.

I re-ran using the transcriptome provided here (https://github.com/pachterlab/kallisto-transcriptome-indices/releases) and am still unable to reproduce the outputs you are getting, and none of the AXXXX suffixes appear in the GTF, CDNA fasta, nor the t2g (so I'm not sure where these are coming from).

Please let me know how it looks when running with a manually built index.

from kb_python.

Smart-seq vars in 0.25 about kb_python HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent