Code Monkey home page Code Monkey logo

Comments (8)

JBreunig avatar JBreunig commented on June 1, 2024 1

Ok, seems to work using that index. Now I get a "gene_name" column that is easy to work with...thanks!!

adata
Out[6]:
AnnData object with n_obs × n_vars = 384 × 53292
var: 'gene_name'

adata.var.head()#
Out[8]:
gene_name
ENSG00000000003.14 TSPAN6
ENSG00000000005.6 TNMD
ENSG00000000419.12 DPM1
ENSG00000000457.14 SCYL3
ENSG00000000460.17 C1orf112

from kb_python.

Lioscro avatar Lioscro commented on June 1, 2024

Hi, @JBreunig,
It looks like kallisto is outputting transcript IDs instead of gene IDs, probably because processing smartseq fastqs require a slightly different procedure than standard UMI-based single cell fastqs.

I agree that gene IDs are easier to work with. I'll take a look at how we can convert those to gene IDs instead.

Regarding the last adata.var you show, I'm not sure what is happening with the genes there. Perhaps you are using the index and t2g from this (https://github.com/pachterlab/kallisto-transcriptome-indices) repository? If so, my recommendation would be to manually generating the transcriptome index following the procedure described in the FAQ 1 in our tutorials because there is some additional columns in the t2g that kb uses, while the t2g in the repository do not include those.

from kb_python.

JBreunig avatar JBreunig commented on June 1, 2024

Ok, that all makes sense. Thanks in advance for looking into converting to gene IDs!

from kb_python.

Lioscro avatar Lioscro commented on June 1, 2024

Hi, @JBreunig,

I've pushed an update to the devel branch that should now output gene IDs instead of transcript IDs as the columns.
Could you try running and verify? Once I get the confirmation, we'll release a new version of kb with these changes.
You can install the devel branch with the following command:

pip install git+https://github.com/pachterlab/kb_python@devel --upgrade

from kb_python.

JBreunig avatar JBreunig commented on June 1, 2024

Hi Joseph,
I think it's close but do you have a suggestion to remove the .XXXXXX places and condense duplicates from the format "ENSG00000277400.1.A14056" to "ENSG00000277400.1" as is shown in the tutorials and typical with droplet data?

Here are the outputs using the same reference validated with human 10X datasets:

BT1030f = sc.read('/mnt/1TBretrySamsung/HuEpendymoma/BT1030/BT1030rd/adata.loom', cache=True, validate=False)
writing an h5ad cache file to speedup reading next time

adata = BT1030f

t2g = pd.read_csv("/mnt/WD2TBunderside/WorkingHumanRefKBtools081520/t2g.txt", header=None, sep="\t", names=["tid", "gid", "gene"]) #Human

t2g = t2g.drop_duplicates(["gid", "gene"])
t2g = t2g.set_index("gid")
t2g.head()
Out[8]:
tid gene
gid
ENSG00000223972.5 ENST00000456328.2 DDX11L1
ENSG00000227232.5 ENST00000488147.1 WASH7P
ENSG00000278267.1 ENST00000619216.1 MIR6859-1
ENSG00000243485.5 ENST00000473358.1 MIR1302-2HG
ENSG00000284332.1 ENST00000607096.1 MIR1302-2

adata.var.head()#
Out[9]:
Empty DataFrame
Columns: []
Index: [ENSG00000277400.1.A14056, ENSG00000277400.1.A32841, ENSG00000277400.1.A35311, ENSG00000277400.1.A36796, ENSG00000277400.1.A44325]

adata
Out[10]: AnnData object with n_obs × n_vars = 384 × 845338

adata.var["Gene"] = adata.var.index.map(t2g["gene"])

adata.var.head()#
Out[12]:
Gene
ENSG00000277400.1.A14056 NaN
ENSG00000277400.1.A32841 NaN
ENSG00000277400.1.A35311 NaN
ENSG00000277400.1.A36796 NaN
ENSG00000277400.1.A44325 NaN

from kb_python.

Lioscro avatar Lioscro commented on June 1, 2024

I actually have no idea where the AXXXX suffixes are coming from. I have never seen that happen on my test data. Are you using a cached version of the output matrix?

from kb_python.

JBreunig avatar JBreunig commented on June 1, 2024

Just to be clear, cached from the loom?

When I look at the genes.txt and transcripts.txt in the output directory from today, they are there as well:
genes.txt
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796

transcripts.txt
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796

Retrying now with a new, rebuilt index...will post when done.

from kb_python.

Lioscro avatar Lioscro commented on June 1, 2024

Yes, cached from the loom. I mentioned that because I noticed the cache=True option in your sc.read command, which may be reading a cached (older) version of the loom file.

I re-ran using the transcriptome provided here (https://github.com/pachterlab/kallisto-transcriptome-indices/releases) and am still unable to reproduce the outputs you are getting, and none of the AXXXX suffixes appear in the GTF, CDNA fasta, nor the t2g (so I'm not sure where these are coming from).

Please let me know how it looks when running with a manually built index.

from kb_python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.