Comments (8)
Ok, seems to work using that index. Now I get a "gene_name" column that is easy to work with...thanks!!
adata
Out[6]:
AnnData object with n_obs × n_vars = 384 × 53292
var: 'gene_name'
adata.var.head()#
Out[8]:
gene_name
ENSG00000000003.14 TSPAN6
ENSG00000000005.6 TNMD
ENSG00000000419.12 DPM1
ENSG00000000457.14 SCYL3
ENSG00000000460.17 C1orf112
from kb_python.
Hi, @JBreunig,
It looks like kallisto is outputting transcript IDs instead of gene IDs, probably because processing smartseq fastqs require a slightly different procedure than standard UMI-based single cell fastqs.
I agree that gene IDs are easier to work with. I'll take a look at how we can convert those to gene IDs instead.
Regarding the last adata.var
you show, I'm not sure what is happening with the genes there. Perhaps you are using the index and t2g from this (https://github.com/pachterlab/kallisto-transcriptome-indices) repository? If so, my recommendation would be to manually generating the transcriptome index following the procedure described in the FAQ 1 in our tutorials because there is some additional columns in the t2g that kb
uses, while the t2g in the repository do not include those.
from kb_python.
Ok, that all makes sense. Thanks in advance for looking into converting to gene IDs!
from kb_python.
Hi, @JBreunig,
I've pushed an update to the devel
branch that should now output gene IDs instead of transcript IDs as the columns.
Could you try running and verify? Once I get the confirmation, we'll release a new version of kb with these changes.
You can install the devel
branch with the following command:
pip install git+https://github.com/pachterlab/kb_python@devel --upgrade
from kb_python.
Hi Joseph,
I think it's close but do you have a suggestion to remove the .XXXXXX places and condense duplicates from the format "ENSG00000277400.1.A14056" to "ENSG00000277400.1" as is shown in the tutorials and typical with droplet data?
Here are the outputs using the same reference validated with human 10X datasets:
BT1030f = sc.read('/mnt/1TBretrySamsung/HuEpendymoma/BT1030/BT1030rd/adata.loom', cache=True, validate=False)
writing an h5ad cache file to speedup reading next time
adata = BT1030f
t2g = pd.read_csv("/mnt/WD2TBunderside/WorkingHumanRefKBtools081520/t2g.txt", header=None, sep="\t", names=["tid", "gid", "gene"]) #Human
t2g = t2g.drop_duplicates(["gid", "gene"])
t2g = t2g.set_index("gid")
t2g.head()
Out[8]:
tid gene
gid
ENSG00000223972.5 ENST00000456328.2 DDX11L1
ENSG00000227232.5 ENST00000488147.1 WASH7P
ENSG00000278267.1 ENST00000619216.1 MIR6859-1
ENSG00000243485.5 ENST00000473358.1 MIR1302-2HG
ENSG00000284332.1 ENST00000607096.1 MIR1302-2
adata.var.head()#
Out[9]:
Empty DataFrame
Columns: []
Index: [ENSG00000277400.1.A14056, ENSG00000277400.1.A32841, ENSG00000277400.1.A35311, ENSG00000277400.1.A36796, ENSG00000277400.1.A44325]
adata
Out[10]: AnnData object with n_obs × n_vars = 384 × 845338
adata.var["Gene"] = adata.var.index.map(t2g["gene"])
adata.var.head()#
Out[12]:
Gene
ENSG00000277400.1.A14056 NaN
ENSG00000277400.1.A32841 NaN
ENSG00000277400.1.A35311 NaN
ENSG00000277400.1.A36796 NaN
ENSG00000277400.1.A44325 NaN
from kb_python.
I actually have no idea where the AXXXX suffixes are coming from. I have never seen that happen on my test data. Are you using a cached version of the output matrix?
from kb_python.
Just to be clear, cached from the loom?
When I look at the genes.txt and transcripts.txt in the output directory from today, they are there as well:
genes.txt
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796
transcripts.txt
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796
Retrying now with a new, rebuilt index...will post when done.
from kb_python.
Yes, cached from the loom. I mentioned that because I noticed the cache=True
option in your sc.read
command, which may be reading a cached (older) version of the loom file.
I re-ran using the transcriptome provided here (https://github.com/pachterlab/kallisto-transcriptome-indices/releases) and am still unable to reproduce the outputs you are getting, and none of the AXXXX suffixes appear in the GTF, CDNA fasta, nor the t2g (so I'm not sure where these are coming from).
Please let me know how it looks when running with a manually built index.
from kb_python.
Related Issues (20)
- Issues with running RNA velocity (La Manno) analysis in kb_python 0.28.0 HOT 5
- Using kb -count with SMART-Seq mRNA LP (with UMIs) HOT 12
- Restructure unmapped SMART-seq3 BAM from ENA to proper fastqs for kb HOT 5
- Custom fasta and gtf file based on subset of genes for multiple species HOT 3
- SmartSeq3 Demultiplexed Fastq files HOT 4
- kb-python version 0.28.2 giving problems on m1 mac HOT 6
- Is kb-python capable of processing many samples in one command? HOT 2
- --umi-gene no longer supported in kb count? HOT 2
- Containerized kb can choose the wrong binary HOT 1
- How to run RNA velocity analysis (La Manno) on BAM files HOT 4
- kb-python processing significantly less reads HOT 6
- Crosspost from kallistobustools: kb count error with CSP reads HOT 21
- Running kb count with a single FASTQ file HOT 4
- kb count - joint vs individual sample processing for RNA velocity analysis HOT 2
- kb ref can't handle blanks in path arguments HOT 2
- Naive collapsing of UMIs in SS3xpress data? HOT 6
- No reads pseudoaligned in 10XV1 chemistry HOT 6
- Usage of f1 and f2 parameters in kb ref and count HOT 2
- Changed barcodes from the kb-python count HOT 2
- Issue obtaining unaligned reads HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kb_python.