Comments (9)
You can use NCBI Datasets for this.
An EntrezDirect method would be to use elink
first to get links to genes from RefSeq accession.version, download gene DocSum and then extract the gene names. An example is shown below:
$ cat accs.txt
NR_133910.2
NM_001318896.2
NM_001160354.2
NM_001369393.2
NM_006552.2
$ epost -db nuccore -input accs.txt \
| elink -target gene \
| esummary \
| xtract -pattern DocumentSummary -element Name
MECP2
FHL2
CXCL17
LY6K
SCGB1D1
Since the gene DocSum does not have transcript accessions, a bash for loop can be used to map acc.ver to gene symbols:
$ cat accs.txt \
| while read -r acc ; do
g=$(epost -db nuccore -id $acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name);
echo -e "$acc\t$g" ;
done
NR_133910.2 CXCL17
NM_001318896.2 FHL2
NM_001160354.2 LY6K
NM_001369393.2 MECP2
NM_006552.2 SCGB1D1
from edirectcookbook.
Thank you. This is extremely helpful.
I'm working with a list of 177,816 accession.versions (attached), the number of reference transcripts in the GRC38 release on the NCBI Human Genome Resources page when I downloaded it. While NCBI Datasets worked fine for a small test file of ten accession.versions, I let the complete file run overnight, but it never finished. Was I just not patient enough? I would really like this to work.
Thanks for the scripts. I'm currently running the second. When I tested it, it worked fine but slowly: it seemed like I was only getting about one result per second, meaning my full list will take over two days to complete. Is there any way to speed this up?
from edirectcookbook.
using esummary may be faster:
epost -db nuccore -id NM_001318896.2 | elink -target gene | esummary -format text
The DocSum is HUGE and is going to be slow. But if you need to preserve 1:1 linkages between input and output, I believe you have to send each request one at a time, unfortunately
from edirectcookbook.
Is there any way to speed this up?
Not using EntrezDirect. As I had mentioned earlier, NCBI Datasets is a good choice for this. For example, I was able to download the data for the entire list in 6 min using the following command:
$ datasets summary gene accession --inputfile accNosRandom.txt --as-json-lines > gene_summary.jsonl
from edirectcookbook.
Thanks for information.
I downloaded the datasets
and dataformat
command-line tools (Linux AMD64) . As a test, I ran
datasets summary gene accession --inputfile accNosRandom.short.txt --as-json-lines > accNosRandom.short.jsonl
where accNosRandom.short.txt
(a list of ten accession.version numbers) is attached and got accNosRandom.short.jsonl
as a result (also attached, although I added the extension to .txt
so I could upload it). I then ran
dataformat tsv gene --inputfile ./accNosRandom.short.jsonl --fields transcript-accession,symbol
but got an empty result (see image). What am I doing wrong?
accNosRandom.short.txt
accNosRandom.short.jsonl.txt
from edirectcookbook.
You are not doing anything wrong. Currently dataformat
does not support the jsonl files generated by datasets summary gene ...
commands. This feature is in the pipeline and will be added in the near future.
As a workaround, you can "download" the package without any sequence data and use dataformat
as shown below:
$ datasets download gene accession --inputfile accNosRandom.short.txt --exclude-gene --exclude-protein --exclude-rna
Downloading: ncbi_dataset.zip 57.6kB done
$ dataformat tsv gene --package ncbi_dataset.zip --fields gene-id,symbol,transcript-accession | head -n3
NCBI GeneID Symbol Transcript Accession
10717 AP4B1 XR_007066904.1
10717 AP4B1 XM_017000090.2
from edirectcookbook.
Again, thanks.
Unfortunately, this doesn't appear to work as there is extraneous information in the result. I need the official symbol of the accession numbers in the file in order. It looks like the results are all the records for the genes with the accession.version numbers in my list. For example, XM_017000093.3 is the first accession number, which is a transcript for gene AP4B1. However, in the result provided by dataformat
, I get a bunch of accession numbers for that gene including the one I supplied.
from edirectcookbook.
Ah, the details! Yes, datasets
by default returns all transcripts for a given gene, not just the ones you have asked for. an additional unix join
may be needed to filter the output of dataformat
.
from edirectcookbook.
Thanks again for all your help.
join
requires that the two files to be joined are sorted on the join field. I wanted to preserve the order, so I used grep
:
cat accNosRandom.txt \
| while read -r acc ; do
grep -m 1 "$acc" accNosRandom.info.tab ;
done
where accNosRandom.txt
is my list of accession numbers and accNosRandom.info.tab
is the dataformat
ed package.
BTW, when I ran datasets
, I got the messages below. What do they mean? Note that the last message is different. There are 177,816 accession.version numbers in the list I gave to datasets
.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The accession (NR_120632.1) you provided is not currently in NCBI Gene or does not have an associated NCBI GeneID.
from edirectcookbook.
Related Issues (20)
- How to use xtract to get <name value> pairs of all the attributes?
- The “-if” and “-unless” statements in xtract did not seem to work for me when extracting Attributes in BioSample HOT 3
- The efetch command reported BioSamples’ Attributes completely different from those displayed at the NCBI website HOT 2
- Get entries that have specific keyword in the tag "Isolation source"
- get nucleotide sequence by gene ID HOT 12
- Get Transcription Factors ChiP-seq experiments
- The importance of adding < /dev/null with while loops
- Retrieve Bioproject IDs and related SRAs within given timerange HOT 2
- curl: (60) SSL certificate problem: unable to get local issuer certificate
- Citation List
- Overall design
- Retrieve bioproject IDs and SRA IDs from PMIDs? HOT 3
- getting the fastq reads from accession ids HOT 1
- Download 1,000 Random Human RefSeq Transcripts HOT 2
- Downloading FASTA Records with GI Number via efetch? HOT 2
- Retrieve all metadata from edirect that is accessible from SRArunTable
- Get coding sequences for a gene id HOT 2
- How to merge data from two different databases into one output file? HOT 3
- How to print an empty field if no data is found?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from edirectcookbook.