Given a list of accession.version numbers, is there a way to download the official gen

Thanks again for all your help. join</code

Download Symbol via EFetch about edirectcookbook HOT 9 OPEN

ncbi-hackathons commented on July 21, 2024

Download Symbol via EFetch

from edirectcookbook.

Comments (9)

vkkodali commented on July 21, 2024

You can use NCBI Datasets for this.
An EntrezDirect method would be to use elink first to get links to genes from RefSeq accession.version, download gene DocSum and then extract the gene names. An example is shown below:

$ cat accs.txt
NR_133910.2
NM_001318896.2
NM_001160354.2
NM_001369393.2
NM_006552.2
$ epost -db nuccore -input accs.txt \
    | elink -target gene \
    | esummary \
    | xtract -pattern DocumentSummary -element Name 
MECP2
FHL2
CXCL17
LY6K
SCGB1D1

Since the gene DocSum does not have transcript accessions, a bash for loop can be used to map acc.ver to gene symbols:

$ cat accs.txt \
    | while read -r acc ; do 
        g=$(epost -db nuccore -id $acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name); 
        echo -e "$acc\t$g" ; 
done
NR_133910.2     CXCL17
NM_001318896.2  FHL2
NM_001160354.2  LY6K
NM_001369393.2  MECP2
NM_006552.2     SCGB1D1

from edirectcookbook.

mapauley commented on July 21, 2024

Thank you. This is extremely helpful.

I'm working with a list of 177,816 accession.versions (attached), the number of reference transcripts in the GRC38 release on the NCBI Human Genome Resources page when I downloaded it. While NCBI Datasets worked fine for a small test file of ten accession.versions, I let the complete file run overnight, but it never finished. Was I just not patient enough? I would really like this to work.

Thanks for the scripts. I'm currently running the second. When I tested it, it worked fine but slowly: it seemed like I was only getting about one result per second, meaning my full list will take over two days to complete. Is there any way to speed this up?

accNosRandom.txt

from edirectcookbook.

kharo commented on July 21, 2024

using esummary may be faster:

epost -db nuccore -id NM_001318896.2 | elink -target gene | esummary -format text

The DocSum is HUGE and is going to be slow. But if you need to preserve 1:1 linkages between input and output, I believe you have to send each request one at a time, unfortunately

from edirectcookbook.

vkkodali commented on July 21, 2024

Is there any way to speed this up?

Not using EntrezDirect. As I had mentioned earlier, NCBI Datasets is a good choice for this. For example, I was able to download the data for the entire list in 6 min using the following command:

$ datasets summary gene accession --inputfile accNosRandom.txt --as-json-lines > gene_summary.jsonl

from edirectcookbook.

mapauley commented on July 21, 2024

Thanks for information.

I downloaded the datasets and dataformat command-line tools (Linux AMD64) . As a test, I ran
datasets summary gene accession --inputfile accNosRandom.short.txt --as-json-lines > accNosRandom.short.jsonl
where accNosRandom.short.txt (a list of ten accession.version numbers) is attached and got accNosRandom.short.jsonl as a result (also attached, although I added the extension to .txt so I could upload it). I then ran
dataformat tsv gene --inputfile ./accNosRandom.short.jsonl --fields transcript-accession,symbol
but got an empty result (see image). What am I doing wrong?

accNosRandom.short.txt
accNosRandom.short.jsonl.txt

from edirectcookbook.

vkkodali commented on July 21, 2024

You are not doing anything wrong. Currently dataformat does not support the jsonl files generated by datasets summary gene ... commands. This feature is in the pipeline and will be added in the near future.

As a workaround, you can "download" the package without any sequence data and use dataformat as shown below:

$ datasets download gene accession --inputfile accNosRandom.short.txt --exclude-gene --exclude-protein --exclude-rna 
Downloading: ncbi_dataset.zip    57.6kB done
$ dataformat tsv gene --package ncbi_dataset.zip --fields gene-id,symbol,transcript-accession | head -n3
NCBI GeneID     Symbol  Transcript Accession
10717   AP4B1   XR_007066904.1
10717   AP4B1   XM_017000090.2

from edirectcookbook.

mapauley commented on July 21, 2024

Again, thanks.

Unfortunately, this doesn't appear to work as there is extraneous information in the result. I need the official symbol of the accession numbers in the file in order. It looks like the results are all the records for the genes with the accession.version numbers in my list. For example, XM_017000093.3 is the first accession number, which is a transcript for gene AP4B1. However, in the result provided by dataformat, I get a bunch of accession numbers for that gene including the one I supplied.

from edirectcookbook.

vkkodali commented on July 21, 2024

Ah, the details! Yes, datasets by default returns all transcripts for a given gene, not just the ones you have asked for. an additional unix join may be needed to filter the output of dataformat.

from edirectcookbook.

mapauley commented on July 21, 2024

Thanks again for all your help.

join requires that the two files to be joined are sorted on the join field. I wanted to preserve the order, so I used grep:

cat accNosRandom.txt \
    | while read -r acc ; do
        grep -m 1 "$acc" accNosRandom.info.tab ;
done

where accNosRandom.txt is my list of accession numbers and accNosRandom.info.tab is the dataformated package.

BTW, when I ran datasets, I got the messages below. What do they mean? Note that the last message is different. There are 177,816 accession.version numbers in the list I gave to datasets.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The accession (NR_120632.1) you provided is not currently in NCBI Gene or does not have an associated NCBI GeneID.

from edirectcookbook.

Download Symbol via EFetch about edirectcookbook HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent