Code Monkey home page Code Monkey logo

Comments (9)

vkkodali avatar vkkodali commented on July 21, 2024

You can use NCBI Datasets for this.
An EntrezDirect method would be to use elink first to get links to genes from RefSeq accession.version, download gene DocSum and then extract the gene names. An example is shown below:

$ cat accs.txt
NR_133910.2
NM_001318896.2
NM_001160354.2
NM_001369393.2
NM_006552.2
$ epost -db nuccore -input accs.txt \
    | elink -target gene \
    | esummary \
    | xtract -pattern DocumentSummary -element Name 
MECP2
FHL2
CXCL17
LY6K
SCGB1D1

Since the gene DocSum does not have transcript accessions, a bash for loop can be used to map acc.ver to gene symbols:

$ cat accs.txt \
    | while read -r acc ; do 
        g=$(epost -db nuccore -id $acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name); 
        echo -e "$acc\t$g" ; 
done
NR_133910.2     CXCL17
NM_001318896.2  FHL2
NM_001160354.2  LY6K
NM_001369393.2  MECP2
NM_006552.2     SCGB1D1

from edirectcookbook.

mapauley avatar mapauley commented on July 21, 2024

Thank you. This is extremely helpful.

I'm working with a list of 177,816 accession.versions (attached), the number of reference transcripts in the GRC38 release on the NCBI Human Genome Resources page when I downloaded it. While NCBI Datasets worked fine for a small test file of ten accession.versions, I let the complete file run overnight, but it never finished. Was I just not patient enough? I would really like this to work.

Thanks for the scripts. I'm currently running the second. When I tested it, it worked fine but slowly: it seemed like I was only getting about one result per second, meaning my full list will take over two days to complete. Is there any way to speed this up?

accNosRandom.txt

from edirectcookbook.

kharo avatar kharo commented on July 21, 2024

using esummary may be faster:

epost -db nuccore -id NM_001318896.2 | elink -target gene | esummary -format text

The DocSum is HUGE and is going to be slow. But if you need to preserve 1:1 linkages between input and output, I believe you have to send each request one at a time, unfortunately

from edirectcookbook.

vkkodali avatar vkkodali commented on July 21, 2024

Is there any way to speed this up?

Not using EntrezDirect. As I had mentioned earlier, NCBI Datasets is a good choice for this. For example, I was able to download the data for the entire list in 6 min using the following command:

$ datasets summary gene accession --inputfile accNosRandom.txt --as-json-lines > gene_summary.jsonl

from edirectcookbook.

mapauley avatar mapauley commented on July 21, 2024

Thanks for information.

I downloaded the datasets and dataformat command-line tools (Linux AMD64) . As a test, I ran
datasets summary gene accession --inputfile accNosRandom.short.txt --as-json-lines > accNosRandom.short.jsonl
where accNosRandom.short.txt (a list of ten accession.version numbers) is attached and got accNosRandom.short.jsonl as a result (also attached, although I added the extension to .txt so I could upload it). I then ran
dataformat tsv gene --inputfile ./accNosRandom.short.jsonl --fields transcript-accession,symbol
but got an empty result (see image). What am I doing wrong?

accNosRandom.short.txt
accNosRandom.short.jsonl.txt
sshot-1

from edirectcookbook.

vkkodali avatar vkkodali commented on July 21, 2024

You are not doing anything wrong. Currently dataformat does not support the jsonl files generated by datasets summary gene ... commands. This feature is in the pipeline and will be added in the near future.

As a workaround, you can "download" the package without any sequence data and use dataformat as shown below:

$ datasets download gene accession --inputfile accNosRandom.short.txt --exclude-gene --exclude-protein --exclude-rna 
Downloading: ncbi_dataset.zip    57.6kB done
$ dataformat tsv gene --package ncbi_dataset.zip --fields gene-id,symbol,transcript-accession | head -n3
NCBI GeneID     Symbol  Transcript Accession
10717   AP4B1   XR_007066904.1
10717   AP4B1   XM_017000090.2

from edirectcookbook.

mapauley avatar mapauley commented on July 21, 2024

Again, thanks.

Unfortunately, this doesn't appear to work as there is extraneous information in the result. I need the official symbol of the accession numbers in the file in order. It looks like the results are all the records for the genes with the accession.version numbers in my list. For example, XM_017000093.3 is the first accession number, which is a transcript for gene AP4B1. However, in the result provided by dataformat, I get a bunch of accession numbers for that gene including the one I supplied.

from edirectcookbook.

vkkodali avatar vkkodali commented on July 21, 2024

Ah, the details! Yes, datasets by default returns all transcripts for a given gene, not just the ones you have asked for. an additional unix join may be needed to filter the output of dataformat.

from edirectcookbook.

mapauley avatar mapauley commented on July 21, 2024

Thanks again for all your help.

join requires that the two files to be joined are sorted on the join field. I wanted to preserve the order, so I used grep:

cat accNosRandom.txt \
    | while read -r acc ; do
        grep -m 1 "$acc" accNosRandom.info.tab ;
done

where accNosRandom.txt is my list of accession numbers and accNosRandom.info.tab is the dataformated package.

BTW, when I ran datasets, I got the messages below. What do they mean? Note that the last message is different. There are 177,816 accession.version numbers in the list I gave to datasets.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The accession (NR_120632.1) you provided is not currently in NCBI Gene or does not have an associated NCBI GeneID.

from edirectcookbook.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.