Code Monkey home page Code Monkey logo

edirectcookbook's Introduction

EDirect_EUtils_API_Cookbook

Just copy and paste commands off the page. Modify the search strings to work for you!

If there are things you want to be able to do with EDirect, but can't figure out how, you can ask the community for help by creating an Issue. See below, under "How to contribute," for more information.

To install EDirect, follow the instructions in "Entrez Direct: E-utilities on the Unix Command Line"

PLEASE UPDATE TO THE LATEST VERSION of E-Direct when possible to avoid a bug in older versions associated with the new NCBI API rate limit policy and API keys

How to contribute

You can contribute to this page through GitHub. (If you are not already viewing the GitHub version of this page, please click the "View on GitHub" button at the top of the page.) Using GitHub, you can create Issues or Pull Requests to contribute to the cookbook.

Create an Issue to:

  • Request an EDirect script to accomplish a task, citing specific use cases
  • Present a non-working EDirect script and ask for a fix
  • Identify non-working scripts listed below

Create a Pull Request to:

  • Add a working EDirect script to the list below
  • Modify or optimize an EDirect script listed below
  • Update the "Confirmed by:" date/version of a listed EDirect script with confirmation that it is still valid

Best Practices for EDirect:

  • Please keep to <50,000 expected hits (it simply won’t work)
  • Please do not run from multiple processors on a compute farm
  • Update to latest version

For more information and documentation on EDirect, please see:

All items below come with no explicit or implicit warranty.

All code is as-is and produced for the bioinformatics community, from the bioinformatics community.

EDirect Scripts

Get all proteins from a nucleotide interval in a genome

Description (optional):
Written by: Peter Cooper Confirmed by: Ben Busby Databases: Taxonomy

efetch -db nuccore -id NZ_AZKP01000022.1 -seq_start 149413 -seq_stop 154038 -format gbc | xtract -insd CDS INSDInterval_from INSDInterval_to protein_id product

Get child taxids for a node in NCBI taxonomy

Description (optional): Note: Options for parsing nodes.dmp from NCBI Taxonomy are cited in issue #25, intentionally left open Written by: Scott McGinnis (11/17/2017)
Confirmed by:
Databases: Taxonomy

esearch -db taxonomy -query "vertebrata[orgn]" | efetch -db taxonomy -format docsum | xtract -pattern DocumentSummary -if Rank -equals family -element Id,Division,ScientificName,CommonName | more

Get all SRA runs for a BioProject based on an SRA Run ID

Description: Given an SRA Run ID (e.g. SRR532256) that is a member of a BioProject that has additional runs, retrieve all the other run IDs. This is a variant of the BioProject call below. Written by: Rob Edwards (1/11/2018) Confirmed by: Databases: SRA, BioProject

esearch -db sra -query "SRR532256" |  efetch -format docsum | xtract -pattern Runs -ACC @acc  -element "&ACC"

Get all SRA runs for a given BioProject

Description (optional):
Written by: Bob Sanders (3/22/2017)
Confirmed by:
Databases: SRA, BioProject

esearch -db bioproject -query "PRJNA356464" | elink -target sra | efetch -format docsum | \
xtract -pattern DocumentSummary -ACC @acc -block DocumentSummary -element "&ACC"

Get latitiude and longitude for SRA Datasets (e.g. outbreaks and metagenomes)

Description (optional):
Written by: BB, Mike D, Rob Edwards (4/12/2017)
Confirmed by:
Databases: SRA, BioSample

for i in $(cat sra_ids.txt); do ll=$(esearch -db sra -query $i | \
elink -target biosample | efetch -format docsum | \
xtract -pattern DocumentSummary -block Attribute -if Attribute@attribute_name -equals lat_lon -element Attribute); \
echo -e "$i\t$ll"; done

Get run sizes (in bp) for SRA Datasets

Description (optional): This retrieves the SRR id and the size in bp of the run from a file (ids.txt) of SRR IDs. You can also change bases to size_MBto get the size of the dataset in MB. Question: Does the size in MB include the sequence identifiers (i.e. the size of the file) or just the sequences? Written by: Rob Edwards (7/6/2017) Confirmed by: Databases: SRA

epost -db sra -input ids.txt -format acc | esummary -format runinfo -mode xml | xtract -pattern Row -element Run,bases

Gene Aliases

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: gene

esearch -db gene -query "Liver cancer AND Homo sapiens" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Name OtherAliases OtherDesignations

Genomic sequence fastas from RefSeq assembly for specified taxonomic designation

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by: Peter Cooper (NCBI) and Wayne Matten (NCBI) (12/29/2016, v6.00)
Databases: assembly

wget `esearch -db assembly -query "Leptospira alstonii[ORGN] AND latest[SB]" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element FtpPath_RefSeq | \
awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}'`
(For larger sets of data the above may fail as wget may not accept a very large number of arguments.
The command below should work for all.)

esearch -db assembly -query "Leptospira alstonii[ORGN] AND latest[SB]" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element FtpPath_RefSeq | \
awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}' | \
xargs wget

Get organellar contigs from genbank

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: nuccore

esearch -db nuccore -query "LKAM01" | efetch -format fasta

Get protein sequences from nucleotide accessions

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: nuccore, protein

cat accs_file | epost -db nuccore -format acc | \
elink -target protein | efetch -format fasta

Complete taxonomy (KPCOFG) for taxids

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: taxonomy

efetch -db taxonomy -id 9606,1234,81726 -format xml | \
xtract -pattern Taxon -tab "," -first TaxId ScientificName \
-group Taxon -KING "(-)" -PHYL "(-)" -CLSS "(-)" -ORDR "(-)" -FMLY "(-)" -GNUS "(-)" \
-block "*/Taxon" -match "Rank:kingdom" -KING ScientificName \
-block "*/Taxon" -match "Rank:phylum" -PHYL ScientificName \
-block "*/Taxon" -match "Rank:class" -CLSS ScientificName \
-block "*/Taxon" -match "Rank:order" -ORDR ScientificName \
-block "*/Taxon" -match "Rank:family" -FMLY ScientificName \
-block "*/Taxon" -match "Rank:genus" -GNUS ScientificName \
-group Taxon -tab "," -element "&KING" "&PHYL" "&CLSS" "&ORDR" "&FMLY" "&GNUS"

Obtain UniProt IDs from gene symbols

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: gene, protein

esearch -db gene -query "tp53[preferred symbol] AND human[organism]" | \
elink -target protein | \
esummary | \
xtract -pattern DocumentSummary -element Caption SourceDb | \
grep -E '^[OPQ][0-9][A-Z0-9]{3}[0-9]\|^[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}'

Retrieve Taxon IDs from list of genome accession numbers

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: nuccore

cat genome_accession.txt | \
epost -db nuccore -format acc | \
esummary | \
xtract -pattern DocumentSummary -element AccessionVersion TaxId

Convert article DOI to PMID

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by: Mike Davidson (NLM) (12/16/2016, v5.80)
Databases: pubmed

esearch -db pubmed -query "10.1111/j.1468-3083.2012.04708.x" | \
esummary | \
xtract -pattern DocumentSummary -block ArticleId -sep "\t" -tab "\n" -element IdType,Value | \
grep -E '^pubmed|doi'

Access organism specific meta-data from NCBI genome database

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: genome, bioproject

esearch -db genome -query "22954[uid]" | \
elink -target bioproject | \
efetch -format xml | \
xtract -pattern DocumentSummary -element Salinity OxygenReq OptimumTemperature TemperatureRange Habitat

Get the status of records from PubMed search

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by: Mike Davidson (NLM) (12/16/2016, v5.80)
Databases: pubmed

esearch -db pubmed -query "pde3a AND 2016[dp]" | \
esummary | \
xtract -pattern DocumentSummary -element Id RecordStatus

Conduct a PubMed search and retrieve the results as a list of PMIDs

Description (optional):
Written by: Mike Davidson (2/22/2017)
Confirmed by: Mike Davidson (NLM) (2/22/2017, v6.30)
Databases: pubmed

esearch -db pubmed -query "seasonal affective disorder" | efetch -format uid

Sort the hits by sequence length in nucleotide database

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: nuccore

esearch -db nuccore -query "bacillus[orgn] AND biomol_rRNA[prop] AND 1500:1560[slen]" | \
esummary | \
xtract -pattern DocumentSummary -element Slen Extra | \
sort -rnk 1

Getting meta data from assembly

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: assembly

esearch -db assembly -query "mammals[orgn] AND latest[filter]" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Organism,SpeciesName,BioSampleAccn,LastMajorReleaseAccession \
-block Stat -if "@category" -equals chromosome_count -element Stat | \
grep -Pv "\t0$"

Fetch HSPs from a BLAST hit in FASTA

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: nuccore

blastn -db nr -query in.fna -remote -outfmt "6 sacc sstart send" | \
xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'

Get all Gene Ontology IDs for a given protein accession

Description (optional):
Written by: NCBI Folks (12/14/2016)
Confirmed by:
Databases: protien, biosystems

epost -db protein -id BAD92651.1 -format acc | \
elink -target biosystems | \
efetch -format docsum | \
xtract -pattern externalid -element externalid | \
awk '{if ($0 ~ /GO/) print $0}'

Get the ten most frequently-occurring authors for a set of articles

Description (optional): Searches PubMed for the string "traumatic brain injury athletes", restricts results to those published in 2015 and 2016, retrieves the full XML records for each of the search results, extracts the last name and initials of every author on every record, sorts the authors by frequency of occurrence in the results set, and presents the top ten most frequently-occurring authors, along with the number of times that author appeared.
Written by: Mike Davidson (NLM) (12/15/2016)
Confirmed by: Mike Davidson (NLM) (12/16/2016)
Databases: pubmed

esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2015 -maxdate 2016 | \
efetch -format xml | \
xtract -pattern Author -sep " " -element LastName,Initials | \
sort-uniq-count-rank | \
head -n 10

Get the ten funding agencies who are most active in funding articles on a particular topic

Description (optional): Searches PubMed for the string "diabetes AND pregnancy", restricts results to those published in 2014 through 2016, retrieves the full XML records for each of the search results, extracts the funding agencies for every grant on every record, sorts the agencies by frequency of occurrence in the results set, and presents the top ten most frequently-occurring agencies, along with the number of times that agency appeared.
Written by: Mike Davidson (2/17/2017)
Confirmed by: Mike Davidson (NLM) (v6.30, 2/17/2017)
Databases: pubmed

esearch -db pubmed -query "diabetes AND pregnancy" -datetype PDAT -mindate 2014 -maxdate 2016 | \
efetch -format xml | \
xtract -pattern Grant -element Agency | \
sort-uniq-count-rank | \
head -n 10

Look up the publication date for thousands of PMIDs (option one)

Description (optional): Takes a file which contains a list of PMIDs (table_of_pubmed_ids) and uses cat to access the contents of the file, epost to post the PMIDs to the history server, efetch to retrieve the records and xtract to extract PMID and Publication Date.
Written by: NCBI Folks (12/15/2016)
Confirmed by: Mike Davidson (NLM) (v6.30, 2/17/2017)
Databases: pubmed

cat table_of_pubmed_ids | \
epost -db pubmed | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block PubDate -sep " " -element Year,Month MedlineDate

Look up the publication date for thousands of PMIDs (option two)

Description (optional): Takes a file which contains a list of PMIDs (table_of_pubmed_ids) and epost -input to access the contents of the file and post the PMIDs to the history server, efetch to retrieve the records and xtract to extract PMID and Publication Date.
Written by: Mike Davidson (2/17/2017)
Confirmed by: Mike Davidson (NLM) (v6.30, 2/17/2017)
Databases: pubmed

epost -input table_of_pubmed_ids -db pubmed | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block PubDate -sep " " -element Year,Month MedlineDate

Find the first author for a set of PubMed records

Description (optional): Outputs the PMID and first author's last name and initials for one or more PubMed records Written by: Mike Davidson (2/17/2017)
Confirmed by: Mike Davidson (NLM) (v6.30, 2/17/2017)
Databases: pubmed

efetch -db pubmed -id 16940437 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block Author -position first -sep " " -element LastName,Initials

Find the first author and any other authors who contributed equally for a set of PubMed records

Description (optional): Outputs the PMID and first author's last name and initials for one or more PubMed records. If the record indicates equal contributors to the first author, the last name and initials for all equal contributors will also be output, separated by commas.
Written by: Mike Davidson (10/27/2017)
Confirmed by: Mike Davidson (NLM) (v7.40, 10/27/2017)
Databases: pubmed

efetch -db pubmed -id 22358458,26877147 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block Author -position first -sep " " -tab ", " -element LastName,Initials -EQUAL Author@EqualContrib \
-block Author -if "+" -is-not 1 \
-and Author@EqualContrib -equals Y \
-and "&EQUAL" -equals Y \
-sep " " -tab ", " -element LastName,Initials

Download GEO Data from a BioProject Accession

Description (optional):
Written by: NCBI Folks (12/16/2016)
Confirmed by:
Databases: gds

esearch -db gds -query "PRJNA313294[ACCN]" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element FTPLink

Extract all MeSH Headings from a given PMID

Description (optional): Retrieves the PMID of a PubMed record, followed by a pipe-delimitted list of MeSH Descriptors for a PMID.
Written by: Mike Davidson (10/02/2017)
Confirmed by: Mike Davidson (NLM) (v7.30, 10/02/2017)
Databases: pubmed

efetch -db pubmed -id 24102982 -format xml | \
xtract -pattern PubmedArticle -tab "|" -element MedlineCitation/PMID \
-block MeshHeading -tab "|" -element DescriptorName

Extract all MeSH Headings and Subheadings from a given PMID

Description (optional): Retrieves the PMID of a PubMed record, followed by a pipe-delimitted list of MeSH Descriptors and Qualifiers for a PMID. Each Descriptor is followed by any attached qualifiers, separated by "/".
Written by: Mike Davidson (10/02/2017)
Confirmed by: Mike Davidson (NLM) (v7.30, 10/02/2017)
Databases: pubmed

efetch -db pubmed -id 24102982 -format xml | \
xtract -pattern PubmedArticle -tab "|" -element MedlineCitation/PMID \
-block MeshHeading -tab "|" -sep "/" -element DescriptorName,QualifierName

Search for articles by authors affiliated with a specific institution by matching two partial affiliation strings.

Description (optional): Searching PubMed for two affiliation strings ANDed together (e.g. "translational medicine[AD] AND thomas jefferson[AD]") will retrieve all records that have both strings listed somewhere in the record's Affiliation data, but does not require both strings be listed on the same author's affiliation. To generate a list of PMIDs where both strings are present in the same affiliation element, use the following script.
Written by: Mike Davidson (4/2/2018)
Confirmed by: Mike Davidson (NLM) (v8.10, 4/2/2018)
Databases: pubmed

esearch -db pubmed -query "translational medicine[ad] AND thomas jefferson[ad]" | \
efetch -format xml | \
xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
-block Affiliation -if Affiliation -contains "translational medicine" -and Affiliation -contains "thomas jefferson" \
-tab "\n" -element "&PMID" | \
sort -n | uniq

Search for PMC articles citing a gived PubMed articler; retrieve title, source, ID

Description: Retrieve information about all PMC articles (wihich have free fulltext available) which cite a gived PubMed article Written by: Lukas Wagner (08/16/2018) Databases: pubmed, pmc

esearch -db pubmed -query 23618408 | elink -name pubmed_pmc_refs -target pmc | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Title -element Source -block ArticleId -if "IdType" -equals pmcid -element Value

edirectcookbook's People

Contributors

dcgenomics avatar kharo avatar linsalrob avatar lwagnerdc avatar mikeadavidson avatar petercooper2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edirectcookbook's Issues

geoprofiles data format

Hello,
We are trying to get data from the geoprofiles db, but we cannot find a resource for the formats that are available. -format docsum does seem to work, but we ideally want access to the expression data available from the https website. It there a way to get the info via xml?

Download individual bacterial genomes

Hi,
I started to use the following to download few genomes but it doesn't work as expected.
Although I provide the accession for a specific genome, it downloads all the genomes from that species instead, ending sometimes with absurdly big files with tens or hundreds of genomes mixed together

for genome in AL009126.3
do
esearch -db genome -query ${genome} | elink -target assembly | elink -target nuccore | efetch -format fasta > ${genome}.fasta
done

Thanks
Xabi

PS: great initiative

The importance of adding < /dev/null with while loops

I've been tricked by this several times, and thought it would be a good addition to the Cookbook. When using while loops without adding the </dev/null after the input the input, only the first line will be processed. Here is the (slightly modified) example from the docs:

  while read org
  do
    esearch -db taxonomy -query "$org [LNGE] AND family [RANK]" < /dev/null |
    efetch -format docsum |
    xtract -pattern DocumentSummary -lbl "$org" \
      -element ScientificName Division
  done <  organisms


(The "< /dev/null" input redirection construct prevents esearch from "draining" the remaining lines from stdin.)

I can make a pull request is this is repo is still being supported.

Citation List

Is there a script that can list/rank articles under a certain keyword by the most cited by to the least cited by?

Any help would be much appreciated!

Retrieve all metadata from edirect that is accessible from SRArunTable

When retrieving runinfo from edirect as follows:

PRJN=PRJNA784998
~/edirect/esearch -db sra -query $PRJN | ~/edirect/efetch -format runinfo

there are different metadata fields than when accessing from the SRA Run Selector, and downloading metadata

This is the bioproject here:
https://www.ncbi.nlm.nih.gov/Traces/study/?page=2&query_key=3&WebEnv=MCID_633332599c0211437bc6a566&o=acc_s%3Aa

Of note: BIOMATERIAL_PROVIDER is not part of the efetch runinfo format. Is there a way to get all metadata through efetch? I have tried accessing that field specifically, but can't find it.

Are those two methods of accessing metadata considered equivalent? Or is efetch --format runinfo accessing different metadata?

Thanks,
Karina

EDirect installation requires LWP::Protocol::https

Installation of edirect requires the module LWP::Protocol::https and this needs to be added to the installer script (I'm not sure if this is a cookbook issue per se, but you can't use the cookbook without it!)

curl: (60) SSL certificate problem: unable to get local issuer certificate

I install EDirect correctly, and add the edirect directory path to $PATH.
When I test it(eg: esearch -db pubmed -query "Babalobi OO[au] AND 2008[pdat]" ), I got error like below:

curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.

Get publication counts per gene per year

Hi

I would like to write a edirect query to extract number of publications per gene per year. The group I am interested in is Viridiplantae. So for all species under this group, given a date range, I would like to get the publication count for each gene in that species. The final output that I am looking for is something like

YEAR Genus_Species Gene_Symbol Publication_Count
1970 Arabidopsis thaliana PHYA 3
1971 Arabidopsis thaliana PHYA 2

I can get [PDAT] to work for -db pubmed but not [GENE] or [ORGN]. Need Help. Thanks

Retrieve bioproject IDs and SRA IDs from PMIDs?

Hey all,

I have a list of PMIDs for which I want to extract the Bio project ID and SRA IDs. I was unable to find anything in the documentation for the same. Any suggestions?

Looking forward to your replies!

Grab Lat_Lon for SRA accessions.

Almost there, just need to phrase element:

esearch -db sra -query SRR5381359 | elink -target biosample | efetch -format xml | xtract -pattern DocumentSummary -element

using xtract to get SRA ids and BioProject ids

I have a sample xml file (shown below) from BioSample and my objective is to extract the SRA and BioProject ids.

</BioSample><BioSample submission_date="2009-11-25T14:28:04.407" access="public" last_update="2015-01-29T02:30:16.280" publication_date="2009-11-25T14:28:09.680" id="5124" accession="SAMN00005124">
  <Ids>
    <Id is_primary="1" db="BioSample">SAMN00005124</Id>
    <Id db="SRA">SRS007221</Id>
    <Id db="GEO">GSM451804</Id>
  </Ids>
  <Description>
    <Title>AdultMale_combined_RNAseq_1, 2</Title>
    <Organism taxonomy_id="7227" taxonomy_name="Drosophila melanogaster">
      <OrganismName>Drosophila melanogaster</OrganismName>
    </Organism>
  </Description>
  <Owner>
    <Name>Institute for Genomics and Systems Biology, University of Chicago</Name>
    <Contacts>
      <Contact email="[email protected]">
        <Name>
          <First>Kevin</First>
          <Last>White</Last>
        </Name>
      </Contact>
    </Contacts>
  </Owner>
  <Models>
    <Model>Generic</Model>
  </Models>
  <Package display_name="Generic">Generic.1.0</Package>
  <Attributes>
    <Attribute attribute_name="source_name" display_name="source name" harmonized_name="source_name">AdultMale</Attribute>
    <Attribute attribute_name="development stage" display_name="development stage" harmonized_name="dev_stage">AdultMale</Attribute>
  </Attributes>
  <Links>
    <Link label="GEO Sample GSM451804" type="url">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM451804</Link>
    <Link label="PRJNA116485" type="entrez" target="bioproject">116485</Link>
    <Link label="PRJNA168994" type="entrez" target="bioproject">168994</Link>
    <Link label="PRJNA63467" type="entrez" target="bioproject">63467</Link>
  </Links>

Initially the command i used generated this output:

SAMN00014503    SRS074435   128909,129179
SAMN00014655    SRS074533   127185
SAMN00014812    129305,127109

The issue is that the columns shifted to the left when the SRA id isn't present on the xml file.

So after i got in touch with the NCBI staff and together we came up with this answer:

xtract -input out001.xml -pattern BioSample -SRA "(-)" \
-block Id -if Id@db -equals "SRA" -SRA Id \
-block Ids -first Id -element "&SRA" \
-block Link -if Link@target -equals "bioproject" -tab "," -element Link

Then it generated the desired output:

SAMN00014503    SRS074435   128909,129179
SAMN00014655    SRS074533   127185
SAMN00014812    -   129305,127109
SAMN00031920    -
SAMN00032070    -
SAMN00032222    -
SAMN00032375    -

Extracting Pubmed ID (PMID) for a list of refseq accession number

Hi
I have a list ~500 refseq accession numbers (all bacterial genomes), ~100 of them are completed genomes (CP009681 etc.) while others are assemblies (LALG00000000.1 etc). Most of them are not published and therefore do not have PMID number associate with it. My goal is identify the accession numbers which are published and extract the PMID associated with them. In other words, I want to extract PMID for each accession number if available.
accessions.txt. Accession numbers are in a file, one accession per line. Here is what I have done so far-

cat accessions.txt | epost -db nuccore -format acc | elink -target pubmed | efetch -format xml | xtract -pattern PubmedArticleSet -element PMID

27152133 26048971 25767217 25250641 24970829 24962815 24723721 24051324 23770143

The above output is not in correct format. I need the output in this format-

CP009681 27152133
CP010295 25767217
CP010296 25767217
CP007176 25250641
LALG01000000 26048971
LALH01000000 26048971
LALI01000000 26048971

Any help will be highly appreciated

Tauqeer

HTML code in Pubmed "title" field

Using the following pubmed id: 28634180

You can see here that the title contains an italic word (MUTYH).

Now let's extract the XML of this paper via the API:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=pubmed&id=28634180
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&query_key=1&WebEnv=<INSERT_WEBENV_HERE>&retype=xml

Here is the "title" entry:
<Item Name="Title" Type="String">A &lt;i&gt;MUTYH&lt;/i&gt; germline mutation is associated with small intestinal neuroendocrine tumors.</Item>

We can see that the italic tag is present in the title in it's escaped form.

I'm a little concerned of a possible security hole if it's possible to submit a title with html code embedded in it. I know nothing about the process of uploading a paper in pubmed (like whether the user input is sanitized or not, who sends papers to pubmed and how). I've just noticed the tags when displaying a summary of the paper above on my website. Since, the NCBI website displays the title with the italic code actually interpreted, I'm wondering what else can be passed to the title (or other) field.

Software Version

There is a suggestion to "PLEASE UPDATE TO THE LATEST VERSION of E-Direct when possible to avoid a bug in older versions associated with the new NCBI API rate limit policy and API keys".

How does one find the current version of the edirect tools that is installed, and the newest version of the tools available online?

Retrieve Bioproject IDs and related SRAs within given timerange

Hi,

I would like to retrieve all NCBI bioproject IDs and the related SRAs for bioprojects that were created within a certain timeframe.
I saw the code that retrieves the SRAs for a given bioproject which works, however, I don't manage to limit my search to a specific timeframe.

What I tried to only retrieve bioproject IDs within a given timeframe:

  1. esearch -db bioprojects -query "Mycobacterium tuberculosis" retmax = 10000 | efilter -mindate 2020/02/21 -maxdate 2020/05/01 -datetype crdt
    --> no error message, but the efilter is ignored
  2. esearch -db bioproject -query "Mycobacterium tuberculosis" -mindate "2020/02/21" -maxdate "2020/05/01" -datetype "CRDT"

Does someone know what I am doing wrong/how I am able to retrieve only results from a specific time range?

How to use xtract to get <name value> pairs of all the attributes?

I am trying to extract all the attributes in each BioSample record. Since each BioSample record has a different set of attributes (for example, some has "source_name" and "development stage"; some has “cell line”, “cell type”, “tissue”, etc.), I want to get the <name value> pairs of all the attributes each BioSample record contains.

When I use the following command:
efetch -db biosample -id SAMN04383980 -format xml | xtract -pattern BioSampleSet -division BioSample -group Attributes -element Attribute
It only outputs the values of all the attributes in BioSample SAMN04383980 as the following:

H7 hESCs       H7 derived

In this output, I do not know which attribute name each of these values belongs to.

I also tried the following command:
efetch -db biosample -id SAMN04383980 -format xml | xtract -pattern BioSampleSet -division BioSample -group Attributes -element Attribute@attribute_name -element Attribute
I got the following output (without <name value> correspondence):

source_name     cell line       H7 hESCs        H7 derived

So I was wondering if there is a way by which I can get <name value> pairs of all the attributes in a BioSample record, like the following:

source_name    cell line
H7 hESCs       H7 derived

I would appreciate your advice.

Thank you very much!

Formatting and missing value issue in xtract metadata from bioproject

I am trying to extract metadata from a list of Bioprojects Ids.
The script I have works fine, but it cannot deal well with missing data. I read that in the xtract help that I could use the flag -def as Default placeholder for missing fields.
So I tried to add the -def "NA" flag, but it does not change anything in the output.

This is my code:

for i in $(cat $A); 
	do ll=$(esearch -db bioproject -query $i | 
	efetch -format xml |
	xtract -pattern DocumentSummary -element \
	$G,$S,$M,$O,$OT,$TR,$H,$Sa,$BR,$Tro,$Sco,$Org,$E,$Otr,\
	$Phen,$Dis -def "Na" )
	echo -e "$i\t$ll" >> $B;
	done

This s my output:

PRJNA310173
PRJNA253675 eNegative eBacilli eNo eAerobic eMesophilic eHostAssociated Tularemia

This is what I would like to have:

PRJNA310173 Na Na Na Na Na Na Na
PRJNA253675 eNegative eBacilli eNo eAerobic eMesophilic eHostAssociated Tularemia

The “-if” and “-unless” statements in xtract did not seem to work for me when extracting Attributes in BioSample

I was trying to use the “-if “statement in xtract to print out the values of only the Attributes that have harmonized_name in a BioSample record, but all the Attributes values are printed out even though some of them have no harmonized_name.

I have the following xml of a BioSample record by using efetch:

<?xml version="1.0" ?>
<BioSampleSet><BioSample access="public" publication_date="2019-05-30T00:00:00.000" last_update="2019-05-30T15:01:23.120" submission_date="2019-05-30T14:12:04.357" id="11893672" accession="SAMN11893672">   
<Ids>     
    <Id db="BioSample" is_primary="1">SAMN11893672</Id>     
    <Id db_label="Sample name">RS0002</Id>     
    <Id db="SRA">SRS4847476</Id>   
</Ids>   
<Description>     
    <Title>SK-MEL-28_1</Title>     
    <Organism taxonomy_id="9606" taxonomy_name="Homo sapiens">       
        <OrganismName>Homo sapiens</OrganismName>     
    </Organism>   
</Description>   
<Owner>     
    <Name>Research Center for Molecular Medicine of the Austrian Academy of Sciences</Name>     
    <Contacts>       
        <Contact email="[email protected]">         
            <Name>           
                <First>Vitaly</First>           
                <Last>Sedlyarov</Last>         
            </Name>       
        </Contact>     
    </Contacts>   
</Owner>   
<Models>     
    <Model>Human</Model>   
</Models>   
<Package display_name="Human; version 1.0">Human.1.0</Package>   
<Attributes>     
    <Attribute attribute_name="isolate" harmonized_name="isolate" display_name="isolate">cell line</Attribute>     
    <Attribute attribute_name="age" harmonized_name="age" display_name="age">51</Attribute>     
    <Attribute attribute_name="biomaterial_provider" harmonized_name="biomaterial_provider" display_name="biomaterial provider">ATCC</Attribute>     
    <Attribute attribute_name="sex" harmonized_name="sex" display_name="sex">male</Attribute>     
    <Attribute attribute_name="tissue" harmonized_name="tissue" display_name="tissue">skin</Attribute>     
    <Attribute attribute_name="cell_line" harmonized_name="cell_line" display_name="cell line">SK-MEL-28</Attribute>     
    <Attribute attribute_name="replicate">1</Attribute>   
</Attributes>   
<Links>     
    <Link type="entrez" target="bioproject" label="PRJNA545487">545487</Link>   
</Links>   
<Status status="live" when="2019-05-30T14:12:04.358"/> </BioSample> </BioSampleSet>

I use the following command in order to extract only the Attributes that have harmonized_name:

efetch -db biosample -id SAMN11893672 -format xml | xtract -pattern BioSampleSet -division BioSample -group Attributes -if Attribute@harmonized_name -sep ",\t" -element Attribute

However, I got the following output which includes the values of all the Attributes:

cell line,      51,     ATCC,   male,   skin,   SK-MEL-28,      1

The last Attribute (attribute_name=“replicate”, value=1) has no harmonized_name, but its value “1” was also printed out for some reason.
So it looks like my statement -if Attribute@harmonized_name did not work as expected.
Could you please point out any issue in my usage which may have caused this problem?

I was also trying to use the “-unless” statement in xtract to filter out certain Attribute (e.g. filter out the Attribute with harmonized_name="isolate"), but all the Attributes are filtered out for some reason.

For the same BioSample record as described above, when I use the following command:

efetch -db biosample -id SAMN11893672 -format xml | xtract -pattern BioSampleSet -division BioSample -group Attributes -sep ",\t" -element Attribute@harmonized_name

I am able to print out the harmonized_name of all the Attributes as below:

isolate,        age,    biomaterial_provider,   sex,    tissue, cell_line

However, when I try to filter out the Attribute with harmonized_name="isolate" by using the following command:

efetch -db biosample -id SAMN11893672 -format xml | xtract -pattern BioSampleSet -division BioSample -group Attributes -unless Attribute@harmonized_name -equals "isolate" -sep ",\t" -element Attribute@harmonized_name

Nothing is printed out--- all the Attributes are filtered out for some reason.
So it looks like my statement -unless Attribute@harmonized_name -equals "isolate" did not work as expected.
Could you please point out any issue in my usage which may have caused this problem?

I would really appreciate your help!

Thank you very much!

Get filesizes for a list of SRA files

I have a pretty long list of SRA entries that I want to download but before I download everything I would like to know how big I can expect each of the files to be and how big all files will be in sum.
My list contains IDs that start with ERR or PRJNA if that is important.

getting the fastq reads from accession ids

Hey All,

I am interested in extracting out the fastq reads of accession ids. But I cant find the element FtpPath in the document summary information.
I am trying something like this-
esearch -db sra -query "ERX4643323" | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -element FtpPath_RefSeq

Is It possible?

Regards
Jigyasa

Downloading FASTA Records with GI Number via efetch?

Is there a way to tell efetch to download FASTA records such that record headers include the GI number? For example, a given header would look like this
>gi|2248537881|ref|NM_001407571.1| Homo sapiens BRCA1 DNA repair associated (BRCA1), transcript variant 6, mRNA
instead of like this
>NM_001407571.1 Homo sapiens BRCA1 DNA repair associated (BRCA1), transcript variant 6, mRNA

The command I'm using is efetch -db nuccore -input accNosRandom.txt -format fasta > seq.fna where accNosRandom.txt contains a list of accession numbers. It results in the non-GI number format.

The efetch command reported BioSamples’ Attributes completely different from those displayed at the NCBI website

I was trying to use efetch to obtain the Attributes of a BioSample record, but I found that for some BioSample records, the Attributes reported in the xml are completely different from those displayed at the NCBI website. And the BioSample Id reported in the xml is different from the BioSample Id specified in the efetch command.
I use the following command to get the xml of a BioSample record:
efetch -db biosample -id SAMEA5244969 -format xml

Example 1: for BioSample SAMEA5244969, the NCBI website displays the Attributes as shown at https://www.ncbi.nlm.nih.gov/biosample/10858554
However, the efetch command reported the following xml:

<?xml version="1.0" ?>
<BioSampleSet>
   <BioSample access="public" publication_date="2016-06-04T00:00:00.000" last_update="2017-01-23T16:11:22.000" submission_date="2016-06-14T11:27:28.390" id="5244969" accession="SAMEA4457316">   
      <Ids>     
         <Id db="BioSample" is_primary="1">SAMEA4457316</Id>   
      </Ids>   
      <Description>     
         <Title>Sample from Homo sapiens</Title>     
         <Organism taxonomy_id="9606" taxonomy_name="Homo sapiens">       
            <OrganismName>Homo sapiens</OrganismName>     
         </Organism>   
      </Description>   
      <Owner>     
         <Name>EBI</Name>   
      </Owner>   
      <Models>     
         <Model>Generic</Model>   
      </Models>   
      <Package display_name="Generic">Generic.1.0</Package>   
      <Attributes>     
         <Attribute attribute_name="Sample Name" harmonized_name="sample_name" display_name="sample name">source 4</Attribute>     
         <Attribute attribute_name="Sex" harmonized_name="sex" display_name="sex">male</Attribute>     
         <Attribute attribute_name="disease state" harmonized_name="disease" display_name="disease">normal</Attribute>     
         <Attribute attribute_name="organism part" harmonized_name="tissue" display_name="tissue">colon</Attribute>     
         <Attribute attribute_name="specimen with known storage state">frozen specimen</Attribute>  
      </Attributes>   
      <Status status="live" when="2016-06-14T11:27:28.393"/> 
   </BioSample> 
</BioSampleSet>

The Attributes in this xml are completely different from those displayed at the NCBI website. And the reported BioSample Id (SAMEA4457316) in this xml is different from the BioSample Id (SAMEA5244969) specified in the efetch command.

Example 2: for BioSample SAMEA104565009, the NCBI website displays the Attributes as shown at https://www.ncbi.nlm.nih.gov/biosample/11349430
However, the efetch command reported the following xml:

<?xml version="1.0" ?>

This xml does not contain any elements even though a list of Attributes are displayed at the NCBI website.

Example 3: for BioSample SAMEA5099860, the NCBI website displays the Attributes as shown at https://www.ncbi.nlm.nih.gov/biosample/10655621
However, the efetch command reported the following xml:

<?xml version="1.0" ?>
<BioSampleSet>
   <BioSample access="public" publication_date="2014-10-22T00:00:00.000" last_update="2016-10-25T08:32:28.000" submission_date="2016-05-19T19:48:00.303" id="5099860" accession="SAMEA3067264">   
      <Ids>     
         <Id db="BioSample" is_primary="1">SAMEA3067264</Id>   
      </Ids>   
      <Description>     
         <Title>Sample from Homo sapiens</Title>     
         <Organism taxonomy_id="9606" taxonomy_name="Homo sapiens">  
            <OrganismName>Homo sapiens</OrganismName>     
         </Organism>     
         <Comment>       
            <Paragraph>ExAC_v0.1_Sample_52281</Paragraph>     
         </Comment>   
      </Description>   
      <Owner>     
         <Name>EBI</Name>   
      </Owner>   
      <Models>     
         <Model>Generic</Model>   
      </Models>   
      <Package display_name="Generic">Generic.1.0</Package>   
      <Attributes>     
         <Attribute attribute_name="Sample Name" harmonized_name="sample_name" display_name="sample name">52281</Attribute>   
      </Attributes>   
      <Status status="live" when="2016-05-19T19:48:00.305"/> 
   </BioSample> 
</BioSampleSet>

The Attributes in this xml are completely different from those displayed at the NCBI website. And the reported BioSample Id (SAMEA3067264) in this xml is different from the BioSample Id (SAMEA5099860) specified in the efetch command.

I was wondering if you have some ideas about why the efetch command did not work correctly for the above BioSamples?

I’d greatly appreciate your help!

Thank you very much!

Get coding sequences for a gene id

Hi everyone,

I would like to use a list of gene ids to get FASTA formats of the proteins coded in those genes and the mRNA sequence without introns.

So far with this command I can get the protein sequence:
os.system('esearch -db gene -query "'+ "102888688" + ' [ID]" | elink -target protein -name gene_protein_refseq -cmd neighbor | xtract -pattern LinkSet -block IdList -element Id -block LinkSetDb -element Id | efetch -db protein -format fasta')

With this command I can get the mRNA with introns, which I don't want:
os.system('elink -db gene -id ' + "102888688" + ' -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta')

Get entries that have specific keyword in the tag "Isolation source"

Dear all,

I am trying to retrieve all bacterial genomic sequences in the refseq database that were collected from oceanic environment. I am trying to do this by limiting the search results to the ones that have the word "marine" or "oceanic" in the field "Isolation source" under Features in the description, but I cannnot figure out how and I am struggling with it.
Could anybody give me suggestions on how?

Thanks in advance.

How to print an empty field if no data is found?

I think I've seen this somewhere, but can't find it...

So I'm running something like

esearch -db sra -query "XXXXXX YYYYY[orgn] AND \"biomol rna\"[Properties]" | efetch -format docsum | xtract.Linux -pattern DocumentSummary -ACC @acc -LIB Library_descriptor/LIBRARY_NAME -element "&ORGN" Biosample "&ACC" "&LIB"

but the LIBRARY_NAME doesn't always exist (or doesn't have data). So how can I print an empty field (instead of nothing) if no data is found?

Magicblast: Masking rRNA and tRNA sequences from bacterial plasmids in SRA reads

To determine which bacterial species/substrains (E. coli, to test) have specific plasmids, a plasmid Blast database was created and Magicblasted with E. coli SRA reads. Most of the hits are to 16S rRNA and tRNA sequences in the plasmids. This issue is how to use the rRNA/tRNA Feature sequences in the Nucleotide db for masking these reads.

For example, how to extract the sequence by using the Feature tag:

rRNA complement(276650..278200)
/locus_tag="BTV67_11800"
/product="16S ribosomal RNA"
(from: https://www.ncbi.nlm.nih.gov/nuccore/CP020346.1)

And then using the sequence it links to for masking the reads that hit it.

How to merge data from two different databases into one output file?

What I want to do is

  1. search SRA, find all deposited RNAseq for an organism of interest and extract SRA experiment, study, run etc
  2. link to BioSample and extract the sample attributes (tissue, developmental stage etc) for the same SRA records.

So far, I've been able to complete the first part like so

esearch -db sra -query "Adoxophyes honmai[orgn] AND \"biomol rna\"[Properties]" | efetch -format docsum | xtract.Linux -pattern DocumentSummary -ACC @acc -element Biosample "&ACC"

And I also know how to "jump" from one database to another with elink

esearch -db sra -query "Adoxophyes honmai[orgn] AND \"biomol rna\"[Properties]" | elink -target biosample | efetch -format docsum | xtract.Linux -pattern SampleData -element OrganismName Id Attribute

However, as soon as I "jump" from SRA to BioSample I can't access information that is only found in SRA (e.g. the SRA Run, Experiment, Study IDs). I guess I'm looking for the equivalent of join in SQL...

Download 1,000 Random Human RefSeq Transcripts

Hi all,

I want to download a specific number (e.g., 1,000) of random--it's essential that they be random, although "representative" is perhaps a better word--RNA reference sequence transcripts, preferably from a specific reference build (e.g., GRCh38), although this isn't super important. Any thoughts on how to do this?

By random I mean that there would be nothing to distinguish one batch of 1,000 sequences from another, e.g., in the number of curated (NM_, NR_) versus model (XM_, XR_) sequences. As alluded to above, my goal is to have a completely representative subset of transcript sequences.

Mark

API Keys

How do you insert the api_key into edirect?

If you are piping between commands (e.g. an epost and then an esummary) does the key need to be supplied to both commands?

How to limit esearch results retmax or similar?

I was hoping to do something like this to get recently assembled genomes

esearch -db assembly -query "eucaryotes" -retmax 100 -sort "recently added"

But retmax does not work

Using -days to seek into the past e.g. days 10 always returns none

EDirect pulling [crdt] against [entrez] to get all PMID's for 2017 where they don't match

I use [crdt] in all my PubMed searches and there seems to be no way to pull this date from xml. So I either need a way to pull that data, or even better I want to know the gap where the crdt date is different than the entrez date for the year 2017. If the gap is very small I can use entrez in my EDirect as a direct replacement of [crdt], but if 10-20 percent or more don't match then I can't and need another method to pull [crdt]. So I need a list when they are diff with PMID, crdt date and entrez date for 2017. I can then see how many records are different and the average gap in days between the two when they are different.

Any help or suggestions are greatly appreciated!
Tom

assembly field SB

In the example for esearch:

esearch -db assembly -query "Leptospira alstonii[ORGN] AND latest[SB]"

What is [SB]?

If I look for the fields in assembly, e.g. einfo -db assembly -fields

There is no SB listed.

Download Symbol via EFetch

Given a list of accession.version numbers, is there a way to download the official gene symbol (only) of the corresponding gene using one of the EDirect utilities? If not, any thoughts on how this might be best accomplished?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.