Code Monkey home page Code Monkey logo

Comments (2)

sevragorgia avatar sevragorgia commented on August 19, 2024 3

Hi Ido,

you can replace the old ulr with something looking like this:

"https://legacy.uniprot.org/uniprot/?query=taxonomy:33208&format=fasta&compress=yes&include=no"

changing the taxonomy ID accordingly, that should do the work.

cheers

Sergio

from transpi.

sevragorgia avatar sevragorgia commented on August 19, 2024 1

The issue with the metazoa UniProt database and whatever other database to be downloaded from the UniProt is that the API was changed, and the legacy option must be fixed too.

The new REST API requires pagination if more than 10,000,000 records are to be downloaded. This makes things a bit more complicated but possible.

Please note that I drafted this bash script only to make the download work and that it can probably be much more efficient and elegant, but I don't have time to improve it.

#test using cyanobacterial proteins (2010 in total) change the url to download the metazoan dataset after testing

url="https://rest.uniprot.org/uniprotkb/search?compressed=true&format=fasta&query=%28%28taxonomy_id%3A1608213%29%29&size=500"
page=1

while [ "$url" != "" ]
do

echo "Downloading $url"

curl -D head $url >>test.gz

url=`grep link head | cut -f 2 -d " " | sed 's/[;]//g' | sed 's/(/%28/g' | sed 's/)/%29/g' | sed 's/id:/id%3A/g' | sed 's/[<>]//g'`

echo $page >>head

page=$((page+1))

done

This will download 2,010 cyanobacterial proteins from UniProt and write the headers necessary to proceed in the file head. In this file, you should find a number (5 in this case) indicating the number of pages of 500 records the script downloaded. You should check this is the correct number (!) by dividing the number of sequences to download by 500 (the number of records on each page).

For metazoa, there are 34,470,675 proteins in the database (checked 12.06.2023). You can do the math.

The URL for Metazoa looks like this:

#metazoa url
url="https://rest.uniprot.org/uniprotkb/search?compressed=true&format=fasta&query=%28%28taxonomy_id%3A33208%29%29&size=500"

Generally, you can build the URLs by replacing the taxonomy_id with the appropriate number. You need to replace your taxonomy ID in the following part of the URL %3AYOUR_TAX_ID_HERE%29%29. The %3A before and %29%29 after the TAX_ID are mandatory.

I did not test for the (larger) metazoan file, so I would appreciate it if you could report the results.

Oh, and I am assuming you will do this from a Linux OS.

from transpi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.