Code Monkey home page Code Monkey logo

uniprot.rs's Introduction

uniprot.rs Star me

Rust data structures and parser for the UniprotKB database(s).

Actions Codecov License Source Crate Documentation Changelog GitHub issues

๐Ÿ”Œ Usage

The uniprot::uniprot::parse function can be used to obtain an iterator over the entries of a UniprotKB database in XML format (either SwissProt or TrEMBL). XML files for UniRef and UniParc can also be parsed, with uniprot::uniref::parse and uniprot::uniparc::parse, respectively.

extern crate uniprot;

let f = std::fs::File::open("tests/uniprot.xml")
   .map(std::io::BufReader::new)
   .unwrap();

for r in uniprot::uniprot::parse(f) {
   let entry = r.unwrap();
   // ... process the Uniprot entry ...
}

Any BufRead implementor can be used as an input, so the database files can be streamed directly from their online location with the help of an HTTP library such as reqwest, or using the ftp library.

The XML format is the same for the EBI REST API and for the UniProt API, so this library can also be used to read single entries or larger queries. For instance, you can search UniProt for a keyword and retrieve all the matching entries:

extern crate ureq;
extern crate libflate;
extern crate uniprot;

let query = "bacteriorhodopsin";
let query_url = format!("https://www.uniprot.org/uniprot/?query={}&format=xml&compress=yes", query);

let req = ureq::get(&query_url).set("Accept", "application/xml");
let reader = libflate::gzip::Decoder::new(req.call().unwrap().into_reader()).unwrap();

for r in uniprot::uniprot::parse(std::io::BufReader::new(reader)) {
    let entry = r.unwrap();
    // ... process the Uniprot entry ...
}

See the online documentation at docs.rs for more examples, and some details about the different features available.

๐Ÿ“ Features

  • threading (enabled by default): compiles the multithreaded parser that offers a 90% speed increase when processing XML files.
  • url-links (disabled by default): exposes the links in OnlineInformation as an url::Url.

๐Ÿ” See Also

If you're a bioinformatician and a Rustacean, you may be interested in these other libraries:

  • pubchem.rs: Rust data structures and API client for the PubChem API.
  • obofoundry.rs: Rust data structures for the OBO Foundry.
  • fastobo: Rust parser and abstract syntax tree for Open Biomedical Ontologies.
  • proteinogenic: Chemical structure generation for protein sequences as SMILES strings.

๐Ÿ“‹ Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

๐Ÿ“œ License

This library is provided under the open-source MIT license.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the UniProt Consortium. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

uniprot.rs's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

jianshu93

uniprot.rs's Issues

Example with gzip?

Would you mind writing a working example using a gzip with this (especially using libflate)?

I have tried using both libflate and flate2 to parse uniref90.xml.gz without ungzipping it. Although my code compiles and runs, it appears to never parse anything. I am wrapping the flate2::read::GzDecoder or libflate::gzip::Decoder in a BufReader and I've tried being explicit about creating a SequentialParser.

    let input_file = File::open(input_filename).expect(&format!("Can't open file: {}", input_filename));
    let input_reader = BufReader::new(input_file);

    let gzdecoder = GzDecoder::new(input_reader);
    let brg = BufReader::new(gzdecoder);

    for r in uniprot::parser::SequentialParser::new(brg) {
       println!("works");  // this never prints
       let entry = r.unwrap();
   }

Performance issues

I work with Unipept, and as part of our pipeline we parse the SwissProt and TrEMBL files from Uniprot.

We are looking into optimising our Java parser, and were considering re-writing it in Rust using your library. However, it seems to have worse performance than we expected.

The example from the README that simply loops through the XML file takes between 50 and 60 seconds for the SwissProt file using 8 threads (all of which are at 100% usage). This is suspiciously slow. Would you be willing to look into resolving some issues?

Below is a flamegraph to help pointing down the cause (download it & open in your browser to use it interactively): https://gist.github.com/stijndcl/a650e4e20ef3a3e59886a4d5ce8c1b1a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.