Light

ucla-bd2k / aztecretrieval Goto Github PK

View Code? Open in Web Editor NEW

0.0 17.0 1.0 15.06 MB

Bioinformatics repository crawler and updater

JavaScript 57.20% CSS 0.10% HTML 0.76% Shell 0.15% Python 41.78%

aztecretrieval's People

Contributors

Watchers

Forkers

peichao

aztecretrieval's Issues

Document Code

Make sure your scripts are well documented (comments, function explanation). Documentation will be stored in the Wiki of the Github repo.

Create database of file/data formats and use to extract from Publications

Fix Tag bug

Some of the tags look like “[tag1, tag2, tag3]”. Parse the tag so that they are separated into an array of tags [‘tag1’, ‘tag2’, ‘tag3’].

Use Github API

Use the Github API to extract the programmatic information about the tool.
Metadata includes: Maintainers (name, github username, email), Programming Language, Version (Version number and date), License, and number of forks/pulls.

Extract Text from PDF

Input extracted data into Solr

Write a new script which takes in a json file containing extracted data of publications and pushes that data into the solr database.

Create All-in-one script for pipeline

Create all-in-one script that downloads the PDFs, extracts metadata from PDF using GROBID, enrich using APIs, and insert into Solr. Each component should be modularized (1. Get papers from Journal (Download PDFs if needed), 2. Classify publication, 3. Extract metadata from PDF & enrich, 4. Insert metadata into Solr)

Parse Funding

Funding is extracted by Grobid, but they are sentences. Parse the sentence to get the Agency and Grant Number.

Insert Metadata to Solr

Make sure the the Solr is also compatible with the Aztec web application

Download Bioinformatics PDF

Download 5000 publications from Bioinformatics journal

Classify publications

Given a PDF, classify the publication as tool or nontool using the classifier.

Setup Environment

As a developer, I would like to setup this project so that I can run/test it locally.

Extract Platforms from Publications

Platforms include if the tool is compatible with a particular OS

Add Citation Metrics to Tools

Using the CrossRef API (and/or Altmetrics), retrieve the number of citations for each tool (for those that have a DOI) and boost it's ranking accordingly in Solr.

Extract Metadata from Publication

Using Grobid, extract metadata from the PDF.
Metadata includes tools name, description (abstract), links, source code links, technologies, grant/funding information, authors, and affiliations.

Extract Metadata from Publication

Using modified Grobid script, extract metadata from publication

Using Pubmed API

Using the Pubmed API, retrieve the following information:
PMID (Pubmed ID), Journal Name, DOI, affiliation, Author, Publication title.

Example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=27402908&rettype=xml

Streamline Extraction Process

Given a list of PMID, use the Pubmed API to extract information, Grobid to extract info, and Github.

The Pubmed API should give you the DOI, which can be used to download the PDF for Grobid.
Be sure to look for the Github link in the PDF; if there is a link, then use the Github API to extract info.

Put it into a JSON that looks like this:
{
pubmed:{...},
grobid: {...},
github: {...}
}

Missing Descriptions

Some tools are missing descriptions; try to fill in the descriptions with the abstract.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.