Code Monkey home page Code Monkey logo

literature-search's Introduction

literature-search

Literature search pipeline for BRCA Exchange

Overview

Attempts to download all PubMed papers with BRCA in the title or abstract and then look for variants mentioned in the text and supplemental material. Download and variant search courtesy of pubMunch followed by normalization to HGVS courtesy of Biocommons HVGS and export into a literature.json file for ingest into BRCA Exchange.

Running

Make a local copy of pubConfExample and fill in your email and keys.

Create a local references directory where the static reference files will be stored.

Create a local crawl directory where the downloaded papers and output will be stored.

Download references (only need to run once):

docker run --rm -it \
  --user=`id -u`:`id -g` \
  -v path/to/your/pubConf:/app/.pubConf:ro \
  -v path/to/references/storage:/references \
  -v path/to/crawl/storage:/crawl \
	brcachallenge/literature-search:latest references

Download a single paper as a test:

docker run --rm -it \
  --user=`id -u`:`id -g` \
  -v path/to/your/pubConf:/app/.pubConf:ro \
  -v path/to/references/storage:/references \
  -v path/to/crawl/storage:/crawl \
	brcachallenge/literature-search:latest --pmid 9042909 crawl

Run a full crawl incrementally downloading any papers since the last crawl and output stats:

docker run --rm -it \
  --user=`id -u`:`id -g` \
  -v path/to/your/pubConf:/app/.pubConf:ro \
  -v path/to/references/storage:/references \
  -v path/to/crawl/storage:/crawl \
	brcachallenge/literature-search:latest crawl

You should find a literature.json file under the crawl directory with a list of the papers crawled, their abstract and then any variants found along with snippets around the mention of the variant:

{
  "date": "2019-04-23T16:27:27", 
  "papers": {
    "9042909": {
      "abstract": "The mutations 185delAG....", 
      "articleId": 5009042909,
    }
  }, 
  "variants": {
    "chr13:g.32340300:GT>G": [
      {
        "mentions": [
          "1997). In the Ashkenazi Jewish population, three founder mutations, 185delAG and 5382insC in the BRCA1 gene  921 and<<< 6174delT>>> in the ..",
          ...
        ],
        "pmid": "9042909", 
        "points": 3
      }
      ...
    ],
    "chr13:g.32340526:AT>A":
    ...
  }
}

You can run each individual step of the crawler as well:

docker run brcachallenge/literature-search:latest
Usage: run.py [OPTIONS] COMMAND [ARGS]...

Options:
  --debug / --no-debug  Generate debug output
  --pmid TEXT           PMID to crawl
  --help                Show this message and exit.

Commands:
  convert     Convert papers to text
  crawl       Crawl latest papers...
  download    Download papers
  export      Export literature.json
  find        Find variants in all papers text
  lovd        Run LOVD test
  match       Match variants to papers
  references  Download references
  update      Update list of variants and pubmed ids

Developing and Debugging

Build a local docker that includes this crawler and pubMunch

make build

Start the docker, map the local code into /app and launch bash:

make debug

Download the references:

python3 run.py references

Crawl from within docker a single paper

python3 run.py --pmid 9042909 crawl

literature-search's People

Contributors

maximilianh avatar rcurrie avatar diekhans avatar almussel avatar colossus avatar floe avatar katrinleinweber avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.