contentmine / canary Goto Github PK

View Code? Open in Web Editor NEW

5.0 11.0 4.0 34.93 MB

Canary is a UI to the contentmine tools getpapers, quickscrape, norma, and ami.

License: MIT License

CSS 0.45% JavaScript 34.41% HTML 64.26% Python 0.88%

canary's Introduction

Canary

version 0.2.0

Canary is an API and daily runner for ContentMine.

Install

First install all the other tools, see their repos on how to do so - quickscrape, getpapers, norma, AMI.

By installing quickscrape and getpapers you will have ensured you already have node installed.

Install meteor (https://www.meteor.com/install):

curl https://install.meteor.com/ | sh

Get the codebase:

git clone http://github.com/contentmine/canary

Run it:

cd canary

meteor

If you want to use a settings file, like the example one provided, and/or set the port to run on, run with a command like this:

meteor --port 3123 --settings settings.json

If you want to have your own index running, install elasticsearch too (https://www.elastic.co/)

Configure

At the top of the canary.js file there are various options that can be set. It is best to check directly there to ensure you are seeing the most up to date possibilities.

Code Structure

Canary is only a server-side app, even though it is written in meteor which can do server and client side. Externally, it exposes an API that can be connected to from remote services.

The main code is in canary.js, which defines settings and the API endpoints.

cron.js defines the daily jobs that run to retrieve and process articles on a daily basis, extracting facts and saving them to the index each day. The cron functions make use of the other functions, although the API can also call them directly in some cases, if necessary.

index.js contains the code that can query and submit data to the elasticsearch indexes.

normalise.js contains code that can do or that can execute normalisation to scholarly html.

process.js can execute other processes on the article content to extract facts, for example by calling AMI.

retrieve.js retrieves content from remote APIs and sites, via quickscrape / thresher / getpapers, or simply direct http requests to URLs.

Dictionaries

The extract section of the daily functions expects to be able to read a folder full of dictionary files. These SHOULD be JSON files named something like species.json and they should contain a JSON list of objects. Each object must have at least a "query" key, and that key should point to either a simple string to match exactly on, or a regex starting with /, or an object that is a full "query" part of an elasticsearch query.

It would also be possible to allow dictionaries in .xml and then convert them to the same structure, and also to allow .txt files that just contain a list of strings or regexes. However, neither of these capabilities have been added to the cron/extract function yet, althouth they easily could be.

When a fact is discovered, any keys present in the match object will be added to the fact, and the name of the dictionary file (without the filetype suffix) will be used to identify which dictionary matched the fact.

canary's People

Contributors

Stargazers

Watchers

Forkers

chreman anusharanganathan tarrow getbioinfo

canary's Issues

Setup cambridge servers

Move All instances of Meteor to Cron.js

Have quickscrape / getpapers output a standalone abstract file

So that Norma / AMI can take just abstracts where those are all that is available

Convert all not in cron.js to library

EIC into hypothesis

Should be able to take EIC data from results and use them to highlight facts in the scholarly.html version of articles that we have.

Have FOTD tweet daily

Explore producing an updated IUCn

Can we, given the current IUCn list, and an update of the species in it on our demo, regularly produce an updated list where we move species based on what we learn about them? Perhaps we can identify certain statements that we can take as meaning a species is no longer endangered? Or has become endangered? Perhaps @blahah or @rossmounce can provide input here.

Integrate QS and GP

Blog about FOTD

Workspace names with hyphens may be causing a problem

See jenny-mozzies

Explore how CUL spreadsheets can be automated

This is a stub issue that will grow - look at spreadsheets, talk to RJ and ET about how this may tie up with IIOA and APC

Make an IUCn demo page

Make a specialised page that lists IUCn species updates as we find them, and relates them to the current list.

Blog post about RRID

MM was on holiday after last meeting, did not get blog post for last sprint done before. TODO asap

Update Norma to read abstracts and normalise them for AMI

Run back through old dates with RRID

Once RRID functionality deployed to live, run it back over dailies to August to see if we can find any

Deploy new OA journal ability to daily canary and run back to August

There are more OA journals we can now scrape, so deploy updates to live server then have Canary run back over old dates to get new sources and process them

Get spreadsheet examples from CUL

Blog about Canary

bulk download

Having a bulk download button would be great, to locally recreate what would've been the result if I ran everything on the VM - essentially only copying the workspace. A tar or zip wouldn't be too big?

Gain login credentials for tool ES

Find a way to get a list of IUCn species

Where is this list? @blahah probably knows.

If there is an API we can query for species that we pull each day, that would be good. If not, a way to get a dump of it and keep it up to date. Or a way to scrape it off a web page somewhere. Whichever approach, a python script that can be called as an exec by canary would be good.