Code Monkey home page Code Monkey logo

canary's Introduction

Canary

version 0.2.0

Canary is an API and daily runner for ContentMine.

Install

First install all the other tools, see their repos on how to do so - quickscrape, getpapers, norma, AMI.

By installing quickscrape and getpapers you will have ensured you already have node installed.

Install meteor (https://www.meteor.com/install):

curl https://install.meteor.com/ | sh

Get the codebase:

git clone http://github.com/contentmine/canary

Run it:

cd canary

meteor

If you want to use a settings file, like the example one provided, and/or set the port to run on, run with a command like this:

meteor --port 3123 --settings settings.json

If you want to have your own index running, install elasticsearch too (https://www.elastic.co/)

Configure

At the top of the canary.js file there are various options that can be set. It is best to check directly there to ensure you are seeing the most up to date possibilities.

Code Structure

Canary is only a server-side app, even though it is written in meteor which can do server and client side. Externally, it exposes an API that can be connected to from remote services.

The main code is in canary.js, which defines settings and the API endpoints.

cron.js defines the daily jobs that run to retrieve and process articles on a daily basis, extracting facts and saving them to the index each day. The cron functions make use of the other functions, although the API can also call them directly in some cases, if necessary.

index.js contains the code that can query and submit data to the elasticsearch indexes.

normalise.js contains code that can do or that can execute normalisation to scholarly html.

process.js can execute other processes on the article content to extract facts, for example by calling AMI.

retrieve.js retrieves content from remote APIs and sites, via quickscrape / thresher / getpapers, or simply direct http requests to URLs.

Dictionaries

The extract section of the daily functions expects to be able to read a folder full of dictionary files. These SHOULD be JSON files named something like species.json and they should contain a JSON list of objects. Each object must have at least a "query" key, and that key should point to either a simple string to match exactly on, or a regex starting with /, or an object that is a full "query" part of an elasticsearch query.

It would also be possible to allow dictionaries in .xml and then convert them to the same structure, and also to allow .txt files that just contain a list of strings or regexes. However, neither of these capabilities have been added to the cron/extract function yet, althouth they easily could be.

When a fact is discovered, any keys present in the match object will be added to the fact, and the name of the dictionary file (without the filetype suffix) will be used to identify which dictionary matched the fact.

canary's People

Contributors

tarrow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

canary's Issues

EIC into hypothesis

Should be able to take EIC data from results and use them to highlight facts in the scholarly.html version of articles that we have.

Explore producing an updated IUCn

Can we, given the current IUCn list, and an update of the species in it on our demo, regularly produce an updated list where we move species based on what we learn about them? Perhaps we can identify certain statements that we can take as meaning a species is no longer endangered? Or has become endangered? Perhaps @blahah or @rossmounce can provide input here.

Make an IUCn demo page

Make a specialised page that lists IUCn species updates as we find them, and relates them to the current list.

Blog post about RRID

MM was on holiday after last meeting, did not get blog post for last sprint done before. TODO asap

bulk download

Having a bulk download button would be great, to locally recreate what would've been the result if I ran everything on the VM - essentially only copying the workspace. A tar or zip wouldn't be too big?

Find a way to get a list of IUCn species

Where is this list? @blahah probably knows.

If there is an API we can query for species that we pull each day, that would be good. If not, a way to get a dump of it and keep it up to date. Or a way to scrape it off a web page somewhere. Whichever approach, a python script that can be called as an exec by canary would be good.

Deploy latest code

Code from last sprint still to be deployed onto live, as MM was on holiday. TODO asap.

RRID into canary

RRID should now be ready in AMI. Anusha should check it out, then pass to Mark once confirmed working in latest version of AMI and Mark will update Canary to put it live.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.