Session restart?

Paperoni

Paperoni is Mila's tool to collect publications from our researchers and generate HTML or reports from them.

Install

First clone the repo, then:

pip install -e .

Configuration

Create a YAML configuration file named config.yaml in the directory where you want to put the data with the following content:

paperoni:
  paths:
    database: papers.db
    history: history
    cache: cache
    requests_cache: requests-cache
    permanent_requests_cache: permanent-requests-cache
  institution_patterns:
    - pattern: ".*\\buniversit(y|é)\\b.*"
      category: "academia"

All paths are relative to the configuration file. Insitution patterns are regular expressions used to recognize affiliations when parsing PDFs (along with other heuristics).

Make sure to set the $GIFNOC_FILE environment variable to the path to that file.

Start the web app

To start the web app on port 8888, execute the following command:

grizzlaxy -m paperoni.webapp --port 8888

You can also add this section to the configuration file (same file as the paperoni config):

grizzlaxy:
  module: paperoni.webapp
  port: 8888

And then you would just need to run grizzlaxy or grizzlaxy --config config-file.yaml.

Once papers are in the system, the app can be used to validate them or perform searches. There are some steps to follow in order to populate the database:

Add researchers

Go to http://127.0.0.1:8888/author-institution
Enter a researcher's name, role at the institution, as well as a start date. The end date can be left blank, and then click Add/Edit
You can edit a row by clicking on it, changing e.g. the end date and clicking Add/Edit
Then, add IDs on Semantic Scholar: click on the number in the Semantic Scholar IDs column, which will open a new window.
This will query Semantic Scholar with the researcher's name. Each box represents a different Semantic Scholar ID. Select:
- Yes if the listed papers are indeed from the researcher. This ID will be scraped for this researcher.
- No if the listed papers are not from the researcher. This ID will not be scraped.

Ignore OpenReview IDs for the time being, they might not work properly at the moment.

Scrape

The scraping currently needs to be done on the command line.

# Scrape from semantic_scholar
paperoni acquire semantic_scholar

# Get more information for the scraped papers
# E.g. download from arxiv and analyze author list to find affiliations
# It can be wise to use --limit to avoid hitting rate limits
paperoni acquire refine --limit 500

# Merge entries for the same paper; paperoni acquire does not do it automatically
paperoni merge paper_link

# Merge entries based on paper name
paperoni merge paper_name

Other merging functions are author_link and author_name for authors (not papers) and venue_link for venues.

Validate

Go to http://127.0.0.1:8888/validation to validate papers. Basically, you click "Yes" if the paper should be in the collection and "No" if it should not be according to your criteria (because it comes from a homonym of the researcher, is in the wrong field, is just not a paper, etc. -- it depends on your use case.)

	with load_config(os.environ["PAPERONI_CONFIG"]) as cfg:
	with cfg.database as db:
	reaserchers = get_authors(author_name)
	author_id = reaserchers[0].author_id

	def regen(event=None):
	name = None
	if event is not None:
	name = event["name"]
	if event is not None and event["$submit"] == True:
	name = None
	addAuthor(event)
	return generate(name)

	with load_config(os.environ["PAPERONI_CONFIG"]) as cfg:
	with cfg.database as db:
	regen = regenerator(
	queue=q,
	regen=regen,
	reset=page["#mid-div"].clear,
	db=db,
	)
	async for result in regen:
	htmlAuthor(result)

	async def regenerator(queue, regen, reset, db):
	gen = regen(db=db)

mila-iqia / paperoni Goto Github PK

paperoni's Introduction

Paperoni

Install

Configuration

Start the web app

Add researchers

Scrape

Validate

paperoni's People

Contributors

Stargazers

Watchers

Forkers

paperoni's Issues

Recommend Projects

Recommend Topics

Recommend Org