Code Monkey home page Code Monkey logo

paperoni's Introduction

Paperoni

Paperoni is Mila's tool to collect publications from our researchers and generate HTML or reports from them.

Install

First clone the repo, then:

pip install -e .

Configuration

Create a YAML configuration file named config.yaml in the directory where you want to put the data with the following content:

paperoni:
  paths:
    database: papers.db
    history: history
    cache: cache
    requests_cache: requests-cache
    permanent_requests_cache: permanent-requests-cache
  institution_patterns:
    - pattern: ".*\\buniversit(y|é)\\b.*"
      category: "academia"

All paths are relative to the configuration file. Insitution patterns are regular expressions used to recognize affiliations when parsing PDFs (along with other heuristics).

Make sure to set the $GIFNOC_FILE environment variable to the path to that file.

Start the web app

To start the web app on port 8888, execute the following command:

grizzlaxy -m paperoni.webapp --port 8888

You can also add this section to the configuration file (same file as the paperoni config):

grizzlaxy:
  module: paperoni.webapp
  port: 8888

And then you would just need to run grizzlaxy or grizzlaxy --config config-file.yaml.

Once papers are in the system, the app can be used to validate them or perform searches. There are some steps to follow in order to populate the database:

Add researchers

  • Go to http://127.0.0.1:8888/author-institution
  • Enter a researcher's name, role at the institution, as well as a start date. The end date can be left blank, and then click Add/Edit
  • You can edit a row by clicking on it, changing e.g. the end date and clicking Add/Edit
  • Then, add IDs on Semantic Scholar: click on the number in the Semantic Scholar IDs column, which will open a new window.
  • This will query Semantic Scholar with the researcher's name. Each box represents a different Semantic Scholar ID. Select:
    • Yes if the listed papers are indeed from the researcher. This ID will be scraped for this researcher.
    • No if the listed papers are not from the researcher. This ID will not be scraped.

Ignore OpenReview IDs for the time being, they might not work properly at the moment.

Scrape

The scraping currently needs to be done on the command line.

# Scrape from semantic_scholar
paperoni acquire semantic_scholar

# Get more information for the scraped papers
# E.g. download from arxiv and analyze author list to find affiliations
# It can be wise to use --limit to avoid hitting rate limits
paperoni acquire refine --limit 500

# Merge entries for the same paper; paperoni acquire does not do it automatically
paperoni merge paper_link

# Merge entries based on paper name
paperoni merge paper_name

Other merging functions are author_link and author_name for authors (not papers) and venue_link for venues.

Validate

Go to http://127.0.0.1:8888/validation to validate papers. Basically, you click "Yes" if the paper should be in the collection and "No" if it should not be according to your criteria (because it comes from a homonym of the researcher, is in the wrong field, is just not a paper, etc. -- it depends on your use case.)

paperoni's People

Contributors

breuleux avatar notoraptor avatar satyaog avatar simnol22 avatar

Stargazers

Dominik Antal avatar Nick Imanzi avatar  avatar Stefan avatar Sadaf Safa avatar  avatar Benjamin Wolff avatar  avatar Jigar avatar Tyler Collins avatar Ricardo Barros Lourenço avatar  avatar Michael Corrado avatar Rahul Nair avatar  avatar Alexander Tong avatar Francesco Landolfi avatar Christopher Morris avatar Suyuchen Wang avatar Carolyne Pelletier avatar Razvan Sultana avatar Yang Chao avatar Harshavardhan Kamarthi avatar cyphersnake avatar Sujith Vemishetty avatar  avatar  avatar Arghyadeep Das avatar Soumya Chatterjee avatar Jean-Marc Alkazzi avatar  avatar Gleb Skibitsky avatar Victor Ferraz avatar Rebeca Sarai avatar Thomas Harper avatar James Hawley avatar Matthew avatar Terkwood avatar Eugene Siow avatar Owain West avatar Laxman Singh Tomar avatar Maciej Kos avatar Yassine avatar Stefan  avatar Simon Robinson avatar Denis Denisov avatar Matej avatar Nikita avatar Rajmund Nagy avatar Polygon Analytics avatar Fabian Beuke avatar Kory avatar  avatar Kaushal Bhogale avatar Akshay Kulkarni avatar Sharath Raparthy avatar Shlomi Hod avatar Florian Golemo avatar  avatar Jeff Carpenter avatar  avatar Przemysław Kaleta avatar Martin Weiss avatar Adhitthano avatar  avatar David Smith avatar Lucas Heck avatar Fredrik Olsson avatar  avatar  avatar Lester Covax avatar Jacob Danovitch avatar d3sm0 avatar  avatar  avatar Mehran Shakerinava avatar Haque Ishfaq avatar Matt Shaffer avatar Michael avatar  avatar ik5 avatar Victor Schmidt avatar Chen-Yang Su avatar Xiaogang He avatar AaronCao avatar Tao BAI avatar weijing avatar  avatar 爱可可-爱生活 avatar Kaito avatar Anubhav Tiwari avatar Jeff Hammerbacher avatar breandan avatar  avatar Pierluca D'Oro avatar Thomas Schweizer avatar Edoardo Debenedetti avatar Koustuv Sinha avatar

Watchers

 avatar James Cloos avatar  avatar d3sm0 avatar  avatar Matt Shaffer avatar JwongXyan Myou avatar

paperoni's Issues

Session restart?

Hi,
Thanks for the tool looks neat. Is there an easy way to keep the cli alive, and search for other papers? or is something that might break the API?

IndexError: list index out of range in /find-authors-ids/

with load_config(os.environ["PAPERONI_CONFIG"]) as cfg:
with cfg.database as db:
reaserchers = get_authors(author_name)
author_id = reaserchers[0].author_id

> Traceback (most recent call last):
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni-config/venv/cp311/lib/python3.11/site-packages/starbear/serve.py", line 278, in run
>     await self.fn(self.page)
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni-config/venv/cp311/lib/python3.11/site-packages/starbear/wrap.py", line 33, in wrapped_app
>     await app(page)
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni/paperoni/webapp/common.py", line 276, in app
>     return await fn(page, page[target])
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni/paperoni/webapp/find-authors-ids.py", line 325, in app
>     author_id = reaserchers[0].author_id
>                 ~~~~~~~~~~~^^^
> IndexError: list index out of range

No license?

Was in the process of listing this fine piece of software but I couldn't find a licence. Assuming MIT right ? ;-)

Bibtex Author connector should be "and"

Author names in bibtex need to be connected with "and" (www.texfaq.org/FAQ-manyauthor)

Currently the bibtex author names are connected with "," with "and" only at the end (see paperoni/papers.py line 43) This causes an error in latex compilation (e.g. Overleaf)

e.g.

@inproceedings{zhang2019-convolutional15,
    author = {Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao and Kaisheng Ma},
    title = {Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation},
    year = {2019},
    booktitle = {2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {3712-3721},
    publisher = {IEEE}
}

Whereas the author list should be

author = {Linfeng Zhang and Jiebo Song and Anni Gao and Jingwei Chen and Chenglong Bao and Kaisheng Ma}

TypeError: app.<locals>.regen() got an unexpected keyword argument 'db' in /author-institution/

This :

async def regenerator(queue, regen, reset, db):
gen = regen(db=db)

throws the following exception :

> Traceback (most recent call last):
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni-config/venv/cp311/lib/python3.11/site-packages/starbear/serve.py", line 278, in run
>     await self.fn(self.page)
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni-config/venv/cp311/lib/python3.11/site-packages/starbear/wrap.py", line 33, in wrapped_app
>     await app(page)
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni/paperoni/webapp/common.py", line 276, in app
>     return await fn(page, page[target])
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni/paperoni/webapp/author-institution.py", line 240, in app
>     async for result in regen:
>   File "/Users/satyaortiz-gagne/travail/mila/CODE/paperoni/paperoni/webapp/common.py", line 186, in regenerator
>     gen = regen(db=db)
>           ^^^^^^^^^^^^
> TypeError: app.<locals>.regen() got an unexpected keyword argument 'db'

The definition of regen is :

def regen(event=None):
name = None
if event is not None:
name = event["name"]
if event is not None and event["$submit"] == True:
name = None
addAuthor(event)
return generate(name)

and the previous call in stack is :

with load_config(os.environ["PAPERONI_CONFIG"]) as cfg:
with cfg.database as db:
regen = regenerator(
queue=q,
regen=regen,
reset=page["#mid-div"].clear,
db=db,
)
async for result in regen:
htmlAuthor(result)

I'll check the history for the file and try to understand what could have leaded to this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.