Code Monkey home page Code Monkey logo

chocula's Introduction

Chocula: Scholary Journal Metadata Munging

Chocula is a python tool for parsing and merging journal-level metadata from various sources into a sqlite3 database file for analysis. It is currently the main source of journal-level metadata for the fatcat catalog of published papers.

Quickstart

You need python3.8, pipenv, and sqlite3 installed. Commands are run via make. If you don't have python3.8 installed system-wide, try installing with pyenv.

Set up dependencies and fetch source metadata:

make dep fetch-sources

Then re-generate entire sqlite3 database from scratch:

make database

Now you can explore the database; see chocula_schema.sql for the output schema.

sqlite3 chocula.sqlite

Developing

There is partial test coverage, and we verify python type annotations. Run the tests with:

make test

History / Name

This is the 3rd or 4th iteration of open access journal metadata munging as part of the fatcat project; earlier attempts were crude ISSN spreadsheet munging, then the oa-journals-analysis repo (Jupyter notebook and a web interface), then the fatcat:extra/journal_metadata/ script for bootstrapping fatcat container metadata. This repo started as the fatcat journal_metadata directory and retains the git history of that folder.

The name "chocula" comes from a half-baked pun on Count Chocula... something something counting, serials, cereal. Read more about Count Chocula.

ISSN-L Munging

Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in the ISSN-L table. On the portal.issn.org public site, these are listed as:

"This provisional record has been produced before publication of the
resource.  The published resource has not yet been checked by the ISSN
Network.It is only available to subscribing users."

For example:

  • 2199-3246/2199-3254: Digital Experiences in Mathematics Education

Previously these were allowed through into fatcat, so some 2000+ entries exist. This allowed through at least 110 totally bogus ISSNs. Currently, chocula filters out "unknown" ISSN-Ls unless they are coming from existing fatcat entities.

Source Metadata

The sources.toml configuration file contains a canoncial list of metadata files, the last time they were updated, and original URLs for mirrored files. The general workflow is that all metadata files are bunled into "source snapshots" and uploaded/downloaded from the Internet Archive (archive.org) together.

There is some tooling (make update-sources) to automatically download fresh copies of some files. Others need to be fetched manually. In all cases, new files are not automatically integrated: they are added to a sub-folder of ./data/ and must be manually copied and sources.toml updated with the appropriate date before they will be used.

Some sources of metadata were helpfully pre-parsed by the maintainer of https://moreo.info. Unfortunately this site is now defunct and the metadata is out of date.

Adding new directories or KBART preservation providers is relatively easy, by creating new helpers in chocula/directories/ and/or chocula/kbart.py.

Updating Homepage Status and Countainer Counts

Run these commands from a fast connection; they will run with parallel processes. These hit only public URLs and API endpoints, but you would probably have the best luck running these from inside the Internet Archive cluster IP space:

make data/2020-06-03/homepage_status.json
make data/2020-06-03/container_stats.json

Then copy these files to data/ (no sub-directory) and update the dates in sources.toml. Update the sqlite database with:

pipenv run python -m chocula load_fatcat_stats
pipenv run python -m chocula load_homepage_status
pipenv run python -m chocula summarize

chocula's People

Contributors

bnewbold avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.