plazi / lycophron Goto Github PK

View Code? Open in Web Editor NEW

2.0 8.0 5.0 160 KB

Batch uploader to Zenodo

License: Creative Commons Zero v1.0 Universal

Python 100.00%

lycophron's Introduction

Lycophron

Lycophron is a CLI tool to support batch uploads of records to Zenodo.

The tool supports the upload through CSV files that describe each record to be uploaded.

"The" Quickstart Guide

Installation

Install from GitHub using pip:

pip install --user "lycophron @ git+https://github.com/plazi/lycophron@main"

Alternatively, use pipx:

pipx install "lycophron @ git+https://github.com/plazi/lycophron@main"

Initalize a local project named "monkeys"
```
lycophron init monkeys
```
List the project contents:
```
cd monkeys/

tree
.
├── lycophron.cfg
├── lycophron.db
├── files
│   └── Adam_Hubert_1976.pdf
└── dev_logs.log
```
You will be prompted to input the authentication token. You can leave it empty as this can be done afterward by directly editing the configuration file.

Configure the project (add or edit TOKEN and ZENODO_URL)

cat lycophron.cfg

# TOKEN = 'CHANGE_ME'
# ZENODO_URL = 'https://zenodo.org/api'

Create a CSV file from a template

Generate a template with all the fields:
```
lycophron gen-template --filename output.csv --all
```
Add custom fields (e.g. dwc and ac) to the template:
```
lycophron gen-template --filename output.csv --custom "dwc,ac"
```
Fill in the metadata and load the file
```
lycophron load --inputfile output.csv
```
Publish to Zenodo
```
lycophron publish
```

Installation

Requirements

Python v3.11+

Install from GitHub using pip or pipx:

pipx install "lycophron @ git+https://github.com/plazi/[email protected]

Note

In the future Lycophron will be published on PyPI and just require pip install lycophron.

Linux/macOS

It is recommended to use a virtual environment to install and run the application.

To create a virtual environment named lycophron, run the following command in your terminal:
python3 -m venv lycophron
source lycophron/bin/activate

To install the CLI tool for development, clone this repository and run:

# For local development
uv pip sync requirements-dev.txt
uv pip install -e .[dev]

Commands

init

This command initializes the project, creates necessary configuration files, and sets up a local database.

init [name] : initialize the app with the given name init --token : initialize the app with a token

⚠️ Adjust configurations in the generated files to meet specific upload requirements.

validate

This command validates the current project. I.e. it checks whether the application is properly configured, the directory structure is correct and the metadata is valid and ready to be loaded locally.

validate --filename : validate the project and the given file.

gen-template

This command generates a CSV file from the default template, containing all the required headers. The template can either be generated with all the fields or by explicitely adding custom fields on top of the required ones.

gen-template --filename : creates the output file in the given path and name. gen-template --all : creates a template using all fields (required and custom fields). gen-template --custom "x,y,z" : creates a template using the required fields plus the given custom fields.

load

This command loads records from a local file into the local Database, ensuring they are ready to upload to Zenodo.

load --inputfile: load records from a given file to a local DB

publish

Publishes the previously loaded records to Zenodo.

This command specifically targets records that are currently unpublished. Importantly, this operation is designed to be executed multiple times, allowing for a phased or incremental approach to publishing records as needed.

publish: publish records to Zenodo

Configuration

name	description
TOKEN	Token to authenticate with Zenodo
ZENODO_URL	URL where to publish records (e.g. https://zenodo.org/api/deposit/depositions)

Supported metadata

name	cardinality	data type	description
description	1	string	record's description
creators	1-N	list of strings	-
title	1	string	-
keywords	0-N	list of strings	-
access_right	0-1	string	-
upload_type	0-1	string	-
publication_type	0-1	string	-
publication_date	0-1	ISO8601-formatted date string	-
journal_title	0-1	string	-
journal_volume	0-1	string	-
journal_issue	0-1	string	-
journal_pages	0-1	string	-
communities	0-N	list of strings	-
doi	0-1	string	-
files	0-N	list of strings	name of the files to upload for the record

How to create a .csv

Generate the template by running lycophron gen-template and fill in the metadata.
File names must match the files under the directory /files/

⚠️ When working with fields defined as a list in the CSV file, it is essential to separate each item with a new line ("\n"). This ensures proper formatting and accurate representation of the list structure in the CSV file, thus allowing Lycophron to parse the values correctly.

Example:

title	description	access_right	upload_type	communities	publication_type	publication_date	journal_title	journal_volume	journal_issue	journal_pages	doi	creators.name	creators.affiliation	files	id	related_identifiers.identifier	related_identifiers.relation
LES NYCTERIDAE (CHIROPTERA) DU SÉNÉGAL: DISTRIBUTION, BIOMETRIE ET DIMORPHISME SEXUEL	Cinq especes de Nycteridae sont preserves au Sénegal	open	publication	bats_project	article	1976-12-31	Mammalia	40	4	597-613	10.1515/mamm.1976.40.4.597	Adam, F. Hubert, B.		Adam_Hubert_1976.pdf	specimen001	{doi:figure001} {doi:figure002} {doi:figure003}	Documents Documents Documents
New ecological data on the noctule bat (Nyctalus noctula Schreber, 1774) (Chiroptera, Vespertilionidae) in two towns of Spain	The Iberian Peninsula represents the South-western limit of distribution of thenoctule bat (Nyctalus noctula) in Europe.	open	publication	bats_project	article	1999-01-31	Mammalia	63	3	273-280	10.1515/mamm.1999.63.3.273	Alcalde, J. T.	Departamento de Zoología, Faculdad de Ciencias, Universidad de Navarra. Avda Irunlarrea s/n. 31080, Pamplona. Spain	Alcade_1999.pdf	figure001	{doi:speciment001}	isDocumentedBy
ROOSTING, VOCALIZATIONS, AND FORAGING BY THE AFRICAN BAT, NYCTERIS THEBAICA	There is no abstract	open	publication	bats_project	article	1990-05-21	Journal of Mammalogy	71	2	242-246	10.2307/1382175	Aldridge, H. D. J. N. Obrist, M. Merriam, H. G. Fenton, M. B.

Known issues

DOIs are generated on demand, existing or external DOIs are not accepted. issue
Metadata is compliant with legacy Zenodo, not RDM
Output is not clear for the user. E.g. user does not know what to run next, how to fix issues with the data, etc.
init does not create the needed file structure, still requires manual intervention (e.g. creation of ./files/)
Linking between rows is not supported. E.g. row 1 references a record from row 2

Development

Dependency management

To manage Python dependencies, Lycophron uses uv.

Generate dependency files

You should not need to do this. More often, you will want to just bump all dependencies to their latest versions. See upgrading dependencies

# Generate requirements.txt
uv pip compile pyproject.toml > requirements.txt

# Generate requirements-dev.txt
uv pip compile pyproject.toml --extra tests > requirements-dev.txt

Add a new dependency

To add a new dependency, add it in the dependencies section of the pyproject.toml file. Add testing/development dependencies under [project.optional-dependencies] section's tests extra.

# Update the requirements.txt file
uv pip compile pyproject.toml --upgrade-package <new-package>

# Update the requirements-dev.txt file
uv pip compile pyproject.toml --extra tests --upgrade-package <new-package>

Upgrade all dependencies

To upgrade all dependencies in development, run the following command in your terminal:

uv pip compile pyproject.toml --extra tests --upgrade

and for production:

uv pip compile pyproject.toml --upgrade

lycophron's People

Contributors

Stargazers

Watchers

Forkers

alejandromumo tcatapano lnielsen jrcastro2 yashlamba

lycophron's Issues

Game plan, thoughts and discussions

Ok, so, this is a summary of what I discussed with @slint on skype today - and this discussion is a catch up of what we previously discussed in the last Arcadia Sprint meeting at CERN (Feb/2020). This is a Plazi-Zenodo join effort that aims a complete re-do of Lycophron in order to deliver a tool that can handle any use case, not only specific ones, with more performance and reliability.

Lycophron should have a separate module to handle the Zenodo communication;
It should load/export data using Pandas Dataframes;
It must be a CLI tool (Click), accepting commands as: upload, update, publish, delete, and uninstall, with some parameters to turn in/off sandbox mode, define the export file, edit sensible information (e.g. API token), and so on;
It should use .env (python-dotenv) to keep sensible information and other eventual parameters of the tool;
Schema-based data validation (possible libs: Pydantic, Marshmallow);
It should be to auto-match provided columns with Zenodo fields, asking, if not an absolute match, if the user agrees before doing any API call;
We should use Celery to 'multithreading';

First step is building the Zenodo communication module, the next step would be implementing the first commands for the CLI.

Tomorrow I'll work on setting labels, milestones and creating templates for issues, and the README (at least the skeleton).

What do you think, @slint ?

Cheers!

Support external DOIs

Currently there are two issues related to DOIs:

External DOIs are not updated in the record's metadata
Records without DOIs are not accepted.

Define workflow for managing dependencies

There are a couple of solutions already out there:

pip-tools + virtualenv (I personally have a preference for this one, although there is some "manual" work involved to maintain, but can be documented).
pipenv
Poetry
PDM

env: take advantage of environment to configure better defaults

I guess this is only needed for local developemnt, when calling localhost, probably we should use a dev flag to enable/disable this?

        if self.config["ENV"] == "development"
             client.session.verify = False

Originally posted by @jrcastro2 in #32 (comment)

meta: extract issues from the restructuring PR

From the comments and TODOs in the code in #16, we can extract some follow-up issues to address later on.

Modernize codebase to Python 3.10+

This is a good opportunity to use match, added in Python 3.10 (if we bumped the module python requirement of course)

Originally posted by @alejandromumo in #30 (comment)

Add processing status for each record

The current data model supports that each record has a "status", thus allowing the user to understand the current progress on its uploads.

The application supports it but there is currently an error when accessing the database inside a celery task:

  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1964, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 747, in do_execute
    cursor.execute(statement, parameters)
MemoryError
[2023-03-27 15:36:41,853: ERROR/MainProcess] Pool callback raised exception: MemoryError('Process got: ')
Traceback (most recent call last):
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/billiard/pool.py", line 1796, in safe_apply_callback
    fun(*args, **kwargs)
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/celery/worker/request.py", line 730, in on_success
    return self.on_failure(retval, return_ok=True)
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/celery/worker/request.py", line 545, in on_failure
    raise MemoryError(f'Process got: {exc}')
MemoryError: Process got:

it seems that the engine (sqlite) is executing the cursor to fetch the data and somehow fails. The, celery returns a MemoryError.

Package manager of choice

Hi Alex,

Are you ok with pipenv as the project package manager and have a Pipfile here, or do you prefer to stick with requirements.txt?

I'll be using pipenv anyways, and I can export the requirements.txt if you prefer this way.

Please, let me know.

Cheers,

updated bat list

get a new version of the bat files by Monday https://drive.google.com/drive/folders/13kZvDDCUq4ueNleQbiVx09B7v0C3EkqV

this is the complete list of bat publications used for data extraction https://docs.google.com/spreadsheets/d/1y5uBKvyzDQgQUtyRHvn5f20IOA7LoMho/edit#gid=966826746

conversion of taxodros bibliography

@slint @jhpoelen @lnielsen
Here is a draft of a CSV for the lycophron upload to Zenodo.
https://docs.google.com/spreadsheets/d/1f-_6MFzObIBlxeCaEtHD5ZRF0Kwj0zKq_BiPvbEyYSg/edit#gid=0

Alex, can you please have a look at it and let me know? Also, may be you indicate in a color what is required. I tried to do some, looking at the * in the upload form.

I am not sure, how to add multiple contributors, keywords with the line breaks in a single field. When I save it and open the CSV file, then the XLS is not the same.

What do you recommend when we have the bibliographic reference article as a string, such as Nature 541: 136-138 and authors as string?
Do we need to parse this out in a first round, or just add?

Thanks

Donat

XLS template for upload

@alejandromumo @slint Where can I find a template XLS to be used to upload articles to BLR?

We need in the TNA projects in BiCIKL these XLS to hand it out to the awardeed so that they can add their publications in a format that saves us time to then upload.

thanks for a link

Donat

Improve docs on CSV fields

We can integrate the following bullets into the main docs of the CSV fields:

Each line represents a record that will be created on Zenodo
Required fields are marked as bold in the header. Fields that don’t have a value are skipped.
For the doi field:
- It should be filled in if there is a DOI already registered for an entry
- If not filled, we’ll register a Zenodo DOI for the record
You’ll notice that the fields are a somewhat “de-normalized” version of the JSON representation we’re using on Zenodo. Since we’re often dealing with “complex” fields such as multi-level nesting of arrays of objects, we have taken some liberty with the data formatting to allow representing these values. Some examples of such fields:
- Keywords (subjects.subject): the cell value contains “new-line” separated keywords
- Creators/authors (creators.*): following the “new-line” separated convention, these have been “tabularized”. In the example there are two authors: Nils Schlüter (affiliation: Museum für Naturkunde, ORCID: 0000-0002-5699-3684) and John Smith (affiliation: CERN, ORCID: none)
Some of the fields rely on controlled vocabularies (e.g. the resource types, contributor types, licenses, related identifier relation types, etc.). The values for these types can be found under the following endpoints (to which you can add a ?q=<search term> query string parameter to narrow down results)
- Resource Type ID ([resource_type.id](http://resource_type.id/)): https://zenodo.org/api/vocabularies/resourcetypes?size=1000
- Creator affiliations ID ([creators.affiliations.id](http://creators.affiliations.id/)): https://zenodo.org/api/affiliations
  - Basically we accept a valid ROR ID
- Rights/License ID ([rights.id](http://rights.id/)): https://zenodo.org/api/vocabularies/licenses?size=1000
  - These are all based on SPDX IDs, but we have some “custom” cases for e.g. “Other (open)”, etc. that are available at the endpoint
- Contributor Role ID ([contributors.role.id](http://contributors.role.id/)): https://zenodo.org/api/vocabularies/contributorsroles?size=1000
- Language IDs ([languages.id](http://languages.id/)): https://zenodo.org/api/vocabularies/languages?size=1000
- Related identifiers Relation Type ID ([related_identifiers.relation_type.id](http://related_identifiers.relation_type.id/)): https://zenodo.org/api/vocabularies/relationtypes?size=1000
For custom fields we have a reference sheet at https://docs.google.com/spreadsheets/d/1TUyDT6yOypX2DBuM_PNUZucFTC93uFlEa7PoAMYvnDI/edit#gid=314238332, but the basic premise is that they correspond to known vocabularies such as DarwinCore, AudubonCore, etc. They all receive multiple terms

Coping with UPDATE of custom metadata fields with multiple values/entries

Hi Alex,

We recently discussed, by email, on how update custom metadata fields and that raised me a couple of question, especially because we're designing this tool to be universally used - not exclusively used by our domain.

Take our universe of custom metadata fields as an example. We have some fields that will always have a single value (most of the DwC based ones) and some other that will have multiple values, like, locations in treatments, or, the OBO ones. For the ones with unique values, the idea of using a relational database (like a spreadsheet to an extant) as input would work perfectly fine. We can take the value on that specific column/row and replace it in the server. But if you start thinking about the custom metadata fields with multiple values, we need to know the value to be changed, not only the new value. Then I have the following questions:

how should the user input the data in the incoming spreadsheet if we need the current state and the new desired state?
how can we tell Lycophron which fields require the two values, as this will be used universally?
are spreadsheets still the best input now that we consider this case?

I've some ideas in mind, but I'll let you start the brainstorm here.

Thanks!

Bi-directional linking

Define an input convention for the import template on how to bi-directionally link identifiers
- Could be something like {<item_id>:<identifier>:<is_bidirectional>}, e.g. {specimen001:doi:true}
Implement the functionality in the Zenodo metadata serializer

lycophron files

@punkish @tcatapano @slint
here are some more files that could be the missing progam files in the repo. Can you please check out?
https://drive.google.com/drive/folders/1-Bm1ihJtZdTLOd35iXd-TFd60aKATYZJ

Denormalize/parse MfA input sheet into Lycophron tempalte

Define unique IDs for all the objects (speciments and photos)
Fill-in bi-directional links between specimen <-> photo
Map input data to DarwinCore/AudubunCore metadata fields

Depends on new input from MfA folks: plazi/arcadia-project#234

upload bat files from sandbox to production

Hi Donat and @flsimoes ,

Manuel finished last week the Sandbox upload of the ~230 records bats collection from the Google Sheet Felipe and Juliana shared in the last Arcadia sprint.

Everything was uploaded in a separate Zenodo community at https://sandbox.zenodo.org/communities/bats_project/search. If you agree with this method of organisation, we could do the same when we perform the final upload on the production system. You can create the Zenodo community under the BLR Zenodo account and add any extra information in the description. As discussed before we can include the records also in the classic BLR community.
The Google sheet used for the import and upload via Lycophron is at https://docs.google.com/spreadsheets/d/19T3S6kKJyVJpe7lgd5GKYKLfZXMCTZhgo6d0N8EhfIk/edit

Some pending action items before we go ahead with uploading to production:

You’ll notice that the community includes 225 records, while the Google sheet has 239 rows (240 - 1 header).
- 5 records with existing DOIs were already uploaded from previous tries
- 5 records had an invalid affiliation value, but we can clean this up in the Google sheet
45 entries in the Google sheet were missing DOIs, but we uploaded them anyways to verify the metadata: https://sandbox.zenodo.org/communities/bats_project/search?q=exists:conceptdoi
The metadata on the records (or at least a sample of them) has to be verified, just to make sure we didn’t mess up anything in the process. These are metadata-only updates which is fine to also perform later on (even after the production upload).

Let me know if you have any questions, I’ll be off next week on holidays, but we can re-route to the rest of the team so they can help.

Cheers,
Alex

PS: I couldn’t find Juliana’s email address, so feel free to forward this or add her in the loop.

plazi / lycophron Goto Github PK

lycophron's Introduction

Lycophron

"The" Quickstart Guide

Getting started

Installation

Linux/macOS

Commands

Configuration

Supported metadata

How to create a .csv

Known issues

Development

Dependency management

Generate dependency files

Add a new dependency

Upgrade all dependencies

lycophron's People

Contributors

Stargazers

Watchers

Forkers

lycophron's Issues

Recommend Projects

Recommend Topics

Recommend Org