Code Monkey home page Code Monkey logo

lycophron's People

Contributors

alejandromumo avatar jrcastro2 avatar mguidoti avatar punkish avatar slint avatar yashlamba avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lycophron's Issues

Game plan, thoughts and discussions

Ok, so, this is a summary of what I discussed with @slint on skype today - and this discussion is a catch up of what we previously discussed in the last Arcadia Sprint meeting at CERN (Feb/2020). This is a Plazi-Zenodo join effort that aims a complete re-do of Lycophron in order to deliver a tool that can handle any use case, not only specific ones, with more performance and reliability.

  • Lycophron should have a separate module to handle the Zenodo communication;
  • It should load/export data using Pandas Dataframes;
  • It must be a CLI tool (Click), accepting commands as: upload, update, publish, delete, and uninstall, with some parameters to turn in/off sandbox mode, define the export file, edit sensible information (e.g. API token), and so on;
  • It should use .env (python-dotenv) to keep sensible information and other eventual parameters of the tool;
  • Schema-based data validation (possible libs: Pydantic, Marshmallow);
  • It should be to auto-match provided columns with Zenodo fields, asking, if not an absolute match, if the user agrees before doing any API call;
  • We should use Celery to 'multithreading';

First step is building the Zenodo communication module, the next step would be implementing the first commands for the CLI.

Tomorrow I'll work on setting labels, milestones and creating templates for issues, and the README (at least the skeleton).

What do you think, @slint ?

Cheers!

Denormalize/parse MfN input sheet into Lycophron tempalte

  • Define unique IDs for all the objects (speciments and photos)
  • Fill-in bi-directional links between specimen <-> photo
  • Map input data to DarwinCore/AudubunCore metadata fields

Depends on new input from MfA folks: plazi/arcadia-project#234

Support external DOIs

Currently there are two issues related to DOIs:

  • External DOIs are not updated in the record's metadata
  • Records without DOIs are not accepted.

upload bat files from sandbox to production

Hi Donat and @flsimoes ,

Manuel finished last week the Sandbox upload of the ~230 records bats collection from the Google Sheet Felipe and Juliana shared in the last Arcadia sprint.

Some pending action items before we go ahead with uploading to production:

  • You’ll notice that the community includes 225 records, while the Google sheet has 239 rows (240 - 1 header).
    • 5 records with existing DOIs were already uploaded from previous tries
    • 5 records had an invalid affiliation value, but we can clean this up in the Google sheet
  • 45 entries in the Google sheet were missing DOIs, but we uploaded them anyways to verify the metadata: https://sandbox.zenodo.org/communities/bats_project/search?q=exists:conceptdoi
  • The metadata on the records (or at least a sample of them) has to be verified, just to make sure we didn’t mess up anything in the process. These are metadata-only updates which is fine to also perform later on (even after the production upload).

Let me know if you have any questions, I’ll be off next week on holidays, but we can re-route to the rest of the team so they can help.

Cheers,
Alex

PS: I couldn’t find Juliana’s email address, so feel free to forward this or add her in the loop.

Bi-directional linking

  • Define an input convention for the import template on how to bi-directionally link identifiers
    • Could be something like {<item_id>:<identifier>:<is_bidirectional>}, e.g. {specimen001:doi:true}
  • Implement the functionality in the Zenodo metadata serializer

XLS template for upload

@alejandromumo @slint Where can I find a template XLS to be used to upload articles to BLR?

We need in the TNA projects in BiCIKL these XLS to hand it out to the awardeed so that they can add their publications in a format that saves us time to then upload.

thanks for a link

Donat

conversion of taxodros bibliography

@slint @jhpoelen @lnielsen
Here is a draft of a CSV for the lycophron upload to Zenodo.
https://docs.google.com/spreadsheets/d/1f-_6MFzObIBlxeCaEtHD5ZRF0Kwj0zKq_BiPvbEyYSg/edit#gid=0

Alex, can you please have a look at it and let me know? Also, may be you indicate in a color what is required. I tried to do some, looking at the * in the upload form.

I am not sure, how to add multiple contributors, keywords with the line breaks in a single field. When I save it and open the CSV file, then the XLS is not the same.

What do you recommend when we have the bibliographic reference article as a string, such as Nature 541: 136-138 and authors as string?
Do we need to parse this out in a first round, or just add?

Thanks

Donat

Package manager of choice

Hi Alex,

Are you ok with pipenv as the project package manager and have a Pipfile here, or do you prefer to stick with requirements.txt?

I'll be using pipenv anyways, and I can export the requirements.txt if you prefer this way.

Please, let me know.

Cheers,

Add processing status for each record

The current data model supports that each record has a "status", thus allowing the user to understand the current progress on its uploads.

The application supports it but there is currently an error when accessing the database inside a celery task:

  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1964, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 747, in do_execute
    cursor.execute(statement, parameters)
MemoryError
[2023-03-27 15:36:41,853: ERROR/MainProcess] Pool callback raised exception: MemoryError('Process got: ')
Traceback (most recent call last):
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/billiard/pool.py", line 1796, in safe_apply_callback
    fun(*args, **kwargs)
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/celery/worker/request.py", line 730, in on_success
    return self.on_failure(retval, return_ok=True)
  File "/Users/alejandromumo/.virtualenvs/lycophron/lib/python3.9/site-packages/celery/worker/request.py", line 545, in on_failure
    raise MemoryError(f'Process got: {exc}')
MemoryError: Process got:

it seems that the engine (sqlite) is executing the cursor to fetch the data and somehow fails. The, celery returns a MemoryError.

Coping with UPDATE of custom metadata fields with multiple values/entries

Hi Alex,

We recently discussed, by email, on how update custom metadata fields and that raised me a couple of question, especially because we're designing this tool to be universally used - not exclusively used by our domain.

Take our universe of custom metadata fields as an example. We have some fields that will always have a single value (most of the DwC based ones) and some other that will have multiple values, like, locations in treatments, or, the OBO ones. For the ones with unique values, the idea of using a relational database (like a spreadsheet to an extant) as input would work perfectly fine. We can take the value on that specific column/row and replace it in the server. But if you start thinking about the custom metadata fields with multiple values, we need to know the value to be changed, not only the new value. Then I have the following questions:

  1. how should the user input the data in the incoming spreadsheet if we need the current state and the new desired state?
  2. how can we tell Lycophron which fields require the two values, as this will be used universally?
  3. are spreadsheets still the best input now that we consider this case?

I've some ideas in mind, but I'll let you start the brainstorm here.

Thanks!

Improve docs on CSV fields

We can integrate the following bullets into the main docs of the CSV fields:


  • Each line represents a record that will be created on Zenodo
  • Required fields are marked as bold in the header. Fields that don’t have a value are skipped.
  • For the doi field:
    • It should be filled in if there is a DOI already registered for an entry
    • If not filled, we’ll register a Zenodo DOI for the record
  • You’ll notice that the fields are a somewhat “de-normalized” version of the JSON representation we’re using on Zenodo. Since we’re often dealing with “complex” fields such as multi-level nesting of arrays of objects, we have taken some liberty with the data formatting to allow representing these values. Some examples of such fields:
    • Keywords (subjects.subject): the cell value contains “new-line” separated keywords
    • Creators/authors (creators.*): following the “new-line” separated convention, these have been “tabularized”. In the example there are two authors: Nils Schlüter (affiliation: Museum für Naturkunde, ORCID: 0000-0002-5699-3684) and John Smith (affiliation: CERN, ORCID: none)
  • Some of the fields rely on controlled vocabularies (e.g. the resource types, contributor types, licenses, related identifier relation types, etc.). The values for these types can be found under the following endpoints (to which you can add a ?q=<search term> query string parameter to narrow down results)
  • For custom fields we have a reference sheet at https://docs.google.com/spreadsheets/d/1TUyDT6yOypX2DBuM_PNUZucFTC93uFlEa7PoAMYvnDI/edit#gid=314238332, but the basic premise is that they correspond to known vocabularies such as DarwinCore, AudubonCore, etc. They all receive multiple terms

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.