Code Monkey home page Code Monkey logo

invenio-vocabularies's Introduction

Invenio-Vocabularies

Invenio module for managing vocabularies, based on Invenio-Records and Invenio-Records-Resources. This module provides:

  • Factories for easily generating models, record API classes, services, and resources
  • Helpers for importing vocabularies

Further documentation is available on https://invenio-vocabularies.readthedocs.io/

invenio-vocabularies's People

Contributors

alejandromumo avatar anikachurilova avatar chriz-uniba avatar effervescent-shot avatar fenekku avatar floriancassayre avatar github-actions[bot] avatar glignos avatar jennur avatar jrcastro2 avatar kpsherva avatar lnielsen avatar max-moser avatar mb-wali avatar mitsosf avatar ntarocco avatar psaiz avatar rekt-hard avatar slint avatar tmorrell avatar utnapischtim avatar zzacharo avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

invenio-vocabularies's Issues

cli: vocabulary list

Implement a command to list all vocabularies, including subtypes:

invenio vocabularies list

Do we forsee the list of vocabularies containing more than 50-100 entries? If so we might want to implement a search command too, invenio vocabularies search <vocabulary_name>

Improve search performance

The vocabulary type name resolution has a significant impact on performance, particularly when listing records.

What happens is that the functions dump/load (in dumper_extensions.py) are called for each search result item. The vocabulary type resolution therefore happens for each of them; and that seems to be the bottleneck.

cli: vocabulary convert

Implement a command to convert/process the content of a vocabulary. This would be useful to process big files, that are in other formats, contain many unused fields, etc.

invenio vocabularies convert <vocabulary> <orig_file_path> <dst_file_path>

fixtures/datastreams: implement async loading

Currently vocabularies are loaded synchronously, which blocks the system for some time (much, in the case of big vocabularies as affiliations). Need to do it async (via celery tasks).

See previous implementation example.

global: consolidate vocabulary loading/import workflow

Context

Currently, the mechanics to import vocabulary (fixtures) are placed in invenio-rdm-records. In order to make it more mantainable and easy to use we want to make loading composable.

Tasks

- [ ] Analyzed and define a list of loading use cases, potenatially each one of them will be a mixin:
- load from a local file (which formats are supported? related #16)
- load from a remote file
- prioritized loading
- filtered loading (e.g. load only some entries based on a condition)
- loading of simple vocabularies (no subtypes)
- loading of nested vocabularies (a type with several subtypes, we only want to support one nested level)
- load using bulk creation/indexing, note that some vocabularies might contain millions of entries

  • Define a base class for the loaders, taking into account the existing one in invenio-rdm-records and already working cases like Zenodo OpenAire loader

  • Implement previously defined mixins/use cases. If they cannot be generalized (i.e. they need knowledge of the schema knowledge) they should be implemented in invenio-rdm-records.

- [ ] Implement the FixturesEngine with the new loaders

cli: vocabulary update

Implement a command to update the content of a vocabulary, i.e.

invenio vocabularies update resourcetype

Must take into account the vocabulary nature and support:

  • updating from a local file
  • updating from a remote file
  • prioritized update
  • filtered update (e.g. update based on a condition, like having the CERN ROR)
  • update only a subtype (e.g. subjects-mesh)

Most of these functionalities should be available through loader mixins.

Potential ignore and force arguments to support updating entries in the BaseFixture._load_vocabulary function.

subjects not indexed in the proper index

The default cookiecutter subjects are not indexed in the appropriate index (subjects-subject-v1.0.0-...)

~ curl -k -XGET localhost:9200/_cat/indices
yellow open affiliations-affiliation-v1.0.0-1626423514  vhALcuX2QUOY6R-kny0vcw 1 1   15 0 60.5kb 60.5kb
yellow open rdmrecords-records-record-v4.0.0-1626423514 _8V5SPMdTASYTzfH4p0wFg 1 1    0 0   208b   208b
yellow open vocabularies-vocabulary-v1.0.0-1626423514   imc_7kMJR7egbg6tKNyUzA 1 1 8394 0  3.9mb  3.9mb
yellow open communities-communities-v1.0.0-1626423514   y-kDNW6pS4KJ6e0aZh7yrQ 1 1    0 0   208b   208b
yellow open subjects-subject-v1.0.0-1626423514          uwsPZfUKTx2Yi3g4Wux9Pg 1 1    0 0   208b   208b
yellow open rdmrecords-records-record-v2.0.0-1626423514 wIaLUvhTTPe4RxHxrhCMsw 1 1    0 0   208b   208b
yellow open rdmrecords-records-record-v3.0.0-1626423514 RiNSvG1hRqmSl5QJQsEFgA 1 1    0 0   208b   208b
green  open .kibana_1                                   gkfpjmHFQ8OUstOiKWl3cQ 1 0    2 0  9.1kb  9.1kb
yellow open rdmrecords-drafts-draft-v4.0.0-1626423514   V4tCUs6eST2Opra2T2tboQ 1 1    0 0   208b   208b
yellow open rdmrecords-drafts-draft-v3.0.0-1626423514   9guxeetcRSa0pQ-QTP9VdA 1 1    0 0   208b   208b
yellow open rdmrecords-drafts-draft-v2.0.0-1626423514   bZc3ab5dSMOilUKG52Q-Zg 1 1    0 0   208b   208b

cli: vocabulary export/dump

Implement a command to export/dump the content of a vocabulary, i.e.

invenio vocabularies dump resourcetype <dst_file_path>

records: refactor jsonschema to use definitions

  • Refactor vocabularies schema to use JSONRef definitions. Copy from RDM-Records records/definitions.json internal-pid schema to invenio-records-resources.

    • Add def for $schema, id in invenio-records-resources.
  • Create a invenio_vocabularies/records/jsonschemas/vocabularies/definitions-v1.0.0.json

    • Add title, description, icon.

endpoints: overlapping with specialized vocabulary

Since the specialized vocabularies define their own ES mappings they don't share the same index as the generic vocabulary. Therefore we have to manually register a new endpoint on /vocabulary/[our_specialized_vocabulary] for each specialized vocabulary type. However this doesn't work because the endpoint /vocabulary is already registered by the generic vocabulary - and it will fall back to that one, which will return an empty result because the vocabulary is not on this index.

For the subjects vocabulary, the current endpoint was set to /subjects instead of /vocabularies/subjects as a workaround to this limitation.

A solution might be to avoid matching the URL if the vocabulary type is not recognized, that way it will always fall back to the correct serializer.

See inveniosoftware/invenio-rdm-records#312

global: generic vocabulary

Based on the relevant topic in inveniosoftware/rfcs#20, we need a base "general"-purpose vocabulary API for facilitating some data-light use-cases. It should be implementing:

  • Model/Schema - based on some common properties (identifier, title, description, icon)
  • Service layer (CRUD operations in the DB, and indexing)
  • Resource/presentation layer (REST API, serialization, etc.)

datastreams: test loading use cases

As a result of #85

Make sure the following use cases are supported:

  • load from a local file (which formats are supported? related #16)
  • load from a remote file
  • prioritized loading
  • filtered loading (e.g. load only some entries based on a condition)
  • loading of simple vocabularies (no subtypes)
  • loading of nested vocabularies (a type with several subtypes, we only want to support one nested level)
  • load using bulk creation/indexing, note that some vocabularies might contain millions of entries

Note: updating vocabularies is a different issue #86

cli: common parameters

There are several commands with common arguments, these could be generalized in the group command. However, the interface would change and I don't really fancy it. e.g.

Now

invenio vocabularies import names ....

With generic arguments

invenio vocabularies names import

Migration script?

We should discuss what to do with migration generation. Here #55 a migration script was suggested (see below), but perhaps it's not what we want.

#!/usr/bin/env bash
# -*- coding: utf-8 -*-
#
# Copyright (C) 2021 Northwestern University.
#
# Invenio-Vocabularies is free software; you can redistribute it and/or
# modify it under the terms of the MIT License; see LICENSE file for more
# details.

# Quit on errors
set -o errexit

# Quit on unbound symbols
set -o nounset

# Always bring down docker services
function cleanup {
    eval "$(docker-services-cli down --env)"
}
trap cleanup EXIT

if [[ "$#" -ne 2 ]]; then
    echo "Usage: ./gen-migration.sh <parent_id> <revision msg>"
fi

parent_id=$1
message=$2

eval "$(docker-services-cli up --db ${DB:-postgresql} --search ${ES:-elasticsearch} --mq ${CACHE:-redis} --env)"
export INVENIO_SQLALCHEMY_DATABASE_URI=${SQLALCHEMY_DATABASE_URI}
invenio db drop --yes-i-know
invenio alembic upgrade
invenio alembic revision -p ${parent_id} "${message}"
# TODO: Automate this last part
echo "Now just extract path from output and move it to invenio_vocabularies/alembic/"
# sed Generating <path>" and; mv <path in output> invenio_vocabularies/alembic/

readers: implement directory reader

class DirectoryReader(BaseReader):
    """Directory reader."""

    def __init__(self, *args, regex=None, **kwargs):
        """Constructor."""
        self._regex = re.compile(regex) if regex else None
        super().__init__(*args, **kwargs)

    def read(self):
        """Opens a directory and iterates through the files in the subdirs."""
        for subdir, dirs, files in os.walk(self._origin):
            for filename in files:
                if not self._regex or self._regex.match(filename):
                    with open(os.path.join(subdir, filename), "rb") as fp:
                        yield fp.read()

cli: vocabulary delete

Implement a command to delete a vocabulary:

invenio vocabularies delete resourcetype

Must take into account the vocabulary nature and support:

  • deletion of only a subtype (e.g. subjects-mesh)

contrib: subjects vocabulary migration to datastreams

The BaseFixture only creates the parent vocabulary. It does not take care of vocabularies with schemes (e.g. subjects).

This is the only vocabulary that would support schemes:

  • Remove generic schemes table
  • Add it for schemes
  • ...

api: record API classes should implement custom `get_record` for their type

Generated (or default) record API classes for vocabularies that use the common VocabularyMetadata model for storage, should be able to have a more "precise" get_record() method, which makes sure that the correct type of vocabulary is also fetched.

In practice, that means that it shouldn't be possible to call Language.get_record() with a License ID and get a result back

A rough implementation of this was done here: https://github.com/inveniosoftware/invenio-rdm-records/blob/f5b7cbc483f4754ab1e592f492e830bf86fd772d/invenio_rdm_records/records/api.py#L30-L41

fixtures/datastreams: implement bulk importing

The BaseFixture writes one item at a time (via the datastream). Implement the required changes to support bulk import (e.g. create all items in db - one commit per item - but index all at once in ES).

Vocabulary reference class implementation should be moved from rdm-records

Vocabularies have full Marshmallow schemas. However, when used nested e.g. from rdm-records a simple schema is used. Normally only containing id and another attribute for custom cases.

Those schemas are defined in RDM-Records:

With the names vocabulary implementation, the AffiliationSchema had to be copy pasted to this module. Should those schemas live in invenio-vocabularies?

Split Subjects into its own Vocabulary

Review list of steps to do like affiliations but for subjects:

Data layer

  • Define relation in CommonFIeldsMixin.
    • Determine which fields to dump in ES index.
  • JSONShema: Do not modify an already released schema, create new version instead.
  • JSONSchema: Define or review existing property
    • List or single?
    • Mixed linked with non-linked?
  • Mappings adapt to JSONSchema and dumped fields.

Service layer

  • Schema: Fix metadata schema.
  • Facets: Do we need facets? Guillaume: let's skip for now

Presentation layer

  • Fix REST API serializations
  • Fix other serialization formats.

Tests

  • Add tests that record linking is wokring as expected (e.g. from service layer).
    • Bad and good data.
  • Add REST API tests for serialization, facets
  • Ensure that I18N is tested properly if required.

Follow structure from:

https://codimd.web.cern.ch/vc6wJipAS66l1XjwKjYdmA?both#

datastreams: OrcidTransformer should use a Marshmallow schema

The OrcidTransformer implements the apply a custom logic to extract the name record. This logic could be moved to a Marshmallow schema. For this we would need to:

1- [ ] Create a MarshmallowTransformer, receives a dictionary and loads it into the schema. The schema should be configurable.
2- [ ] Create a Marshmallow schema that can load an ORCiD record into a name record.
3- [ ] Change the datastream configuration for the transformer. Something like:

transformers: [
    {
        type: xml
    }, {
        type: marshmallow,
        args: {
            schema: ORCiDNameSchema
        }
    }
]

ui faceting via vocabulary label

The facet class should created in /services/facets.py and called VocabularyLabels. It will be used from the service configuration, e.g. here.

This class will be set as value_labels attribute of e.g. the TermsFacet or any other inheriting from LabelledFacetMixin. Therefore, it will be called when get_label_mapping is invoked.

The label class should implement the __call__ method and return a dict of {id: label}. See for example the RecordRelationLabel.

Questions:

  • How do we deal with i18n? i.e. how do we know which key inside the props (en, de, etc.) do we use?
  • Is it worth implementing an "interface" to be clear which methods should be implemented by XYZLabels classes? (i.e. __call__), so the way of calling in the XYZFacets classes is always the same? On the other hand, if __call__ is not implemented it will throw an exception so we have an "implicit" interface.

EDIT, discussion IRL:

  • How to access the service: for generic vocabularies we will use the available proxy, for those that require specific details and services there will be another labelling class that will need to be aware of the service (e.g. recieve it in the constructor)
  • How to access the identity: it is required by the service, but it should suffice with the AnonymousIdentity since there are no permissions enforeced there (they are but the policy is AnyUser).
  • How to deal with i18n: make it into a lazy function, in a similar way than lazy_gettex (use speaklater) and let the labelling system get it later on. Use Marshmallow-Utils:gettext_from_dict as function.

autocompletion query logic not working

Package version (if known): v0.1.5

Describe the bug

The autocompletion query is not returning logical values. For example, adding e starts returning Azerbaijani. The query that goes is e*, which is weird that Azerbaijani is returned since it that not start by e.

Screenshot 2020-12-14 at 10 38 38

Others for example s, which would expect Spanish or similar are getting:

        "hits": [
            {
                "id": "jxvc9-18d97",
                "type": 1,
                "title": "Afar"
            },
            {
                "id": "gymv3-dj020",
                "type": 1,
                "title": "Abkhazian"
            },
            {
                "id": "w1p7s-9ny73",
                "type": 1,
                "title": "Afrikaans"
            },
            {
                "id": "trfrg-8f634",
                "type": 1,
                "title": "Akan"
            },
            {
                "id": "wfw9v-kyp21",
                "type": 1,
                "title": "Amharic"
            }
        ],

Queries with normal ES query string sintax like using quotes fro exact match do not work. Tested also Eng, eng and english.

Steps to Reproduce

  1. Bootstrap RDM
  2. Go to the deposit form
  3. Type e (or s) in the language field.

Expected behavior

Obtain english or other languages that start by e.

Screenshots (if applicable)

Additional context

Possible issues:

  • The field is of type keyword in ES.
  • There is no analyzer set on it. One possibility is to use ES built in autocomplete, which has limitations or use something more custom and complete like:
"tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter"
          ]
        }
}
"analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        },
}

--- then in the field

"<field name>": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
},

naming: `Base` objects with functionality

In the case of datastreams there is a base class that has an implementation that in many cases suffice. In addition, it is not overwritable.
See examples of cases in factories.py. I'm wondering if there should be similar to e.g. the readers

  • BaseDataStream with the skeleton and constructor
  • DataStream with the current implementation

Note that this also happens with BaseFixture

Implement read_many

When serializing to datacite, having a read_many to retrieve multiple subjects (vocabularies) at once would be great.

Override read_all to filter by type like search

invenio-vocabularies's overrides its search method to filter by type:

def search(self, identity, params=None, es_preference=None, type=None, # <-- this
**kwargs)

. A similar read_all is needed in VocabulariesService to only read all a specific vocabulary type documents.

In particular, this is needed if we want to use read_all to get all possible resource types in the deposit page dropdown.

Custom vocabulary import

At Northwestern, we want to have control over the list of licenses shown: a large list was deemed too intimidating and ripe with potential "inaccurate choices". As such, as an instance installer, I would like to be able to import my customized list of a specific vocabulary.

Example for licenses (if licenses need a custom import format):

invenio vocabularies import licenses my_licenses.csv

(if any vocabulary has the same interface - you could have the first line of the csv provide the metadata about the vocabulary itself)

invenio vocabularies import licenses.csv 

The same is true for any vocabulary.

permissions: implement "read-only" permission policy

Currently, the vocabularies REST API is fully unprotected allowing. This task is about implementing a "read-only" REST API for vocabularies.

  1. Implement read-only permission policy
  2. Ensure that vocabularies can still be created programmaticall via the service
  3. Extensively test protection of all methods on the REST API.

The current permission policy should only allow can_search and can_read by any user. The other actions should require permission that allows us to create vocabulary items programmatically, but which prevents the REST API from being used.

Not sure exactly how to do this, but one idea is to:

  1. Create a new system role named system_process.
  2. Create a hard-coded identity system_identity that has the system role need.
system_process = SystemRoleNeed("system_process")
system_identity = Identity()
system_identity.provdes.add(system_process)

Then the remaining actions like can_create should simply require the system_process system role, and the code which needs to create the vocabulary records can use the the system_identity: service.create(system_identity, data)

cli: how test click commands

This is a much bigger task, but we should manage to standardize CLI testing... The problem with these two approaches:
a) The current function requires an external one and does not call the corresponding click command
b) The update cmd test can only test the exit status, no read or anything (because there are 2 app contexts? ๐Ÿค”)

see test_cli.py::test_process vs test_cli.py:test_update_cmd

datastreams: implement LoggingWriter

Implement a writter that can be used to print to console, file, logs, sentry... pseudocode:

from flask import current_logger
from logging import level


class LoggingWriter(BaseWriter):
  
    def write(self, entry, level=level.INFO):
        current_app.logger.log(level, entry)

contrib: define data model for names (orcid) vocabulary

Based on the ORCiD dump a data model (fields/type) needs to be defined. For example affiliations jsonschema

Note that this new vocabulary will extend the base (generic) vocabulary, which means that the vocabulary items will be records. Therefore, the following attributes are already available/present:

  • id (str)
  • created (date)
  • updated (date)
  • links (links list)
  • revision_id (int)
  • title (i18n str)
  • description (i18n str)
  • icon (st)

See BaseRecordSchema and BaseVocabularySchema

contrib: create names (orcid) vocabulary

Context

Creators and Contributors become a vocabulary, the whole object not just the ORCiD identifier.

Pre-requisites

  • Define a name for the vocabulary: names (approved by Lars)
  • Define a data model (which fields will the vocabulary contain) #83

Note: the following steps of the implementation should follow the same semantics/mechanism used for subjects and affiliations.

Data Layer

  • Create a package in invenio_vocabularies/contrib/<name_plural>
  • JSONSchema: jsonschemas/<name plural>/<name>-v1.0.0.json
    • Use defs title, description, icon from invenio_vocabularies/records/jsonschemas/vocabularies/definitions-v1.0.0.json (preferably)
    • Use defs id, schema, pid from invenio-records-resources.
  • Mappings: Add mappings for v6 and v7 following (mappings/{v6,v7}/<name plural>/<name>-v1.0.0.json). Names and other attributes might contain non-latin charaters, for searchability we might want to use [asciifolding].(https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html)
  • Record type factory: <name-plural>.py:
    • Permissions policy use invenio_vocabularies/serivces/permissions.py:PermissionPolicy
    • Endpoint /<name-plural>
  • API/Models: api.py and models.py (use from factory)
  • Alembic recipe for creating tables (invenio_vocabularies/alembic).

Service layer

  • Service schema: schema.py
    • Inherit from BaseVocabularySchema.
    • #95
  • Service: service.py
    • Add SearchOptions:
      • Need to check if we inherit from VocabularySearchOptions.
      • suggest_parser_cls may need tweaking
      • Need facets?
      • Need facets labelling? (i.e. will RDM-Records have facets over this vocab)
    • Customize components? Probably need DataComponent, PIDComponent.
    • Do we use PIDs or UUIDs.

Resource layer

  • Config:
    • May need to define serializer "application/vnd.inveniordm.v1+json" with associated schema.

Tests

  • Schema must be validated with good and bad data.
    • Data layer
    • Service layer
  • REST API
    • Serializations json and inveniordm v1.
    • Actions: Search, create, read, update, delete.

Complete all above first. Then:

Extra info: based on inveniosoftware/invenio-rdm-records#328 see closing PRs to get an idea of the required changes/implementation

contrib: custom pid provider for names

Problem
Name records are using RecordIdV2 provider. The format of the generated id is fine, however, the PID type should be different (nameid).

*Possible solution
Edit the PID provider factory to accept a base class, then modify the attribute pid_type. This is prefered over creating a custom provider since only said attribute needs to be modified.

Other questions

  • How do we deal with duplicates if the PID is "random"? This is not a problem only for duplicates coming from different sets (e.g. ORCiD and GND) but also from the same ORCiD dump (testing for the existence of each element before inserting might be expensive).
  • How do we implement resolution per id (e.g. resolve an ORCiD). We might want to put the identifiers in a pids field, in the same manner, we do for DOIs in the records.

For the developer

  • Must contain tests at service level

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.