inveniosoftware / invenio-vocabularies Goto Github PK

View Code? Open in Web Editor NEW

2.0 72.0 37.0 639 KB

Invenio module for managing vocabularies.

Home Page: https://invenio-vocabularies.readthedocs.io

License: MIT License

Python 89.98% Shell 0.74% JavaScript 9.28%

invenio-vocabularies's Introduction

Invenio-Vocabularies

Invenio module for managing vocabularies, based on Invenio-Records and Invenio-Records-Resources. This module provides:

Factories for easily generating models, record API classes, services, and resources
Helpers for importing vocabularies

Further documentation is available on https://invenio-vocabularies.readthedocs.io/

invenio-vocabularies's People

Contributors

Stargazers

Watchers

invenio-vocabularies's Issues

cli: vocabulary list

Implement a command to list all vocabularies, including subtypes:

invenio vocabularies list

Do we forsee the list of vocabularies containing more than 50-100 entries? If so we might want to implement a search command too, invenio vocabularies search <vocabulary_name>

contrib: implement orcid api reader

Read one or more records from the orcid.org api

Improve search performance

The vocabulary type name resolution has a significant impact on performance, particularly when listing records.

What happens is that the functions dump/load (in dumper_extensions.py) are called for each search result item. The vocabulary type resolution therefore happens for each of them; and that seems to be the bottleneck.

global: create initial data access layer for generic vocabulary

cli: vocabulary convert

Implement a command to convert/process the content of a vocabulary. This would be useful to process big files, that are in other formats, contain many unused fields, etc.

invenio vocabularies convert <vocabulary> <orig_file_path> <dst_file_path>

fixtures/datastreams: implement async loading

Currently vocabularies are loaded synchronously, which blocks the system for some time (much, in the case of big vocabularies as affiliations). Need to do it async (via celery tasks).

See previous implementation example.

global: consolidate vocabulary loading/import workflow

Context

Currently, the mechanics to import vocabulary (fixtures) are placed in invenio-rdm-records. In order to make it more mantainable and easy to use we want to make loading composable.

Tasks

~~- [ ]~~ Analyzed and define a list of loading use cases, potenatially each one of them will be a mixin:
- load from a local file (which formats are supported? related #16)
- load from a remote file
- prioritized loading
- filtered loading (e.g. load only some entries based on a condition)
- loading of simple vocabularies (no subtypes)
- loading of nested vocabularies (a type with several subtypes, we only want to support one nested level)
- load using bulk creation/indexing, note that some vocabularies might contain millions of entries

Define a base class for the loaders, taking into account the existing one in invenio-rdm-records and already working cases like Zenodo OpenAire loader
Implement previously defined mixins/use cases. If they cannot be generalized (i.e. they need knowledge of the schema knowledge) they should be implemented in invenio-rdm-records.

~~- [ ]~~ Implement the FixturesEngine with the new loaders

cli: vocabulary update

Implement a command to update the content of a vocabulary, i.e.

invenio vocabularies update resourcetype

Must take into account the vocabulary nature and support:

updating from a local file
updating from a remote file
prioritized update
filtered update (e.g. update based on a condition, like having the CERN ROR)
update only a subtype (e.g. subjects-mesh)

Most of these functionalities should be available through loader mixins.

Potential ignore and force arguments to support updating entries in the BaseFixture._load_vocabulary function.

subjects not indexed in the proper index

The default cookiecutter subjects are not indexed in the appropriate index (subjects-subject-v1.0.0-...)

~ curl -k -XGET localhost:9200/_cat/indices
yellow open affiliations-affiliation-v1.0.0-1626423514  vhALcuX2QUOY6R-kny0vcw 1 1   15 0 60.5kb 60.5kb
yellow open rdmrecords-records-record-v4.0.0-1626423514 _8V5SPMdTASYTzfH4p0wFg 1 1    0 0   208b   208b
yellow open vocabularies-vocabulary-v1.0.0-1626423514   imc_7kMJR7egbg6tKNyUzA 1 1 8394 0  3.9mb  3.9mb
yellow open communities-communities-v1.0.0-1626423514   y-kDNW6pS4KJ6e0aZh7yrQ 1 1    0 0   208b   208b
yellow open subjects-subject-v1.0.0-1626423514          uwsPZfUKTx2Yi3g4Wux9Pg 1 1    0 0   208b   208b
yellow open rdmrecords-records-record-v2.0.0-1626423514 wIaLUvhTTPe4RxHxrhCMsw 1 1    0 0   208b   208b
yellow open rdmrecords-records-record-v3.0.0-1626423514 RiNSvG1hRqmSl5QJQsEFgA 1 1    0 0   208b   208b
green  open .kibana_1                                   gkfpjmHFQ8OUstOiKWl3cQ 1 0    2 0  9.1kb  9.1kb
yellow open rdmrecords-drafts-draft-v4.0.0-1626423514   V4tCUs6eST2Opra2T2tboQ 1 1    0 0   208b   208b
yellow open rdmrecords-drafts-draft-v3.0.0-1626423514   9guxeetcRSa0pQ-QTP9VdA 1 1    0 0   208b   208b
yellow open rdmrecords-drafts-draft-v2.0.0-1626423514   bZc3ab5dSMOilUKG52Q-Zg 1 1    0 0   208b   208b

cli: vocabulary export/dump

Implement a command to export/dump the content of a vocabulary, i.e.

invenio vocabularies dump resourcetype <dst_file_path>

records: refactor jsonschema to use definitions

Refactor vocabularies schema to use JSONRef definitions. Copy from RDM-Records records/definitions.json internal-pid schema to invenio-records-resources.
- Add def for $schema, id in invenio-records-resources.
Create a invenio_vocabularies/records/jsonschemas/vocabularies/definitions-v1.0.0.json
- Add title, description, icon.

endpoints: overlapping with specialized vocabulary

Since the specialized vocabularies define their own ES mappings they don't share the same index as the generic vocabulary. Therefore we have to manually register a new endpoint on /vocabulary/[our_specialized_vocabulary] for each specialized vocabulary type. However this doesn't work because the endpoint /vocabulary is already registered by the generic vocabulary - and it will fall back to that one, which will return an empty result because the vocabulary is not on this index.

For the subjects vocabulary, the current endpoint was set to /subjects instead of /vocabularies/subjects as a workaround to this limitation.

A solution might be to avoid matching the URL if the vocabulary type is not recognized, that way it will always fall back to the correct serializer.

See inveniosoftware/invenio-rdm-records#312

global: generic vocabulary

Based on the relevant topic in inveniosoftware/rfcs#20, we need a base "general"-purpose vocabulary API for facilitating some data-light use-cases. It should be implementing:

Model/Schema - based on some common properties (identifier, title, description, icon)
Service layer (CRUD operations in the DB, and indexing)
Resource/presentation layer (REST API, serialization, etc.)

datastreams: test loading use cases

As a result of #85

Make sure the following use cases are supported:

load from a local file (which formats are supported? related #16)
load from a remote file
prioritized loading
filtered loading (e.g. load only some entries based on a condition)
loading of simple vocabularies (no subtypes)
loading of nested vocabularies (a type with several subtypes, we only want to support one nested level)
load using bulk creation/indexing, note that some vocabularies might contain millions of entries

Note: updating vocabularies is a different issue #86

contrib: URLs in subjects id cause problems in resolution

https://github.com/inveniosoftware/invenio-rdm-records/blob/master/tests/resources/vocabularies/test_subjects_vocabulary.py#L36

datastreams: create mixins for shared functionality of DirectoryReader and TarReader

requires #112
Both readers use a regex expression to filter
The TarReade loops through a folder, it could potentially use/inherit the DirectoryReader. Might need to add a recursive flag.

cli: common parameters

There are several commands with common arguments, these could be generalized in the group command. However, the interface would change and I don't really fancy it. e.g.

Now

invenio vocabularies import names ....

With generic arguments

invenio vocabularies names import

Migration script?

We should discuss what to do with migration generation. Here #55 a migration script was suggested (see below), but perhaps it's not what we want.

#!/usr/bin/env bash
# -*- coding: utf-8 -*-
#
# Copyright (C) 2021 Northwestern University.
#
# Invenio-Vocabularies is free software; you can redistribute it and/or
# modify it under the terms of the MIT License; see LICENSE file for more
# details.

# Quit on errors
set -o errexit

# Quit on unbound symbols
set -o nounset

# Always bring down docker services
function cleanup {
    eval "$(docker-services-cli down --env)"
}
trap cleanup EXIT

if [[ "$#" -ne 2 ]]; then
    echo "Usage: ./gen-migration.sh <parent_id> <revision msg>"
fi

parent_id=$1
message=$2

eval "$(docker-services-cli up --db ${DB:-postgresql} --search ${ES:-elasticsearch} --mq ${CACHE:-redis} --env)"
export INVENIO_SQLALCHEMY_DATABASE_URI=${SQLALCHEMY_DATABASE_URI}
invenio db drop --yes-i-know
invenio alembic upgrade
invenio alembic revision -p ${parent_id} "${message}"
# TODO: Automate this last part
echo "Now just extract path from output and move it to invenio_vocabularies/alembic/"
# sed Generating <path>" and; mv <path in output> invenio_vocabularies/alembic/

Match up with JSONSchema

readers: implement directory reader

class DirectoryReader(BaseReader):
    """Directory reader."""

    def __init__(self, *args, regex=None, **kwargs):
        """Constructor."""
        self._regex = re.compile(regex) if regex else None
        super().__init__(*args, **kwargs)

    def read(self):
        """Opens a directory and iterates through the files in the subdirs."""
        for subdir, dirs, files in os.walk(self._origin):
            for filename in files:
                if not self._regex or self._regex.match(filename):
                    with open(os.path.join(subdir, filename), "rb") as fp:
                        yield fp.read()

XMLTransformer should have default behaviour

i.e. implement default apply, otherwise it breaks API definition

names: add `search_as_you_type` to `affiliations.name`

In order to provide suggestion by affiliation name in the contributors/creators modal of the deposit form

cli: vocabulary delete

Implement a command to delete a vocabulary:

invenio vocabularies delete resourcetype

Must take into account the vocabulary nature and support:

deletion of only a subtype (e.g. subjects-mesh)

tests: read_all and read_many for vocabularies

test methods before PR
Closes #43

contrib: subjects vocabulary migration to datastreams

~~The BaseFixture only creates the parent vocabulary. It does not take care of vocabularies with schemes (e.g. subjects).~~

This is the only vocabulary that would support schemes:

Remove generic schemes table
Add it for schemes
...

api: record API classes should implement custom `get_record` for their type

Generated (or default) record API classes for vocabularies that use the common VocabularyMetadata model for storage, should be able to have a more "precise" get_record() method, which makes sure that the correct type of vocabulary is also fetched.

In practice, that means that it shouldn't be possible to call Language.get_record() with a License ID and get a result back

A rough implementation of this was done here: https://github.com/inveniosoftware/invenio-rdm-records/blob/f5b7cbc483f4754ab1e592f492e830bf86fd772d/invenio_rdm_records/records/api.py#L30-L41

Readthedocs-URL in Readme and git-repo-about gives 'page does not exist'

The readthedocs-link: https://invenio-vocabularies.readthedocs.io/
given in the Readme
and in the repo-'About' (right next to the green 'Code'-button) here:

Gives me a 'page does not exist'

fixtures/datastreams: implement bulk importing

The BaseFixture writes one item at a time (via the datastream). Implement the required changes to support bulk import (e.g. create all items in db - one commit per item - but index all at once in ES).

Vocabulary reference class implementation should be moved from rdm-records

Vocabularies have full Marshmallow schemas. However, when used nested e.g. from rdm-records a simple schema is used. Normally only containing id and another attribute for custom cases.

Those schemas are defined in RDM-Records:

AffiliationSchema
SubjectSchema
Generic VocabularySchema
Maybe some others

With the names vocabulary implementation, the AffiliationSchema had to be copy pasted to this module. Should those schemas live in invenio-vocabularies?

names: dump `affiliations.identifiers` in ES

Discussed IRL: we might want to dump also affiliations.identifiers to allow searching by it. Therefore, not a relation.attrs but a dumper. Shelved, need to open issue.

Originally posted by @ppanero in #93 (comment)

models: remove VocabularySubtype

Remove VocabularySubtype as it should not exists. Instead subjects vocabulary should have a SubjectScheme model that similar to this subtype.

facets not working for subjects

because the labelling class goes to look for the custom ones and those do not exist in the db/es

Split Subjects into its own Vocabulary

Review list of steps to do like affiliations but for subjects:

Data layer

Define relation in CommonFIeldsMixin.
- Determine which fields to dump in ES index.
JSONShema: Do not modify an already released schema, create new version instead.
JSONSchema: Define or review existing property
- List or single?
- Mixed linked with non-linked?
Mappings adapt to JSONSchema and dumped fields.

Service layer

Schema: Fix metadata schema.
Facets: Do we need facets? Guillaume: let's skip for now

Presentation layer

Fix REST API serializations
Fix other serialization formats.

Tests

Add tests that record linking is wokring as expected (e.g. from service layer).
- Bad and good data.
Add REST API tests for serialization, facets
Ensure that I18N is tested properly if required.

Follow structure from:

https://codimd.web.cern.ch/vc6wJipAS66l1XjwKjYdmA?both#

datastreams: OrcidTransformer should use a Marshmallow schema

The OrcidTransformer implements the apply a custom logic to extract the name record. This logic could be moved to a Marshmallow schema. For this we would need to:

1- [ ] Create a MarshmallowTransformer, receives a dictionary and loads it into the schema. The schema should be configurable.
2- [ ] Create a Marshmallow schema that can load an ORCiD record into a name record.
3- [ ] Change the datastream configuration for the transformer. Something like:

transformers: [
    {
        type: xml
    }, {
        type: marshmallow,
        args: {
            schema: ORCiDNameSchema
        }
    }
]

ui faceting via vocabulary label

The facet class should created in /services/facets.py and called VocabularyLabels. It will be used from the service configuration, e.g. here.

This class will be set as value_labels attribute of e.g. the TermsFacet or any other inheriting from LabelledFacetMixin. Therefore, it will be called when get_label_mapping is invoked.

The label class should implement the __call__ method and return a dict of {id: label}. See for example the RecordRelationLabel.

Questions:

How do we deal with i18n? i.e. how do we know which key inside the props (en, de, etc.) do we use?
Is it worth implementing an "interface" to be clear which methods should be implemented by XYZLabels classes? (i.e. __call__), so the way of calling in the XYZFacets classes is always the same? On the other hand, if __call__ is not implemented it will throw an exception so we have an "implicit" interface.

EDIT, discussion IRL:

How to access the service: for generic vocabularies we will use the available proxy, for those that require specific details and services there will be another labelling class that will need to be aware of the service (e.g. recieve it in the constructor)
How to access the identity: it is required by the service, but it should suffice with the AnonymousIdentity since there are no permissions enforeced there (they are but the policy is AnyUser).
How to deal with i18n: make it into a lazy function, in a similar way than lazy_gettex (use speaklater) and let the labelling system get it later on. Use Marshmallow-Utils:gettext_from_dict as function.

tests: implement tests for read_all with and without cache

They got removed from records-resources due to an issue with permissions inveniosoftware/invenio-records-resources#261

autocompletion query logic not working

Package version (if known): v0.1.5

Describe the bug

The autocompletion query is not returning logical values. For example, adding e starts returning Azerbaijani. The query that goes is e*, which is weird that Azerbaijani is returned since it that not start by e.

Others for example s, which would expect Spanish or similar are getting:

        "hits": [
            {
                "id": "jxvc9-18d97",
                "type": 1,
                "title": "Afar"
            },
            {
                "id": "gymv3-dj020",
                "type": 1,
                "title": "Abkhazian"
            },
            {
                "id": "w1p7s-9ny73",
                "type": 1,
                "title": "Afrikaans"
            },
            {
                "id": "trfrg-8f634",
                "type": 1,
                "title": "Akan"
            },
            {
                "id": "wfw9v-kyp21",
                "type": 1,
                "title": "Amharic"
            }
        ],

Queries with normal ES query string sintax like using quotes fro exact match do not work. Tested also Eng, eng and english.

Steps to Reproduce

Bootstrap RDM
Go to the deposit form
Type e (or s) in the language field.

Expected behavior

Obtain english or other languages that start by e.

Screenshots (if applicable)

Additional context

Possible issues:

The field is of type keyword in ES.
There is no analyzer set on it. One possibility is to use ES built in autocomplete, which has limitations or use something more custom and complete like:

"tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter"
          ]
        }
}
"analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        },
}

--- then in the field

"<field name>": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
},

naming: `Base` objects with functionality

In the case of datastreams there is a base class that has an implementation that in many cases suffice. In addition, it is not overwritable.
See examples of cases in factories.py. I'm wondering if there should be similar to e.g. the readers

BaseDataStream with the skeleton and constructor
DataStream with the current implementation

Note that this also happens with BaseFixture

Implement read_many

When serializing to datacite, having a read_many to retrieve multiple subjects (vocabularies) at once would be great.

resources: implement `tasks` management rest endpoint

contrib: implement proxies for services/resources

This would affect the way invenio-rdm-records access them, e.g. in the affiliations fixture

Override read_all to filter by type like search

invenio-vocabularies's overrides its search method to filter by type:

def search(self, identity, params=None, es_preference=None, type=None, # <-- this
**kwargs)

. A similar read_all is needed in VocabulariesService to only read all a specific vocabulary type documents.

In particular, this is needed if we want to use read_all to get all possible resource types in the deposit page dropdown.

fixtures: implement fixtures engine

As result of #85

Custom vocabulary import

At Northwestern, we want to have control over the list of licenses shown: a large list was deemed too intimidating and ripe with potential "inaccurate choices". As such, as an instance installer, I would like to be able to import my customized list of a specific vocabulary.

Example for licenses (if licenses need a custom import format):

invenio vocabularies import licenses my_licenses.csv

(if any vocabulary has the same interface - you could have the first line of the csv provide the metadata about the vocabulary itself)

invenio vocabularies import licenses.csv

The same is true for any vocabulary.

permissions: implement "read-only" permission policy

Currently, the vocabularies REST API is fully unprotected allowing. This task is about implementing a "read-only" REST API for vocabularies.

Implement read-only permission policy
Ensure that vocabularies can still be created programmaticall via the service
Extensively test protection of all methods on the REST API.

The current permission policy should only allow can_search and can_read by any user. The other actions should require permission that allows us to create vocabulary items programmatically, but which prevents the REST API from being used.

Not sure exactly how to do this, but one idea is to:

Create a new system role named system_process.
Create a hard-coded identity system_identity that has the system role need.

system_process = SystemRoleNeed("system_process")
system_identity = Identity()
system_identity.provdes.add(system_process)

Then the remaining actions like can_create should simply require the system_process system role, and the code which needs to create the vocabulary records can use the the system_identity: service.create(system_identity, data)

cli: how test click commands

This is a much bigger task, but we should manage to standardize CLI testing... The problem with these two approaches:
a) The current function requires an external one and does not call the corresponding click command
b) The update cmd test can only test the exit status, no read or anything (because there are 2 app contexts? 🤔)

see test_cli.py::test_process vs test_cli.py:test_update_cmd

datastreams: implement LoggingWriter

Implement a writter that can be used to print to console, file, logs, sentry... pseudocode:

from flask import current_logger
from logging import level


class LoggingWriter(BaseWriter):
  
    def write(self, entry, level=level.INFO):
        current_app.logger.log(level, entry)

contrib: define data model for names (orcid) vocabulary

Based on the ORCiD dump a data model (fields/type) needs to be defined. For example affiliations jsonschema

Note that this new vocabulary will extend the base (generic) vocabulary, which means that the vocabulary items will be records. Therefore, the following attributes are already available/present:

id (str)
created (date)
updated (date)
links (links list)
revision_id (int)
title (i18n str)
description (i18n str)
icon (st)

See BaseRecordSchema and BaseVocabularySchema

contrib: create names (orcid) vocabulary

Context

Creators and Contributors become a vocabulary, the whole object not just the ORCiD identifier.

Pre-requisites

Define a name for the vocabulary: names (approved by Lars)
Define a data model (which fields will the vocabulary contain) #83

Note: the following steps of the implementation should follow the same semantics/mechanism used for subjects and affiliations.

Data Layer

Create a package in invenio_vocabularies/contrib/<name_plural>
JSONSchema: jsonschemas/<name plural>/<name>-v1.0.0.json
- Use defs title, description, icon from invenio_vocabularies/records/jsonschemas/vocabularies/definitions-v1.0.0.json (preferably)
- Use defs id, schema, pid from invenio-records-resources.
Mappings: Add mappings for v6 and v7 following (mappings/{v6,v7}/<name plural>/<name>-v1.0.0.json). Names and other attributes might contain non-latin charaters, for searchability we might want to use [asciifolding].(https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html)
Record type factory: <name-plural>.py:
- Permissions policy use invenio_vocabularies/serivces/permissions.py:PermissionPolicy
- Endpoint /<name-plural>
API/Models: api.py and models.py (use from factory)
Alembic recipe for creating tables (invenio_vocabularies/alembic).

Service layer

Resource layer

Config:
- May need to define serializer "application/vnd.inveniordm.v1+json" with associated schema.

Tests

Schema must be validated with good and bad data.
- Data layer
- Service layer
REST API
- Serializations json and inveniordm v1.
- Actions: Search, create, read, update, delete.

Complete all above first. Then:

Integrate in RDM-Records (Done in inveniosoftware/invenio-rdm-records#831)
- Ext.py: register service and resource.
- setup.py: entry points

Extra info: based on inveniosoftware/invenio-rdm-records#328 see closing PRs to get an idea of the required changes/implementation

contrib: custom pid provider for names

Problem
Name records are using RecordIdV2 provider. The format of the generated id is fine, however, the PID type should be different (nameid).

*Possible solution
Edit the PID provider factory to accept a base class, then modify the attribute pid_type. This is prefered over creating a custom provider since only said attribute needs to be modified.

Other questions

How do we deal with duplicates if the PID is "random"? This is not a problem only for duplicates coming from different sets (e.g. ORCiD and GND) but also from the same ORCiD dump (testing for the existence of each element before inserting might be expensive).
How do we implement resolution per id (e.g. resolve an ORCiD). We might want to put the identifiers in a pids field, in the same manner, we do for DOIs in the records.

For the developer

Must contain tests at service level

inveniosoftware / invenio-vocabularies Goto Github PK

invenio-vocabularies's Introduction

Invenio-Vocabularies

invenio-vocabularies's People

Contributors

Stargazers

Watchers

Forkers

invenio-vocabularies's Issues

Context

Tasks

Describe the bug

Steps to Reproduce

Expected behavior

Screenshots (if applicable)

Additional context

Context

Pre-requisites

Data Layer

Service layer

Resource layer

Tests

Recommend Projects

Recommend Topics

Recommend Org