elixir-europe / biovalidator Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 6.0 1.56 MB

JSON validator derived from AJV supporting ontology and taxonomy validation.

License: Apache License 2.0

Dockerfile 0.19% JavaScript 76.97% HTML 20.66% Shell 2.17%

biovalidator's People

Contributors

Stargazers

Watchers

Forkers

bedroesb mandloi2309 m-casado tcezard pilare

biovalidator's Issues

Don't use elixir-europe/json-schema-validator library as a dependency

Steps:

Merge https://github.com/elixir-europe/json-schema-validator code to this repo
Refactor the code based on that library code

Fix issue with the keyward graph_restriction

The provided exaples relating to graph_restriction is not working as expected. This could be related to a bug in graph_restriction keyword/logic.

How to reproduce:
Below should fail validation.

node ./validator-cli.js --schema=examples/schemas/graphRestriction-schema.json --json=examples/objects/graphRestriction_fail.json

[Requested feature] Server endpoint to interact with references saved in cache

Summary

A feature (probably endpoint) to interact with references saved in cache of a Biovalidator server:

To be able to refresh the stored references.
To know when was the last time these were refreshed.

Motivation

When using URL-resolvable references for schemas, once any reference is used, it is stored in cache, avoiding its repeated download. This feature is very useful dealing with the delay of downloading referenced schemas. Nevertheless, at some point one may change the URL-resolved reference, and would want those modifications to propagate to the validation server without changing the referenced schema name ("$ref"). If the references are still stored in cache, the server will not grab the modified schemas, and would instead use the obsolete ones.
Therefore, we request an endpoint that allows for the server deployers to:

Be able to refresh the stored references. As soon as this endpoint was contacted, save schemas would be removed from cache. This would cause for the references to be downloaded again when used, and thus grab the new modifications. In our use-case this endpoint could be reached by anyone, not only the server owners, but if credentials were needed it would also be fine.
Be able to check when was the last time the references were refreshed. Again, a public endpoint would be fine. With this feature we would be able to interact with the refresh time to compare new changes, or for debugging purposes.

Examples

Example 1

My schema is used as the "latest version" reference. I notice a mistake and amend it. But I already used the validation tool, which stored the previous version of the schema. I would then want to manually refresh the cache of the server for it to grab the newer version.

Example 2

I would want to incorporate in my GitHub actions one that automatically pings the refresh endpoint whenever a new change is merged with the main branch. That way, changes in the GitHub repository I use to store my schemas would always propagate to the validation API.

Example 3

I would like to create a script that automatically checks if the schemas have not been refreshed for longer than a week (using the second requested feature) and, if so, automatically refresh them (using the first requested feature).

[BUG]: Failed to compile schema: ``async keyword in sync schema``

Bug summary

When the server is deployed with referenced schemas (-ref argument) and custom keyword graphRestriction is used, the validation crashes when compiling the schemas.

Technical details

Used GitHub branch: main
Operating System: WSL2
Node version: v16.13.0
npm version: 8.6.0

To reproduce

Clone and install Biovalidator's project

git clone https://github.com/elixir-europe/biovalidator.git
cd biovalidator
npm install

Clone EGA's metadata GH project:

cd ..
git clone [email protected]:EbiEga/ega-metadata-schema.git

Deploy Biovalidator local server with referenced schemas

sdir="ega-metadata-schema/schemas"
node src/biovalidator -r "$sdir/*.json" -r "$sdir/controlled_vocabulary_schemas/*.json"

Launch a for loop with all JSON documents in the directory requesting validation for each of them and observe how the ones using custom keywords (analysis, experiment, individual, object-set and sample) fail with the same result:

$ cd ega-metadata-schema
$ for file in $( ls ./examples/json_validation_tests/*json); do echo $file; curl --data @$file -H "Content-Type: application/json" -X POST http://localhost:3020/validate; echo ""; done
./examples/json_validation_tests/DAC_valid-1.json
[]
./examples/json_validation_tests/analysis_valid-1.json
{"error":"Failed to compile schema: Error: async keyword in sync schema"}
./examples/json_validation_tests/assay_valid-1_array.json
[]
./examples/json_validation_tests/assay_valid-2_sequencing.json
[]
./examples/json_validation_tests/dataset_valid-1.json
[]
./examples/json_validation_tests/experiment_valid-1.json
{"error":"Failed to compile schema: Error: async keyword in sync schema"}
./examples/json_validation_tests/individual_valid-1.json
{"error":"Failed to compile schema: Error: async keyword in sync schema"}
./examples/json_validation_tests/object-set_valid-1.json
{"error":"Failed to compile schema: Error: async keyword in sync schema"}
./examples/json_validation_tests/policy_valid-1.json
[]
./examples/json_validation_tests/protocol_valid-1.json
[]
./examples/json_validation_tests/protocol_valid-2.json
[]
./examples/json_validation_tests/protocol_valid-3.json
[]
./examples/json_validation_tests/sample_valid-1.json
{"error":"Failed to compile schema: Error: async keyword in sync schema"}
./examples/json_validation_tests/study_valid-1.json
[]
./examples/json_validation_tests/submission_valid-1.json
[]

Observed behaviour

Validation stops for those JSON documents/schemas that are using a custom keyword (graphRestriction in this case).

Expected behaviour

Schemas should be compiled correctly and validation executed.

Additional context

All of the schemas within the schemas/ directory have "$async": true at root level, which renders the error message confusing.
More importantly, if not given at deployment, but fetched (i.e. the reference resolves against the raw text and it's retrieved automatically by the tool), the validation works:

Clone and install Biovalidator's project

git clone https://github.com/elixir-europe/biovalidator.git
cd biovalidator
npm install

Clone EGA's metadata GH project:

cd ..
git clone [email protected]:EbiEga/ega-metadata-schema.git

Deploy Biovalidator local server without any referenced schemas

node src/biovalidator

Launch a for loop with all JSON documents in the directory requesting validation for each of them. In this case validation may not be satisfied for other reasons, but at least was executed:

$ for file in $( ls ./examples/json_validation_tests/*json); do echo $file; curl --data @$file -H "Content-Type: application/json" -X POST http://lo
calhost:3020/validate; echo ""; done
./examples/json_validation_tests/DAC_valid-1.json
[]
./examples/json_validation_tests/analysis_valid-1.json
[{"dataPath":"/targetedLoci/0/organismDescriptor/taxonIdCurie","errors":["Provided term is not child of [http://purl.obolibrary.org/obo/NCBITaxon_1]"]}]
./examples/json_validation_tests/assay_valid-1_array.json
{"error":"Failed to compile schema: Error: AnySchema https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.common-definitions.json is loaded but https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.common-definitions.json#/definitions/sampleLabel-association cannot be resolved"}
./examples/json_validation_tests/assay_valid-2_sequencing.json
{"error":"Failed to compile schema: Error: AnySchema https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.common-definitions.json is loaded but https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.common-definitions.json#/definitions/sampleLabel-association cannot be resolved"}
./examples/json_validation_tests/dataset_valid-1.json
[]
./examples/json_validation_tests/experiment_valid-1.json
{"error":"Failed to compile schema: RangeError: Maximum call stack size exceeded"}
./examples/json_validation_tests/individual_valid-1.json
[{"dataPath":"/organismDescriptor/taxonIdCurie","errors":["Provided term is not child of [http://purl.obolibrary.org/obo/NCBITaxon_1]"]}]
./examples/json_validation_tests/object-set_valid-1.json
{"error":"Failed to compile schema: RangeError: Maximum call stack size exceeded"}
./examples/json_validation_tests/policy_valid-1.json
[]
./examples/json_validation_tests/protocol_valid-1.json
[]
./examples/json_validation_tests/protocol_valid-2.json
[]
./examples/json_validation_tests/protocol_valid-3.json
[]
./examples/json_validation_tests/sample_valid-1.json
[{"dataPath":"/organismDescriptor/taxonIdCurie","errors":["Provided term is not child of [http://purl.obolibrary.org/obo/NCBITaxon_1]"]},{"dataPath":"/sampleCollection/samplingSite/sampledOrganismPartCurie","errors":["Provided term is not child of [http://www.ebi.ac.uk/efo/EFO_0000635]"]},{"dataPath":"/sampleStatus/0/conditionUnderStudy/cusCurie","errors":["provided term does not exist in OLS: [XCO:0000398]"]}]
./examples/json_validation_tests/study_valid-1.json
[]
./examples/json_validation_tests/submission_valid-1.json
[]

Review and update README file

Tasks:

Review the content of the README file and check if the content of the old JSON-schema-validator library should be merged here
add usage information for the CLI option

File-resolvable references (context for $id and $ref)

Summary

To be able to provide Biovalidator with filepaths to resolve references between JSON files.

Description

When building complex schemas it is common to reference inherited subschemas between JSON files through $id and $ref. Currently I don't think there is a way for Biovalidator to accept filepaths of JSON files that contain such $ids mentioned elsewhere. This is truly useful when running a JSON validator locally and when handling static references to files with different versions (e.g. "$id": "my-file.txt" existing in different versions of my-file.txt). Besides, although $ids are suggested to be URL-resolvable, this is just a recommendation, and thus it should not be expected for $ids to always point to the correct JSON files.

AJV-cli already allows for this to be done through the -r argument (referenced schemas). This allows for the whole set of schemas to be passed to the validator. As an example:

schema_name="object-set"
json_doc="object-set-valid-1.json"
ajv --spec=draft2019 -s schemas/EGA.$schema_name.json -d schemas/validation_tests/$json_doc -r "schemas/EGA.!($schema_name).json"

[Bug][Documentation] Improved bug report for existing PID files

Summary

When a server is cut abruptly and the PID file is not removed, the re-deployment is blocked due to the existing PID file.

Technical details

Testing deployment of server.
Testing on current dev branch.
Using the UI and curl to create the Post requests.
Using Windows Subsystem for Linux 2 (WSL2).
Using Node v16.13.0 and npm 8.6.0.

Expected behaviour

When deploying a server with an already existing PID file in place, one of the following:

It removes the existing PID file (a bit risky) and overwrites it with the new one.
It informs you that the there is already a PID file with that filename and recommends you to remove it or change the new PID filename (pidPath argument).

Observed behaviour

It reports the following statement, without handling the issue correctly:

[winston] Unknown logger level: Failed to create PID file.

If the PID file is removed, it works.

To reproduce

Have a PID file in place with the default filename (server.pid), either by stopping the tool abruptly or by creating it on your own.
Deploy the server (node src/biovalidator.js).
Observe printed message.

[Feature request]: Integration with JSON-LD and BioSchemas

Summary

Integration of JSON-LD syntax into the validation of a JSON document

Motivation

The idea would be to expand the interpretation of a JSON-LD document with its context so that it can be further validated following Schema.org and BioSchemas types and profiles.

This request is open for discussion, given that I'm not fully sure about the implications of interpreting JSON-LD as well as JSON Schema and if the benefits outweigh the changes.

Details

The way I envision this feature is by interpreting a JSON-LD document within its context of types and profiles. It could be done both by:

Interpreting the schema. If a schema has the condition of a BioSchemas/Schemas.org property (e.g. Person), then Biovalidator would interpret it, fetch the BioSchemas/Schemas.org definition of that property and apply it during validation. For example, if the JSON looked like the following, Biovalidator would interpret that the schema to apply is the one defined by Schema.org for the Person type.

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "$id": "https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/person.json",
  "type": "object",
  "required": ["person"],
  "additionalProperties": false,
  "properties": {
    "person": {
      "@context": "https://schema.org/",
      "@type": "Person"
    }
  }
}

# If there is a place where "https://schema.org/Person" has its definition in raw format, the following could also be done (?)
{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "$id": "https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/person.json",
  "type": "object",
  "required": ["person"],
  "additionalProperties": false,
  "properties": {
    "person": {
      "$ref": "https://schema.org/Person"
    }
  }
}

Interpreting the data. If the data has the condition of a BioSchemas/Schemas.org property (e.g. Person), then Biovalidator would interpret it as such and apply it during validation. For example, if the JSON data looked like the following, Biovalidator would interpret that an extra layer to apply is the one defined by Schema.org for the Person type. If I am honest, I do not have trust in this approach, given that it would imply conditioning the data through the tool, rather than relying entirely on the schema for it.

{
  "@context": "https://schema.org",
  "@type": "Person",
  ...
}

Use-cases

Re-using Schemas.org and BioSchemas directly from the source by referencing their definitions

[Documentation] Further details on custom keywords

I would like to request more granular documentation regarding the custom keywords for the ontology validation.
There is a description of each custom keyword, but not of the parameters or properties used within each of them. Some seem self explanatory (e.g. include_self) but others could be used in different ways. For example:

ontologies: is it always in CURIE format? Is there a constraint to which ontologies it accepts (any from OLS, only from OBO...)?
classes: does the class' prefix have to match one of the given ontologies at the previous attribute? If so, why the need of the first property?
relations: Wouldn't it always be rdfs:subClassOf? Or does it accept any other rdfs types that could be used (e.g. domain or range) for ontology validation?
direct: if false does it imply that the child node cannot be a direct descendant of the parent term? Or does it just state that it doesn't need to be a direct one? On the other hand, If true, does it mean it has to be a direct descendant? Or that it can be a direct descendant?
format: is this a custom keyword related to the ontology validation? Or just a keyword used in the examples for any other reason?

Are there extra keywords that were not used in the examples but can be used within the said custom keywords?

[Documentation] Improve documentation of the interaction with OLS and ENA APIs

I would appreciate a more profound documentation regarding how Biovalidator behaves with OLS's and ENA's APIs. Specifically, if the APIs are unavailable, what the error messages would look like.
The other day I noticed OLS was down, but I couldn't test how Biovalidator behaves when trying to reach its service and receiving server errors.

In essence, I would like for the documentation of Biovalidator (perhaps here) to mention whether Biovalidator throws an error (and of what type) if their services are unreachable.

[Requested feature] Create a health check endpoint for the server

Summary

Creation of a "Health Check" Endpoint for Biovalidator Server, to be used to determine the operational status and health of a system or service.

Motivation

The addition of a "Health Check" endpoint for the Biovalidator server would provide a convenient way to determine the server's operational status. This feature aims to simplify the process of checking whether the server is up and running or if it has encountered any issues.

Details

The "Health Check" endpoint would be a new API endpoint that can be accessed to verify the status of the Biovalidator server. The endpoint should return a response indicating whether the server is functioning correctly or if it is experiencing any errors or downtime.

When the endpoint is accessed, it should perform a quick internal check to ensure that all necessary services and components of the Biovalidator server are operational. This check should cover critical aspects, for example: checking that other endpoints are accessible, that external service dependencies are accessible, and overall server availability.

The response from the "Health Check" endpoint should include relevant information about the server's status, such as a timestamp of the last check, the server's uptime duration, amount of validation requests since deployment and any specific error messages or warnings if applicable.

Use cases

The "Health Check" endpoint would be beneficial for the following use cases:

Monitoring the Biovalidator server's availability and ensuring it is functioning properly.
Alerting administrators or system operators if the server is down or experiencing issues.
Integrating the "Health Check" functionality into existing monitoring systems or dashboards.
Automating regular checks of the server's health and receiving notifications based on the results.
Obtaining information about the current deployment of the server.

Example

To perform a "Health Check" on the Biovalidator server, a GET request would be made to the following mocking endpoint:

GET http://localhost:3020/healthcheck

The response from the server would provide information about the server's current status. A successful response would have a status code of 200 and a JSON payload similar to the following:

{
  "status": "ok",
  "timestamp": "2023-05-24T12:00:00Z",
  "uptime": "2 days 3 hours 30 minutes",
  "requestsNumber": 1930,
  "message": "Server is running smoothly.",
  "externalDependencies": [
        "identifiers.org": { ... },
        "ontologyLookupService": { ... },
        "europeanNucleotideArchive": { ... }
   ]
}

In case of an error or server unavailability, the response would have a status code indicating the issue and an accompanying error message.

[Bug]: typo at "errors" status code in README

In the documentation about errors it says:

...
HTTP status code 200
[]

An example of a validation response with errors:
HTTP status code 200
...

I suppose the HTTP status code was copied, but I'm fairly sure a status code of 200 shouldn't be erroneous.

[Server] Validation request hangs if the reference cannot be resolved

Summary

Testing Biovalidator server tool I noticed that if I use a reference ($ref) that is not given as a local reference when deploying the server (--ref="..."), the Post request just hangs there unresponsive, even though it was invalid from the server's perspective.

By un-resolved reference I believe anything that is not plain text behaves the same: from GitHub URIs (not raw files) to pictures.

Technical details:

Testing deployment of the local server (localhost).
Testing on current dev branch.
Using curl to create the Post request.
Using Windows Subsystem for Linux 2 (WSL2).
Using Node v16.13.0 and npm 8.6.0.

Example

For example, starting the server as:

node ../../src/biovalidator.js
2022-06-13T17:22:39.933Z [info] ---------------------------------------------
2022-06-13T17:22:39.934Z [info] ------------ ELIXIR biovalidator ------------
2022-06-13T17:22:39.934Z [info] ---------------------------------------------
2022-06-13T17:22:39.934Z [info] Started server on port 3020 with base URL /
2022-06-13T17:22:39.935Z [info] PID file is available at ./server.pid

And the following data:

{
    "schema": {
        "$ref": "https://icatcare.org/app/uploads/2018/07/Thinking-of-getting-a-cat.png#/properties/object_title"
    },
    "data": {
        "test": "test"
    }
}

I received in the terminal where I deployed the server the following message:

2022-06-13T17:16:12.062Z [error] Failed to retrieve remote schema: https://icatcare.org/app/uploads/2018/07/Thinking-of-getting-a-cat.png, TypeError: undefined

But I received nothing on the terminal executing the curl command (the process doesn't finish), nor it appeared in the logs created by the server.

Add to dockerhub

Under which organisation can we add this to dockerhub. (Previous elixir-validator was under EBISPOT)

[BUG]: Big CV lists: ``RangeError: Maximum call stack size exceeded``

Bug summary

When loading a huge Controlled Vocabulary (CV) list, the amount of items in the enum keyword exceeds the maximum of the call stack.

Technical details

Used GitHub branch: main
Operating System: WSL2
Node version: v16.13.0
npm version: 8.6.0

To reproduce

Clone and install the project

git clone https://github.com/elixir-europe/biovalidator.git
cd biovalidator
npm install

Deploy Biovalidator local server:

node src/biovalidator

Request validation, and in the used schemas, one being way too big. In my case, EGA.cv.instrument_platforms_array.json, with 12.000+ items in enum.

time curl --data @$file -H "Content-Type: application/json" -X POST http://localhost:3020/validate

Observe how long it takes for the initial fetch, and then how it stops the validation, prompting the following message:

{"error":"Failed to compile schema: RangeError: Maximum call stack size exceeded"}
real    0m15.998s
user    0m0.005s
sys     0m0.001s

Observed behaviour

Validation crashes and the document is not validated, since the schema is not compiled correctly.

Expected behaviour

The JSON document $file would be validated accordingly.

Additional context

At the terminal where the server is deployed the following error logs appear:

2022-12-06T13:37:31.464Z [info] Compiling new schema, $schemaId: undefined
2022-12-06T13:37:46.912Z [error] Failed to compile schema: RangeError: Maximum call stack size exceeded
2022-12-06T13:37:46.913Z [error] An error occurred while running the validation: {"error":"Failed to compile schema: RangeError: Maximum call stack size exceeded"}
2022-12-06T13:37:46.914Z [error] New validation request: Server failed to process data: {"error":"Failed to compile schema: RangeError: Maximum call stack size exceeded"}

[Bug]: typo at "object" property in README

At README.md, within the dev branch (AFAIK the latest one), I saw that the following block has object instead of data at the root level of the JSON file.

{
  "schema": {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
      "alias": {
        "description": "A sample unique identifier in a submission.",
        "type": "string"
      },
      "taxonId": {
        "description": "The taxonomy id for the sample species.",
        "type": "integer"
      },
      "taxon": {
        "description": "The taxonomy name for the sample species.",
        "type": "string"
      },
      "releaseDate": {
        "description": "Date from which this sample is released publicly.",
        "type": "string",
        "format": "date"
      }
    },  
    "required": ["alias", "taxonId" ]
  },
  "object": {
    "alias": "MA456",
    "taxonId": 9606
  }
}

[Requested feature] Usage of different ontology versions during validation

Summary

A feature to be able to use ontology versions on demand for term validation.

Motivation and details

For the sake of traceability it's a must to store the version that was used of each ontology during validation. Nevertheless, knowing which version of the ontology was used is only partly useful if that version cannot be used when trying to validate the metadata again. Therefore, a feature to use ontologies' versions is required.

Inspired by Phenopacket's approach (see resources at MetaData object of their schemas), EGA new schemas specify in a similar fashion the version of each ontology used in a submission (see lines of code): a single object (submission) that has an array of used ontologies, each with their respective versions. This is restrictive in the sense that only one version of each ontology can be used per submission, but that is the expected use-case. Saving the ontology version at each individual ontology term seems overwhelming and unnecessary.

Following this approach, the requested feature would include a parser that would detect (either by a file, reference, bespoke structure. of part of the JSONs..) automatically which ontology was used for each submission and, if not found, to use the latest version available (current behaviour). This puts a heavy constraint, which is the fact that objects may be dependant on other objects being validated at the same time. We can discuss how this could be done in the best manner, or if it would be better to record each version at each ontology use, etc.

Use cases

Example 1

I submitted metadata to EGA 3 months ago, and it was valid at that time. Now this metadata is going to be shared across different institutions, with a validation step in the middle. The ontology I used changed and now my metadata is no longer valid against the standards. Being able to specify which version of the ontology I used would allow me to pass validation according to the time my submission was done.

Update AJV library to the latest one

Error with versions of Node.js below v10

Summary

Executing tests (npm test) of Biovalidator using nodejs v8.11.1 throws syntax errors at biovalidator/node_modules/jest/node_modules/jest-cli/build/cli/index.js:227

Description

Hi, I am trying to make use of the Biovalidator in the CLI using validator-cli.js to validate a JSON document against a custom schema (Draft 2019-09).
After installing Biovalidator using nodejs v8.11.1 (npm v5.6.0) and performing the tests it throws the following error:

$ npm test

> [email protected] test /mnt/c/Users/mcasado/Documents/GitHub/biovalidator
> jest

/mnt/c/Users/mcasado/Documents/GitHub/biovalidator/node_modules/jest/node_modules/jest-cli/build/cli/index.js:227
    } catch {
            ^

SyntaxError: Unexpected token {
    at createScript (vm.js:80:10)
    at Object.runInThisContext (vm.js:139:10)
    at Module._compile (module.js:616:28)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)
    at Module.require (module.js:596:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/mnt/c/Users/mcasado/Documents/GitHub/biovalidator/node_modules/jest/node_modules/jest-cli/bin/jest.js:16:3)
npm ERR! Test failed.  See above for more details.

Making use of the validator-cli.js with this erroneous build unveils inconsistencies between nodejs (v8.11.1) and yargs.

$ time node ./validator-cli.js -s examples/schemas/test-schema.json -j examples/objects/test-schema-valid.json
/mnt/c/Users/mcasado/Documents/GitHub/biovalidator/node_modules/yargs-parser/build/index.cjs:997
        throw Error(`yargs parser supports a minimum Node.js version of ${minNodeVersion}. Read our version support policy: https://github.com/yargs/yargs-parser#supported-nodejs-versions`);
        ^

Error: yargs parser supports a minimum Node.js version of 10. Read our version support policy: https://github.com/yargs/yargs-parser#supported-nodejs-versions
    at Object.<anonymous> (/mnt/c/Users/mcasado/Documents/GitHub/biovalidator/node_modules/yargs-parser/build/index.cjs:997:15)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)
    at Module.require (module.js:596:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/mnt/c/Users/mcasado/Documents/GitHub/biovalidator/node_modules/yargs/build/index.cjs:2855:16)
    at Module._compile (module.js:652:30)

real    0m0.094s
user    0m0.053s
sys     0m0.000s

AJV execution

When trying to validate the JSON document against its schema using AJV directly (its CLI feature ajv-cli) it does work and validation is passed:

$ ajv --spec=draft2019 -s schema.json -d json_doc.json
strict mode: missing type "object" for keyword "additionalProperties" at "test_schema_X#/properties/sample_labels" (strictTypes)
json_doc.json valid

Using nodejs v10.19.0

Tests are passed correctly:

Test Suites: 10 passed, 10 total
Tests:       27 passed, 27 total
Snapshots:   0 total
Time:        48.545 s
Ran all test suites.

And using validator-cli.js with test schemas and documents seems to work:

$ time node ./validator-cli.js -s examples/schemas/test-schema.json -j examples/objects/test-schema-valid.json
 No validation errors reported.
Validation finished.

real    0m6.116s
user    0m0.867s
sys     0m0.571s

$ time node ./validator-cli.js -s examples/schemas/test-schema.json -j examples/objects/test-schema-invalid.json
 The validation process has found the following error(s):

 /characteristics.organism
        should have required property 'organism'
/characteristics.Organism
        should have required property 'Organism'
/characteristics.species
        should have required property 'species'
/characteristics.Species
        should have required property 'Species'
/characteristics
        should match some schema in anyOf

Validation finished.

real    0m6.076s
user    0m0.813s
sys     0m0.595s

SO and versions

The details of the build that passes the tests for me are:

Windows Subsystem for Linux 2 (WSL2)
ajv-cli v5.0.0
npm v6.13.4
And nodejs details are:

$ node -p process.versions
{ http_parser: '2.9.3',
  node: '10.19.0',
  v8: '6.8.275.32-node.55',
  uv: '1.28.0',
  zlib: '1.2.11',
  brotli: '1.0.7',
  ares: '1.15.0',
  modules: '64',
  nghttp2: '1.39.2',
  napi: '5',
  openssl: '1.1.1d',
  icu: '64.2',
  unicode: '12.1',
  cldr: '35.1',
  tz: '2019c' }

I also tried using the latest versions of npm (v8.1.3) and nodejs (v16.13.0) which had a similar behaviour as the versions I managed to make biovalidator work with.

The only issue in working with versions >10 (that seem to work) of nodejs is that there is an additional issue (related to memory use) I found when validating my custom document and schema. But this can be traced in another GH issue if necessary.

Investigate problems associated with $async keyword

Current MIAPPE checklist contains ontology validation with "isValidTerm". $async validaiton coupled with this gives an error. Need to investigate and provide more information.

JavaScript heap out of memory using validator-cli.js with custom schema (arrays).

Summary

Executing Biovalidator using its CLI script (validator-cli.js) gets stalled and eventually throws a fatal error (JavaScript heap out of memory).

Description

I am trying to make use of the Biovalidator in the CLI using validator-cli.js to validate a JSON document against a custom schema (Draft 2019-09). It gets stalled around 30-60 seconds and finally exits with a fatal error. Not using a version of Node.js below v10 could be at fault (for what I've seen in some posts).
The command I use and its output is the following:

$ time node ./validator-cli.js -s schema.json -j json_doc.json

<--- Last few GCs --->

[1335:0x3dfcf90]    34692 ms: Scavenge 1377.0 (1421.2) -> 1375.5 (1422.2) MB, 2.4 / 0.0 ms  (average mu = 0.085, current mu = 0.001) allocation failure
[1335:0x3dfcf90]    34699 ms: Scavenge 1377.2 (1422.2) -> 1375.7 (1423.2) MB, 2.8 / 0.0 ms  (average mu = 0.085, current mu = 0.001) allocation failure
[1335:0x3dfcf90]    34705 ms: Scavenge 1377.5 (1423.2) -> 1375.9 (1429.2) MB, 2.5 / 0.0 ms  (average mu = 0.085, current mu = 0.001) allocation failure


<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0xee28515be1d]
    1: StubFrame [pc: 0xee2851134b0]
Security context: 0x390d8aa9e6e9 <JSObject>
    2: serialize [0xb6ee4d18999] [/mnt/c/Users/mcasado/Documents/GitHub/biovalidator/node_modules/uri-js/dist/es5/uri.all.js:~1001] [pc=0xee2855d04cd](this=0x0b6ee4d1bf31 <Object map = 0x1e7141271389>,components=0x15dc21bec709 <Object map = 0x22fff8fd149>)
    3: resolveSchema [0xb6ee4d1c0a9] [/mnt/c/Users/mcasado/Document...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0x8fa090 node::Abort() [node]
 2: 0x8fa0dc  [node]
 3: 0xb0052e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb00764 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xef4c72  [node]
 6: 0xef4d78 v8::internal::Heap::CheckIneffectiveMarkCompact(unsigned long, double) [node]
 7: 0xf00e52 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node]
 8: 0xf01784 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xf043f1 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [node]
10: 0xecd874 v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [node]
11: 0x116d9fe v8::internal::Runtime_AllocateInNewSpace(int, v8::internal::Object**, v8::internal::Isolate*) [node]
12: 0xee28515be1d
Aborted

real    0m35.526s
user    0m36.739s
sys     0m1.720s

The JSON schema (schema.json) is:

{
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "$id": "test_schema_X",
    "type": "object",
    "required": ["sample_labels"],
    "additionalProperties": false,
    "properties": {
      "sample_labels": {
        "type": "array",
        "minItems": 1,
        "additionalProperties": false,
        "items": {
          "type": "object",
          "required": ["label"],
          "additionalProperties": false,
          "properties": {
            "label": {
              "type": "string"
            } 
          }
        }
      }
    }  
}

And the JSON document (json_doc.json):

{
  "sample_labels": [
    {
      "label": "test1"
    },
    {
      "label": "test2"
    }
  ]
}

Even though it would not be a solution, I thought of increasing the memory dedicated to Node.js:

export NODE_OPTIONS="--max-old-space-size=8192"

The outcome is that instead of stalling for 30-60 seconds, the execution gets stalled for 4 minutes, but throws the same type of error.

AJV execution

When trying to validate the JSON document against its schema using AJV directly (its CLI feature ajv-cli) it does work and validation is passed:

$ ajv --spec=draft2019 -s schema.json -d json_doc.json
strict mode: missing type "object" for keyword "additionalProperties" at "test_schema_X#/properties/sample_labels" (strictTypes)
json_doc.json valid

Using validator-cli.js with other JSON schemas/documents and tests

Tests are passed correctly:

Test Suites: 10 passed, 10 total
Tests:       27 passed, 27 total
Snapshots:   0 total
Time:        48.545 s
Ran all test suites.

And using validator-cli.js with other schemas and documents seems to work:

$ time node ./validator-cli.js -s examples/schemas/test-schema.json -j examples/objects/test-schema-valid.json
 No validation errors reported.
Validation finished.

real    0m6.116s
user    0m0.867s
sys     0m0.571s

$ time node ./validator-cli.js -s examples/schemas/test-schema.json -j examples/objects/test-schema-invalid.json
 The validation process has found the following error(s):

 /characteristics.organism
        should have required property 'organism'
/characteristics.Organism
        should have required property 'Organism'
/characteristics.species
        should have required property 'species'
/characteristics.Species
        should have required property 'Species'
/characteristics
        should match some schema in anyOf

Validation finished.

real    0m6.076s
user    0m0.813s
sys     0m0.595s

Therefore, what I guess is that there has to be some node within my custom schema that leads to an excess of memory use, some kind of memory leak or infinite loop. My suggestion, given the short length of my custom schema is that it could be related to the way items are handled in arrays by Biovalidator.

SO and versions

I am using Windows Subsystem for Linux 2 (WSL2), ajv-cli v5.0.0, npm v6.13.4 and nodejs details are:

$ node -p process.versions
{ http_parser: '2.9.3',
  node: '10.19.0',
  v8: '6.8.275.32-node.55',
  uv: '1.28.0',
  zlib: '1.2.11',
  brotli: '1.0.7',
  ares: '1.15.0',
  modules: '64',
  nghttp2: '1.39.2',
  napi: '5',
  openssl: '1.1.1d',
  icu: '64.2',
  unicode: '12.1',
  cldr: '35.1',
  tz: '2019c' }

I also tried using the latest versions of npm (v8.1.3) and nodejs (v16.13.0) to no avail, since they threw similar errors as the v10.19.0, except for the Security context that is solved with the latest versions.

[Requested feature] Interaction with ``identifiers.org`` API Web Services

Summary

A feature to check whether a CURIE resolves against identifiers.org API web services, as to know if an element exists in another resource.

Motivation

A feature of this type would improve greatly the utility of the schemas, adding an extra step of semantic validation with the resourceful identifiers.org. See the below use cases for examples on how I would envision this feature to enrich the metadata standards of my resource (EGA).

Details

Similar to how the current custom keywords interact with OLS API, I would like to request a feature (e.g. a new keyword) that allows for a quick API call to identifiers.org and validates whether an element exist in another resource based on a given CURIE.

In order to resolve a CURIE, identifiers.org exclusively requires a Compact Identifier consisting of a unique prefix and a local provider designated accession number (prefix:accession). Given this structure, an example with the minimal custom keyword I envisioned (named here identifiersExists, but can take any other name) is:

{
    "type": "object",
    "properties": {
        "arrayOrEnaIdentifier": {
            "type": "string",
            "identifiersExists":  {
                "prefixes" : ["arrayexpress", "ena.embl"]
            }
        }
    }
}

In the above example, we would be indicating that the given arrayOrEnaIdentifier (CURIE) would have to exist in either Array Express or ENA's EMBL namespaces (arrayexpress and ena.embl respectively). Therefore, the following JSON documents (i.e. data) would be valid:

# JSON document 1
{
    "arrayOrEnaIdentifier": "arrayexpress:E-MEXP-1712"
}

# JSON document 2
{
    "arrayOrEnaIdentifier": "ena.embl:BN000065"
}

These last two identifiers would resolve automatically against identifiers.org using the following URI structure:

identifiers.org + compact identifier
- JSON document 1: https://identifiers.org/arrayexpress:E-MEXP-1712
- JSON document 2: https://identifiers.org/ena.embl:BN000065

Nevertheless, it is also important to account for the designated namespace's prefix: not only a compact identifier needs to be resolved to an existing record in a resource, but also need to have the designated prefix. One of the namespaces of identifiers.org is itself, which could be used for this purpose as well if needed to assert a namespace exists (when compiling the schemas). Therefore, the following JSON document would not be valid, even though it is correctly resolved by identifiers.org:

{
    "arrayOrEnaIdentifier": "ncbigene:100010"
}

Likewise, it would be invalid if the compact identifier, even with the correct prefix, would not resolve to a record in the resource. For example, if I used the following made up accession arrayexpress:E-MEXP-17121 (added an extra 1 at the end):

{
    "arrayOrEnaIdentifier": "arrayexpress:E-MEXP-17121"
}

It is also important to differentiate an invalid record because identifiers.org rejected the API call (e.g. format error - e.g. arrayexpress:hello-world) or due to the record not existing in the designated resource (e.g. arrayexpress:E-MEXP-17121). Although this last one depends on how each resource redirects non-existing records, it should be straightforward to address once the identifier is resolved to the registry URI.

Use cases

Asserting a referenced gene in an experimental design does exist: https://identifiers.org/ncbigene:100010
Asserting an Array Design Format (ADF) exists in ArrayExpress instead of having to submit it to the EGA: https://identifiers.org/arrayexpress.platform:A-AFFY-98
Relying on platforms submitted to ArrayExpress instead of having to submit them to the EGA: https://identifiers.org/arrayexpress.platform:A-GEOD-50
Asserting previously submitted objects exist in EGA without the need of internal access: https://identifiers.org/ega.dataset:EGAD00000000001
Asserting a reference to an existing pipeline exists in GitHub: https://identifiers.org/github:EbiEga/ega-metadata-schema

[Bug] logDir argument does not work as intended

Summary

Argument logDir does not work, and in the documentation (README) it sometimes appears as LogPath.

Technical details

Testing deployment of server.
Testing on current dev branch.
Using the UI and curl to create the Post requests.
Using Windows Subsystem for Linux 2 (WSL2).
Using Node v16.13.0 and npm 8.6.0.

Expected behaviour

At deployment, given the directory in which I want my logs to be saved (--logDir=path/to/logs/), the tool would save the logs of the running instance in that directory.

Observed behaviour

At deployment, given the directory in which I want my logs to be saved, the tool does not save the logs in the given location, but instead saves them at the current directory in which the tool was deployed.

To reproduce

Deploy the tool using logDir argument: node src/biovalidator.js --logDir="path/to/logs/". Tried it with different path formats: relative, absolute, with and without the final slash (/)...
Check for the logs in the given filepath.
Observe log created in the current location, instead of the given filepath.

[Feature request]: Entity referencing (AJV's ``$data`` keyword)

Summary

Inclusion of AJV's solution for entity referencing: $data or something similar.

Motivation

Entity referencing within the schemas would allow to construct more complex restrictions in the validation.

Details

Similar to what AJV implemented as part of their combining-schemas documentation, the idea would be to allow Biovalidator to not only interpret $ref, which so far has been incredibly useful, but also $data keywords in the schemas.

The $data keyword would be used to dynamically reference data within the constraints of a JSON Schema definition. In other words, not fully knowing the value that may be provided for a property would not impede that such value could be used in a constraint.

I tested a locally deployed server of Biovalidator with the examples below, and the validation did not work as I expected, so I assume it's not part of it.

Examples

Some time ago I tested this feature with AJV and made three mock examples with schemas here. Below I format some of them in the schema & data format of Biovalidator's message:

# The following should pass validation, given that the first and second MD5 are equal, and that is the constraint stablished
#    in the schema (i.e. the data from MD5_1 should be the constant of MD5_2). 
{
    "schema": {
        "type": "object",
        "required": ["MD5_1", "MD5_2"],
        "properties": {
            "MD5_1": {
            "type": "string"
            },
            "MD5_2": {
            "type": "string",
            "const": { "$data": "1/MD5_1" }
            }
        }
    },
    "data": {
        "MD5_1": "06266488e1b14195523df877eac39b31",
        "MD5_2": "06266488e1b14195523df877eac39b31"    
    }
}

# The following should not pass validation, but it does, since the interpretation of the schema does not include the 
#    negative reference to the $data in MD5_1.
{
    "schema": {
        "type": "object",
        "required": ["MD5_1", "MD5_2"],
        "properties": {
            "MD5_1": {
                "type": "string"
            },
            "MD5_2": {
                "type": "string",
                "not": { 
                    "const": { "$data": "1/MD5_1" } 
                }
            }
        }
    },
    "data": {
        "MD5_1": "06266488e1b14195523df877eac39b31",
        "MD5_2": "06266488e1b14195523df877eac39b31"
    }
}

Use-cases

The flexibility that $data provides is enormous, but a few use cases, at least for the EGA, could be:

Checking whether the encrypted MD5 and the unencrypted MD5 checksums are different (e.g. submitters providing the same value incorrectly)
Checking whether the number of samples is above the number of referenced samples
Checking whether object identifiers that have the same details are the same

[Bug] Error during programmatic validation: ``graphRestriction`` and ``allOf``

Summary

I found a bug that causes validation to fail when an allOf contains a graphRestriction keyword and another item, sharing a property, and the validation is executed programmatically. The validation is not executed, as it fails, presumably, when collecting all referenced schemas.

Technical details and context

Date of testing: 22.11.2022
Testing against EGA JSON schemas.
Testing on current main branch.
Tested both through the user interface (localhost:3020/), the local server endpoint (localhost:3020/validate) and EGA's server endpoint (biovalidator.ega.ebi.ac.uk/).
Using Windows Subsystem for Linux 2 (WSL2).
Using Node v16.13.0 and npm 8.6.0.
All tests passed when installed Biovalidator.

Expected behaviour

For JSON keywords to be combined without issues and validation to be executed accordingly.

Observed behaviour

[Server mode] Using the programmatic validation. (1) Validation is not even executed; (2) an empty dictionary is retrieved ({}) instead of the list with the validation output; (3) the terminal and logs show little to no information as to what was the issue. This happened both when deploying the server locally or through EGA's API.

# Error displayed in the terminal where the server was deployed and logs
2022-11-22T15:25:18.148Z [error] An error occurred while running the validation: {}
2022-11-22T15:25:18.151Z [error] New validation request, server failed to process data: {}

[Server mode] Using the UI. Surprisingly it did work, regardless of how allOf or anyOf were arranged with graphRestriction. See image below for an example.

To reproduce

Deploy the server (or have it deployed beforehand). In my case the schemas and their variations in my debugging time were local, so I gave them at deployment as follows:

node src/biovalidator -r "$sdir/*.json" -r "$sdir/controlled_vocabulary_schemas/*.json"

Have a referenced JSON schema that contains a property in which allOf, containing an element that is not graphRestriction alone, and also graphRestriction, exist within a single property. In the following example file A is the one we send to the validator; and file A references file B within it, which contains the JSON structure above described:

# Summarised contents of file A (the "data" is a full sample object JSON document:
{
  "schema": {
        "$ref": "https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.sample.json"
  },
  "data": {
    ...
  }
}

# Part of the contents of file B
{
    "type": "string",
    "allOf": [
      {
        "title": "General CURIE pattern",
        "$ref": "./EGA.common-definitions.json#/definitions/curie_general_pattern"
      },
      {
        "graphRestriction":  {
          "ontologies" : ["obo:efo"],
          "classes": ["EFO:0000635"],
          "relations": ["rdfs:subClassOf"],
          "direct": false,
          "include_self": false
        }
      }
    ]        
  }

Perform a request to the validate endpoint of the server (e.g. localhost:3020/validate). In our case we used curl in the CLI:

curl --data @sample_valid-1.json -H "Content-Type: application/json" -X POST http://localhost:3020/validate

Observe the terminal output

{}

Tested cases

The above cited chunks of JSON schemas are just one of the cases in which I found the issue. Below I will compile some of the scenarios in which validation passed or not using the programmatic calls to the localhost server. I will try to also summarize what changes between tries.

It's important to notice that whether the JSON document ("data") was valid or not against the schemas is not important here, since the bug appears before the validation starts. Therefore, whatever outcome of the validation, I took it as a successful run without this bug.

Using a full sample object, which included an organism-part-entity that was crashing the validation. It did not work.

# Summarised contents of file A (the "data" is a full sample object JSON document):
{
  "schema": {
        "$ref": "https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.sample.json"
  },
  "data": {
    ...
  }
}

# Part of the contents of file B
"organism-part-entity": {
        "type": "string",
        "allOf": [
          {
            "title": "General CURIE pattern",
            "$ref": "./EGA.common-definitions.json#/definitions/curie_general_pattern"
          }
        ],
        "graphRestriction":  {
          "ontologies" : ["obo:efo"],
          "classes": ["EFO:0000635"],
          "relations": ["rdfs:subClassOf"],
          "direct": false,
          "include_self": false
        }
      }

Getting rid of the allOf clause. It did work.

# Same file A

# Modified parts of file B
"organism-part-entity": {
    "type": "string",
    "graphRestriction":  {
      "ontologies" : ["obo:efo"],
      "classes": ["EFO:0000635"],
      "relations": ["rdfs:subClassOf"],
      "direct": false,
      "include_self": false
    }
  }

Having both an element within the allOf and the graphRestriction keyword. It did not work.

# Same file A

# Modified parts of file B
"organism-part-entity": {
    "type": "string",
    "allOf": [
      {
        "title": "General CURIE pattern",
        "$ref": "./EGA.common-definitions.json#/definitions/curie_general_pattern"
      },
      {
        "graphRestriction":  {
          "ontologies" : ["obo:efo"],
          "classes": ["EFO:0000635"],
          "relations": ["rdfs:subClassOf"],
          "direct": false,
          "include_self": false
        }
      }
    ]        
  }

Getting rid of the $ref within file B. It did not work: it's not because the other item in the allOf is a reference.

# Same file A

# Modified parts of file B
"organism-part-entity": {
    "type": "string",
    "allOf": [
      {
        "pattern": "^\\w[^:]*:.+$"
      }
    ],
    "graphRestriction":  {
      "ontologies" : ["obo:efo"],
      "classes": ["EFO:0000635"],
      "relations": ["rdfs:subClassOf"],
      "direct": false,
      "include_self": false
    }
  }

Having only graphRestriction within the allOf keyword. It did work

# Same file A

# Modified parts of file B
"organism-part-entity": {
    "type": "string",
    "allOf": [
      {
        "graphRestriction":  {
          "ontologies" : ["obo:efo"],
          "classes": ["EFO:0000635"],
          "relations": ["rdfs:subClassOf"],
          "direct": false,
          "include_self": false
        }
      }
    ]
  }

Changing the allOf to anyOf and having multiple elements (one of which would not be met, so the graphRestriction would have to be) within it. It did work.

# Same file A

# Modified parts of file B
"organism-part-entity": {
    "type": "string",
    "anyOf": [
      {
        "type": "number"
      },
      {
        "graphRestriction":  {
          "ontologies" : ["obo:efo"],
          "classes": ["EFO:0000635"],
          "relations": ["rdfs:subClassOf"],
          "direct": false,
          "include_self": false
        }
      }
    ]
  }

Directly referencing the property alone instead of the whole object containing the property. It did not work.

# Content of file A
{
    "schema": {
      "$ref": "https://raw.githubusercontent.com/EbiEga/ega-metadata-schema/main/schemas/EGA.common-definitions.json#/definitions/organism-part-entity"   
    },
    "data": "UBERON:0000956"
}

# Referenced content at file B:
"organism-part-entity": {
        "type": "string",
        "allOf": [
          {
            "title": "General CURIE pattern",
            "$ref": "./EGA.common-definitions.json#/definitions/curie_general_pattern"
          }
        ],
        "graphRestriction":  {
          "ontologies" : ["obo:efo"],
          "classes": ["EFO:0000635"],
          "relations": ["rdfs:subClassOf"],
          "direct": false,
          "include_self": false
        },
      }

Cryptic error message when trying to validate against beacon v2 schema

Hello!

I was attempting to try out the biovalidator but I am running into errors and not sure where I am going wrong. Thanks in advance for any help or advice you can offer!

The inputs I am using are the beacon v2 genomicVariations defaultSchema and a genomic variant from a reference beacon below.

Here is the error output I get when I try to run the CLI version of biovalidator:

marionfs@5000L-205837-M biovalidator % node ./validator-cli.js -s ../beacon-v2-Models/BEACON-V2-draft4-Model/genomicVariations/defaultSchema.json -j ../../querying_beacon/ega_g_variants.json 
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/start/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/start/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/end/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/end/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/start/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/start/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/end/items"
unknown format "int64" ignored in schema at path "#/definitions/Position/properties/end/items"
async schema compiled encountered and error
Error: async schema referenced by sync schema
    at callAsyncRef (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/vocabularies/core/ref.js:68:19)
    at callRef (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/vocabularies/core/ref.js:63:9)
    at callValidate (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/vocabularies/core/ref.js:34:13)
    at Object.code (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/vocabularies/core/ref.js:24:20)
    at Object.keywordCode (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/keyword.js:12:13)
    at /Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/iterate.js:16:35
    at CodeGen.code (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/codegen/index.js:438:13)
    at CodeGen.block (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/codegen/index.js:567:18)
    at Object.schemaKeywords (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/iterate.js:16:13)
    at typeAndKeywords (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/index.js:126:15)
    at subSchemaObjCode (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/index.js:113:5)
    at Object.subschemaCode (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/index.js:87:13)
    at Object.applySubschema (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/subschema.js:17:16)
    at KeywordCxt.subschema (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/context.js:145:28)
    at applyPropertySchema (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/vocabularies/applicator/properties.js:45:17)
    at Object.code (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/vocabularies/applicator/properties.js:33:17)
    at Object.keywordCode (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/keyword.js:12:13)
    at /Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/iterate.js:54:27
    at CodeGen.code (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/codegen/index.js:438:13)
    at CodeGen.block (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/codegen/index.js:567:18)
    at iterateKeywords (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/iterate.js:51:9)
    at groupKeywords (/Users/marionfs/Documents/GitHub/biovalidator/node_modules/ajv/dist/compile/validate/iterate.js:31:13)
2022-07-25T02:22:06.514Z [error] An error ocurred while running the validation. Error : {}
console error: [object Object]
2022-07-25T02:22:06.514Z [error] undefined

{
  "_id": ["62afade9676e4d25e5240835"],
  "variantType": ["SNP"],
  "alternateBases": ["T"],
  "frequencyInPopulations": [
    {
      "source": ["The Genome Aggregation Database (gnomAD)"],
      "sourceReference": ["https://gnomad.broadinstitute.org"],
      "frequencies": [
        {
          "population": ["AFR_AF"],
          "alleleFrequency": [-1]
        }
      ]
    },
    {
      "source": ["The Genome Aggregation Database (gnomAD)"],
      "sourceReference": ["https://gnomad.broadinstitute.org"],
      "frequencies": [
        {
          "alleleFrequency": [-1],
          "population": ["AMR_AF"]
        }
      ]
    },
    {
      "sourceReference": ["https://gnomad.broadinstitute.org"],
      "source": ["The Genome Aggregation Database (gnomAD)"],
      "frequencies": [
        {
          "population": ["EAS_AF"],
          "alleleFrequency": [-1]
        }
      ]
    },
    {
      "frequencies": [
        {
          "population": ["EUR_AF"],
          "alleleFrequency": [-1]
        }
      ],
      "source": ["The Genome Aggregation Database (gnomAD)"],
      "sourceReference": ["https://gnomad.broadinstitute.org"]
    },
    {
      "source": ["The Genome Aggregation Database (gnomAD)"],
      "sourceReference": ["https://gnomad.broadinstitute.org"],
      "frequencies": [
        {
          "alleleFrequency": [-1],
          "population": ["SAS_AF"]
        }
      ]
    }
  ],
  "variantLevelData": {
    "clinicalInterpretations": [
      {}
    ],
    "phenotypicEffects": [
      ["MODIFIER"]
    ]
  },
  "referenceBases": ["C"],
  "identifiers": {
    "clinVarIds": [
      {}
    ],
    "proteinHGVSIds": [
      {}
    ],
    "genomicHGVSId": [
      {}
    ]
  },
  "variantInternalId": ["chr21_9411318_C_T"],
  "position": {
    "start": [
      [9411317]
    ],
    "refseqId": ["21"],
    "end": [
      [9411318]
    ],
    "assemblyId": ["hs37"]
  },
  "molecularAttributes": {
    "aminoacidChanges": [
      {}
    ],
    "molecularEffects": [
      ["MODIFIER"]
    ],
    "geneIds": [
      ["CHR_START-MIR3648-1"]
    ]
  },
  "caseLevelData": [
    {
      "biosampleId": ["NA24695"],
      "individualId": ["NA24695"]
    },
    {
      "individualId": ["NA24694"],
      "biosampleId": ["NA24694"]
    },
    {
      "biosampleId": ["NA24631"],
      "individualId": ["NA24631"]
    }
  ]
}

Return an object with proper messages after successful validation

Currently successful validation returns an empy array. This is confusing to the end program instead return an object with defined fields for both valid and invalid responses.

{
"validationStatus": VALID, 
"warnings": ["missing recommended field xxx"], 
"errors": []
}

[BUG]: NCBITaxon wrong ontology assertion: ``...not child of...``

Bug summary

When using the custom keyword graphRestriction for the NCBITaxon ontology, it seems to fail regardless of the set parent level.

Technical details

Used GitHub branch: main
Operating System: WSL2
Node version: v16.13.0
npm version: 8.6.0

To reproduce

Clone, install and deploy a local server

git clone https://github.com/elixir-europe/biovalidator.git
cd biovalidator
npm install
node src/biovalidator

Regardless of endpoint (UI or CLI), set the schema to use graphRestriction with NCBITaxon ontology; and the data to correctly contain the CURIE of a term that is hierarchically below the one in the custom keyword.

# In this case the schema constraint is any NCBITaxon CURIE below its root level (NCBITaxon:1).
{
  "type": "object",
  "required" : ["taxonIdCurie"],
  "properties": {
    "taxonIdCurie": {
      "type": "string",
      "graphRestriction":  {
        "ontologies" : ["obo:NCBITAXON"],
        "classes": ["NCBITaxon:1"],
        "relations": ["rdfs:subClassOf"],
        "direct": false,
        "include_self": false
      }
    }
  }
}

# And the data contains the NCBITaxon CURIE for humans
{
  "taxonIdCurie": "NCBITaxon:9606"
}

Observe the erroneous message:

- taxonIdCurie
  Provided term is not child of [http://purl.obolibrary.org/obo/NCBITaxon_1]

Observed behaviour

The validation does not pass even though the conditions are met for it to pass (i.e. the term is correct for the hierarchy)

Expected behaviour

The validation result should be that it was passed.

Additional context

I'm unsure if this error has to do with the tool itself, OLS API, or NCBITaxon per se. But it is worth investigating.

I also checked with other levels of the hierarchy:

NCBITaxon:9606 being right below NCBITaxon:9605
NCBITaxon:9605 being below the root level NCBITaxon:1
graphRestriction in a similar fashion but using other ontologies, where the validation behaves as expected.

Check gitlab config to be able to deploy to BioSamples's VMs

[Requested feature] Providing Schema and Data in different files for a server instance

Summary

A way for a user to send a REQUEST to a server with both the data and schema properties of the data.json file as independent files or arguments.

Motivation and details

In most validation attempts in our use-case, the schema is fixed from the very beginning, being a set of schemas one references. Therefore, it is more comfortable for the users if the data (data to validate) is a file on its own, and the schema can be another. This way, it is easier to organize the data files to be validated.

Furthermore, for me it seems that it would be easier to trace the validation attempts if I knew not just which data file, mixed, was used, but at the command level which schema and data files were used.

It is the same way it works for the command line interface, or even the user interface: two different arguments.

Use cases

Example 1

The schema is always the same, so I can re-use the same argument over and over, changing just the data file it refers to. If so, I would be able to use a similar command (completely made up in the example) each time, only changing the data_X.json file:

curl -F "[email protected]" -F "data=@data_1.json" -H "Content-Type: application/json" -X POST "biovalidator.ega.ebi.ac.uk/validate"
curl -F "[email protected]" -F "data=@data_2.json" -H "Content-Type: application/json" -X POST "biovalidator.ega.ebi.ac.uk/validate"
curl -F "[email protected]" -F "data=@data_3.json" -H "Content-Type: application/json" -X POST "biovalidator.ega.ebi.ac.uk/validate"

Example 2

I have a referenced schema that is incredibly long, and it makes it difficult for me to edit the data easily, since the schema gets always in the way (visually, regular expressions, substitutions...). Having the schema in a file that I will not modify by myself, and being able to mass-produce the changes in the data files would be handy.

[Future work] - Improving ontology validation error reports

(See discussion)

Proposed feature for the future: knowing the ontology validation rules, specified following the custom keywords (e.g. graph_restriction), when an ontology term validation fails, Biovalidator could provide an improved error report. For example:

Guiding users with OLS URLs on where to find the correct term (e.g. this term does not belong to this hierarchy, search here...).
Suggesting the correct term, perhaps with a mapping suggestor such as Zoom (e.g. the term was wrong, perhaps you meant...).

[Enhancement] Increasing the information of log files

As far as I know, the current way logs are populated for a deployed server is by adding a line for each error that it encounters during validation. There are two enhancements I would like to propose:

Logs should also contain errorless validation attempts. This would improve the traceability of the issues and also improve the overall value of the logs, being able to weigh the erroneous attempts (e.g. We see that X% of the total was erroneous, due to this reason...).
Logs should also contain information of the validation attempts performed through the User Interface (e.g. http://localhost:3020/ in localhost). For what I've seen, the logs only contain attempts done through the Post requests, but not by the UI.

[Bug] Replace deprecated `request-promise` package

We are using request-promise package as the http client. This pacakge is deprecated and should be replaced with a newer one.

Create a CLI script to be able to validate from the command line

Something like this node validator-cli.js --schema=/path/to/schema --json=/path/to/object.
That would print out Validation successful! if there were no errors
or
Validation error: ....

[Requested feature]: separate file with ontology and taxonomy rules

Summary

The Extended keywords for ontology and taxonomy validation is a quite unique feature in this validator, and requires the graphRestriction, isChildTermOf and isValidTaxonomy in the test_schema.json file. If the JSON-LD schema definition is not under my control,
I would like these semantic validations to be passed into biovalidator from a second file.

Motivation

I would like to allow better validation for schema.org and bioschemas metadata. Currently, there are types defined in JSON schema for e.g. https://schema.org/Dataset or https://bioschemas.org/profiles/MolecularEntity/0.5-RELEASE, which are developed in e.g. https://github.com/BioSchemas/specifications/tree/master/Dataset/ or https://github.com/BioSchemas/specifications/tree/master/MolecularEntity/.

These types allow various properties to have values as https://schema.org/DefinedTerm, and I'd expect the majority of these come from OBO ontologies you'd find on terminology services like OLS or NCBO.

However, I'd expect that schema.org wants to keep their types lean and won't allow people to add further validation into their schema definition. Also, for one schema type, there might be multiple profiles in different communities suggesting / requesting different restrictions on allowed ontology terms.

Example

An example would probably great, but I don't have one yet. I only found biovalidator at last weeks AllHands in Dublin :-)

Yours,
Steffen