Inspirehep schemas and related tools bundle.
- Free software: GPLv2 license
- Documentation: https://inspire-schemas.readthedocs.io
Inspire JSON schemas and utilities to use them.
License: GNU General Public License v2.0
Inspirehep schemas and related tools bundle.
The documentation should have an explicit chapter that is automatically generated after the JSONSchema.
Please feel free to suggest on best practice and how this should looke like.
The aim of this project is to allow anybody to discover which fields exist and how to use them, and their structure, without having to open the JSON.
The builder should support filed copyright.year
.
This should be harmonized, one way or another, for #58
@michamos and I have identified that the flags: CORE
, Citeable
, Refereed
are particular because they can be set by algorithms (that could evolve in time), but that could be overridden by a curator.
Since it's not currently possible to identify who set the flag, we have the issue of:
We would propos that this flag are augmented with a source information. E.g.:
"citeable": {
"flag": true,
"source": "CURATOR",
}
or
"core": {
"flag": false,
"source": "core-guesser",
}
Alternatively we could have a list of objects:
"citeable": [
{
"flag": true,
"source": "CURATOR",
},
{
"flag": false,
"source": "citeable-guesser",
}
]
Possibly sorted chronologically (e.g. latest first), where the final value is computed at runtime (e.g. CURATOR
has precedence over an algorithm.
The cons of this approach is that is adding quite some complexity.
Better ideas? Are we solving the wrong problem?
Schema says (
) that999C5k
contains TeXkeys, but a search on legacy disproves this: https://inspirehep.net/search?p=999C5k%3A**.
Is this actually a new thing that is going to happen with your improvements to refextract
that you made, @michamos ?
@annetteholtkamp commented on Thu Mar 02 2017
The 210 field mostly contains synonyms, expansions of acronyms etc - which can probably be ignored in the future. But it also contains the acronym RPP which is used for the citesummary option to exclude the RPP. These records we need to tag somehow.
Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.
However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.
It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).
Continues inspirehep/inspire-next#1504.
classification_number
is actually very similar to keywords
to the point that we can simply merge them together.
Note that on display time:
PACS should be displayed in their human friendly way. PDG, should link to the PDG website.
We shall rename this so that it is more meaningful.
Also TeXKeys are not external and should thus not be stored in this field.
Right now there's a small bit of logic in the builder that decides if the record is citable or not (can be overriden if need be) but maybe that's not the place or the way to set that flag. Maybe using a periodic bibcheck task, and/or moving the check to a standalone function that's dynamically called or something.
After versioning is introduced in #9 we will need to support upgrading records. There should be an API similar to:
def record_needs_upgrade(record):
...
return True
def upgrade_record(record):
...
return upgraded_record
That automatically upgrades the provided record.
Following fields are being used in author forms but are not on the schema. One of the side effects is that those fields are not visible in the record editor, so not editable from the Holding Pen.
external_system_numbers
renamed to external_system_identifiers
,institutions
renamed scheme
Add the sphinx plugin for it (see https://github.com/inveniosoftware-contrib/json-merger/tree/master/json_merger for an example)
Check if it can validate that the params defined in the docstring matches the params in the function, that would be great.
Originally we designed references to mimic mini-records. It looks like Catalogers will still want to curate them, so we shall simplify where possible the structure to make it nice when visualized in the record editor.
PR to soon come.
Some ~20K records:
https://inspirehep.net/search?ln=en&p=902%3A**&of=hb&action_search=Search&sf=earliestdate&so=d&wl=0
have orphan affiliations, i.e. affiliation not attached to a specific author but just available for searching.
This are stored in MARC 902__a for Literature. We should preserve this field, because currently, it's not possible to recompute the affiliation for many of the affected records due to missing PDF.
There's no need anymore to support categories from other sources, we can only keep inspire and arxive ones, simplifying a lot and getting rid of the nefarious challenging 'anyOf'.
Builder isn't populating publication_info.material
field.
In [3]: x = LiteratureBuilder(source='arxiv')
In [4]: x.add_language('')
In [5]: x
Out[5]: LiteratureBuilder(source="arxiv", record={'languages': ['']})
@kaplun commented on Wed Jun 15 2016
Currently the collection
field is just a porting of MARC 980
. E.g.:
{"collections": [
{"primary": "CORE"},
{"primary": "Book"},
{"primary": "HEP"},
{"primary": "Citeable"}
]}
On the other hand the concept of document_type
is managed by the enhancer facet_inspire_doc_type
. E.g.:
{"facet_inspire_doc_type": ["book"]}
This is suboptimal.
Citeable
should become a flag and be added at indexing time based on other valuesCORE
should be declared as a flag and be available in all schemasHEP
is actually redundant since it represents the fact that this is a record from Literaturefacet_inspire_doc_type
should become document_type
and be populated by dojson, rather than enhanced before indexing.@kaplun commented on Thu Aug 25 2016
I think we should bump priority of this one, since category is really scattered around the code base in a wrong way.
@jacquerie commented on Fri Aug 26 2016
This needs a spec. The thing I refactored in https://github.com/inspirehep/inspire-next/blob/25cba484c652d21c112628c4967e684c02d6fcfd/inspirehep/modules/records/receivers.py#L120-L210 is a 1 to 1 correspondence with the code that was there before, but makes no sense to me.
You need to define precisely:
collections
document_types
980__a
values mapped to those allowable valuesCiteable
@kaplun commented on Mon Sep 19 2016
What should we do with collections
Should disappear.
What are the allowable document_types
Exactly the keys that you have defined in the two tables in the docstring populate_inspire_document_type()
.
How are the 980__a values mapped to those allowable values
Those that are document types are mapped to document types (possibly with the same value as in 980). Those that are flags, such as citeable
and core
should be mapped to a corresponding flag. (I think we have it for core already). deleted
is also mapped to a deleted field.
What is the algorithm that sets
Citeable
:
Mmh. I guess it's more the question of what is not citeable. I see by default anything that comes from arXiv is citeable. @annetteholtkamp can you help here?
@jmartinm commented on Wed Oct 05 2016
Now that inspirehep/inspire-next#1589 is merged, and once we get rid of the collections
field, note that we will still have a _collections
field managed by invenio-collections
.
This field gets populated based on a query matching the record (see config) so that config will have to be amended for the queries to match the new document_type
field.
@jmartinm commented on Thu Oct 06 2016
Collection fields are:
1100059 HEP
832449 Citeable
698265 CORE
584001 Published
401801 arXiv
312449 ConferencePaper
57480 Arxiv
51674
26942 Thesis
27168 Review
10148 Lectures
5637 NOTE
7727 Proceedings
7507 noncore
4643 THESIS
3488 Introductory
4003 Withdrawn
3982 Hep
3344 Book
172 D0-PRELIMINARY-NOTE
2069 BOOK
1239 NONCORE
1240 PROCEEDINGS
1115 citeable
891 BookChapter
452 Conference
33 Core
11 REPORT
5 Preprint
6 published
3 core
3 Note
2 Noncore
2 Report
1 PUBLISHED
1 thesis
1 book
1 proceedings
1 NonCore
1 Conferencepaper
1 Accelerators
1 Proceddings
Our schema says possible document types are:
[
"Published",
"arXiv",
"ActivityReport",
"ConferencePaper",
"Thesis",
"Review",
"Lectures",
"Note",
"Proceedings",
"Introductory",
"Book",
"BookChapter",
"Report"
],
And our current document type facet has the following mapping (from 980__a
value to facet value):
'published': 'peer reviewed',
'thesis': 'thesis',
'book': 'book',
'bookchapter': 'book chapter',
'proceedings': 'proceedings',
'conferencepaper': 'conference paper',
'note': 'note',
'report': 'report',
'activityreport': 'activity report',
'lectures': 'lectures',
'review': 'review',
'preprint' if no journal info
So for this issue to proceed we would need:
980__a
what document type from our schema to assign.ActivityReport
vs Activity Report
for example.enum
in our jsonschema to acomodate document types such as the hidden collections: Hal Hidden
, notes from different experiments and so on.Preprint
or Peer Reviewed
which are not mentioned as document types in the schema.@kaplun commented on Wed Oct 05 2016
@jmartinm Thanks. I'd suggest @annetteholtkamp et al. can help us removing all the outliers from 980__a
@kaplun commented on Thu Oct 06 2016
- For each value in 980__a what document type from our schema to assign.
1100059 HEP -> Literature Schema
832449 Citeable -> citeable flag
698265 CORE -> 'core' flag: True
584001 Published -> published flag
401801 arXiv -> ignore (redundant)
312449 ConferencePaper -> 'conference paper'
57480 Arxiv -> ignore (redundant)
51674 -> ignore (W00t?)
26942 Thesis -> 'thesis'
27168 Review -> 'review'
10148 Lectures -> 'lectures'
5637 NOTE -> 'note'
7727 Proceedings -> 'proceedings'
7507 noncore -> 'core' flag: False
4643 THESIS -> 'thesis'
3488 Introductory -> 'introductory'
4003 Withdrawn -> 'withdrawn' flag
3982 Hep -> Literature Schema
3344 Book -> 'book'
172 D0-PRELIMINARY-NOTE
891 BookChapter -> 'book chapter'
452 Conference -> Wot? In HEP?
11 REPORT -> 'report'
5 Preprint -> ignore redundant
6 published -> ignore redundant
- Should the document type in the schema be human readable ActivityReport vs Activity Report for example.
I believe so: anyway cataloguers will edit record either though scripts or through the editor, which will enforce the accepted values. Therefore there is no need to introduce a simplified spelling to avoid typos.
- Complete the enum in our jsonschema to acomodate document types such as the hidden collections: Hal Hidden, notes from different experiments and so on.
I will create a dedicated issue for that.
- Do we still need the receiver to convert the document types into a more 'user facing' facet, with values such as Preprint or Peer Reviewed which are not mentioned as document types in the schema.
I believe given the above point on how to spell document types, the answer is nope.
@annetteholtkamp commented on Thu Oct 13 2016
On 05 Oct 2016, at 12:39, Samuele Kaplun [email protected] wrote:
1100059 HEP -> Literature Schema
832449 Citeable -> citeable flag
698265 CORE -> 'core' flag: True
584001 Published -> published flag
401801 arXiv -> ignore (redundant)
why is that redundant?
312449 ConferencePaper -> 'conference paper'
57480 Arxiv -> ignore (redundant)
51674 -> ignore (W00t?)
what is this?
26942 Thesis -> 'thesis' 27168 Review -> 'review' 10148 Lectures -> 'lectures' 5637 NOTE -> 'note' 7727 Proceedings -> 'proceedings' 7507 noncore -> 'core' flag: False
Is there only true and false, or also undefined ?
4643 THESIS -> 'thesis' 3488 Introductory -> 'introductory' 4003 Withdrawn -> 'withdrawn' flag 3982 Hep -> Literature Schema 3344 Book -> 'book' 172 D0-PRELIMINARY-NOTE
We should ask Heath whether this tag is still necessary if a record is in HEP
891 BookChapter -> 'book chapter'
452 Conference -> Wot? In HEP?
Yes, we never managed to clean them all up. Most of them are probably conf papers - but needs to be checked.
11 REPORT -> 'report' 5 Preprint -> ignore redundan 6 published -> ignore redundant
- Annette
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub inspirehep/inspire-next#1215 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AM1-O0bt4NEgPP7sFdLO3gHQmNT0f53Fks5qw5qFgaJpZM4I2XVr.
@kaplun commented on Thu Oct 13 2016
401801 arXiv -> ignore (redundant)
why is that redundant?
We don't need to say arXiv
. We already know from the arXiv ID.
51674 -> ignore (W00t?)
what is this?
A collection with empty value ๐
7507 noncore -> 'core' flag: False
Is there only true and false, or also undefined ?
Yes, all flags have also undefined values.
Specifically how to generate the examples for the backwards-compatibility checks.
We do not have the unit tests for the builder.
This is the list of the unit test that we should write:
Depends on #107, rewrite/amend/write the docstrings of the methods of the builders on google docs style and verify the content (with a curator if needed).
Also make sure to generate a nice page for it so it can be easily accessed and consulted to builder users.
Currently we depend on node in order to generate fake data from jsonschema as noted in
#13 (comment)
We should port this to use a pythonic solution possibly based e.g. on fake-factory
with some jsonschema extension.
Continues inspirehep/inspire-next#1468.
INIS has a vocabulary of keywords that some of our records use, for example: https://inspirehep.net/record/132217/export/xme.
@annetteholtkamp says that we need to add it to
inspire-schemas/inspire_schemas/records/hep.yml
Lines 836 to 877 in efa2996
In order to support multiline strings we are using yaml format for the schemas (see #97 ), that means that we have to generate valid json at package time in order to distribute them (also will have to see how to handle in the tests and such).
Schemas will evolve with time. For this reason we should introduce versioning. Each schema should be versioned with semver
technique. Each time a modification is performed this is done by copying the last schema into a new file first and then performing the modification.
I think there are no keywords at all for arXiv harvest.
All keywords I found are for user-submissions, e.g.
https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=metadata.keywords.value:model
FYI:
If we have the fulltext (e.g. arXiv) we run
python bibclassify_cli.py -s -n 35 -k HEPontCore.rdf fulltext.pdf
If we have only metadata (e.g. for journals) we run
python bibclassify_cli.py -s -n 10 -k HEPontCore.rdf title_abstract_keywords.txt
To be renamed page_number
or number_of_pages
?
To be enforced to be a simple int
.
When populating it, it's actually overriding the 'url' field:
In [2]: from inspire_schemas.builders import LiteratureBuilder
In [3]: lb = LiteratureBuilder(source='mama', )
In [4]: lb
Out[4]: LiteratureBuilder(source="mama", record={})
In [5]: lb.add_copyright(material='i\'m not a url')
In [6]: lb
Out[6]: LiteratureBuilder(source="mama", record={'copyright': [{'url': "i'm not a url"}]})
TeXKeys are automatically generated, shouldn't be deleted, but can be declared obsolete by a cataloger.
Before deployment of Inspire 3 to labs, we need to finalize the part of the schema that is used in the harvesters that are currently on labs, namely user literature suggestions and non-CORE arXiv harvesting.
The concerned keys are:
{'abstracts', 'preprint_date', 'collections', 'external_system_numbers', 'license', 'report_numbers', 'collaborations', 'titles', 'arxiv_eprints', 'public_notes', 'acquisition_source', 'publication_info', 'copyright', 'authors', 'dois', 'page_nr', 'imprints'}
{'external_system_numbers', 'accelerator_experiments', 'arxiv_eprints', 'collaboration', 'publication_info', 'acquisition_source', 'license', 'report_numbers', 'public_notes', 'imprints', 'abstracts', 'thesis', 'titles', 'languages', 'thesis_supervisors', 'field_categories', 'dois', 'urls', 'collections', 'title_translations', 'hidden_notes', 'authors'}
{'core', 'citeable', 'published'}
Commit b28c499 should be backported to become inspire-schemas==31.1.0
.
As pointed out by @jacquerie, the test that ensures that our schemas are jsonschema compliant is sitting on:
https://github.com/inspirehep/inspire-next/blob/a6c641e860a9e7c357e30af5733edef9546afb76/tests/unit/records/test_records_jsonschemas.py
This should be moved to this repo, so that we don't commit invalid schemas.
Builder isn't populating dois.material
field.
Besides the current value
, record
and curated_relation
a raw_value
field is needed to add the value automatically extracted by tools. This field will not be editable but is helpful to then populate the value
field (which contains the canonical name that links to a record in the Institutions database)
https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.json#L79
In the utils.py module, when normalizing names, we just need to remove the space wherever we have a '. ' pair on the first name (that is, the second element after splitting by ',' a string like 'Caro, D. J.'.
package.json
with the dependencies.gitignore
to exclude node_modules
scripts/generate_example_records.json
executableThe contents of the _files
field for Literature record is supposed to contain the metadata to retrieve the file by invenio-records-files
.
The schema we have for it was copied by Zenodo and so contains the basic info in the invenio-records-files schema, but also some additional Zenodo-specific stuff (previewer
, type
) that we probably don't need.
The workflow is using this field in yet another way, writing description
and doctype
there (for arXiv PDF and extracted plots), which are not currently in the schema. This doesn't cause any error now as the results of _files
are discarded anyway and never sent to legacy, but we should decide on what information we really want to have there.
@kaplun and @tsgit know how files ares handled on legacy and could share their experience.
Discussing with @jacquerie, we identified the following keys that might be useful:
doctype
(or document_type
?): to signal what kind of document is attached. This would be an enum
with values fulltext
, plot
, what else?mime_type
: how this document is encoded, which might warrant a different handling (e.g. PDF vs XML for a fulltext).hidden
: a flag to indicate whether this file is publicly visible (would be true for fulltexts used for indexing that we may not serve directly to our users).We have lots of them (https://inspirehep.net/search?ln=en&p=773__t%3A**&wl=0), but no space in the schema to put them. Shall we just discard them?
Right now we only show in the changelog/releasenotes the bugs specified with closes #XXX
but not the ones that have the external repo reference like addresses anotherorg/anotherrepo#XXX
, we should show those too.
This part of schema acquisition_source/datetime
doesn't have type
which should be string
"acquisition_source": {
"$schema": "http://json-schema.org/schema#",
"additionalProperties": false,
"description": "Only the first source is stored: if the record later gets enriched with\nmetadata coming from a second source, the `acquisition_source` is not\nupdated.\n\n:MARC: ``541``",
"properties": {
"datetime": {
"description": "This does not necessarily coincide with the creation date of the\nrecord, as there might be some delay between the moment the\noriginal information is obtained and a record is finally created in\nthe system.\n\n:MARC: ``541__d``",
"format": "date-time",
"title": "Date on which the metadata was obtained"
},...
The keywords
field is not handled by the builder.
Builder isn't populating license.material
, license.imposing
fields.
The source.yml is currently a free text field. This will be a problem when we start using this field value as the way to retrieve records from different sources (arXiv, APS, ...) by the merger, since this value will be in the database and should always be the same.
An enum should be created instead.
NOTE: One of the values in the enum should be used for the user submission forms. The forms currently don't populate the source
field, but they should start doing that once we have the merger working.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.