inveniosoftware / dojson Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 29.0 1.27 MB

Simple pythonic JSON to JSON converter.

Home Page: https://dojson.readthedocs.io

License: Other

Python 98.74% Shell 0.04% XSLT 1.15% Dockerfile 0.08%

dojson's People

Contributors

Stargazers

Watchers

dojson's Issues

Extreme CPU usage after updating to versions later than 1.0.1

We have experienced extreme CPU usage on records insertion after updating the verion of dojson from 1.0.1 to 1.2.1.

We pinpointed the problem and it occurs on all versions after 1.0.1.

On 1.0.1 for 100 records it takes 2.5 seconds :
Prun output: https://gist.github.com/kaplun/42d593e74f8b0821c6a3a754cdcfa6ae

On 1.2.1 for 100 records it takes 879 seconds:
Prun output: https://gist.github.com/kaplun/3f4dd3065e041af862b3bc16b69e371a

After git bisect we believe that the problem occurs on one of the following commits:

marc21: 100 $a R vs NR

We seem to have a problem with repetitive (=R) vs non-repetitive (=NR) subfields.

Consider the following MARC21 record:

  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="a">Donges, Jonathan F</subfield>
  </datafield>

It is converted into:

 'main_entry_personal_name': {'personal_name': ['Donges, Jonathan F']},

Note that personal_name is a list (so R), while MARC21 standard says that it is 100 $a - Personal name (NR), hence not a list.

utils: filter_values() is not GroupableOrderedDict aware

Take this snippet:

>>> from dojson.utils import GroupableOrderedDict, force_values
>>> a = GroupableOrderedDict([('a', None)])
>>> @filter_values
>>> def foo(b):
...     return b
>>> foo(a)
{'__order__': ('a',)}

Instead the correct result should have been:

>>> foo(a)
{'__order__': ()}

GroupableOrderedDict riases KeyError on empty values

>>> from dojson.contrib.marc21.utils import create_record
>>> xml = '<record><datafield tag="037" ind1=" " ind2=" "></datafield></record>'
>>> d = create_record(xml)
>>> d
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/collections.py", line 173, in __repr__
    return '%s(%r)' % (self.__class__.__name__, self.items())
  File "/opt/cds/lib/python2.7/site-packages/dojson/utils.py", line 358, in items
    return tuple(self.iteritems(with_order, repeated))
  File "/opt/cds/lib/python2.7/site-packages/dojson/utils.py", line 369, in iteritems
    yield key, value[0]
  File "/opt/cds/lib/python2.7/site-packages/dojson/utils.py", line 286, in __getitem__
    item = OrderedDict.__getitem__(self, key)
KeyError: 0
>>> d['037__']
GroupableOrderedDict((('__order__', ()),))

Allow local fields, subfields and indicators

Marc21 makes room for a local installation to define its own local field with the convention that they contain a 9, either as a tag (typically 9XX tags, but also like 59X), indicators or subfields.

Invenio should make it easy to follow this convention without resorting to tortuous modifications of internal syntax rules.

installation: support for Python 3.3+

Fix issue with incompatible dependency esmre.

Mappings for field 880 from MARC21 standard

The LOC specification for 880 seems to be very difficult (impossible?) to map to JSON in a reasonable manner (basically almost all the subfields can be mapped to Same as associated field JSON key, which doesn't make sense) :
https://www.loc.gov/marc/bibliographic/bd880.html

Do we have any real life examples for those fields or is this fields unused ?
CC @tiborsimko @aw-bib @martinkoehler @fjorba @jma who might know any examples

cli: `identity` command for checking "undo"

add mappings for bd01x09x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd80x83x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

marc21: support for repeatable subfields

marc21: add field definitions for authority records

add mappings for bd25x28x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd1xx.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

marc21: generate "undo" functions

generate "undo" function for json2marc21
shall we generate it in the same files or new one?
- pros: easier to maintain
- cons: large files, slower loading

cc @tiborsimko @jmartinm

RFC: Adding marc tags to field descriptions

Currently, the schema for Marc21 contain the "human readable marc tag". E.g. ""title_statement" gives

"description": "Title Statement"

It would be helpful if we could also hold the numeric tag, ie. 245.

This serves as sort of a unique identifier
It helps inline documentation
It is easy to construct a lookup url form that (e.g. http://www.loc.gov/marc/bibliographic/bd245.html), that can serve well in a help functionality.
It might serve in some upcoming editor for a "non-human" format (being usually preferred by full time cataloguers due to it's compactness).

Without changing the model one could just add that to "description". Similarly, it might be very helpul to have the subfield codes in some sort of description as well (due to their uniqueness). E.g.

"title_statement": {
        "description": "Title Statement",
        "tag": "245",
        "title": {
             "type": "string",
             "subfield" : "a"
        },

add mappings for bd20x24x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd76x78x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

When reporting missing rules, pass original blob to exception handler

In https://github.com/inveniosoftware/dojson/blob/master/dojson/overdo.py#L163:

            except Exception as exc:
                if exc.__class__ in handlers:
                    handler = handlers[exc.__class__]
                    if handler is not None:
                        handler(exc, output, key, value)
                else:
                    raise

what about passing to handler also the original blob variable? In this way the handler would have the original input and would be able to output more detailed information (e.g. say it could retrieve an record ID).

Create demosite with Records and Search modules

Should we reuse invenio-demosite package ?
What would be the easiest way to have 'easy to install' demosite package ?

marc21: add field definitions for holdings records

Please add holding fields as well: http://www.loc.gov/marc/holdings/. In brief, they describe either the physical location and conditions of the documents, and the holdings of the periodicals (which issues do we have of a journal).

tests: ESMRE is not Python 3+ compatible

Using /Users/jirka/programing/python/dojson/.eggs/lxml-3.4.4-py3.4-macosx-10.10-x86_64.egg
Searching for esmre
Reading https://pypi.python.org/simple/esmre/
Reading http://code.google.com/p/esmre/
Best match: esmre 0.3.1
Downloading https://pypi.python.org/packages/source/e/esmre/esmre-0.3.1.tar.gz#md5=95ace12bac0c79cf95712336489bc4a4
Processing esmre-0.3.1.tar.gz
Writing /var/folders/57/jt0zh0ys541671wgmgp8dnkh0000gp/T/easy_install-v5dff8r8/esmre-0.3.1/setup.cfg
Running esmre-0.3.1/setup.py -q bdist_egg --dist-dir /var/folders/57/jt0zh0ys541671wgmgp8dnkh0000gp/T/easy_install-v5dff8r8/esmre-0.3.1/egg-dist-tmp-3z1ktnug
src/esm.c:38:11: error: no member named 'ob_type' in 'esm_IndexObject'
    self->ob_type->tp_free((PyObject*) self);
    ~~~~  ^
src/esm.c:185:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    PyObject_HEAD_INIT(NULL)
    ^~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/include/python3.4m/object.h:86:5: note: expanded from macro 'PyObject_HEAD_INIT'
    1, type },
    ^~~~~~~
src/esm.c:187:5: warning: incompatible pointer to integer conversion initializing 'Py_ssize_t' (aka 'long') with an expression of type 'char [10]' [-Wint-conversion]
    "esm.Index",                        /*tp_name*/
    ^~~~~~~~~~~
src/esm.c:190:5: warning: incompatible pointer types initializing 'printfunc' (aka 'int (*)(PyObject *, FILE *, int)') with an expression of type 'destructor' (aka 'void (*)(PyObject *)') [-Wincompatible-pointer-types]
    (destructor) esm_Index_dealloc,     /*tp_dealloc*/
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:205:5: warning: incompatible integer to pointer conversion initializing 'const char *' with an expression of type 'unsigned long' [-Wint-conversion]
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, /*tp_flags*/
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/include/python3.4m/object.h:643:29: note: expanded from macro 'Py_TPFLAGS_DEFAULT'
#define Py_TPFLAGS_DEFAULT  ( \
                            ^
src/esm.c:206:5: warning: incompatible pointer types initializing 'traverseproc' (aka 'int (*)(PyObject *, visitproc, void *)') with an expression of type 'char [47]' [-Wincompatible-pointer-types]
    "Index() -> new efficient string matching index",  /* tp_doc */
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:213:5: warning: incompatible pointer types initializing 'struct PyMemberDef *' with an expression of type 'PyMethodDef [4]' [-Wincompatible-pointer-types]
    esm_Index_methods,             /* tp_methods */
    ^~~~~~~~~~~~~~~~~
src/esm.c:214:5: warning: incompatible pointer types initializing 'struct PyGetSetDef *' with an expression of type 'PyMemberDef [1]' [-Wincompatible-pointer-types]
    esm_Index_members,             /* tp_members */
    ^~~~~~~~~~~~~~~~~
src/esm.c:221:5: warning: incompatible pointer types initializing 'allocfunc' (aka 'PyObject *(*)(struct _typeobject *, Py_ssize_t)') with an expression of type 'initproc' (aka 'int (*)(PyObject *, PyObject *, PyObject *)') [-Wincompatible-pointer-types]
    (initproc) esm_Index_init,      /* tp_init */
    ^~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:223:5: warning: incompatible pointer types initializing 'freefunc' (aka 'void (*)(void *)') with an expression of type 'PyObject *(PyTypeObject *, PyObject *, PyObject *)' [-Wincompatible-pointer-types]
    esm_Index_new,                 /* tp_new */
    ^~~~~~~~~~~~~
src/esm.c:239:9: error: non-void function 'initesm' should return a value [-Wreturn-type]
        return;
        ^
src/esm.c:241:9: warning: implicit declaration of function 'Py_InitModule3' is invalid in C99 [-Wimplicit-function-declaration]
    m = Py_InitModule3("esm", esm_methods,
        ^
src/esm.c:241:7: warning: incompatible integer to pointer conversion assigning to 'PyObject *' (aka 'struct _object *') from 'int' [-Wint-conversion]
    m = Py_InitModule3("esm", esm_methods,
      ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:245:9: error: non-void function 'initesm' should return a value [-Wreturn-type]
        return;
        ^
11 warnings and 3 errors generated.
error: Setup script exited with error: command 'clang' failed with exit status 1

add mappings for bd00x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

Write tests to JSON Schema

What we want: Make sure that the JSON schema generated by DoJSON correctly matches the fields from the MARC21 test files. We don't want to miss any field.

Problem: JSON Schema's additionalProperties is currently limited when used with allOf or anyOf. Thus we cannot use the JSON Schema as it is to validate that any additional fields have been added.

Note: Currently all the fields are typed as string. Thus we don't want to test complex typing or formats.

Possible solution 1: Write a script which checks that all fields in a JSON MARC21 (JSON generated by dojson with a MARC21 file as input) are present in the JSON Schema.

Possible solution 2: Write a script which transforms a JSON Schema in a standardized way by merging all the schemas from anyOf lists, then add additionalProperties = false and use a JSON Schema library to validate.

Possible solution 3: Use the domapping tool to generate mappings from JSON Schema. Add strict everywhere in the mapping to refuse additional fields. Send the mappings to Elasticsearch. Try to load the MARC21 files and see if it fails.

Update JSON schemas for the latest mappings

As some doJSON mappings were updated, the JSON schemas also need to be updated.

docs: auto-generate API documentation

auto-generate API documentation
enable ReadTheDocs hook
improve README.rst

add mappings for bd3xx.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd6xx.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

marc21: validate field names

contrib: dojson drop unknown fields

In #50, a test fails because a subfield is not part of the spec.

the faulty record

<record>
    <datafield tag="852" ind1="8" ind2=" ">
        <subfield code="i">M 314</subfield>
        <subfield code="h">339</subfield>
        <subfield code="b">Library Reading Room</subfield>
        <subfield code="9">10</subfield>
    </datafield>
</record>

the test

expected = create_record(xml)
result = to_marc21.do(marc21.do(expected))
assert expected == result

The spec says nothing about a $9 subfield. But should it remove it nonetheless?

RFC: One big metadata MARC21 file for testing ?

Hi,

@tiborsimko , @lnielsen, @egabancho during this sprint, we have gathered some example metadata from Zenodo, TIND and CDS. Do you think it makes sense to combine them in a one big record with all possible MARC21 fields (also including tags from those 6 records provided by @Kennethhole in https://github.com/inveniosoftware/dojson/pull/76/files) ?
Or should we create separate files for each source (cds_demo_data.xml, zenodo_demo_data.xml, tind_demo_data.xml) ?
Or should we create one big file with multiple records inside ?
The main reason for that is the maintainability of data in the future (we can easily add new fields if we keep only one file with test data).

marc21: stream parser for split_blob and create_record

marc21: 0247 overmatching

There seem to be an overmatching happening with certain field/indicator combinations, such as starting with leading zero.

Consider the below-quoted Zenodo test input record that contains DOI field (0247) that is being translated in the output as both other_standard_identifier (which is good) but also as
former_title (which is bad). The latter field is defined as:

dojson/contrib/marc21/fields/bd20x24x.py:@marc21.over('former_title', '^247[10][10]')

Seems the leading zero isn't properly matched via caret.

In [10]: print x
<record>
  <controlfield tag="001">17575</controlfield>
  <controlfield tag="005">20150513165819.0</controlfield>
  <datafield tag="024" ind1="7" ind2=" ">
    <subfield code="2">DOI</subfield>
    <subfield code="a">10.5281/zenodo.17575</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p>Model definitions and data for BrainPigletHI&lt;/p></subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u"></subfield>
    <subfield code="a">Other (Open)</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">other-open</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">bphi: Initial release</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">provisional-user-zenodo</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">272681</subfield>
    <subfield code="u">https://zenodo.org/record/17575/files/bphi-v1.0.zip</subfield>
    <subfield code="z">0</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">software</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2015-05-13</subfield>
  </datafield>
  <datafield tag="347" ind1=" " ind2=" ">
    <subfield code="p">20</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="a">Matthew Caldwell</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="a">https://github.com/bcmd/bphi/tree/v1.0</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="n">url</subfield>
  </datafield>
</record>

In [11]: marc21.do(create_record(x))
Out[11]: 
{'control_number': '17575',
 'copyright_or_legal_deposit_number': [{'copyright_or_legal_deposit_number': 'other-open',
   'source': 'opendefinition.org'}],
 'date_and_time_of_latest_transaction': '20150513165819.0',
 'digital_file_characteristics': [{}],
 'electronic_location_and_access': [{'file_size': '272681',
   'method': 'HTTP',
   'public_note': '0',
   'uniform_resource_identifier': 'https://zenodo.org/record/17575/files/bphi-v1.0.zip'}],
 'former_title': [{'title': '10.5281/zenodo.17575'}],
 'host_item_entry': [{'main_entry_heading': 'https://github.com/bcmd/bphi/tree/v1.0',
   'note': 'url',
   'relationship_information': 'isSupplementTo'}],
 'information_relating_to_copyright_status': [{'copyright_status': 'open'}],
 'main_entry_personal_name': {'personal_name': 'Matthew Caldwell'},
 'other_standard_identifier': [{'source_of_number_or_code': 'DOI',
   'standard_number_or_code': '10.5281/zenodo.17575',
   'type_of_standard_number_or_code': u'Source specified in subfield $2'}],
 'publication_distribution__imprint': [{'date_of_publication_distribution': '2015-05-13'}],
 'subject_added_entry_topical_term': [{'level_of_subject': u'Primary',
   'source_of_heading_or_term': 'opendefinition.org',
   'thesaurus': u'Source specified in subfield $2',
   'topical_term_or_geographic_name_entry_element': 'other-open'}],
 'summary': [{'summary': '<p>Model definitions and data for BrainPigletHI</p>'}],
 'terms_governing_use_and_reproduction_note': [{'terms_governing_use_and_reproduction': 'Other (Open)',
   'uniform_resource_identifier': u''}],
 'title_statement': {'title': 'bphi: Initial release'},

Write script to get MARCXML examples from LOC website

Parse the LOC websites and generate MARCXML data to test doJSON

docs: section and talk cleanup

As mentioned in #82 (comment) we may want to clean the docs a bit, for example:

1. Better separate user-oriented talk and developer-oriented talk

E.g. in the Usage section, we say:

The user does not need to know how CLI is installed; if we talk about --help then the user is probably more interested in what the usage options are and how to use it. (How to use rules and schemas, how to find missing fields or ignore missing, etc.) The entry points talk should rather go into developer-oriented section later. (How to add rules, etc.)

2. Cleaner TOC

The current table of contents looks messy:

The sections and subsections are not nicely separated.

add mappings for bd5xx.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

RFC: handling of indicators in Marc21

If I understand it correctly, the Marc21 format allows for arbitrary indicators, addressing the usecases mentioned by @fjorba. This seems a definite improvement compared to Invenio 1.x.

For illustrative purposes I'll just use Marc 100 below, however, Marc has several fields where these issues apply as well.

If I understood the schema correctly, however, each marc field gets mapped to a JSON internal name. So, e.g. a field like 100__ gets mapped to main_entry_personal_name. Similarly, 1000_, 1001_, 1003_ get mapped to main_entry_personal_name as well, so all author personal names end up in the same JSON field. Again this addresses nicely the usecase of @fjorba as finally all authors regardless of the indicators get indexed and displayed as behind the ingestion only main_entry_personal_name is used.

In discussions with @martinkoehler we now wondered about dissemnation and probably indexing issues from there.

Say, I ingest Marc21 records that are using 1001_. In Marc-language this refers to $a to store Surename, Forename. So the 1_ gives semantic introducing the concepts of Surename and Forename and define how they should be extracted.

100 1_ $aAdams, Henry

Now I ingest from another Marc source, and I get 1000_. Here the 0 signifies, that the name in $a is a forename. The cannonical example at LoC being

100 0_ $aJohn $cthe Baptist, Saint

Sidenote to @martinkoehler: from the examples to 0_ it is clear, that this does not refer to a storage like Henry Adams compared to 1_ Adams, Henry, but that it is indeed meant for name entities that consist of a forename only, like e.g. popes, saints or artists names.

In this discussion we also came to the point that it would be possible in principle to treat 1_ programatically as "split the name at , to the concepts of forename and surename and store them in two JSON fields. We were not clear if this is intended. It could address the dissemination issue mentioned below.

Another case is 1003_:

100 3_ $aFarquhar family

Where you do not have a concept of forename / surename but the concept of family name. (Note: I'd have to check if RDA would not drop the family in the above, it is clearly expressed in 3_ already can be a left over from ISBD in the AACR. At least I'd prefer to drop it.)

For indexing, one can argue that in the word index for names it might be no issue to treat them all alike. Regardless what you search is, Adams, Henry or Heny Adams the word index will take care of it probably treating the , as an unsignificant character. It might come up in the phrase index however, at least if one has a mixed storage (say one 100__$aHenry Adams).

Some thoughts on this?

The second point and actually the main concerns arrise from reexporting to Marc. If 1001_ is ingested one would suspect to get 1001_ back, right? If I understand it correctly, right now one would get 100__, right? Given the semantics introduced by the indicators, ignoring This them would loose information effectively.

In the current system this would not happen at least not if one stores the ingestion format as is. And as all processes are working on the ingested format updates to the records would be processed properly and thus keep the format.

Any thoughts on this yet?

RFC: Respecting the order of MARC subfields

As a follow up to the RFC about handling of indicators #19, I would like to raise my concerns about the order of subfields. The order of subfields are important for libraries and should be taken into consideration in the MARC-JSON mapping. A very simple example is an item that has been published by two publishers in two different locations. By following MARC21, the metadata would look like this:

<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">Paris :</subfield>
<subfield code="b">Gauthier-Villars ;</subfield>
<subfield code="a">Chicago :</subfield>
<subfield code="b">University of Chicago Press,</subfield>
<subfield code="c">1955.</subfield>
</datafield>

and should be displayed to the user in the order the subfields are stored:

Paris : Gauthier-Villars ; Chicago : University of Chicago Press, 1955.

If I have understood it correclty, the contrib.marc21 does not have the concept of order and it would map subfields a together, subfields b together, etc. My concerns are:

a) We will not be able to display the subfields in the correct order, so it might end up to be displayed like this:

Chicago : University of Chicago Press, Paris : Gauthier-Villars ; 1955.

You can see that the punctuation changes between equal subfields as they are dependent of the next subfield (comma before subfield c) and it ilustrates how important the order can be. In the worst case it can end up looking like this:

Paris : Chicago : Gauthier-Villars ; University of Chicago Press, 1955.

b) Exporting it back to MARC would leave us with an XML, which looks like:

<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">Paris :</subfield>
<subfield code="a">Chicago :</subfield>
<subfield code="b">Gauthier-Villars ;</subfield>
<subfield code="b">University of Chicago Press,</subfield>
<subfield code="c">1955.</subfield>
</datafield>

c) I am also curiouse how it will be handled by Elasticsearch. Will I be able to copy- paste the displayed text Paris : Gauthier-Villars ; Chicago : University of Chicago Press, 1955. and do an exact/partial phrase query?

Has any of these concerns been taken into consideration?

Marc21 indicators: Check allowed indicators to become compliant with LOC

Hi,
I recognized that the allowed indicators for 245 is not correct.
https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/fields/bd20x24x.py#L186
Here, the second indicator should allow character 0,1,2,3,4,5,6,7,8,9.
http://www.loc.gov/marc/bibliographic/bd245.html

I believe the automatic extraction of which indicators that are allowed has failed due to the use of "1-9".
If you have a look at "Varying form of the title", it is correct since all the numbers are listed at the LOC website.
https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/fields/bd20x24x.py#L226
http://www.loc.gov/marc/bibliographic/bd246.html

I quickly had a look, and it seems like we have the same problem with 222, 240, 242 and 243 as well. This is only for titles, and we should have a look at the other MARC fields as well.

add mappings for bd70x75x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

schemas: update JSON-Schemas enums

Check that all enums have correct value after fixing rules.

Example:

Proposal

Add tests to verify not only the "identity" of MARC21 and TO_MARC21 but also validate against JSON-Schema.

global: incompatible change released as patch version

The change introduced in d3272c0 is a backward incompatible change (breaks code on Zenodo), so the change ought to have been release in a v1.3.0 instead of v1.2.1. Just logging the issue here for completeness

dojson: test all fields

Make sure all the fields are tested. This can be split between many people.
Checked fields:

... (TODO: Check the file names and finish the list)

utils: force_list returns a tuple

utils.force_list(), despite its name and docstring, actually returns a tuple or None.

A better implementation could be:

def force_list(data):
    """Wrap data in list."""
    if data is None:
        return []
    elif not isinstance(data, (list, tuple, set)):
        return [data]
    elif isinstance(data, (tuple, set)):
        return list(data)
    return data

however then it would be necessary to amend filter_values to also filter away empty lists.

add mappings for bd4xx.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

RFC Enhancing data with $schema

base URL
Overdo.do should/can add relative $schema
add possibility to specify extra arguments on CLI for load/do/dump

cc @SamiHiltunen @hachreak @egabancho

dojson: add mappings for bd84188x.py

Add mappings (based on documentation from: https://www.loc.gov/marc/bibliographic/)
Add test examples (for each mapping, take the example data from LOC website. For example, if you go to this page:
https://www.loc.gov/marc/bibliographic/bd383.html
you will see lines like:

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

>>> from dojson.utils import GroupableOrderedDict
>>> a = GroupableOrderedDict((('a', 12), ('b', 123), ('a', 23)))
>>> repr(a)
"GroupableOrderedDict((('__order__', ('a', 'b', 'a')), ('a', (12, 23)), ('b', 123)))"
>>> eval(repr(a))
GroupableOrderedDict([('__order__',
                       ('__order__', '__order__', '__order__', 'a', 'a', 'b')),
                      ('a', (12, 23)),
                      ('b', 123)])
>>> eval(repr(a)) == a
False

instead I would expect:

>>> repr(a)
"GroupableOrderedDict((('a', 12), ('b', 123), ('a', 23)))"
>>> eval(repr(a))
GroupableOrderedDict((('a', 12), ('b', 123), ('a', 23)))
>>> eval(repr(a)) == a
True

See: https://docs.python.org/2/library/functions.html#repr

inveniosoftware / dojson Goto Github PK

dojson's People

Contributors

Stargazers

Watchers

Forkers

dojson's Issues

the faulty record

the test

1. Better separate user-oriented talk and developer-oriented talk

2. Cleaner TOC

Proposal

Recommend Projects

Recommend Topics

Recommend Org