Code Monkey home page Code Monkey logo

dojson's People

Contributors

blixhavn avatar david-caro avatar egabancho avatar glignos avatar greut avatar jacquerie avatar jirikuncar avatar jma avatar kaplun avatar lchrzaszcz avatar lnielsen avatar max-moser avatar michamos avatar samihiltunen avatar switowski avatar tiborsimko avatar topless avatar utnapischtim avatar zazasa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dojson's Issues

Extreme CPU usage after updating to versions later than 1.0.1

We have experienced extreme CPU usage on records insertion after updating the verion of dojson from 1.0.1 to 1.2.1.

We pinpointed the problem and it occurs on all versions after 1.0.1.

On 1.0.1 for 100 records it takes 2.5 seconds :
Prun output: https://gist.github.com/kaplun/42d593e74f8b0821c6a3a754cdcfa6ae

On 1.2.1 for 100 records it takes 879 seconds:
Prun output: https://gist.github.com/kaplun/3f4dd3065e041af862b3bc16b69e371a

After git bisect we believe that the problem occurs on one of the following commits:

marc21: 100 $a R vs NR

We seem to have a problem with repetitive (=R) vs non-repetitive (=NR) subfields.

Consider the following MARC21 record:

  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="a">Donges, Jonathan F</subfield>
  </datafield>

It is converted into:

 'main_entry_personal_name': {'personal_name': ['Donges, Jonathan F']},

Note that personal_name is a list (so R), while MARC21 standard says that it is 100 $a - Personal name (NR), hence not a list.

utils: filter_values() is not GroupableOrderedDict aware

Take this snippet:

>>> from dojson.utils import GroupableOrderedDict, force_values
>>> a = GroupableOrderedDict([('a', None)])
>>> @filter_values
>>> def foo(b):
...     return b
>>> foo(a)
{'__order__': ('a',)}

Instead the correct result should have been:

>>> foo(a)
{'__order__': ()}

GroupableOrderedDict riases KeyError on empty values

>>> from dojson.contrib.marc21.utils import create_record
>>> xml = '<record><datafield tag="037" ind1=" " ind2=" "></datafield></record>'
>>> d = create_record(xml)
>>> d
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/collections.py", line 173, in __repr__
    return '%s(%r)' % (self.__class__.__name__, self.items())
  File "/opt/cds/lib/python2.7/site-packages/dojson/utils.py", line 358, in items
    return tuple(self.iteritems(with_order, repeated))
  File "/opt/cds/lib/python2.7/site-packages/dojson/utils.py", line 369, in iteritems
    yield key, value[0]
  File "/opt/cds/lib/python2.7/site-packages/dojson/utils.py", line 286, in __getitem__
    item = OrderedDict.__getitem__(self, key)
KeyError: 0
>>> d['037__']
GroupableOrderedDict((('__order__', ()),))

Allow local fields, subfields and indicators

Marc21 makes room for a local installation to define its own local field with the convention that they contain a 9, either as a tag (typically 9XX tags, but also like 59X), indicators or subfields.

Invenio should make it easy to follow this convention without resorting to tortuous modifications of internal syntax rules.

add mappings for bd01x09x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd80x83x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd25x28x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd1xx.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

RFC: Adding marc tags to field descriptions

Currently, the schema for Marc21 contain the "human readable marc tag". E.g. ""title_statement" gives

"description": "Title Statement"

It would be helpful if we could also hold the numeric tag, ie. 245.

  • This serves as sort of a unique identifier
  • It helps inline documentation
  • It is easy to construct a lookup url form that (e.g. http://www.loc.gov/marc/bibliographic/bd245.html), that can serve well in a help functionality.
  • It might serve in some upcoming editor for a "non-human" format (being usually preferred by full time cataloguers due to it's compactness).

Without changing the model one could just add that to "description". Similarly, it might be very helpul to have the subfield codes in some sort of description as well (due to their uniqueness). E.g.

"title_statement": {
        "description": "Title Statement",
        "tag": "245",
        "title": {
             "type": "string",
             "subfield" : "a"
        },

add mappings for bd20x24x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd76x78x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

When reporting missing rules, pass original blob to exception handler

In https://github.com/inveniosoftware/dojson/blob/master/dojson/overdo.py#L163:

            except Exception as exc:
                if exc.__class__ in handlers:
                    handler = handlers[exc.__class__]
                    if handler is not None:
                        handler(exc, output, key, value)
                else:
                    raise

what about passing to handler also the original blob variable? In this way the handler would have the original input and would be able to output more detailed information (e.g. say it could retrieve an record ID).

tests: ESMRE is not Python 3+ compatible

Using /Users/jirka/programing/python/dojson/.eggs/lxml-3.4.4-py3.4-macosx-10.10-x86_64.egg
Searching for esmre
Reading https://pypi.python.org/simple/esmre/
Reading http://code.google.com/p/esmre/
Best match: esmre 0.3.1
Downloading https://pypi.python.org/packages/source/e/esmre/esmre-0.3.1.tar.gz#md5=95ace12bac0c79cf95712336489bc4a4
Processing esmre-0.3.1.tar.gz
Writing /var/folders/57/jt0zh0ys541671wgmgp8dnkh0000gp/T/easy_install-v5dff8r8/esmre-0.3.1/setup.cfg
Running esmre-0.3.1/setup.py -q bdist_egg --dist-dir /var/folders/57/jt0zh0ys541671wgmgp8dnkh0000gp/T/easy_install-v5dff8r8/esmre-0.3.1/egg-dist-tmp-3z1ktnug
src/esm.c:38:11: error: no member named 'ob_type' in 'esm_IndexObject'
    self->ob_type->tp_free((PyObject*) self);
    ~~~~  ^
src/esm.c:185:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    PyObject_HEAD_INIT(NULL)
    ^~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/include/python3.4m/object.h:86:5: note: expanded from macro 'PyObject_HEAD_INIT'
    1, type },
    ^~~~~~~
src/esm.c:187:5: warning: incompatible pointer to integer conversion initializing 'Py_ssize_t' (aka 'long') with an expression of type 'char [10]' [-Wint-conversion]
    "esm.Index",                        /*tp_name*/
    ^~~~~~~~~~~
src/esm.c:190:5: warning: incompatible pointer types initializing 'printfunc' (aka 'int (*)(PyObject *, FILE *, int)') with an expression of type 'destructor' (aka 'void (*)(PyObject *)') [-Wincompatible-pointer-types]
    (destructor) esm_Index_dealloc,     /*tp_dealloc*/
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:205:5: warning: incompatible integer to pointer conversion initializing 'const char *' with an expression of type 'unsigned long' [-Wint-conversion]
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, /*tp_flags*/
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/include/python3.4m/object.h:643:29: note: expanded from macro 'Py_TPFLAGS_DEFAULT'
#define Py_TPFLAGS_DEFAULT  ( \
                            ^
src/esm.c:206:5: warning: incompatible pointer types initializing 'traverseproc' (aka 'int (*)(PyObject *, visitproc, void *)') with an expression of type 'char [47]' [-Wincompatible-pointer-types]
    "Index() -> new efficient string matching index",  /* tp_doc */
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:213:5: warning: incompatible pointer types initializing 'struct PyMemberDef *' with an expression of type 'PyMethodDef [4]' [-Wincompatible-pointer-types]
    esm_Index_methods,             /* tp_methods */
    ^~~~~~~~~~~~~~~~~
src/esm.c:214:5: warning: incompatible pointer types initializing 'struct PyGetSetDef *' with an expression of type 'PyMemberDef [1]' [-Wincompatible-pointer-types]
    esm_Index_members,             /* tp_members */
    ^~~~~~~~~~~~~~~~~
src/esm.c:221:5: warning: incompatible pointer types initializing 'allocfunc' (aka 'PyObject *(*)(struct _typeobject *, Py_ssize_t)') with an expression of type 'initproc' (aka 'int (*)(PyObject *, PyObject *, PyObject *)') [-Wincompatible-pointer-types]
    (initproc) esm_Index_init,      /* tp_init */
    ^~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:223:5: warning: incompatible pointer types initializing 'freefunc' (aka 'void (*)(void *)') with an expression of type 'PyObject *(PyTypeObject *, PyObject *, PyObject *)' [-Wincompatible-pointer-types]
    esm_Index_new,                 /* tp_new */
    ^~~~~~~~~~~~~
src/esm.c:239:9: error: non-void function 'initesm' should return a value [-Wreturn-type]
        return;
        ^
src/esm.c:241:9: warning: implicit declaration of function 'Py_InitModule3' is invalid in C99 [-Wimplicit-function-declaration]
    m = Py_InitModule3("esm", esm_methods,
        ^
src/esm.c:241:7: warning: incompatible integer to pointer conversion assigning to 'PyObject *' (aka 'struct _object *') from 'int' [-Wint-conversion]
    m = Py_InitModule3("esm", esm_methods,
      ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/esm.c:245:9: error: non-void function 'initesm' should return a value [-Wreturn-type]
        return;
        ^
11 warnings and 3 errors generated.
error: Setup script exited with error: command 'clang' failed with exit status 1

add mappings for bd00x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

Write tests to JSON Schema

What we want: Make sure that the JSON schema generated by DoJSON correctly matches the fields from the MARC21 test files. We don't want to miss any field.

Problem: JSON Schema's additionalProperties is currently limited when used with allOf or anyOf. Thus we cannot use the JSON Schema as it is to validate that any additional fields have been added.

Note: Currently all the fields are typed as string. Thus we don't want to test complex typing or formats.

Possible solution 1: Write a script which checks that all fields in a JSON MARC21 (JSON generated by dojson with a MARC21 file as input) are present in the JSON Schema.

Possible solution 2: Write a script which transforms a JSON Schema in a standardized way by merging all the schemas from anyOf lists, then add additionalProperties = false and use a JSON Schema library to validate.

Possible solution 3: Use the domapping tool to generate mappings from JSON Schema. Add strict everywhere in the mapping to refuse additional fields. Send the mappings to Elasticsearch. Try to load the MARC21 files and see if it fails.

add mappings for bd3xx.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

add mappings for bd6xx.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

contrib: dojson drop unknown fields

In #50, a test fails because a subfield is not part of the spec.

the faulty record

<record>
    <datafield tag="852" ind1="8" ind2=" ">
        <subfield code="i">M 314</subfield>
        <subfield code="h">339</subfield>
        <subfield code="b">Library Reading Room</subfield>
        <subfield code="9">10</subfield>
    </datafield>
</record>

the test

expected = create_record(xml)
result = to_marc21.do(marc21.do(expected))
assert expected == result

The spec says nothing about a $9 subfield. But should it remove it nonetheless?

RFC: One big metadata MARC21 file for testing ?

Hi,

@tiborsimko , @lnielsen, @egabancho during this sprint, we have gathered some example metadata from Zenodo, TIND and CDS. Do you think it makes sense to combine them in a one big record with all possible MARC21 fields (also including tags from those 6 records provided by @Kennethhole in https://github.com/inveniosoftware/dojson/pull/76/files) ?
Or should we create separate files for each source (cds_demo_data.xml, zenodo_demo_data.xml, tind_demo_data.xml) ?
Or should we create one big file with multiple records inside ?
The main reason for that is the maintainability of data in the future (we can easily add new fields if we keep only one file with test data).

marc21: 0247 overmatching

There seem to be an overmatching happening with certain field/indicator combinations, such as starting with leading zero.

Consider the below-quoted Zenodo test input record that contains DOI field (0247) that is being translated in the output as both other_standard_identifier (which is good) but also as
former_title (which is bad). The latter field is defined as:

dojson/contrib/marc21/fields/bd20x24x.py:@marc21.over('former_title', '^247[10][10]')

Seems the leading zero isn't properly matched via caret.

In [10]: print x
<record>
  <controlfield tag="001">17575</controlfield>
  <controlfield tag="005">20150513165819.0</controlfield>
  <datafield tag="024" ind1="7" ind2=" ">
    <subfield code="2">DOI</subfield>
    <subfield code="a">10.5281/zenodo.17575</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p>Model definitions and data for BrainPigletHI&lt;/p></subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u"></subfield>
    <subfield code="a">Other (Open)</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">other-open</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">bphi: Initial release</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">provisional-user-zenodo</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">272681</subfield>
    <subfield code="u">https://zenodo.org/record/17575/files/bphi-v1.0.zip</subfield>
    <subfield code="z">0</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">software</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2015-05-13</subfield>
  </datafield>
  <datafield tag="347" ind1=" " ind2=" ">
    <subfield code="p">20</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="a">Matthew Caldwell</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="a">https://github.com/bcmd/bphi/tree/v1.0</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="n">url</subfield>
  </datafield>
</record>

In [11]: marc21.do(create_record(x))
Out[11]: 
{'control_number': '17575',
 'copyright_or_legal_deposit_number': [{'copyright_or_legal_deposit_number': 'other-open',
   'source': 'opendefinition.org'}],
 'date_and_time_of_latest_transaction': '20150513165819.0',
 'digital_file_characteristics': [{}],
 'electronic_location_and_access': [{'file_size': '272681',
   'method': 'HTTP',
   'public_note': '0',
   'uniform_resource_identifier': 'https://zenodo.org/record/17575/files/bphi-v1.0.zip'}],
 'former_title': [{'title': '10.5281/zenodo.17575'}],
 'host_item_entry': [{'main_entry_heading': 'https://github.com/bcmd/bphi/tree/v1.0',
   'note': 'url',
   'relationship_information': 'isSupplementTo'}],
 'information_relating_to_copyright_status': [{'copyright_status': 'open'}],
 'main_entry_personal_name': {'personal_name': 'Matthew Caldwell'},
 'other_standard_identifier': [{'source_of_number_or_code': 'DOI',
   'standard_number_or_code': '10.5281/zenodo.17575',
   'type_of_standard_number_or_code': u'Source specified in subfield $2'}],
 'publication_distribution__imprint': [{'date_of_publication_distribution': '2015-05-13'}],
 'subject_added_entry_topical_term': [{'level_of_subject': u'Primary',
   'source_of_heading_or_term': 'opendefinition.org',
   'thesaurus': u'Source specified in subfield $2',
   'topical_term_or_geographic_name_entry_element': 'other-open'}],
 'summary': [{'summary': '<p>Model definitions and data for BrainPigletHI</p>'}],
 'terms_governing_use_and_reproduction_note': [{'terms_governing_use_and_reproduction': 'Other (Open)',
   'uniform_resource_identifier': u''}],
 'title_statement': {'title': 'bphi: Initial release'},

docs: section and talk cleanup

As mentioned in #82 (comment) we may want to clean the docs a bit, for example:

1. Better separate user-oriented talk and developer-oriented talk

E.g. in the Usage section, we say:

2016-03-09-163635_2560x1440_scrot

The user does not need to know how CLI is installed; if we talk about --help then the user is probably more interested in what the usage options are and how to use it. (How to use rules and schemas, how to find missing fields or ignore missing, etc.) The entry points talk should rather go into developer-oriented section later. (How to add rules, etc.)

2. Cleaner TOC

The current table of contents looks messy:

2016-03-09-164127_2560x1440_scrot

The sections and subsections are not nicely separated.

add mappings for bd5xx.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

RFC: handling of indicators in Marc21

If I understand it correctly, the Marc21 format allows for arbitrary indicators, addressing the usecases mentioned by @fjorba. This seems a definite improvement compared to Invenio 1.x.

For illustrative purposes I'll just use Marc 100 below, however, Marc has several fields where these issues apply as well.

If I understood the schema correctly, however, each marc field gets mapped to a JSON internal name. So, e.g. a field like 100__ gets mapped to main_entry_personal_name. Similarly, 1000_, 1001_, 1003_ get mapped to main_entry_personal_name as well, so all author personal names end up in the same JSON field. Again this addresses nicely the usecase of @fjorba as finally all authors regardless of the indicators get indexed and displayed as behind the ingestion only main_entry_personal_name is used.

In discussions with @martinkoehler we now wondered about dissemnation and probably indexing issues from there.

Say, I ingest Marc21 records that are using 1001_. In Marc-language this refers to $a to store Surename, Forename. So the 1_ gives semantic introducing the concepts of Surename and Forename and define how they should be extracted.

100 1_ $aAdams, Henry

Now I ingest from another Marc source, and I get 1000_. Here the 0 signifies, that the name in $a is a forename. The cannonical example at LoC being

100 0_ $aJohn $cthe Baptist, Saint

Sidenote to @martinkoehler: from the examples to 0_ it is clear, that this does not refer to a storage like Henry Adams compared to 1_ Adams, Henry, but that it is indeed meant for name entities that consist of a forename only, like e.g. popes, saints or artists names.

In this discussion we also came to the point that it would be possible in principle to treat 1_ programatically as "split the name at , to the concepts of forename and surename and store them in two JSON fields. We were not clear if this is intended. It could address the dissemination issue mentioned below.

Another case is 1003_:

100 3_ $aFarquhar family

Where you do not have a concept of forename / surename but the concept of family name. (Note: I'd have to check if RDA would not drop the family in the above, it is clearly expressed in 3_ already can be a left over from ISBD in the AACR. At least I'd prefer to drop it.)

For indexing, one can argue that in the word index for names it might be no issue to treat them all alike. Regardless what you search is, Adams, Henry or Heny Adams the word index will take care of it probably treating the , as an unsignificant character. It might come up in the phrase index however, at least if one has a mixed storage (say one 100__$aHenry Adams).

Some thoughts on this?

The second point and actually the main concerns arrise from reexporting to Marc. If 1001_ is ingested one would suspect to get 1001_ back, right? If I understand it correctly, right now one would get 100__, right? Given the semantics introduced by the indicators, ignoring This them would loose information effectively.

In the current system this would not happen at least not if one stores the ingestion format as is. And as all processes are working on the ingested format updates to the records would be processed properly and thus keep the format.

Any thoughts on this yet?

RFC: Respecting the order of MARC subfields

As a follow up to the RFC about handling of indicators #19, I would like to raise my concerns about the order of subfields. The order of subfields are important for libraries and should be taken into consideration in the MARC-JSON mapping. A very simple example is an item that has been published by two publishers in two different locations. By following MARC21, the metadata would look like this:

<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">Paris :</subfield>
<subfield code="b">Gauthier-Villars ;</subfield>
<subfield code="a">Chicago :</subfield>
<subfield code="b">University of Chicago Press,</subfield>
<subfield code="c">1955.</subfield>
</datafield>

and should be displayed to the user in the order the subfields are stored:

Paris : Gauthier-Villars ; Chicago : University of Chicago Press, 1955.

If I have understood it correclty, the contrib.marc21 does not have the concept of order and it would map subfields a together, subfields b together, etc. My concerns are:

a) We will not be able to display the subfields in the correct order, so it might end up to be displayed like this:

Chicago : University of Chicago Press, Paris : Gauthier-Villars ; 1955.

You can see that the punctuation changes between equal subfields as they are dependent of the next subfield (comma before subfield c) and it ilustrates how important the order can be. In the worst case it can end up looking like this:

Paris : Chicago : Gauthier-Villars ; University of Chicago Press, 1955.

b) Exporting it back to MARC would leave us with an XML, which looks like:

<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">Paris :</subfield>
<subfield code="a">Chicago :</subfield>
<subfield code="b">Gauthier-Villars ;</subfield>
<subfield code="b">University of Chicago Press,</subfield>
<subfield code="c">1955.</subfield>
</datafield>

c) I am also curiouse how it will be handled by Elasticsearch. Will I be able to copy- paste the displayed text Paris : Gauthier-Villars ; Chicago : University of Chicago Press, 1955. and do an exact/partial phrase query?

Has any of these concerns been taken into consideration?

Marc21 indicators: Check allowed indicators to become compliant with LOC

Hi,
I recognized that the allowed indicators for 245 is not correct.
https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/fields/bd20x24x.py#L186
Here, the second indicator should allow character 0,1,2,3,4,5,6,7,8,9.
http://www.loc.gov/marc/bibliographic/bd245.html

I believe the automatic extraction of which indicators that are allowed has failed due to the use of "1-9".
If you have a look at "Varying form of the title", it is correct since all the numbers are listed at the LOC website.
https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/fields/bd20x24x.py#L226
http://www.loc.gov/marc/bibliographic/bd246.html

I quickly had a look, and it seems like we have the same problem with 222, 240, 242 and 243 as well. This is only for titles, and we should have a look at the other MARC fields as well.

add mappings for bd70x75x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

dojson: test all fields

Make sure all the fields are tested. This can be split between many people.
Checked fields:

  • bd00x.py
  • bd01x09x.py
  • bd1xx.py
  • bd20x24x.py
  • bd25x28x.py
  • bd3xx.py
  • bd4xx.py
  • bd5xx.py
  • bd6xx.py
  • bd70x75x.py
  • bd76x78x.py - @SamiHiltunen
  • bd80x83x.py
  • bd84188x.py

... (TODO: Check the file names and finish the list)

utils: force_list returns a tuple

utils.force_list(), despite its name and docstring, actually returns a tuple or None.

A better implementation could be:

def force_list(data):
    """Wrap data in list."""
    if data is None:
        return []
    elif not isinstance(data, (list, tuple, set)):
        return [data]
    elif isinstance(data, (tuple, set)):
        return list(data)
    return data

however then it would be necessary to amend filter_values to also filter away empty lists.

add mappings for bd4xx.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

dojson: add mappings for bd84188x.py

383 ##$ano. 14,$bop. 27, no. 2
383 ##$cBWV 211
383 ##$bop. 8, no. 1-4
, etc.

The idea is to convert those lines into MARCXML example data (don't do this by hand, write a script for that !), so we are sure all mappings are tested:

<record>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="a">no. 14</subfield>
    <subfield code="b">op. 27, no. 2</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="c">BWV 211</subfield>
  </datafield>
  <datafield tag="383" ind1=" " ind2=" ">
    <subfield code="b">op. 8, no. 1-4</subfield>
  </datafield>
, etc.
</record>

marc21liberal: create very liberal MARC21 schema

Following the discussion in #23, it is interesting to create a new, very liberal MARC21 schema, where all the fields and subfields would be repeatable, all the indicator values allowed, etc. We may call it marc21liberal or some such.

utils: eval(repr(GroupableOrderedDict(...)) != GroupableOrderedDict(...)

Currently:

>>> from dojson.utils import GroupableOrderedDict
>>> a = GroupableOrderedDict((('a', 12), ('b', 123), ('a', 23)))
>>> repr(a)
"GroupableOrderedDict((('__order__', ('a', 'b', 'a')), ('a', (12, 23)), ('b', 123)))"
>>> eval(repr(a))
GroupableOrderedDict([('__order__',
                       ('__order__', '__order__', '__order__', 'a', 'a', 'b')),
                      ('a', (12, 23)),
                      ('b', 123)])
>>> eval(repr(a)) == a
False

instead I would expect:

>>> repr(a)
"GroupableOrderedDict((('a', 12), ('b', 123), ('a', 23)))"
>>> eval(repr(a))
GroupableOrderedDict((('a', 12), ('b', 123), ('a', 23)))
>>> eval(repr(a)) == a
True

See: https://docs.python.org/2/library/functions.html#repr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.