Code Monkey home page Code Monkey logo

eas-schematrons's Introduction

eas-schematrons

Overview

This is the official repository for the Schematron files managed by TS-EAS. The Schematron that are managed in this repository specify additional tests that should be performed on documents associated with TS-EAS schemas, such as EAC-CPF 2.0, to ensure consistent encoding practices.

The tests included in the output files of this repository (e.g. eac.sch) should be considered as a required extension of the base schema (e.g. eac.rng). For guidance how to associate both the Schematron and base schema files, see [https://github.com/SAA-SDT/TS-EAS-subteam-notes/wiki/Schematron].

Implementers might also choose to add additional tests, depending on their own local or consortium requirements. For guidance on how to do this, see [https://github.com/SAA-SDT/TS-EAS-subteam-notes/wiki/Schematron].

Summary of Tests Provided

The TS-EAS Schematron files currently specify the following tests:

  • If and only if a language encoding of iso639-1, iso639-2b, iso639-3, or ietf-bcp-47 is set in the control section, then every languageOfElement and languageCode attribute in the document will be tested against a regular expression pattern that adheres to that specific standard.

    • For example, if either iso639-1 or ietf-bcp-47 is set in the control section, then a value of "fr" will validate because those two characters are found in the associated regular expression pattern.
  • The countryCode attribute will be validated against ISO 3166-1, unless eac:control/@contryEncoding is set to "otherCountryEncoding".

  • All scriptCode and scriptOfElement attributes will be validated against ISO 15924, unless eac:control/@scriptEncoding is set to "otherScriptEncoding".

  • The text value of the agencyCode element will be validated against a expression value that adheres to the ISO 15511 standard, unless eac:control/@repostioryEncoding is set to "otherRepositoryEncoding". The regular expression requires a prefix, a hyphen ("-"), and an identifier.

    • The prefix must either match a country code from ISO 3166-1, or it must include three to four characters in the A to Z range, regardless of case (e.g. "oclc", "EUR", and "SzB" are all valid prefixes).
    • The identifier that follows the hyphen can be 1 to 11 characters long and include a mix of A-Z characters (regardless of case), numeric characters from 0-9, as well as any of the additional three characters: ":", "/", and "-".
  • Every '@id' attribute will be tested in the document to ensure that each id only occurs once. This test is already carried out by the XSD version of the schema, but it is not enforced by the RNG version due to how RNG treats the xsd:ID datatype.

  • Every reference-related (e.g. '@sourceReference') and target attribute present will be tested to ensure that the attribute is linked elsewhere in the current file.

    • @conventionDeclarationReference: must be linked to an @id found in a conventionDeclaration element.
    • @localTypeDeclarationReference: must be linked to an @id found in a localTypeDeclaration element.
    • @maintenanceEventReference: must be linked to an @id found in a maintenanceEvent element.
    • @sourceReference: must be linked either to an @id found in a source element or a citedRange element.
    • @target: must be linked to an @id found somewhere within the current document.
  • The maintanceAgency element within the control section must include either a non-empty agencyCode element or a non-empty agencyName element. It can also have both, but it needs one of the two at a minimum.

  • The eventDateTime element must include either a @standardDateTime attribute or text.

  • Any use of the @era attribute should be restricted to either 'ce' or 'bce'.

  • Unless the dateEncoding within the control section is set to "otherDateEncoding", then a sub-profile of dates allowable by ISO 8601:2019, parts 1 and 2 (which includes EDTF dates), will be enforced. The Schematron file includes the following restrictions on the @notAfter, @notBefore, @standardDate attributes:

    1. Valid dates within all three attributes may be composed of the following:
      • a Year
        • which may optionally start with a "+" or "-".
        • contain no less than 4 characters (e.g. year 100 must be encoded as 0100), and no more than 10 characters.
        • contains numeric characters, or an X to indicate an unknown value, according to the new EDTF features provided in ISO 8601:2019. e.g. "192X" is valid.
      • a Year, a hyphen separtor (intended to not be optional in our subprofile of ISO 8601), and a Season, which must have a value of 21 to 41, according to ISO 8601:2019
      • a Year, a hyphen separator, and a Month, which must include a value from 01 (for January) to 12 (for December). Unknown values can optionally be indicated with an "X".
      • a Year, a hyphen separator, a Month, a hyphen seprator, and a Day, which must be include two characters in the range of 01 to 31, with an optional X to indicate an unknown value. If the month is February, then 30 and 31 will be invalid values, and 29 should only be valid for leap years.
      • a Year, a hyphen seprator, a Month, a hyphen seprator, a Day, and a Time, which can either start with a "T" or a " ".
      • a qualifier that precedes or follows any of those parts. The valid qualifiers are "?" (i.e. uncertain), "~" (i.e. approximate), and "%" (i.e. uncertain and approximate).
    2. @notAfter and @notBefore must not contain any date ranges, which may be specified with ".." and "/". If date ranges are required, then those should only be encoded within a @standardDate attribute that is present on a date element (not a fromDate or toDate element).
    3. Further, though date ranges may start and end with "..", they should not start or end with a "/".
    4. Last, regarding date ranges, only one range can be specified in our profile. In other words: 1800/1820 is valid, but 1800/1820..1830 would not be. Similarly, a single attribute that encodes a date set and range (e.g. 1800,1802,1807,1810..1820) would also not be valid. For that case, the dateSet element should be utlized instead.

GitHub Repository Structure

Currently, this repository includes a few files in the root, such as a data license and this README. Additionally, there are four different directories:

  • build
    • This directory currently contains both a shell script and Windows command file that can be utilized to regenerate a copy of the Schematron file, which will be posted in the "schematron" directory. To do that, either scripts will call the XSLT transformation in the "build/transformations" directory, which in turn will take the files in the "src" directory and use those to generate the Schematron file. One reason for this extra step is that the automated process makes it easier, for example, to update the regular expression pattern used in our validation of IETF BCP 47 Language codes.
  • schematron
    • This directory contains the final deliverable of the Schematron file, which is currently delivered in a single file (i.e. a single file for each base TS-EAS schema, since those currently have separate namespaces).
  • src
    • This directory contains the source file, which the build directory utilzes to create the output in the schematron directory. Therefore, if any additional tests are added, then those tests will be added in this directory.
  • vendor
    • For the time being we are including the necessary XSLT processor to generate the resulting schema file directly in this GitHub repository. Until we have an automated build process baked in, this makes things easier to share the current build process. As it stands now, the only requirement is having a working installation of JAVA on your computer.

eas-schematrons's People

Contributors

fordmadox avatar kerstarno avatar

Watchers

 avatar Salvatore Vassallo avatar Karin Bredenberg avatar Michael Rush avatar Daniel Pitti avatar Florence Clavaud avatar  avatar Iris Lee avatar  avatar

eas-schematrons's Issues

We currently don't have any special tests for unitDate in the same way we do for the single date elements (date, fromDate, toDate)

Ideally, these two elements will merge (since structured dates can benefit from textual description, and textual description dates already do not require machine-readable date attributes in EAD), but until then, we should likely not allow such encodings like the following:

<unitDate notAfter="2024whatayear"/>

Simple to address in the Schematron, but it begs the question about why we have the two elements still, since why provide an encoding choice to users to encode an equivalent (but in this case Schematron-valid) statement such as:

<unitDateStructured><date notAfter="2024"/></unitDateStructured>

Allowing both goes against our first schema-design principle, anyway.

Schematron currently checks if the content in <agencyCode> is compliant with ISO 15511 even if @repositoryEncoding is not set

Creator of issue

  1. Kerstin Arnold
  2. EAD team lead, TS-EAS
  3. @kerstarno
  4. [email protected]

The issue relates to

  • EAC-CPF schema issue
  • EAC-CPF Tag Library issue
  • EAD schema issue
  • EAD Tag Library issue
  • Schema issue
  • Tag Library issue
  • Suggestions for all schemas
  • Suggestions for all Tag Libraries
  • Other

Wanted change/feature

  • Text: This might be a similar remnant from EAD3 Schematron rules as requiring ISO 8601 compliant dates even if @dateEncoding isn't set (see #7). At the moment, in both the XSD and the RNG (when used with Schematron), there seems to be a check for the content of <agencyCode> to be compliant with ISO 15511, so that the following error message is shown even if @repositoryEncoding isn't set (yet) in <control>: "If the repositoryEncoding is set to ISO 15511, then the format of the value of the agencyCode element is constrained to that of the International Standard Identifier for Libraries and Related Organizations (ISIL: ISO 15511): a prefix, a hyphen, and an identifier."
  • This message should only be shown, if @repositoryEncoding is present and is indeed used with the value "iso15511".

Add validation for cases that use the "EASList" value in the @...Encoding attributes of <control>

Creator of issue

  1. Kerstin Arnold
  2. EAD team lead, TS-EAS
  3. @kerstarno
  4. [email protected]

The issue relates to

  • EAC-CPF schema issue
  • EAC-CPF Tag Library issue
  • EAD schema issue
  • EAD Tag Library issue
  • Schema issue
  • Tag Library issue
  • Suggestions for all schemas
  • Suggestions for all Tag Libraries
  • Other

Wanted change/feature

  • Text: This feature request follows SAA-SDT/eas-schemas#1. With the decision to define the use of value lists via @...Encoding attributes within <control>, users can decide to use the "EASList" for values in the respective attributes within the descriptive part of EAD or any other, i.e. their own lists (either completely different from the EAS lists or an extension of them.
  • For now, the "EASList" value essentially picks up on the values that have been predefined for these attributes in EAD3 (respectively EAC-CPF 2.0) and this is what Schematron should check against. In the context of finalising EAD 4.0, these values should be checked and it should be reviewed whether any other values should be added.
  • For now, this check only applies to the draft of EAD 4.0. EAC-CPF 2.0 would first need to be revised with regard to removing the predefined value lists.
  • If addressLineTypeEncoding is used with the value "EASList", the @addressLineType attribute should only contain one of the following values: county, country, district, municipality, postBox, postalCode, region, street
  • If @audienceEncoding is used with the value "EASList", the @audience attribute should only contain one of the following values: external, internal
  • If contactLineTypeEncoding is used with the value "EASList", the @contactLineType attribute should only contain one of the following values: directions, email, fax, homepage, mobileNumber, phoneNumber
  • If coverageEncoding is used with the value "EASList", the @coverage attribute should only contain one of the following values: part, whole
  • If detailLevelEncoding is used with the value "EASList", the @detailLevel attribute should only contain one of the following values: basic, extended, minimal
  • If descriptionOfComponentsTypeEncoding is used with the value "EASList", the @descriptionOfComponentsType attribute should only contain one of the following values: analyticOverview, combined, inDepth
  • If levelEncoding is used with the value "EASList", the @level attribute should only contain one of the following values: class, collection, file, fonds, item, recordGroup, series, subfonds, subgroup, subseries
  • If maintenanceEventTypeEncoding is used with the value "EASList", the @maintenanceEventType attribute should only contain one of the following values: cancelled, created, deleted, derived, revised, unknown, updated
  • If maintenanceStatusEncoding is used with the value "EASList", the @maintenanceStatus attribute should only contain one of the following values: cancelled, deleted, deletedMerged, deletedReplaced, deletedSplit, derived, new, revised
  • If physDescStructuredTypeEncoding is used with the value "EASList", the @physDescStructuredType attribute should only contain one of the following values: carrier, materialType, spaceOccupied
  • If publicationStatusEncoding is used with the value "EASList", the @publicationStatus attribute should only contain one of the following values: approved, inProcess, published
  • If statusEncoding is used with the value "EASList", the @status attribute should only contain one of the following values: alternative, authorized, ongoing, unknown
  • If unitDateTypeEncoding is used with the value "EASList", the @unitDateType attribute should only contain one of the following values: bulk, inclusive

Note: While most of the values listed above have been kept as they are in EAD3 (respectively EAC-CPF 2.0), some have been adapted to camelCasing respectively to full length. These values are:

  • "analyticOverview" and "inDepth" for @descriptionOfComponentsType
  • "recordGroup" and "subgroup" for @level
  • "materialType" and "spaceOccupied" for @physDescStructuredType
    The spelling of these attribute values will hence need to be adapted as part of the general conversion to EAD 4.0, which will be developed in one of the next stages of the revision process.

Consider a rule to ensure that @...Encoding attributes are present and linked to <conventionDeclaration>

Creator of issue

  1. Kerstin Arnold
  2. EAD team lead, TS-EAS
  3. @kerstarno
  4. [email protected]

The issue relates to

  • EAC-CPF schema issue
  • EAC-CPF Tag Library issue
  • EAD schema issue
  • EAD Tag Library issue
  • Schema issue
  • Tag Library issue
  • Suggestions for all schemas
  • Suggestions for all Tag Libraries
  • Other

Wanted change/feature

  • Text: This feature request follows SAA-SDT/eas-schemas#1. With the decision to define the use of value lists via @...Encoding attributes within <control>, it might be worth investigating and considering if this could be linked to a Schematron rule, especially if - as currently suggested - the vast majority of these @...Encoding attributes would be optional. Ideally there would be two checks:
  • First, if any of the attributes that would require the definition of a value list via @...Encoding is used, there should be a check if this @...Encoding attribute indeed is present. E.g. if I use the @audience attribute, I need to have the @audienceEncoding attribute in <control>.
  • Second, though I will admit that I'm not sure, if Schematron can be used in this case, it would be great if there could be a check that, in case the @audienceEncoding attribute (to stick with the example) is set to "otherAudienceEncoding", there indeed is a <conventionDeclaration> element that defines or references a definition of the values used for @audience. As there would be not prescribed way on how to encode this, though, it might not be possible to do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.