Code Monkey home page Code Monkey logo

schematron-enhancement-proposals's Introduction

This repository was archived and made read-only on 1 October 2020.

At that time, new applications of Schematron were advised to use the SchXslt Schematron implementation at https://github.com/schxslt/schxslt. The list of currently known Schematron implementations is maintained in the 'Awesome Schematron' repository at https://github.com/Schematron/awesome-schematron#software.

schematron Release

This is the most recent version of the "skeleton" XSLT implementation of ISO Schematron by Rick Jelliffe and many others. Notable early contributions were made by Oliver Becker and his students.

It is a library of XSLT scripts suitable for embedding in applications or servers, or running from command shells. There is a version for XSLT1 and one for XSLT2. There is an XSLT API to allow easy integration, but most popular is to use the generated output XML documents which use the flat SVRL (Schematron Validation Reporting Language) defined as part of ISO Schematron.

This Open Source software was first released in 2000, and has had various homes since them: xml.ascc.net (Academia Sinica, Taiwan), Schematron.com (Rick Jelliffe's information site, courtesy Allette Systems), GoogleCode and now GitHub. There are several other minor forks of Schematron on the web: as at January 2017, this site is Rick's "official" distribution site for the code.

Status: The code has tracked the various versions of Schematron from version 1.1 to ISO Schematron 2006 and draft ISO Schematron 2nd edition (now ISO Schematron 2016). The scripts are currently being checked against the released ISO Schematron 2016 International Standard to confirm conformance, and to merge various bug fixes and enhancements that have been requested over the last decade.

Bugs and Limitations

As of October 2020 this implementation is not conformant to the ISO specification with regards to the following requirements:

  • The language tag of a diagnostic is not copied to the SVRL output.
  • Property references are not copied to the SVRL output.
  • The xsl:copy-of instruction is not executed inside a sch:property element.
  • The sch:name element with a @path attribute does not expand into the value of evaluating the expression in @path.
  • An xsl:key element cannot contain a sequence constructor.
  • A variable defined for a phase is not scoped to this phase, but has global scope.
  • A variable defined for a patter is not scoped to this pattern, but has global scope.
  • The rule context cannot be a comment node.
  • The rule context cannot be a processing instruction node.
  • A subordinate document expressions cannot contain a variable.
  • A rule can extend an abstract rule that is defined in a different pattern.

schematron-enhancement-proposals's People

Contributors

andrewsales avatar dmj avatar xatapult avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schematron-enhancement-proposals's Issues

Enhance sch:let to support functions, accumuators and keys

This is a proposal to enhance sch:let so that it can be used for

  • function definition
  • xsl:key functionality
  • xsl:accumulator functionality

This current email only looks at two issues:

  • First the diffence between denotative languages like Schematron and functional languages like XST.
  • Second, a syntax for functions that would work in any XSLT implementaion

A desirable side-effect might be that, with viable "Schematron-y" ways to do functions etc, XSLT namespace elements could be disallowed in Schematron schemas (unless pulled in as an external library.)

Background

During the initial standardization, we discussed whether Schematron should have functions. Because it was not clear then that XSLT would be as common as XPath implementations or what was happening with data typing, we decided to keep things minimal, and to allow piggybacking constrained by the QLB.

Providing them was prudent, practical and generous for a new standard for a juvenile language, especially one aiming to be as minimal as possible. However,perhaps we should now be looking at how to evolve Schematron to have denotative support for them: it may low-hanging fruit!

The context of this is the discussions on libraries and which XSLT elements should be allowed in the standard QLBs.

Denotative versus Functional Languages

  • XSLT is a functional language. So it is natural that XSLT developers will want to implement things in a functional kind of ways: with functions. ISO Schematron allows some xslt:* elements to help, but I think it is reasonable that it keeps them to a minimum, otherwise it stops being a thing in its own right.

    • An example I saw of this was when I started working at a large organization who had hired a Java/XSLT developer to make its Schematron schemas: the developer was very smart (and I appreciated working with her) but had decided that all assertions should be either external Java function calls, or XSLT functions that usually called Java: so rather than take advantage of XPaths, XPaths were in effect banned in assertions (and were simple in the template matches.) Thinking that this would be more easy to write and maintain, she didn't end up with a bad Schematron schema, but a bad Java program
    • Now actually in this case the underlying reason was that the code needed a lot of database calls; and the wrong call had been made architecturally, so that instead of a single coarse-grain call to the database at the start, storing into a variable, each function had to make its own fine-grain enquiry.
    • So this is an example that what people think of language-capability issues can very often be architecture issues; how you have gone about solving the problem. When you find you have to go through hoops to do something, you need to check that there is not a better way.
  • However, Schematron is not a functional language. It is a denotative language.

    • Peter Landin's The Next 700 Programming Languages (1966) https://dl.acm.org/doi/10.1145/365230.365257 which introduced the term "denotative language" and gave why it is different from other LISPy approaches of "declarative" and "functional" programing.
    • A denotative language is concerned with "characterizing some ... system as a set of entities and functional relations between them." To do this, he makes heavy use of "where" clauses (and their named versions, "let" clauses.
      • For example, lets say we want to write some complex transform functionally:we may have a chain of functions, each one working on the output of the next, like a pipeline.

      • Or we may have a list of functions then apply them with some higher-order function (like map or fold).

      • We may adopt a discipline that we alternate between "mapping" or "reduction" functions, and so end up with a "map-reduce" system.

      • But if we want to write it denotatively then we think in terms of a sequence of scoped things (with names), what holds for them (type), what is the relationship from its sources (functional).

        • You could consider it a succession of views.
    • The advantage of having named objects (for documentation, incremental programming, program proving and efficiency) should be obvious.
      • Landin also said the advantage of denotative programming is that the represented entities can be optimized better: in XSLT systems this might be just lazy evaluation; in symbolic programming systems it can enable (in the practical or encouragement sense) re-arrangement of terms.
  • So in terms of Schematron, a functional programmer would be asking "how can I make functions that do useful things?" while a denotative programmer would be asking "how can I make variables that name useful things?"

    • Functional programming has a reputation of risking unmaintainability without extreme discipline: the more that generic functions are used, the less that the code gives a birds-eye view of intent. Denotative programming requires that the developer think in terms of neat, well-described stages.
    • Indeed, isn't this exactly the same discipline that Schematron itself attempts for XML documents? To prevent some brilliant and elegant abstraction (e.g. grammars) from producing artifacts that are divorced from intentSo to figure out CALS tables, you would create a series of variables, each with whatever information is needed to make explicit some hidden structure:
      • For example, you might have one variable with the max number of columns, and another containing a list of each spanned and/or merged cell, which your code then looks up.
      • You might say that the idiom is extracting out-of-band markup and derived features, rather than successive functional transformations of a document.
  • So the more that functional programmer starts attempting to do functional programming in a denotative language like Schematron, the more frustrated they will be that Schematron only provides such half-assed support.

    • Including libraries and so on become an issue: making Schematron more and more like XSLT becomes the supposed solution.
    • The issue of how to include foreign functional code (e.g. XSLT generated by REx) becomes the focus, rather than how to generate denotative Schematron variables (and e.g. abstract rules or patterns to get parameters if desired).
    • I note that more and more developers are using nested XPath for-each iterators. It is nice to have iterators, however, when the expression gets complicated it runs the risk of obscuring the intent. Refactoring to use either XPath variables or Schematron variables where possible can allow cake + eating.

Suggestion: parameterized variables = functions

Should Schematron variables be improved to take parameters of some type (these being implemented by construction Xpath or XSLT functions? ) Perhaps all we need is that a Schematron variable can be bound to an XPath (or XSLT) function. This would allow any function that can be defined entirely with Xpaths. (XPath3 allows variables, comments, etc, so it is not so bad,)

Formal parameters are provided. This allows the let to act as a function declaration. (It might be that all the functions have a built in parameter to get the context as their first parameter, defaulted, That might allow some simplification. )

< sch:rule context="moneyList"     id="r1" >
    < sch:let name="sum-children"  value="$myContext/sum(*)"  as="function(*)" > 
       < sch:param name="myContext" / >
      < / sch:let >     

      < sch:assert test="r1:sum-children(.)  >  0" >
                A money list.s ubelement should be greater than 0 < / sch:assert >

N.B. The idea of assigning a function to a variable may be strange, but it is supported in e.g. LISP, C, Java, JavaScript etc. However, in XSLT it would implemented as an xsl:function with prefix added if needed. (I.e. it would probably not be implemented as an xsl:variable bound to an XPath3 inline function declaration.

That it is a function would be known by the sch:let/@as statement. The QLBs for XSLT1 and XSLT2 would be revised to allow sch:let/@as but only for "function(*)" because they don't need to (or cannot) support XSD typiing.

Misc

Should Schematron variables be improved to subsume XSLT3 accumulators and xslt:key? (Can such things be implemented in XSLT 1 and 2 with extra modes?)

You can see the direction here: do away with XSLT elements in Schematron.

P.S. Interesting to see that paper from 56 years ago various people discussing whether using indentation rather than delimiters is good for programming languages...
P.P.S. Schematron's element is named "let" not "variable" as a reference to that paper, and so that it could be developed independently of xslt:variable. The other thing that I took from the paper (which we were taught in the Computer Languages course at Uni) is the idea that we want to think in terms of families of languages: so in Landin's terminology Schematron is the family and its XML syntax and Query Language Binding specifies the member.

Remove or reformulate statements about natural-language text

The standard contains several statements about natural-language texts:

In section 5.4.2, about asserts:

The natural-language assertion shall be a positive statement of a constraint.

In section 5.4.12, about reports:

The natural-language assertion shall be a positive statement of a found pattern or a negative statement of a constraint.

Two things:

  • IMHO these kinds of rules don't belong in a standard. It's up to the Schematron user how to formulate their messages and the standard has nothing to say about this. Schematron still works fine, if you add <assert test="...">Grr, wrong!</assert> (although I wouldn't recommend it, but that's another matter). To me it feels like saying that if you use this particular tool (in this case Schematron) you are only allowed to use it in such and such a way, any other usage is prohibited, but we can't enforce this.
  • An assert having to be stated positive is very restricting. What's wrong with This UUID is not unique? Why is it better than A UUID should be unique?

Nonetheless, an advice, recommendation or whatever about how to formulate the messages should probably be in the standard. But not how it is now.

Allow <rule> elements without contents

According to the official Schematron schema, an empty <rule> (not containing asserts/reports) is not allowed. However, there are use-case where this is useful. For instance, when you purposely want to stop further validation in a pattern when a certain condition is true.

Therefore I would propose to allow empty <rule> elements as well.

Proposal for proper language versioning system

Proposal:
The correct method for producing a new version of Schematron with new syntax, is to make a new namespace. This is how the move from Academia Sinica's original Schematon to ISO Schematron was handled. Implementations need a slight tweak to handle it, and then versioning stops being a problem for users.

In particular, I propose

How:*

Schematron implementations including existing ones should move to either:

a) preprocess schema files to convert the namespace to the one they understand, after which their normal error reporting mechanism will report if they know the particular element or attribute.

For example, an existing implementation would test for
:[starts-with(namespace(), "http://purl.oclc.org/dsdl/schematron-")]
and rename such an element to
{http://purl.oclc.org/dsdl/schematron}*

For example, a Schematron 2 implementation would test for
:[starts-with(namespace(), "http://purl.oclc.org/dsdl/schematron")]
and rename such an element to
{http://purl.oclc.org/dsdl/schematron-2}*

For example, a Schematron 3 implementation would test for
:[starts-with(namespace(), "http://purl.oclc.org/dsdl/schematron")]
and rename such an element to
{http://purl.oclc.org/dsdl/schematron-3}*

b) OR have more complex @ match|@ elect expression.

For example, an existing implementation that currently looks for sch:pattern etc. would look for
*:pattern[starts-with(namespace(), "http://purl.oclc.org/dsdl/schematron")]

N.b. the Schematron 3 implementation would also have a test that rejects earlier version schemas marked experimental:
/*:schema[namespace()= "http://purl.oclc.org/dsdl/schematron-2-experimental"]
so that such schemas have a limited life in the wild, like the dinosaurs in Jurassic Park.

What is effect of this?

  • The schema has the appropriate version clearly marked up, for humans and implementations.
  • An implementation of the newer version will run existing schemas transparently. The older schema namespace will be converted to the newer one.
  • In implementation of an older version will run new schemas transparently if they only use features supported in the old version. So we can take an existing ISO version 1 schema, convert it up to the new namespace and it will run everywhere, then covert it back to the ISO version 1 namespace and it will still run everywhere.
  • If an implementation is presented with a more recent element from a newer namespace than it supports, it will 1) know that it is a Schematron element not some foreign element to be passed over, 2) complain that it does not understand that element therefore cannot process
  • Cut and paste is possible from old namespace schemas to new, without worrying about adjusting the /sch:schema namespace.
  • The library mechanism I propose would be made robust, because versions of the library made with an old namespace would not need updating.
  • Evolving the standard, including adding minor versions, experimental versions and even custom dialects (!), becomes well managed and not a big deal for standards committees.
  • It provides a mechanism by which experimental, draft and pre-standardization features can be tried and added, but labelled appropriately.
  • Schemas that were tried with experimental features would not be converted

Why isn't sch:schema/@Version or something good enough?: Once again we can see that XSLT and XSD have lead the way in how to get least bang-per-buck: in this case with namespaces.

So in XSD and XSLT (and most language) even though elements have their standard namespace, that is not enough to match the appropriate processor. So you cannot merely take an element's name and namespace URI and know what it may contain. In other words, information that the namespace is supposed to represent is missing. So you have to have one "standard" system for identifying the general semantics of the element (the namespace) and then another ad hoc mechanism for actually then identifying which schema/features/operations/version is being used.

(N.b. sch:schema/@schemaVersion documents the schema version, not the version of Schematron being used.)

What is the technical reason people have not used namespaces for versions? The reason was because the large corporate (proprietary and open source) infrastructures were not written to cope with versions in namespaces. Originally many people tried, on the expectation that there would be a one-to-one mapping from namespace to schema, but then found that sofware had been hardcoded for a particular namespace. And the desire to do databinding, to automatically convert between objects and XML, added another barrier of complexity. The problem was that the XML Namespaces specification was _under_specified in the area of versions.

Why doesn't that reason apply to Schematron?

Schematron is in a really nice position of having only a few implementations, so it is possible to build this in ahead of when it is needed. E.g. if XslSch and Oxygen etc put it in soon, it is harmless, but will allow a graceful transition to future versions.

Helpful for determining how to implement other solutions:
If this is adopted, then it helps us understand how other changes should be implemented.: any breaking change in semantics must be explicitly and locally marked up.

For example, a new version can add sch:let/@as because that does not change the interpretation of any other element or attribute.

But if we want to alter sch:pattern to act as a ruleset rather than a switch, we must either add some attribute to the sch:pattern or (better) add a new element sch:rule. But we could not use some top-level attribute such as /sch:schema/@patternsAreActuallyRulesets,nor a command-line parameter, nor a different namespace because none of these survive simple cut and pasting, or allow simple inspection.

Setting up the query language environment

The ISO specification allows the XSLT elements function and key to be used in a Schematron schema. This makes sense because both are required to set up the query language environment. The xsl:key element prepares data structures for the fn:key() function and the `` xsl:function``` element allows for the use of user defined functions.

In the same vein Schematron should allow other XSLT elements to be used. Namely:

xsl:include, xsl:import, xsl:use-package

These three instructions are used to load user defined function libraries.

xsl:import-schema

The instruction is used to load type information.

xsl:accumulator

Defines the data structures for the fn:accumulator-before() and fn:accumulator-after() functions.

Changes to the text of the 2020 specification:

Add the following sentence to the default query language specification in Annex A:

  • The XSLT1 elements import and include may be used, in the XSLT1 namespace, before the pattern element.

Add the following sentences to the XSLT 2.0 query language specification in Annex H:

  • The XSLT2 elements import and include may be used, in the XSLT2 namespace, before the pattern element.

  • The XSLT2 element import-schema may be used, in the XSLT2 namespace, before the pattern element.

Add the following sentences to the XSLT 3.0 query language specification in Annex J:

  • The XSLT3 elements import, include, and use-package may be used, in the XSLT3 namespace, before the pattern element.

  • The XSLT3 element import-schema may be used, in the XSLT3 namespace, before the pattern element.

  • The XSLT3 element accumulator may be used, in the XSLT3 namespace, before the pattern element.

Support partial ordering of execution of patterns in a phase

Currently, the order of execution of active patterns in a phase is undefined. This potentially allows some optimizations, such as running each pattern in a separate process.

However, there are other situations where it might be useful to have some kind of way to specify an order of execution. The three use cases I have are

  1. so that SVRLs can be generated with priorititized order (order-svrl) - not a biggie for me
  2. so that broad or critical assertions can be tested as soon as possible (fast-fail)
  3. to enable a subsequent mechanism where the properties from previous patterns can be used in assertions of later patterns (pattern-properties). I will mention this later.

A potential mechanism for this is to have a kind wait-for mechanism (i.e. like fork/join or OMP barriers). Lets say we allow a new attribute sch:phase/sch:active/@Do where values are "next" or "default"

<sch:phase id="example">
<sch:active pattern="p1" />
<sch:active pattern="p2" />
<sch:active pattern="p3" do="next"/>
<sch:active pattern="p4" do="next" />
<sch:active pattern="p5" />
<sch:active pattern="p6" />
</sch:phase>

If no @Do=next occurs, then the operation is exactly that of current schematron: no required execution order (can be parallel)

If there is an @Do=next, then we look at the sch:active elements in source code order.

In the case above, patterns p1 and p2 are tested, without any order dependency.

When these patterns have been tested, we test p3 (and any immediately following sch:active patterns that do not have @Do=next; there are none in this case, so just p3 is tested.)

When these p3 has been tested, we test p4,p5 , p6, without any required order dependence. (p5 and p6, because they immediately follow p4 without any @Do=next.)

How does this meet the requirements?

  1. order-svrl. The most simple implementation of Schematron would naturally get the svrl partially ordered by this barrier mechanism.

  2. fast fail. If an implementation allowed termination on the first failed assertions, then this mechanism checks the patterns of interest first.

  3. pattern-properties. A subsequent mechanism could allow a pattern to look at the current SVRL output in progress for patterns that are before the previous barrier. This way you can have a Schematron pattern in the same schema that checks the SVRL: instead of having to have two passes. Also, it allows one pattern to be detected in the document, then the presence of that pattern to be used in subsequent patterns: i.e. if there was a complex pattern to determine "This document follows the rules of the 1998 schema not the 2015 revision" then that information could be used in other assertions." The details of this are really a separate issue, but probably dependant on this change.

In a sense, this enhancements moves Schematron closer to having an internal pipeline, and so reduces the need for e.g. Xproc or external logic. However, I think it would be a mistake to have a generic mechanism.

The attribute name @Do is generic enough that other extensions could be used: e.g. when-no-previous-failed-asserts or do-when-no-previous-role-is-error could be defined to tie in to failed-assert/@ROLE="ERROR" etc. It would be worthwhile considering this, however I suggest that the vanilla @Do=next is a first step worth considering.

Pass parameters to the validating stylesheet at validation runtime

The <let> elements ends as <param>s in the compiled stylesheet. It would be convenient if you could pass values on these parameters to the transformer at the validation time, and in this way overwrite the default value given from the schema.

A scenario could be that you have to check something against an external source, which url may change depending on your environment (e.g. staging or production).

Allow content inside sch:value-of and sch:name

An important scenario for Schematron is that it should support the SDLC for schemas. For example:

  • Subject matter experts create a simple rich-text document with the schema with assertion texts in lists, loosely grouped however they see fit, and with explanatory headings and paragraphs.
  • The developer marks this up as Schematron, arranging them into patterns and rules, and creating the Xpaths etc as necessary.
  • The developer pretty prints the Schema, and stakeholders review it
  • Translators make localized versions of the assertion and diagnostic tests, and developer marks up and integrates these
  • Testers use a pretty-printed schema to make and perform tests that each assertion text has been implemented correctly (the rug must match the carpet.)
  • The schema is used to validate, with messages being send to the appropriate humans and systems with the sch:name and sch:value-of elements evaluated
  • The pretty-printed schema can be used for documentation, reference, training, DevOps triage, and sent to third parties for their developers.

Unfortunately, where sch:value-of and sch:name are used, the assertions have a gap, creating the likelihood of nonsense sentences: exactly the opposite of what Schematron promises.

Furthermore, the XPaths in sch:value-of/@select and sch:name/@select have no documentation, presenting an added burden for maintainers who have to figure out what the intent of the Xpath was. Furthermore, if there is a mistake in the @select XPath, such as an unhandled case, it make fail to provide a text value, and so produce the gapped garbage message: this is not prudent for a validator, which needs to assume that the document is a mess and provide fallback behaviour so that the developer does not need to complicate their @select Xpaths to cover a default fallback.

Proposal

Sch:value-of and sch:name should allow rich-text (text(), span, emph, etc). This text would have the pretty-print/fallback phrase.

<sch:rule context="a | b | c">
  <sch:assert test="x"><sch:name>a, b and c</sch:name> elements should have one or more x elements in them.<sch:assert>
  ...
  • For an existing pretty-printer, developed expecting sch:name and sch:value-of are empty, it continues to function with no change: i.e. the mangled and incorrect text " elements should have on or more x elements in them" is produced.
  • For an existing pretty-printer, developed to just take the string value of sch:assert, it gets the new text.
  • For a new pretty-printer, it gets the new rich text.
  • For an existing validator, developed expecting sch:name and ch:value-of are empty, it continues to function with no change
  • For a new validator, it uses the text as fallback ifever its @select produces nothing.
  • For developers and maintainers, the extra information provides a hint (documentation) for what the Xpath is supposed to return.

I think this is the kind of thing that an existing implementation could provide immediately as a value-add to the standard, because it can be stripped out with a simple XLST if ISO conformance is needed.

It removes a long-term wart, is limited and reason-about-able, does not change any other element, and would be trivial to implement.

Provide option to deactivate if-then-else processing of `rule`s

Re clause 6.5 of the 2020 standard:

A rule element acts as an if-then-else statement within each pattern. An implementation may make
order non-significant by converting rules context expressions to elaborated rule-context expressions.

My proposal is to move this out of the implementation realm and into the Schematron schema instead, because:

  • it is a source of bugs and confusion for schema authors, being at odds with the behaviour of successive assert/report constraints within a rule
  • clause 5.4.10 says "The child rule elements of a pattern give constraints that are in some way related", which implies a grouping that is meaningful or useful, but the if-then-else behaviour can force the schema author to refactor patterns where they want every rule to have the chance to fire, regardless of whether a preceding sibling's @context has matched. So the utility of grouping related rules under a pattern is degraded by an aspect of the implementation.
  • it seems influenced by specific (XSLT) implementations - i.e. typically implemented as an xsl:choose
  • it would require implementations to support either validation strategy, rather than being left to the implementer to choose.

For backward compatibility, I propose the current behaviour is retained by default, but markup to turn off the if-then-else processing of rules - e.g. ruleOrderSignificant='false' - is present at the schema and/or phase level. It may also be useful at the pattern level too, but personally I think it would be cleaner and clearer left at those higher levels.

Allow name and value-of to be nested in emph, dir, and span

Schematron defines a simple templating language for messages. The normative grammar disallows value-of and name to be nested in a span, emph, or dir. This makes it impossible to emphasize or mark a part of the message with a dynamically calculated value.

The grammar should be changed to allow value-of and name inside span, emph, and dir.

Typed variables

The current specification of Schematron does not provide means to declare the required type of a variable. Users of Schematron are working around this shortcoming by defining variables using the xsl:variable instruction and relying on the underlying processor to copy these variable declarations to the validation stylesheet.

This document proposes the use of a single @as attribute on a Schematron variable declaration to indicate the required type of a variable for query languages that support typed variables.

For the XSLT 2.0 and the XSLT 3.0 query binding this attribute has the same semantics as described in section 9.3 of [XSLT2] and [XSLT3] respectively.

Caveat: Declaring the type of a variable can change the result of an XPath expression. For example, in XSLT 2.0 and XSLT 3.0 a variable declared to be a sequence of elements behaves differently than the same variable without a type declaration. The latter case creates an implicit document node, while the former does not.

[XSLT2] : XSL Transformations (XSLT) Version 2.0 (Second Edition), http://www.w3.org/TR/xslt20/
[XSLT3] : XSL Transformations (XSLT) Version 3.0, https://www.w3.org/TR/xslt-30/

Support progressive visibility of SVRL inside a schematron schema

This proposal is to support progressive visibility of SVRL inside a Schematron schema. It relies on a barrier mechanism like the one proposed in the previous suggest here, #14. (Basically, it groups the active patterns in a phase and guarantees an execution order of these groups though not the patterns within a group.)

In Schematron, assertions, rules and patterns do not have any information available from other assertions, rules and patterns. (This allows a wide range of potential implementation strategies, or more to the point does not tie it to how the most popular implementation works.)

However, there is a case where this is painful:

  1. We cannot use the pattern mechanism to determine that some pattern exists, then use that information in the assertions of other patterns. Instead, the mechanism of global variables must be used, which do not have assertion texts or the rule/assert/report discipline; or, for developers in a hurry, they may just make much more complicated XPaths with duplicate tests.

The mechanism I propose does not affect any other operation of Schematron, and only applies in phases with grouped patterns (as in suggestion #14 ).

The mechanism is that every pattern group (e.g. a group started by sch:pattern/sch:active[@Do="next"]) has an automatic variable supplied with a standard-reserved name, which contains the current SVRL output (i.e. the collated SVRL output of all the patterns in previous pattern groups.)

In effect, they get

  <sch:pattern id="XYZ">
  <sch:let whatever:as="element()" name="PROGRESSIVE-SVRL">
      ...  svrl top-level element  goes here with results
  </sch:let>      

  <sch:rule context="household">
         <sch:assert
          test="if ($PROGRESSIVE-SVRL//svrl:raised-flag[@id = "subjectIsACriminal") then true() else count(child::*) = 6">
         The household of a criminal hasextra requirements: they should have 6 child elements.</sch:assert>
    <sch:rule>
  <sch:pattern>     

In other words, this mechanism allows the validation of a document to also see the results of previous validation in the same phase. It does not allow sequencing of phases, or results of one phase to be available to another phase. (Which would be a separate enhancement) For example, if in one pattern you create a property for some subject node, then you can in the next pattern group locate that property value.

EXAMPLE: A use case might be grading a term paper (in a similar fashion to multi-level approaches like map-reduce, mark-sweep, neural-nets, etc). The first pattern group might mark all the questions; the next pattern group might then grade the results A, B, C,D, F etc. A subsequent pattern group (with sch:rule/@context="/") might then figure out if it should be + or -. A subsequent pattern group (with sch:rule/@context="/") might detect patterns in the failed questions: e.g. the student has a gap in their knowledge of "bunchy top disease" and should be directed towards studying that.

There is an interaction with sch:pattern/@document too. When you are in one group, the previous groups' SVRL is available, and those groups could include patterns looked-for in external documents. For example, consider validating a compound document which is a ZIP of multiple XML files, with some main or TOC file that is the head document for validation (e.g. OOXML, etc.): the ZIP is unzipped. You use the sch:pattern/@document to construct the names and paths of the subdocuments and validate them. So when you validate one pattern group, you can see the results of the validation of other pattern groups: when you are validating one subfile of the compound document, you can see the interesting pattern subjects (and their locations) of subdocuments validated/marked in the previous pattern groups.

It may be appropriate to define the automatic variable with a namespace name. (It is perhaps more like a parameter than a variable, but sch:param is already used for abstract patterns so maybe we don't need to overload that name. I don't know.)

(I am not working on an implementation of this or #14 , it is only an idea.)

Behavior of pattern/@documents

The ISO skeleton implementation resolves a pattern/@documents value as relative the source XML document's location. SchXslt resolves this as relative to the schema location.

The standard leaves this undefined. However, I think the ISO interpretation makes more sense: When a source document contains a relative reference to some other document, this path is usually relative to the source document. Resolving it relative to the schema document (which could be anywhere) does IMHO not make much sense.

It should be specified unambiguously.

Schematron schema issues

  • The current version of the standard does not have the Schematron schemas available for download. You have to copy/paste them from the PDF, hardly ideal. Why not as a separate download.
  • The Schematron schemas in Schematron have a bug (the query binding should be xslt2 or higher). This should be resolved.
  • Add an XML Schema version of the RelaxNG schemas. XML Schema is used a lot and in some environments the only schema language that works

XPath3 normative reference is undefined

which makes the definition of the (XSLT3, XPath3) query language bindings in Annexes J & K problematic.
Contrast with XPath2, for which both XPath and XPath Functions & Operators are defined normatively.
This should be added alongside "XPath Functions" in clause 2.

sch:pattern/@document needs better specification of error handling

Background

The intent of sch:pattern/@document is that it allows a pattern to validate an XML resource retrieved by constructing a URL with information from the primary document.

So this allows a hub or TOC document with links to other documents. Such in a ZIP archive. XML is a web technology, therefore it is good to support at least a minimal support for documents in a web.

For Schematron 2016, there was only one level of indirection allowed*; consequently there can be no problem with transitive closure issues such as loops and self-referencing documents.

Problem

If no XML resource can be retrieved from a URL in sch:pattern/@document what is the result of validation?

Proposed Solution

Schematron description enhanced: If a resource cannot be retrieved for a URL in sch:pattern/@document then it is implementation-defined whether validation of other patterns continues.

SVRL elements augmented to allow elements that flag that there was an error retrieving (or parsing/converting) an external resource.

<svrl:active-pattern ...>
    <svrl:resource-error document=" ...">

    </svrl:resource>
    <svrl:conversion-error>
    </svrl:conversion-error>
</svrl:active-pattern>

Where the contents of svrl:resource-error are an implementation-defined message useful for the human in the context. It could, for example, be the MIME header for the error response. (Even some simple message like "Unable to retrieve external document XXXX.XXXX" would be better than nothing.)

The svrl:conversion-error element is the subject of a separate enhancement proposal. See #48 for details.

  • In theory, we could validate a fixed chain length of documents, reading one document into a top-level variable, then using that to for the URL for another document in another variable, and so on. Then using information in the variables to construct the URLs for sch:pattern/@documents. But with XPath 3 we get Xpath functions (I think), allowing an unbounded traversal of links.

sch:visit element

I would like to propose a new element for Schematron, intended to allow a stylesheet to declare its required behaviour better, to improve the power of phases and roles, to increase clarity in the schema, and potentially to substantially improve efficiency by reducing unnecessary processing.

The element is optional: /sch:schema/sch:visit, sch:phase/sch:visit, sch:phase/sch:active/sch:visit and sch:pattern/sch:visit. The sch:visit attribute would be standard for any Schematron, but the attributes used depends on the QLB. The default is suited for any QLB where the document is XML (or viewed as XML).

The sch:visit element declares

  1. What kind of infoset is assumed/required
  2. What type of nodes should be visited (to pattern granularity)
  3. Whether validation should be restricted to some branch (e.g in the current phase)
  4. A priority and declaration for @ROLE values

An example is this:

<sch:schema ... queryBinding="xslt2" >
    <sch:visit 
           elements="yes"
           attributes="no"     
           text="yes"
           comment="no"
           processing-instruction="no"
          infoset="xml entities dtd valid"
           branch="/"
           role-priority="fatal error warn info tip"
   />

This declares that the engine needs to visit and validate elements and attributes, but not other kinds of nodes. Roughly, if an engine does not support visiting attributes, it should generate a warning at its start when seeing attributes="yes".

This also allows an engine to select a visiting strategy that is optimal for the document. An implementation may override these on the commandline (or a user could edit the file) to switch off or prioritize validating certain items, or limit the start-point of the validation to a certain branch.

Lexical

The effective pseudo-DTD would be

<!ATTLIST sch:visit
            -- yes = true is required to visit that node, no = false is required to not visit that node, 
                      "auto" allows detection of needed nodetypes from the @context xpaths.
                       by default, auto inherits the next higher sch:visit  --> 
           elements ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "yes" if auto not implemented --
           attributes ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "no" if auto not implemented --
           text          ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "no" if auto not implemented --
           comment ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "no" if auto not implemented --
           processing-instruction ( "yes" | "no" | "true" | "false" | "auto" ) "auto"    -- "auto" defaults to "no" if auto not implemented --

         -- Which nodes need to be visited? see later  --
          infoset  NMTOKENS  "xml"     
          -- The branch to start validation from --
           branch CDATA "/" 
         -- Significant values used in @role (multiple tokens need to be allowed in @role), and their priority -->
         role="fatal BASIC error warning DETAIL info tip"
>

Node Visiting

  • sch:pattern/sch:visit declares what nodes the pattern needs to visit in its document)
  • sch:phase/sch:visit declares what nodes need to be looked at in that phase in any document
  • sch:phase/sch:active/sch:visit declare what nodes need to be looked at in that phase in any document
  • sch:schema/sch:visit declares what nodes need to be looked at in the default document.

For the node visiting, a lower-level element restricts a higher-level one, in the priority defaults/implementation-override/schema/phase/active/pattern. In effect, whether a pattern visits a certain kind of node is the AND of all the in-scope sch:visit attributes.

  • If a node type is not specified, visiting default to any ISO Schematron rules
  • If an implementation decides to restrict application of validation to certain node-types, it can do so.
  • If the sch:schema/sch:visit has attribute="no" and comment="yes", then a sch:pattern inherits them by default.

An implementation can override these (limit further).

Alternatives (visiting by node type)

The problem that this solves is that XPath is very complicate to parse, so that it is a non-trivial thing to look through each @context to see whether it looks at text and attributes. Now you can get a pretty good hint in some cases (does the XPath contain "processing-instruction(" or "comment("?.

For example, an implementer might decide "if none of the @contexts in rules in active patterns contain 'attribute::' or '@' or 'attribute' then I don't need to visit attributes". But that would result in lots of unecessary visits; so it would need to be coupled with some simpler parse of the @context Xpath. For example, to produce an Xpath with all predicates removed, so that the only place for @ or attribute:: was in the location steps.

However, who has produced this simpler parser/stripper?

So having an ability to declare explicitly what kind of visiting is required does open up the door for phases (and particular patterns within a phase) to have an optimal visiting strategy.

@ branch

The branch attribute takes subsets of XPath that specify one or more elements: validation starts at (and under) those elements.

  1. a simplified XPath to an element: absolute, wildcarded namespace and wildcardable name, and position predicate (last is optional), which locates a single element node that is the branch to be validated, with no "//" anywhere. E.g. /*:book[1]/*:appendix[3] or /*[1]/purchase-order[5]/item
  2. an absolute descendent search (starts with "//") and a single optional position predicate e.g. //footnote or //chapter[1]
  • The first case limits the scope of the schema to a certain branch of the document only. This can help reduce unnecessary content matching, where the pattern contents are only in some branch.
  • The second form is for a compete traversal of the document, selecting either all the elements with that name, or the _n_th one.

The XPath subset is simple enough to be trivally parsed in XPath using tokeniser. To provide more of Xpath would be an implementation problem (e.g. for people in the MS ecosystem who have to use XSLT 1.0 still.) But there are all sorts of possibilities

In particular, @ branch accepts the Xpaths found in the svrl:failed-assert/@location attributes, which means in an loosely-interactive application, the editor can go from an error report to a re-validation fast, or to just validate the current element.

A parallelizer could run validation of each branch-start-element in a separate thread.

(At worst, an implementation could implement this by filtering out SVRL elements: that would not have performance benefits for the validation, but could still reduce processing cost or complexity on the user side.)

@ infoset - making feather-dusting practical

The @ infoset attribute has a little Domain Specific Language, just keywords. It specifies what kind of infoset the schema requires. As far as I know, there is no standard method for doing this (even in DSDL, and the schema PI is not powerful enough), which means even as simple a question as ''can my Schematron schema assume that DTD or XSD defaults have been put in?'' has no way to be specified.

                  " xml",  ( ("xinclude" | "dtd" | "xsd" | "rng" | * ) ("expand" | "type" | "validate" )+)*   |   "psvi"  | (*)*  
  •         e.g. "xml" is plain old standalone XML, no XInclude, no entitiy inclusion, no DTD processing for IDs and defaults, no validation
    
  •         e.g. "psvi" is an alias for Post Schema Validation Infoset, and equiv to "xml xsd type validate".
    

@ infoset can be used to check or advise or fail or even perform, depending on the implementation. The markup says what the assumption of the schema is, either for humans (to know how they need to configure their system) or for implementations (to configure automatically or to check as much as they can). The (*) is for extensibility, but an implementation would fail if there was something it didn't understand here.

An implementation that does not handle something should raise an ERROR.

  •             xinclude only takes expand
    
  •            dtd takes "expand" (schema advises that entity dereferencing is needed), "type" (DTD provides default values and IDs), validate   fails on invalidity
    
  •            xsd takes "type" (psvi) and "validate" fails on invalidity: maybe something else is needed for streaming? 
    
  •            rng takes "validate" (validation). I expect we want the other parts of DSDL here too.  
    

Example:

 <sch:phase id="validate-chapters">
    <sch:visit  element="yes" attribute="no"  branch="//chapter" />
    <sch:active pattern="validate-tables"/>
    <sch:active pattern="validate-titles"/>
    <sch:active pattern="validate-text"/>
    <sch:active pattern="validate-figures"/>
 </sch:phase>

 <sch:phase id="validate-appendixes">
    <sch:visit  element="yes" attribute="no"  branch="//appendix" />
    <sch:active pattern="validate-tables"/>
    <sch:active pattern="validate-titles"/>
    <sch:active pattern="validate-text"/>
    <sch:active pattern="validate-figures"/>
 </sch:phase>

 <sch:phase id="validate-technical-appendix">
    <sch:visit  element="yes" attribute="yes"  branch="//appendix[1]" />
    <sch:active pattern="validate-technical-attributes"/>
 </sch:phase>

In this example, we have phases to validate chapters only (all of them, only visiting elements), appendixes only (all of them, only visiting elements) and particular attibute constraints of the first appendix.

People seem to grok Schematron as a feather-duster for the places other schema languages cannot reach: i.e. more power, but used in an ancillary fashion to another Schema language. So it seems reasonable that a Schematron schema have a way to specify its infoset, because validation (with DTDs and XSD, not RNG) can change the information in the instance being validated: it is not a parallel process but a serial one.

@ role-priority

This specifies which role values are expected by the schema: this is information for document and schema writers and implementers, and it can allow the schema to be validated that it is limited to these tokens.

For example

   role-priority="fatal BASIC error warning DETAIL info tip"

has two sets of roles, one is fatal/error/warning/info/tip and the other is BASIC/DETAIL. The highest priority token in an @ROLE is the one (probably) used for that @ROLE by some application.

Moreover, the priority allows an implementation to adopt different strategies:

  • Divide/select/sort the patterns/rules/asserts in order to fit in with fail-fast behaviour (e.g. fail at the first fatal error): e.g. where the user wants to fail as fast as possible on some fatal error, and not continue validating. This might be roles like FATAL, ERROR etc.
  • Divide/select/sort the patterns/rules/asserts so that the SVRL has some pre-sort. E.g. an implementation might have an operational mode where when a rule files the assertions are tested in priority, with the assertion testing for each rule stopping at the first failure, but validation continuing. For example we might have
<sch:schema>
   <sch:visit  ...
         role-priority="PREREQUISITE DETAIL" 
  <sch:rule context="table">
    <sch:assert role="warning" test="@cols">A table should have a cols attribute</sch:assert>
    <sch:assert role="PREREQUISITE warning" test="row">A table should have at least one row</sch:assert>
    <sch:assert role="DETAIL error" test="count(row) = count(*)">A table should only have rows</sch:assert>
   </sch:rule>
   ...

In this example, when we find a table element, the rule first checks that there is at least one row (because of @ROLE has "PREREQUISITE".)
If there is not, it reports the issue (the role also says it is a warning) and does not test any more assertions on that node.
If there is, then it check the next lowest priority, which here has one assert that tables should only have rows. If that fails it reports the failure and does not test any more assertions on that node.

Conclusion

A main criticism of Schematron is that it is too slow: this is an issue that comes up on high volume servers such as firewalling: sometimes Schematron is used to extract the requirements, prototype and debug a filter successfully, then replaced by e.g. a SAX based program. That is fair enough.

However, there are many implementation strategies that could be provided to users that allow fast-fail, parallelization, reduced latency, or more targetted or sorted output. However, many of these require a little extra information about the schema: information which belongs in the schema, as it is adheres to particular phases, patterns, rules and assertions.

So I think the trick is how to leverage (as they used to say, to our snooty disdain) existing structures (such as phases and roles) to provide markup that implementers can readily support (i.e. which, by and large are optional to implement or have some trivial fallback.)

For example @branch can be implemented by various methods 1) Failing if found "not implemented" 2) using it to skip visiting certain nodes 3) visiting everywhere but not testing them, 4) filtering the incoming SVRL so that nodes outside the branch are not added to the report or 5) filtering the report after it is generated.

Similarly, @role-priority could be completely ignored by an implementation. Or it could just be used to validate the values of @ROLE in the same schema. Or it could be use to automatically prioritize testing assertions or patterns. Or it could be used in combination with an "skip tests in rule after first assertion fail" or "skip testing pattern on other nodes after first pattern with failed assert" or fail-fast ("skip testing any more nodes after first failure"). Or to only test all highest priority tests only.

These choices are for implementers to implement, but the priority of roles (and the infoset needed, and what kind of visiting is needed) is a schema concern (and the developer can decide whether and how to make use of it.) So all the Schematron schema needs to do is to provide suitable declarations that make explicit that hard-to-derive metadata about the schema.

Enhance SVRL to support flags

There is a gap in SVRL in that while we do want our svrl:failed-assert to signal that a flag was raised by it, there is no top-level place to show flags. Putting this in a separate page.

I think the best approach is to add to SVRL something like the <svrl:flag-raised element below, with wording that the usual XPath rule of absence=false() applies.

WHAT IS A FLAG?

The original intent of @Flag, it was that a perfect version of SVRL would have something like

svrl:schematron-report
<svrl:flag-raised flag="invalid"/>
<svrl:flag-raised flag="has-severe-error" />
<svrl:flag-raised flag="uses-version-27-features"/>
<svrl:flag-raised flag="not-dispatchable" />
...

A flag is a property of the whole document, and of the whole validation session. It is a label.

It is not a property of a single node or a single failed assertion. It is a summary that if some assertion has failed, or some report has succeeded, we can label the document (the validation result). The flag function is something like
distinct-values( //(svrl:failed-assert | srvrl:succeeded-report)/@Flag )
to give a distinct set of raised flags for the document.

The reason that individual svrl:failed-asserts have their own flag is

  1. to allow convenient location of them,
  2. to allow the flag function to be run on the
  3. I didn't want to put in the extra external pass to summarize the SVRL flags (though IIRC I put in notes mentioning that to extra the flag status you need to look for any SVRL item having that flag on it.

Now this is why a flag was originally defined as not being a boolean but as something that existed or not.

But I guess that since the standard now treats it as boolean, you would be having something like:

svrl:schematron-report
scrl:flags-summary
<svrl:flag name="invalid" value="true" />
<svrl:flag name="has-severe-error" value="true" />
<svrl:flag name="uses-version-27-features" value="true" />
<svrl:flag name="not-dispatchable" value="true" />
<svrl:flag name="urgent-issue" value="false" />
...
</svrl:flag-summary>

Aligining the definitions for title and p elements (or clarifying the difference?)

The (not often used) <title> and <p> Schematron elements have different models for no apparent reason.

  • The <p> element follows the usual rules for mixed text in Schematron. It allows foreign attributes/elements and<dir>, <span> and <emph>
  • The <title> element only allows <dir> and does not allow foreign attributes/elements.

I suggest we align both definitions and allow the same kinds of mixed contents in both of them. Best probably to define <title> the same as <p>.

Flag attribute enhancements

A flag attribute is now typed as a simple xs:string. But it is supposed to be the name of a variable...

So wouldn't it be better to type it as xs:NCName? Or some other restrictive type?

Even better: I see no reason why a triggered assert, report or rule couldn't raise multiple flags... So I would propose to type the flag attribute as a list of its base type.

Define whether a rule <extends href="..."> must be abstract

The specification leaves open whether a rule included using a <extends href="..."/> must be abstract or not. In both processors (ISO and Schxslt) it does not matter what you do on the included rule.

The specification hints that this included rule should not be abstract, but its says so with some detours. Let's make this clear.

BTW: I don't care whether or not it must be abstract or not. Maybe we can even leave the status quo (both situations are allowed). This is an enhancement proposal just to make it clear.

Change default query language binding

The default query language binding is xslt, meaning that you're limited to XPath 1.0. That is severely limiting nowadays.

Why not keep the default query language binding in check with the current state of technology? My proposal would be to make xslt3 the default, but xpath31 would also be a good candidate.

Allow query language expressions in element content

This was mentioned during the users meetup at XML Prague 2022, so am capturing here for further discussion.

IIRC the rationale was that it might prove awkward for some query languages to be expressed in attribute values. So the proposal here is to allow a single (presumably first) child element to optionally contain this instead.
To be clear, the proposal is not to replace the current markup, but to give the option to use elements instead of attributes.
See also #22 (comment).

Example:

<rule context='foo'>
<assert test='@bar'>Element <name/> must have a bar attribute.</assert>
</rule>
<rule>
  <context>foo</context>
  <assert><test>@bar</test>Element <name/> must have a bar attribute.</assert>
</rule>

This would mean that if a schema is pre-processed (e.g. via a transform), attribute value normalization would not of course be applied to expressions appearing in element content, which in turn may be of benefit to users to whom the formatting of expressions matters.

It might be argued that such users could author using the idiom in the second example above (or their own variant) and generate a valid Schematron schema from that -- but I think it would be useful to standardize this and be able to do so natively, and it may also smooth the path for other query language bindings for which the current syntax poses difficulties.

Specify the query language binding in use when embedding Schematron (@sch:queryBinding)

Rationale: We need a convenient way to specify the query language binding in use when embedding Schematron rules or pattern in a host language such as XML Schema or TEI ODD. Schematron has the queryBinding attribute on the schema element. We can reuse it but put it in the Schematron namespace to avoid polluting the host language.

Add

5.4.16 sch:queryBinding attribute

When embedding Schematron, the sch:queryBinding attribute may be used on the outermost element of the host document to declare the query binding in use.

Define parameters for abstract patterns

There is no way you can define which parameters are there for an abstract pattern. Therefore the processor cannot check that an abstract pattern invocation is correct.

I know that these parameters are macro/text-substituted and are not "real" parameters in the XSLT sense, but nonetheless it would be nice if we could define them, both for checking and code documentation purposes.

Proposal for variation selector - suitable for i18n etc.

Basic Variants, using i18n as example

Two new element /sch:schema/sch:variant and sch:variant/sch:enable are introduced. A variant is a test (like an assertion test) that can be used to enable (or strip out) any element. Like sch:param, you provide the a value for it like parameter to the schema

<sch:variant name="LANGUAGE"  of="sch:diagnostics"  default=" 'en' " >
   <sch:p>The language of diagnostics may be selected by specifying a parameter "language" with  
   the language code, such as "en" or "jp". All other diagnostics are removed. If there is no diagnostics
    with the appropriate language, then a default with no language is used. If no "language" is specified
   then no diagnostics are enabled. </sch:p>

   <sch:variable name="diagnostic-elements" select="/sch:schema/sch:diagnostics" />
   
   <sch:variable name="diagnostic-element-candidates" select="$diagnostic-elements[ @xml:lang= $LANGUAGE]". />

   <sch:enable test="if ($diagnostic-element-candidates) 
                                   then ($diagnostic-element-candidates = . )
                                   else ($diagnostic-elements[not(@xml:lang)] = .)" />
<sch:variant>

Our document also imports various files with sch:diagnostics for each language, such as

<sch:diagnostics id="d1" xml:lang="en_AU">
   <sch:diagnostic>What's that, Skip? Helicopter crash near Cabbage Tree Creek?</sch:diagnostic>
    ...
</sch:diagnostics>

So our sch:variant element

  1. Is called "LANGUAGE" and defines a kind of schema parameter that it is supplied externally and scoped to the sch:variant element.
  2. applies to every node self::sch:diagnostics that is found
  3. has a variable that collects all diagnostics (because in this case, we want to default if there are none) both in the schema
  4. has another variable that collects diagnostics in our particuar case
  5. has an enable element with a test at that sch:diagnostics element to enable or disable (strip) the test.

This variant select allows very simple selection. The details can be discussed. Note that in this case, @id is not used at all.

Cleaner files

A two-stage declaration is less distracting: the main schema only has a top-level

   <sch:import href="internationalized-diagnostics-list.sch"/>

This imported file in turn has

 <sch:schema ... > 
       <sch:variant name="LANGUAGE"  of="sch:diagnostics"  >
             ...  <!-- unchanged from above -->
       </sch:variant>

      <sch:import href="diagnostics-en.sch" />
      <sch:import href="diagnostics-jp.sch" />
      <sch:import href="diagnostics-de.sch" />
      <sch:import href="diagnostics-cs.sch" />

</sch:schema>

So the operation here is just inclusion, with scoped variant selection. An implementation method would be for the sch:import processor to perform the enabling/disabling, so that it only returns the enabled diagnostics up. Or it could mark the sch:diagnostics with some implementation-specific attribute to disable them. There is no need for this to be a costly feature (e.g. calculated each time every assertion is checked.)

Smart imports

We really would prefer to not import files we don't need at all, without parsing them. But we can use the variant mechanism to do so: in our internationalized-diagnostics-list.sch file we change our variant function to:

<sch:schema ...>
    <sch:variant name="LANGUAGE"  of="sch:import" >
        <sch:enable test="ends-with(@href, concat('-', $LANGUAGE, '.sch')" />
   </sch:variant>
    
      <sch:import href="diagnostics-en.sch" />
      <sch:import href="diagnostics-jp.sch" />
      <sch:import href="diagnostics-de.sch" />
      <sch:import href="diagnostics-cs.sch" />
</sch:schema>

Other scenarios:

Severity-level selection

This mechanism can be used to, with a parameter, enable or disable any part of the schema. For example, we could have

   <sch:param name="SEVERITY" select="'#ALL'" />

     <sch:variant name="SEVERITY" default=" '#ALL' "  of="sch:assert | sch:report[@role]" >
        <sch:enable test=".[$SEVERITY = '' 
               | $SEVERITY = '#ALL' 
               | (#SEVERITY = 'error' and @role='error')
               | (#SEVERITY = 'warning' and (@role='error' or @role='warning')
               | (#SEVERITY = 'info' and (@role='error' or @role='warning' or @role='info')  ]" />
     </sch:variant>

This allows the caller to control which level of severity is tested and reported, using a schema parameter (e.g. a command line parameter or invocation parameter.) In this case, an sch:assert with no @ROLE will be disabled if SEVERITY="error", but an sch:report with no @ROLE will not be affected by this variant.

If we wanted to have rule-level control we could have

 <sch:param name="SEVERITY" select="'#ALL'" />

     <sch:variant name="severity-level-of-assertions"  of="sch:assert | sch:report" >
        <sch:enable test=".[$SEVERITY = '' 
               | $SEVERITY = '#ALL' 
               | (#SEVERITY = 'error' and (@role='error' or parent::rule/@role="error"))
               | (#SEVERITY = 'warning' and (@role='error' or @role='warning" or parent::rule[@role='error' or @role='warning']))
               | (#SEVERITY = 'info' and (@role='error' or @role='warning' or @role='info'
                                             or parent::rule[@role="error" or @role=;warning'] or @role='info')  ]" />
     </sch:variant>

Fallback to database implementation

I have seen Xpaths in assertions or variables which drop into Java to do database connections. But we could make the same schema allow both java and odbc or whatever.

<sch:schema ...> 

     <sch:variant name="USE-JAVA"  of="sch:let[contains(@select, 'java:')]" >
        <sch:enable test="$USE-JAVA = 'yes' " />
     </sch:variant>

     <sch:variant name="USE-ODBC"  of="sch:let[contains(@select, 'odbc:')]" >
        <sch:enable test="$USE-ODBC= 'yes' " />
     </sch:variant>

     <sch:let id="database_access" select="java:blahblah" />
     <sch:let id="datacase_access" select="java:blahblah" />

Design rationale

We don't want to mark targets, as it makes lots of markup if we have to annotate each assertion, for example. But we can use any markup in the schema: ids, roles, flags, text in Xpaths, and so on. (Because foreign attributes are allowed in Schematron, if we do want our variant to enable/disable elements that do not have any convenient markup, someone could use that Eg.

This, like macros, is a general feature that can perform lots of tasks. It can be implemented both in a pipeline, by inclusion time, or at compile time, or at runtime. Indeed, it may be that some mix of these is optimal.

Include time versus run-time parameters

Like sch:param in any case, it may be that we need some other control mechanism to determine which stage of a pipeline the parameter is needed. For example, for database access, the variant is known at deployment time, and so stripped out of the code. But severity level might be varied at runtime without recompilation, so may be better served by wrapping the assertion in a conditional statement. An implementation might provide an invocation parameter that identifies any variants that are to be resolved at run-time.

This is not information that, I think, belongs in a schema.

Connection to phase

A phase can select a variant.

   <sch:phase id="start">
           <sch:active pattern="p1">...</sch:active>
          <sch:active variant="SEVERITY"  value="warning"  />
  </sch:phase>

The phase would override the command-line.

Conceptual: is a variant a conceptual object or a practical one?

A variant is at least a practical object, such as selecting the database provision, or selecting the diagnostics by language, or the severity level.

However, it could also be conceptual. First, because it ties into phases, which certainly can be conceptual. But also because it allows a different construction of the schema based on its characteristics.

Lets take this as an example Akoma Ntoso, that schema for national laws, or (probably) HL7. A Schematron schema can be made for it, but it is designed to be subsetted. So you may keep the kitchen sink XSD and use the Schematron just for the particular dialect by the region, or even for each particular sub use (legislation, regulation, treaty, etc). You may combine some of these into phases so that the phase indicates which one.

What variants allow is a different way of ruling things in or out of larger Schematron schema. For example, you might declare a schema-level variant for "legislation" that enables assertions for legislation metadata (or, rather, which disables rules for regulation metadata) and one for "regulation" that does the reverse.

     <sch:variant name="LEGISLATION"  of="sch:rule[contains(trim(@context), 'metadata/legislation-number')]" >
        <sch:enable test=" $LEGISLATION = 'yes' " />
     </sch:variant>

     <sch:variant name="REGULATION"  of="sch:rule[contains(trim(@context), 'metadata/regulation-number')]" >
        <sch:enable test="$REGULATION='yes' " />
     </sch:variant>

Now it might be that this is perfectly well handled without variants by, say, having

<sch:rule context="metadata/legislation-number[ $I-AM-LEGISLATION-PARAM = 'yes' ]">

where the rule is explicitly enabled. But to do this at the individual assertion level means adding a lot of boilerplate, and likelyhood of error. So I don't think it is feasible.

Possibility: Variants that are detected in the document

Optionally, we could even make variants be based on the incoming document.

<sch:variant name="XHTML" of="sch:phase">
     <sch:enable document-test=" /*[contains(namespace(), 'xhtml')] and  @id="xhtml-constraints" />
</sch:variant>

This introduces an alternative attribute @document-test. If the incoming document uses the xhtml namespace, then the phase "xhtml-constraints" is enabled.

I can see three implementation methods,among several.

  1. Dynamic. Before looking at the incoming parameter for selecting the phase, the implementation tries each
sch:variant[contains(@as, 'sch:phase')][sch:enable/@document-test]

and, if none match then uses the supplied one. The generated code for every pattern has a conditional that allows run-time selection of phase et.

  1. Static. The document is first read in, and the test applied. Then that becomes information to be used to include and compile the Schematron schema into executable code.

  2. Semi static. The document inclusions are performed statically. At runtime, the phase variants are tested, then that is used to trim down the schema into executable code which is then run.

Define a textual format for Schematron

Is it time for a textual format for Schematron?

XML people have no problem with pointy brackets, but much of the rest of the world doesn't feel the same way about them. One of the advantages of Schematron is that people can write their own error messages, but why should they have to use XML to do so?

A textual format for Schematron could make Schematron more straightforwardly usable by more people.


jsontron (https://amer-ali.github.io/jsontron/) is a Schematron-like textual format in JSON. A seemingly simple but probably not very good solution would be to adopt that but use XPath assertions.

I have hopes that XML-in-KDL (XiK) (https://github.com/kdl-org/kdl/blob/main/XML-IN-KDL.md) could be a useful textual representation for some XML applications, but I've run out of time right now to try it out.

If an EBNF for a custom syntax can be developed, then it's likely that a parser for it could be generated. There are also precedents of syntaxes that are defined by textual descriptions of what to do at every possible token.

Clarify handling of localization

ISO Schematron provides localization with the @xml:lang attribute. From the specification it is not clear that or if a conformant processor is required to perform language fixup https://www.w3.org/TR/xinclude/#language when

  • incorporating external definitions via sch:include or sch:extends
  • instantiating abstract rules
  • instantiating abstract patterns
  • instantiating diagnostics
  • instantiating properties

This is also related to #7 -- i.e. a svrl:failed-assert is supposed to inherit the language property from the sch:assert element

Define the production rules for SVRL

There is a schema for SVRL but there are no "production rules". What I mean is that it is not defined that when a pattern gets active a <svrl:active-pattern .../> is produced. And when a rule fires <svrl:fired-rule .../>. Etc.

Of course it's obvious, but a specification should IMHO not leave things to being obvious. It should define the exact rules for producing an SVRL document based on a Schematron validation run.

Define overrides for several attributes

Some attributes can be defined on a parent(s) and child(ren) element. Examples found are role, subject, fpi, icon and see. The may be more.

The standard does not define, however logical this may seem, that a value on a child overrides a value on the parent. I think this should be made explicit.

Base URI fixup

Schematron allows for the inclusion of external definitions via sch:include (5.4.4) and sch:expand (5.4.3). The Schematron specification does not discuss how this inclusion effects relative URI references in the inserted content.

A processor implementing this proposal performs base URI fixup as defined in section 4.5.5 of [XINCLUDE].

[XINCLUDE] XML Inclusions (XInclude) Version 1.0 (Second Edition), https://www.w3.org/TR/xinclude/

Fix incompatibilities between Schematron and SVRL

There are several incompatibilities between Schematron and SVRL. For instance:

  • What should be the value of schematron-output/@title. There is nothing in the Schematron spec that fits (not even intuitively)
  • The same for active-patter/@name and active-pattern/@role
  • In SVRL there is a dir/@role attribute. But not in Schematron
  • The same for emph/@class
  • Some other problems are the data types of some attributes in the SVRL. For instance assert/@id in Schematron is of (correct) type xs:ID. But in SVRL the corresponding failed-assert/@id is suddenly an xs:NCName

If we're going to keep SVRL we at least must make sure it is compatible with the Schematron language.

A library system for abstract patterns and abstract rule-sets

This is related to the #9 issue about abstract rules.

The current system of extends and import does not really allow libraries of abstract rules to be created.This is because an abstract rule may require a pattern variable, or even a schema variable. (I did not consider it a real problem, for the initial use case of providing simple macros to simplify repeated code.)

Also, a rule in a library may have

Suggestion: only extend abstract rules that are either in same pattern or are located in a rule library.

A library is an external document (or included under /sch:schema) containing abstract patterns and new abstract rule-sets which are not tested. They must be brought into the schema e.g. using sch:extends.

<sch:library id="xxx">
sch:titleA library!</sch:title>

<sch:rule-set abstract="true" >
     <sch:let .../> 
    <sch:rule abstract="yes" ...  id="ar1' ... > 
    ...
     <sch:properties ..>
     <sch:diagnostics ...>
</sch:rule-set>

<sch:pattern abstract="yes" ....>
  ...

sch:library

The differences are:

  1. abstract patterns and abstract rule-sets can have their own diagnostics and properties
  2. abstract rule-sets can have variables (sch:let) which are injected into their parent rule (error if name clash) when used
  3. libraries are part of the schema construction phase, not visible to the schema execution phase.

In the spirit of minimalism, we resolve a reference to an abstract rule by looking for the simple id only, first in the current pattern then in any sch:library/sch:rule-set/sch:rule. This is a library system not a module system where the modules are part of an addressable path: no "namespaces"! ( If one rule is taken from an abstract rule-set, all its variable declarations are in-scope (i.e. injected) into the host pattern also. Similarly, its properties and diagnostics are in scope)

Change abstract pattern parameter mechanism

This proposal overrides issues #8 and #11. These issues are no longer relevant if this proposal gets implemented.

This proposal tries to deal with several problems:

  1. Abstract pattern parameters behave as (are) macro-expanded text strings. That deviates from the way XSLT (and other programming languages) work with parameters.
  2. There are many Schematron users that are also doing XSLT. However, the syntax for using parameters deviates in a confusing way from how XSLT is doing it. Schematron uses <sch:param .../> to set a parameter. XSLT Uses <xsl:param .../> to define a parameter and <xsl:with-param .../> to set it.

Proposal:

  1. Make abstract pattern parameters XPath variables
  2. Use the XSLT element names : <sch:param .../> to define a parameter and <sch:with-param .../> to set them.

Define the variables that are in-scope when a property is expanded

In the specification it is not defined what variables are in-scope when a property is expanded. So what variables is a <value-of> in a property allowed to reference?

I would expect only the variables that are lexically in scope, meaning, for properties, only global variables defined before the property is defined? Or also the variables that are defined in the pattern/rule that fired and caused the property to be expanded?

Support chained runs of phases, plus with progressive visibility of prior SVRL

Currently, to move from one phase to another requires external logic. A simple mechanism inside Schematron could help this.

This proposal accompanies #14 and #15 but is independent of them.

1) Chained running of phases

The proposal is to add to sch:phase some attributes which nominate a phase to be run after this has finished. The SVRL results of the current and prior phase are available in a global variable for that next phase.

An attribute is added sch:phase/@and which is an XPath expression retuning a string (the phase id) or empty or false() or empty string etc.

 <sch:phase id-"p1"  and=" 'p2'"> ....</sch:phase>

 <sch:phase id="p2">...</sch:phase>

Once all the patterns in p1 are attempted, then the phase with id 'p2' is attempted.

This allows an orderly tree of phases to be run. Because the attribute is an Xpath, you can use if..then...else to have multiple branches, based on information in the main document, global variables and input params etc.

Note that there is no order dependency added as to the order which phases and patterns etc should be run: merely that when one phase is performed, other phases will also be active. This could be implemented by e.g. first running all the patterns in one phase, then running all the patterns in the next phase that were not in the first phase; or it could be run finding the closure of all phases to be activated, then running all those in an undefined order (including simultaneous evaluation).

Consequently, if two phases have the same active pattern then that pattern is only activated once. It is an error if two phases activate the same pattern have variables with the same name because only one pattern is activated (this is more strict than minimal, but is readily doable.)

@and would only name a single phase, not multiple, for simplicity. (Now it could return a list of phase ids, I suppose: but then it gets away from the idea of a phase being a notionally discrete state. I am not convinced that there is a need to go beyond the Hidden Markov Model or finite-state-machine -type constraint that each state "transition" only requires information on that state to make a single transition. I think it would complicate some implementations, if not the code, at least the thought needed to implement. )

2) Chained running of phases with order

This is an extension that requires 1). It using the same markup as #14. It lets you run a sequence of phases, directed by the markup.

A phase can have a sch:phase/@Do. This is a barrier. It means that all patterns nominated in prior phases must have completed (logically, not necessary temporily, if there is some "lazy" or JIT implementation involved.)

If a previous phase has already activated a pattern, there is no need to reevaluate it. The same error as in 1) about phase variables with the same name applies, I expect.

<sch:phase id-"p1" and=" 'p2'"> ....</sch:phase>

 <sch:phase id="p2"  do="next">...</sch:phase>

The simplest way to implement this is that if there is any sch:phase with @Do=next, the implementation only evaluates any @and attributes after the phase it belongs to has completed, then runs that regardless of any @Do on that next sch:phase.

A better way to implement it is that the @and is evaluated at the start of the phase, then if the phase has no @Do its patterns can be merged, but if it does then they are trimmed and queued.

For example, if your input may be of several different dialects of your XML which required different treatment, your entry phase may be null and select specific phases for each different do

<sch:phase id="start"  and="if (/xbrl:xbrl) then 'input-as-xbrl" else 'input-as-html'" />
<sch:phase id="input-as-html">....</sch:phase>
<sch:phase id="input-as-xbrl">....</sch:phase>

This draws out the potential of phases to support multi-revision schema lifecycles without external logic.

3 ) Chained phases with progressive visibility of SVRL

This is an extension of 2) and uses the same markup or concept as #15 but applying to phases.

So this is the case where @Do is used, so there is a barrier. In this case, we make an automatic variable available, e.g. SVRL_PROGRESSIVE or whatever (similar to #15) which provides the cumulative SVRL of the previous phases.

This has two uses:

A) The asserts, reports, variables, rules etc of patterns in the chained phases after the barrier can see the cululative SVRL results of the previous phases. So you can use one phase to mark the next.

B) The cumulative SVRL result (i.e. that automatic variable) is also available in the @and XPath, so you can decide which next phase to run based on the validation results of the current and previous phases.
So, for example,
<sch:phase id="p1"
and="if ($SVRL_PROGRESSIVE//svrl:flag-raised[@id="DISASTER"]) then 'disaster-phase' else 'cool-phase' ">
<sch:active ...
</sch:phase>

<sch:phase id="disaster-phase">....<sch:phase>

<sch:phase id="cool-phase">...sch:phase

If there is no sch:phase/@Do, then there is no logical evaluation constraint on the evaluation of phases, and therefore it would be an error for any Xpath to contain string $SVRL-PROGRESSIVE if there is no sch:phase/@Do=next (or sch:pattern/@Do=next from #14 for that matter).

Option. Speed: extend the internal SVRL location using generate-id() id for faster lookup.

The above suggestions allow external annotations of nodes using the SVRL. One use case might be where an assertion wants to test a node based on properties found for that node in the progressive SVRL made by logically previous phases and patterns. But this seems difficult, because it needs to do a complete search and match of the SVRL, perhaps by generating a canonical XPath and matching it against all the location XPaths in the SVRL.

Now a workaround would be to use the Xpaths locations in the SVRL as keys, but it is still work.

So what would be better is if every SVRL entry also had an attribute with the generate-id() value. Because we are not changing the document itself, this would allow faster lookup. (It would also be useful for performance if the we could make a key into the SVRL using these values, for fast lookup, as long as this was generated lazily: so when we are at a node and use generate-id(), as a key, the hash table is only populalated with the SVRL keys up to that id, etc. )

This id should be stripped from the SVRL that goes out, as generate-id() is not stable between runs.

It may be that the hash table should also be provided as an automatic variable (I guess, an automatic key?) so that it does not get re-calculated to often. Anyway, I think there is scope there for something in this case...

(I am not working on an implementation of this idea.)


N.B. On Quickfix There might also be some nice interaction with XML QuickFix too: for example in one phase to check that the fixes of a previous phase went far enough, and fix the fixes if not. (Though maybe some smart looping to repeatedly run the same validation/fix combo over a document until there were no more fix actions performed might be something too: I don't know that Schematron needs to be extended for this, though: it would be done by the runner.)

Use different syntax for abstract pattern parameters?

This will break compatibility so it's probably a nogo, but maybe it should be discussed?

Parameters in abstract patterns use the same syntax as variable references: $name. IMHO that's an unfortunate choice, rather confusing. Is there something we can do about it?

Support invisible XML

Background

Invisible XML is simple system for a deterministic context-free transducer (specified with a non-deterministic context-free attribute grammar) that is worthwhile supporting

IXML can be considered useful both in itself and as a good example of a class of processing.

Scenarios

Obviously a non-XML document converted using an iXML grammar into XML can be validated with Schematron. And a Schematron engine could have its own method to detect and convert a non-XML document and run the conversion, presenting the result to the Schematron validation

However, there are three other scenarios.

  1. We want to be able to validate a non-XML document directly, and we want the grammar to be used to be part of the Schematron schema, either directly or by a name.

  2. We want for an sch:pattern/@document reference to, if it downloads a non-XML resource, convert the document to XML.

  3. We want to be able to take some node value (such as an attribute's value), convert it to XML and have that XML available in a variable,

3a. We want to take that variable and validate patterns in it.

Also, SVRL needs to be adjusted to cope.

Proposal

SVRL

As an initial minimal approach to leaves as much flexibility for implementers as possible, I propose to augment SVRL with svrl:active-pattern/svrl:conversion-failure, which is a container element that can contain any message from the parser. (As with URL retrieval failures, we are rather at the mercy of the library and implementation for the quality and user-targetting of the error message.)

See #47 for info.

Schematron

1) Main document from ixml

Schematron is augmented by a top-level element sch:schema\sch:conversion which registers a converter name for a MIME type or extension. This can be inline or by a reference.

  <sch:conversion is="somename"  mime-type="text/*"  convert-as="ixml" >
      .. ixmlt gramar here
  </sch:conversion>

or

   <sch:conversion id="somename" mime-type="text/*" covert-as="ixml"  href="URL or file relative to schema" />

As well as, or in addition to @mime-type we allow @filename to match on the filename by regex e.g. *.ixml. Perhaps we can, for UNIXy reasons, allow @magic to look at the initial bytes of the file.

The sch:schema element is augmented by an attribute @use-conversion which provides the conversion to use, e.g.

   <sch:schema ... use-conversion="USB-address" />

2) Pattern on external document from ixml

The sch:schema element is augmented by an attribute @use-conversion which provides the conversion to use.

   <sch:pattern ...  document=" 'http://eg.co/po1.txt' "  use-conversion="PurchaseOrder" />

@use-conversion can only be used if @document is present. If the retrieved resource is MIME type */*xml then no conversion is performed (and an implementation determined warning generated.)

If there are two patterns with the same URL and conversion, the document should be re-used not re-retrieved.

3) Parse node value into variable

To read some text from a node into a variable and convert it, the sch:let element is augmented by an attribute @use-conversion which provides the conversion to use.

<sch:let name="xxxx" value=" @thing "  use-conversion="thing-parser"   />   

3a) Validate variables using patterns

However, there is no obvious mechanism to make patterns validate a variable's value. That is a more general facility that would be a separate proposal, probably only needed if this proposal is accepted.

##Examples for 3) Parse node value into variable

There are numerous examples of complex data formats used for attribute and data: URLs, and even CSV. There are many cases where it is no practical or desirable to represent the atomic components of some complex data using elements: because of verbosity, for example, or because there is an industry standard idiom or notation that is what is being marked up.

Currently, Schematron fails in its core task of finding patterns in documents, whenever the document contains these complex field values.

ISO 8601

Our document is a large book catalog, where each book has a date using ISO8601. This is not the subset of used by XSD, but the full ISO8601 date format. So, we have an element like

<book 
    author="Erasmus" 
    author-life-span="%1466-%10/1536-07-23"  
    author-active-date="%1499-X/1536-07-?23"  
    creation-date="1523-X/?1524-X"  
    publication-date="2022-01-01"   > ...</book>

(For ISO8601, the % means approximate, the ? means uncertain, the X is a wildcard, the / is a date range; it allows omitting the day. Things like timezones etc not shown.)

We want to validate that the author-active-date range fits in the author-life-span range, that the creation date fits in the author-active range, and that the publication-date is later than the creation-date. We have a converter for complete ISO8601 date to XML (whether this is iXML or some regex converter is not material) so we can have the complex expressions sitting as nice sets of XDM nodes.

<sch:schema   ... >
      <sch:conversion id="ISO8601"  covert-as="ixml"  href="notations/ISO8601-date.txt"  />
   ...
  <sch:pattern>
      <sch:rule context="book">
         <sch:let name="author-life-span-as-XML"    select="@author-life-span" use-conversion="ISO8601" />
         <sch:let name="author-active-date-as-XML"  select="@author-active-date" use-conversion="ISO8601" />
         <sch:let name="creation-date-as-XML"       select="@creation-date"      use-conversion="ISO8601" />
         <sch:let name="publication-date-as-XML"  select="@publication-date" use-conversion="ISO8601" />
      
         <sch:assert test="(number($author-active-date-as-XML/date/range/from/year) 
                                    >=   number($author-life-span-as-XML/date/range/from/year))
                             and  (number($author-active-date-as-XML/date/range/to/year) 
                                    &lt;l= number($author-life-span-as-XML/date/range/to/year)) "
          >The author-active-date range should fit in the author-life-span range</sch:assert>

         <sch:assert test="(number($creation-date-as-XML/date/range/from/year) 
                                    >=   number($author-active-date-as-XML/date/range/from/year))
                             and  (number($creation-date-as-XML/date/range/to/year) 
                                    &lt;l= number($author-active-as-XML/date/range/to/year))"
          >The creation date should fit in the author-active range</sch:assert>

         <sch:assert test="number($publication-date-as-XML/date/range/from/year) 
                                   > number($creation-date-as-XML/date/range/from/year)"
          >The publication-date should be later than the creation-date.</sch:assert>
  

And we can go on making the tests better, without having to worry about how to parse the data

Example: XPaths

For Schematron itself, we have many XPaths. Schematron validation has been held back because validators do not check the XPaths.

The Schematron schema for Schematron could invoke the converter for the XPaths and do various kinds of validation. For example, in this example we check that we are not using XSLT3 XPaths novelties when our schema query language binding advertises the schema as only requiring XSLT1 or XSLT2.

<sch:schema   ... >
      <sch:conversion id="XPath"  covert-as="ixml"  href="notations/Xpath3-1.txt"  />
      ...
       <sch:pattern ="XSLT1-exclusions">
           <sch:rule context="sch:rule[/sch:schema[@qlb='xslt1' or @qlb='xslt2'] ">
                   <sch:let name="test-as-XML"    select="@test" use-conversion="XPath" /> 
                   <sch:report test="$test-as-XML//token[@value='function']"
                    >XSLT1 and XSLT2 do not allow function definitions in XPaths</sch:report>
      

Example: Land Points

A mapping system specifies areas of land by surfaces bounded by some number of points, where the points have a northerly, easterly, and elevation value.

These are specified in a whitespace separated list: N0 E0 H0 N1 E1 H1 ... Nn En Hn

<LandXML...>
...
<Surfaces ...>
<Surface>
<Definition ...>
<Pnts>
<P id="XYZ">30 10 20 40 80 110  40 85 6 32 12 24</P>
<P .../>
</Pnts>
<Faces>
<F .../>
</Faces>
</Definition>
...
</Surface>
...
</Surfaces>
...
</LandXML>

We want to make sure that none of the points in the polygon overlap. We want to do this by exposing the data as tuple, rather than hiding it behind some complex function.

Method: again, we define an iXML grammar that converts the P element into a variable as

<points> 
 <point N="3" E="10" H="20" />
 <point N="40" E="80" H="110" />
 <point N="40" E="85" H="6" />
 <point N="32" E="12" H="24" />
</points>

which is very explicit for validation.

(I note that in fact using Schematron to validate geometry is a real application: the intersection of flight routes over Europe, being the example I was informed of. )

Validate styles from CSS stylesheet

We are validating an XHTML document. It has a linked CSS stylesheet. We want to confirm that the CSS has selectors for all the stylenames used in the XHTML.

So we have a CSS parser in iXML (or whatever). So we read the document in (as a string: if XPath does not support this, a standard function should be made, presumably.

<sch:schema   ... >
      <sch:conversion id="CSS"  covert-as="ixml"  href="notations/CSS.txt"  />
      <sch:let  id="Stylesheet-uri" 
                     value="/html/head/link[@type='text/css'][1]/@ref " />

      <sch:let  id="Stylesheet-uri" 
                     value="extension:download-as-text( $Stylesheet-uri)" use-conversion="CSS" " />

      <sch:pattern>
         <sch:rule context="*[@class]">
                   ... do the validation here

So we have our CSS file as a top-level variable, as XML. The Schematron rules then handle looking up in that data.

(Of course, wild CSS has other issues: included stylesheets and so on. Being able to parse a stylesheet means that such things can start to be addressed, rather than us being stymied at the start.)

Example 2) Pattern on external document from ixml

Most of the Schematron projects I have been involved in over the years have involved AB testing: either testing that the information that was in the input document is also in the transformed document mutatis mutandis, or that when a document is converted then round-tripped back, it has the equivalent information as far as can be.

Database migration validation

Recently, I had a variation on this AB testing. A large complex organization web-publishes large complex XML dumps of its databases, produced by a large complex pipeline. They had lost confidence with passage of years and rust and moth, and decided that prudence dictated they make smaller chunks of data available using JSON and CSV (as well as an XML).

However, for a particular reason, they did not have access to the code that produced the big XML. So they wanted to cross check their new JSON/CSV API against the XML data dumps. For a particular reason, they were not interested in backward compatability (for all the data in the XML, does it match the JSON/CSV API?) but on forward compabitility (for all the data in the new JSON/CSV API does it match the XML.)

With the current proposal, this could be handled in Schematron like this:

<sch:schema ... >
    
   <!-- Specify the kind of conversion and the script -->
   <sch:conversion id="CSV" mime-type="text/*" covert-as="ixml"  href="notations/CSV-converter.txt" />

   <!-- Give the primary XML document a name, so it can accessed in patterns over external documents -->
   <sch:let name="xmlDocument" value="/*" as="element()"/>

   <!-- This pattern reads in the external  CSV document, converting it to XML, then validates it -->
   <sch:pattern ...  
          document=" 'https://eg.farm.gov.xx/datamart/yokel-list?characteristic=slack-jawed' " 
          use-conversion="CSV" > 

           <sch:rule context="/CSV/row">
                 
                   <sch:assert test=" $xmlDocument//yokel/@hog-count = cell[1]"
                    >The value of the first cell of each row should be the same as the
                    yokel's hog-count in the XML</sch:assert>

Remove deprecated query language bindings

There are several query language bindings defined that can be considered deprecated:

  • stx: The STX initiative never came beyond the “working draft” stage. Its last update was in April 2007. Its ideas found their way into XSLT 3.0.
  • exslt: EXSLT can be considered outdated (the last change is from 2003). Most if not all of its proposed extensions have found their way into the newer XPath versions.

My proposal would be to remove these bindings from the standard.

Define all query bindings

Several query bindings in the Schematron spec are reserved but undefined. That makes a very sloppy impression.

Scope limitation: increase schematron efficiency, reduce SVRL noise, integrate better, enhance modeling of phase to cross-cut by region and role

Problem

Using Schematron to make limited assertions on large documents can involve unnecessary traversal of the document.
E.g. to assert that the top-level element is the correct kind may involve iterating over the whole document.

The current workaround in this case tends to be to use termination, which is a bit brutish.

Furthermore, SVRL reports can contain excessive and otiose reports that can swamp the user, and be regarded as noise and a source of inefficiency.

Furthermore, the only cross-cutting mechanism is the phase, which cross-cuts by pattern. There is no way to cross-cut by document region or by assertion role (e.g. severity.)

Outcome Scenarios

We have a document with a zillion elements. We have a large and complex Schematron schema that takes a lot of processing and we have efficiency, latency, congestion and timeout out constraints. We really want to just check the metadata elements but we cannot use the phase mechanism for it because of bureaucratic reasons. The ideal solution would be to only validate the metadata and not traverse the rest of the document looking for contexts.

We have the same scenario. But we only are interested in testing assertion about severe errors in the metadata. The ideal solution would let us bypass testing any assertions if there was no @role="ERROR" on the assertion.

We have the same scenario. But we want to reduce the noise in the SVRL to humans. So we want to prioritize testing so that we reject documents with serious errors first, and only test other assertions after testing those, and if there are no severe errors. The ideal solution would allow some kind of re-arrangement of validation into two passes: one which tests the assertions with @role="ERROR" and another which tests other assertions.

Proposed Solution

Schema-level scoping

Add an attribute to sch:schema/@scope which limits the nodes which the Schematron schema needs to look at in the main document apart from the document node.

  • It intended as is a practical parameter, constraining area of interest in a document, not a modeling feature. In other words, even if some node gets excluded from validation, there is no implication that the schema rule do not also apply to that node: merely that we are not interested in looking or knowing at the moment.
  • It is like a parameter, in that it can be overridden on invocation, if the implementation supports it.

The value is made with the following syntax:
priority ( "from" | "to" | "only" ) ws+ role-clause location
where ws is whitespace and location is an absolute XPath pattern in the QLB

For example:

<sch:schema ... scope="to /*/*/*" ..>

will limit validation to only the document node and the first three levels of nodes. E.g. /law/part/section/clause will not be validated.

  • "from " uses an initial absolute XPath, which is where validation starts from.
    -- E.g. from /book/appendix means do not validate all /book nor /book/node()[not(self::appendix)] nor /book/node()[not(self::appendix)]//node()
  • "to " uses an initial absolute XPath, which is used to select the nodes which will be validated
    -- e.g. to /book/appendix means means do validate all / and /book and /book/node()[not(self::appendix)] and /book/node()[not(self::appendix)]//node()
  • "only " supplies an XPath, and only elements that match those are validated.
    -- e.g. only /criminal/metadata/* meaning validate the document root and the children of metadata.
    --`e.g., only //html:* means only elements in the HTML namespace.
  • In all cases, the document node is validated.
  • The default is "from /" meaning all nodes in the document including the document node.
  • The scope does not apply to sch:patterns[@document]. No provision is made for dynamic override of them.

Prioritize

The priority is a hint to perform the in-scope validations before other validations, not instead of. It uses the optional keyword "prioritize". So priorize from is a hint to validate the nodes at and under some path first, prioritize to is to validate the nodes from the top until the path first (e.g. more like a top-down breadth-first traversal), and prioritize only means to validate (the document node and) the specified nodes of that XPath first, then the others.

Where and Until

The role-clause is ("until" | "where") ws+ {"role | "flag"} = {" ws* (role ws+ )* "}") and is a hinting mechanism that ties into the @role and @flag attributes. (Below, when @role is used, @flag is implied as well.)

"where" is a hint to only test assertions where there is an in-scope @role`` attribute with a matching token. "In-scope" means ``@role on the sch:assert, linked-to sch:diagnostics, linked-to sch:property, parent sch:rule, parent sch:pattern (and should be current sch:phase and sch:active too)

For example

<sch:schema ... scope="from when role={fatal error}  /regulations/regulation[@jurisdiction='AU']" ...>

means at this time we are only interested severe errors in Australian regulations. So we only look at and under Australian regulations, and, as a hint, the implementation needn't

"Until" says that once an assertion has failed (or report succeeds) which has in in-scope @role attribute (hat has a token that matches a token on the list, then the implementation can opt-out of processing more. This needs to be implementation-dependent, to not create a burden for implementers. But the actiion could be that if an assertion with the role fails,

  • we don't test any more assertions on that node,
  • or we don't traverse to its children,
  • or we don't test any more of that pattern,
  • or, we just terminate.

For example

<sch:schema ... scope="from until role={fatal error}  /regulations/regulation[@jurisdiction='AU']" ...>

says to validate the document node plus Australia regulations (the element and its desendents) but provides the hint that as soon as

So the point of "when" and "using" is to reduce noise in the SVRL, and do so in a way that enhances the @role markup.
kind of

Phase-level scoping

Also, add the same attribute to sch:phase and sch:active. It limits the scope of the patterns in the phase, in addition to any scope specified on the schema. If the pattern activated uses @document, this scoping applies to that pattern. It allows phases to "cross-cut" based on region and role.

Pattern level scoping

Also, add the same attribute to sch;pattern. It limits the scope of contexts in the pattern, in addition to any sch:schema/@scope or sch:pattern/@scope or sch:active/@scope. If sch:pattern/@document is specified, it limits the scope of the pattern in that document.

Alternatives Considered

I developed experimental parsers (using PEG and REx) for parsing XPaths, to allow an implementation to determine what kind of nodes it needed to look at, and potentially to know whether there were other limits that could be known from static analysis. It is possible, but a lot of code.

A schema implementation could certainly provide this as a paramter when running a schema validation.

Other Benefits

Furthermore, the feature of allowing it to be stated in the sch:schema element makes things more explicit and easier to implement. Furthermore, it would be a useful general features for users to be able to select the scope of elements.

Furthermore, it could provide a way to enhance phases.

Schematron engine that provide this would be better targetted for integration into IDEs: the IDE could limit interactive validation to the current node by passing the relevant "only" Xpath, for example.

Implementation Considerations

The "from " case is trivially implemented. E.g. in the skeleton code, for each mode, it would involve first validating the document node only, then finding al the nodes that match the scope then priming the validation with those.

The "to " case can be implemented, e.g. in the skeleton implementation, by creating a variable with all the nodes that that match the scope XPath, then for each mode, first validating the document node only, the validate the nodes that match the scope.

We are not particularly concerned about efficiency in cases like only //html:: because the aim is to provide efficiency in cases where we don't want to have to process the entire document because of something where there is an obvious fast way to get the to information needed.

The when role = { x y z } can be ignored if the implementer desires. It is a hint. It could be faked up by post-processing the generated SVRL to remove all failed-asserts etc whose in-scope roles do not match: this would reduce noise if not efficiency.

The until role = { x y z } can be ignored if the implementer desires. It is a hint. It could be faked up by post-processing the SVRL to remove all following-elements after the first successful-report or failed assert with a token in the its @role that matches. The best would be to terminate gracefully without testing further or looking at more contexts.

An implementation may decide how much of scope to support overriding by. For example, the implementor may decide that it is easy to support that the command line/invocation parameter only supports "only" with no role testing, and only supports that if the sch:schema/@scope is default or uses "only" (i.e. it is just a matter of swapping the XPath string, not generating different code.) Or the implementation may decide that it only supports certain overrides of sch:schema/@scope as a compile-time option not a run-time option.

In other words, an implementation

  • must support parse all sch:*/@scope and implement from, to and only, as language features
  • is free to support the role hint a much as is convenient and useful
  • is free to implement overriding sch:schema/@scope at runtie or compile-time as much as is convenient
  • Is free to implement prioritize or not

prioritize would be handled by two passes. This is not inefficient, as the desired outcome is get show-stopper assertions tested ahead of other assertions. There could be some extra inefficiency if pattern and rule variables need to be re-calculated in both passes. (However, this can coded around.)

Note that if an sch:rule has an role attribute that does no match, or it contains no asserts or reports (or their diagostics or properties) that match, it does not mean that the role context is not applicable: the pattern does not change depending on the scope attributes, all that happens is that some nodes will not be tested to see if they match any context in a pattern, and some assertions may not be tested. This is a matter of what is interesting to the invoker, not a matter of modeling. The scope is not a way to switch on or off rules.

There is an exception to this: consecutive rules at the end of a pattern that have an @role that does not have a matching token, or which have no assertions with matching roles, have no effect, therefore can be switched off. E.g.

<sch:schema ... scope="from when role={PET}">
...
<sch:pattern>
   <sch:rule context="dog" role="PET"  id="r1">
   ...
   </sch:rule>
   <sch:rule context="lion" role="WILD" id="r2">
       <sch:assert test="@exit='go'">Linus and his friends must go</sch:assert>
    </sch:rule>
    <sch:rule context="*" id="r3">
        <sch:report test="true()">Unknown animals are not regarded as wild or pets</sch:report<
    </sch:rule>
</sch:pattern>

In this case, rule r3 can be switched off as it has no children with role of PET. And the next previous one, now the last, r2 also can be switched off as its role is no PET and it has no reachable children with role of PET.

Fix errors in Schematron for SVRL

Schematron/schema#6 proposes several fixes to the Schematron for SVRL:

  • 'second-level' rule checked for sibling, not parent
  • 'svrl:schematron-output' rule checked that the current element did not exist
  • 'svrl:schematron-output' rule failed to count a child element type
  • 'svrl:fired-rule' rule omitted namespace when checking for a preceding sibling element

If the fixes are correct, they should be included in the next version of Schematron (or, as an exercise for the editor, implemented as changes in the current Schematron written for the xslt binding).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.