fasterxml / woodstox Goto Github PK

View Code? Open in Web Editor NEW

219.0 13.0 81.0 7.35 MB

The gold standard Stax XML API implementation. Now at Github.

License: Apache License 2.0

Java 99.94% HTML 0.06%

hacktoberfest stax-api xml xml-parser

woodstox's Introduction

Overview

The gold standard Stax XML "pull" API (javax.xml.stream) implementation.

Since version 4.0, Woodstox also implements SAX API for event-based XML processing.

Most if not all popular Java XML web service frameworks use either Stax or SAX API for XML processing: this means that Woodstox can be used with the most popular Java frameworks.

For longer overview, here are some options for more reading:

But in general usage follows standard Stax or SAX API usage.

Status

Type	Status
Build (CI)
Artifact
OSS Sponsorship
Javadocs
Code coverage (6.x)
OpenSSF Score

Get it!

Maven

The most common way is to use Maven (or Ivy) to access it from Maven Central repository. Coordinates for this are:

Group id: com.fasterxml.woodstox
Artifact id: woodstox-core
Latest published version: 6.6.1 (2024-02-26)

Note that Maven id has changed since Woodstox 4.x but API is still compatible (despite nominal major version upgrade -- major version upgrades in this case were only due to package coordinate changes)

Requirements

Woodstox 5 and above require Java 6 (JDK 1.6); as well as Stax API that is included in JDK. The only other mandatory dependency is Stax2 API, extended API implemented by Woodstox and some other Stax implementations (like Aalto.

Optional dependency is Multi-Schema Validator (MSV) that is needed if using XML Schema or RelaxNG validation functionality

License

Woodstox 4.x and above licensed under Apache 2 license.

Documentation etc

Configuration

Most configuration is handled using standard Stax mechanism, property access via

XMLInputFactory.setProperty(propertyName, value) for configuring XML reading aspects
XMLOutputFactory.setProperty(propertyName, value) for configuring XML writing aspects

Names of properties available, including standard Stax 1.x ones, are documented in a series of blog posts:

Stax 1.x standard configuration properties
Stax2 extension configuration properties
Woodstox-specific configuration properties

Support

Community support

Woodstox is supported by the community via the mailing list: woodstox-user

Enterprise support

Available as part of the Tidelift Subscription.

The maintainers of woodstox and thousands of other packages are working with Tidelift to deliver commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use. Learn more.

Contributing

For simple bug reports and fixes, and feature requests, please simply use projects Issue Tracker, with exception of security-related issues for which we recommend filing a Tidelift security contact (NOTE: you do NOT have to be a subscriber to do this).

Other

Check out project Wiki for javadocs

woodstox's People

Contributors

Stargazers

Watchers

Forkers

linhkuivanen wgreven-ibr shepherdsheryl xranby valery1707 caleb-an rozanilla wolfgangmeyers vrdate nicolas-raoul spredfast daddyauden chandan83 haiderny priyatransbit nkutsche nhudinh2103 hasuniea mattadamson hxc9 mohanaraosv chouwane mguessan klausboeing scalableminds almondmfb ajoajoajo jo-ka mmrvka anthrax-0 tool-recommender-bot ioanmarcu pascalschumacher sriramvaram tolyan istudens wade1990 guohuichen coheigea michael-siegel-github phated finch0001 zhaoshiling1017 kwin clementdenis francescoz93 carlwang87 genisysram isabella232 santosh653 kheliz dmukhopadhyay1987 jbescos johannesherr pdo-axelor serasset koenlavooij makesoftwaresafe chrisr3 adedayoominiyi nanaorg adilhamiani pjfanning bulksecuritygeneratorprojectv2 adamkorcz tomitribe cesarsotovalero dschlenk magmaruss gnodet nuix ppalaga orbisman bryanchance hadoop835 brian-afk nonoco1200 norrisjeremy chriss-0x01 stanio

woodstox's Issues

Incorrect validation error(s) only using Stax2 writer

Hello,

I've been running into some seriously bizarre behavior with validation in Woodstox. I might be missing something dead-simple, but otherwise I'm inclined to think it's a bug.

The issue is that, taking a given XML document that conforms to a given XSD schema, we get validation exceptions when using XmlStreamWriter2 (after calling validateAgainst on the writer), while no exceptions are thrown (as expected) from XmlStreamReader2 with the same document and the same schema.

Here is a small example to reproduce, which reads an XML file into XMLStreamReader2 then copies it without changes to an XMLStreamWriter2 :

import com.ctc.wstx.stax.WstxInputFactory;
import com.ctc.wstx.stax.WstxOutputFactory;
import org.codehaus.stax2.XMLStreamReader2;
import org.codehaus.stax2.XMLStreamWriter2;
import org.codehaus.stax2.validation.XMLValidationSchema;
import org.codehaus.stax2.validation.XMLValidationSchemaFactory;

import javax.xml.stream.XMLStreamException;
import java.io.InputStream;
import java.io.StringWriter;

public class Converter {

    public static void main(String... args) throws XMLStreamException {

        InputStream reader = Converter.class.getClassLoader().getResourceAsStream("test.xml");
        StringWriter writer = new StringWriter();

        XMLValidationSchema schema = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA)
                .createSchema(Converter.class.getClassLoader().getResourceAsStream("schema.xsd"));


        XMLStreamReader2 xmlReader = (XMLStreamReader2) new WstxInputFactory().createXMLStreamReader(reader);
        xmlReader.validateAgainst(schema);

        XMLStreamWriter2 xmlWriter = (XMLStreamWriter2) new WstxOutputFactory().createXMLStreamWriter(writer);
        xmlWriter.validateAgainst(schema);

        xmlWriter.copyEventFromReader(xmlReader, false);

        while (xmlReader.hasNext()) {
            xmlReader.next();

            xmlWriter.copyEventFromReader(xmlReader, true);
        }

        System.out.println(writer.toString());
    }
}

Here is the XML:

<?xml version="1.0" encoding="UTF-8"?>
<JobStatus xsdVersion="NA">
    <Document>
        <DocumentId>1234567890</DocumentId>
    </Document>
    <Document>
        <DocumentId>1234567891</DocumentId>
    </Document>
</JobStatus>

and the schema

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified"
           xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="JobStatus">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Document" maxOccurs="unbounded">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="DocumentId" type="xs:string"/>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
            <xs:attribute name="xsdVersion" type="xs:string" use="required"/>
        </xs:complexType>
    </xs:element>
</xs:schema>

Running the example code with those two files results in the following exception being thrown:

Exception in thread "main" com.ctc.wstx.exc.WstxValidationException: element "JobStatus" is missing "xsdVersion" attribute
        at [row,col {unknown-source}]: [1,66]
        at com.ctc.wstx.exc.WstxValidationException.create(WstxValidationException.java:50)
        at com.ctc.wstx.sw.BaseStreamWriter.reportProblem(BaseStreamWriter.java:1223)
        at com.ctc.wstx.msv.GenericMsvValidator.reportError(GenericMsvValidator.java:549)
        at com.ctc.wstx.msv.GenericMsvValidator.reportError(GenericMsvValidator.java:541)
        at com.ctc.wstx.msv.GenericMsvValidator.reportError(GenericMsvValidator.java:535)
        at com.ctc.wstx.msv.GenericMsvValidator.validateElementAndAttributes(GenericMsvValidator.java:343)
        at com.ctc.wstx.sw.BaseNsStreamWriter.closeStartElement(BaseNsStreamWriter.java:420)
        at com.ctc.wstx.sw.BaseStreamWriter.copyEventFromReader(BaseStreamWriter.java:807)
        at Converter.main(Converter.java:34)

However, the code executes fine if one removes xmlWriter.validateAgainst(schema); and the result XML is identical (modulo white space changes) to the input file, and still conforms to the XSD.

Just for information, I've reproduced this behavior (validation successful on reader, but fails on writer) with different pairs of XML and XSD, in which case you can get different types of error messages at different positions in the document.

So I would say the questions here are:

why does validation fails, when the document obviously conform to the schema (and when the output of the writer conforms to the schema, if one lets it run by removing the validation) ?
how can validation of the same document succeed when reading, fails when writing ?

Versions:
tested with Java 7 & 8 using the following dependencies:

org.codehaus.woodstox stax2-api 4.0.0
com.fasterxml.woodstox woodstox-core 5.0.2
net.java.dev.msv msv-core 2013.6.1
net.java.dev.msv xsdlib 2013.6.1

webservices-*2.1.1.jar when used with woodstox-core-asl-4.4.1.jar conflicts due to class com.ctc.wstx.sr.ValidatingStreamReader

I have found that when webservices-*2.1.1.jar is used with woodstox-core-asl-4.4.1.jar there is a conflict due to com.ctc.wstx.sr.ValidatingStreamReader class. This happens because XMLStreamReaderFactory.create in StreamSOAPCodec returns ValidatingStreamReader which is present in both the jars.

Exception type does not correspond

woodstox/src/main/java/com/ctc/wstx/stax/WstxInputFactory.java

Line 643 in bae0330

throw new IllegalArgumentException("Null InputStream is not a valid argument");

The function claims that it throws XMLStreamException but here it throws IllegalArgumentException which may cause uncatchable exception.

Inconsistent reading of formatted xml when validating schema

When setting a reader to validate schema, the spaces between the tags returned generate a SPACE event, while the same read without validation return a CHARACTERS event.

Adding this test to TestW3CSchema shows the issue:

  public void testFormattedXml() throws XMLStreamException {
      // Another sample schema, from
      String SCHEMA = "<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>\n"
          + "<xs:element name='item'>\n"
          + " <xs:complexType>\n"
          + "  <xs:sequence>\n"
          + "   <xs:element name='quantity' type='xs:positiveInteger'/>"
          + "   <xs:element name='price' type='xs:decimal'/>"
          + "  </xs:sequence>"
          + " </xs:complexType>"
          + "</xs:element>"
          + "</xs:schema>";
      
      XMLValidationSchema schema = parseW3CSchema(SCHEMA);
      
      // First, valid doc:
      String XML = "<item>\n    <quantity>3</quantity>\n    <price>4.05</price>\n</item>";
      XMLStreamReader2 sr = getReader(XML);
      sr.validateAgainst(schema);
      
      assertEquals(XMLStreamConstants.START_ELEMENT, sr.next());
      assertEquals(XMLStreamConstants.CHARACTERS, sr.next());
      assertEquals("\n    ", sr.getText());
      assertEquals(XMLStreamConstants.START_ELEMENT, sr.next());
    }

This test fails as is, but if the line sr.validateAgainst(schema); is removed, the test runs fine.

I think the correct behaviour for this case would be to return SPACE.

Allow configuration of DTD entity resolver for DTDSchemaFactory

While it is possible to configure an XMLResolver to use for resolving either general parsed entities, or for processing entities only relevant for DTD subsets, when parsing XML documents with embedded DTD subsets and/or references, there is no way to do that currently when reading DTDs externally.

It would make sense to allow similar configurability via DTDSchemaFactory as well.

NoSuchMethodError after update to Woodstox 5.1.0

Woodstox 5.1.0 introduces JAR conflicts due to linkage with newer StAX library. It seems that 5.1.0 cannot be used together with libraries that are dependent on stax2-api 3.1.4 (e.g. Apache Olingo).

This behaviour seems to violate the semantic versioning rules.

java.lang.NoSuchMethodError: org.codehaus.stax2.ri.EmptyIterator.getInstance()Ljava/util/Iterator;
	at com.ctc.wstx.util.DataUtil.emptyIterator(DataUtil.java:46)
	at com.ctc.wstx.evt.SimpleStartElement.getAttributes(SimpleStartElement.java:111)
	at org.odftoolkit.odfdom.pkg.rdfa.RDFaParser.getAttributeByName(RDFaParser.java:414)
	at org.odftoolkit.odfdom.pkg.rdfa.RDFaParser.parse(RDFaParser.java:190)
	at org.odftoolkit.odfdom.pkg.rdfa.RDFaParser.beginRDFaElement(RDFaParser.java:114)
	at org.odftoolkit.odfdom.pkg.rdfa.SAXRDFaParser.startElement(SAXRDFaParser.java:115)
	at org.odftoolkit.odfdom.pkg.rdfa.MultiContentHandler.startElement(MultiContentHandler.java:83)
	at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
	at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.odftoolkit.odfdom.pkg.OdfFileDom.initialize(OdfFileDom.java:223)
	at org.odftoolkit.odfdom.dom.OdfContentDom.initialize(OdfContentDom.java:60)
	at org.odftoolkit.odfdom.pkg.OdfFileDom.<init>(OdfFileDom.java:105)
	at org.odftoolkit.odfdom.dom.OdfContentDom.<init>(OdfContentDom.java:50)
	at org.odftoolkit.odfdom.pkg.OdfFileDom.newFileDom(OdfFileDom.java:157)
	at org.odftoolkit.odfdom.pkg.OdfPackageDocument.getFileDom(OdfPackageDocument.java:323)
	at org.odftoolkit.odfdom.dom.OdfSchemaDocument.getFileDom(OdfSchemaDocument.java:405)
	at org.odftoolkit.odfdom.dom.OdfSchemaDocument.getContentDom(OdfSchemaDocument.java:206)
	at org.odftoolkit.simple.chart.AbstractChartContainer.<init>(AbstractChartContainer.java:71)
	at org.odftoolkit.simple.TextDocument$ChartContainerImpl.<init>(TextDocument.java:957)
	at org.odftoolkit.simple.TextDocument.getChartContainerImpl(TextDocument.java:948)
	at org.odftoolkit.simple.TextDocument.getChartByTitle(TextDocument.java:853)

Reason is the different return type of EmptyIterator.getInstance().

Invalid attribute rejection when element is nillable

XSD:

<?xml version="1.0" encoding="utf-8"?>
<xsd:schema xmlns="http://www.openuri.org/mySchema" xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.openuri.org/mySchema" version="2.0">
	<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
	<xsd:element name="comment" type="comment_type"/>
	<xsd:complexType name="comment_type">
		<xsd:choice>
			<xsd:element name="annotation" type="xsd:string"/>
			<xsd:element name="note" type="xsd:string"/>
		</xsd:choice>
		<xsd:attribute name="country" type="xsd:string"/>
	</xsd:complexType>
	<xsd:complexType name="PurchaseOrderType">
		<xsd:sequence>
		    <xsd:element name="comment" type="comment_type" nillable="true"/>
		</xsd:sequence>
		<xsd:attribute name="orderDate" type="xsd:date"/>
	</xsd:complexType>
</xsd:schema>

XML: I have to attach my xml file as txt (it is small file, 247 bytes long). This view does not show element when it has attribute xsi:nil="true".

instance.txt

<?xml version="1.0" encoding="UTF-8"?>
<data:purchaseOrder orderDate="2006-10-30" xmlns:data="http://www.openuri.org/mySchema"     
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <comment country="US" xsi:nil="true"/>
</data:purchaseOrder>

Problem:

com.ctc.wstx.exc.WstxValidationException: unexpected attribute "country"

Java's default XSD validator does not report any problems with attribute.

Misleading message in the deprecated javadoc tag of com.ctc.wstx.api.WstxInputProperties.P_CUSTOM_INTERNAL_ENTITIES

The deprecation JavaDoc message, warning about the eventual removal of the com.ctc.wstx.api.WstxInputProperties.P_CUSTOM_INTERNAL_ENTITIES property states that "the same functionality can be achieved by using custom entity resolvers".

Unfortunately this might not be true, because the streaming API currently only uses entity resolvers for resolving external entity references. The removal of the P_CUSTOM_INTERNAL_ENTITIES property will therefore also remove the ability to override internal entities.

A new dedicated XMLResolver factory/reader property might be needed to support resolving custom internal entities in future.

xml:lang attribute not handled correctly by XMLStreamReader2 using DTD validation

I'm using DTD schema validation with XMLStreamReader2:

        XMLInputFactory2 inFactory = (XMLInputFactory2)XMLInputFactory2.newInstance();
        inFactory.setProperty(XMLInputFactory2.IS_VALIDATING, Boolean.TRUE);
        inFactory.setProperty(XMLInputFactory2.SUPPORT_DTD, Boolean.TRUE);
        inFactory.setProperty(XMLInputFactory2.IS_NAMESPACE_AWARE, Boolean.FALSE);
        String dtdPath = path + ident;
        InputStream is = Utility.findResource(dtdPath);
        if (is == null) {
            throw new XMLStreamException("Unable to access DTD at path " + dtdPath);
        }
        final byte[] dtddata = Utility.readFully(is);
        XMLResolver resolver = new XMLResolver() {
            @Override
            public Object resolveEntity(String publicID, String systemID, String baseURI, String namespace) {
                return new ByteArrayInputStream(dtddata);
            }
        };
        inFactory.setXMLResolver(resolver);
        XMLStreamReader2 reader = (XMLStreamReader2)inFactory.createXMLStreamReader(in);
        // TODO: reuse schemas, since they're guaranteed threadsafe and immutable - use context to store?
        XMLValidationSchemaFactory schemaFactory =
            XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_DTD);
        XMLValidationSchema schema = schemaFactory.createSchema(new ByteArrayInputStream(dtddata));
        reader.validateAgainst(schema);

This seems to work well in general, but when I try a document which includes an xml:lang attribute I get:

com.ctc.wstx.exc.WstxValidationException: Element <FreeFormText> has no
attribute "xml:lang"
  at [row,col {unknown-source}]: [8,11]
    at
com.ctc.wstx.exc.WstxValidationException.create(WstxValidationException.java:50)
    at
com.ctc.wstx.sr.StreamScanner.reportValidationProblem(StreamScanner.java:580)
    at
com.ctc.wstx.sr.ValidatingStreamReader.reportValidationProblem(ValidatingStreamReader.java:383)
    at
com.ctc.wstx.sr.InputElementStack.reportProblem(InputElementStack.java:849)
    at
com.ctc.wstx.dtd.DTDValidatorBase.doReportValidationProblem(DTDValidatorBase.java:497)
    at
com.ctc.wstx.dtd.DTDValidatorBase.reportValidationProblem(DTDValidatorBase.java:479)
    at
com.ctc.wstx.dtd.DTDValidator.validateAttribute(DTDValidator.java:251)
    at
org.codehaus.stax2.validation.ValidatorPair.validateAttribute(ValidatorPair.java:78)
    ,,,

This occurs even though the DTD has the attribute specifically defined for that element:

<!ELEMENT FreeFormText
           ( #PCDATA ) >
<!ATTLIST FreeFormText
           xml:lang CDATA #IMPLIED >

The strangest part is that it appears to work correctly when using XMLStreamReader and XMLResolver (with a DOCTYPE in the document that references the DTD) - it only fails when using the reader.validateAgainst(schema) approach to set the validation DTD directly.

com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '/' (code 47) (expected a name start character)

Hello,

I'm having an issue unmarshaling an XML file when some special characters as "/" are included inside one attribute.

In the .xsd file, the attribute is defined as normalized String, and as far as I know the character / is allowed:

<xs:element name="field" maxOccurs="unbounded">
	<xs:complexType>
		<xs:attribute name="name" type="xs:token" use="required" />
		<xs:attribute name="value" type="xs:normalizedString" use="required" />
	</xs:complexType>
</xs:element>

This is one sample value in the field:
<field name = "test" value = "test/"/>

And here the call I make:

XMLStreamReader xsr = null;
try {
	// Create the XML stream reader
	XMLInputFactory xif = XMLInputFactory.newFactory();
	xsr = xif.createXMLStreamReader(inputStream, "UTF-8");

	// Unmarshall the XML with JAXB, with XML schema validation enabled
	JAXBContext jc = JAXBContext.newInstance(Root.class);
	Unmarshaller unmarshaller = jc.createUnmarshaller();
	unmarshaller.setSchema(this.xmlSchema);
	Root rootIndex = (Root) unmarshaller.unmarshal(xsr);
	[...]
}

From what I've found, the issue happens in StreamScanner.fullyResolveEntity. Is this working as expected? is the character "/" not allowed?

Maybe too strict validation for CDATA value

Hello,

my work is to nest "stringified xml with cdata" inside cdata (don't blame me, legacy code).
It is not working as expected because of https://github.com/FasterXML/woodstox/blob/master/src/main/java/com/ctc/wstx/sw/BufferingXmlWriter.java#L1484

actually my cdata escaped text contains this :

<Name><![CDATA[John Doe]]]]><![CDATA[></Name>

That stackoverflow thread learned me that it can work : https://stackoverflow.com/questions/223652/is-there-a-way-to-escape-a-cdata-end-token-in-xml

Please advise, do you think we should fix it ? Or am I wrong ?
Thanks for your help.

`CompactStartElement` appears to incorrectly classify attributes as default

When reading and writing XML via WstxInputFactory and DefaultEventAllocator, attributes appear to be classed as default/unspecified incorrectly.

This appears to be caused by CompactStartElement.constructAttr; this method passes isDef (presumably "is default"?) to AttributeEventImpl's wasSpecified parameter directly. If an attribute is default, shouldn't that mean it wasn't specified?

Please let me know if I'm misinterpreting this behaviour.

Test case:

import java.io.ByteArrayInputStream;
import java.io.StringWriter;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

import com.ctc.wstx.evt.DefaultEventAllocator;
import com.ctc.wstx.stax.WstxEventFactory;
import com.ctc.wstx.stax.WstxInputFactory;

public class Main {

    public static void main(String[] args) throws XMLStreamException {
        String xml = "<?xml version=\"1.0\" ?><a b=\"c\"></a>";
        WstxInputFactory inputFactory = new WstxInputFactory();
        XMLStreamReader stream = inputFactory.createXMLStreamReader(new ByteArrayInputStream(xml.getBytes()));
        DefaultEventAllocator allocator = DefaultEventAllocator.getDefaultInstance();
        XMLEventFactory eventFactory = new WstxEventFactory();
        StringWriter writer = new StringWriter();
        while(stream.hasNext()) {
            XMLEvent event = allocator.allocate(stream);
            if(event.isStartElement()) {
                // Force parsing of attributes
                StartElement start = event.asStartElement();
                event = eventFactory.createStartElement(start.getName(), start.getAttributes(), start.getNamespaces());
            }
            event.writeAsEncodedUnicode(writer);
            stream.next();
        }
        System.out.println("Expected: " + xml);
        System.out.println("Actual:   " + writer.toString());
    }
}

how to handle xml shorthand like <elem/>

Invalid 5.0.0 on mvnrepository

It seems there is an issue with version 5.0.0 on mvnrepository, it's only 6Kb. See http://mvnrepository.com/artifact/com.fasterxml.woodstox/woodstox-core/5.0.0

Version 5.0.1 is fine though, so you could just update the README to indicate 5.0.1 is the version to use.

5.0.0 Maven jar missing classes

It appears the 5.0.0 Maven published jar is missing all the class files:

jar -tvf woodstox-core-5.0.0.jar
907 Mon Feb 23 20:28:58 CST 2015 META-INF/MANIFEST.MF
0 Mon Feb 23 20:28:58 CST 2015 META-INF/
0 Mon Feb 23 20:28:58 CST 2015 META-INF/maven/
0 Mon Feb 23 20:28:58 CST 2015 META-INF/maven/com.fasterxml.woodstox/
0 Mon Feb 23 20:28:58 CST 2015 META-INF/maven/com.fasterxml.woodstox/woodstox-core/
144 Mon Feb 23 20:28:58 CST 2015 META-INF/maven/com.fasterxml.woodstox/woodstox-core/pom.properties
6609 Mon Feb 23 20:28:38 CST 2015 META-INF/maven/com.fasterxml.woodstox/woodstox-core/pom.xml
0 Mon Feb 23 20:28:58 CST 2015 META-INF/services/
35 Mon Feb 23 20:28:38 CST 2015 META-INF/services/javax.xml.stream.XMLEventFactory
34 Mon Feb 23 20:28:38 CST 2015 META-INF/services/javax.xml.stream.XMLInputFactory
35 Mon Feb 23 20:28:38 CST 2015 META-INF/services/javax.xml.stream.XMLOutputFactory
34 Mon Feb 23 20:28:38 CST 2015 META-INF/services/org.codehaus.stax2.validation.XMLValidationSchemaFactory.dtd
38 Mon Feb 23 20:28:38 CST 2015 META-INF/services/org.codehaus.stax2.validation.XMLValidationSchemaFactory.relaxng
34 Mon Feb 23 20:28:38 CST 2015 META-INF/services/org.codehaus.stax2.validation.XMLValidationSchemaFactory.w3c

Improve hashing implementation used for symbol tables for element and attribute names

Existing symbol-table implementation uses hashing implementation similar to default JDK String.hashCode(). While this has benefits like simplicity, and comparable performance to standard String hashing, there are some well-documented issues regarding hash-collisions.
It would be good to improve these aspects, to avoid most obvious problems.

There are many possible ways to improve handling; one of first things to check could be to see if and how changes to Jackson handling could be adopted. Jackson's handling differs a bit (since it uses straight-from-bytes to char[] approach, like Aalto) but there may be things to take. Other sources should be considered too, including hash algorithms that do not use simple multiply-then-append, and are not as easy to generate collisions against.

One idea not used by Jackson, but potentially useful here could be consideration of JDK8 style overflow areas, where secondary sorting is used for "too big" buckets. Such buckets could use simple binary search. That would be a bigger change, but would eliminate the problem.

Property WstxInputProperties.P_VALIDATE_TEXT_CHARS unrecognized

Trying to use the lib (v5.0.3) because it was intend to have a feature to ignore invalid characters in xml text, but in fact it is not there:
The code:
XMLInputFactory2 f = (XMLInputFactory2)XMLInputFactory2.newInstance(); f.setProperty(WstxInputProperties.P_VALIDATE_TEXT_CHARS, Boolean.FALSE);
Produces the following error:
java.lang.IllegalArgumentException: Unrecognized property 'com.ctc.wstx.validateTextChars'
at com.ctc.wstx.api.CommonConfig.reportUnknownProperty(CommonConfig.java:168)
at com.ctc.wstx.api.CommonConfig.setProperty(CommonConfig.java:159)
at com.ctc.wstx.api.ReaderConfig.setProperty(ReaderConfig.java:35)
at com.ctc.wstx.sr.BasicStreamReader.setProperty(BasicStreamReader.java:1306)

Text content is not validated correctly when copying events to `XmlStreamWriter`

Hello,

I found a bug in the way validation is performed for text content when writing to an XmlStreamWriter, which leads to validation errors.

I've tested it with the 5.0.3 release.
Here is the code to reproduce:

import com.ctc.wstx.stax.WstxInputFactory;
import com.ctc.wstx.stax.WstxOutputFactory;
import org.codehaus.stax2.XMLStreamReader2;
import org.codehaus.stax2.XMLStreamWriter2;
import org.codehaus.stax2.validation.XMLValidationSchema;
import org.codehaus.stax2.validation.XMLValidationSchemaFactory;

import javax.xml.stream.XMLStreamException;
import java.io.InputStream;
import java.io.StringWriter;

public class Converter {

    public static void main(String... args) throws XMLStreamException {

        InputStream reader = Converter.class.getClassLoader().getResourceAsStream("test.xml");
        StringWriter writer = new StringWriter();

        XMLValidationSchema schema = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA)
                .createSchema(Converter.class.getClassLoader().getResourceAsStream("schema.xsd"));


        XMLStreamReader2 xmlReader = (XMLStreamReader2) new WstxInputFactory().createXMLStreamReader(reader);
        xmlReader.validateAgainst(schema);

        XMLStreamWriter2 xmlWriter = (XMLStreamWriter2) new WstxOutputFactory().createXMLStreamWriter(writer);
        xmlWriter.validateAgainst(schema);

        xmlWriter.copyEventFromReader(xmlReader, false);

        while (xmlReader.hasNext()) {
            xmlReader.next();

            xmlWriter.copyEventFromReader(xmlReader, true);
        }

        System.out.println(writer.toString());
    }
}

Here is the XML:

<?xml version='1.0' encoding='UTF-8'?>
<Document>1</Document>

and the schema

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="unqualified"
           xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="Document" type="xs:int"/>
</xs:schema>

Running the example code with those two files results in the following exception being thrown:

Exception in thread "main" com.ctc.wstx.exc.WstxValidationException: Unknown reason
 at [row,col {unknown-source}]: [1,61]
    at com.ctc.wstx.exc.WstxValidationException.create(WstxValidationException.java:50)
    at com.ctc.wstx.sw.BaseStreamWriter.reportProblem(BaseStreamWriter.java:1243)
    at com.ctc.wstx.msv.GenericMsvValidator.reportError(GenericMsvValidator.java:549)
    at com.ctc.wstx.msv.GenericMsvValidator.reportError(GenericMsvValidator.java:541)
    at com.ctc.wstx.msv.GenericMsvValidator.reportError(GenericMsvValidator.java:535)
    at com.ctc.wstx.msv.GenericMsvValidator.validateElementEnd(GenericMsvValidator.java:390)
    at com.ctc.wstx.sw.BaseNsStreamWriter.doWriteEndTag(BaseNsStreamWriter.java:742)
    at com.ctc.wstx.sw.BaseNsStreamWriter.writeEndElement(BaseNsStreamWriter.java:291)
    at com.ctc.wstx.sw.BaseStreamWriter.copyEventFromReader(BaseStreamWriter.java:795)
    at Converter.main(Converter.java:34)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

When reading that XML, no error occur, this only happens when writing.
Additionnaly, if I change the type of the Document element to xs:string in the XSD, validation is successful.

I've dug a bit into this, my understanding is that the stream writer doesn't pass text content to the validator, GenericMsvValidator#validateText is never called, and as a result GenericMsvValidator#mTextAccumulator stays empty until validateElementEnd is called.

Multi-document mode produces events only for first document

When trying to parse a stream of documents, woodstox produces events for first document and after an end-document event goes into end-of-input state and reports that it has no more events. Seems like it's not the intended behaviour.

For example the following code will print "Document start" only once.

import com.ctc.wstx.api.WstxInputProperties;
import com.google.common.collect.Lists;
import org.codehaus.stax2.XMLEventReader2;
import org.codehaus.stax2.XMLInputFactory2;

import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.XMLEvent;
import java.io.StringReader;
import java.util.List;

import static org.apache.commons.lang3.StringUtils.*;

public class Test {

    private static final String MULTIDOC =
            "<?xml version='1.0'?><root>text</root><!--comment-->\n"
            +"<?xml version='1.0'?><root>text</root><?proc instr>\n"
            +"<?xml version='1.0'?><root>text</root><!--comment-->"
            +"<?xml version='1.0' encoding='UTF-8'?><root>text</root><!--comment-->"
            +"<?xml version='1.0' standalone='yes'?><root>text</root><!--comment-->"
            +"<?xml version='1.0'?><root>text</root><!--comment-->";


    public static void main(String[] args) throws XMLStreamException {
        new Test().woodstox();
    }

    public void woodstox() throws XMLStreamException {
        XMLInputFactory2 factory = (XMLInputFactory2) XMLInputFactory2.newInstance();
        factory.setProperty(WstxInputProperties.P_INPUT_PARSING_MODE, WstxInputProperties.PARSING_MODE_DOCUMENTS);

        XMLEventReader2 xmlEventReader = (XMLEventReader2) factory.createXMLEventReader(new StringReader(MULTIDOC));
        while (xmlEventReader.hasNextEvent()) {
            XMLEvent xmlEvent = xmlEventReader.nextEvent();
            if (xmlEvent.isStartDocument()) {
                System.out.println("Document start");
            }
        }
    }

}

woodstox 5.0.3, stax2 4.0.0

GenericMsvValidator.getAttributeType(int) always returns null, causing a NullPointerException in com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM

When reading a DOM document form a StAXSource backed by a validating XMLStreamReader2, com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM will throw a NullPointerException when trying to process attributes. This seems to be caused by GenericMsvValidator.getAttributeType(int) always returning a null reference for attribute type, which SAX2DOM is unprepared to handle.

The exception stack trace:

java.lang.NullPointerException
	at com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM.startElement(SAX2DOM.java:204)
	at com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.closeStartTag(ToXMLSAXHandler.java:208)
	at com.sun.org.apache.xml.internal.serializer.ToSAXHandler.flushPending(ToSAXHandler.java:281)
	at com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.startElement(ToXMLSAXHandler.java:650)
	at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleStartElement(StAXStream2SAX.java:319)
	at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:145)
	at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:101)
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:688)
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:737)
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:351)

Tested with:

JRE 1.8.0_141-b15 (x64)
com.fasterxml.woodstox:woodstox-core:5.0.3
net.java.dev.msv:msv-core:2013.6.1

Code to reproduce the error:

File xmlFile = new File("Test.xml");
File schemaFile = new File("Test.xsd");

validatAgainst(new File(xmlFile.toURI()), new File(schemaFile.toURI()));

XMLInputFactory2 xmlInputFactory = (XMLInputFactory2) XMLInputFactory2.newFactory();
xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true);
xmlInputFactory.setProperty(XMLInputFactory.IS_VALIDATING, true);

XMLValidationSchema xmlValidationSchema = XMLValidationSchemaFactory
		.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA).createSchema(schemaFile);

XMLStreamReader2 xmlStreamReader = (XMLStreamReader2) xmlInputFactory.createXMLStreamReader(xmlFile);
xmlStreamReader.validateAgainst(xmlValidationSchema);

Transformer transformer = TransformerFactory.newInstance().newTransformer();

while (xmlStreamReader.hasNext()) {
	xmlStreamReader.next();
	if (xmlStreamReader.getEventType() == XMLStreamConstants.START_ELEMENT) {
		transformer.reset();
		DOMResult result = new DOMResult();
		transformer.transform(new StAXSource(xmlStreamReader), result);
	}
}

Test.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.org/Test"
	xmlns:tns="http://www.example.org/Test" elementFormDefault="qualified">
	<element name="test">
		<complexType>
			<attribute name="attr" type="string" />
		</complexType>		
	</element>
</schema>

Test.xml

<?xml version="1.0" encoding="UTF-8"?>
<t:test xmlns:t="http://www.example.org/Test" attr="value" />

edit: Fixed broken link.

Validation error due to white-space being handled as CData by `BaseStreamWriter`

Hello,

I've found a bug in the way white-space is handled by the BaseStreamWriter when using validation with release 5.1.0

The exception we get is the following:
Exception in thread "main" com.ctc.wstx.exc.WstxValidationException: Element <Document> has non-mixed content specification; can not contain non-white space text, or any CDATA sections at [row,col {unknown-source}]: [1,49] at com.ctc.wstx.exc.WstxValidationException.create(WstxValidationException.java:50) at com.ctc.wstx.sw.BaseStreamWriter.reportProblem(BaseStreamWriter.java:1248) at com.ctc.wstx.sw.BaseStreamWriter.reportValidationProblem(BaseStreamWriter.java:1739) at com.ctc.wstx.sw.BaseStreamWriter.reportInvalidContent(BaseStreamWriter.java:1691) at com.ctc.wstx.sw.BaseStreamWriter.verifyWriteCData(BaseStreamWriter.java:1516) at com.ctc.wstx.sw.BaseStreamWriter.writeCData(BaseStreamWriter.java:331) at com.ctc.wstx.sw.BaseStreamWriter.copyEventFromReader(BaseStreamWriter.java:806) at Converter.main(Converter.java:34)

This happens because white-space events copied to the BaseStreamWriter, at BaseStreamWriter:806, are handled by writeCData() and therefore treated as a CDATA block for validation purposes.

Java 9-ea+164 jdeps warning

jdeps -jdkinternals of Java 9-ea+164 produces the following warning for projects using woodstox-core:5.0.3:

com.ctc.wstx.msv.GenericMsvValidator
-> org.relaxng.datatype.Datatype
JDK internal API (JDK removed internal API)

This issue is just meant as an "in case you didn't know".

Force XML version 1.1 override for parser

Hello,
I have a XML file that has a prolog tagging the document to be in version=1.0, even though it contains some entities (namely ), which do not conform to 1.0, so I get an exception when parsing the file as expected.

Once I manually change the version in the prolog to 1.1, the file is parsed without errors.

My question is whether there's a way to force the parser to discard the prolog information to somehow always use XML 1.1 semantics, without the need to alter the original file. I'm using the SAX API.

Thanks.

Missing license information

Hi 😃

I was just looking through this repository for some license information and noticed that quite a few source files have the following header:

/* Woodstox XML processor
 *
 * Copyright (c) 2004- Tatu Saloranta, [email protected]
 *
 * Licensed under the License specified in the file LICENSE which is
 * included with the source code.
 * You may not use this file except in compliance with the License.
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

This refers to a LICENSE file which I couldn't find anywhere in the repository. What is the license of this project and could you perhaps create the missing LICENSE file? Thanks!

Facility for Grammar Reuse (e.g. Xerces GrammarPool)

@cowtowncoder I am trying to switch eXist-db from Xerces to Woodstox (with the SAX API).

Xerces has the facility to setup a shared Grammar Pool, by setting the property http://apache.org/xml/properties/internal/grammar-pool and providing an org.apache.xerces.xni.grammars.XMLGrammarPool so that parsed Grammars (DTD, XML Schema etc) can be reused without the overhead of re-parsing them. I was wondering if Woodstox (or its use of MSV) had anything similar, or if it was even desirable to have that in Woodstox?

If it is missing but desirable, and you let me know how you envisage it being added, I am happy to do the work and send a PR...

Stax API should be provided instead of required

Stax API at the pom.xml should be provided instead of required, same as you did for StAxMate I think. Most people are using Java 6+ at this moment.

Woodstox Stax2 validator won't work with Java 7?

Hi,

I'm using Woodstox latest with Stax2 API and when I try your recommended way to validate an XML against an schema I get the following exception:

java.lang.NoClassDefFoundError: com/sun/msv/reader/GrammarReaderController
    at java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532)
    at java.lang.Class.getConstructor0(Class.java:2842)
    at java.lang.Class.newInstance(Class.java:345)
    at org.codehaus.stax2.validation.XMLValidationSchemaFactory.createNewInstance(XMLValidationSchemaFactory.java:306)
    at org.codehaus.stax2.validation.XMLValidationSchemaFactory.newInstance(XMLValidationSchemaFactory.java:209)
    at org.codehaus.stax2.validation.XMLValidationSchemaFactory.newInstance(XMLValidationSchemaFactory.java:116)

In the JavaDoc is says it uses the Sun multi schema validator? I'm not sure if such API was hidden or deprecated, is there any way to either fix this or import an alternative to this Sun API via maven?

Add support for JEP-185 (JAXP-1.5) properties named `ACCESS_EXTERNAL_`

Apparently there's a JEP to add Yet Another Set of configuration properties (to overlap with existing) ones:

http://openjdk.java.net/jeps/185

and since users will be trying to use them (as per #50) we'll probably need to add support.
(no one from JEP-185 has tried to reach the project prior to this user request)

It is further unfortunate these are added as System properties since that has all the problems of global variables; as well as the question of how these should interact with existing configuration settings.
But it is what it is and this is becoming Oracle's walled garden so. shrug.

XMLStreamException: Unbound namespace URI ''

Somehow this project got onto my classpath, replacing the default openJDK implemenation, and code that was previously working fine started failing with this exception:

Caused by: javax.xml.stream.XMLStreamException: Unbound namespace URI ''
    at com.ctc.wstx.sw.BaseStreamWriter.throwOutputError(BaseStreamWriter.java:1473)
    at com.ctc.wstx.sw.SimpleNsStreamWriter.writeAttribute(SimpleNsStreamWriter.java:84)

The namespaceURI being passed is empty string (i.e., default namespace).

Is the exception behavior correct?

If the behavior in this case is unspecified, it would be nice to make it compatible with the default implemenation.

Also, the behavior is inconsistent with the behavior of the version of writeAttribute() that does not take a namespaceURI parameter.

In other words:

output.writeAttribute("", "name", "value");

throws an exception while

output.writeAttribute("name", "value");

does not... but shouldn't these be equivalent?

WstxEOFException: Unexpected end of input block in start tag

Team,

We're encountering an odd intermittent exception in this stack and would really appreciate any insight in how to diagnose the root cause please.

Caused by: com.ctc.wstx.exc.WstxEOFException: Unexpected end of input block in start tag
at [row,col {unknown-source}]: [1,2753]
at com.ctc.wstx.sr.StreamScanner.throwUnexpectedEOB(StreamScanner.java:691)
at com.ctc.wstx.sr.StreamScanner.loadMoreFromCurrent(StreamScanner.java:1063)
at com.ctc.wstx.sr.StreamScanner.getNextCharFromCurrent(StreamScanner.java:802)
at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2946)
at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2837)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1072)
at oracle.j2ee.ws.saaj.util.ResettableXMLStreamReader.next(ResettableXMLStreamReader.java:216)
at oracle.j2ee.ws.saaj.soap.StaxHandler.moveNext(StaxHandler.java:210)
at oracle.j2ee.ws.saaj.soap.StaxHandler.staxParse(StaxHandler.java:114)
at oracle.j2ee.ws.saaj.soap.StaxHandler.staxParse(StaxHandler.java:102)
at oracle.j2ee.ws.saaj.soap.StaxHandler.staxParseNextElement(StaxHandler.java:217)
at oracle.j2ee.ws.saaj.soap.ElementImpl.realizeNextChild(ElementImpl.java:1573)
at oracle.j2ee.ws.saaj.soap.ElementImpl.getFirstChild(ElementImpl.java:1628)

The issue is related to a SOAP web service request to an ESB which works sometimes and fails with this exception in other cases. The request / response is consistent in all cases and not overly large e.g. around 3000 bytes

Many thanks

Woodstox conflict with BeanIO parser, when reading XML file with CDATA

hallo,

Parsing a XML-File with beanio.org framework and Woodstox as default implementation works fine as long as the CDATA-Text doesn't exist.

Problem is: it ignores CDATA-Tag as if it doesn't exist, so we can't read as long as woodstox is in classpath.

beanio does read it fine with "com.sun.xml.internal.stream.XMLInputFactoryImpl"... as soon as we add woodstox jar to classpath it takes woodstox as default implementation.

we use "woodstox-core-asl-4.2.0.jar"

can you please follow this issue and tell me if you have any clue on what could the prblem be.

best regards.

thanks

Shall BasicStreamReader use WstxUnexpectedCharException instead of WstxParsingException?

This happens when BasicStreamReader encounters '<' in an attribute value:

com.ctc.wstx.exc.WstxParsingException: Unexpected '<' in attribute value

com.ctc.wstx.sr.CompactNsContext#doGetNamespaceURI() returns null on missing prefix (instead of "")

According to the documentation of NamespaceContext, an unbound prefix should result in the XMLConstants.NULL_NS_URI being returned on invoking getNamespaceURI. The implementation returns null instead. This is visible on line 90.

Error-Codes for all Parse/Write-Exceptions and Errors.

Like asked at Stackoverflow:
https://stackoverflow.com/questions/50911195/can-woodstox-error-messages-be-customized

Cite of Staxman:

addition of formal error codes in Exceptions seems like a good improvement (Stax spec/API does not have those, but Woodstox implementation types could, either directly, or as tag interface). One potential challenge might be that of how to templatize things (how to add pertinent pieces of information within message). But adding error codes seems like a good start.

I will also take a look at the codebase and see if i can provide a pull-request for this.

500 characters limit when calling XMLStreamReader#getText() after CDATA event

While the scenario tested in TestXMLStreamReader2#testLongerCData() works correcly, the same 500 chars limit bug (WSTX-211) is still happening with a slight modification. For instance, this test case:

    public void testLongerCData2() throws Exception
    {
        String SRC_TEXT =
                "\r\n123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\r\n"
                          + "<embededElement>Woodstox 4.0.5 does not like this embedded element.  However, if you take\r\n"
                          + "out one or more characters from the really long line (so that less than 500 characters come between\r\n"
                          + "'CDATA[' and the opening of the embeddedElement tag (including LF), then Woodstox will instead\r\n"
                          + "complain that the CDATA section wasn't ended.";
        String DST_TEXT = SRC_TEXT.replace("\r\n", "\n");
        String XML = "<?xml version='1.0' encoding='utf-8'?>\r\n"
                     + "<test><![CDATA[" + SRC_TEXT + "]]></test>";
        // Hmmh. Seems like we need the BOM...
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        bos.write(0xEF);
        bos.write(0xBB);
        bos.write(0xBF);
        bos.write(XML.getBytes("UTF-8"));
        byte[] bytes = bos.toByteArray();
        XMLInputFactory2 f = (XMLInputFactory2) XMLInputFactory.newInstance();
        // important: don't force coalescing, that'll convert CDATA to CHARACTERS
        f.setProperty(XMLInputFactory.IS_COALESCING, Boolean.valueOf(false));

        XMLStreamReader sr = f.createXMLStreamReader(new ByteArrayInputStream(bytes));
        assertTokenType(START_DOCUMENT, sr.getEventType());
        assertTokenType(START_ELEMENT, sr.next());
        assertEquals("test", sr.getLocalName());
        assertTokenType(CDATA, sr.next());
        // This should still work, although with linefeed replacements
        final String text = sr.getText();
        assertEquals("" + text.length(), DST_TEXT, text);
        // assertTokenType(END_ELEMENT, sr.getEventType());
        sr.close();
    }

does sr.next() followed by sr.getText() instead of doing just sr.getElementText().

When running this test case, you will see that only the first 500 characters are read.

This usage id done, for instance, by CXF, in StaxUtils#copy

BaseStreamWriter writeCharacters should deal with null text?

In the implementation of the XMLStreamWriter : BaseStreamWriter
(https://github.com/FasterXML/woodstox/blob/master/src/main/java/com/ctc/wstx/sw/BaseStreamWriter.java)
the method writeCharacters(String text) does not check for the text to be null (line 464 : int len = text.length();) and we can get a NullPointerException.
In sun implementation (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/com/sun/xml/internal/stream/writers/XMLStreamWriterImpl.java#XMLStreamWriterImpl.writeXMLContent%28java.lang.String%29) a check is done.
What should be the correct behavior?
Thanks

Method XMLStreamWriter.writeCharacters(String) escape '>' char only at the start of a string

Example OK :
// will print <DATA>> test &</DATA>

ByteArrayOutputStream output=new ByteArrayOutputStream();
Charset charset = StandardCharsets.ISO_8859_1;
XMLOutputFactory xmlFactory = javax.xml.stream.XMLOutputFactory.newInstance();
XMLStreamWriter writer = xmlFactory.createXMLStreamWriter(output, charset.name());
writer.writeStartElement("DATA");
writer.writeCharacters("> test &");
writer.writeEndElement();
writer.close();
System.out.println(output.toString(charset.name()));

Example NOT OK :
//will print "<DATA>& test ></DATA>"

ByteArrayOutputStream output=new ByteArrayOutputStream();
Charset charset = StandardCharsets.ISO_8859_1;
XMLOutputFactory xmlFactory = javax.xml.stream.XMLOutputFactory.newInstance();
XMLStreamWriter writer = xmlFactory.createXMLStreamWriter(output, charset.name());
writer.writeStartElement("DATA");
writer.writeCharacters("& test >");
writer.writeEndElement();
writer.close();
System.out.println(output.toString(charset.name()));

StreamReader reads incomplete CDATA

In some cases, CDATA content returned by ValidatingStreamReader is not complete.

I created a simple reproducer. When woodstox dependency is commented out, it passes. With woodstox it fails.

https://github.com/TomasHofman/woodstox-reproducer

Validator loads entire string content in memory.

For an implementation of a streaming API, it is surprising to observe that the validator loads the entire string content in memory. While the reader gives multiple text events, the validator loads all text content in memory. Is there a way to customize the behavior of the default validator? To be specific, the XML Schema validator shows this behavior.

W3C Schema Validation does not cater for xs:unique constraints

Consider the following XSD (called idc2.xsd):

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns="idc2.xsd"
            xmlns:idc="idc2.xsd"
            targetNamespace="idc2.xsd"
            elementFormDefault="qualified"
            version="1.0"
            >
  <xsd:element name="itemList">
	<xsd:complexType>
	  <xsd:sequence>
	    <xsd:element name="item" maxOccurs="unbounded" type="xsd:decimal" />
	  </xsd:sequence>
	</xsd:complexType>
	<xsd:unique name="itemAttr">
	  <xsd:selector xpath="idc:item"/>
	  <xsd:field    xpath="."/>
	</xsd:unique>
  </xsd:element>
</xsd:schema>

And the corresponding (invalid) XML (called idc2.xml):

<?xml version="1.0"?>
<itemList xmlns="idc2.xsd"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="idc2.xsd idc2.xsd">
   <item>1</item>
   <item>1</item>
   <item>2</item>
</itemList>

The following pom.xml dependencies I used:

<dependency>
    <groupId>com.fasterxml.woodstox</groupId>
    <artifactId>woodstox-core</artifactId>
    <version>5.0.3</version>
</dependency>
<dependency>
    <groupId>msv</groupId>
    <artifactId>msv</artifactId>
    <version>20050913</version>
</dependency>
<dependency>
    <groupId>relaxngDatatype</groupId>
    <artifactId>relaxngDatatype</artifactId>
    <version>20020414</version>
</dependency>
<dependency>
    <groupId>com.sun.msv.datatype.xsd</groupId>
    <artifactId>xsdlib</artifactId>
    <version>2013.2</version>
</dependency>

And the following example program (TestXSD.java):

package jdi.test.xsdvalidation;

import java.io.InputStream;

import javax.xml.stream.XMLInputFactory;

import org.codehaus.stax2.XMLInputFactory2;
import org.codehaus.stax2.XMLStreamReader2;
import org.codehaus.stax2.validation.XMLValidationException;
import org.codehaus.stax2.validation.XMLValidationSchema;
import org.codehaus.stax2.validation.XMLValidationSchemaFactory;

public class TestXSD {
	
	public static void main(final String[] args) throws Exception {
		final String xmlFileName = "idc2.xml";
		final String xsdFileName = "idc2.xsd";
		
		//load Schema
		final XMLValidationSchemaFactory xmlValidationSchemaFactory = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA);
		
		final InputStream schemaInputStream = TestXSD.class.getResourceAsStream(xsdFileName);
		final XMLValidationSchema xmlValidationSchema = xmlValidationSchemaFactory.createSchema(schemaInputStream);
		
		//load (invalid) XML file
		final InputStream xmlInputStream = TestXSD.class.getResourceAsStream(xmlFileName);
		final XMLInputFactory2 xmlInputFactory2 = (XMLInputFactory2)XMLInputFactory.newInstance();
		final XMLStreamReader2 xmlStreamReader =(XMLStreamReader2) xmlInputFactory2.createXMLStreamReader(xmlInputStream);
		try{
			//validate the XML file
			xmlStreamReader.validateAgainst(xmlValidationSchema);
			//traverse the streaming document
			while(xmlStreamReader.hasNext()){
				xmlStreamReader.next();
			}
		} catch(final XMLValidationException e){
			//catch validation exception
			System.err.println("XML file: " + xmlFileName + " failed to validatate against: " + relaxNgFileName);
			return ;
		}
		System.out.println("XML file: " + xmlFileName + " successfully validated against: " + relaxNgFileName);
	}

}

produces the not expected output:

XML file: idc2.xml successfully validated against: idc2.xsd

Note:

Xerces-J 2.11 and Eclipse validate this file correctly.

Apparently I suspect that the class GenericMsvValidator is not doing any of Unique, Key, KeyRef validations. Only ID and IDREF seems to be supported (not sure whether this coincides with Key / KeyRef respectively). I have not found out whether this is supposed to be the case or just an issue. I furthermore discovered the field mVGM.grammer.topLevel.element.identityConstraints did contain the unique constraint.
The obvious suggestion to use Xerces-J instead does not solve my problem either, as https://issues.apache.org/jira/browse/XERCESJ-1276 is doing unique constraint validation in O(n^2) instead of O(n log(n)) as expected.

Security problem when using Woodstox as a drop-in replacement for JDK parsers

JEP 185: Restrict Fetching of External XML Resources introduced system properties for securing applications against security threats such as XML External Entities.

When e.g. the system property -Djavax.xml.accessExternalDTD= is set to the empty list, the JDK parsers throw an exception if the parsed document contains a reference to an external DTD.

When Woodstox is added to the application's class path it replaces the default parsers. But then the system property seems no longer to have any effect, weakening the security of the application.

Since security is generally a major concern Woodstox should honour the properties introduced by JEP 185.

`XMLStreamReader.getAttributeValue(null, localName)` does not ignore namespace URI

i use woodstox-5.1.0 and jackson-2.7.7

Following quick xml-example shows that the function getAttributeValue(..) does not work as expected for resolving an attribute of an xml-element by its localName.

The xml used here is an example from: https://www.w3schools.com/xml/schema_schema.asp

        final String test =
                        "<note xmlns=\"https://www.w3schools.com\"\n" +
                        "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" +
                        "xsi:schemaLocation=\"https://www.w3schools.com note.xsd\">\n" +
                        "\n" +
                        "<to>Tove</to>\n" +
                        "<from>Jani</from>\n" +
                        "<heading>Reminder</heading>\n" +
                        "<body>Don't forget me this weekend!</body>\n" +
                        "</note> ";

        final InputStream is = new ByteArrayInputStream(test.getBytes(StandardCharsets.UTF_8));

        final XMLInputFactory staxInputFactory = XMLInputFactory.newInstance();
        final XMLStreamReader staxReader = staxInputFactory.createXMLStreamReader(is);

        staxReader.nextTag();

        // returns schemaLocation (correct!)
        final String attributeLocalName = staxReader.getAttributeLocalName(0);
        System.out.println(attributeLocalName);

        // returns value for schemaLocation (correct!)
        final String attributeValue = staxReader.getAttributeValue(0);
        System.out.println(attributeValue);

        // returns NULL (unexpected!)
        final String schemaLocation = staxReader.getAttributeValue(null, "schemaLocation");
        System.out.println(schemaLocation);

Allow recovery of buffered but not yet processed characters/bytes

When processing a stream that contains mixed content, like MTOM/XOP messages, it is easy enough to detect the end of an XML document and to return control to code that deals with the non-XML content, but it is impossible to recover the bytes/characters buffered by the XML reader so they are available to the non-XML parsing code.

Tatu suggest to mimick the API of Jackson. Personally, I would find variants accepting a char[] or byte[] more convenient.

// Needed to create arrays of correct size
public int bufferedCharsAvailable();
public int bufferedBytesAvailable(Charset cs);
// get unprocessed chars/bytes
public void releaseBuffered(byte[] buffer, Charset cs);
public void releaseBuffered(char[] buffer);

The byte-Array methods could default to the character set supplied when creating the XML reader.

However, the API as it exists for Jackson would also be fine with me. Anything, really, that allows me to recover the buffer.

XMLStreamReader reports "no namespace" (aka default namespace) for attributes as "", not `null`

I believe the XMLStreamReader incorrectly handles default namespaced attributes. The relevant documentation can be found at.

https://www.w3.org/TR/2006/REC-xml-names11-20060816/#defaulting

Default namespace declarations do not apply directly to attribute names; the interpretation of unprefixed attributes is determined by the element on which they appear.

and

https://www.w3.org/TR/2006/REC-xml-names11-20060816/#uniqAttrs

... the second because the default namespace does not apply to attribute names:

In my example code I show that woodstox produces the default namespace (an empty string) when
staxReader.getAttributeNamespace( 0 ) is called while the com.sun.xml.internal.stream.XMLInputFactoryImpl implementations returns null

I believe the sun implementation is correct.

example code attached:
ExampleCode.zip

com.ctc.wstx.evt.CompactStartElement#getAttributes() inverts "specified" flag on attributes

CompactStartElement.getAttributes() returns a list of attributes where the isSpecified() flag is false for attributes which were specified in the source XML. This is backwards.

The attached sample WS.java.txt produces this output:

com.ctc.wstx.stax.WstxInputFactory
com.ctc.wstx.evt.WstxEventReader
com.ctc.wstx.evt.CompactStartElement org.codehaus.stax2.ri.evt.AttributeEventImpl c d false

The "false" in the last line is the isSpecified() return value for the sample XML, and it ought to be true.

There are actually a couple of problems with handling of the "specified" flag for attributes. The relevant AttributeEventImpl ctor looks like this, implying the last argument should be true for specified attributes:

    public AttributeEventImpl(Location loc, String localName, String uri, String prefix,
                              String value, boolean wasSpecified)

However, CompactStartElement#constructAttr() has this:

    public Attribute constructAttr(String[] raw, int rawIndex, boolean isDef)
    {
        return new AttributeEventImpl(mLocation, raw[rawIndex], raw[rawIndex+1],
                                      raw[rawIndex+2], raw[rawIndex+3], isDef);
    }

The argument name "isDef" suggests the flag should be true if the flag is default, but it's passing it unchanged to the ctor which expects a value that's true if the flag is not default. Then, in #getAttributes() we have this:
l.add(constructAttr(rawAttrs, i, (i >= defOffset)));
so it's passing a flag which is true if the attribute is default. However, in #getAttributeByName() we have this:
return constructAttr(mRawAttrs, ix, !mAttrs.isDefault(ix));
So it's passing a flag which is false if the attribute is default.

We originally found this with Woodstox 4.2.1. I can also reproduce it with 5.0.3.

Maximum XML name limit not applied to namespace URIs (JAXP, 8148872)

CVE-2016-3500 OpenJDK: maximum XML name limit not applied to namespace URIs (JAXP, 8148872)

I have found the issue in woodstox-core-asl-4.0.8.jar. I have produced this issue with huge namespace URL.
java.lang.OutOfMemoryError: Java heap space
at com.ctc.wstx.util.TextBuilder.resize(TextBuilder.java:179)
at com.ctc.wstx.util.TextBuilder.bufferFull(TextBuilder.java:151)
at com.ctc.wstx.sr.BasicStreamReader.parseNormalizedAttrValue(BasicStreamReader.java:1912)

Please let me know whether this issue is fixed in 5.0.3 release.

Content truncated when calling XMLStreamReader#getText() after CDATA event

Although the fix for #21 works for the provided case, some slight modification to that test case causes it to fail. For instance, the following test fails:

  public void testLongerCData3() throws Exception {
    String SRC_TEXT =
        "123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\r\n"
            + "<embededElement>Woodstox 4.0.5 does not like this embedded element.  However, if you take\r\n"
            + "out one or more characters from the really long line (so that less than 500 characters come between\r\n"
            + "'CDATA[' and the opening of the embeddedElement tag (including LF), then Woodstox will instead\r\n"
            + "complain that the CDATA section wasn't ended.";
    String DST_TEXT = SRC_TEXT.replace("\r\n", "\n");
    String XML = "<?xml version='1.0' encoding='utf-8'?>\r\n"
        + "<test><![CDATA[" + SRC_TEXT + "]]></test>";
    XMLInputFactory f = getInputFactory();
    // important: don't force coalescing, that'll convert CDATA to CHARACTERS
    f.setProperty(XMLInputFactory.IS_COALESCING, Boolean.FALSE);

    XMLStreamReader sr = f.createXMLStreamReader(new StringReader(XML));
    assertTokenType(START_DOCUMENT, sr.getEventType());
    assertTokenType(START_ELEMENT, sr.next());
    assertEquals("test", sr.getLocalName());
    assertTokenType(CDATA, sr.next());
    // This should still work, although with linefeed replacements
    final String text = sr.getText();
    if (text.length() != DST_TEXT.length()) {
      fail("Length expected as " + DST_TEXT.length() + ", was " + text.length());
    }
    if (!text.equals(DST_TEXT)) {
      fail("Length as expected (" + DST_TEXT.length() + "), contents differ:\n" + text);
    }
    assertTokenType(END_ELEMENT, sr.next());
    sr.close();
  }

The only difference with testLongerCData2 is that this one does not contain a line break at the beginning of the numbers sequence. Running this test results in a failure with message junit.framework.AssertionFailedError: Length expected as 829, was 73.

Just in case, another similar test case with no line breaks at all should be included.

invisible CHARACTER data

I may be misunderstanding the XML spec, but I am unable to see the wobble in the following snippet

<?xml version="1.0" encoding="UTF-8"?>
       <root>
       <foo>wibble</foo>
       <foo>
         <bar>
          fish
         </bar>
           wobble
       </foo>
       <foo>fish</foo>
       </root>

with woodstox or aalto.

If I put the wobble before any other tags, it is seen. Text is also ignored if it's between two tags, e.g.

<?xml version="1.0" encoding="UTF-8"?>
       <root>
       <foo>wibble</foo>
       <foo>
         wobble1
         <bar>
          fish
         </bar>
           wobble
         <bar2>
          fish
         </bar2>
       </foo>
       <foo>fish</foo>
       </root>

I thought XML was supposed to support "mixed content".

Stax 4.0.0 is not compatible with Woodstox 5.0.1

While upgrading to the latest version of Stax and Woodstox we encountered NoSuchMethodError. The stacktrace is as follows.

Caused by: java.lang.NoSuchMethodError: org.codehaus.stax2.ri.EmptyIterator.getInstance()Lorg/codehaus/stax2/ri/EmptyIterator;
at com.ctc.wstx.util.DataUtil.emptyIterator(DataUtil.java:74)

Moving back to stax 3.1.4 version has fixed this issue for issue for us, but would like to see this error fixed. Is it already a known issue?

BasicStreamReader.getElementText() behavior doesn't match Java documentation

The documentation for XMLStreamReader suggests that getElementText() return all valid text between a START_ELEMENT and END_ELEMENT.

With 5.0.1 and 5.0.2 woodstox, this doesn't happen for an element that contains text and then CDATA.
Example:
"<tag>foo<![CDATA[bar]]></tag>", calling getElementText() while at tag will return "foo", while the example implementation from the doc comments suggests it should be "foobar".

I linked the documentation from Java 8, but it's the same blurb given for previous versions as well.

fasterxml / woodstox Goto Github PK

woodstox's Introduction

Overview

Status

Get it!

Maven

Requirements

License

Documentation etc

Configuration

Support

Community support

Enterprise support

Contributing

Other

woodstox's People

Contributors

Stargazers

Watchers

Forkers

woodstox's Issues

Recommend Projects

Recommend Topics

Recommend Org