razrfalcon / roxmltree Goto Github PK

View Code? Open in Web Editor NEW

406.0 406.0 40.0 523 KB

Represent an XML document as a read-only tree.

License: Apache License 2.0

Rust 97.81% Python 2.19%

xml

roxmltree's Introduction

roxmltree's People

Contributors

Stargazers

Watchers

Forkers

cormacrelf kardeiz sahwar loomaclin timdiekmann tomjw64 cfsamson mxmerz jfrimmel isgasho jdrouet ignatenkobrain rastertail jredrado icodein curiousleo jwindhaber taiki-e baskerville acidburn0zzz philippeitis cole-miller tvorog code-terror drpoppyseed s3bk ker0olos tngo04 mayhemheroes kornelski benthillerkus kavika13 sarvex filestar iq-scm whynothugo nneesshh wiezzel ethan1225 wenyuzhao r0b0tdev

roxmltree's Issues

Dealing with malformed documents

I have some existing data that I want to parse that is malformed. It is part of an existing shipped game, and mods have been based off this data (including copy/pasting and editing the files), so doing the mass-updates to this data would be kinda painful, from a community-support standpoint.

Specifically, it has:

<?xml version="2.0" ?> (non-existent XML version number)
Multiple root nodes

There are xml parsers (at least in other languages) that are capable of being bent to parse these. The game (written in C++) is using one such parser.
I'm wondering if it would be good to add non-default options to support these types of errors?

Is there already an existing way to continue trying to parse even when an error like this is encountered?

If not, do you know if there's other Rust XML parsers that might be better at this sort of thing?
Would it make sense to use RazrFalcon/xmlparser directly, and would it support malformed documents like this?
I picked this particular parser due to the focus on performance (and secondarily, correctness). But I could consider switching if there's a better option for my use case.

If adding some optional flexibility things like this would be good, I might consider submitting some PRs for these workarounds in specific. Let me know if that would be welcome or would expedite this request.

Stack overflow with very small file

Context

Calling parse with very small file (314 lines) with DTD is enough to overflow the stack, for example:

let contents = fs::read_to_string("very_small_file.xml").unwrap();
let opt = roxmltree::ParsingOptions { allow_dtd: true };
Document::parse_with_options(&contents, opt);

This only happens in tests because of the reduced stack size on new threads.

I'm not an expert on XML and I havent found a proper way to debug the stack overflows in rust.

The problem goes away, even with bigger files, if they don't have DTD (not really an option for my use case).

Environment

MacBook Pro M1

Temporary solutions (as far as I know...)

Removing DTD from xml (not ideal)
Setting the stack size of child threads (most flexible one):

let child = thread::Builder::new().stack_size(3 * 1024 * 1024).spawn(move || {
   let opt = roxmltree::ParsingOptions { allow_dtd: true };
   let doc = Document::parse_with_options(&contents, opt);
}).unwrap().join();

Note: 1Mb less, or a stack of 2Mb, and the stack overflows.

Setting environment variable RUST_MIN_STACK=8388608 before calling tests. Cannot be done inside tests because they start new threads.
Restricting the tests to run in single threaded environment with cargo test -- --test-threads=1

Ideal solution

parse_with_options should use less stack.

Even if this is more noticeable in tests, depending on the OS and implementation code this can also be an issue in production.

Test file

listxml_0244.xml.zip

PS: let me know if I can provide any further information 🙏

Most APIs rather return a `&'a str` than a `Cow<'input, str>`

I get that this is a bit more ergonomic, but it also means that users cannot do zero-copy stuff—or need to keep the Document around.
I think a solution could be to just make the fields that have Cows public and annotate that this is not a stable API.

HTML parsing

I am parsing an invalid HTML code and so I have UnexpectedCloseError very often. Currently there is no way to deal with it - the library just stops from working and that's it. I'd, however, ignore these errors and continue parsing the XML further. Is it possible?

[Feature] Add a way to validate XML with custom "rules"

I am making data-driven scripting for my game with XML, and it would be cool if I could specify what certain documents should look like. Like, for instance, restrict document to only having specified tags, or that a tag can only have specified attributes (some of which can be optional).
Edit: Maybe add an option to ParsingOptions for that, possibly?

Possible wrong lifetime used for returned text slices.

Many functions return text slices with a 'a lifetime instead of the 'd lifetime.

E.g. in Node there is fn tag_name(&self) -> ExpandedName<'a> but it probably should be
fn tag_name(&self) -> ExpandedName<'d>.

(The ExpandedName documentation also uses 'd to imply it should be using the document
lifetime if I'm not wrong).

There are a number of other functions which are also affected, but I havent looked enough into the crate if some of them might actually need to use 'a.

Possible candidates are:

default_namespace
resolve_tag_name_prefix
lookup_prefix
lookup_namespace_uri
...

There are also some functions which accept a &'a str (or similar) as input,
but should accept any kind of lifetime as they are only used to compare
with the tree (like e.g. lookup_namespace_uri).

Note that none of the fixes would be breaking changes as they only can
prolong the "duration" of the returned lifetime (as far as you can speak
of duration).

Lastly I think:

fn has_attribute<N>(&self, name: N) -> bool where
    N: Into<ExpandedName<'a>>,

should be more like

fn has_attribute<N>(&self, name: N) -> bool where
    N: PartialEq<ExpandedName<'a>>>,

which would be a braking change but far more flexible,
for one it would work for any lifetime of passed in ExpandedName
even if shorter then 'a, additionally it works better with custom
types, again independent of the lifetime as long as eq is
implemented with a wildcard 'a.

Namespaced Tags

This is a stripped down version of a file which seems like it should work (does with quick-xml in any case which is what I'm currently using):

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>

<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Worksheet>
	<ss:Table>
		<Row>
			<Cell>
				<Data>Hello</Data>
			</Cell>
		</Row>
	</ss:Table>
</Worksheet>
</Workbook>

To see the failure in action:

> cargo run --example ast -- sample.xls
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/examples/ast sample.xls`
Error: expected 'Table' tag, not 'ss:Table' at 12:2.

[Feature Request] Parallelized descendents iterator

Would be possible, or more like it, would be beneficial to create a par_descendents() method that returns a parallel iterator?

Add Node::children_elements

Incorrect entity resolving

An XML like this:

<!DOCTYPE test [
    <!ENTITY e "<p>&#x3c;</p>">
]>
<e>&e;</e>

should lead to an error, but it's not.

resolve_href cost too long when parse complex svg

parse svg glyph in NotoColorEmoji font block more than 10 min and still no complete.
it's a O(n^2) algorithm, will probably become a problem when deal with too many nodes and many href node。
Saw from the comment, It's the choice for the less caret dependency, maybe can supply a more effected implement as a feature?

[Feature request] Adding some selector (CSS or XPath) behind a feature gate

First of all, thanks for the crate!!

After using roxmltree for a while I was missing some handy way to traverse the tree. Then, I saw that there is a "companion" crate simplecss that parses css selectors and exposes a corresponding traversal trait Element that structs represnting a node of a XML tree shall implement.

There is already the code in the examples of simplecss.

Request

Would you consider adding the implementation of Element for roxmltree::Node (behind a feature gate called simplecss)?

Notes

As stated in its README, "[simplecss] is not a browser-grade CSS parser. If you need one, use cssparser + selectors."
If you agree with my request now and if, at some point in the future, I would implement selectors::Element on roxmltree::Node, would you consider adding it in a similar fashion?

parsing XML characters

Hi!

I'm trying to parse the following:

<span>
&copy; Croft's Accountants Inc., All Rights Reserved.
</span>

And I end up with the following error: UnknownEntityReference("copy", TextPos {...})

Are the XML and HTML not handled? Can't I have them as a String?

roxmltree accepts opening tag without closing tag

If I try this:

Document::parse("<open>Text").unwrap();

roxmltree accepts this without complaint. Is this an explicit feature or should it check that there are unclosed opening tags at the end of parsing and reject the input as invalid XML if so?

End position of elements

I noticed that there is a starting position of an element, but no end position available.

I am looking for a way to find position in a case like this:

<root>
  <foo>
    <bar/>
    <bar/>
  </foo>
</root>

Taking the element /root/foo, I would want to have the < of the opening <foo> and then the > of the closing </foo>.

I noticed #10 and the commit mentioning the addition of range, but can't seem to find it in the current API.

.descendants() iterator not returning full descendants for nested nodes

I'm having an issue iterating over descendants of deeply nested nodes. Calling .descendants() on a node at the top of the document is working fine, but when I query a node somewhere deeper in the doc and then try to call .descendants() on it, I am only getting partial results.

The data I'm using is not open source unfortunately, so I can't share here. If you have test data I would be glad to provide a working example.

Not getting children of descendants

Hey,

I have the following xml

<manifest xmlns:android="http://schemas.android.com/apk/res/android" android:versionCode="57" android:versionName="1.6.11" android:allowBackup="false" android:compileSdkVersion="29" android:compileSdkVersionCodename="10" package="com.myapp" platformBuildVersionCode="29" platformBuildVersionName="10">
    <uses-sdk android:minSdkVersion="19" android:targetSdkVersion="29"/>
    <application android:theme="@style/AppTheme" android:label="@string/app_name" android:icon="@mipmap/ic_launcher" android:name="com.myapp.activity" android:debuggable="false" android:screenOrientation="portrait" android:allowBackup="true" android:supportsRtl="true" android:usesCleartextTraffic="true" android:roundIcon="@mipmap/ic_launcher_round" android:appComponentFactory="android.support.v4.app.CoreComponentFactory">
        <activity android:theme="@style/SplashTheme" android:label="@string/app_name" android:name="com.myapp.SplashActivity">
            <intent-filter>
                <action android:name="android.intent.action.MAIN"/>
                <category android:name="android.intent.category.LAUNCHER"/>
            </intent-filter>
        </activity>

Now when I do:

        for node in doc.descendants() {
            match node.tag_name().name() {
                "activity" => another_fn(node),
                _ => (), // In case the node is something different
            }
        }

Now in the another_fn I have tried using has_children, children , first_children, last_children, etc but nothing gives me the intent-filter as the child of the activity.

So what I want is that when it sends activity node to another_fn it should also give me intent-filter as the children of that node.

Is that something I should be able to do? or am I doing anything wrong?

Also, I'm sorry for asking so many questions 😅

Rust Lifetimes: Is it possible to have roxmltree take ownership of the text string?

A minimal example

use std::path::PathBuf;
use roxmltree;

fn main() {
  let filepaths: Vec<&PathBuf> = get_filepaths();

  filepaths.iter()
    .map(std::fs::read_to_string)
    .map(|text| roxmltree::Document::parse(&text.unwrap()))
    .for_each(|tree| {
      // Do something with each XML tree
    })
}

When I try to compile:

error[E0515]: cannot return value referencing temporary value
  --> src\main.rs:32:21
   |
32 |         .map(|text| roxmltree::Document::parse(&text.unwrap()))
   |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^-------------^
   |                     |                           |
   |                     |                           temporary value created here
   |                     returns a value referencing data owned by the current function

Make it possible to get the end position of a node

Right now there is the Node::pos() method which gets you the start position of the node, but there seems to be no way to get the end position. This makes it impossible to get the original XML which corresponded to the node.

Looking at xmlparser, it seems like StrSpan does include the end position. However it is probably not as easy as just using that, since most StrSpans probably made available during parsing only cover the tag itself, not the entire element. I suspect it might be necessary to do something around the roxmltree ElementEnd processing too.

How to Read 2 GB file and Parse it

Hi Team,

Lxml in python is able to read only an element and we can remove it from memory after we parse it. This helps in parsing big file easily without affecting memory. It has method called iterparse

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/

Is it possible to have something similar , so that we dont need to create a huge string and parse it.

Thanks

DOCTYPE causes parsing to fail

I'm a Rust beginner so maybe I am missing something obvious but when I parse the following string I get an error:

use roxmltree::Document;
    
const XML: &str = r#"
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE szene SYSTEM "szene.dtd">
<szene>
  <triangulation src="abgabetriangulation_high.xml"/>
  <fenster breite="320" hoehe="240"/>
  <raumteilung unterteilung="2"/>
  <kamera>
    <position x="-2.46" y="7.86" z="9.51"/>
    <ziel x="0.03" y="-1.27" z="-2.78"/>
    <fovy winkel="45.0"/>
  </kamera>
  <beleuchtung> 
    <!-- Lichteigenschaften !-->
    <hintergrundfarbe b="0.5" g="0.2" r="0.2"/>
    <ambientehelligkeit b="1" g="1" r="1"/>
    <abschwaechung konstant="1" linear="0" quadratisch="1"/>
    <!-- Lichtquellen !-->
    <lichtquelle>
      <position x="8.07" y="6.48" z="-0.92"/>
      <farbe b="1.0" g="1.0" r="1.0"/>
    </lichtquelle>
    <lichtquelle>
      <position x="-3.00" y="8.35" z="10.65"/>
      <farbe b="1.0" g="1.0" r="1.0"/>
    </lichtquelle>
  </beleuchtung>
</szene>
"#;

fn main()
{
    Document::parse(XML).unwrap();
}

[...]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParserError(UnknownToken(TextPos { row: 2, col: 1 }))', src/bin/test.rs:36:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/1.59.0/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/1.59.0/library/core/src/panicking.rs:116:14
   2: core::result::unwrap_failed
             at /rustc/1.59.0/library/core/src/result.rs:1690:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/1.59.0/library/core/src/result.rs:1018:23
   4: test::main
             at ./src/bin/test.rs:36:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/1.59.0/library/core/src/ops/function.rs:227:5

However when I remove the xml declaration and the doctype it works:

use roxmltree::Document;
    
    
const XML: &str = r#"
<szene>
  <triangulation src="abgabetriangulation_high.xml"/>
  <fenster breite="320" hoehe="240"/>
  <raumteilung unterteilung="2"/>
  <kamera>
    <position x="-2.46" y="7.86" z="9.51"/>
    <ziel x="0.03" y="-1.27" z="-2.78"/>
    <fovy winkel="45.0"/>
  </kamera>
  <beleuchtung> 
    <!-- Lichteigenschaften !-->
    <hintergrundfarbe b="0.5" g="0.2" r="0.2"/>
    <ambientehelligkeit b="1" g="1" r="1"/>
    <abschwaechung konstant="1" linear="0" quadratisch="1"/>
    <!-- Lichtquellen !-->
    <lichtquelle>
      <position x="8.07" y="6.48" z="-0.92"/>
      <farbe b="1.0" g="1.0" r="1.0"/>
    </lichtquelle>
    <lichtquelle>
      <position x="-3.00" y="8.35" z="10.65"/>
      <farbe b="1.0" g="1.0" r="1.0"/>
    </lichtquelle>
  </beleuchtung>
</szene>
"#;

fn main()
{
    Document::parse(XML).unwrap();
}

.text() fails to read an element's text if it contains just a Compatibility Ideograph, or fails to read it correctly if it starts with one

<character>
<literal>欄</literal>
<literal>欄</literal>
</character>

For some reason .text() on a Node of the second tag fails to read the 欄. 欄 happens to be a compatibility codepoint for 欄 so I dropped 欄 in there as well. 欄 doesn't cause the error. The location doesn't matter. This is a cut down failure case I ran into trying to get some data out of a 15 megabyte (in XML) dictionary, no problems at all until this character, which is very close to the end of it.

code:

use std::fs::File;
use std::io::Read;
use std::collections::HashMap;

extern crate roxmltree;

fn load_to_string(fname : &str) -> std::io::Result<String>
{
    let mut file = File::open(fname)?;
    let mut string = String::new();
    file.read_to_string(&mut string)?;
    return Ok(string);
}

fn main() -> Result<(), std::io::Error>
{
    let kanjidic = load_to_string("kanjidic2.xml")?;
    println!("{}", kanjidic);
    let mut mapping = HashMap::<String, i64>::new();
    match roxmltree::Document::parse(&kanjidic) {
        Ok(doc) =>
        {
            for character in doc.root().descendants().filter(|element| element.has_tag_name("character"))
            {
                for property in character.descendants().filter(|element| element.is_element())
                {
                    if property.has_tag_name("literal")
                    {
                        if let Some(text) = property.text()
                        {
                            //
                        }
                        else
                        {
                            panic!("literal at line {} position {} does not have recognizable text", property.node_pos().row, property.node_pos().col);
                        }
                    }
                }
            }
        }
        Err(e) =>
        {
            panic!("failed to parse: {:?}", e);
        }
    }
    
    Ok(())
}

[Request] Find methods

Would be nice if the library had find method to find the node with that name 😃

Consider public fields of the main structs.

I am working on a macro that parses the XML and from tokentree's and as for now it is not possible to directly output the roxmltree. So while having my custom version of XML ast in the macro I encode it into string and then decode using roxmltree. It would be really great if I can emit something like this right from the macro.

    roxmltree::Document {
        text: "",
        nodes: vec![...],
        attrs: vec![...],
        namespaces: vec![...]
    };

getting an XML representation of a node

Thanks for this library! I wrote lxml a long time ago, though the API used there is taken from ElementTree.

I'd like to be able to display the XML representation of a node in the tree. I realize I can't change the XML but this is still useful for debugging and in tests.

I can get quite far with node.position(), combined with doc.input_text(). That gets me the start position. But I don't know how to get the end position of the node. If there is a next sibling I can use the start position of that, but not all nodes have such a position. The position next descendant won't work as it potentially includes the end tag of an outer node, for instance if I have <container><a><b/></a><x/></container>, and I want to see <b/> then the position of the next descendant is <x/> so therefore the output would be <b/></a>.

Would it be possible to maintain the end position of a node as well?

Multiple entities in the same text node are misinterpreted as entity loop/nesting

Because the entity loops are detected inaccurately, the example below panics, even though there's no loop or no nesting of the entity definitions. Modification of the depth limit would be a nice workaround but that's not possible right now.

fn main() {
    let sample = r#"<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent [<!ENTITY plus "+">]>
<PATDOC>
&plus;&plus;&plus;&plus;&plus;&plus;&plus;&plus;&plus;&plus;&plus;&plus;
</PATDOC>
"#;
    roxmltree::Document::parse(sample).unwrap();
}

the trait `std::error::Error` is not implemented for `roxmltree::Error`

I'm not sure if this was intentional, but trying to upgrade from 0.13 -> 0.14 results in this error:

error[E0277]: the trait bound `roxmltree::Error: std::error::Error` is not satisfied
   --> src/main.rs:318:48
    |
318 |     let doc = roxmltree::Document::parse(&text)?;
    |                                                ^ the trait `std::error::Error` is not implemented for `roxmltree::Error`
    |
    = note: required because of the requirements on the impl of `From<roxmltree::Error>` for `anyhow::Error`
    = note: required by `from`

Reduce peak memory usage

Measure and try to reduce the peak memory overhead over the input file. Basically find out how much memory is used by roxmltree metadata itself.

Make token ranges optional
Try removing ExpandedNameOwned::prefix
Remove 4GB input limit
~~Try using custom ShortRange for attributes and namespaces indexes. Like 20.12 packed in u32~~
Test peak memory usage

Serde support

This is more of a request for comments than an issue: I consume an XML-based API where neither quick-xml's Serde support the strong-xml or hard-xml wrapper around xmlparser can currently handle the format (OGC CSW records). serde-xml-rs does correctly deserialize the format though.

While I understand that one would normally base Serde support directly on xmlparser instead of roxmltree, in my case, my data model is already Serde-oriented and going via the intermediate roxmltree is still ten times (90%) faster than using serde-xml-rs and it seems significantly simpler to implement than an integration with xmlparser. (So basically, I just use Serde to avoid manually unpacking the document structure.)

Hence, I created a small crate available at https://github.com/adamreichold/serde-roxmltree and wanted to ask you a few questions before publishing anything, namely:

Whether you are alright with name directly referencing roxmltree or whether I should try to be more creative?
Whether you consider this useful and small enough to include here as an optional feature? (It works equally well as a separate library, so I think this would mainly help discoverability.)
Whether you have any hints on things that should be implemented differently?

UnknownEntityReference for HTML entities

Great work! Thank you for building this.

I am trying to parse XML that contains HTML entities like ′, but I am getting:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: UnknownEntityReference("prime", TextPos { row: 450, col: 3724 })', src/libcore/result.rs:997:5

Is there any way in particule you'd recommend I handle this?

Less than usefull Debug output for Element

From roxmltree 0.15 to 0.16, the Debug output for Element changed from including the element name, attributes and namespaces to something similar but with an added virtual doc attribute including debug info for the entire document. This changes diagnostic output in my program from a line of easy-to-digest text to thousands of not very comprehensible lines (my input xml in an example is 750 lines, which translates to 4559 lines (235 kB) out error message).

Is it possible to change this behaviour from code using roxmltree? Would you accept a patch for skipping the doc virtual attribute in Debug formatting?

Possibly incorrect "children()" behaviour

Hi!

I just ran a unit-test while parsing some XML file and I noticed something weird which I am seeing as possibly incorrect behaviour,
but that is where I wish for you to tell me whether that is the case or not please.

The XML snippet in question:

<scene output_file="myImage.png">
	<background_color r="255" g="0" b="0"/>
	<camera>
		<position x="1.0" y="-2.0E-10" z="-3"/>
		<lookat x="1" y="2" z="3"/>
		<up x="1" y="2" z="3"/>
		<horizontal_fov angle="90"/>
		<resolution horizontal="1920" vertical="1080"/>
		<max_bounces n="100"/>
	</camera>
... (other elements omitted)
</scene>

code snippet in question:

assert!(
            camera_elem.children().count() == 6,
            // camera_elem.children().filter(|node| node.is_element()).count() == 6,
            "There WASN'T 6 child elements of the `camera` element! Instead, found the following children:
            {:?}", camera_elem.children().map(|child_elem| child_elem.tag_name().name()).collect::<Vec<&str>>()
        );

Failed test-assertion response of mine:

There WASN'T 6 child elements of the `camera` element! Instead, found the following children:
            ["", "position", "", "lookat", "", "up", "", "horizontal_fov", "", "resolution", "", "max_bounces", ""]

Since there is no NodeType::Attribute (which of course could be that this isn't even meant to exist according to the XML specification, but I don't know that at this point),
I was a bit confused, what would those empty string-slices refer to exactly?
And since the camera's children are all EmptyElemTag (as specified here),
there should not be Text-nodes, right?

Cheers!

Get Entity from text

I would like to get the Entity code from a text node. I am parsing an xml document that describes a dictionary. It uses Entities for all of the parts of speech. I would like to get both the full text ("adjective") and the code ("adj"). Is this possible to add as an api to node or is there another way to get this information?

Perform attributes normalization on access

Attributes normalization requires an allocation, which is both expensive to CPU and RAM. And it doesn't guarantee that such attribute will be accessed at all. Or accessed multiple times.
Therefore we could try performing normalization only on access. This would improve parsing performance and reduce memory usage by sacrificing attribute value access performance.

Also, it means we would have to return Cow<'input, str>) instead of &'input str. Which is meh...

And, a caller might not want to perform normalization to begin with. So this way we would be able to provide a raw access to an attribute value.

Huge memory hog

Hello,

I'm trying to use the library for parsing medium size xml, but some of them are gigantic, like this one

https://vdp.cuzk.cz/vymenny_format/soucasna/20200331_OB_554782_UZSZ.xml.zip

Uzipped xml file is about 982M 20200331_OB_554782_UZSZ.xml.

Now, when you try to run e.g. stats example, it will eat your whole memory, then swap and it will so so finish on my MacBook Pro 2018 16GB ram.

Attribute value position/range no longer available since v0.16

Minimal example from versions 0.15 and smaller:

struct Foo<'input> {
    t2: &'input str,
}

// works with roxmltree v0.15
fn get_foo<'input>(xml_str: &'input str) -> Foo<'input> {
    let xml = roxmltree::Document::parse(xml_str).unwrap();
    let root_node = xml.root_element();
    for attr in root_node.attributes() {
        if attr.name() == "t2" {
            return Foo { t2: &xml_str[attr.value_range()]}
        }
    }
    unreachable!()
}

Attribute::value_range removal is the issue here.

This relates to #88. But the proposed solution requires using StringStorage, if I'm not mistaken, which is not ideal.

My use-case for example is a deriving/deserialisation framework. All users will have to use StringStorage fields if they want zero-copy (unless a layer of indirection like rkyv::Archive is added, which for me would be an overkill).

Tree from stream

Is it possible to use roxmltree to pull the next element off a buffer, using something like a continuation? Maybe returning an Option for the Document, the remaining text and an opaque state of accumulated tokens for the next pass.

Use custom allocator for strings

Parsing special characters

Hi,
I'm pulling a folder structure from a WebDAV server and try to parse the results with roxmltree. This is the snippet I use for it:

    let doc = roxmltree::Document::parse(&body).unwrap();
    let paths: Vec<&str> = doc
        .descendants()
        .filter(|n| n.has_tag_name("href"))
        .map(|n| n.text().unwrap())
        .collect();

    for i in paths {
        println!("{:#?}", i)
    }

As you can see, I'm interested in the text of the href tags. Since these contain several special characters like spaces and such the resulting strings look something like this: World%c2%b4s%20best%20Chocolate%20Chip%20Cookies/
Does roxmltree provide anything for parsing this into the actual plain text names or do I have to take care of this myself?
Best

Performance comparison with RapidXML

Would you care to add RapidXML to the benchmark?
It's a C++ XML parser, arguably a very fast one. Rust has a safety advantage over C++, but both languages target a similar niche - it would be awesome if apart from safety the Rust library also offered similar (or better!) performance.

http://rapidxml.sourceforge.net/
https://github.com/dwd/rapidxml - one of multiple GitHub mirrors

has_attributes is not working properly

Hey,

I have the following line in my XML file

<activity android:theme="@style/AppTheme" android:label="@string/title_activity_checkout" android:name="com.myapp.CheckoutActivity" android:configChanges="keyboard|keyboardHidden|orientation|screenLayout|uiMode|screenSize" android:windowSoftInputMode="adjustResize"/>

Now the android: part is defined as

<manifest xmlns:android="http://schemas.android.com/apk/res/android" android:versionCode="57" android:versionName="1.6.11" android:allowBackup="false" android:compileSdkVersion="29" android:compileSdkVersionCodename="10" package="com.myapp" platformBuildVersionCode="29" platformBuildVersionName="10">

Now when I take out all the tags with activity in them I am using the following code:

        for node in doc.descendants() {
            match node.tag_name().name() {
                "activity" => another_fn_here(node),
                _ => ()
            }
        }

If we print the node variable we get the following output:

Element { tag_name: activity, attributes: [Attribute { name: {http://schemas.android.com/apk/res/android}theme, value: "@style/Intercom_PanelTheme" }, Attribute { name: {http://schemas.android.com/apk/res/android}name, value: "com.myapp.activityname" }, Attribute { name: {http://schemas.android.com/apk/res/android}exported, value: "false" }, Attribute { name: {http://schemas.android.com/apk/res/android}launchMode, value: "singleTop" }, Attribute { name: {http://schemas.android.com/apk/res/android}configChanges, value: "orientation|screenSize" }, Attribute { name: {http://schemas.android.com/apk/res/android}windowSoftInputMode, value: "stateHidden" }], namespaces: [Namespace { name: Some("android"), uri: "http://schemas.android.com/apk/res/android" }] }

Now the first issue is that you'll notice the xmlns gets expanded in all the attributes so you cannot do

if node.has_attribute("android:exported"){
	//something
}

Another issue which I'm encountering is that even if I use full string even then it's not working:

static XMLNS: &str = "{http://schemas.android.com/apk/res/android}";

let exported = format!("{}{}", XMLNS, "exported");
if node.has_attribute(exported.as_str()) {
	// do something
} else {
	println!("Not Working")
}

In this case also always the Not working is printed.

I have tried to print the node and in some cases, the attribute might not be there but in few activities it was there so the condition should be true.

Please let me know if I'm using any of the functions in the wrong manner. And also I'm not sure whether it's intentional or not but I think xmlns shouldn't be expanded in all the attributes.

Thanks

document.root().range() returns 0..0

This fails.

assert_eq!(roxmltree::Document::parse("<e/>").unwrap().root().range(), 0..4);

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `0..0`,
 right: `0..4`', src\main.rs:21:5

no_std Support

Dear all,

as roxmltree only depends on xmlparser, would it be possible to make roxmltree support no_std environments?

xmlparser is no_std ready, so that would already take some effort away.

Best regards,
Andreas

question: is there a pure iterator pattern for very large files?

Hitting

Error: the input string should be smaller than 4GiB.

reading through the code not sure if there is an easy pattern to work with large files as I think it relies on finding both start and end tags for each node, but only had a cursory look so far.

TextPos on parsed tree

The question is can I take text position of some node if parsing is successful?
For example if I need post-process the result and return custom error with position.

Debug is not implemented for iterator structs

I have the code

use roxmltree::NodeType;

let doc = roxmltree::Document::parse(input)?;

let channel = doc.root().children().filter(|x| x.node_type() == NodeType::Element).exactly_one().unwrap();

Where exactly_one is itertools::Itertools::exactly_one. The problem I'm getting is that I can't use .unwrap() on that since the error needs to implement Debug, which it only does if the inner iterator does, and roxmltree::Children doesn't.

It looks like just adding a #[derive(Debug)] to the iterator structs would be enough. Iterators generally don't have too useful of a debug output for end users, but it's important that they do implement Debug and output something otherwise the types are hard to work with.

I'd also add the missing debug implementation lint to ensure all public types have a debug implementation, or at least it's intentional to not have one.

node.next_sibling_element() returns the node itself

Example code:

fn main() {
    let xml = "<root> <a/> <b/> <c/> </root>";
    let doc = roxmltree::Document::parse(xml).unwrap();

    let root = doc.root_element();
    let a = root.first_element_child().unwrap();
    let b = a.next_sibling_element().unwrap();
    let c = b.next_sibling_element().unwrap();

    println!("{}", root.tag_name().name());
    println!("{}", a.tag_name().name());
    println!("{}", b.tag_name().name());
    println!("{}", c.tag_name().name());
}

Expected output:

root
a
b
c

Actual output:

root
a
a
a

Make it easier to get a TextPos from an error

Hi, love your work on this, I'm using it to parse CSL for a new citation processor and it's fantastic.

CSL is mostly written by hand and involves big enums of allowed attribute values, so it's important to point out when and where these mishaps happen. I use the TextPos embedded in each variant of roxmltree::Error to produce a codespan for each error, and this works well.

However, it's probably better that this big match statement lives alongside its definition so other users can benefit and get the TextPos easily, and without having to update if roxmltree adds new variants. It would probably live on impl Error. Here is my source:

fn get_pos(e: &Error) -> TextPos {
    use xmlparser::Error as XP;
    match *e {
        Error::InvalidXmlPrefixUri(pos) => pos,
        Error::UnexpectedXmlUri(pos) => pos,
        Error::UnexpectedXmlnsUri(pos) => pos,
        Error::InvalidElementNamePrefix(pos) => pos,
        Error::DuplicatedNamespace(ref _name, pos) => pos,
        Error::UnexpectedCloseTag { pos, .. } => pos,
        Error::UnexpectedEntityCloseTag(pos) => pos,
        Error::UnknownEntityReference(ref _name, pos) => pos,
        Error::EntityReferenceLoop(pos) => pos,
        Error::DuplicatedAttribute(ref _name, pos) => pos,
        Error::ParserError(ref err) => match *err {
            XP::InvalidToken(_, pos, _) => pos,
            XP::UnexpectedToken(_, pos) => pos,
            XP::UnknownToken(pos) => pos,
        },
        _ => TextPos::new(1, 1)
    }
}

roxmltree v0.11.1 breaks code that worked with v0.11.0

Hi,

you broke my build. ;-)

This is just a FYI that something happened. I guess that nothing can be done about it even though v0.11.1 is not semver compatible with v0.11.0 (or I misunderstood something).

The error is:

error[E0621]: explicit lifetime required in the type of `name`
    --> xcbgen-rs/src/parser.rs:1129:5
     |
1128 | fn get_attr<'a>(node: roxmltree::Node<'a, '_>, name: &str) -> Result<&'a str, ParseError> {
     |                                                      ---- help: add explicit lifetime `'a` to the type of `name`: `&'a str`
1129 |     node.attribute(name).ok_or(ParseError::InvalidXml)
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ lifetime `'a` required
error: aborting due to previous error

The code in question is this:
https://github.com/psychon/x11rb/blob/7f2407f735881acebb46c93eb3bc8fed3fa99814/xcbgen-rs/src/parser.rs#L1128-L1130

fn get_attr<'a>(node: roxmltree::Node<'a, '_>, name: &str) -> Result<&'a str, ParseError> {
    node.attribute(name).ok_or(ParseError::InvalidXml)
}

I haven't looked closely yet at what happened, but the compiler's suggestion seems wrong to me. Perhaps I need Node<'a, 'a>? Dunno, future-me will experiment with this.

Anyway, feel free to just close this issue. As I said, I just wanted to let you know about this.

[Q/A] What should be the return type of the function in print_pos.rs example?

Hey, I am new to Rust so having a hard time figuring out what should be the return typeset if I want to send the doc back to the calling function.

fn read_xml(file_path: String){
	let text = std::fs::read_to_string(&afile_path).unwrap();
    let doc = match roxmltree::Document::parse(&text) {
        Ok(doc) => doc,
        Err(e) => {
            println!("Error: {}.", e);
            return;
        },
    };
	return doc;  // This gives error about not setting a proper return type
}

fn main() {
	read_xml("myxml.xml");
}

I tried doing something like:

fn read_xml(file_path: String) -> roxmltree::Document<'static>{

but then I get an error on the return; in the Err(e).

Now I want to know what should be the return type of read_xml function and how should I accept it in main function.

Thanks

razrfalcon / roxmltree Goto Github PK

roxmltree's Introduction

roxmltree's People

Contributors

Stargazers

Watchers

Forkers

roxmltree's Issues

Context

Environment

Temporary solutions (as far as I know...)

Ideal solution

Test file

Request

Notes

A minimal example

Recommend Projects

Recommend Topics

Recommend Org