Code Monkey home page Code Monkey logo

xml-stream-parser's Introduction

xml stream parser

xml-stream-parser is xml parser for GO. It is efficient to parse large xml data with streaming fashion.

Usage

<?xml version="1.0" encoding="UTF-8"?>
<bookstore number="2" loc="273456">
   <book>
      <title>The Iliad and The Odyssey</title>
      <price>12.95</price>
      <comments>
         <userComment rating="4">Best translation I've read.</userComment>
         <userComment rating="2">I like other versions better.</userComment>
      </comments>
   </book>
   <book>
      <title>Anthology of World Literature</title>
      <price>24.95</price>
      <comments>
         <userComment rating="3">Needs more modern literature.</userComment>
         <userComment rating="4">Excellent overview of world literature.</userComment>
      </comments>
   </book>
   <journal>
      <title>Journal of XML parsing</title>
      <issue>1</issue>
   </journal>
</bookstore>

Stream over books and journals

f, _ := os.Open("input.xml")
br := bufio.NewReaderSize(f,65536)
parser := xmlparser.NewXMLParser(br, "book", "journal")

for xml := range parser.Stream() {
   fmt.Println(xml.Childs["title"][0].InnerText)
   if xml.Name == "book" {
      fmt.Println(xml.Childs["comments"][0].Childs["userComment"][0].Attrs["rating"])
      fmt.Println(xml.Childs["comments"][0].Childs["userComment"][0].InnerText)
   }
}

Skip tags for speed

parser := xmlparser.NewXMLParser(br, "book").SkipElements([]string{"price", "comments"})

Attributes only

parser := xmlparser.NewXMLParser(br, "bookstore", "book").ParseAttributesOnly("bookstore")

Error handlings

for xml := range parser.Stream() {
   if xml.Err !=nil {
      // handle error
   }
}

Progress of parsing

// total byte read to calculate the progress of parsing
parser.TotalReadSize

Xpath query provides alternative to default fast access for different usecases

parser := xmlparser.NewXMLParser(bufreader, "bookstore").EnableXpath()

for xml := range p.Stream() {
   // select books 
   xml.SelectElements("//book")
   xml.SelectElements("./book")
   xml.SelectElements("book")
   // select titles
   xml.SelectElements("./book/title")
   // select book with price condition
   xml.SelectElements("//book[price>=20.95]"))
   //comments with rating 4
   xml.SelectElements("//book/comments/userComment[@rating='4']")
}
// for evaluate function or reuse existing xpath expression
// sum of all the book price
expr, err := p.CompileXpath("sum(//book/price)")
price := expr.Evaluate(p.CreateXPathNavigator(xml)).(float64)

xpath functionality implemented via xpath library check more examples in its documentation

If you interested check also json parser which works similarly

xml-stream-parser's People

Contributors

imirkin avatar setnicka avatar tamerh avatar tsak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

xml-stream-parser's Issues

Exception Handling

goroutine Does not end

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book ISBN="10-000000-001">
        <title>The Iliad and The Odyssey</title>
        <price>12.95</price>
        <comments>
            <userComment rating="4">Best translation I've read.</userComment>
            <userComment rating="2">I like other versions better.</userComment>
        </comments>
        <description>Homer's two epics of the ancient world, The Iliad & The Odyssey, tell stories as riveting today as when they were written between the eighth and ninth century B.C.</description>
    </book>
    <book ISBN="10-000000-999">
        <title>Anthology of World Literature</title>
        <price>24.95</price>
        <comments>
            <userComment rating="3">Needs more modern literature.</userComment>

XPath is not working with namespaces?

Hi, sharing the example with non-working XPath query
Code

import (
	"bufio"
	"bytes"
	"fmt"
	"github.com/davecgh/go-spew/spew"
	"net/http"

	xmlparser "github.com/tamerh/xml-stream-parser"
)
func main() {
	br := bufio.NewReaderSize(bytes.NewReader([]byte(`
		<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body></soap:Body></soap:Envelope>
	`)), 65536)

	str := xmlparser.NewXMLParser(br, "soap:Envelope").EnableXpath()
	for xml := range str.Stream() {
		fmt.Println("1: ")
		spew.Dump(xml.Childs)
		fmt.Println("2: ")
		spew.Dump(xml.SelectElements("soap:Body"))
		fmt.Println("3: ")
		spew.Dump(xml.SelectElement("soap:Body"))
	}
}

output:

go run *.go
1:
(map[string][]xmlparser.XMLElement) (len=1) {
 (string) (len=9) "soap:Body": ([]xmlparser.XMLElement) (len=1 cap=1) {
  (xmlparser.XMLElement) {
   Name: (string) (len=9) "soap:Body",
   Attrs: (map[string]string) <nil>,
   InnerText: (string) "",
   Childs: (map[string][]xmlparser.XMLElement) <nil>,
   Err: (error) <nil>,
   childs: ([]*xmlparser.XMLElement) <nil>,
   parent: (*xmlparser.XMLElement)(0xc000122000)({
    Name: (string) (len=13) "soap:Envelope",
    Attrs: (map[string]string) (len=1) {
     (string) (len=10) "xmlns:soap": (string) (len=41) "http://schemas.xmlsoap.org/soap/envelope/"
    },
    InnerText: (string) "",
    Childs: (map[string][]xmlparser.XMLElement) (len=1) {
     (string) (len=9) "soap:Body": ([]xmlparser.XMLElement) (len=1 cap=1) {
      (xmlparser.XMLElement) {
       Name: (string) (len=9) "soap:Body",
       Attrs: (map[string]string) <nil>,
       InnerText: (string) "",
       Childs: (map[string][]xmlparser.XMLElement) <nil>,
       Err: (error) <nil>,
       childs: ([]*xmlparser.XMLElement) <nil>,
       parent: (*xmlparser.XMLElement)(0xc000122000)(<already shown>),
       attrs: ([]*xmlparser.xmlAttr) <nil>
      }
     }
    },
    Err: (error) <nil>,
    childs: ([]*xmlparser.XMLElement) (len=1 cap=1) {
     (*xmlparser.XMLElement)(0xc000122080)({
      Name: (string) (len=9) "soap:Body",
      Attrs: (map[string]string) <nil>,
      InnerText: (string) "",
      Childs: (map[string][]xmlparser.XMLElement) <nil>,
      Err: (error) <nil>,
      childs: ([]*xmlparser.XMLElement) <nil>,
      parent: (*xmlparser.XMLElement)(0xc000122000)(<already shown>),
      attrs: ([]*xmlparser.xmlAttr) <nil>
     })
    },
    parent: (*xmlparser.XMLElement)(<nil>),
    attrs: ([]*xmlparser.xmlAttr) (len=1 cap=1) {
     (*xmlparser.xmlAttr)(0xc0000aa420)({
      name: (string) (len=10) "xmlns:soap",
      value: (string) (len=41) "http://schemas.xmlsoap.org/soap/envelope/"
     })
    }
   }),
   attrs: ([]*xmlparser.xmlAttr) <nil>
  }
 }
}
2:
([]*xmlparser.XMLElement) <nil>
(interface {}) <nil>
3:
(*xmlparser.XMLElement)(<nil>)
(interface {}) <nil>

[Feature] Get Element Attributes without the whole Tree

Hi, thank you for the great package. Its very fast and the only one that works for me.

I would appreciate to have an opinion to have access to an element attributes without load the whole Tree.
In my case, I have a "root" tag with important attributes and all other tags is its Childs, so it will load the whole file on memory.
I will be happy to make a PR if with give some directions.

Can't get it to work?

I have been trying to use this parser but even when I try the example code mentioned here, I can't get it to run. Any valid xml results in an invalid xml error because of EOF? What am I doing wrong?

--

mistake

bufio: invalid use of UnreadByte

go version go1.12.5 windows/amd64
xml-stream-parser: latest ("Skip XML declatarions at beginning" commit)
There is error while try to use Stream(),

for xml := range parserXML.Stream() {

xml.Err.Error() contains "Invalid xml"

I checked out, there is sendError() method shows this, and error actually contains "bufio: invalid use of UnreadByte".

He sterted at here: https://github.com/tamerh/xml-stream-parser/blob/master/xmlparser.go#L453

and before at here:

err := x.reader.UnreadByte()

Next is bufio itself.

A potential missing edge case

Hey, I started using your library recently. I think there may be an edge case missed in the code
If I test the XML in README, it passes. But if I added an extra space between an attribute name and an opening quote.
For example
<bookstore number="2" loc="273456"> is acceptable,
<bookstore number ="2" loc="273456"> is also acceptable,
<bookstore number= "2" loc="273456"> throws a runtime error

Based on XML spec, I think spaces around = should be all valid.
It's probably caused by this line here

What do you think?

How to access directly some deeply nested elements?

This is rather a question as an issue
given your example but a bit extended

<?xml version="1.0" encoding="UTF-8"?>
<lectures>
   <libraries>
       <library name="xx">
          <book>
          ...
          </book>
          ...
       </library>
       <library name="yy">
         <book>
         ...
         </book>
         ...
       </library>
   </libraries>
   <bookstores>
      <bookstore name="xx">
         <book>
         ...
         </book>
         ...
      </bookstore>
      <bookstore name="yy">
         <book>
         ...
         </book>
         ..
      </bookstore>
   </bookstores>
</lectures>

How get to access directly to all books in bookstores (but not those in libraries) in an efficient manner?
Because this doesn't work:
parser:= NewXMLParser(br, "lectures/bookstores/bookstore/book")

And what is your recommendation for parsing large (>1gb) XML files with loads of elements without reparsing the file again and again?

CDATA issue

There is a problem while parcing xml document with CDATA block - parsing stops after first element.

Example file:

<?xml version="1.0" encoding="UTF-8" ?>
<offers>
	<offer>
		<title>Title1</title>
		<description><![CDATA[Big description 1]]></description>
	</offer>
	<offer>
		<title>Title2</title>
		<description><![CDATA[Big description 2]]></description>
	</offer>
	<offer>
		<title>Title3</title>
		<description><![CDATA[Big description 3]]></description>
	</offer>
</offers>

Example code:

import (
	streamParser "github.com/tamerh/xml-stream-parser"
)
func main() {
	f, _ := os.Open("example2.xml")
	br := bufio.NewReaderSize(f, 65536)
	parser := streamParser.NewXMLParser(br, "offer")
	for xml := range parser.Stream() {
		for tagName, _:= range xml.Childs {
			fmt.Println(tagName)
		}
	}
}`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.