Code Monkey home page Code Monkey logo

xmlcutty's Introduction

README

The game ain't in me no more. None of it.

xmlcutty is a simple tool for carving out elements from large XML files, fast. Since it works in a streaming fashion, it uses almost no memory and can process around 1G of XML per minute.

Why? Background.

Development

Packages updates:

go get -u
go mod tidy

Compile

# Linux amd64 architecture
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o xmlcutty cmd/xmlcutty/main.go
strip xmlcutty

Install

Use a deb or rpm release. It's in AUR, too.

Or install with the go tool:

$ go get github.com/miku/xmlcutty/cmd/xmlcutty

Usage

$ cat fixtures/sample.xml
<a>
    <b>
        <c></c>
    </b>
    <b>
        <c></c>
    </b>
</a>

Options:

$ xmlcutty -h
Usage of xmlcutty:
  -path string
        select path (default "/")
  -rename string
        rename wrapper element to this name
  -root string
        synthetic root element
  -v    show version

It looks a bit like XPath, but it really is only a simple matcher.

$ xmlcutty -path /a fixtures/sample.xml
<a>
    <b>
        <c></c>
    </b>
    <b>
        <c></c>
    </b>
</a>

You specify a path, e.g. /a/b and all elements matching this path are printed:

$ xmlcutty -path /a/b fixtures/sample.xml
<b>
    <c></c>
</b>
<b>
    <c></c>
</b>

You can end up with an XML document without a root. To make tools like xmllint happy, you can add a synthetic root element on the fly:

$ xmlcutty -root hello -path /a/b fixtures/sample.xml | xmllint --format -
<?xml version="1.0"?>
<hello>
    <b>
        <c></c>
    </b>
    <b>
        <c></c>
    </b>
</hello>

Rename wrapper element - that is the last element of the matching path:

$ xmlcutty -rename beee -path /a/b fixtures/sample.xml
<beee>
    <c></c>
</beee>
<beee>
    <c></c>
</beee>

All options, synthetic root element and a renamed path element:

$ xmlcutty -root hi -rename ceee -path /a/b/c fixtures/sample.xml | xmllint --format -
<?xml version="1.0"?>
<hi>
    <ceee/>
    <ceee/>
</hi>

It will parse XML files without a root element just fine.

$ head fixtures/oai.xml
<record>
    <header>
        <identifier>oai:arXiv.org:0704.0004</identifier>
        <datestamp>2007-05-23</datestamp>
        <setSpec>math</setSpec>
    </header>
    <metadata>
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"... >
            <dc:title>A determinant of Stirling cycle numbers counts ...
            <dc:type>text</dc:type>
            <dc:identifier>http://arxiv.org/abs/0704.0004</dc:identifier>
...

This is an example XML response from a web service. We can slice out the identifier elements. Note that any namespace - here oai_dc - is completely ignored for the sake of simplicity:

$ cat fixtures/oai.xml | xmlcutty -root x -path /record/metadata/dc/identifier \
                       | xmllint --format -
<?xml version="1.0"?>
<x>
    <identifier>http://arxiv.org/abs/0704.0004</identifier>
    <identifier>http://arxiv.org/abs/0704.0010</identifier>
    <identifier>http://arxiv.org/abs/0704.0012</identifier>
</x>

We can go a bit further and extract the text element, which is like a poor man text() in XPath terms. By using the a newline as argument to rename, we effectively get rid of the enclosing XML tag:

$ cat fixtures/oai.xml | xmlcutty -rename '\n' -path /record/metadata/dc/identifier \
                       | grep -v "^$"
http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012

This last feature is nice to quickly extract text from large XML files.

xmlcutty's People

Contributors

miku avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.