martinblech / xmltodict Goto Github PK

Python module that makes working with XML feel like you are working with JSON

License: MIT License

Python 99.84% Shell 0.16%

xmltodict's Introduction

xmltodict

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>
...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>
...  """), indent=4))
{
    "mydocument": {
        "@has": "an attribute", 
        "and": {
            "many": [
                "elements", 
                "more elements"
            ]
        }, 
        "plus": {
            "@a": "complex", 
            "#text": "element as well"
        }
    }
}

Namespace support

By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:

>>> xml = """
... <root xmlns="http://defaultns.com/"
...       xmlns:a="http://a.com/"
...       xmlns:b="http://b.com/">
...   <x>1</x>
...   <a:y>2</a:y>
...   <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
True

It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:

>>> namespaces = {
...     'http://defaultns.com/': None, # skip this namespace
...     'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
...     'root': {
...         'x': '1',
...         'ns_a:y': '2',
...         'http://b.com/:z': '3',
...     },
... }
True

Streaming mode

xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print(artist['name'])
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print(article['title'])

$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ gunzip enwiki.dicts.gz | script1.py
$ gunzip enwiki.dicts.gz | script2.py
...

Roundtripping

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'response': {
...             'status': 'good',
...             'last_updated': '2014-02-16T23:10:12Z',
...     }
... }
>>> print(unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<response>
	<status>good</status>
	<last_updated>2014-02-16T23:10:12Z</last_updated>
</response>

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

>>> import xmltodict
>>> 
>>> mydict = {
...     'text': {
...         '@color':'red',
...         '@stroke':'2',
...         '#text':'This is a test'
...     }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>

Lists that are specified under a key in a dictionary use the key as a tag for each item. But if a list does have a parent key, for example if a list exists inside another list, it does not have a tag to use and the items are converted to a string as shown in the example below. To give tags to nested lists, use the expand_iter keyword argument to provide a tag as demonstrated below. Note that using expand_iter will break roundtripping.

>>> mydict = {
...     "line": {
...         "points": [
...             [1, 5],
...             [2, 6],
...         ]
...     }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<line>
        <points>[1, 5]</points>
        <points>[2, 6]</points>
</line>
>>> print(xmltodict.unparse(mydict, pretty=True, expand_iter="coord"))
<?xml version="1.0" encoding="utf-8"?>
<line>
        <points>
                <coord>1</coord>
                <coord>5</coord>
        </points>
        <points>
                <coord>2</coord>
                <coord>6</coord>
        </points>
</line>

Ok, how do I get it?

Using pypi

You just need to

$ pip install xmltodict

Using conda

For installing xmltodict using Anaconda/Miniconda (conda) from the conda-forge channel all you need to do is:

$ conda install -c conda-forge xmltodict

RPM-based distro (Fedora, RHEL, …)

There is an official Fedora package for xmltodict.

$ sudo yum install python-xmltodict

Arch Linux

There is an official Arch Linux package for xmltodict.

$ sudo pacman -S python-xmltodict

Debian-based distro (Debian, Ubuntu, …)

There is an official Debian package for xmltodict.

$ sudo apt install python-xmltodict

FreeBSD

There is an official FreeBSD port for xmltodict.

$ pkg install py36-xmltodict

openSUSE/SLE (SLE 15, Leap 15, Tumbleweed)

There is an official openSUSE package for xmltodict.

# Python2
$ zypper in python2-xmltodict

# Python3
$ zypper in python3-xmltodict

xmltodict's People

Contributors

Stargazers

Watchers

Forkers

wiedi jvictorchen komasing dersphere deeshank dpla-attic erineg1 ralphbean sh4ka kmartino kevbo boblannon lordofdoom neocxi gelliravi adambarthelson agarwal-karan pombredanne lgfausak dusual swordqiu atbrox jasonwiener ahalbert kod3r scalaview utahdave jisqyv fedaykin nicholasxjy dingk-r bgilb mchorfa kandee72 agent00 waytai atupal owenfox angrygorilla akreffett fhoehle takeplace nirajbhutada ascii1011 rjbez17 c0untzer0 lyapun ilonanagy crypt0s sumitsapate-tudip darchangell pyhunterpig lovato jonwelles acarpentier supriyaanand twistedlog vicgc hansweltar fanfannothing rulz midnightradio thurday yorks biznixcn frewsxcv sivacn-zeomega xu071602 nwenzel zoomquiet chaoshengt pombreda wavelets hartsock afthill rszalski splitt katugtug sexybear mak- stamperious damui turingczz ruin19 ubehera hl198181 mrjohnsson77 williemaddox honewatson codingwangfeng puddingbk maikroeder sirex sq6jnx mr-e bzamecnik ubershmekel iyn xiliangsong jervyshi

xmltodict's Issues

Iteration type standardization

I have a structure similar to this.

<root>
   <abc>
      <tr>one</tr>
      <tr>two</tr>
      <tr>three</tr>
   </abc>
   <abc>
      <tr>four</tr>
   </abc>
   <abc>
      five
   </abc>
</root>

When I iterate over this like so

doc = xmltodict.parse(data)
for e in doc['root']['abc']:
   print e

I get

OrderedDict([(u'tr', [u'one', u'two', u'three'])])
OrderedDict([(u'tr', u'four')])
five

Why isn't ufour in a list? Like so OrderedDict([(u'tr', [u'four'])]).
I understand that five is inconsistent with the rest of the structure. However I don't agree with changing a type based on the number of elements. Ex. ...<abc><tr>four</tr></abc>... is represented as a string while <abc><tr>one</tr><tr>two</tr><tr>three</tr></abc> is represented as a list.

Read encoding in file

Could it be possible to add by default or with an option a way for "parse" function to read file's encoding in the encoding declaration () ?

I do this that way:

xml = f.read()
start = xml.find('encoding="') + len('encoding="')
end = xml.find('"', start)
xmlEncoding = xml[start : end ]

Maybe you could find a safer way

Recent xml2dict installation failed with the following

Running setup.py egg_info for package xml2dict
Traceback (most recent call last):
File "", line 16, in
File "/myproj/virtualenv/build/xml2dict/setup.py", line 14, in
long_description=open('README.md').read())
IOError: [Errno 2] No such file or directory: 'README.md'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 16, in

File "/myproj/virtualenv/build/xml2dict/setup.py", line 14, in

long_description=open('README.md').read())

python 2.7.5

Exclude tags

Hi,

Could you help me. Are there any ways to exclude some tags from parsing. For example:

"< article>
< title>Here is a title< /title>
< description>here is a big description< element >Title< /element >< /description>
< /article>"

I need to get description without parsing:
article: {"title": "Here is a title", "description: "here is a big description< element>Title< /element>"}

Thanks.

post-processing callback

@slestak has a good idea here

Add option to supress xml declaration (snippet-mode?)

This module does a marvelous job of parsing and unparsing XML. However, when I unparse an ordereddict, I do not always need the version and encoding declaration.

For example, I am creating OrderedDicts from scratch and passing them to unparse. Since these are intended to be snippets instead of fully declared xml, I find myself manually postprocessing this to remove this declaration.

xml containing 1 child

Consider the following code

xml = """<?xml version="1.0" encoding="utf-8" ?>
<root>
    <children>
        <child>
            <name>A</name>
        </child>
    </children>
</root>"""

xmltodict.parse(xml)['root']['children']['child']

Wouldn't you expect to have an iterable object even when there is only 1 child?

welcome

app

Encoded character should not be 'stripped' if its the only character in the data item

In the function endElement we have the following code snippet

if self.strip_whitespace and data is not None:                                                                                     
                data = data.strip() or None

This case fails for the following XML file as there is just one whitespace character that we have for value's data and that gets stripped

<?xml version="1.0" encoding="UTF-8"?>                                                                                                         
<service>
    <key>
        <name>SEPARATOR</name>
        <value>&#009;</value>
    </key>
</service>

The output obtained after parsing this XML file would look like this

>>> import xmltodict
>>> fd = open(".../sample_file.xml")
>>> return_dict = xmltodict.parse(fd)
>>> return_dict
OrderedDict([(u'service', OrderedDict([(u'key', OrderedDict([(u'name', u'SEPARATOR'), (u'value', None)]))]))])

But if we change the code snippet to

if self.strip_whitespace and data is not None:                                                                                     
                if data.strip() is None:
                    pass
                else:
                    data = data.strip()

We can get the following output

>>> import xmltodict
>>> fd = open(".../sample_file.xml")
>>> return_dict = xmltodict.parse(fd)
>>> return_dict
OrderedDict([(u'service', OrderedDict([(u'key', OrderedDict([(u'name', u'SEPARATOR'), (u'value', u'')]))]))])

We are able to get the TAB character

unparse handles lists incorrectly?

My Python object looks like so:
{'Response': {'@errorcode': '00', 'Versions': [{'Version': {'@Updated': u'2013-10-23T18:29:11', 'Basic': {'@md5': u'a7674c694607b169e57593a4952ea26f'}}}, {'Version': {'@Updated': u'2013-10-23T18:55:53', 'Basic': {'@md5': u'b50001ee638f7df058d2c5f9157c6e8a'}}}]}}

The resulting XML from 'unparse' puts an endtag for Versions after the first Version, then starts it again before the second list item.

Seems that "Versions" shouldn't be ended after the first "Version" object?

Configurable Prefixes for Attributes and Text

I try to use the resulting dictionary from xmltodict within Django templates.

Within django templates dictionary members are addressed via dot-Notation, i.e
contents of var['node1']['node2']['@ATTR'] is addressed via node1.node2.@ATTR in django templating language.

A problem occurs with those @ and # prefixes for Attributes and Textcontents, as django cannot evaluate the variable node1.node2.@ATTR due to the existance of those prefixes @ and # within its variables.

Feature Request:
Make the used prefixes for Attributes and Textcontents configurable, so the user can determine which prefixes are used (for example XMLATTR_ instead of @)

xml tabs and spaces not ignored when creating json file

Here is my python code

import io,json,xmltodict
infile = io.open(xmlfilenameIn, 'r')
outfile = io.open(jsonfilenameOut, 'wb')
o = xmltodict.parse( infile.read() );
json.dump( o , outfile, indent=2 )

leads this xml

    <Service>
      <kind>1</kind>
    </Service>

to the following json

        "Service": {
          "kind": "1", 
          "#text": "\n          \n        "
        },

And it repeats throughout the json file: all newlines and tabulations become "#text" elements.

Is there an option to ignore the tabulation and newlines in the input xml when creating the corresponding json?

Thanks

Support for Semi-Structured XML

Semi-structured element should not be parsed and the content should be kept as is.
For example:

<level1>
    <level2>
        <level3>First level3.</level3>
        text outside 1st level3 at the end
    </level2>
    <level2>
        text outside 2nd level3 at the beginning
        <level3>Second level3.</level3>
        text outside 2nd level3 at the end
    </level2>
    <level2>
        text outside 3rd level3 at the beginning
        <level3>Third leve3.</level3>
    </level2>
</level1>

will produce (at the corresponding levels):

level2: [
{ #text: "text outside 1st level3 at the end", level3: "First level3." },
{ #text: "text outside 2nd level3 at the beginningtext outside 2st level3 at the end", level3: "Second level3." },
{ #text: "text outside 3rd level3 at the end", level3: "Third level3." },
]

which is not only irreversible (not keeping order) but for 2nd level3 also meaningless (joining the texts). According to "spec" you claim you are adopting, at least the case of 2nd level3 should not be parsed.

It would be also great if a tag name(s) could be specified (as a parameter to parse function) whose content wouldn't be parsed at all. It could also solve the described issue sometimes (as the user would specify that tag level2 shouldn't be parsed and its content should be kept in #text property).

Thank you in advance for your comments.

Latest version is not x.x.x :-)

Hi, sorry for this issue,
but I'm looking to package xmltodict into Debian, but I've a problem with the latest tag.
Could your change the v0.9 to v0.9.0
Thanks !

Add documentation for unparse

I'm not sure if the unparse support is officially done, but there are tests for it. It would be nice to call out this functionality in the README.

Create Changelog.txt

There is no easy way to see changes between versions other than dig through all commits.

Simple Changelog.txt file in the root with all changes relevant changes.

Only OrderedDicts are returned

This may just be a documentation issue, but when I run: (Python 2.7 OS X)

foo = xmltodict.parse("""<?xml version="1.0" ?>
        <person>
          <name>john</name>
          <age>20</age>
        </person>""")

    print foo

I get:

Output: OrderedDict([(u'person', OrderedDict([(u'name', u'john'), (u'age', u'20')]))])

In a nested XML document, this is making hard for me to turn this into JSON

Support for object member names that are not valid XML element names, & etc.

Suppose I want an end-result like so:

'Top': {
    'One_Thing': 'The first thing',
    'Two Thing': 'The second thing'
    27:'And another thing'
}

This is only possible if the XML looks like this:

<Top>
    <One_Thing>The first thing</One_Thing>
    <Two Thing>The second thing</Two Thing>
    <27>And another thing</27>
</Top>

...which is of course, invalid XML. Element names cannot contain spaces, and there are other characters which are completely valid in a Python dict but are not valid in an XML element name. Element names cannot be numbers either. This limits the use and applicability of this library.

Another approach:

<Top>
    <thing name="One Thing">The first thing</thing>
    <thing name="Two Thing">The second thing</thing>
    <thing name="27">And another thing</thing>
</Top>

also does not work, because the @name values come out as '@name' and the actual strings come out as #text... nobody wants that. Further, these come out as an array under the name 'thing' - which was not really the intent.

I propose a solution which allows the caller to 'annotate' the XML to define the manner in which the dict is generated. This would enable the generator of the XML to direct the library in ways that are heretofore simply assumptions.

Consider:

<Top>
    <One_Thing _x2d_name="One Thing">The first thing</One_Thing>
    <Two_Thing _x2d_name="Two Thing">The second thing</Two_Thing>
    <Ano_Thing _x2d_name="27">And another thing</Ano_Thing>
</Top>

this annotation attribute could be 'sniffed' by the xmltodict library and understood as a directive to rename the generated element to a value which can easily be stored in an attribute, but which could not otherwise have been in the element name.

This approach, I would argue, could also be used to help xmltodict generate dictionaries more like users want without having to implement complex post-processing methods.

For instance:

<Top>
    <Jan _x2d_type="float">37</Jan>
    <Feb _x2d_type="float">-42.8</Feb>
</Top>

could be used to cause xmltodict to convert the text to a particular python type that is not a string. Currently, this must be done with post-processing.

There are other common operations which might be valuable additions to this - like converting a comma-separated list to an actual array...

Without turning this into a full product map, I'd like to propose that at least these two xmltodict directive attributes be supported:

_x2d_name
_x2d_type

These would be the first directive attributes that are really 'meta' processing commands, digested during conversion. Callers could annotate the XML prior to passing it into the processor, therefore avoiding having to write post-processing for common conversion needs.

odd parse for same xml structure

Example

In [1]: import xmltodict
In [2]: rdf = '''
   ...:   <RDF:Seq RDF:about="urn:scrapbook:item20070113201921">
   ...:     <RDF:li RDF:resource="urn:scrapbook:item20070113201940"/>
   ...:   </RDF:Seq>
   ...: '''

In [3]: doc = xmltodict.parse(rdf)

In [6]: rdf2='''  <RDF:Seq RDF:about="urn:scrapbook:item20070113201921">
   ...:     <RDF:li RDF:resource="urn:scrapbook:item20070113201940"/>
   ...:     <RDF:li RDF:resource="urn:scrapbook:item20070113201941"/>
   ...:   </RDF:Seq>
   ...: '''

In [7]: doc2=xmltodict.parse(rdf2)

In [8]: doc['RDF:Seq']['RDF:li']
Out[8]: OrderedDict([(u'@RDF:resource', u'urn:scrapbook:item20070113201940')])

In [9]: doc2['RDF:Seq']['RDF:li']
Out[9]:
[OrderedDict([(u'@RDF:resource', u'urn:scrapbook:item20070113201940')]),
 OrderedDict([(u'@RDF:resource', u'urn:scrapbook:item20070113201941')])]

Problem

when i enjoy the great lib.

discover this:

iff sub. node only one
parse() will make one obj., NOT the list

so means when i walk big xml, had to judge the sub nodes quantity ,
had to usage different code!

such as:

for li in SOME_XML[..]['RDF:li']:
    print li
    ...

can not working, i must fixed as:

if 1 == SOME_XML[..]['RDF:li']:
    print SOME_XML[..]['RDF:li']
else:
    for li in SOME_XML[..]['RDF:li']:
        print li
        ...

that is not pythonic

Hope

if all case,is make list obj.

that is better

thanx for all

PIP install on Windows 7 64-bit fails

I just installed Python 2.7.6 64-bit on Windows 7 Home Premium and added pip.

When I try to pip install xmltodict, I get this stack trace. Looks like the code is hitting a UNICODE BOM mark or something.

C:\Python27\Scripts>pip install xmltodict
Downloading/unpacking xmltodict
Downloading xmltodict-0.9.0.tar.gz
Cleaning up...
Exception:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\pip\basecommand.py", line 122, in main
status = self.run(options, args)
File "C:\Python27\lib\site-packages\pip\commands\install.py", line 278, in run

requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundl

e=self.bundle)
File "C:\Python27\lib\site-packages\pip\req.py", line 1229, in prepare_files
req_to_install.run_egg_info()
File "C:\Python27\lib\site-packages\pip\req.py", line 292, in run_egg_info
logger.notify('Running setup.py (path:%s) egg_info for package %s' % (self.s
etup_py, self.name))
File "C:\Python27\lib\site-packages\pip\req.py", line 265, in setup_py
import setuptools
File "C:\Python27\lib\site-packages\setuptools__init__.py", line 12, in
from setuptools.extension import Extension
File "C:\Python27\lib\site-packages\setuptools\extension.py", line 7, in
from setuptools.dist import _get_unpatched
File "C:\Python27\lib\site-packages\setuptools\dist.py", line 15, in
from setuptools.compat import numeric_types, basestring
File "C:\Python27\lib\site-packages\setuptools\compat.py", line 19, in <module

from SimpleHTTPServer import SimpleHTTPRequestHandler
File "C:\Python27\lib\SimpleHTTPServer.py", line 27, in
class SimpleHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
File "C:\Python27\lib\SimpleHTTPServer.py", line 208, in SimpleHTTPRequestHand
ler
mimetypes.init() # try to read system mime.types
File "C:\Python27\lib\mimetypes.py", line 358, in init
db.read_windows_registry()
File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry
for subkeyname in enum_types(hkcr):
File "C:\Python27\lib\mimetypes.py", line 249, in enum_types
ctype = ctype.encode(default_encoding) # omit in 3.x!
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 7: ordinal
not in range(128)

Storing debug log for failure in F:\Users\Virtus Draco\pip\pip.log

C:\Python27\Scripts>

Install fails with pip because of a download error for coverage module

Installing xmltodict with pip fails complaining about not being able to find coverage:

$ pip install xmltodict
Downloading/unpacking xmltodict
  Downloading xmltodict-0.8.3.tar.gz
  Running setup.py egg_info for package xmltodict
    Download error on https://pypi.python.org/simple/coverage/: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate ve
rify failed -- Some packages may not be found!
    Couldn't find index page for 'coverage' (maybe misspelled?)
    Download error on https://pypi.python.org/simple/: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify fail
ed -- Some packages may not be found!
    No local packages or download links found for coverage
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
      File "/tmp/pip_build_sspadmin/xmltodict/setup.py", line 36, in <module>
        setup_requires=['nose>=1.0', 'coverage'],
      File "/home/sspadmin/.pyenv/versions/2.7.5/lib/python2.7/distutils/core.py", line 112, in setup
        _setup_distribution = dist = klass(attrs)
      File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 239, in __init__
      File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 263, in fetch_build_eggs
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 568, in resolve
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 806, in best_match
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 818, in obtain
      File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 313, in fetch_build_egg
      File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 603, in easy_install
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('coverage')
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.python.org/simple/coverage/: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate ve
rify failed -- Some packages may not be found!

Couldn't find index page for 'coverage' (maybe misspelled?)

Download error on https://pypi.python.org/simple/: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed -
- Some packages may not be found!

No local packages or download links found for coverage

Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/tmp/pip_build_sspadmin/xmltodict/setup.py", line 36, in <module>

    setup_requires=['nose>=1.0', 'coverage'],

  File "/home/sspadmin/.pyenv/versions/2.7.5/lib/python2.7/distutils/core.py", line 112, in setup

    _setup_distribution = dist = klass(attrs)

  File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 239, in __init__

  File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 263, in fetch_build_eggs

  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 568, in resolve

  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 806, in best_match

  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 818, in obtain

  File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 313, in fetch_build_egg

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 603, in easy_install

distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('coverage')

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_sspadmin/xmltodict

However installation of coverage can be done with pip install coverage and then pip install xmltodict works fine. Any ideas?

postprocessor never called for cdata items

see dpla-attic@d2b08f6#commitcomment-2407288

No attributes when using streaming, item_depth and item_callback

Suppose we have some xml:

<items>
    <item attr='hello'>
        <inner>value</inner>
        <inner2>value2</inner2>
    </item>
</items>

Now we parse it:

xmltodict.parse(open(xmlfile))
>>> OrderedDict([(u'items', OrderedDict([(u'item', OrderedDict([(u'@attr', u'hello'), (u'inner', u'value'), (u'inner2', u'value2')]))]))])

Everything seems fine, we have everything we expect.

But if we use item_depth and item_callback, we'll have some problems:

def cb(_, item):
    print item

xmltodict.parse(open(xmlfile), item_depth=2, item_callback=cb)
>>> OrderedDict([(u'inner', u'value'), (u'inner2', u'value2')])

As you can see, in second example we've lost attr="hello"

xml_to_dict return incorrect dict

Allow pretty printing XML in unparse()

xmltodict should have the option to pretty print XML with correct indentation. The default behavior should stay the same (print all inline), and this new feature should only be enabled with an optional argument to unparse().

Cannot Handle Multiple Non-Nested Tags

<'test>0</'test>
<'another>1</'test>

Attempting to parse that will return error junk after document element.
I was expecting it to return a dictionary {"test" : 0, "another" : 1}

Ignore the quotes in the middle of tags...

Provide easy way to lose items ordering

I order to compare xml files with different elements order, it would be great if library would place items in unordered dicts.
Same behaviour could be archived by applying following function to parse() output:

def disorder(ordered):
    if hasattr(ordered, 'iteritems'):
        return dict((k, disorder(v))
                    for k, v in ordered.iteritems())

Filter during parsing?

I would like to be able to parse a subset of an XML document, for example, if I don't need the entire document, but need a particular deeply-nested tree starting at some path within the XML document.

Is there a way to do this using the new postprocessor feature mentioned in #6? I don't see a way to exclude elements that way, but it's possible I am misreading the docs/code.

No problem here..

Just a big thank you for writing this library, it works perfectly and does exactly what I needed.. your work has saved me many hours of heart ache messing around with lxml.. Thank you!!!!

Cal

Too many newlines when unparse() with "pretty" option

When using this as test2.xml :

<?xml version="1.0" encoding="utf-8"?>
<mydocument>
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus>element as well</plus>
</mydocument>

this Python code :

import xmltodict
with open('test2.xml') as f: xml = f.read()
doc = xmltodict.parse(xml)
out = xmltodict.unparse(doc, pretty = True)
with open('out2.xml', 'w') as g: g.write(out)

will give an output with too many blank lines, like :

<?xml version="1.0" encoding="utf-8"?>
<mydocument>
    <and>
        <many>elements</many>

        <many>more elements</many>
    </and>

    <plus>element as well</plus>
</mydocument>

xmltodict.parse() should handle unicode objects as input

Pretty self-explanatory:

>>> import xmltodict
>>> xmltodict.parse(u"<A>香</A>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "xmltodict.py", line 193, in parse
    parser.Parse(xml_input, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128)
>>> xmltodict.parse(u"<A>香</A>".encode('utf-8'))
OrderedDict([(u'A', u'\u9999')])

preprocessing callback

It would be nice to have the same functionality the post processor provides in a preprocessor callback. This would allow round-tripping. What do you think?

xmltodict.unparse method sorts subelements

What steps will reproduce the problem?

given xml file with such element:

<masterserver id="main" cmd="--xmpp_resource=main_dal"/>

parse xml file into dict
unparse dict into xml file

What is the expected output? What do you see instead?
expected xml file with such element:

<masterserver id="main" cmd="--xmpp_resource=main_dal"/>

What version of the product are you using? On what operating system?
returned xml file with such element:

<masterserver cmd="--xmpp_resource=main_dal" id="main"/>

Please provide any additional information below.
actually xml is ok, but I don't need it to be modified such way.
I mean unparse method likely sort subelements in alphabet order which is undesirable.
Thanks

Problem with installation related to use_setuptools

Hello!

When I try to install package I get : http://paste.in.ua/8875/

If I remove from distribute_setup import use_setuptools; use_setuptools() from setup.py - everything working fine.

I wonder, why upgrading distribute required for installing xmltodict.

Thanks

pip install fails due to missing README.md

long_description=open('README.md').read())

FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

However, downloading 0.8.3 and using setup.py install okay.

Cannot parse &

Cannot parse xml with & character:

import xmltodict
xmltodict.parse("<xtra> Cal\&Atilde;&sect;a Jeans Alex com Lavagem Clara - Colcci</xtra>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/barroca/Developer/heroku/meliuz/meliuz-env/lib/python2.7/site-packages/xmltodict.py", line 189, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: undefined entity: line 1, column 11

It might be a problem with expat.

Add XML Security

Expat doesn't have any XML security built in. There is a newer standard library that solves this:

https://pypi.python.org/pypi/defusedexpat/#modifications-in-pyexpat

It would be nice if xmltodict migrated to this (at least consider it).

failed to install in virtualenv for version 0.8.0

Downloading/unpacking xmltodict==0.8.0
Downloading xmltodict-0.8.0.tar.gz
Running setup.py egg_info for package xmltodict
Traceback (most recent call last):
File "", line 16, in
File "/home/jessica/.virtualenvs/sch_env/build/xmltodict/setup.py", line 2, in
from distribute_setup import use_setuptools
ImportError: No module named distribute_setup
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 16, in

File "/home/jessica/.virtualenvs/sch_env/build/xmltodict/setup.py", line 2, in

from distribute_setup import use_setuptools

ImportError: No module named distribute_setup

Cleaning up...
Command python setup.py egg_info failed with error code 1 in /home/jessica/.virtualenvs/sch_env/build/xmltodict

consider adding a built-in function to convert the OrderedDict to simple dict

Sometimes users may only need the simple dict if they only cares about the content.

Although users can always use dict(ordered_dict_obj) to convert it into a dict object, the object generated by xmltodict is usually a nested OrderedDict of OrderedDicts. Users may have to write a function with iteration to convert all the nested OrderedDict to dict. I think a built-in .to_simple_nested_dict() would be handy and useful.

Whitespace inside of elements is stripped

Whitespace appears to be stripped from the text inside elements (specifically, \n is removed. While bad practice, this is still valid XML and shouldn't be removed unless specified)

XML in question: (yes, i did deliberately OrderedDict = dict for my project):

<?xml version="1.0" encoding="UTF-8"?>
<playlist xmlns="http://xspf.org/ns/0/" version="1">
  <title/>
  <creator/>
  <trackList>
    <track>
      <location>http://stream.r-a-d.io:8000/main.mp3</location>
      <title>Makino Yui - Yokogao -acoustic version-</title>
      <annotation>Stream Title: R/a/dio
Stream Description: Unspecified description
Content Type:audio/mpeg
Current Listeners: 43
Peak Listeners: 52
Stream Genre: Japanese Music</annotation>
      <info>https://r-a-d.io</info>
    </track>
  </trackList>
</playlist>

Suddenly becomes invalid:

>>> x = xmltodict.parse(xml, xml_attribs=False)
>>> x["playlist"]["trackList"]["track"]["annotation"]
u'Stream Title: R/a/dioStream Description: Unspecified descriptionContent Type:audio/mpegCurrentListeners: 43Peak Listeners: 52Stream Genre: Japanese Music''

I suspect this is due to data.strip() on every line.

Error parsing gzip file - trying to ran example from Readme

Python 2.7.5 |Anaconda 1.6.1 (x86_64)| (default, Jun 28 2013, 22:20:13) 
Type "copyright", "credits" or "license" for more information.

IPython 1.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: def handle_artist(_, artist):
   ...:         print artist['name']
   ...:     

In [2]: import xmltodict

In [3]: from gzip import GzipFile


In [4]: xmltodict.parse(GzipFile('discogs_20130201_artists.xml.gz'),item_depth=2, item_callback=handle_artist)
Persuader, The
---------------------------------------------------------------------------
ParsingInterrupted                        Traceback (most recent call last)
<ipython-input-4-70a78ef7cd1d> in <module>()
----> 1 xmltodict.parse(GzipFile('discogs_20130201_artists.xml.gz'),item_depth=2, item_callback=handle_artist)

/opt/anaconda/lib/python2.7/site-packages/xmltodict.pyc in parse(xml_input, encoding, expat, process_namespaces, namespace_separator, **kwargs)
    222     parser.CharacterDataHandler = handler.characters
    223     try:
--> 224         parser.ParseFile(xml_input)
    225     except (TypeError, AttributeError):
    226         if isinstance(xml_input, _unicode):

/opt/anaconda/lib/python2.7/site-packages/xmltodict.pyc in endElement(self, full_name)
    102             should_continue = self.item_callback(self.path, item)
    103             if not should_continue:
--> 104                 raise ParsingInterrupted()
    105         if len(self.stack):
    106             item, data = self.item, self.data

ParsingInterrupted:

Namespaces Bug

The second half of the Namespace Support example in the Readme file does not properly replace namespaces as displayed.

consider adding a built-in function to obtain the raw text?

For example,

>>> doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="complex">
...     element as well
...   </plus>
... </mydocument>
... """)
>>> 
>>> doc['mydocument']['and']
OrderedDict([(u'many', [u'elements', u'more elements'])])

I want to obtain the raw text:

>>> magic_raw_text(doc['mydocument']['and'])
u'<many>elements</many>\n<many>more elements</many>`

I think a built-in magic_raw_text() is useful.

Can't parse URL attributes with & symbol in it

< teststatus>
< thing id="0 1" xlink:href="http://example.com/form/Api?op=do_something&port=6">
< another>0< /another>
< /thing>
< /teststatus>

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3, column 76

Feature request: Namespace support

This project is great! Does it support namespaces? I don't see anything in the examples or tests.

Unfortunately I have to deal with both building and parsing a lot of web-servicey XML where you have a half dozen or more namespaces in a given document. LXML is the only Python framework I've found that sufficiently deals with this issue, by way of the namespaces parameter so you don't have to qualify each element with the full namespace URI.

I'd like to request an additional optional parameter to xmltodict.parse():

namespaces = {
    'one' : 'http://one.com',
    'two' : 'http://two.com }

doc = xmltodict.parse("""
 <mydocument 
        xmlns='http://one.com'
        xmlns:two='http://two.com' 
        xmlns:three='http://three.com'
        has="an attribute">
   <two:and>
     <many>elements</many>
     <many>more elements</many>
   </two:and>
   <plus three:a="complex">
     element as well
   </plus>
 </mydocument>""", namespaces=namespaces)

doc['one:mydocument']['two:and']

doc['one:mydocument']['one:plus']['@three:a']

@ githubstatus user:"@ githubstatus" size:"@ githubstatus" fork:only extension:"@ githubstatus" state:open author:"@ githubstatus" fullname:"@ githubstatus" fork:true

How allways get list of OrderedDict

Hello,

I have one question....
how can I modify parse function so that I always get list of OrderedDict's not only when the same node name coms more then once.
In this sample I have as output list of 3 OrderedDicts:

<modules>
   <modul id="1"/>
   <modul id="2"/>
   <modul id="3"/>
</modules>

OrderedDict([(u'modules', OrderedDict([(u'modul', [OrderedDict([(u'@id', u'1')]), OrderedDict([(u'@id', u'2')]), OrderedDict([(u'@id', u'3')])])]))])

and here ist only OrderedDict but not list

<modules>
   <modul id="1"/>
</modules>

OrderedDict([(u'modules', OrderedDict([(u'modul', OrderedDict([(u'@id', u'1')]))]))])

it is much more complcated to walk over dictionary when you have different types of varaibles....

any help will be appreciated :)

Piotr

Ability to avoid <![CDATA[ and ]]> tags to be converted when unparsing

When programatically creating for example Android XML strings files using your library's unparse functionality the library converts "" into "]]& gt;" (Spaces added between for the two html entities so things show up)

I don't know about other xml readers, but the Android SDK's way of allowing developers to add string resources that contain HTML symbols etc is to use CDATA tags like that.
I.e. you wrap your html string in CDATA tags and the android sdk doesn't go crazy when trying to parse the xml.

This is the xml I want to construct:

<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="some_html"><![CDATA[<a href="http://www.space.com">Hi my name is Bob</a>]]></string>
</resources>

This is what I pass to unparse:

val = '<![CDATA[<a href="http://www.space.com">Hi my name is Bob</a>]]>'
strings = [{"@name": "some_html", "#text": val}]
xmld = {"resources": {"string": strings}}
xml = xmltodict.unparse(xmld, pretty=True)

And this is the result:

<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="some_html">&lt;![CDATA[&lt;a href="http://www.space.com"&gt;Hi my name is Bob&lt;/a&gt;]]&gt;</string>
</resources>

I can't find a way to tell unparse not to try anything fancy for this particular value, but no luck, and I can't see anything obvious in your samples. Is there a way to achieve what I'm trying to do here?

Error on pip installation

This module looks pretty cool, but I'm getting the following error when I run "pip install xml2dict":

Downloading/unpacking xml2dict
  Downloading XML2Dict-0.2.1.tar.gz
  Running setup.py egg_info for package xml2dict
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/john/tmp/build/xml2dict/setup.py", line 14, in <module>
        long_description=open('README.md').read())
    IOError: [Errno 2] No such file or directory: 'README.md'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/john/tmp/build/xml2dict/setup.py", line 14, in <module>

    long_description=open('README.md').read())

IOError: [Errno 2] No such file or directory: 'README.md'

----------------------------------------
Command python setup.py egg_info failed with error code 1
Storing complete log in /home/john/.pip/pip.log

Example in docs not working for wiki dumps?

Hi Martin,

This looks like a fantastic tool and thanks for making it. I've been trying to parse wikipedia dumps into JSON, and tried out your program. I copied your wikipedia example and had the following error:

$ nano myscript.py
import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']

$ cat enwiki-latest-pages-articles.xml.bz2 | bunzip2 | ./xmltodict/xmltodict.py 2 | ./myscript.py
./myscript.py: line 1: import: command not found
./myscript.py: line 3: syntax error near unexpected token `('
./myscript.py: line 3: `    _, article = marshal.load(sys.stdin)'

I have an outstanding Stack Overflow question about parsing bigass wikipedia XML dumps, with a couple of my errors:
http://stackoverflow.com/questions/17286183/loading-all-of-wikipedia-data-into-mongodb

Also noticed that maybe it was interpreting my myscript.py as bash instead of python, so reran with:

$ cat enwiki-latest-pages-articles.xml.bz2 | bunzip2 | ./xmltodict/xmltodict.py 2 | python myscript.py
Traceback (most recent call last):
  File "myscript.py", line 4, in <module>
    print article['title']
KeyError: 'title'