okfn / messytables Goto Github PK

View Code? Open in Web Editor NEW

389.0 45.0 110.0 5.06 MB

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py

Home Page: http://messytables.readthedocs.io/

HTML 50.40% Python 49.18% Makefile 0.05% Dockerfile 0.37%

messytables's Introduction

Parsing for messy tables

A library for dealing with messy tabular data in several formats, guessing types and detecting headers.

See the documentation at: https://messytables.readthedocs.io

Find the package at: https://pypi.python.org/pypi/messytables

See CONTRIBUTING.md for how to send patches, run tests.

Contact: Open Knowledge Labs - http://okfnlabs.org/contact/. We especially recommend the forum: http://discuss.okfn.org/category/open-knowledge-labs/

messytables's People

Contributors

Stargazers

Watchers

Forkers

timclicks asuffield davidread domoritz joshdata open-source-gis uxscripts leadsplus netconstructor netcon-source cheekybastard web5design perfectcode1 scraperdragon scraperwiki bearrito marcdacosta rossjones parsing scraping-xx manelclos petarr aramaiya thrawnca kanitw mpbaunofficial atchai ziggi0703 bivald iadrich tuvalabs bossadvisors carchrae ahlusar1989 mychapati reubano dallagi dualsky ranjanprj rlugojr rebeccabilbro flyeven hammady argv0 mcarans anukat2015 eonuora backgroundcheck nacnudus adamchainz jasonamyers stevenmaude hryngo ofergold yashodhank sstm2 datopian stefina opendata-swiss chfw nikeshbalami slangwald danepublicznegovpl schlos bangoruniversity samhatchett felipecc jonadem janostik fredounnet ryanmwhitephd sigmango tusharbihani limbo92 jsrpy ekzhu shahzebz ananyamukh6 lyinfu kenoskynci hhy5277 abeusher micahxie fredericomm fagan2888 burbanom firecast sensiblecodeio kmbn ra2003 sagargg-zz mathias-c keitaroinc jaskiewiczm andreyatgithub typhonuk e7dal jurbendroin plaidcloud fatemehahmadi94

messytables's Issues

Provide format-specific metadata

Often I find myself wanting more details about the individual cells than just their values.

e.g.

Some HTML cells contain more than just a single value; and this can require additional parsing to understand what the true value is. For example,

42¹

is naively converted to 421 at the moment; in order to do this additional processing I require the HTML source of the cell.

Some formats (Excel, HTML, etc.) support additional formatting - e.g. bold, font colour, background colour. It would be good to allow future support for these.
But we don't want to write enormous amounts of code to cover all use cases, especially where features are limited to one or two formats. But making available the internals of the library parsing the file (e.g. LXML's internal rendering of the cell) we can allow people to interrogate this data without hacking on messytables directly.

So: I propose adding a "properties" attribute to messytables Cells, which is a dictionary; what keys exist is entirely dependant on the helper library.

Currently, I:

expose internals via "_lxml", "_xlrd", "_pyxl"
expose raw HTML for the cell via "html"
expose whether a cell was spanned via "span" (HTML only so far)

Does this sound like a good idea / terrible idea?

Add JSONType to possible types

JSONType for JSON. Specifically complex JSON types (lists, dicts, mixed forms) -- simple string / numbers would just be those types.

Mime type detection fails with "Can't read SAT"

This spreadsheet:
http://www.fao.org/fileadmin/templates/worldfood/Reports_and_docs/Food_price_indices_data_deflated.xls

If I read it with any:

import messytables

filename = "Food_price_indices_data_deflated.xls"
tableset = messytables.any.any_tableset(open(filename), extension=".xls")
print tableset

I get this mime error:

$ ./messy_to_json.py 
Traceback (most recent call last):
  File "./messy_to_json.py", line 7, in <module>
    tableset = messytables.any.any_tableset(open(filename), extension=".xls")
  File "/usr/local/lib/python2.7/dist-packages/messytables/any.py", line 48, in any_tableset
    raise ValueError("Unrecognized MIME type: " + mimetype)
ValueError: Unrecognized MIME type: Composite Document File V2 Document, corrupt: Can't read SAT

It works fine if I do:

tableset = messytables.any.any_tableset(open(filename), mimetype="application/vnd.ms-excel", extension=".xls")

Make type names JSON Table Schema conformant

See http://www.dataprotocols.org/en/latest/json-table-schema.html#types

Decimal => number
String => string
Integer => integer

etc

Does it make sense to have MessyTables be able to create a SQL string?

I just found myself using MessyTables to write a SQL "CREATE TABLE" string. This involved converting MessyTable types to SQL query equivalents. It would also involve trying to figure out, for example, whether something is a varchar or a text blob.

I don't see a feature that does this yet. If I branch and create it would that be a useful contribution and in the direction MessyTables is hoping to go?

Support float types in XLS

Complete this TODO

[discussion] messytables should only work with local files

Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).

type_guess should return None when it can't guess

type_guess sort of breaks down when there's empty fields involved. It should ideally return None instead of completely skipping over the field.

Proposal: Discourage (& deprecate) constructing TableSet objects directly

Our library users are coupling their code against knowledge of our classes. This makes it hard for us to merge or separate functionality from different formats.

I propose we encourage using any_tableset (possibly with a new flag to override type detection when we are certain) and somehow emit Deprecation Warnings the TableSet classes are constructed directly.

Support explicit encoding for excel.py

encoding_override argument

This especially useful when you have bad excel data, e.g. i had a recent excel file which yielded:

ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

Stack trace ended with:

  File "/home/rgrp/.virtualenvs/dp/local/lib/python2.7/site-packages/xlrd/__init__.py", line 1485, in parse_globals
    self.handle_codepage(data)
  File "/home/rgrp/.virtualenvs/dp/local/lib/python2.7/site-packages/xlrd/__init__.py", line 1123, in handle_codepage
    self.derive_encoding()
  File "/home/rgrp/.virtualenvs/dp/local/lib/python2.7/site-packages/xlrd/__init__.py", line 1103, in derive_encoding
    _unused = unicode('trial', self.encoding)
LookupError: unknown encoding: unknown_codepage_21010

Explicitly setting encoding to utf8 and it worked fine ...

Encrypted workbooks give misleading error

This error gets raised for password protected workbooks:

        except XLRDError:
            raise ReadError("Unsupported Excel format, or corrupt file")

It should (preferably!) just open them anyway, and at least tell you it is an encrypted workbook, rather than suppress the detailed error message that is inside the thrown away XLRDError structure

Allow pass-through of all csv.reader arguments

Currently only pass through delimiter see https://github.com/okfn/messytables/blob/master/messytables/commas.py#L122

should allow quotechar at least ...

This is minor!

Tags when shipping.

Would be really cool if there was a tag created when messytables was pushed to pypi with the same name as the version that was pushed.

Zip files cannot be loaded over a socket

Loading a remote zipped file breaks messytables (primarily because it can't seek on the file-like object)

    import urllib 
    fh = urllib.urlopen('http://data.gov.uk/data/resource_cache/6d/6d8a0f2d-db23-40ea-8b40-eb20eb75b07f/wastedata-200809.zip')
    table_set = ZIPTableSet(fh)

It should be possible to wrap the fobj for ZipTableSet in a seekable-stream, but the bufferedfile seek method doesn't have enough arguments (seek has two args, pos and whence=0) which means the check whether to load more data will require taking whence into account.

Offset processing is different for CSV and XLS files

See these tests that are failing: https://github.com/domoritz/messytables/blob/d9aab3164adae2f0089a594a17717690f5d4f093/test/test_rowset.py#L82

Support all format arguments python csv.reader supports (??)

Would this include dialect (??)

See http://docs.python.org/2/library/csv.html#csv-fmt-params

currently only have quotechar, delimiter (and additionally encoding)

Remove from_fileobj method

Why?

DRY ... - from_fileobj just duplicates init in most cases and makes adding new parameters a pain in the neck (you have to add in both places).

messytables did not import table correctly

Hey,

i've just uploaded a CSV from scraperwiki into thedatahub:
https://scraperwiki.com/scrapers/discursoscamara-new/

http://thedatahub.org/dataset/brazilian-congress-speeches/resource/67a566bd-9561-4f02-b727-99f9e1f2b4f3

It's not getting recognized properly, both the columns names and some columns content are wrong. See http://imgur.com/a/xg33B (we have updated the datahub store since so these issues will no longer be apparent!)

I think it has to do with double quotes inside the field not beeing escaped properly by scraperwiki. LibreOffice can handle it just fine thought.

Remove dependency on magic library (or at least make it so not a fundamental dependency)

cf @gka's comment in #26

magic depends on file function from filesystem. This won't necessarily exist on a bunch of systems (including e.g. google app engine - this is a direct blocker for use in dataproxy, for example). Furthermore it only provides very limited additional functionality in the any.py module.

Personally I feel this kind of guessing functionality (which also causes problems as it requires seek) could just be left out of messytables and provided in the examples or the README ...

Options

Remove entirely

Either switch to http://docs.python.org/2/library/mimetypes.html (won't be as good but not bad)

Limit dependency

Move the import magic into the if block where it is actually used

This would at least ensure magic is not a top level dependency ...

Permit browsing of LXML tree by not dropping trees

I want to be able to get the header before the table; the current practice of dropping all the tables prevents me from usefully doing this.

Excel XML format support (?)

Docs say:

"The library supports workbooks in the Microsoft Excel 2003 format. The newer, XML-based Excel format is not yet supported."

However we have a dependency on openpyxl which is specifically for the XML format. Do we support the newer format or not?

List users of messytables

In the wiki or the readme

Provide simpler way to get started

Current instructions (see below) are a bit much!

Can we not have:

import messytables
# file is loaded, types are guessed and the row set object is returned!
myrowset = messytables.load_csv(open('mycsv.csv'))

Current set up:

from messytables import CSVTableSet, type_guess, \
  types_processor, headers_guess, headers_processor, \
  offset_processor

fh = open('messy.csv', 'rb')

# Load a file object:
table_set = CSVTableSet.from_fileobj(fh)

# A table set is a collection of tables:
row_set = table_set.tables().next()

# A row set is an iterator over the table, but it can only
# be run once. To peek, a sample is provided:
print row_set.sample[0]

# guess column types:
types = type_guess(row_set.sample)

# and tell the row set to apply these types to
# each row when traversing the iterator:
row_set.register_processor(types_processor(types))

# guess header names and the offset of the header:
offset, headers = headers_guess(row_set.sample)
row_set.register_processor(headers_processor(headers))

# add one to begin with content, not the header:
row_set.register_processor(offset_processor(offset + 1))

# now run some operation on the data:
for row in row_set:
  do_something(row)

Enjoy the 200th commit

✨

Rework of detection

Incoming pull request shortly; we were having problems with the any module, so have refactored it.

`CellType.test` doesn't return a Boolean value

I'm not quite sure how the API is intended to work, however I think CellType.test should return a Boolean. Currently, the default implementation returns the value or None.

The reason for this is I created a NullType(CellType) and NullType.test basically returned None or None. That behaviour makes it very difficult to use as a test.

type_guess guesses datetime field on xls files as string

See https://gist.github.com/4502515 for the test cases, https://github.com/okfn/messytables/blob/master/horror/simple.xls is the xls file, and https://github.com/okfn/messytables/blob/master/horror/simple.csv is the csv file used.

Both when running messytest2.py with latest messytables or messytest1.py with 0.3.0 (when DateType worked albeit slowly), it returns

[String, Integer, String]

However, when I run messytest3.py (which runs type_guess on simple.csv) it correctly returns the following

[DateUtil, Integer, String]

any_tableset and TableSets should raise the same exception

Currently any_tableset passes through any exception from ZIPTableSet, CSVTableSet etc. which could be: ValueError, xlrd.biffh.XLRDError, csv.Error etc. Would be good to harmonize these, to make it easier for callers to catch sensible errors and differentiate them from coding syntax exceptions.

Update pip

Any chance of pushing the most recent changes to pip?

Remove all instances of seek() so that we can handle streaming data again

After some format detection fixes, we now have a few calls to seek() in the CSV module. Those cannot work on urllib-style http request data. One of the main use cases for messytables is to do streaming web data. We should remove these calls, even if this results in a loss of functionality wrt. type detection.

Type guessing weights might be wrong

@pudo in #17 noted that the DateType weight is now the same as the string weight. I think I made a mistake, and all of the weights should be incremented by 1 so that string has the lowest weight. I'll submit another pull request when I get a chance.

Document dependency to file

I installed messytables but it won't run because of a missing system package: file. That's the lib that provides the filetype detection, I guess.

I had to install it via brew install file-formula or sudo port install file.

Would be nice to document this dependency somewhere, so other users save the time to look up the error. Alternatively, messytables should only provide filetype detection feature if the package is installed, and print a friendly message if the user tries to use it anyway. My guess is that big chunks of messytables would run w/o auto-detection.

type_guess throws an error

See https://gist.github.com/4502515 for the test cases and https://github.com/okfn/messytables/blob/master/horror/simple.xls is the xls file used.

When I run messytest1.py from the gist with messytables 0.3.0, it works fine. When I run it with messytables 0.4.0 or against master, it throws the following error

  File "./messytest.py", line 15, in <module>
    main()
  File "./messytest.py", line 10, in main
    types = ms.types.type_guess(row_set.sample)
  File "/home/nigel/.virtualenvs/serviceconverters/local/lib/python2.7/site-packages/messytables/types.py", line 159, in type_guess
    guess = type.test(cell.value)
  File "/home/nigel/.virtualenvs/serviceconverters/local/lib/python2.7/site-packages/messytables/types.py", line 100, in test
    if not is_date(value):
  File "/home/nigel/.virtualenvs/serviceconverters/local/lib/python2.7/site-packages/messytables/dateparser.py", line 7, in is_date
    return date_regex.match(value)
TypeError: expected string or buffer

[discussion] Move messytables to its own organization

This should be considered to encourage more people to contribute to messytables.

@frabcus, @paulfurley, @rgrp @pudo @davidread opinions?

Organization at: https://github.com/organizations/messytables

Can't open zero-password encrypted files.

LibreOffice will open password protected files with no password with no user intervention.

Neither xlrd nor openpyxl will read them, as far as I can see.

Incidently, we discovered that xlrd error messages are harshly suppressed (see #90) which is unhelpful for debugging without hacking on library internals.

Invoking libreoffice --headless --convert-to xlsx encrypted.xls gave a file which we could read with messytables just fine. I'm not convinced requiring libreoffice is appropriate, however!

Blog post about messytables

@pudo would you be up for putting a short post together (perhaps for labs blog or blog.okfn.org ...)?

ODS reading breaks when spreadsheet contains drawing annotations

Namespace prefix draw for style-name on annotation is not defined, line 1, column 383

Will need the draw namespace adding to the wrapper. Should take the opportunity to also include others that are likely to appear.

http://data.defra.gov.uk/Environmental_Liability_Incidents_Reports_(UK)/Environmental+Liability-+Incident+return+2012.ods is a good test case.

test_any.py doesn't run without pdftables installed

It should.

The code to do it is in scraperwiki@aa31755 but mixed with some other junk

Missing documentation for HTML and PDF

It would be great to have the docs for PDF and HTML in https://github.com/okfn/messytables/blob/master/doc/index.rst.

@scraperdragon could you take this one?

Decimal places are always truncated, because Integer has higher default weight

Fields with decimal places can still be parsed as integers, so both Decimal and Integer achieve perfect scores in type_guess. However, Integer has higher default weight, so the decimal places will be dropped.

This is a problem in data such as

https://staging.data.qld.gov.au/storage/f/2013-09-11T04%3A22%3A59.234Z/qscd-datafile.xls

where Latitude and Longitude will be rounded off (and thus become almost useless, because fractions of degrees are extremely important).

Should Decimal have higher default weight? Or, to keep Integer meaningful, should there be some way of distinguishing whether a field actually had decimal places or not?

Distinguish Date from DateTime (?)

Is it possible? Is it worth it? Nice if we could ...

Type guessing dates causes performance issues

I ran into this with data-converters. messytables has huge performance issues guessing dates. For now, I've turned off date guessing in my code.

Datastorer throws error

[2013-01-11 18:37:26,055: ERROR/MainProcess] Task datastorer.upload[0c511be4-01ca-464c-9289-bffc06ca55ec] raised exception: AttributeError("'NoneType' object has no attribute 'value'",)
Traceback (most recent call last):
  File "/Users/sw/.virtualenvs/ckan/lib/python2.7/site-packages/celery/execute/trace.py", line 47, in trace
    return cls(states.SUCCESS, retval=fun(*args, **kwargs))
  File "/Users/sw/.virtualenvs/ckan/lib/python2.7/site-packages/celery/app/task/__init__.py", line 247, in __call__
    return self.run(*args, **kwargs)
  File "/Users/sw/.virtualenvs/ckan/lib/python2.7/site-packages/celery/app/__init__.py", line 175, in run
    return fun(*args, **kwargs)
  File "/Users/sw/Sites/ckan/ckanext-datastorer/ckanext/datastorer/tasks.py", line 92, in datastorer_upload
    return _datastorer_upload(context, data, logger)
  File "/Users/sw/Sites/ckan/ckanext-datastorer/ckanext/datastorer/tasks.py", line 182, in _datastorer_upload
    for data in chunky(row_set.dicts(), 100):
  File "/Users/sw/Sites/ckan/ckanext-datastorer/ckanext/datastorer/tasks.py", line 176, in chunky
    dict, itertools.islice(it, n)))
  File "/Users/sw/.virtualenvs/ckan/src/messytables/messytables/core.py", line 189, in dicts
    generator = self.sample if sample else self
  File "/Users/sw/.virtualenvs/ckan/src/messytables/messytables/core.py", line 170, in __iter__
    import pdb; pdb.set_trace()
  File "/Users/sw/.virtualenvs/ckan/src/messytables/messytables/types.py", line 189, in apply_types
    if strict and type and cell.value:
AttributeError: 'NoneType' object has no attribute 'value'

Proposal: Remove openpyxl and use xlrd for XLSX files.

Keep running XLSX files which are problematic for openpyxl such as this one [https://www.treasurydirect.gov/ govt/ reports/ pd/ pd_sbredemptionsissuesbyseries.xlsx](https://www.treasurydirect.gov/ govt/ reports/ pd/ pd_sbredemptionsissuesbyseries.xlsx)

openpyxl takes a friggin' long time and a tonne of memory to iterate through this spreadsheet.

xlrd appears to open and process it just fine.

openpyxl seems to be a little overwhelmed with issues. I also find it difficult to contribute due to them using Mercurial(!)

@domoritz @rossjones Do you guys have any view on dropping openpyxl?

A precondition, I think, would be to improve the XLSX tests first (I'm volunteering)

Support encoding on CSV files

Propose using http://pypi.python.org/pypi/chardet

messytables doesn't recognize the right CSV columns

messytable fails to recognize this tab-separated table. Here's what I tried:

from messytables import CSVRowSet
from urllib import urlopen
file = urlopen('./2012-Presidential-Campaign-Finance-Contributors.tsv')
rs = CSVRowSet('data', file)
sample = list(rs.sample)
len(sample[0]) # = 1

BufferedFile.read() is incorrect

BufferedFile is used when the stream is non-seekable, such as a stream obtained from the raw property of a request.get() (which is I believe a urllib3.HttpResponse). The read method on BufferedFile has a parameter that is required, and therefore doesn't work with .read() used in XLSTableSet.init

Specifying a default to read in BufferedFile (such as n=-1, read all) then causes a failure in XLRD as the buffer only reads 1024 bytes at a time.

Unify filename and fileobj argument in XLSTableSet

suggest a first parameter named fileobj but note that it can be a path in the docstring.

DateUtil guessing can interpret data in different formats

Since the DateUtil type only tries to convert a string into a date, It might interpret different rows with different formats.

Example:

02/03/04 -> MM/DD/YY
31/03/04 -> DD/MM/YY

This is why we have the Date type but it is very slow because it has to go through different formats. A way to speed it up is to reduce the number of possible formats when using the strict type guessing. Only date formats that are still possible should be used for following rows.

Type guessing does not exclude unparsable types

My problem is that I have to parse all fields properly. The type guessing, however, does suggest a type that I cannot use.

>>> type_guess(CSVTableSet(StringIO.StringIO('1 \n 2 \n foo')).tables[0])
[Decimal]

I would expect string instead of decimal. My idea would be to add an option to type_guess that will remove types from the guessable if the casting would fail.

str(NUMBER) does work, whereas int(STRING) does not always succeed. The same goes for numeric types.