Code Monkey home page Code Monkey logo

messytables's Introduction

messytables's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

messytables's Issues

Document dependency to file

I installed messytables but it won't run because of a missing system package: file. That's the lib that provides the filetype detection, I guess.

I had to install it via brew install file-formula or sudo port install file.

Would be nice to document this dependency somewhere, so other users save the time to look up the error. Alternatively, messytables should only provide filetype detection feature if the package is installed, and print a friendly message if the user tries to use it anyway. My guess is that big chunks of messytables would run w/o auto-detection.

Remove all instances of seek() so that we can handle streaming data again

After some format detection fixes, we now have a few calls to seek() in the CSV module. Those cannot work on urllib-style http request data. One of the main use cases for messytables is to do streaming web data. We should remove these calls, even if this results in a loss of functionality wrt. type detection.

Encrypted workbooks give misleading error

This error gets raised for password protected workbooks:

        except XLRDError:
            raise ReadError("Unsupported Excel format, or corrupt file")

It should (preferably!) just open them anyway, and at least tell you it is an encrypted workbook, rather than suppress the detailed error message that is inside the thrown away XLRDError structure

Proposal: Remove openpyxl and use xlrd for XLSX files.

Keep running XLSX files which are problematic for openpyxl such as this one [https://www.treasurydirect.gov/ govt/ reports/ pd/ pd_sbredemptionsissuesbyseries.xlsx](https://www.treasurydirect.gov/ govt/ reports/ pd/ pd_sbredemptionsissuesbyseries.xlsx)

openpyxl takes a friggin' long time and a tonne of memory to iterate through this spreadsheet.

xlrd appears to open and process it just fine.

openpyxl seems to be a little overwhelmed with issues. I also find it difficult to contribute due to them using Mercurial(!)

@domoritz @rossjones Do you guys have any view on dropping openpyxl?

A precondition, I think, would be to improve the XLSX tests first (I'm volunteering)

Tags when shipping.

Would be really cool if there was a tag created when messytables was pushed to pypi with the same name as the version that was pushed.

BufferedFile.read() is incorrect

BufferedFile is used when the stream is non-seekable, such as a stream obtained from the raw property of a request.get() (which is I believe a urllib3.HttpResponse). The read method on BufferedFile has a parameter that is required, and therefore doesn't work with .read() used in XLSTableSet.init

Specifying a default to read in BufferedFile (such as n=-1, read all) then causes a failure in XLRD as the buffer only reads 1024 bytes at a time.

Add JSONType to possible types

JSONType for JSON. Specifically complex JSON types (lists, dicts, mixed forms) -- simple string / numbers would just be those types.

Does it make sense to have MessyTables be able to create a SQL string?

I just found myself using MessyTables to write a SQL "CREATE TABLE" string. This involved converting MessyTable types to SQL query equivalents. It would also involve trying to figure out, for example, whether something is a varchar or a text blob.

I don't see a feature that does this yet. If I branch and create it would that be a useful contribution and in the direction MessyTables is hoping to go?

Remove dependency on magic library (or at least make it so not a fundamental dependency)

cf @gka's comment in #26

magic depends on file function from filesystem. This won't necessarily exist on a bunch of systems (including e.g. google app engine - this is a direct blocker for use in dataproxy, for example). Furthermore it only provides very limited additional functionality in the any.py module.

Personally I feel this kind of guessing functionality (which also causes problems as it requires seek) could just be left out of messytables and provided in the examples or the README ...

Options

Remove entirely

Either switch to http://docs.python.org/2/library/mimetypes.html (won't be as good but not bad)

Limit dependency

Move the import magic into the if block where it is actually used

This would at least ensure magic is not a top level dependency ...

Zip files cannot be loaded over a socket

Loading a remote zipped file breaks messytables (primarily because it can't seek on the file-like object)

    import urllib 
    fh = urllib.urlopen('http://data.gov.uk/data/resource_cache/6d/6d8a0f2d-db23-40ea-8b40-eb20eb75b07f/wastedata-200809.zip')
    table_set = ZIPTableSet(fh)

It should be possible to wrap the fobj for ZipTableSet in a seekable-stream, but the bufferedfile seek method doesn't have enough arguments (seek has two args, pos and whence=0) which means the check whether to load more data will require taking whence into account.

Mime type detection fails with "Can't read SAT"

This spreadsheet:
http://www.fao.org/fileadmin/templates/worldfood/Reports_and_docs/Food_price_indices_data_deflated.xls

If I read it with any:

import messytables

filename = "Food_price_indices_data_deflated.xls"
tableset = messytables.any.any_tableset(open(filename), extension=".xls")
print tableset

I get this mime error:

$ ./messy_to_json.py 
Traceback (most recent call last):
  File "./messy_to_json.py", line 7, in <module>
    tableset = messytables.any.any_tableset(open(filename), extension=".xls")
  File "/usr/local/lib/python2.7/dist-packages/messytables/any.py", line 48, in any_tableset
    raise ValueError("Unrecognized MIME type: " + mimetype)
ValueError: Unrecognized MIME type: Composite Document File V2 Document, corrupt: Can't read SAT

It works fine if I do:

tableset = messytables.any.any_tableset(open(filename), mimetype="application/vnd.ms-excel", extension=".xls")

messytables did not import table correctly

Hey,

i've just uploaded a CSV from scraperwiki into thedatahub:
https://scraperwiki.com/scrapers/discursoscamara-new/

http://thedatahub.org/dataset/brazilian-congress-speeches/resource/67a566bd-9561-4f02-b727-99f9e1f2b4f3

It's not getting recognized properly, both the columns names and some columns content are wrong. See http://imgur.com/a/xg33B (we have updated the datahub store since so these issues will no longer be apparent!)

I think it has to do with double quotes inside the field not beeing escaped properly by scraperwiki. LibreOffice can handle it just fine thought.

Proposal: Discourage (& deprecate) constructing TableSet objects directly

Our library users are coupling their code against knowledge of our classes. This makes it hard for us to merge or separate functionality from different formats.

I propose we encourage using any_tableset (possibly with a new flag to override type detection when we are certain) and somehow emit Deprecation Warnings the TableSet classes are constructed directly.

Type guessing weights might be wrong

@pudo in #17 noted that the DateType weight is now the same as the string weight. I think I made a mistake, and all of the weights should be incremented by 1 so that string has the lowest weight. I'll submit another pull request when I get a chance.

Excel XML format support (?)

Docs say:

"The library supports workbooks in the Microsoft Excel 2003 format. The newer, XML-based Excel format is not yet supported."

However we have a dependency on openpyxl which is specifically for the XML format. Do we support the newer format or not?

any_tableset and TableSets should raise the same exception

Currently any_tableset passes through any exception from ZIPTableSet, CSVTableSet etc. which could be: ValueError, xlrd.biffh.XLRDError, csv.Error etc. Would be good to harmonize these, to make it easier for callers to catch sensible errors and differentiate them from coding syntax exceptions.

Datastorer throws error

[2013-01-11 18:37:26,055: ERROR/MainProcess] Task datastorer.upload[0c511be4-01ca-464c-9289-bffc06ca55ec] raised exception: AttributeError("'NoneType' object has no attribute 'value'",)
Traceback (most recent call last):
  File "/Users/sw/.virtualenvs/ckan/lib/python2.7/site-packages/celery/execute/trace.py", line 47, in trace
    return cls(states.SUCCESS, retval=fun(*args, **kwargs))
  File "/Users/sw/.virtualenvs/ckan/lib/python2.7/site-packages/celery/app/task/__init__.py", line 247, in __call__
    return self.run(*args, **kwargs)
  File "/Users/sw/.virtualenvs/ckan/lib/python2.7/site-packages/celery/app/__init__.py", line 175, in run
    return fun(*args, **kwargs)
  File "/Users/sw/Sites/ckan/ckanext-datastorer/ckanext/datastorer/tasks.py", line 92, in datastorer_upload
    return _datastorer_upload(context, data, logger)
  File "/Users/sw/Sites/ckan/ckanext-datastorer/ckanext/datastorer/tasks.py", line 182, in _datastorer_upload
    for data in chunky(row_set.dicts(), 100):
  File "/Users/sw/Sites/ckan/ckanext-datastorer/ckanext/datastorer/tasks.py", line 176, in chunky
    dict, itertools.islice(it, n)))
  File "/Users/sw/.virtualenvs/ckan/src/messytables/messytables/core.py", line 189, in dicts
    generator = self.sample if sample else self
  File "/Users/sw/.virtualenvs/ckan/src/messytables/messytables/core.py", line 170, in __iter__
    import pdb; pdb.set_trace()
  File "/Users/sw/.virtualenvs/ckan/src/messytables/messytables/types.py", line 189, in apply_types
    if strict and type and cell.value:
AttributeError: 'NoneType' object has no attribute 'value'

DateUtil guessing can interpret data in different formats

Since the DateUtil type only tries to convert a string into a date, It might interpret different rows with different formats.

Example:

02/03/04 -> MM/DD/YY
31/03/04 -> DD/MM/YY

This is why we have the Date type but it is very slow because it has to go through different formats. A way to speed it up is to reduce the number of possible formats when using the strict type guessing. Only date formats that are still possible should be used for following rows.

Support explicit encoding for excel.py

encoding_override argument

This especially useful when you have bad excel data, e.g. i had a recent excel file which yielded:

ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

Stack trace ended with:

  File "/home/rgrp/.virtualenvs/dp/local/lib/python2.7/site-packages/xlrd/__init__.py", line 1485, in parse_globals
    self.handle_codepage(data)
  File "/home/rgrp/.virtualenvs/dp/local/lib/python2.7/site-packages/xlrd/__init__.py", line 1123, in handle_codepage
    self.derive_encoding()
  File "/home/rgrp/.virtualenvs/dp/local/lib/python2.7/site-packages/xlrd/__init__.py", line 1103, in derive_encoding
    _unused = unicode('trial', self.encoding)
LookupError: unknown encoding: unknown_codepage_21010

Explicitly setting encoding to utf8 and it worked fine ...

[discussion] messytables should *only* work with local files

Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).

Update pip

Any chance of pushing the most recent changes to pip?

Type guessing does not exclude unparsable types

My problem is that I have to parse all fields properly. The type guessing, however, does suggest a type that I cannot use.

>>> type_guess(CSVTableSet(StringIO.StringIO('1 \n 2 \n foo')).tables[0])
[Decimal]

I would expect string instead of decimal. My idea would be to add an option to type_guess that will remove types from the guessable if the casting would fail.

str(NUMBER) does work, whereas int(STRING) does not always succeed. The same goes for numeric types.

type_guess guesses datetime field on xls files as string

See https://gist.github.com/4502515 for the test cases, https://github.com/okfn/messytables/blob/master/horror/simple.xls is the xls file, and https://github.com/okfn/messytables/blob/master/horror/simple.csv is the csv file used.

Both when running messytest2.py with latest messytables or messytest1.py with 0.3.0 (when DateType worked albeit slowly), it returns

[String, Integer, String]

However, when I run messytest3.py (which runs type_guess on simple.csv) it correctly returns the following

[DateUtil, Integer, String]

`CellType.test` doesn't return a Boolean value

I'm not quite sure how the API is intended to work, however I think CellType.test should return a Boolean. Currently, the default implementation returns the value or None.

The reason for this is I created a NullType(CellType) and NullType.test basically returned None or None. That behaviour makes it very difficult to use as a test.

Provide format-specific metadata

Often I find myself wanting more details about the individual cells than just their values.

e.g.

  1. Some HTML cells contain more than just a single value; and this can require additional parsing to understand what the true value is. For example,
421

is naively converted to 421 at the moment; in order to do this additional processing I require the HTML source of the cell.

  1. Some formats (Excel, HTML, etc.) support additional formatting - e.g. bold, font colour, background colour. It would be good to allow future support for these.

  2. But we don't want to write enormous amounts of code to cover all use cases, especially where features are limited to one or two formats. But making available the internals of the library parsing the file (e.g. LXML's internal rendering of the cell) we can allow people to interrogate this data without hacking on messytables directly.

So: I propose adding a "properties" attribute to messytables Cells, which is a dictionary; what keys exist is entirely dependant on the helper library.

Currently, I:

  • expose internals via "_lxml", "_xlrd", "_pyxl"
  • expose raw HTML for the cell via "html"
  • expose whether a cell was spanned via "span" (HTML only so far)

Does this sound like a good idea / terrible idea?

Can't open zero-password encrypted files.

LibreOffice will open password protected files with no password with no user intervention.

Neither xlrd nor openpyxl will read them, as far as I can see.

Incidently, we discovered that xlrd error messages are harshly suppressed (see #90) which is unhelpful for debugging without hacking on library internals.

Invoking libreoffice --headless --convert-to xlsx encrypted.xls gave a file which we could read with messytables just fine. I'm not convinced requiring libreoffice is appropriate, however!

Decimal places are always truncated, because Integer has higher default weight

Fields with decimal places can still be parsed as integers, so both Decimal and Integer achieve perfect scores in type_guess. However, Integer has higher default weight, so the decimal places will be dropped.

This is a problem in data such as

https://staging.data.qld.gov.au/storage/f/2013-09-11T04%3A22%3A59.234Z/qscd-datafile.xls

where Latitude and Longitude will be rounded off (and thus become almost useless, because fractions of degrees are extremely important).

Should Decimal have higher default weight? Or, to keep Integer meaningful, should there be some way of distinguishing whether a field actually had decimal places or not?

Provide simpler way to get started

Current instructions (see below) are a bit much!

Can we not have:

import messytables
# file is loaded, types are guessed and the row set object is returned!
myrowset = messytables.load_csv(open('mycsv.csv'))

Current set up:

from messytables import CSVTableSet, type_guess, \
  types_processor, headers_guess, headers_processor, \
  offset_processor

fh = open('messy.csv', 'rb')

# Load a file object:
table_set = CSVTableSet.from_fileobj(fh)

# A table set is a collection of tables:
row_set = table_set.tables().next()

# A row set is an iterator over the table, but it can only
# be run once. To peek, a sample is provided:
print row_set.sample[0]

# guess column types:
types = type_guess(row_set.sample)

# and tell the row set to apply these types to
# each row when traversing the iterator:
row_set.register_processor(types_processor(types))

# guess header names and the offset of the header:
offset, headers = headers_guess(row_set.sample)
row_set.register_processor(headers_processor(headers))

# add one to begin with content, not the header:
row_set.register_processor(offset_processor(offset + 1))

# now run some operation on the data:
for row in row_set:
  do_something(row)

Rework of detection

Incoming pull request shortly; we were having problems with the any module, so have refactored it.

Remove from_fileobj method

Why?

  • DRY ... - from_fileobj just duplicates init in most cases and makes adding new parameters a pain in the neck (you have to add in both places).

type_guess throws an error

See https://gist.github.com/4502515 for the test cases and https://github.com/okfn/messytables/blob/master/horror/simple.xls is the xls file used.

When I run messytest1.py from the gist with messytables 0.3.0, it works fine. When I run it with messytables 0.4.0 or against master, it throws the following error

  File "./messytest.py", line 15, in <module>
    main()
  File "./messytest.py", line 10, in main
    types = ms.types.type_guess(row_set.sample)
  File "/home/nigel/.virtualenvs/serviceconverters/local/lib/python2.7/site-packages/messytables/types.py", line 159, in type_guess
    guess = type.test(cell.value)
  File "/home/nigel/.virtualenvs/serviceconverters/local/lib/python2.7/site-packages/messytables/types.py", line 100, in test
    if not is_date(value):
  File "/home/nigel/.virtualenvs/serviceconverters/local/lib/python2.7/site-packages/messytables/dateparser.py", line 7, in is_date
    return date_regex.match(value)
TypeError: expected string or buffer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.