Comments (16)
gesaber_data.csv
10935992 lines
Old tablite took 522s seconds, experimental tablite changes takes 63 seconds. That's 820% speed improvement.
from tablite.
I was thinking that perhaps we could use this:
https://github.com/juancarlospaco/faster-than-csv
but didn't finish to pack it with creating the numpy file format in:
https://nbviewer.org/github/root-11/root-11.github.io/blob/master/content/reading_numpys_fileformat.ipynb
what are your thoughts?
from tablite.
I think if it's memory safe (no risk of out-of-memory exceptions), then perhaps it could be considered as an option.
From the first glance, I can see that the lib you linked returns the whole table as a list, which could mean that the user device may run out of memory, where the list will just won't fit in RAM.
I think it is more practical if the tool, just like in tablite, would fail only if the column does not fit in RAM, not the whole table.
Unless my understanding of tablite core is incorrect, then please correct me.
from tablite.
I would read tablite.config.Config.page_size (int) rows at a time and thereby slice the read operations to prevent OOMError
from tablite.
The current implementation itself also has a lot of room to be improved.
In a few places the file is re-read from the beginning every time we want to go to certain line, when we could file.seek()
, since we we can save the line offsets in the initial loop through the file.
The text escape uses python loops which are slow on their own. And then there's constant memory re-allocations when working with strings. The effect of this is 35ms
per line (on my machine). If we compare that with csv.reader
with a custom dialect, it can do a line 0.17ms
.
It could probably be even faster because I assume there's setup overhead which could probably knock even more if the entire file (or a chunk of it) was processed instead of on line-by-line basis.
We can play around with dialect settings and see what passes all the existing tests.
from tablite.
To improve performance of current text escaping module we could use as @realratchet mentioned standard csv library reader:
file_reader_utils.py#L103
def _call_3(self, s): # looks for qoutes.
words = []
# qoute = False
# ix = 0
# while ix < len(s):
# c = s[ix]
# if c == self.qoute:
# qoute = not qoute
# if qoute:
# ix += 1
# continue
# if c == self.delimiter:
# word, s = s[:ix], s[ix + self._delimiter_length :]
# word = word.lstrip(self.qoute).rstrip(self.qoute)
# words.append(word)
# ix = -1
# ix += 1
# if s:
# s = s.lstrip(self.qoute).rstrip(self.qoute)
# words.append(s)
class MyDialect(csv.Dialect):
delimiter = self.delimiter
quotechar = self.qoute
escapechar = '\\'
doublequote = True
quoting = csv.QUOTE_MINIMAL
skipinitialspace = False
lineterminator = "\n"
dia = MyDialect
parsed_words = list(csv.reader(StringIO(s), dialect=dia))[0]
words.extend(parsed_words)
return words
with this improvement we can achieve fast text escaping, since the implementation is written in C.
However, there is one test (test_filereader_formats.py/test_text_escape
), which fail, because of incorrect data format according to RFC4180 standard (https://www.rfc-editor.org/rfc/rfc4180#page-2 - point 7).
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote
The test:
te = text_escape('"1000294";"S2417DG 24"" LED monitor (210-AJWM)";"47.120,000";"CM3";3')
assert te == ["1000294", 'S2417DG 24"" LED monitor (210-AJWM)', "47.120,000", "CM3", "3"]
For test to be correct, it should contain data, which is following CSV standard. To make it correct, all double qoutes, which are inside fields, according to standard, have to be double qoute escaped:
te = text_escape('"1000294";"S2417DG 24"""" LED monitor (210-AJWM)";"47.120,000";"CM3";3')
assert te == ["1000294", 'S2417DG 24"" LED monitor (210-AJWM)', "47.120,000", "CM3", "3"]
Then the test passes.
from tablite.
If you agree with the statement about incorrect test, I can make a PR.
from tablite.
from tablite.
Ovidijus would have to give concrete figures after csv.reader implementation, but 35ms 0.17ms is a huge jump since that's where we're spending most of our time.
As for breaking file format I don't think we should care, because we either break properly exported files, or we support only one bad exporter. The only sideffect is noy breaking import it's just the imported result is slightly different. As we talked with Ovidijus having bad imports for correctly following standart exporters vs wrong exporters adds no value.
from tablite.
from tablite.
It would never raise an error either case. We loose any compliance with any exporter that follows csv standard in favor for supporting quotes of one that does not as those double quotes would just be consumed and produce empty sequence. As I'm not sure which one is more important you'd have to make the choice.
However, if following that specific exporter format is important, tablite then looses a lot of value being public repository.
If we drop csv.reader idea and keep the existing functionality we can still have some speedup by not reallocating the memory as frequently as allocating on the heap and gc are expensive. Even an non-interpreted language would choke on heap allocations and deallocations in the loop.
As for sniffer I don't think we experimented with sniffer just line by line parsing.
from tablite.
from tablite.
If we look at the speed impact, the same file (renamed, so tablite does not use cache) with the same options took 3 minutes 13 seconds (previously 1 hour 18 minutes):
start = time.time()
tbl = Table.from_file("./test2.csv", guess_datatypes=False, text_qualifier='"', delimiter=',')
end = time.time()
print(end - start)
# OUTPUT
importing: reading 'test2.csv' bytes: 0.00%| | [00:00<?]
importing: consolidating 'test2.csv': 100.00%|██████████| [03:13<00:00]
193.13518619537354
from tablite.
We got it down to something like 7s, still WIP.
from tablite.
from tablite.
test.csv
1000 columns wide, 3000 rows.
Running 10 tests averaged 5.76 seconds on my machine. Only a single core is used. No datatype inference.
Running 10 tests averaged 2 minutes 39 seconds on my machine. Only a single core is used. With datatype inference.
When it comes to datatype inference, I went the naive route. Where I infer datatypes during pagination since it already loads the numpy anyway.
There can still be SOME improvements but because we support non-primitive datatypes, datetime, date, time, using numpy as data format makes it extremely challenging because it requires not only understanding fully how numpy saves objects since it uses contiguous blocks as far as I can tell when reading anything, but because it uses pickling we also need to understand pickling format fully so that when the file is read until the end we could seek back into necessary offset and override the pickling headers, etc.
And even then I don't think it would add much speed improvements when it comes to data type inference, because I found that pythons string parsing is just slow, so it's not worth bothering with it I think since using native would probably improve it.
We'll need to do a lot more experiments just to make sure things are in order but all tests pass, so it's promising.
from tablite.
Related Issues (20)
- Join (reindexing) fails when table spans multiple pages HOT 2
- Documentation is out of sync HOT 1
- Determine method to handle out-of-memory for large joins. HOT 1
- Proposed format specification HOT 1
- multi proc groupby HOT 1
- multi proc join HOT 3
- Add warning in add_rows that is the slowest method HOT 1
- Deprecating support for python 3.8 in favor of type hints throughout the code HOT 1
- Columns with empty names HOT 2
- Table.load very slow with dtype('O') HOT 5
- Bloat in H5 storage following repeated SIGKILL HOT 3
- Statistics discrepancies in median/mode HOT 1
- Do Tablite Support different datasets Concurrently ? HOT 6
- Addition of match operator HOT 5
- HDF5 file size never decreases + concurrent interpreters can overwrite each others files. HOT 14
- sorting problem with datetime dt columns HOT 1
- Inconsistent row slice HOT 3
- statistics() fails on time column HOT 2
- my first issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tablite.