root-11 / tablite Goto Github PK
View Code? Open in Web Editor NEWmultiprocessing enabled out-of-memory data analysis library for tabular data.
License: MIT License
multiprocessing enabled out-of-memory data analysis library for tabular data.
License: MIT License
There is a need for a conditional filter based on the columns from another table. It can be called as .match(...)
method.
Reasoning:
.filter(...)
method that can filter out wanted rows in the table based on criteria. The problem: it can only match the table by it's own columns or given primitive values..lookup(...)
method that can do conditional join of two tables based on certain criteria. The problem: when no matches are found, None
's are filled in their place.In current implementation if we would want to find rows of table1
that match the conditions in table2
we would have to query the table using lookup
operator. And then filter out None
results using filter
. This requires iterating over the base table twice. The match
method would perform this operation in a single pass.
The header says it all.
file_reader(.... keep=['column A', 'column C'])
# dropping column B
inverse operation
file_reader(.... drop=['column A', 'column C'])
# keeping column B.
Permit user to declare datatype in addition:
file_reader(.... keep=['column A', 'column C'], dtype=['int', 'date'])
hereby column A is int and column C is date
module not found while running datatype_tests
I think the tablite
library is awesome in terms of handling large data. Thanks for the development efforts!
However, I noticed that rapid changes to the core elements made the documentation out of sync and at this moment completely misleading.
Why?
tablite
uses h5py
, however, I can't seem to find anything related to it (only in the requirements.txt).tablite
now use numpy
as data storage.There might be other things, but the frontpage is definitely the first thing that pops with outdated information.
The problem is captured here: https://stackoverflow.com/questions/2150739/iso-time-iso-8601-in-python
sum() takes at most 2 arguments (4 given)
https://github.com/root-11/tablite/blob/8971306a5249414372b6f7a7d3bfd6b56cfbe639/tablite/utils.py#L317C9-L317C9
Just stumbled across tablite. Really cool, definitely can agree with the use case (pandas and numpy are amazing, but also super massive when you just want to deal with a 5 column 20 row table.)
What are your thoughts on:
bs4.element.Tag
(the result of page.select_one('table#tabid')
from BeautifulSoup)? (Although I try to avoid suggesting adding dependencies, BeautifulSoup seems to be popular enough that making it an optional dependency for parsing HTML tables seems like a reasonable compromise to avoid having to parse manually - or with html5lib/lxml)table
to tablite
to match the project name?import «packagename»
- which in this case, obviously won't worktable.py
file somewhere in a project (or the current directory) are much higher than someone having a tablite.py
file -- this would create an import conflict and would break tablite from loading. (Again, not sure if there's PEP guidance on this, but as a general rule, I think avoiding naming packages with common words is a good idea. For submodules/packages this isn't an issue if the main/parent name is completely unique.)A nice API would be something like
Table.imputation(columns=[target columns], using=[source columns], missing=[None, 'n/a', np.nan] )
Using the source columns an index can be created.
The index can then be used to calculate the distance between the each target row and the all other rows.
Missing denotes the values that should be considered missing.
In the overview below (t)ime, stdev and count (of pages):
loading 'kyle_cs_r.tpz' file: 100%|███████████████████████| 41/41 [00:00<00:00, 124.14it/s]
dtype('int64') t: 0.00438, stdev: 0.00000, count: 1
dtype('<U46') t: 0.02037, stdev: 0.00000, count: 1
dtype('O') t: 0.00877, stdev: 0.00095, count: 27
dtype('<U36') t: 0.00767, stdev: 0.00000, count: 1
dtype('<U58') t: 0.01238, stdev: 0.00164, count: 2
dtype('<U20') t: 0.00334, stdev: 0.00064, count: 2
dtype('<U11') t: 0.00199, stdev: 0.00000, count: 1
dtype('<U23') t: 0.00536, stdev: 0.00057, count: 2
dtype('<U21') t: 0.00385, stdev: 0.00067, count: 3
dtype('<U26') t: 0.00438, stdev: 0.00000, count: 1
loading 'kyle_cs_l.tpz' file: 100%|█████████████████████████| 747/747 [10:14<00:00, 1.22it/s]
dtype('<U2') t: 0.01451, stdev: 0.01045, count: 166
dtype('int64') t: 0.02732, stdev: 0.00889, count: 332
dtype('O') t: 3.61340, stdev: 0.35634, count: 166
dtype('float64') t: 0.03205, stdev: 0.01417, count: 83
Note the second second last record: dtype('O') 3.61 sec / page !
My conclusion is that we must resort to type 'O' as the last resort.
Question: @realratchet - can you imagine a way I could reindex the structs without loading them?
Question: Can tablite be extended to allow pickling of non-native python objects?
Answer: H5py does support storage of binary data so yes this can be done.
Hi Team,
Can we support different datasets and can store at different location and process it concurrently.
Regards,
Akahs
From the readme
# We start with creating two tables:
left = Table()
left.add_column('number', int, allow_empty=True, data=[1, 2, 3, 4, None])
left.add_column('color', str, data=['black', 'blue', 'white', 'white', 'blue']) # US english color.
right = Table()
right.add_column('letter', str, allow_empty=True, data=['a', 'b,', 'c', 'd', None])
right.add_column('colour', str, data=['blue', 'white', 'orange', 'white', 'blue']) # Queens english.
# Now this wont work:
left_join = left.left_join(right, keys=['colour'], columns=['number', 'letter'])
# Because keys need to match `colour` with `color`
A possible fix could be a list containing a tuple of keys where index 0 indicates the left column and index 1 indicates the right.
Here is a quick sketch for the format I'm thinking, some inspiration from GLTF. Here's a quick overview.
First 12 bytes is the preamble, which is as follows:
TBLT
which is 0x54424C54)Header is a JSON to make the format fully extensible in the future. Currently I have these in mind that we should have:
{
"page": {
"type": <type>,
"compression"?: "lz4"|null
}
}
Obviously this format cannot deal with mixed types although we talked about getting rid of mixed types and replacing with the datatype that encompass all of the elements in the column. Obviously this means that we cannot represent None
either and have to fill with the default value e.g, empty string for string types, zero for integer types. If we would ever want to go back to mixed types we can instead have an array of fragments.
The types that we support:
We keep the numpy modifiers, meaning we can still set <
and >
for endianness and size when valid, i.e., we can still do <U64
for a page that has the longest string of length 64
and is little endian.
One of the current issues we have with numpy format is the enormous size of string pages due to padding. We still want to keep the padding because we want good interoperability with python but we also want reduced page size with minimal impact to performance as possible. Therefore we introduce "compression"?
key in the header. It's an optional key that may or may not exist in the JSON, if not provided treated as null
. It should only really be used for strings but could technically be paired with other types too.
I tried multiple decompression algorithms the least additional processing time to read the pages was by lz4
but there still was processing time so maybe the table producing function should try and see if it's necessary to even use the compression. But that can be parked for now.
Current numpy format has no concept of date
type which while is just a subset of datetime
and can be fully expressed by it, it has a different __repr__
function and we're likely going to get a lot of complaints if all of the date
formats will be turned into datetime
because the table will be filled with YYYY-MM-DD 00:00:00
strings.
Even though compressed strings in disk don't need to be contiguous memory I think it's probably still best to store them as so just because np.frombuffer
will work much nicer with it. Therefore, we replicate the way numpy stores the strings and use zero pad at the end if a string is shorter than the longest string.
This is the page data stored in either compressed memory block or contiguous data block. Just raw binary dump nothing special.
Because we said we're getting rid of the mixed type I accounted for that when conjuring the data format although this means we cannot have None
which is easy with string pages but personally seems iffy when it comes to other datatypes. Concepting engine has special flags to check for None
types in some widgets which would make those flags obsolete?
Also treating None
as default value, e.g., 0.0
when a float
could maybe potentially have undesired side-effects? But I may be overthinking it and this isn't really an issue I should be worried about.
We could also pull in extra information about the page into the JSON, e.g., statistics if we think it's beneficial, I tried to keep this extensible.
Would it be out of the question to include an accompanying tablite.hdf5 file in the repo, possible alongside one of the tests, or from a selected point in the demo?
charset-normalizer==2.0.0 is much better than chardet.
If it should, then this:
def setup(self, table):
""" helper to setup the group functions """
self.output = Table()
self.required_headers = self.keys + [h for h, fn in self.groupby_functions]
for h in self.keys:
col = table[h]
self.output.add_column(header=h, datatype=col.datatype, **allow_empty=False**) # add column for keys
Should be this:
def setup(self, table):
""" helper to setup the group functions """
self.output = Table()
self.required_headers = self.keys + [h for h, fn in self.groupby_functions]
for h in self.keys:
col = table[h]
self.output.add_column(header=h, datatype=col.datatype, **allow_empty=True**) # add column for keys
def test_sort_datetime():
from datetime import datetime
t = Table(columns={"A": [datetime(2019, 2, 2, 12, 12, 12), datetime.now()], "B": [2, 2]})
t.sort({"A": False})
# try:
# t.sort({"A": False})
# except:
# t.show()
# print('datetime sort failed')
# from numpy import datetime64
# t = Table(columns={"A": [datetime64('2005'), datetime64(datetime.now())], "B": [2, 2]})
# try:
# t.sort({"A": False})
# except:
# t.show()
# print('np.datetime sort failed')
Reindexing task fails when the column spans multiple pages, here's a way to reproduce it easily. This can happen when pages are highly fragmented like in this case and probably even when with larger tables spanning multiple pages.
from tablite import Table
from tablite.config import Config
Config.PAGE_SIZE = 1
cat_3 = Table({'order_LP_pallets': [15060, 15060],
'avg_layers_on_order_LP_pallets': [3.2756972111553786, 3.2756972111553786],
'scenario': ['LP_SCPT', 'LP_MCPT']})
cat_5 = Table({'order_boxes': [23552, 24441, 23552, 24431],
'avg_HUs_per_box': [4.162746263586956,
4.146884333701567,
4.162746263586956,
4.147558429863698],
'order_pallets': [41398, 41438, 37950, 37991],
'avg_case_per_order_pallet': [43.975167882506405,
43.95417249867272,
42.61736495388669,
42.59450922586929],
'scenario': ['SCPT', 'MCPT', 'LP_SCPT', 'LP_MCPT']})
Config.PAGE_SIZE = 1000000
join_0 = cat_5.join(
cat_3,
left_keys=["scenario"],
right_keys=["scenario"],
kind="left",
merge_keys=True,
left_columns=None,
right_columns=None
)
Hello,
It may be my personal opinion, but it seems non logical to have table columns with empty names.
Currently, tablite
allows to have a column with an empty name (or just a whitespace).
I would expect to receive an exception telling me, that "column name cannot be empty".
>>> from tablite import Table
>>> t = Table()
>>> t[""] = [1, 2, 3]
>>> t.show()
+==+=+
|# | |
+--+-+
| 0|1|
| 1|2|
| 2|3|
+==+=+
When I've created a nice table and want to paste the data into some visualisation package, it would be nice to be able to copy paste.
I am trying tablite with a CSV file with many fields and some of which are long, more specifically full texts. When reading with Table.import_file
, I get the following exception
Exception: Traceback (most recent call last):
File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/mplite/__init__.py", line 26, in execute
return self.f(*self.args,**self.kwargs)
File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/tablite/core.py", line 2646, in text_reader_task
data[header].append(fields[index])
IndexError: list index out of range
When the same dataset is converted to xlsx and then opened, no errors occur. I am attaching an example file that causes the error.
v2022_11_0
. See comments for commit references.Config
"memory"
and exclusively use RAM.new tools.py
date_range(start, stop, step)
like 2022/1/1, 2023/1/1, timedelta(days1) # returns list of datetime
s.new Table methods
table.remove_duplicate_rows()
table.drop_na(*arg)
removes rows with and None, np.nantable.replace(target, replacement)
which searches across all columns, e.g. t.replace(None, -1)
table.replace_missing_values(source=[...], target=column_name)
which looks up nearest neighbour in sources and substitute into target. #18table.to_pandas()
Table.from_pandas(pd.DataFrame)
table.to_h5()
Table.from_h5()
table.to_dict(columns, slice)
# to_dict returns python dict with column names as keys and lists of values as values. The optional slice
permits an effective retrieval of a subset of rows.Table.from_dict()
table.to_list()
returns list of column names, + list of each columnTable.from_list(column names + data)
(the inverse to .to_list()
table.transpose(columns=['Monday', 'Tuesday','Wednesday', 'Thursday', 'Friday'], as='day')
turns the columns into a single column under the heading of 'day'new Column methods:
Column[:]
returns list of native python types. No more numpy arrays or numpy types.Column.to_numpy(slice)
returns numpy's ndarray.Documentation
Cleaner code:
@realratchet : In commit: aee872b in line 452:
"""
Ratchet:
I thought real good about it and it is not possible to mapping tasks be
multi-processed/constant memory, because every slice must know the entire right table dictionary
and anything that is not the first slice must also have all previous other slice indices.
Best we can do is reduce the RAM usage via using hash of a string or reduce memory usage via native implementation.
"""
Please correct me if I misunderstand, but as far as I see it, it is not necessary to hold the entire right table dictionary in memory.
In the process illustrated below, there are 5 tasks:
Task 1: create sparse index for a slice of LEFT and slice of RIGHT. The example shows 8 tasks that can be executed concurrently (if memory permits).
Task 2: concatenate the output of the 8 x task 1. This is a singleton and cannot run in parallel.
Task 3: [optional] sort the result from Task 2.
Task 4: construct the reindex
table based on join type:
Task 5: for each column in new a task is created for a page sized slice of either the left or right column in reindex.
PS> If the key is a table crossing key or any column contains strings
, it makes sense to represent the key as a cryptographic hash.
For multiprocessing to be efficient, subprocesses must be able to create tables
in the same storage as main, as this avoids the need for moving the data.
As Table
's refer Column
's which refer to Page
's, which in turn contain
data, each process must keep track of how each columns point to each page
This takes premise in a weakvaluedictionary called the registry
, where each
Page
has a reference count.
When a table is created and columns are added, pages may be empty. However if a
page is "copied" only the metadata (the pointer) is copied.
When the "copy" is deleted, only the metadata is deleted, as the original table
still has a reference to the data. For the reference system to remain valid,
the reference count in the registry
must obviously be decremented.
When a subprocess exits (SIGTERM), python performs garbage collection (GC).
This means that any variable pointing to a table created by the subprocess
will also be deleted. This triggers a call to table.__del__(self)
.
As the main process might be busy receiving data from other subprocesses, the
__del__
call can happen before main has taken ownership of the data.
To assure that the output from the subprocess isn't lost, the subprocess marks
the table as table.saved=True
far in advance of garbage collection. The logic
is simple:
def __del__(self):
decrement reference count # ignore if unknown.
if saved != True:
delete all data.
This guarantees that the call to __del__
doesn't erase the data from storage,
and thereby permits __main__
to collect it after the subprocess has exited.
When main
loads the data using the subprocess' table key, the reference count
in main
will be correct as Table.load(path, key)
increments the reference
count.
If main
sets saved = False
on the table, that the subprocess created before
it exited, it will now be deleted.
Should there be a error in the implementation of the interaction with the sub-
process, so that main
does not set saved = False
after loading, then the
table will never be deleted.
To help with the clean up there are two options:
Table.reset_storage()
, which truncates the storage to 0 bytes. E.g.
100% dataloss. This is a pragmatic solution during testing where pre-existing
state must be avoided.
Table.load_stored_tables()
, which returns a list of all tables in the
storage (discussed in 5, below).
A user can now loop through all tables and set
saved=False
and exit (SIGTERM) which will garbage collect all loaded tables
as Table.load_stored_tables()
increments the reference counter from zero to
one for all tables, changes save
and then drops them.
This is the preferred method whenever something unrelated to tablite
caused
the python process exit.
Whilst these options are convenient, there are a couple of edge cases, that
should be considered first:
3. If another terminal is opened which uses the same H5 file.
4. If a terminal is killed (SIGKILL)
5. Two terminals use the same import.
As there is no conflict with another terminal using the same store, calling
Table.reset_storage()
will have an inconvenient side effect, as it erases all
data - including any created by the other process.
The default recommendation is therefore to set saved=False
on tables at exit,
rather than resetting the storage.
SIGKILL'ed processes will also leave datasets where save == False
orphaned.
To clean this up, the new python process can start with:
>>> Table.load_stored_tables()
As this will create reference counts and drop them immediately. Saved tables
are unaffected. Should all saved tables be removed - except imports - use:
>>> [t.save=False for t in Table.load_stored_tables()]
A special and delicate case exists if two terminals use the same import, as
duplicate import is avoided. In the example below P1 and P2 are separate
processes:
P1 >>> t1 = Table.import_file(**config)
P1
performed the work to import as set t1.saved=True
P2 >>> t2 = Table.import_file(**config)
As **config
already has been imported by P1
, t2
is loaded in P2
, not imported.
P1
has finished its task and sets saved=False
to exit correctly:
P1 >>> t1.saved = False
P1 >>> exit()
Garbage collection of t1
will now invoke t1.__del__()
and delete the data.
P2 >>> t2.show()
will return an empty table.
To handle this case import
and load
are two separate operations, where
import_file
first imports the dataset and then load
s it.
This means that to get rid of imports as well, the following call must be used:
>>> [t.save=False for t in Table.load_stored_tables(imports=True)]
EOM.
To maintain confidentiality when presenting data, a feature to scramble column would be nice.
scrambling changes the column in place and hides the keys in the table.metadata
Users have reported that add_row is clunky to work with. Hence the following API is suggested:
Example:
t = Table()
t.add_column('A', int)
t.add_column('B', int)
t.add_column('C', int)
The following examples are all valid and append the row (1,2,3) to the table.
t.add_row(1,2,3)
t.add_row([1,2,3])
t.add_row((1,2,3))
t.add_row(*(1,2,3))
t.add_row(A=1, B=2, C=3)
t.add_row(**{'A':1, 'B':2, 'C':3})
The following examples add two rows to the table
t.add_row((1,2,3), (4,5,6)) # adds two rows.
t.add_row([1,2,3], [4,5,6])
t.add_row({'A':1, 'B':2, 'C':3}, {'A':4, 'B':5, 'C':6}) # two (or more) dicts as args.
t.add_row([{'A':1, 'B':2, 'C':3}, {'A':1, 'B':2, 'C':3}]]) # list of dicts.
Please vote & comment.
Hello,
I was wondering could there be an issue with file_reader
when using text_qualifier
? Because it shokingly reduces performance of import...
I've made a test script for this, where I show difference of import with pandas
, numpy
and tablite
.
from tablite import Table
from tablite.config import Config
import numpy as np
import time
from pandas import DataFrame
import pandas
# Create a wide and short table
wide_df = {k: np.random.random_sample(size = 3000) for k in range(0, 1000)}
df_to_save = DataFrame.from_dict(wide_df)
df_to_save.to_csv("./test.csv")
# Generated table size == 56473KB
start = time.time()
tbl = Table.from_file("./test.csv", guess_datatypes=False, text_qualifier='"', delimiter=',')
end = time.time()
print(end - start)
# OUTPUT
importing: consolidating 'test.csv': 100.00%|██████████| [1:18:05<00:00]
4685.573677778244
start = time.time()
new_df = pandas.read_csv("./test.csv", escapechar='"', delimiter=',')
end = time.time()
print(end - start)
# OUTPUT
0.45365166664123535
Disclaimer: Comparison to numpy
is not 100% analitically correct, since there is no text escape.
start = time.time()
arr = np.genfromtxt('./test.csv', delimiter=',', dtype=None)
end = time.time()
print(end - start)
# OUTPUT
5.538379430770874
At first I thought, that there could be overhead from multiprocessing enabled, since there are a lot of columns and they are short, but that is not the case, because using Config.MULTIPROCESSING_MODE = Config.FALSE
made the performance even worse. I was impatient, since tablite in 31 minutes was only able to import 73.68%, where the progress bar has started at 70%.
importing: saving 'test1.csv' to disk: 73.68%|███████▎ | [31:30<3:37:32]
When used without text_qualifier
option tablite
imports the same file in 2 minutes 21 seconds (which is still quite slower than pandas):
start = time.time()
tbl = Table.from_file("./test.csv", guess_datatypes=False, delimiter=',')
end = time.time()
print(end - start)
# OUTPUT
importing: consolidating 'test.csv': 100.00%|██████████| [02:21<00:00]
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3004.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3005.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3006.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3007.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3008.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3009.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3010.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3011.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3012.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3013.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3014.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3015.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3016.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3017.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3018.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3019.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3020.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3021.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3022.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3023.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3024.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3025.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3026.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3027.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3028.npy
...
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4002.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4003.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4004.npy
141.9796497821808
A method needs to be implemented to handle out-of-memory joins.
Notes: https://discuss.python.org/t/predict-and-exploit-available-memory-deterministically/40009
Problems:
Proposed changes:
Branch:
The solution will be developed in the branch hdf5_ectomy
.
Numpy's NEP-1 presented a simple file format which roughly looks this:
\x93NUMPY\x03\x00\x00
{"descr":"<i8", "fortran_order":False,"shape":(3,3)}\n
\x05\x00\x00\x00\x00\x00\x00\x00
\x04\x00\x00\x00\x00\x00\x00\x00
\x07\x00\x00\x00\x00\x00\x00\x00
\x09\x00\x00\x00\x00\x00\x00\x00
\x01\x00\x00\x00\x00\x00\x00\x00
\x08\x00\x00\x00\x00\x00\x00\x00
\x03\x00\x00\x00\x00\x00\x00\x00
\x06\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00
\x93NUMPY
is the magix prefix
\x03
is the major version of the numpy file format.
\x00
is the minor version of the numpy file format.
v\x00
(4 bytes form a little-endian unsigned int) is the padded length of the header modulo 64.
{"descr":"<i8","fortran_order":False,"shape":(3,3)}\n
is the fileheader" with expected keys sorted in alphabetic order, ending with a \n
-character.
The remainder is the data in UTF-8
encoding.
Read more in numpy file format
In the config.py
the working directory is set, which defaults to:
from tablite import Table
import tempdir
from pathlib import Path
import os
temp = path( tempdir.getTempDir() / f"tablite/{os.getpid()}" )
if not temp.exists():
temp.mkdir()
Table.workdir = temp
Within the workdir
the data is organised as follows:
/tmp/tablite/PID-14253/
pages/
p1b1.npy # size set by settings.page_size
p1b2.npy
p2b2.npy
p2b2.npy
...
pNbN.npy
page_index/ # linked list referring pages back to tables/
1.yml
2.yml
...
tables/
1.yml
2.yml
...
N.yml
The pages/
are all of the npy
format, identifiable as p<page number>b<block number>
. New pages are named using file from 1...int64max so that empty slots are tolerated.
The page_index/
contains the equivalent of a linked list of pages pointing to tables.
%YAML 1.2
filename: 1.yml
tables: [1,2,...]
The tables/
are all of the yaml
format, identifiable by a numeric. The contents of a YAML file are illustrated below:
%YAML 1.2 yaml version
filename: 1.yml reverse file identifier.
temp = false temp identifier.
columns: start of columns section.
1: start of column 1.
name: “列 1” column name.
pages: [p1b1, p1b2] list of pages in column 1.
types: [] empty list of types, so column 1 is a C-level data format.
2: start of column 2.
name: “列 2” column name
pages: [p2b1, p2b2] list of pages in column 2.
types: [p3b1, p3b2] list of type codes, so column 2 is not a C-level data format.
tables and columns are mutable pointers to pages.
For Tablite to know what is what, the user must register the function for the type code. In python this is done automatically using dtype or pickle.
Tablite data is considered volatile. This means that EVERY TIME a process shuts down, the workdir
for the PID is erased (sigint
- a part of __exit__
).
If the process is sigkill
'ed the data will remain idle until the PID is used again, whereby the process will erased the contents of the PID-specific folder. There is no other way.
Due to Tablites focus on larger-than-memory datasets, Tablite must write to disk. To assure that we won't run out of disk space, the function that writes will use a disk_limit=0.90
and complaint if the disk usage exceeds 90%.
During multiprocessing, Tablite will share the working directory with the sub-processes in a parent/child-type hierarchy, so that subprocesses won't scatter data in various random folders.
With tmp/tablite/<PID>/...
as the working director a simple starting point for PERSISTENT usage of data would look like this:
t = Table.import(/my/docs/originals/somefile.csv)
table.save()
to your preferred folder, such as /my/docs/wip/data.npz
which is a portable zip'ed folder of the table from the working directory. THIS FILE WILL NEVER BE ERASED BY TABLITE.t = Table.load("/my/docs/wip/data.npz")
, which drops the imported table in the working directory and uses the .npz
instead - which now is tablite & numpy native.Thereby, if you are having a large test suite, you may avoid the reimport of datasets completely. Any subsequent work that uses t
will still happen in the working directory under the PID.
The reason is that t.save(...)
creates an isolated portable snapshot of the table's data, whilst setting T.workdir = ...
uses paginated linking, which is far more space efficient and hence faster.
When tasks
are created by main
it must declare the destination folder in the task, so that when a new subprocess starts, they will know which process' folder to write to. This guarantees that there will be no conflict when two Python processes use tablite concurrently.
Could you give an example of how to reverse a pivot table in the README?
When slicing a table the result is inconsistent based on number of columns.
If more than 1 column exists tbl[slice(...)]
will produce a new Tablite table from the original table for the given slice.
If table has exactly 1 column tbl[slice(...)]
will produce a np.array from the only column for the given slice.
If table has exactly 0 columns tbl[slice(...)]
will produce an empty table.
I think the API not behave differently based on the table layout and should instead be consistent.
As the title says: Type Hints are very helpful for code completion and since python 3.8 precedes type hints, we will stop supporting it.
(The transition from 3.8 to 3.9 should be painless to most users anyway).
Task: Assure type hints throughout the code base.
Tablite - more (?) or better disk support?
The benefit of tablite is that it doesn't care about whether the data is in RAM or on DISK. The users usage is the same. Right now SQLite is used for the disk operation.
That's ok, but limited to usage with tempdir.
Sometimes users want to use another directory, store the tables in a particular state and then continue their work at a later point in time. Whilst tempdir can be used for that, tablite has all the data from memory sitting in the same SQLite blob. So to persist individual tables means moving data, which equals time.
It is probably better to allow the user to:
Table.set_working_dir(pathlib.Path)
which allows the user to have all data in here.
Set Table.metadata['localfile'] = "tbl_0.h5"
which allows tablite to conclude that "working_dir/tbl_0.h5" is where the user wants the file. Even if the user doesn't want to manage the file name but just want to be sure that the data doesn't cause MemoryOverFlow, the user can do:
Table.new_tables_use_disk = True
Table.from_file(filename,...)
and all files will be named in the metadata as:
Table.metadata['filename'] = filename1.csv
Table.metadata['localfile'] = working_dir/filename1.h5
If the filename already is h5 format, the data is merely "imported", by reading the h5 format. This requires only that:
.attrs
(which is dict like anyway)import h5py
f = h5py.File("filename.h5","w")
column_a = f.create_dataset("column A", (100,99,98,97,), dtype='i') # 4 records.
column_b = f.create_dataset("column B", ("a","b","c","d",), dtype='s') # 4 records.
? metadata?
I keep writing h5, because I like hdf5's SWMR function for multiprocessing, in contrast to SQLite's blocking mode.
This brings me to some missing features:
Table.from_file(filename..., limit=100)
should returns head (100 rows) of the file only.Table.from_file(filename..., columns=['a','c','d'])
could return the columns a,c,d (not b,e,f,...) only.Table.from_file(filename..., columns=['a','c','d'], datatypes=[int,str,date], error=None)
could return the columns a,c,d
with corrs. datatypes and uses None
for exceptions.Table.to_hdf5(filename=None, ...)
is alias for to_hdf5 is tempfile mode.Table.from_file(....)
should print the file location is use_disk==True
.Consequences?
This will require that the StoredColumn is inheriting the Table's localfile for data access, but otherwise there's no problem.
The file_reader
function find_format
can be done in parallel, as each worker can read the h5 file with focus on a single column of values and return the appropriate dataformat in isolation.
When exiting an atexit
hook is required for flush()
and close()
so that the h5 file is closed properly
When opening a file that already exists, the loading time should be limited to the time it takes to read the metadata of the h5 file as everything else already is in place.
Currently tablite uses system timezone info when dealing with dates. This should be changed to use UTC unless specified otherwise. While it doesn't matter if everything runs on the local machine there can be a time where this is not true. Additionally there are two tests that will always fail because they only work in UTC timezone.
Use case:
Drives:
/tmp/tablite.hdf5 # local working memory on SSD.
/C/datasources.hdf5 # source data from the client (read only access)
/T/clean_data.hdf5 # teams cleaned data.
Workflow:
prerequisite: User selects site 'Corrusant'
site_name= 'Corrusant'
all_tables = Table.reload_stored_tables('/C/datasources.hdf5') # rapid remote header level read-only access.
# sift through all tables and create tables for site on localhost using the default 'H5_STORAGE':
wip = [ t.any(**{'site': lambda x:x=site_name}) for t in all_tables) ]
# undisclosed process used to clean up the data.
# create/append to a super table with all data from all sites. Note that this table may already exist!
super_table = Table(H5_location="/T/clean_data.hdf5")
for table in wip:
super_table.stack(table) # append to the team wide super table.
At this point clean data exists in a super table for the whole team to work on.
In numerical calculations users may want to use 0 or 0.0 instead of None. In strings users may want a different value, f.x. "" as None.
What are the pro/con for implementing a custom None value?
users can/could just update the table instead:
sometable['column_x'] = [x = "" for x in sometable['column_x'] if x is None] # *strings*
sometable['column_x'] = [x = 0.0 for x in sometable['column_x'] if x is None] # *floats*.
sometable['column_x'] = [x = 0 for x in sometable['column_x'] if x is None] # *integers*.
...?
There are some discrepancies in statistical functions, namely mode and median.
Median: expected value datetime(2020, 1, 1, 0, 0)
To create table in question:
columns = ['SKU', 'length', 'qty', 'fragile', 'date']
data = [
("1", 0.3, 1, False, datetime(2020, 1, 1)),
("5", 1.0, 2, None, datetime(2020, 1, 1)),
("2", 0.2, 1, True, datetime(2020, 1, 3)),
("2", None, 3, True, datetime(2020, 1, 2)),
("4", 0.3, 1, False, None),
]
table = Table.from_dict({
key: val
for key, val in zip(columns, [list(x) for x in zip(*data)])
})
Example 1:
t = Table()
t['A'] = [1,2,3]
t['B'] = [] # empty
t.show()
output:
C:\Data\venv\test310\lib\site-packages\tablite\memory_manager.py:262: in get_data
return np.concatenate(arrays, dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = ([],), kwargs = {'dtype': dtype('O')}, relevant_args = []
> ???
E ValueError: need at least one array to concatenate
Example 2:
t = Table()
t['A'] = [1,2,3]
t['B'] = [10] # short
t.show()
s.append("+" + "+".join(["-" * widths[n] for n in names]) + "+")
> for ix, row in enumerate(self.rows):
E RuntimeError: generator raised StopIteration
Proposal:
t = Table()
t['A'] = [1,2,3]
t['B'] = [10] --> [10,None,None]
t.show() + a warning("columns of unequal length!")
t = Table()
t['A'] = [1,2,3]
t['B'] = [] --> [None,None,None]
t.show() + a warning("columns of unequal length!")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.