Code Monkey home page Code Monkey logo

tablite's People

Contributors

cerv15 avatar github-actions[bot] avatar jetman80 avatar omensi avatar realratchet avatar root-11 avatar taylor88lee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tablite's Issues

Addition of match operator

There is a need for a conditional filter based on the columns from another table. It can be called as .match(...) method.

Reasoning:

  1. We have a .filter(...) method that can filter out wanted rows in the table based on criteria. The problem: it can only match the table by it's own columns or given primitive values.
  2. We have a .lookup(...) method that can do conditional join of two tables based on certain criteria. The problem: when no matches are found, None's are filled in their place.

In current implementation if we would want to find rows of table1 that match the conditions in table2 we would have to query the table using lookup operator. And then filter out None results using filter. This requires iterating over the base table twice. The match method would perform this operation in a single pass.

Drop columns during txt, csv and tsv import

The header says it all.

file_reader(.... keep=['column A', 'column C']) # dropping column B

inverse operation

file_reader(.... drop=['column A', 'column C']) # keeping column B.

Permit user to declare datatype in addition:

file_reader(.... keep=['column A', 'column C'], dtype=['int', 'date'])

hereby column A is int and column C is date

Documentation is out of sync

I think the tablite library is awesome in terms of handling large data. Thanks for the development efforts!

However, I noticed that rapid changes to the core elements made the documentation out of sync and at this moment completely misleading.

Why?

  • README.md states, that tablite uses h5py, however, I can't seem to find anything related to it (only in the requirements.txt).
  • Some of the claims about RAM or disk usage may be outdated. I know tablite now use numpy as data storage.

There might be other things, but the frontpage is definitely the first thing that pops with outdated information.

Load HTML and/or Consider Refactoring

Just stumbled across tablite. Really cool, definitely can agree with the use case (pandas and numpy are amazing, but also super massive when you just want to deal with a 5 column 20 row table.)

What are your thoughts on:

  1. Loading HTML tables - Writing a parser from scratch seems duplicative - I was thinking of something along the lines of being able to pass in a bs4.element.Tag (the result of page.select_one('table#tabid') from BeautifulSoup)? (Although I try to avoid suggesting adding dependencies, BeautifulSoup seems to be popular enough that making it an optional dependency for parsing HTML tables seems like a reasonable compromise to avoid having to parse manually - or with html5lib/lxml)
  2. Refactoring some of the code to multiple files? While it's convenient to have everything in one file, the source is up to 2700+ lines. As new features are added, it's only going to be more work to rearrange. (I get that the goal is to keep things lightweight - that's the whole point of tablite - but that doesn't mean that multiple source files is bad. It helps with organziation and also would offer an opportunity to review for more fine-grained refactoring to see if there are any duplicated sections that could be optimized.)
  3. Renaming the installed package name from table to tablite to match the project name?
  • This one isn't great because it's a major breaking change... which sucks
  • This stands out for a couple of reasons:
    • When installing a package, I usually assume that after installing, I'm going to do an import «packagename» - which in this case, obviously won't work
    • I'm not sure if there's a PEP with guidance on this, but I think it's a common enough practice that when the installed module name is different from the project/package name, it is a surprise
    • "table" is super common. The chances of someone having a table.py file somewhere in a project (or the current directory) are much higher than someone having a tablite.py file -- this would create an import conflict and would break tablite from loading. (Again, not sure if there's PEP guidance on this, but as a general rule, I think avoiding naming packages with common words is a good idea. For submodules/packages this isn't an issue if the main/parent name is completely unique.)

implement replace missing values with MP support

A nice API would be something like

Table.imputation(columns=[target columns], using=[source columns], missing=[None, 'n/a', np.nan] )

Using the source columns an index can be created.
The index can then be used to calculate the distance between the each target row and the all other rows.
Missing denotes the values that should be considered missing.

Table.load very slow with dtype('O')

In the overview below (t)ime, stdev and count (of pages):

loading 'kyle_cs_r.tpz' file: 100%|███████████████████████| 41/41 [00:00<00:00, 124.14it/s]
dtype('int64') t: 0.00438, stdev: 0.00000, count: 1
dtype('<U46')  t: 0.02037, stdev: 0.00000, count: 1
dtype('O')     t: 0.00877, stdev: 0.00095, count: 27
dtype('<U36')  t: 0.00767, stdev: 0.00000, count: 1
dtype('<U58')  t: 0.01238, stdev: 0.00164, count: 2
dtype('<U20')  t: 0.00334, stdev: 0.00064, count: 2
dtype('<U11')  t: 0.00199, stdev: 0.00000, count: 1
dtype('<U23')  t: 0.00536, stdev: 0.00057, count: 2
dtype('<U21')  t: 0.00385, stdev: 0.00067, count: 3
dtype('<U26')  t: 0.00438, stdev: 0.00000, count: 1

loading 'kyle_cs_l.tpz' file: 100%|█████████████████████████| 747/747 [10:14<00:00,  1.22it/s]
dtype('<U2')     t: 0.01451, stdev: 0.01045, count: 166
dtype('int64')   t: 0.02732, stdev: 0.00889, count: 332
dtype('O')       t: 3.61340, stdev: 0.35634, count: 166
dtype('float64') t: 0.03205, stdev: 0.01417, count: 83

Note the second second last record: dtype('O') 3.61 sec / page !

My conclusion is that we must resort to type 'O' as the last resort.

Question: @realratchet - can you imagine a way I could reindex the structs without loading them?

SQL syntax does not support composite keys

From the readme

# We start with creating two tables:
left = Table()
left.add_column('number', int, allow_empty=True, data=[1, 2, 3, 4, None])
left.add_column('color', str, data=['black', 'blue', 'white', 'white', 'blue'])  # US english color.

right = Table()
right.add_column('letter', str, allow_empty=True, data=['a', 'b,', 'c', 'd', None])
right.add_column('colour', str, data=['blue', 'white', 'orange', 'white', 'blue'])  # Queens english.

# Now this wont work:
left_join = left.left_join(right, keys=['colour'], columns=['number', 'letter'])

# Because keys need to match `colour` with `color`

A possible fix could be a list containing a tuple of keys where index 0 indicates the left column and index 1 indicates the right.

Proposed format specification

Intro

Here is a quick sketch for the format I'm thinking, some inspiration from GLTF. Here's a quick overview.

image

Preamble

First 12 bytes is the preamble, which is as follows:

  • 4B MAGIC (maybe TBLT which is 0x54424C54)
  • 4B VERSION
  • 4B HEADER LENGTH - length of the header JSON contents in bytes

Header

Header is a JSON to make the format fully extensible in the future. Currently I have these in mind that we should have:

{
    "page": {
        "type": <type>,
        "compression"?: "lz4"|null
    }
}

Obviously this format cannot deal with mixed types although we talked about getting rid of mixed types and replacing with the datatype that encompass all of the elements in the column. Obviously this means that we cannot represent None either and have to fill with the default value e.g, empty string for string types, zero for integer types. If we would ever want to go back to mixed types we can instead have an array of fragments.

Types

The types that we support:

  • Standard numpy interface:
    • ? (bool)
    • i (int)
    • f (float)
    • U (string)
    • m (timedelta/time) - 64 bit integer
    • M (datetime) - 64 bit integer (unix timestamp)
  • Extended interface types:
    • N (date) - 64 bit integer (unix timestamp)

We keep the numpy modifiers, meaning we can still set < and > for endianness and size when valid, i.e., we can still do <U64 for a page that has the longest string of length 64 and is little endian.

Compression

One of the current issues we have with numpy format is the enormous size of string pages due to padding. We still want to keep the padding because we want good interoperability with python but we also want reduced page size with minimal impact to performance as possible. Therefore we introduce "compression"? key in the header. It's an optional key that may or may not exist in the JSON, if not provided treated as null. It should only really be used for strings but could technically be paired with other types too.

I tried multiple decompression algorithms the least additional processing time to read the pages was by lz4 but there still was processing time so maybe the table producing function should try and see if it's necessary to even use the compression. But that can be parked for now.

Extended interface

Current numpy format has no concept of date type which while is just a subset of datetime and can be fully expressed by it, it has a different __repr__ function and we're likely going to get a lot of complaints if all of the date formats will be turned into datetime because the table will be filled with YYYY-MM-DD 00:00:00 strings.

Handling strings

Even though compressed strings in disk don't need to be contiguous memory I think it's probably still best to store them as so just because np.frombuffer will work much nicer with it. Therefore, we replicate the way numpy stores the strings and use zero pad at the end if a string is shorter than the longest string.

Page

This is the page data stored in either compressed memory block or contiguous data block. Just raw binary dump nothing special.

The downsides

Because we said we're getting rid of the mixed type I accounted for that when conjuring the data format although this means we cannot have None which is easy with string pages but personally seems iffy when it comes to other datatypes. Concepting engine has special flags to check for None types in some widgets which would make those flags obsolete?

Also treating None as default value, e.g., 0.0 when a float could maybe potentially have undesired side-effects? But I may be overthinking it and this isn't really an issue I should be worried about.

Other possible additions

We could also pull in extra information about the page into the JSON, e.g., statistics if we think it's beneficial, I tried to keep this extensible.

Inclusion of a sample tablite.hdf5 file?

Would it be out of the question to include an accompanying tablite.hdf5 file in the repo, possible alongside one of the tests, or from a selected point in the demo?

Should groupby permit nones?

If it should, then this:

    def setup(self, table):
        """ helper to setup the group functions """
        self.output = Table()
        self.required_headers = self.keys + [h for h, fn in self.groupby_functions]

        for h in self.keys:
            col = table[h]
            self.output.add_column(header=h, datatype=col.datatype, **allow_empty=False**)  # add column for keys

Should be this:

    def setup(self, table):
        """ helper to setup the group functions """
        self.output = Table()
        self.required_headers = self.keys + [h for h, fn in self.groupby_functions]

        for h in self.keys:
            col = table[h]
            self.output.add_column(header=h, datatype=col.datatype, **allow_empty=True**)  # add column for keys

sorting problem with datetime dt columns

def test_sort_datetime():
from datetime import datetime
t = Table(columns={"A": [datetime(2019, 2, 2, 12, 12, 12), datetime.now()], "B": [2, 2]})

t.sort({"A": False})
# try:
#     t.sort({"A": False})
# except:
#     t.show()
#     print('datetime sort failed')        

# from numpy import datetime64
# t = Table(columns={"A": [datetime64('2005'), datetime64(datetime.now())], "B": [2, 2]})
# try:
#     t.sort({"A": False})
# except:
#     t.show()
#     print('np.datetime sort failed')  

Join (reindexing) fails when table spans multiple pages

Reindexing task fails when the column spans multiple pages, here's a way to reproduce it easily. This can happen when pages are highly fragmented like in this case and probably even when with larger tables spanning multiple pages.

from tablite import Table
from tablite.config import Config

Config.PAGE_SIZE = 1

cat_3 = Table({'order_LP_pallets': [15060, 15060],

                'avg_layers_on_order_LP_pallets': [3.2756972111553786, 3.2756972111553786],
                'scenario': ['LP_SCPT', 'LP_MCPT']})

cat_5 = Table({'order_boxes': [23552, 24441, 23552, 24431],
                'avg_HUs_per_box': [4.162746263586956,
                                    4.146884333701567,
                                    4.162746263586956,
                                    4.147558429863698],
                'order_pallets': [41398, 41438, 37950, 37991],
                'avg_case_per_order_pallet': [43.975167882506405,
                                                43.95417249867272,
                                                42.61736495388669,
                                                42.59450922586929],
                'scenario': ['SCPT', 'MCPT', 'LP_SCPT', 'LP_MCPT']})

Config.PAGE_SIZE = 1000000

join_0 = cat_5.join(
    cat_3,
    left_keys=["scenario"],
    right_keys=["scenario"],
    kind="left",
    merge_keys=True,
    left_columns=None,
    right_columns=None
)

Columns with empty names

Hello,

It may be my personal opinion, but it seems non logical to have table columns with empty names.

Currently, tablite allows to have a column with an empty name (or just a whitespace).

I would expect to receive an exception telling me, that "column name cannot be empty".

>>> from tablite import Table
>>> t = Table()
>>> t[""] = [1, 2, 3]
>>> t.show()
+==+=+
|# | |
+--+-+
| 0|1|
| 1|2|
| 2|3|
+==+=+

multi proc groupby

image

TODO: figure out how to measure memory footprint so that CPUs don't intercept on the memory footprint.

copy to clipboard would be nice.

When I've created a nice table and want to paste the data into some visualisation package, it would be nice to be able to copy paste.

Tablite throws IndexError when reading a complex CSV file

I am trying tablite with a CSV file with many fields and some of which are long, more specifically full texts. When reading with Table.import_file, I get the following exception

Exception: Traceback (most recent call last):
  File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/mplite/__init__.py", line 26, in execute
    return self.f(*self.args,**self.kwargs)
  File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/tablite/core.py", line 2646, in text_reader_task
    data[header].append(fields[index])
IndexError: list index out of range

When the same dataset is converted to xlsx and then opened, no errors occur. I am attaching an example file that causes the error.

test.csv

List of updates for version 2022.11

  • <-- tickmark means done in branch v2022_11_0. See comments for commit references.

Config

  • Setting so that tablite can run single core only if educators wants this.
  • allow multiple h5 files. tempdir is thereby the default storage, but user can point individual table to other storages. #22
  • allow user to set h5 file to "memory" and exclusively use RAM.

new tools.py

  • date_range(start, stop, step) like 2022/1/1, 2023/1/1, timedelta(days1) # returns list of datetimes.
  • add datatypes.guess
  • add xround

new Table methods

  • table.remove_duplicate_rows()
  • table.drop_na(*arg) removes rows with and None, np.nan
  • table.replace(target, replacement) which searches across all columns, e.g. t.replace(None, -1)
  • table.replace_missing_values(source=[...], target=column_name) which looks up nearest neighbour in sources and substitute into target. #18
  • table.to_pandas()
  • Table.from_pandas(pd.DataFrame)
  • table.to_h5()
  • Table.from_h5()
  • table.to_dict(columns, slice) # to_dict returns python dict with column names as keys and lists of values as values. The optional slice permits an effective retrieval of a subset of rows.
  • Table.from_dict()
  • table.to_list() returns list of column names, + list of each column
  • Table.from_list(column names + data) (the inverse to .to_list()
  • table.transpose(columns=['Monday', 'Tuesday','Wednesday', 'Thursday', 'Friday'], as='day') turns the columns into a single column under the heading of 'day'
  • complete the list of importable formats.
  • remove requirement for column name declaration

new Column methods:

  • Column[:] returns list of native python types. No more numpy arrays or numpy types.
  • Column.to_numpy(slice) returns numpy's ndarray.

Documentation

Cleaner code:

  • test for page size limit. Add config to break early (90% of limit) or break late (110% of limit).
  • test for various np broadcast functions.

Join multiprocessing take 2.

@realratchet : In commit: aee872b in line 452:
"""
Ratchet:

    I thought real good about it and it is not possible to mapping tasks be
    multi-processed/constant memory, because every slice must know the entire right table dictionary
    and anything that is not the first slice must also have all previous other slice indices.

    Best we can do is reduce the RAM usage via using hash of a string or reduce memory usage via native implementation.
"""

Please correct me if I misunderstand, but as far as I see it, it is not necessary to hold the entire right table dictionary in memory.

In the process illustrated below, there are 5 tasks:

Task 1: create sparse index for a slice of LEFT and slice of RIGHT. The example shows 8 tasks that can be executed concurrently (if memory permits).

Task 2: concatenate the output of the 8 x task 1. This is a singleton and cannot run in parallel.

Task 3: [optional] sort the result from Task 2.

Task 4: construct the reindex table based on join type:

  • Inner = sparse table.
  • Left = left table range + duplicates in the sparse table.
  • Outer = left table range repeated by right table range.
    This process is single core.

Task 5: for each column in new a task is created for a page sized slice of either the left or right column in reindex.

  • to construct the new page, each process will first have to gather the required rows by performing read of multiple slices. In the example below REINDEX RIGHT illustrates this with blue and orange background colour reflecting respectively 1st and 2nd page of the table RIGHT.

image

PS> If the key is a table crossing key or any column contains strings, it makes sense to represent the key as a cryptographic hash.

Bloat in H5 storage following repeated SIGKILL

Important notes concerning multiprocessing, Startup and Exit.

For multiprocessing to be efficient, subprocesses must be able to create tables
in the same storage as main, as this avoids the need for moving the data.

Zero entropy = Zero copy.

As Table's refer Column's which refer to Page's, which in turn contain
data, each process must keep track of how each columns point to each page

This takes premise in a weakvaluedictionary called the registry, where each
Page has a reference count.

When a table is created and columns are added, pages may be empty. However if a
page is "copied" only the metadata (the pointer) is copied.

When the "copy" is deleted, only the metadata is deleted, as the original table
still has a reference to the data. For the reference system to remain valid,
the reference count in the registry must obviously be decremented.

Multi processing

When a subprocess exits (SIGTERM), python performs garbage collection (GC).
This means that any variable pointing to a table created by the subprocess
will also be deleted. This triggers a call to table.__del__(self).

As the main process might be busy receiving data from other subprocesses, the
__del__ call can happen before main has taken ownership of the data.

To assure that the output from the subprocess isn't lost, the subprocess marks
the table as table.saved=True far in advance of garbage collection. The logic
is simple:

def __del__(self):
    decrement reference count  # ignore if unknown.
    if saved != True:
        delete all data.

This guarantees that the call to __del__ doesn't erase the data from storage,
and thereby permits __main__ to collect it after the subprocess has exited.

When main loads the data using the subprocess' table key, the reference count
in main will be correct as Table.load(path, key) increments the reference
count.

If main sets saved = False on the table, that the subprocess created before
it exited, it will now be deleted.

Should there be a error in the implementation of the interaction with the sub-
process, so that main does not set saved = False after loading, then the
table will never be deleted.

To help with the clean up there are two options:

  1. Table.reset_storage(), which truncates the storage to 0 bytes. E.g.
    100% dataloss. This is a pragmatic solution during testing where pre-existing
    state must be avoided.

  2. Table.load_stored_tables(), which returns a list of all tables in the
    storage (discussed in 5, below).
    A user can now loop through all tables and set
    saved=False and exit (SIGTERM) which will garbage collect all loaded tables
    as Table.load_stored_tables() increments the reference counter from zero to
    one for all tables, changes save and then drops them.

This is the preferred method whenever something unrelated to tablite caused
the python process exit.

Whilst these options are convenient, there are a couple of edge cases, that
should be considered first:

3. If another terminal is opened which uses the same H5 file.
4. If a terminal is killed (SIGKILL)
5. Two terminals use the same import.
  1. As there is no conflict with another terminal using the same store, calling
    Table.reset_storage() will have an inconvenient side effect, as it erases all
    data - including any created by the other process.
    The default recommendation is therefore to set saved=False on tables at exit,
    rather than resetting the storage.

  2. SIGKILL'ed processes will also leave datasets where save == False orphaned.
    To clean this up, the new python process can start with:

>>> Table.load_stored_tables()

As this will create reference counts and drop them immediately. Saved tables
are unaffected. Should all saved tables be removed - except imports - use:

>>> [t.save=False for t in Table.load_stored_tables()]

  1. A special and delicate case exists if two terminals use the same import, as
    duplicate import is avoided. In the example below P1 and P2 are separate
    processes:

    P1 >>> t1 = Table.import_file(**config)

    P1 performed the work to import as set t1.saved=True

    P2 >>> t2 = Table.import_file(**config)

    As **config already has been imported by P1, t2 is loaded in P2, not imported.

    P1 has finished its task and sets saved=False to exit correctly:

    P1 >>> t1.saved = False
    P1 >>> exit()

    Garbage collection of t1 will now invoke t1.__del__() and delete the data.

    P2 >>> t2.show() will return an empty table.

To handle this case import and load are two separate operations, where
import_file first imports the dataset and then loads it.

This means that to get rid of imports as well, the following call must be used:

>>> [t.save=False for t in Table.load_stored_tables(imports=True)]

EOM.

feature request: scramble column

To maintain confidentiality when presenting data, a feature to scramble column would be nice.
scrambling changes the column in place and hides the keys in the table.metadata

New add row api

Users have reported that add_row is clunky to work with. Hence the following API is suggested:

Example:

    t = Table()
    t.add_column('A', int)
    t.add_column('B', int)
    t.add_column('C', int)

The following examples are all valid and append the row (1,2,3) to the table.

    t.add_row(1,2,3)
    t.add_row([1,2,3])
    t.add_row((1,2,3))
    t.add_row(*(1,2,3))
    t.add_row(A=1, B=2, C=3)
    t.add_row(**{'A':1, 'B':2, 'C':3})

The following examples add two rows to the table

    t.add_row((1,2,3), (4,5,6))  # adds two rows.
    t.add_row([1,2,3], [4,5,6])
    t.add_row({'A':1, 'B':2, 'C':3}, {'A':4, 'B':5, 'C':6}) # two (or more) dicts as args.
    t.add_row([{'A':1, 'B':2, 'C':3}, {'A':1, 'B':2, 'C':3}]]) # list of dicts.

Please vote & comment.

Slow import of files with text escape

Hello,

I was wondering could there be an issue with file_reader when using text_qualifier? Because it shokingly reduces performance of import...

I've made a test script for this, where I show difference of import with pandas, numpy and tablite.

tablite:

from tablite import Table
from tablite.config import Config
import numpy as np
import time
from pandas import DataFrame
import pandas

# Create a wide and short table

wide_df = {k: np.random.random_sample(size = 3000) for k in range(0, 1000)}
df_to_save = DataFrame.from_dict(wide_df)
df_to_save.to_csv("./test.csv")
# Generated table size == 56473KB


start = time.time()
tbl = Table.from_file("./test.csv", guess_datatypes=False, text_qualifier='"', delimiter=',')
end = time.time()
print(end - start)


# OUTPUT
importing: consolidating 'test.csv': 100.00%|██████████| [1:18:05<00:00]  
4685.573677778244

pandas

start = time.time()
new_df = pandas.read_csv("./test.csv", escapechar='"', delimiter=',')
end = time.time()
print(end - start)

# OUTPUT
0.45365166664123535

numpy:

Disclaimer: Comparison to numpy is not 100% analitically correct, since there is no text escape.

start = time.time()
arr = np.genfromtxt('./test.csv', delimiter=',', dtype=None)
end = time.time()
print(end - start)

# OUTPUT
5.538379430770874

At first I thought, that there could be overhead from multiprocessing enabled, since there are a lot of columns and they are short, but that is not the case, because using Config.MULTIPROCESSING_MODE = Config.FALSE made the performance even worse. I was impatient, since tablite in 31 minutes was only able to import 73.68%, where the progress bar has started at 70%.

importing: saving 'test1.csv' to disk: 73.68%|███████▎  | [31:30<3:37:32]

When used without text_qualifier option tablite imports the same file in 2 minutes 21 seconds (which is still quite slower than pandas):

start = time.time()
tbl = Table.from_file("./test.csv", guess_datatypes=False, delimiter=',')
end = time.time()
print(end - start)

# OUTPUT
importing: consolidating 'test.csv': 100.00%|██████████| [02:21<00:00]
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3004.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3005.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3006.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3007.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3008.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3009.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3010.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3011.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3012.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3013.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3014.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3015.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3016.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3017.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3018.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3019.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3020.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3021.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3022.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3023.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3024.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3025.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3026.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3027.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3028.npy
...
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4002.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4003.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4004.npy
141.9796497821808

HDF5 file size never decreases + concurrent interpreters can overwrite each others files.

Problems:

  1. A report made clear that the HDF5 size does not decrease. HDF5 confirms this: The blocks on the file system remain reserved. Compressing the file using HDF5 tools results in a full copy of the remaining dataset. Due to the amount of wip-data that some tablite users face, compressing isn't acceptable.
  2. Another report made clear that two independent python processes creating tables in /tmp/tablite can contradict each other. This isn't acceptable at all.

Proposed changes:

  • The pages will be stored using numpy's fileformat v.1 instead of HDF5.
  • The HDF5 file is replaced with a working directory and processes will have the PID included in their sub-working directory.
  • Tablite tables are considered volatile UNTIL saved explicitly by the user.

Branch:
The solution will be developed in the branch hdf5_ectomy.


Contents

  1. Summary (above)
  2. Numpy's fileformat
  3. Workdir = path
  4. Tables are volatile unless saved explicitly.
  5. Multiprocessing / multiple mains in python

Numpy's fileformat.

Numpy's NEP-1 presented a simple file format which roughly looks this:

\x93NUMPY\x03\x00\x00
{"descr":"<i8", "fortran_order":False,"shape":(3,3)}\n
\x05\x00\x00\x00\x00\x00\x00\x00
\x04\x00\x00\x00\x00\x00\x00\x00
\x07\x00\x00\x00\x00\x00\x00\x00
\x09\x00\x00\x00\x00\x00\x00\x00
\x01\x00\x00\x00\x00\x00\x00\x00
\x08\x00\x00\x00\x00\x00\x00\x00
\x03\x00\x00\x00\x00\x00\x00\x00
\x06\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00

\x93NUMPY is the magix prefix
\x03 is the major version of the numpy file format.
\x00 is the minor version of the numpy file format.
v\x00 (4 bytes form a little-endian unsigned int) is the padded length of the header modulo 64.
{"descr":"<i8","fortran_order":False,"shape":(3,3)}\n is the fileheader" with expected keys sorted in alphabetic order, ending with a \n-character.
The remainder is the data in UTF-8 encoding.

Read more in numpy file format

workdir = path

In the config.py the working directory is set, which defaults to:

from tablite import Table
import tempdir
from pathlib import Path
import os
temp = path( tempdir.getTempDir() / f"tablite/{os.getpid()}" )
if not temp.exists():
    temp.mkdir()

Table.workdir = temp

Within the workdir the data is organised as follows:

/tmp/tablite/PID-14253/
    pages/
        p1b1.npy  # size set by settings.page_size
        p1b2.npy
        p2b2.npy
        p2b2.npy
        ...
        pNbN.npy
    page_index/  # linked list referring pages back to tables/
        1.yml 
        2.yml
        ...
    tables/
        1.yml
        2.yml
        ...
        N.yml

The pages/ are all of the npy format, identifiable as p<page number>b<block number>. New pages are named using file from 1...int64max so that empty slots are tolerated.

The page_index/ contains the equivalent of a linked list of pages pointing to tables.

%YAML 1.2
filename: 1.yml
tables: [1,2,...]    

The tables/ are all of the yaml format, identifiable by a numeric. The contents of a YAML file are illustrated below:

%YAML 1.2                      yaml version
filename: 1.yml                reverse file identifier.
temp = false                   temp identifier.
columns:                       start of columns section.
    1:                         start of column 1.
        name: “列 1”           column name.
        pages: [p1b1, p1b2]    list of pages in column 1.
        types: []              empty list of types, so column 1 is a C-level data format.
    2:                         start of column 2.
        name: “列 2”           column name
        pages: [p2b1, p2b2]    list of pages in column 2.
        types: [p3b1, p3b2]    list of type codes, so column 2 is not a C-level data format.

tables and columns are mutable pointers to pages.

  • pages immutable holders of data.
  • data processing happens in functions using immutable npy pages [notes: 1, 2, 3].

For Tablite to know what is what, the user must register the function for the type code. In python this is done automatically using dtype or pickle.

Tablite tables are volatile UNTIL saved explicitly

Tablite data is considered volatile. This means that EVERY TIME a process shuts down, the workdir for the PID is erased (sigint - a part of __exit__).
If the process is sigkill'ed the data will remain idle until the PID is used again, whereby the process will erased the contents of the PID-specific folder. There is no other way.

Due to Tablites focus on larger-than-memory datasets, Tablite must write to disk. To assure that we won't run out of disk space, the function that writes will use a disk_limit=0.90 and complaint if the disk usage exceeds 90%.

During multiprocessing, Tablite will share the working directory with the sub-processes in a parent/child-type hierarchy, so that subprocesses won't scatter data in various random folders.

save

With tmp/tablite/<PID>/... as the working director a simple starting point for PERSISTENT usage of data would look like this:

  1. Import data (this uses working dir) with the usual t = Table.import(/my/docs/originals/somefile.csv)
  2. Save data using table.save() to your preferred folder, such as /my/docs/wip/data.npz which is a portable zip'ed folder of the table from the working directory. THIS FILE WILL NEVER BE ERASED BY TABLITE.
  3. Load your table using t = Table.load("/my/docs/wip/data.npz"), which drops the imported table in the working directory and uses the .npz instead - which now is tablite & numpy native.

Thereby, if you are having a large test suite, you may avoid the reimport of datasets completely. Any subsequent work that uses t will still happen in the working directory under the PID.

The reason is that t.save(...) creates an isolated portable snapshot of the table's data, whilst setting T.workdir = ... uses paginated linking, which is far more space efficient and hence faster.

Multiprocessing / multiple python processes.

When tasks are created by main it must declare the destination folder in the task, so that when a new subprocess starts, they will know which process' folder to write to. This guarantees that there will be no conflict when two Python processes use tablite concurrently.

Inconsistent row slice

When slicing a table the result is inconsistent based on number of columns.

If more than 1 column exists tbl[slice(...)] will produce a new Tablite table from the original table for the given slice.
If table has exactly 1 column tbl[slice(...)] will produce a np.array from the only column for the given slice.
If table has exactly 0 columns tbl[slice(...)] will produce an empty table.

I think the API not behave differently based on the table layout and should instead be consistent.

More disk support?

Tablite - more (?) or better disk support?

The benefit of tablite is that it doesn't care about whether the data is in RAM or on DISK. The users usage is the same. Right now SQLite is used for the disk operation.
That's ok, but limited to usage with tempdir.

Sometimes users want to use another directory, store the tables in a particular state and then continue their work at a later point in time. Whilst tempdir can be used for that, tablite has all the data from memory sitting in the same SQLite blob. So to persist individual tables means moving data, which equals time.

It is probably better to allow the user to:

  • Table.set_working_dir(pathlib.Path) which allows the user to have all data in here.

  • Set Table.metadata['localfile'] = "tbl_0.h5" which allows tablite to conclude that "working_dir/tbl_0.h5" is where the user wants the file. Even if the user doesn't want to manage the file name but just want to be sure that the data doesn't cause MemoryOverFlow, the user can do:

Table.new_tables_use_disk = True

Table.from_file(filename,...) and all files will be named in the metadata as:

Table.metadata['filename'] = filename1.csv
Table.metadata['localfile'] = working_dir/filename1.h5

If the filename already is h5 format, the data is merely "imported", by reading the h5 format. This requires only that:

  1. The table's metadata is stored in the h5-datasets .attrs (which is dict like anyway)
  2. Each column is stored independently according to it's datatype.
import h5py
f = h5py.File("filename.h5","w")

column_a = f.create_dataset("column A", (100,99,98,97,), dtype='i')  # 4 records.
column_b = f.create_dataset("column B", ("a","b","c","d",), dtype='s')  # 4 records.

? metadata?

I keep writing h5, because I like hdf5's SWMR function for multiprocessing, in contrast to SQLite's blocking mode.

This brings me to some missing features:

  • Table.from_file(filename..., limit=100) should returns head (100 rows) of the file only.
  • Table.from_file(filename..., columns=['a','c','d']) could return the columns a,c,d (not b,e,f,...) only.
  • Table.from_file(filename..., columns=['a','c','d'], datatypes=[int,str,date], error=None) could return the columns a,c,d with corrs. datatypes and uses None for exceptions.
  • Table.to_hdf5(filename=None, ...) is alias for to_hdf5 is tempfile mode.
  • Table.from_file(....) should print the file location is use_disk==True.

Consequences?

  1. This will require that the StoredColumn is inheriting the Table's localfile for data access, but otherwise there's no problem.

  2. The file_reader function find_format can be done in parallel, as each worker can read the h5 file with focus on a single column of values and return the appropriate dataformat in isolation.

  3. When exiting an atexit hook is required for flush() and close() so that the h5 file is closed properly

  4. When opening a file that already exists, the loading time should be limited to the time it takes to read the metadata of the h5 file as everything else already is in place.

Change default tz info to utc

Currently tablite uses system timezone info when dealing with dates. This should be changed to use UTC unless specified otherwise. While it doesn't matter if everything runs on the local machine there can be a time where this is not true. Additionally there are two tests that will always fail because they only work in UTC timezone.

Make it possible to work with multiple HDF repos

Use case:

  1. User U is a member of team who is doing a data audit of client C who stores data in multiple columns.
  2. Only common denominator is a column "site" which exists in all tables.
  3. The team has a list of all sites and has a team drive T, which is the team's cleaned data.
  4. User U needs to reads through C's data to find tables that nobody else in the team have worked on. U shares it in online spreadsheet.
  5. Once U has cleaned the data, s/he uploads it to T.

Drives:

/tmp/tablite.hdf5   # local working memory on SSD.
/C/datasources.hdf5  # source data from the client (read only access)
/T/clean_data.hdf5  # teams cleaned data.

Workflow:

prerequisite: User selects site 'Corrusant'

site_name= 'Corrusant'

all_tables = Table.reload_stored_tables('/C/datasources.hdf5')  # rapid remote header level read-only access.

# sift through all tables and create tables for site on localhost using the default 'H5_STORAGE':
wip = [ t.any(**{'site': lambda x:x=site_name}) for t in all_tables) ]  

# undisclosed process used to clean up the data.

# create/append to a super table with all data from all sites. Note that this table may already exist!
super_table = Table(H5_location="/T/clean_data.hdf5")  
for table in wip:
    super_table.stack(table)  # append to the team wide super table.

At this point clean data exists in a super table for the whole team to work on.

Should it be possible for the user to define their own None/Null/Empty?

In numerical calculations users may want to use 0 or 0.0 instead of None. In strings users may want a different value, f.x. "" as None.
What are the pro/con for implementing a custom None value?

users can/could just update the table instead:

sometable['column_x'] = [x = "" for x in sometable['column_x'] if x is None]  # *strings* 
sometable['column_x'] = [x = 0.0 for x in sometable['column_x'] if x is None]  # *floats*.
sometable['column_x'] = [x = 0 for x in sometable['column_x'] if x is None]  # *integers*.

...?

Statistics discrepancies in median/mode

There are some discrepancies in statistical functions, namely mode and median.

Mode: expected value '2'
image

Median: expected value datetime(2020, 1, 1, 0, 0)
image

To create table in question:

columns = ['SKU', 'length', 'qty', 'fragile', 'date']
data = [
    ("1", 0.3, 1, False, datetime(2020, 1, 1)),
    ("5", 1.0, 2, None, datetime(2020, 1, 1)),
    ("2", 0.2, 1, True, datetime(2020, 1, 3)),
    ("2", None, 3, True, datetime(2020, 1, 2)),
    ("4", 0.3, 1, False, None),
]

table = Table.from_dict({
    key: val
    for key, val in zip(columns, [list(x) for x in zip(*data)])
})

if columns are of odd length `.show()` will give a meaningless errors.

Example 1:

t = Table()
t['A'] = [1,2,3]
t['B'] = []  # empty
t.show()

output:

C:\Data\venv\test310\lib\site-packages\tablite\memory_manager.py:262: in get_data
    return np.concatenate(arrays, dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = ([],), kwargs = {'dtype': dtype('O')}, relevant_args = []

>   ???
E   ValueError: need at least one array to concatenate

Example 2:

t = Table()
t['A'] = [1,2,3]
t['B'] = [10]  # short
t.show()
        s.append("+" + "+".join(["-" * widths[n] for n in names]) + "+")
>       for ix, row in enumerate(self.rows):
E       RuntimeError: generator raised StopIteration

Proposal:

t = Table()
t['A'] = [1,2,3]
t['B'] = [10] --> [10,None,None]
t.show()  + a warning("columns of unequal length!")
t = Table()
t['A'] = [1,2,3]
t['B'] = []  --> [None,None,None]
t.show()  + a warning("columns of unequal length!")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.