Comments (5)
I'm not exactly sure what would you use re-indexing for if it's not loaded. Unless you mean you already have it re-indexed based on some other criteria and just want to blindly select the values from the array.
If that's the case then no, it cannot be improved in how it is now as the pickle format doesn't allow for random access. However, the nim implementation should definitely be faster. I implemented unpickler fully in nim whereas pythons unpickler is written in native python and is not a C binding which makes it slow. I haven't benchmarked it vs python implementation but I'm pretty sure it would be faster and there's places where it could be made even faster.
from tablite.
...re-indexing for if it's not loaded...
When I have an index and need to re-arrange fields in the right
table during a join, the actual value that is being re-ordered doesn't matter. It only matters that the values are put in the right order.
For example a join where the right side index is [4,2,3,1] all fields on all pages would have to be re-ordered to match the order that is dictated by the index. So it doesn't matter whether it's a struct, int, ... whatever as long as the page that is output contains the bytes
in the correct order.
from tablite.
Then no it is not possible with pickle format as that requires random access as bytes are not aligned so you're forced to read the entire page. It requires reading the entire file so only saving grace would be speeding up reading process.
from tablite.
Ok. So the advice is "keep datatypes simple if you want speed."
from tablite.
Related Issues (20)
- Join (reindexing) fails when table spans multiple pages HOT 2
- Documentation is out of sync HOT 1
- Determine method to handle out-of-memory for large joins. HOT 1
- Proposed format specification HOT 1
- multi proc groupby HOT 1
- multi proc join HOT 3
- Add warning in add_rows that is the slowest method HOT 1
- Deprecating support for python 3.8 in favor of type hints throughout the code HOT 1
- Columns with empty names HOT 2
- Bloat in H5 storage following repeated SIGKILL HOT 3
- Statistics discrepancies in median/mode HOT 1
- Do Tablite Support different datasets Concurrently ? HOT 6
- Addition of match operator HOT 5
- HDF5 file size never decreases + concurrent interpreters can overwrite each others files. HOT 14
- sorting problem with datetime dt columns HOT 1
- Inconsistent row slice HOT 3
- Slow import of files with text escape HOT 16
- statistics() fails on time column HOT 2
- my first issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tablite.