Comments (4)
- Q1: Does Avg depth denote the average depth of [all nodes(model nodes and data nodes)] or simply [data nodes]?
It denotes the average depth over all keys (so equivalently, the average depth of all data nodes, weighted by how many keys fall in that data node).
- Q2: I assume Table 2 means bulk loading only 100 million keys for 4 datasets, right?
Yes.
- Q3: I tried to replicate the experiments in Table 2, but got different statistics, could you help me check if there is something I miss? (or maybe the later commits have slightly changed some features?)
There are three reasons you're seeing different numbers. First, the cost model weights we used in the paper (see the last paragraph of page 18 in our arxiv report) are actually slightly different from the default weights in this open-source implementation. Second, the expected_insert_frac
was likely set to 0
to produce these numbers. Third, we've made quite a few changes since we initially submitted our paper, so the bulk loading behavior is different. It is probably not possible to exactly reproduce the paper's results by using this open-source implementation.
from alex.
Thank you so much for your detailed reply.
I have tried to change the expected_insert_frac, it helps when I change expected_insert_frac from 1 to 0.
For the effects of the cost model weights, I changed them as follows:
// Intra-node cost weights
double kExpSearchIterationsWeight = 10; // 20->10
double kShiftsWeight = 1; // 0.5->1// TraverseToLeaf cost weights
double kNodeLookupsWeight = 10; // 20->10
double kModelSizeWeight = 1e-6; // 5e-7->1e-6
The results are as follows:
The statistics for longitudes, longlat, lognormal are quite close to the statistics in the paper. The performance of YCSB has the largest difference. Therefore, like the third reason you mentioned, it is probably not possible to exactly reproduce the paper's results by using this open-source implementation.
Could I have another small question about the retraining strategy?
I found that when there is a hyper-parameter(threshold) in the resize method of data nodes. This threshold is set to 50, which will call the retraining of models. Do I need to change this threshold for different datasets? Or is this threshold the best choice given empirical experiments?
Thank you so much for your patience. I am quite interested in ALEX~
from alex.
Yes, that's a threshold value that we found works well on all datasets in our empirical experiments, and changing it shouldn't have a big impact on performance. But of course you're free to try modifying it if you want to improve performance a bit further on a particular dataset.
from alex.
Thanks so much. Your reply helps a lot.
from alex.
Related Issues (20)
- ALEX crashes in non-batch mode HOT 4
- Question about Figure 9: ALEX supports redundant key values, but STX B+ Tree does not has such feature HOT 8
- FLAG Setting: target specific option mismatch for _tzcnt_u64
- alex insert on Ubuntu
- Unable to erase previously inserted key HOT 2
- Ale
- Github
- DataNode: `min_key_` not set. HOT 2
- `bulk_load_from_existing` doesn't set `min_key_` and `max_key_` correctly HOT 1
- expand_root: some new_children pointers are not assigned
- The meaning of expected shifts approximation formula
- Keys out of order after insert/delete/lookup operations HOT 5
- Insertion uses all memory
- Compiler Error: target specific option mismatch HOT 1
- cost questions
- BUGs in Zipfian generator HOT 2
- Segmentation fault problem when lookup after bulk load HOT 3
- Key is correct but payload is wrong
- Could I have the four datasets? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alex.