Comments (5)
UPDATE: recent revision of the code evaluates that the Weight
(corresponding to qbeast_hash
of the indexed columns) should be computed in another way to stick to the randomness of the algorithm:
- Either the
qbeast_hash
it's done with all the columns. - Either we use Random.nextInt() function.
In both cases, the reading process faces the issue of reproducing the value in a correct way. For that, there's a theoretical solution:
- In the first one, we can redo the hash like before. But this will add an extra performance penalty when the query does not need all the columns in a table.
- In the second one, there's a risk with distributing the indexing because we have to make sure that Random is spread across partitions and processes.
- In both the first and the second one, we can overcome the read with the following proposal: Since we know the max and min values of a file from the metadata written in the DeltaLog, we should be able to re-generate a Random value between [min, max] and then filter by that. This ends directly with the overhead problem, but we need to figure out how to structure the code, which processes are involved, and if it works as expected. (Also, it could be possible that two equal samples output different results...)
from qbeast-spark.
We can just use Random.nextInt() to implement the qbeast_hash function and forget about the indexed columns (only for writing). The complex part comes when we read the data. Two options could be to either write a random weight per cube (user invasive) or create the random weight in memory to use it to filter the data per cube.
from qbeast-spark.
Update with a brief description of how I am trying to implement this:
-
Write: We can change
qbeast hash
function and write a Random.nextInt() as the weight column. -
Read: Either changing spark code (not sure we want to do this for this project) or try to add a new spark extension that allows us to make the necessary changes in the Physical Plan stage.
From what I understand, these changes will modify the way we pushdown the filters when a sample/filter operator is performed. Right now, we add theSampleRule
which allows us to transform the sample operator to a filter that can be pushed down to the data source. What we want to do now is apply these changes a bit deeper in the tree. Instead, we can try to do it during the execution of the Physical Plan. The idea is to read the data per file (cube) and filter the data that we need, because we know the min/max Weight per cube, we should be able to implement this.
-
injectPlannerStrategy could be one entry point.
-
DataSourceScanExec
: This file contains important classes and functions that we want to modify. It containes classesFileSourceScanExec
: "Physical plan node for scanning data from HadoopFsRelations.", functionscreateBucketedReadRDD
andcreateNonBucketedReadRDD
which call method.getPartitionedFile
per each file in the HaddopFSrelation given as input. -
PartitionedFile
: Takes the offset and length parameters to read a chunk from a file. The idea is to use it with a logic like this:
def getPartitionedFileFiltered(
file: FileStatus,
filePath: Path,
partitionValues: InternalRow,
percentage: Double,
min: Long,
max: Long): PartitionedFile = {
val offset = if (percentage < 1.0) {
min + Random.nextInt((max - min).toInt + 1)
} else {
0
}
val hosts = getBlockHosts(getBlockLocations(file), 0, file.getLen - offset)
PartitionedFile(partitionValues, filePath.toUri.toString, 0, file.getLen - offset, hosts)
}
Where we directly read the data that we need without filtering. If this does not work we can filter the data after reading it in createBucketedReadRDD
.
- Challenges: The main challenge that I see now is how to bind the LogicalPlan that we create with the SparkStrategy and connect it to the file with the implemented changes.
from qbeast-spark.
Issue on hold due to current developments on #175
from qbeast-spark.
Due to the complexity of adding these changes to the code, we will postpone the solution of this issue.
Keep in mind that the only case affected on performance is when reading a SAMPLE of files.
from qbeast-spark.
Related Issues (20)
- Add metastores documentation
- Add method to retrieve Revision Information HOT 1
- Using different hash seed for each revision
- Error on sampling when using <columnName>:<type> in columnsToIndex
- Unify Table Properties structure and storage location
- Add QbeastTable.forTable method
- Update CONTRIBUTING.md HOT 1
- Update README with links to the documentation HOT 1
- Add documentation about the Release process HOT 1
- Update some markdown files of Qbeast-spark repository
- Broken link in main README HOT 1
- TBLPROPERTIES not updated on Spark Catalog
- TBLPROPERTIES on new table are saved partially
- Add Option for rollupCubeSize
- Unify Create Qbeast Table in one transaction
- Update links from the README file
- Broken links in docs
- Need to change html labels
- Fix AdvancedConfiguration.md
- Update CODE_OF_CONDUCT.md header
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qbeast-spark.