Code Monkey home page Code Monkey logo

Comments (5)

osopardo1 avatar osopardo1 commented on August 16, 2024

UPDATE: recent revision of the code evaluates that the Weight (corresponding to qbeast_hash of the indexed columns) should be computed in another way to stick to the randomness of the algorithm:

  1. Either the qbeast_hash it's done with all the columns.
  2. Either we use Random.nextInt() function.

In both cases, the reading process faces the issue of reproducing the value in a correct way. For that, there's a theoretical solution:

  1. In the first one, we can redo the hash like before. But this will add an extra performance penalty when the query does not need all the columns in a table.
  2. In the second one, there's a risk with distributing the indexing because we have to make sure that Random is spread across partitions and processes.
  3. In both the first and the second one, we can overcome the read with the following proposal: Since we know the max and min values of a file from the metadata written in the DeltaLog, we should be able to re-generate a Random value between [min, max] and then filter by that. This ends directly with the overhead problem, but we need to figure out how to structure the code, which processes are involved, and if it works as expected. (Also, it could be possible that two equal samples output different results...)

from qbeast-spark.

Adricu8 avatar Adricu8 commented on August 16, 2024

We can just use Random.nextInt() to implement the qbeast_hash function and forget about the indexed columns (only for writing). The complex part comes when we read the data. Two options could be to either write a random weight per cube (user invasive) or create the random weight in memory to use it to filter the data per cube.

from qbeast-spark.

Adricu8 avatar Adricu8 commented on August 16, 2024

Update with a brief description of how I am trying to implement this:

  1. Write: We can change qbeast hash function and write a Random.nextInt() as the weight column.

  2. Read: Either changing spark code (not sure we want to do this for this project) or try to add a new spark extension that allows us to make the necessary changes in the Physical Plan stage.
    From what I understand, these changes will modify the way we pushdown the filters when a sample/filter operator is performed. Right now, we add the SampleRulewhich allows us to transform the sample operator to a filter that can be pushed down to the data source. What we want to do now is apply these changes a bit deeper in the tree. Instead, we can try to do it during the execution of the Physical Plan. The idea is to read the data per file (cube) and filter the data that we need, because we know the min/max Weight per cube, we should be able to implement this.

  • injectPlannerStrategy could be one entry point.

  • DataSourceScanExec: This file contains important classes and functions that we want to modify. It containes classes FileSourceScanExec: "Physical plan node for scanning data from HadoopFsRelations.", functions createBucketedReadRDDand createNonBucketedReadRDDwhich call method .getPartitionedFile per each file in the HaddopFSrelation given as input.

  • PartitionedFile: Takes the offset and length parameters to read a chunk from a file. The idea is to use it with a logic like this:

    def getPartitionedFileFiltered(
        file: FileStatus,
        filePath: Path,
        partitionValues: InternalRow,
        percentage: Double,
        min: Long,
        max: Long): PartitionedFile = {

      val offset = if (percentage < 1.0) {
        min + Random.nextInt((max - min).toInt + 1)
      } else {
        0
      }
      val hosts = getBlockHosts(getBlockLocations(file), 0, file.getLen - offset)
      PartitionedFile(partitionValues, filePath.toUri.toString, 0, file.getLen - offset, hosts)
    }

Where we directly read the data that we need without filtering. If this does not work we can filter the data after reading it in createBucketedReadRDD.

  • Challenges: The main challenge that I see now is how to bind the LogicalPlan that we create with the SparkStrategy and connect it to the file with the implemented changes.

from qbeast-spark.

osopardo1 avatar osopardo1 commented on August 16, 2024

Issue on hold due to current developments on #175

from qbeast-spark.

osopardo1 avatar osopardo1 commented on August 16, 2024

Due to the complexity of adding these changes to the code, we will postpone the solution of this issue.

Keep in mind that the only case affected on performance is when reading a SAMPLE of files.

from qbeast-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.