Comments (17)
Do we need to sort data by MapID of one partition before flushing data for all jobs? I think no. This will bring unused cost for those non-AQE optimized stages. Maybe we could sort the partition data by MapId when AQE's specified ShufflePartitionSpec
is applied in first time.
from incubator-uniffle.
You are right.
from incubator-uniffle.
Do u have implemented this in your internal version? If not, I'm interested on this. @jerqi
from incubator-uniffle.
No. You can go ahead.
from incubator-uniffle.
I propose the design of this issue. https://docs.google.com/document/d/1G0cOFVJbYLf2oX1fiadh7zi2M6DlEcjTQTh4kSkb0LA/edit?usp=sharing
PTAL @jerqi
from incubator-uniffle.
It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
from incubator-uniffle.
It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
Does data need to sort by mapId?
from incubator-uniffle.
It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
Does data need to sort by mapId?
Yes, we only need local order. If we have local order, we can filter much data effectively.
from incubator-uniffle.
It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
Does data need to sort by mapId?
Yes, we only need local order. If we have local order, we can filter much data effectively.
Emm... I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you?
from incubator-uniffle.
It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
Does data need to sort by mapId?
Yes, we only need local order. If we have local order, we can filter much data effectively.
Emm... I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you?
Give an example:
We have three buffers to flush, they taskId 1 block, taskId 2 block, taskId 3 block. We should sort them to taskId 1 block, taskId 2 block, taskId 3 block. And then we can flush them to disks.Then we receive taskId 2 block, taskId 6 block, taskId 1 block, we sort them and flush them, so currently the data on the disk should be
taskId 1 block , taskId 2 block, taskId 3 block, taskId 1 block, taskId 2 block, taskId 6 block.
The data only have local order.
from incubator-uniffle.
taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block, taskId-2 block, taskId-6 block.
If one reader want the data from taskId=1, so it still want to read the data segment from taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block
. The data of taskId-2 block, taskId-3 block
is unnecessary for this reader. Right?
from incubator-uniffle.
taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block, taskId-2 block, taskId-6 block.
If one reader want the data from taskId=1, so it still want to read the data segment from
taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block
. The data oftaskId-2 block, taskId-3 block
is unnecessary for this reader. Right?
Yes.
from incubator-uniffle.
This looks ineffective and it's the same with the original block filter.
from incubator-uniffle.
This looks ineffective and it's the same with the original block filter.
Actually considering random io, It will cost the same time when you read 3 records or 2 records.
from incubator-uniffle.
This looks ineffective and it's the same with the original block filter.
Actually considering random io, It will cost the same time when you read 3 records or 2 records.
Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this view, we should sort the data file.
from incubator-uniffle.
This looks ineffective and it's the same with the original block filter.
Actually considering random io, It will cost the same time when you read 3 records or 2 records.
Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this view, we should sort the data file.
We don't need global order, local order should be enough.
from incubator-uniffle.
from incubator-uniffle.
Related Issues (20)
- [FEATURE] Show logs in Dashboard
- [FEATURE] Show IO/CPU/Disk usage in Dashboard
- [Bug] The StorageManager cache might not function effectively under heavy IO pressure
- [Bug] Occasionally encountering IllegalReferenceCountException when releasing ShuffleIndexResult
- [Bug] [Operator] ShuffleSever cannot be deleted even though there are no more application. HOT 4
- [Bug] ShuffleTaskInfo may leak when app is removed. HOT 1
- [FEATURE] Determine whether data can be written and read based on the actual disk IO situation
- [Bug] app localdisk folder remains when app is expired
- [Improvement] Use thread pool to control the concurrency of data reading threads when enabling Netty HOT 1
- [Improvement] Optimize CompositeByteBuf initialization for better performance when getting memory data HOT 1
- [FEATURE] Add rpc queued time and rpc process time. HOT 1
- [FEATURE] Add gauge metrics for reading data
- [Improvement] Set Netty as the default server type
- [Umbrella] Release 0.9.0
- Bump `master` to `0.10.0-SNAPSHOT`.
- [DOCS] Add licences and notices regarding new dashboard module HOT 4
- [Bug] Optimized FileSegmentManagedBuffer.nioByteBuffer to avoid multiple read file
- [Bug] Fix flaky tests
- [DOCS] Update the descriptions and default values of outdated configurations
- [Bug] Assertions will not take effect during production runtime
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from incubator-uniffle.