Comments (6)
I do think that example would be nice, it's basically what I was trying to build 😄
My approach was going to be something like:
async fn scan(
&self,
state: &SessionState,
projection: Option<&Vec<usize>>,
filters: &[Expr],
limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>> {
let object_store_url = ObjectStoreUrl::parse("file://")?;
let mut file_scan_config = FileScanConfig::new(object_store_url, self.schema())
.with_projection(projection.cloned())
.with_limit(limit);
// Use the index to get row groups to be scanned
// Index does best effort to parse filters and push them down into the metadata store
let partitioned_files_with_row_group_selection = self.index.get_files(filters).await?;
for file in partitioned_files_with_row_group_selection {
file_scan_config = file_scan_config.with_file(PartitionedFile::new(
file.canonical_path.display().to_string(),
file.file_size,
).with_extensions(Arc::new(file.access_plan())));
}
let df_schema = DFSchema::try_from(self.schema())?;
// convert filters like [`a = 1`, `b = 2`] to a single filter like `a = 1 AND b = 2`
let predicate = conjunction(filters.to_vec());
let predicate = predicate
.map(|predicate| state.create_physical_expr(predicate, &df_schema))
.transpose()?
.unwrap_or_else(|| datafusion_physical_expr::expressions::lit(true));
let exec = ParquetExec::builder(file_scan_config)
.with_predicate(predicate)
.build_arc();
Ok(exec)
}
(several functions and types made up)
Does this sound about in line with what you would think of as an example? I think implementing the async store as a familiar RDMS (SQLite via SQLx?) would be a good example.
from arrow-datafusion.
Update here is I have a basic example #10549 ready for review / merge
from arrow-datafusion.
Sorry for jumping in here, maybe this isn't the best issue but it's hard to keep up with all of the amazing work you're doing @alamb!
I wanted to pitch a use case I've been thinking about of storing a secondary index on a searchable async location. Think a relational database with ACID guarantees. In particular the key would be that hooks to do selections / pruning be async and that they pass in filters: I'd push down the filters into filters in the metadata store and run an actual query there that returns the files / row groups to scan. This is in contrast to #10549 for example where the index is in memory and fully materialized. I realize that TableProvider.scan
already serves this purpose, but it'd be nice to integrate into these new APIs instead of having to implement more things oneself because you're hooking in at a higher (lower?) level.
from arrow-datafusion.
Sorry for jumping in here, maybe this isn't the best issue but it's hard to keep up with all of the amazing work you're doing @alamb!
Thanks @adriangb ❤️
I wanted to pitch a use case I've been thinking about of storing a secondary index on a searchable async location. Think a relational database with ACID guarantees. In particular the key would be that hooks to do selections / pruning be async and that they pass in filters: I'd push down the filters into filters in the metadata store and run an actual query there that returns the files / row groups to scan. This is in contrast to #10549 for example where the index is in memory and fully materialized.
Yes, I agree this is a very common usecase in modern database / data systems and one I hope will be easier to implement with some of these APIs (btw see #10813 for an even lower level API which I think brings this idea to its lowest leve.)
I realize that
TableProvider.scan
already serves this purpose, but it'd be nice to integrate into these new APIs instead of having to implement more things oneself because you're hooking in at a higher (lower?) level.
I agree that you could do an async
call as part of TableProvider::scan
to fetch the relevant information from the remote store. Specifically, here
datafusion/datafusion-examples/examples/parquet_index.rs
Lines 223 to 263 in 586241f
One thing that is still unclear in my mind is what other APIs we could offer to make it easier to implement an external index. Most of the the code in parquet_index.rs is to create the in memory index. Maybe we could create an example that shows how to implement a remote index 🤔
from arrow-datafusion.
Does this sound about in line with what you would think of as an example? I think implementing the async store as a familiar RDMS (SQLite via SQLx?) would be a good example.
Yes that is very much in line.
Using SQLite via sql-x would be cool, though I don't think we would want to add new dependencies into the core datafusion crates themselves.
I made a new repo in datafusion-contrib here https://github.com/datafusion-contrib/datafusion-async-parquet-index and invited you to be an admin, in case you want to do things there
from arrow-datafusion.
datafusion-contrib/datafusion-async-parquet-index#1 😃
from arrow-datafusion.
Related Issues (20)
- Circular relationship when determining state fields for AggregateUDF HOT 5
- Support join filter in NestedLoopJoin in fizz join test cases HOT 1
- `Int64` as default type for `make_array` function empty or null case
- `array_slice` panicked when called with empty args HOT 2
- `cli_quick_test` failing on windows (stack overflow) after sqlparser `0.47.0` upgrade
- Implement `ScalarValue::IntervalMonthDayNano` -> String Support
- Implement `ScalarValue::TimestampNanosecond` -> String Support
- Implement `ScalarValue::TimestampMillisecond` -> String Support HOT 3
- Support convert LogicalPlan::EmptyRelation to SQL String HOT 1
- Improve overflow errors HOT 1
- Efficiently and correctly Extract Page Index statistics into `ArrayRef`s HOT 8
- Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. HOT 10
- Support `array_any_value`
- Projects require unique expressions names error in substrait producer/consumer HOT 7
- Substrait consumer doesn't respect final output column names HOT 1
- `extract` doesn't accept quoted field names HOT 2
- Convert `stddev` to udaf HOT 4
- x NOT IN y works but NOT (x IN y) doesn't
- Convert `approx_distinct` to UDAF HOT 1
- Convert `approx_median` to UDAF HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.