Comments (5)
I fixed this by adding a regular checkpoint creation function. This reduces the number of file operations.
PR around auto checkpoints is #913
@scheduler_fn.on_schedule(schedule="*/5 * * * *", memory=options.MemoryOption.GB_1 , timeout_sec=1000) def checkpointdb(request): dt = DeltaTable("gs://bucket/deltalake/feed" , storage_options={"google_service_account_key": json.dumps(google_service_account_key)}) dt.create_checkpoint() print(f"DB Checkpoint" , flush=True)
from delta-rs.
in order to write a delta table, we also need to always know the latest state of the table. as such ever read also requires us to read all relevant log files at least once. Usually there may be one or more list operations as well.
Are you creating checkpoints? if not, we have to read one commit file for at least very transaction that was created on a table, which can become very sizeable.
We have a PR in flight, that will allow us to be more economic in terms of reads, especially in append-only scenarios, where we can disregard a lot of the log - again, given there are checkpoints.
from delta-rs.
I'm going to close this, I don't believe there is something actionable for the delta-rs project here
from delta-rs.
I wouldn't have the slightest clue what class B operations even resemble, I don't use GCP myself.
If you can break it down into lingo to non-gcp users that would help
from delta-rs.
@ion-elgreco mainly class B operations are for reading objects from Google Cloud Storage.
@roeap I'm not sure about checkpoints. I haven't defined any myself, so if write_deltalake
is not using them by default I would assume I was not using them. Based on the numbers I provided does it make sense to get so many read/list operations?
Note: I later changed the implementation of simply adding new parquets as I figured out I don't really need the functionality of delta lake. I just wanted to point it out if anyone else had a similar problem. Especially since this can incur high unexpected costs on cloud providers.
from delta-rs.
Related Issues (20)
- Get statistics metadata HOT 4
- Release GIL in deltalake.write_deltalake HOT 12
- Partition column comparison is an assertion rather than if block with raise exception HOT 3
- DeltaLake executed Rust: write method not found in `DeltaOps` HOT 1
- Property setting in `create` is not handled correctly
- Document how use "deletedFileRetentionDuration" HOT 4
- Rust writer panics on empty record batches HOT 1
- Do not load full source into RAM on write_to_deltalake HOT 5
- Inconsistent units of time
- DeltaTable is not resilient to corrupted checkpoint state
- Generic DeltaTable error: Version mismatch with new schema merge functionality in AWS S3 HOT 1
- Failure to read from table in S3 with special characters like spaces in path
- Support `Time32` and `Time64` types HOT 3
- Support second and millisecond precision for timestamps HOT 2
- Decimal overflow error with schema_mode=merge in Python deltalake 0.16.0
- Successful writes return error when using concurrent writers HOT 5
- Schema evolution on upsert (merge) HOT 1
- Z-Order with larger dataset resulting in memory error HOT 1
- Accommodate separate set of credentials to access DynamoDB HOT 3
- Checkpoint does not preserve reader and writer features for the table protocol. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delta-rs.