Comments (8)
cc @fjetter
from aws-sdk-cpp.
Ok, so it seems the JSON parsing step that takes most of the time in the profile graphs may be spent parsing this JSON string hardcoded (!!) in the SDK's C++ source code:
It decodes into this heavily-nested object (here in Python representation):
https://gist.github.com/pitrou/c4978f29be9d2d3cb9574cd9d262490a
from aws-sdk-cpp.
Thanks for pointing this out. Currently it's behaving within expected boundaries, but I'm interested to hear more of your thoughts on this. How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused.
I noticed in the issue linked above you have improved the performance of your tests to less than 53 ยตs with "caching PoC". Were there any changes you where wanting to be made on the sdk side?
from aws-sdk-cpp.
How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused.
I'll let @fjetter elaborate on their situation, but when distributing individual tasks over a cluster of workers there's a need to deserialize everything that's needed to run such tasks. If a task entails loading data over S3 (with potentially different configurations, since tasks from multiple users or workloads might be in flight), it implies recreating a S3Client each time. Depending on task granularity, I suspect 1ms to instantiate a S3Client might appear as a significant contributor in performance profiles.
from aws-sdk-cpp.
For the record, here's the current prototype that seems to work on our CI. I ended up caching endpoint providers based on the S3 client configuration's relevant options (the ones that influence the provider initialization). There's an additional complication (InitOnceEndpointProvider
) due to the fact that S3Client::S3Client
always reconfigures the endpoint provider, even when it is explicitly passed by the caller, and that is not thread-safe.
https://github.com/apache/arrow/blob/2be51947448aac17da5eb4e7b284483da72f7f41/cpp/src/arrow/filesystem/s3fs.cc#L916-L1022
It would probably have been simpler if I could simply have explicitly created a shared RuleEngine
with the S3 default rules, and instantiate each S3EndpointProvider
from that same RuleEngine
.
from aws-sdk-cpp.
How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused.
I'll let @fjetter elaborate on their situation, but when distributing individual tasks over a cluster of workers there's a need to deserialize everything that's needed to run such tasks. If a task entails loading data over S3 (with potentially different configurations, since tasks from multiple users or workloads might be in flight), it implies recreating a S3Client each time. Depending on task granularity, I suspect 1ms to instantiate a S3Client might appear as a significant contributor in performance profiles.
That sums it up nicely. Due to how arrow and dask is built, we end up instantiating possibly thousands of clients adding up to a couple of seconds in latency whenever we're trying to read a dataset. We essentially end up creating one s3client per file, reusing it is a little difficult at this point.
from aws-sdk-cpp.
This is ultimately a feature request so I will be changing this issue to a feature-request. It's something that we would like to improve the speed of, but I don't have a timeline for when that might happen.
In the short term I would recommend to use a single endpoint resolver for all of you s3clients. You can do this by overloading the client when you initialize it.
from aws-sdk-cpp.
In the short term I would recommend to use a single endpoint resolver for all of you s3clients. You can do this by overloading the client when you initialize it.
We have a proposed workaround now, but it's slightly more complicated than that:
- we are a library, so cannot assume a single endpoint resolver since different s3clients may be requested for different endpoint configurations; we need to maintain a cache (which must also be appropriately flush betfore S3 shutdown)
- S3Client unfortunately does non-thread-safe reconfiguration of the given endpoint resolver in its constructor. That's harmless if the endpoint resolver is specific to a given client, not if the endpoint resolver is shared between multiple clients. We therefore had to implement an immutable endpoint resolver wrapper.
from aws-sdk-cpp.
Related Issues (20)
- Huge RAM usage on big file uploads HOT 7
- callback set in SetDataSentEventHandler is not called on S3Crt PutRequest HOT 2
- Cannot build sdk on a Amazon Linux 2023 based Container after 1.11.211 Version Tag HOT 9
- The S3 protocol does not support retrieving multiple attributes through GetObjectAttributes. HOT 3
- Wrong OpenSSL CMake targets used HOT 2
- Kinesis Video Stream hangs on getMedia request HOT 5
- Mandatory meta data for PUT S3 presigned url unlike javascript sdk HOT 3
- aws-cpp-sdk-core/source/http/curl/CurlHttpClient.cpp: Bad call to cURL HOT 6
- AWS SDK CurlHttp requests HOT 6
- S3Crt GetObject failure HOT 7
- The following imported targets are referenced, but are missing: AWS::aws-c-sdkutils HOT 2
- S3-CRT Client signature error on s3 object keys with special characters HOT 4
- Generate RC files for windows DLLs
- Duplicate definitions for AWS_CREDENTIAL_PROVIDER_EXPIRATION_GRACE_PERIOD HOT 2
- curl-originated error messages not informative enough HOT 1
- InitAPI crash in ubuntu 22.04 in latest code HOT 5
- S3Client Leaks Memory on Windows caused by BCrypt API Misuse HOT 3
- STS does not respect ca cert setting HOT 1
- Between version 1.11.159 and 1.11.305, the GetObjectAsync method of S3CrtClient has become very very slow. HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-sdk-cpp.