Comments (23)
@tmds false alarm - it was my fault. quay.io returns a Range header of 0--1
on the initial upload create, which crashed us. With some additional validation we're all good to go and greened up again.
I'll be adding tests to the linked PR to mimic behaviors of different registries (in terms of what their responses look like to different request patterns) to help ensure we don't regress.
from sdk-container-builds.
I tried to investigate issues on our side. We have 2 issues at the moment. For the first I found workaround.
Both are also reproducible on version that worked before (i tried the version from 15th Feb).
First error occurs on PATCH /v2/<name>/blobs/uploads/<uuid>
and happens due to an existing connection was forcibly closed by the remote host
. The workaround for this issue is to reduce chunk for upload (I checked with 1248080
bytes instead of current 5248080
bytes).
Second error occurs on finalizing the upload using PUT /v2/<name>/blobs/uploads/<uuid>?digest=<digest>
.
The returned error is None of the provided layer digests match the calculated layer digest
, code: BLOB_UPLOAD_INVALID
).
I haven't found a workaround for it.
The last successful push happened on 23rd March: https://github.com/baronfel/sdk-container-demo/actions/runs/4496866095/jobs/7911922863 using 0.4.0 version of the package.
from sdk-container-builds.
@vlada-shubina Vlada Shubina FTE - thank you for the above comments. Is it possible to add a test that will detect such issues in the future? (assuming we'll fix this soon)
we have a separate issue for that: #371
for now we don't have infrastructure to test remote registries unfortunately.
the prerequisites are: obtaining test accounts, ensuring CI prerequisites.
The prototype by @baronfel is available here: https://github.com/baronfel/sdk-container-demo however it runs on private accounts and might not be fully compatible with our engineering.
from sdk-container-builds.
If I'm understanding correctly this library is only doing retries for SocketException
. A mismatch on the digest would not be retried. The linked library does retries regardless of the type of errors.
So my speculation is if there is an eventual consistency on the ECR side with putting the chuck and getting the digest that is not being retried. I don't know exactly how ECR is implemented but I'm sure S3 is involved at some point which is an eventual consistent store.
from sdk-container-builds.
@baronfel I was doing some experimenting with your branch. The authentication logic in AuthHandshakeMessageHandler
is to cache the authentication header for every request uri. Since new locations are vended all the time with new guids there is always a cache miss. So while doing a PATCH chunk 10MB upload of a layer the first attempt is done unauthenticated. I suspect ECR is thinking why are you sending me 10MB of unauthenticated data and so rejects the requests and shutdowns the socket. Hence all of the Socket forced closed exceptions.
As a quick hack I changed the authentication header cache in AuthHandshakeMessageHandler
to use just the host name. Not saying that is the right solution but it works for ECR since the host name is what identifies registry that was logged in. When I did that my uploads seemed to work quite well without any socket forced closed exceptions. @baronfel Could you try recreating my hack and seeing if you are seeing similar behavior? Basically in the calls to AuthHeaderCache
I changed the key to new Uri("https://" + request.RequestUri.Host)
as a hack.
from sdk-container-builds.
All green! (including harbor!)
https://github.com/baronfel/sdk-container-demo/actions/runs/5164957782
@normj I couldn't replicate the socket errors - in this run you can see there was only one socket failure and the overall 'full push' did work. This is the kind of behavior that I'd hope for.
Now will work on refactoring, in-repo tests, more sdk-container-demo test cases, docs, etc.
from sdk-container-builds.
@normj is there any chance you could get the ECR folks to give us some diagnostics from their end to help triage the problem? It's almost certain to be something we regressed, but I don't know where to start looking based on the errors we get directly from the API responses.
from sdk-container-builds.
@baronfel I'll see what I can do
from sdk-container-builds.
@vlada-shubina - thank you for the above comments. Is it possible to add a test that will detect such issues in the future? (assuming we'll fix this soon)
from sdk-container-builds.
@baronfel @vlada-shubina Sorry I haven't been able to dedicate any time on helping this issue. I should have some time open up soon. I was able use the containerize
console app in the dotnet/sdk
repository and was able to successful push an image to ECR so I haven't recreated he issue locally yet for myself.
from sdk-container-builds.
@baronfel @vlada-shubina Sorry I haven't been able to dedicate any time on helping this issue. I should have some time open up soon. I was able use the
containerize
console app in thedotnet/sdk
repository and was able to successful push an image to ECR so I haven't recreated he issue locally yet for myself.
I can easily repro it on barebone web project on .NET SDK 7.0.302, steps to repro:
- login to AWS ECR with with get-login-password as described here: https://docs.aws.amazon.com/AmazonECR/latest/public/public-registries.html
dotnet new webapp
to create new web project- add
<EnableSdkContainerSupport>true</EnableSdkContainerSupport>
property to csproj - add
Properties\PublishProfiles\aws.pubxml
with the registry details (i'm using public registry here):
<Project>
<PropertyGroup>
<ContainerRegistry>public.ecr.aws</ContainerRegistry>
<ContainerImageName>r4z9b1c4/sdk-containers-test</ContainerImageName>
<WebPublishMethod>Container</WebPublishMethod>
</PropertyGroup>
</Project>
- publish container using
dotnet publish --os linux --arch x64 -c Release /p:PublishProfile=aws
Update: same issue appears on the latest build of 7.0.400 from dotnet/sdk repo.
from sdk-container-builds.
@normj
I investigated it more:
- Error
PATCH /v2/<name>/blobs/uploads/<uuid>
due to an existing connection was forcibly closed by the remote host. The workaround for this issue of reducing the chunk to upload (I checked with 1248080 bytes instead of current 5248080 bytes) works reasonably well, i haven't encountered any issues so far after changing it, however it slows down the upload. Do you have any objections on changing the limit? - Second error occurs on finalizing the upload using
PUT /v2/<name>/blobs/uploads/<uuid>?digest=<digest>
.
The returned error is None of the provided layer digests match the calculated layer digest, code: BLOB_UPLOAD_INVALID).
I noticed that in case of non-chunked upload it works as expected, when submitting digest of uncompressed content of the layer. For chunked upload it is not working with the error above. Should we submit anything else in this case?
Thanks
from sdk-container-builds.
When I was talking to the ECR folks before about this feature they mentioned they help this docker client with ECR. https://github.com/regclient/regclient/blob/90e8aca4e5fe0db2f57dfa52f0c9ec9b2e7d9118/scheme/reg/blob.go
One thing I see they have that I don't think the .NET implementation has is a retry at the individual PATCH level. Has the team thought about putting that into registry push feature? Totally speculating here but I'm wondering if we are hitting an eventual consistency issue here.
from sdk-container-builds.
That's very interesting - we have retries at a lower level (an HttpMessageHandler I believe) but the really interesting thing I see here is that they have an 'adaptive' chunk size. That's something I think we need here, since our chunk sizes have been very fragile.
from sdk-container-builds.
That's a great idea. After digging into that code a bit more I see that they also use a GET
on the blob upload endpoint to get the status of the upload (via the Range
header there) in case some data needs to be re-sent. We should try to do the same.
from sdk-container-builds.
@vlada-shubina I've made some progress on this based on the data @normj sent: https://github.com/dotnet/sdk/compare/release/7.0.4xx...baronfel:sdk:containers-chunk-retry?expand=1
So far the main changes are
- retries on chunk uploads specifically, and
- using the Range information from server requests to determine where the next chunk should start from
both are inspired from the regclient code linked. This worked well for github, GCP, and Azure chunked uploads. For AWS uploads, each PATCH for a chunk, say the 0-63999 chunk, would return successfully but the Range
information reported by ECR would remain 0-63999
across every iteration of chunks - we expect this number to climb as chunks are committed, and so we loop (forever?). I didn't see how the regclient codebase deals with this.
from sdk-container-builds.
Found a bit more information today:
- when we start an upload, registries can potentially tell us what their chunk sizes are, so we listen to them now. AWS is one of these, but sometimes registries lie and report
0-0
ranges, so we handle that. - I added the thing where we try a full upload and if that fails try a chunk upload: dotnet/sdk@
release/7.0.4xx...baronfel:sdk:containers-chunk-retry?expand=1
#diff-b05a3261b9. That's kinda cool. - I computed the digest from in-line with the chunks, in case it was different somehow than the other digests we had - they weren't :(
- I added more logging that I'll back out at some point - this was a little helpful
I still don't have a reason for the PUT failing on AWS uploads, but I was able to rule out digest mismatches at least.
from sdk-container-builds.
@normj I tried that as well and didn't have much success - the case you're describing (single Patch with 10MB upload) is the non-chunked upload, which is what we do when the default chunk size (or the chunk size reported by the registry) is greater than the size of the layer. This is the case that we knew worked before with AWS, the one-and-done method. The chunked method still fails for me using this proposed change - I forced the chunking by using SDK_CONTAINER_REGISTRY_CHUNKED_UPLOAD: true
and SDK_CONTAINER_REGISTRY_CHUNKED_UPLOAD_SIZE_BYTES: 1000
(or something equally small), and you can see the results in this job.
I do think you have a point about the auth cache, though - the intent of that cache is to track authentication by host + repository, because each repository can have different credentials (the config json might have separate entries for ghcr.io
and ghcr.io/baronfel/sdk-container-demo
, for example). In addition, we should be caching by authentication method as well (Bearer, Basic, etc) and we're not doing that. So some work should be done here overall (and TBH I'm not convinced that the current catch-all authentication handler is the correct place to put this, as it has no knowledge of registry and repository).
from sdk-container-builds.
I finally figured out how to get logs out of my AWS ECR from CloudTrail, and it was embarrassingly easy to do:
"errorCode": "LayerPartTooSmallException", "errorMessage": "All layer parts except the last part should be at least 5MB",
So I made a change to
a) add a multi-MB lorem ipsum file to the app to bloat its size so that we're over the ECR-provided layer max size, and
b) re-introduced the ECR-specific 5MB check that I'd blown away
this made the chunked uploads work! Here's some debug logging showing the best of all worlds: an initial full-layer push, reacting to errors by moving to chunking, and finally chunking working!
Started upload session for sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2 to https://***.dkr.ecr.us-east-1.amazonaws.com:443/v2/sdk-container-demo/blobs/uploads/a3d7e2d0-7344-4962-9117-3498eccef9c6 with chunk size '5242880'
Chunk size was greater than content length of 9340827, uploading whole blob.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Errored while uploading whole blob: CONTAINER1006: Too many retries, stopping..
Retrying with chunked upload.
Processing chunk because (0 < 9340827)
Processing chunk because (5242880 < 9340827)
Uploaded content for sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2
Computed digest for 'sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2' was 'sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2'
Finalized upload session for sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2
Finished uploading layer sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2 to ***.dkr.ecr.us-east-1.amazonaws.com
I'm a little irked that it was so easy to find logs in CloudTrail 😓 I need to figure out how/why I regressed Google Cloud and Quay.io support (@tmds - any idea who we can talk to for more detailed logs at Quay?) then I'll work on cleaning this all up and documenting.
from sdk-container-builds.
@baronfel Awesome! You are now the expert of AWS.
I didn't realize the refactoring took out the 5MB part size. If you add the hack I did for the authentication does the SocketExceptions go away?
from sdk-container-builds.
@normj I'll check that next - I focused first on making quay and GCP work.
I have greened those up by changing the upload algorithm a bit - before we defaulted to chunked mode if the chunk size detected from the registry was smaller than the blob's length. Quay and GCP don't actually give us a signal here one way or the other, so instead I changed to try the 'full' upload if no signal from the registry is detected, or a signal is detected and that length is greater than the blob. You can see this in action (heh) here: https://github.com/baronfel/sdk-container-demo/actions/runs/5164957782/jobs/9304177385
from sdk-container-builds.
I need to figure out how/why I regressed Google Cloud and Quay.io support (@tmds - any idea who we can talk to for more detailed logs at Quay?
I've asked the question and I'll get back to you. I've also asked if there's a container image that can be used to run it locally.
If you log some specific issues for quay.io, you can cc me on them. I may take a look as well (if time permits).
from sdk-container-builds.
This was fixed by dotnet/sdk#33500, mainly by uploading layers atomically. There is still an issue with chunked upload for AWS. I will create separate issue for that.
from sdk-container-builds.
Related Issues (20)
- Document reading of APP_UID and ASP.NET Core URL env vars
- Support RuntimeFrameworkVersion to infer image tag version HOT 9
- DOCKER_HOST ignored HOT 3
- Publish hangs when using fully qualified repository/image name HOT 1
- Cannot use -p=ContainerImageTags= with multiple tags separated by ; HOT 12
- Container image: save built image ID HOT 1
- Published container build does not appear in local docker registry when run using GithubAction HOT 8
- User defined in base container image overriden for root HOT 6
- Document ability to export a tar.gz version of the image
- with own base image api is not starting anymore HOT 7
- Tooling should validate the application's RID against the base image's os/architecture HOT 1
- Errors when publishing to Digital Ocean Container Registry HOT 3
- Add Container MSBuild Properties to the MSBuild XSD
- Investigate Visual Studio Property Page support HOT 1
- Caching of dependencies in container builds HOT 1
- Using ContainerImageTags with compile-time variables (GitVersion) HOT 4
- Request to add option to disable default pushing to remote registry when specifying ContainerRegistry HOT 2
- Can't use local daemon for caching images
- Publish EFBundle Initcontainer
- Provide an official alpine image with globalization support HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdk-container-builds.