Code Monkey home page Code Monkey logo

Comments (23)

baronfel avatar baronfel commented on June 13, 2024 2

@tmds false alarm - it was my fault. quay.io returns a Range header of 0--1 on the initial upload create, which crashed us. With some additional validation we're all good to go and greened up again.

I'll be adding tests to the linked PR to mimic behaviors of different registries (in terms of what their responses look like to different request patterns) to help ensure we don't regress.

from sdk-container-builds.

vlada-shubina avatar vlada-shubina commented on June 13, 2024 1

@normj @baronfel

I tried to investigate issues on our side. We have 2 issues at the moment. For the first I found workaround.
Both are also reproducible on version that worked before (i tried the version from 15th Feb).

First error occurs on PATCH /v2/<name>/blobs/uploads/<uuid> and happens due to an existing connection was forcibly closed by the remote host. The workaround for this issue is to reduce chunk for upload (I checked with 1248080 bytes instead of current 5248080 bytes).

Second error occurs on finalizing the upload using PUT /v2/<name>/blobs/uploads/<uuid>?digest=<digest>.
The returned error is None of the provided layer digests match the calculated layer digest, code: BLOB_UPLOAD_INVALID).
I haven't found a workaround for it.

The last successful push happened on 23rd March: https://github.com/baronfel/sdk-container-demo/actions/runs/4496866095/jobs/7911922863 using 0.4.0 version of the package.

from sdk-container-builds.

vlada-shubina avatar vlada-shubina commented on June 13, 2024 1

@vlada-shubina Vlada Shubina FTE - thank you for the above comments. Is it possible to add a test that will detect such issues in the future? (assuming we'll fix this soon)

we have a separate issue for that: #371
for now we don't have infrastructure to test remote registries unfortunately.
the prerequisites are: obtaining test accounts, ensuring CI prerequisites.
The prototype by @baronfel is available here: https://github.com/baronfel/sdk-container-demo however it runs on private accounts and might not be fully compatible with our engineering.

from sdk-container-builds.

normj avatar normj commented on June 13, 2024 1

If I'm understanding correctly this library is only doing retries for SocketException. A mismatch on the digest would not be retried. The linked library does retries regardless of the type of errors.

https://github.com/dotnet/sdk/blob/d03535e2d65c270b05ef6f4893330eb1cda4d9a9/src/Containers/Microsoft.NET.Build.Containers/AuthHandshakeMessageHandler.cs#L194

So my speculation is if there is an eventual consistency on the ECR side with putting the chuck and getting the digest that is not being retried. I don't know exactly how ECR is implemented but I'm sure S3 is involved at some point which is an eventual consistent store.

from sdk-container-builds.

normj avatar normj commented on June 13, 2024 1

@baronfel I was doing some experimenting with your branch. The authentication logic in AuthHandshakeMessageHandler is to cache the authentication header for every request uri. Since new locations are vended all the time with new guids there is always a cache miss. So while doing a PATCH chunk 10MB upload of a layer the first attempt is done unauthenticated. I suspect ECR is thinking why are you sending me 10MB of unauthenticated data and so rejects the requests and shutdowns the socket. Hence all of the Socket forced closed exceptions.

As a quick hack I changed the authentication header cache in AuthHandshakeMessageHandler to use just the host name. Not saying that is the right solution but it works for ECR since the host name is what identifies registry that was logged in. When I did that my uploads seemed to work quite well without any socket forced closed exceptions. @baronfel Could you try recreating my hack and seeing if you are seeing similar behavior? Basically in the calls to AuthHeaderCache I changed the key to new Uri("https://" + request.RequestUri.Host) as a hack.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024 1

All green! (including harbor!)

https://github.com/baronfel/sdk-container-demo/actions/runs/5164957782

@normj I couldn't replicate the socket errors - in this run you can see there was only one socket failure and the overall 'full push' did work. This is the kind of behavior that I'd hope for.

Now will work on refactoring, in-repo tests, more sdk-container-demo test cases, docs, etc.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

@normj is there any chance you could get the ECR folks to give us some diagnostics from their end to help triage the problem? It's almost certain to be something we regressed, but I don't know where to start looking based on the errors we get directly from the API responses.

from sdk-container-builds.

normj avatar normj commented on June 13, 2024

@baronfel I'll see what I can do

from sdk-container-builds.

donJoseLuis avatar donJoseLuis commented on June 13, 2024

@vlada-shubina - thank you for the above comments. Is it possible to add a test that will detect such issues in the future? (assuming we'll fix this soon)

from sdk-container-builds.

normj avatar normj commented on June 13, 2024

@baronfel @vlada-shubina Sorry I haven't been able to dedicate any time on helping this issue. I should have some time open up soon. I was able use the containerize console app in the dotnet/sdk repository and was able to successful push an image to ECR so I haven't recreated he issue locally yet for myself.

from sdk-container-builds.

vlada-shubina avatar vlada-shubina commented on June 13, 2024

@baronfel @vlada-shubina Sorry I haven't been able to dedicate any time on helping this issue. I should have some time open up soon. I was able use the containerize console app in the dotnet/sdk repository and was able to successful push an image to ECR so I haven't recreated he issue locally yet for myself.

I can easily repro it on barebone web project on .NET SDK 7.0.302, steps to repro:

  • login to AWS ECR with with get-login-password as described here: https://docs.aws.amazon.com/AmazonECR/latest/public/public-registries.html
  • dotnet new webapp to create new web project
  • add <EnableSdkContainerSupport>true</EnableSdkContainerSupport> property to csproj
  • add Properties\PublishProfiles\aws.pubxml with the registry details (i'm using public registry here):
<Project>
    <PropertyGroup>
        <ContainerRegistry>public.ecr.aws</ContainerRegistry>
        <ContainerImageName>r4z9b1c4/sdk-containers-test</ContainerImageName>
        <WebPublishMethod>Container</WebPublishMethod>
    </PropertyGroup>
</Project>
  • publish container using dotnet publish --os linux --arch x64 -c Release /p:PublishProfile=aws

Update: same issue appears on the latest build of 7.0.400 from dotnet/sdk repo.

from sdk-container-builds.

vlada-shubina avatar vlada-shubina commented on June 13, 2024

@normj
I investigated it more:

  • Error PATCH /v2/<name>/blobs/uploads/<uuid> due to an existing connection was forcibly closed by the remote host. The workaround for this issue of reducing the chunk to upload (I checked with 1248080 bytes instead of current 5248080 bytes) works reasonably well, i haven't encountered any issues so far after changing it, however it slows down the upload. Do you have any objections on changing the limit?
  • Second error occurs on finalizing the upload using PUT /v2/<name>/blobs/uploads/<uuid>?digest=<digest>.
    The returned error is None of the provided layer digests match the calculated layer digest, code: BLOB_UPLOAD_INVALID).

I noticed that in case of non-chunked upload it works as expected, when submitting digest of uncompressed content of the layer. For chunked upload it is not working with the error above. Should we submit anything else in this case?

Thanks

from sdk-container-builds.

normj avatar normj commented on June 13, 2024

When I was talking to the ECR folks before about this feature they mentioned they help this docker client with ECR. https://github.com/regclient/regclient/blob/90e8aca4e5fe0db2f57dfa52f0c9ec9b2e7d9118/scheme/reg/blob.go

One thing I see they have that I don't think the .NET implementation has is a retry at the individual PATCH level. Has the team thought about putting that into registry push feature? Totally speculating here but I'm wondering if we are hitting an eventual consistency issue here.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

That's very interesting - we have retries at a lower level (an HttpMessageHandler I believe) but the really interesting thing I see here is that they have an 'adaptive' chunk size. That's something I think we need here, since our chunk sizes have been very fragile.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

That's a great idea. After digging into that code a bit more I see that they also use a GET on the blob upload endpoint to get the status of the upload (via the Range header there) in case some data needs to be re-sent. We should try to do the same.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

@vlada-shubina I've made some progress on this based on the data @normj sent: https://github.com/dotnet/sdk/compare/release/7.0.4xx...baronfel:sdk:containers-chunk-retry?expand=1

So far the main changes are

  • retries on chunk uploads specifically, and
  • using the Range information from server requests to determine where the next chunk should start from

both are inspired from the regclient code linked. This worked well for github, GCP, and Azure chunked uploads. For AWS uploads, each PATCH for a chunk, say the 0-63999 chunk, would return successfully but the Range information reported by ECR would remain 0-63999 across every iteration of chunks - we expect this number to climb as chunks are committed, and so we loop (forever?). I didn't see how the regclient codebase deals with this.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

Found a bit more information today:

  • when we start an upload, registries can potentially tell us what their chunk sizes are, so we listen to them now. AWS is one of these, but sometimes registries lie and report 0-0 ranges, so we handle that.
  • I added the thing where we try a full upload and if that fails try a chunk upload: dotnet/sdk@release/7.0.4xx...baronfel:sdk:containers-chunk-retry?expand=1#diff-b05a3261b9. That's kinda cool.
  • I computed the digest from in-line with the chunks, in case it was different somehow than the other digests we had - they weren't :(
  • I added more logging that I'll back out at some point - this was a little helpful

I still don't have a reason for the PUT failing on AWS uploads, but I was able to rule out digest mismatches at least.

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

@normj I tried that as well and didn't have much success - the case you're describing (single Patch with 10MB upload) is the non-chunked upload, which is what we do when the default chunk size (or the chunk size reported by the registry) is greater than the size of the layer. This is the case that we knew worked before with AWS, the one-and-done method. The chunked method still fails for me using this proposed change - I forced the chunking by using SDK_CONTAINER_REGISTRY_CHUNKED_UPLOAD: true and SDK_CONTAINER_REGISTRY_CHUNKED_UPLOAD_SIZE_BYTES: 1000 (or something equally small), and you can see the results in this job.

I do think you have a point about the auth cache, though - the intent of that cache is to track authentication by host + repository, because each repository can have different credentials (the config json might have separate entries for ghcr.io and ghcr.io/baronfel/sdk-container-demo, for example). In addition, we should be caching by authentication method as well (Bearer, Basic, etc) and we're not doing that. So some work should be done here overall (and TBH I'm not convinced that the current catch-all authentication handler is the correct place to put this, as it has no knowledge of registry and repository).

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

I finally figured out how to get logs out of my AWS ECR from CloudTrail, and it was embarrassingly easy to do:

"errorCode": "LayerPartTooSmallException",
"errorMessage": "All layer parts except the last part should be at least 5MB",

So I made a change to

a) add a multi-MB lorem ipsum file to the app to bloat its size so that we're over the ECR-provided layer max size, and
b) re-introduced the ECR-specific 5MB check that I'd blown away

this made the chunked uploads work! Here's some debug logging showing the best of all worlds: an initial full-layer push, reacting to errors by moving to chunking, and finally chunking working!

Started upload session for sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2 to https://***.dkr.ecr.us-east-1.amazonaws.com:443/v2/sdk-container-demo/blobs/uploads/a3d7e2d0-7344-4962-9117-3498eccef9c6 with chunk size '5242880'
  Chunk size was greater than content length of 9340827, uploading whole blob.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
Encountered a SocketException with message "Broken pipe". Pausing before retry.
  Errored while uploading whole blob: CONTAINER1006: Too many retries, stopping..
  Retrying with chunked upload.
  Processing chunk because (0 < 9340827)
  Processing chunk because (5242880 < 9340827)
  Uploaded content for sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2
  Computed digest for 'sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2' was 'sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2'
  Finalized upload session for sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2
  Finished uploading layer sha256:f618f4d784e8b2c2851cafc0857d673746a8b1745807b7f42e8e54e329fbd9c2 to ***.dkr.ecr.us-east-1.amazonaws.com

I'm a little irked that it was so easy to find logs in CloudTrail 😓 I need to figure out how/why I regressed Google Cloud and Quay.io support (@tmds - any idea who we can talk to for more detailed logs at Quay?) then I'll work on cleaning this all up and documenting.

from sdk-container-builds.

normj avatar normj commented on June 13, 2024

@baronfel Awesome! You are now the expert of AWS.

I didn't realize the refactoring took out the 5MB part size. If you add the hack I did for the authentication does the SocketExceptions go away?

from sdk-container-builds.

baronfel avatar baronfel commented on June 13, 2024

@normj I'll check that next - I focused first on making quay and GCP work.

I have greened those up by changing the upload algorithm a bit - before we defaulted to chunked mode if the chunk size detected from the registry was smaller than the blob's length. Quay and GCP don't actually give us a signal here one way or the other, so instead I changed to try the 'full' upload if no signal from the registry is detected, or a signal is detected and that length is greater than the blob. You can see this in action (heh) here: https://github.com/baronfel/sdk-container-demo/actions/runs/5164957782/jobs/9304177385

from sdk-container-builds.

tmds avatar tmds commented on June 13, 2024

I need to figure out how/why I regressed Google Cloud and Quay.io support (@tmds - any idea who we can talk to for more detailed logs at Quay?

I've asked the question and I'll get back to you. I've also asked if there's a container image that can be used to run it locally.

If you log some specific issues for quay.io, you can cc me on them. I may take a look as well (if time permits).

from sdk-container-builds.

vlada-shubina avatar vlada-shubina commented on June 13, 2024

This was fixed by dotnet/sdk#33500, mainly by uploading layers atomically. There is still an issue with chunked upload for AWS. I will create separate issue for that.

from sdk-container-builds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.