Code Monkey home page Code Monkey logo

Comments (25)

jandppw avatar jandppw commented on June 2, 2024 1

@mcollina You are absolutely right that this might very well be an AWS issue, or an interop issue (as noted in the report). We just thought that sharing our observations publicly might help other people that might encounter the same issue (as in “no, I’m not crazy”).

AWS support will be made aware of the issue.

Your hint about a Lambda resuming seems plausible.

from undici.

mweberxyz avatar mweberxyz commented on June 2, 2024 1

Out of curiosity, I attempted to replicate this in us-east-1 and couldn't.

I set up a CloudFront distribution with a Lambda named PrivateLambda (invoked via HTTPS via a function URL) as the origin. Then set up a second lambda named PublicLambda (with a function URL) that calls the CloudFront distribution 25 times in a loop.

I called PublicLambda repeatedly using k6, ramping concurrency up and down to suss out out cold/warm start issues. Every request succeeded:

% k6 run test.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: test.js
        output: -

     scenarios: (100.00%) 1 scenario, 50 max VUs, 35m30s max duration (incl. graceful stop):
              * default: Up to 50 looping VUs for 35m0s over 7 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ is status 200
     ✓ is body 25

     checks.........................: 100.00% ✓ 99290     ✗ 0
     data_received..................: 17 MB   8.1 kB/s
     data_sent......................: 7.4 MB  3.5 kB/s
     http_req_blocked...............: avg=149.91µs min=1µs      med=7µs      max=170.28ms p(90)=9µs      p(95)=9µs
     http_req_connecting............: avg=59.64µs  min=0s       med=0s       max=60.26ms  p(90)=0s       p(95)=0s
     http_req_duration..............: avg=668.06ms min=465.52ms med=625.45ms max=5.7s     p(90)=825.18ms p(95)=900.9ms
       { expected_response:true }...: avg=668.06ms min=465.52ms med=625.45ms max=5.7s     p(90)=825.18ms p(95)=900.9ms
     http_req_failed................: 0.00%   ✓ 0         ✗ 49645
     http_req_receiving.............: avg=136.27µs min=16µs     med=117µs    max=204.25ms p(90)=167µs    p(95)=199µs
     http_req_sending...............: avg=36.73µs  min=6µs      med=34µs     max=10.91ms  p(90)=48µs     p(95)=53µs
     http_req_tls_handshaking.......: avg=81.24µs  min=0s       med=0s       max=130.21ms p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=667.89ms min=465.38ms med=625.3ms  max=5.7s     p(90)=824.94ms p(95)=900.73ms
     http_reqs......................: 49645   23.638233/s
     iteration_duration.............: avg=668.46ms min=465.76ms med=625.79ms max=5.76s    p(90)=825.95ms p(95)=901.25ms
     iterations.....................: 49645   23.638233/s
     vus............................: 1       min=0       max=50
     vus_max........................: 50      min=50      max=50

Over 35 minutes this made 49,645 requests to to PublicLambda, which means we made successfully made 1,241,125 fetch calls.

Hopefully I did my math right and this little experiment only cost me a few dollars 😅

Attached cloudformation template for set-up and k6 test file:
replicate-undici-3133.zip

from undici.

ronag avatar ronag commented on June 2, 2024

Can you try to use the undici package instead of the bundled version? To see if the problem still occurs in newer versions of undici.

from undici.

jandppw avatar jandppw commented on June 2, 2024

Because the issues are intermittent, and do not occur locally, we cannot really test for them. We can only get log information from failed invocations from production realistically, over a number of hours or days. We just deployed a new version that should log the stack trace of the cause of the errors, to get a bit more information.

I understand that rolling out a version that uses the latest version of undici directly would help in isolating the issues. Yet, we believe it would be unwise to expose our users to more experiments than necessary for several days, without addressing the issue they have.

Because this is an annoying production issue (~3,5% of interactions failing), we must focus on stability first. Our plan is to revert to using node-fetch (on Node 20) asap. fetch is still experimental in Node 20, after all, and maybe we jumped the gun.

We will report further information when the errors appear with more log information in the mean time.

We understand the report is not very concrete, and regret we cannot be of anymore help. Yet we believed the observations we made are important enough to share here, if only for reference if other people encounter similar issues (e.g., #2964).

from undici.

ronag avatar ronag commented on June 2, 2024

@jandppw Have you considered simply using undici.request which is our recommended API? fetch should only be used if you require compatibility with other runtimes than NodeJS.

from undici.

jandppw avatar jandppw commented on June 2, 2024

@ronag No, we haven’t. We weren’t aware of the fact that undici.request is the recommended API, nor that fetch should only be used if you require compatibility with other runtimes than NodeJS. Is there any documentation on that?

We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?

from undici.

ronag avatar ronag commented on June 2, 2024

Is there any documentation on that?

Not sure. Where would you expect that to be documented?

We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?

fetch was added due to popular demand not due to technical merits. It's never going to be as fast as undici.request. I would stay away from fetch unless writing a isomorphic code or if your team is mostly comfortable with fetch and performance is not critical.

from undici.

jandppw avatar jandppw commented on June 2, 2024

But that would require adding undici as a direct dependency in the project, no? In that situation, undici is a contender next to node-fetch, axios, got, superagent, …, no?

If that is the case, in the current project, at this time, we’d rather revert to proven safe ground (i.e., node-fetch).

Before the above occurred, we were not familiar with undici. Until your comment, to us undici was the thing used inside Node for fetch. Doing some web searches, blogs anno 2024 still refer to the usual suspects, and do not mention undici.

For future work, no dependency trumps any dependency, unless there are very good reasons (such was working with promises instead of consuming events ourselves). In that space, now that we have learned about undici, we will certainly take it into account in future work.

But, to answer the question “Where would you expect that to be documented?”: well, first of all in blogs that I hit with a Google search about doing API calls inside a Node application. This would point me to undici @ npm. None of this is mentioned there. Next, we would have to choose between the contenders. For us, this is probably a bit late. We've been using axios and node-fetch for years now, and would need good reasons to change away from things that are proven in usage.

Now, you mentioning that we should stay away from using fetch in a Node-only application, such as a Lambda, is of a different order. That should be clear first, because, no dependency trumps any dependency, unless there are very good reasons. I have not encountered that point before. So documenting that, and getting the message out, should then be the first step. If this is the official Node point-of-view, maybe the Node fetch documentation should mention it. But this is probably not the forum for that :-).

from undici.

mcollina avatar mcollina commented on June 2, 2024

I would really love that the AWS Lambda team helped out on this one. We lack visibility on the Node.js distribution they are running. Essentially they are likely doing certain things to the Node.js runtime that are causing this.

This makes us believe these cases are bugs in undici, or in the AWS Lambda x86_64 nodejs20.x OS.

Given you are using Node.js for free but paying for AWS Lambda, I recommend you to contact AWS support.


In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never
left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.

I suspect it's due to keep alive connections. Lambda suspends the process and all its state after a response is sent. Undici keep alive the connections, but when does Lambda resume the process we do not know that if we need to reset all the connections. There is no event for us to hook into on resume or suspend.

Try adding a connection: 'close' inside your headers. Alternatives: create a new undici.Client for each request and pass it as a dispatcher property to undici.fetch(). Then .close() the client.

from undici.

jandppw avatar jandppw commented on June 2, 2024

@mcollina Actually, it turns out: no. There seems to be no easy and 0€ way to make AWS aware of the issue for this client.

Our main focus is getting stability back to our users.

I share your frustration about AWS not being involved, but there seems to be no easy way for us to get AWS on the case without paying extra €. The only way to get AWS involved seems to be be public side channels, such as social media. This public report might be a first effort.

Any response from AWS now (if I were in their place anyway) would be that native fetch on Node 20 is still experimental, so go away.

Apart from all that, there are no ill feelings here regarding undici or node. In any case, by the time AWS will offer Node 22, there will have to be some resolution, since native fetch is no longer experimental since Node 21.

In the mean time, this report might help other people in realizing that it probably is not a good idea to use native fetch in AWS Lambda’s for the time being.

from undici.

metcoder95 avatar metcoder95 commented on June 2, 2024

The issues shared seem to follow a pattern of constant connection and abrupt disconnections, I believe I'm seeing a similar issue as @mcollina states. I'd suggest to give it a try to his suggestion and start from there.

The lambda runtime might be doing constant cleanups while resuming causing intermittent connect issues.

from undici.

jandppw avatar jandppw commented on June 2, 2024

We yesterday deployed a version that logs the stack trace of the cause, while working on a revert to node-fetch. Up until now 2 errors are logged of the ETIMEDOUT variant, both with the same stack trace:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at async getImplementation (/var/task/lib/dal/get.js:75:20)
{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }
Error: write ETIMEDOUT
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
    at handleWriteReq (node:internal/stream_base_commons:58:21)
    at writeGeneric (node:internal/stream_base_commons:149:15)
    at Socket._writeGeneric (node:net:952:11)
    at Socket._write (node:net:964:8)
    at writeOrBuffer (node:internal/streams/writable:564:12)
    at _write (node:internal/streams/writable:493:10)
    at Writable.write (node:internal/streams/writable:502:10)
    at write (node:internal/deps/undici/undici:9269:18)
    at _resume (node:internal/deps/undici/undici:9172:33)
    at resume (node:internal/deps/undici/undici:9082:7)
    at [dispatch] (node:internal/deps/undici/undici:8327:11)
    at Client.Intercept (node:internal/deps/undici/undici:8000:20)
    at Client.dispatch (node:internal/deps/undici/undici:6606:44)
    at [dispatch] (node:internal/deps/undici/undici:6837:32)
    at Pool.dispatch (node:internal/deps/undici/undici:6606:44)
    at [dispatch] (node:internal/deps/undici/undici:9957:27)
    at Agent.Intercept (node:internal/deps/undici/undici:8000:20)
    at Agent.dispatch (node:internal/deps/undici/undici:6606:44)
    at node:internal/deps/undici/undici:10929:55
    at new Promise (<anonymous>)
    at dispatch (node:internal/deps/undici/undici:10929:16)
    at httpNetworkFetch (node:internal/deps/undici/undici:10824:73)
    at httpNetworkOrCacheFetch (node:internal/deps/undici/undici:10712:39)
    at httpFetch (node:internal/deps/undici/undici:10545:43)
    at schemeFetch (node:internal/deps/undici/undici:10469:18)
    at node:internal/deps/undici/undici:10342:26
    at mainFetch (node:internal/deps/undici/undici:10361:11)
    at fetching (node:internal/deps/undici/undici:10309:7)
    at fetch (node:internal/deps/undici/undici:10173:20)
    at fetch (node:internal/deps/undici/undici:12344:10)
    at fetch (node:internal/bootstrap/web/exposed-window-or-worker:79:16)
    at getImplementation (/var/task/lib/dal/get.js:75:26)
    …

These again were the first fetch calls during the invocation. The time between the fetch call and the logging of the error was <4ms.

from undici.

ajinkyashukla29dj avatar ajinkyashukla29dj commented on June 2, 2024

We are using amazon-cognito-identity-js package inside of the AWS Lambda function to talk to AWS Cognito. The AWS lambda function makes multiple AWS cognito calls. We see this error randomly for some of the calls.

{
    "cause": {
        "name": "ConnectTimeoutError",
        "code": "UND_ERR_CONNECT_TIMEOUT",
        "message": "Connect Timeout Error"
    }
}
Error: Network error
    at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:117:15
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'NetworkError'
}

The AWS lambda function is configured to use Node.js 18.x runtime. The earliest occurrence of this error seems to be around six months ago but we need to investigate the AWS Cloudwatch logs if it occurred earlier too.

from undici.

mweberxyz avatar mweberxyz commented on June 2, 2024

@ajinkyashukla29dj amazon-cognito-identity-js uses isomorphic-unfetch which uses the globally defined fetch if available. Can you disable experimental fetch by setting NODE_OPTIONS [1] environment variable in your Lambdas to --no-experimental-fetch [2] and see if your issue persists?

[1] https://docs.aws.amazon.com/lambda/latest/dg/runtimes-modify.html
[2] https://nodejs.org/api/cli.html#--no-experimental-fetch

from undici.

jandppw avatar jandppw commented on June 2, 2024

One more error came in, this time of the of the UND_ERR_SOCKET variant:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at async getImplementation (/var/task/lib/dal/get.js:75:20)
{
  cause: {
    name: 'SocketError',
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: '169.NNN.NN.N',
      localPort: NNNNN,
      remoteAddress: undefined,
      remotePort: undefined,
      remoteFamily: undefined,
      timeout: undefined,
      bytesWritten: 1185,
      bytesRead: 69979
    }
  }
}

cause stack:

SocketError: other side closed
    at TLSSocket.onSocketEnd (node:internal/deps/undici/undici:8903:26)
    at TLSSocket.emit (node:events:530:35)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

ThIs again was the first fetch call during the invocation. The time between the fetch call and the logging of the error was <3ms.

from undici.

mweberxyz avatar mweberxyz commented on June 2, 2024

@jandppw I think your best course of action is to swap out native fetch for node-fetch, keeping everything else the same. If the errors go away, it might be an undici issue. If the issues persist, it is either a remote server issue, or at a lower level than undici.

btw: not an undici code contributor, just curious at this point.

from undici.

uNetworkingAB avatar uNetworkingAB commented on June 2, 2024

I use Node.js 21.7.2 for a simple scraper project that use await fetch inside try catch but I get really bad stability where internal errors of undici somehow cause ENETUNREACH but my try catch never trigger but instead Node.js just dies. It's super unstable and I need another implementation to get this working

Edit: I simply ported it to Deno and now it is stable.

from undici.

jandppw avatar jandppw commented on June 2, 2024

3 more errors of the ETIMEDOUT variant came in this morning, with the same stack trace.

A new version is now rolled out on production that no longer uses native fetch, so this is the last observation we can share.

from undici.

ajinkyashukla29dj avatar ajinkyashukla29dj commented on June 2, 2024

@mweberxyz Thank you for your inputs. I don't see the ConnectTimeoutError error after setting the NODE_OPTIONS env variable to --no-experimental-fetch. I am seeing the following response even though the username and password is correct

{
    "size": 0,
    "timeout": 0
}
ERROR	NotAuthorizedException: Incorrect username or password.
    at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:136:19
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'NotAuthorizedException'
}

from undici.

mweberxyz avatar mweberxyz commented on June 2, 2024

@ajinkyashukla29dj sorry can't help with that error it's coming from the amazon-cognito-identity-js library, looks like the HTTP request succeeded though!

Can you provide some more details about your lambda so I can try to replicate UND_ERR_CONNECT_TIMEOUT you encountered when using native fetch? Such as:

  • How often did the UND_ERR_CONNECT_TIMEOUT occur? Every request or just intermittently?
  • What AWS region?
  • Is the lambda running in a VPC? If so, public or private subnet? and is there any privatelink to AWS services defined in the vpc?
  • How is the lambda invoked (api gateway, manually from console, api gateway, manual test from console, etc)?

from undici.

uNetworkingAB avatar uNetworkingAB commented on June 2, 2024

For me this is not AWS related, it happens on macOS "all the time" (very often) for a very simple scraper project that does about 500 fetch calls or something. To confirm I ported it to another impl. of fetch and it worked without stability issues there.

The issue I have is that ETIMEDOUT and similar is not being propagated to my try/catch but instead just crashes Node.js somewhere inside undici.

It can be reproduced with await fetch inside try/catch leading to crashed Node.js process.

from undici.

mweberxyz avatar mweberxyz commented on June 2, 2024

@uNetworkingAB Can you give a code sample? Knowing which remote servers you are having trouble with would help with replication.

from undici.

mcollina avatar mcollina commented on June 2, 2024

Please verify if undici v6.14.1 fixes this.

from undici.

lucamezzalira avatar lucamezzalira commented on June 2, 2024

@jandppw my name is Luca, I work for AWS as Severless Specialist, I'd love to connect with you to review your challenge with the service team. Would you mind ping me in private so we can arrange a chat?

from undici.

jandppw avatar jandppw commented on June 2, 2024

@lucamezzalira Sure, Luca. Sending you an email now.

from undici.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.