Code Monkey home page Code Monkey logo

Comments (10)

akuttig-hs avatar akuttig-hs commented on August 10, 2024

I believe we are experiencing a similar problems every since 1.25.0 was released last week. We are accessing blobs directly and getting random 504 errors.

The code can be as simple as:
storage_client = storage.Client()
bucket = storage_client.get_bucket("some-bucket")

Logs show errors like this:

google.api_core.exceptions.GatewayTimeout: 504 GET https://storage.googleapis.com/storage/v1/b/some-bucket/?projection=noAcl:

from python-storage.

busunkim96 avatar busunkim96 commented on August 10, 2024

@mcsimps2 Did you begin to see this behavior recently (with a particular release) or has it been ongoing?

Edit: Assigning priority and type to keep the bots happy, please change to something more appropriate once you are able to look at this in more detail.

from python-storage.

mcsimps2 avatar mcsimps2 commented on August 10, 2024

I have had hanging threads on both google-cloud-storage == 1.24.1 and 1.25.0. However, this problem has only manifested recently (the first time I ran into it was this earlier this week). It first occurred on 1.24.1, so I upgraded to 1.25.0. The Google servers have usually been responsive enough that I've never run into a timeout issue at all. Thus, I believe these types of loopholes have existed in the code over several versions, yet perhaps something recent occurred that decreased the timeliness/responsiveness of the servers, thereby causing these timeout=None problems to manifest themselves; however, this is only speculation.

In the meantime, if there are other users experiencing a similar problem (@akuttig-hs), especially with threads hanging indefinitely from the timeout=None issue, I've explored three solutions that each seem to work as a temporary workaround:

  • calling the function in a new thread and killing that thread after a certain amount of time passes
  • monkeypatching AuthorizedSession to always have a timeout
  • monkeypatching the requests library that google-cloud-python uses to always have a timeout:
original_session_request = requests.Session.request
def monkeypatched_session_request(self, *args, **kwargs):
      if not kwargs.get("timeout"):
            kwargs["timeout"] = (DEFAULT_CONNECT_TIMEOUT, DEFAULT_READ_TIMEOUT)
      return original_session_request(self, *args, **kwargs)
requests.Session.request = monkeypatched_session_request

The monkeypatch was the first thing I tried, and it definitely helped alleviate the problem since the proper Timeout exception is being thrown by the requests library, whereas before, it would never throw an exception and just wait forever for a response. That said, in the end, it's not a real solution to the problem.

from python-storage.

plamut avatar plamut commented on August 10, 2024

Does the issue have to do with the resumable-media dependency by any chance? The following code snippet could be relevant:

https://github.com/googleapis/google-resumable-media-python/blob/1edb974175d16c9f542fe84dd6bbfa2d70115d48/google/resumable_media/requests/_helpers.py#L130-L131

The snippet applies the default timeout argument, if the later is missing, but not if it is present and set to None. A recent change to storage added optional timeout arguments (to match the method signature changes in core), which might have broken the resumable-media's reliance on the non-existence of the argument (I'm not 100% sure how these timeouts are propagated around, though).

If the above is indeed the cause, it would affect versions 1.24.0+, i.e. those released after Dec 26th 2019, which would be consistent with the reports on indefinite hangings that started occurring recently.

Since we more or less always want to have some sort of a timeout, the above logic in resumable media could be modified to apply the default timeout on None values as well.


Edit:

The code path from the issue description:

...
This then calls JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout to None), which makes the ultimate call to the requests.Session object with a None timeout.

does not sound like going through resumable-media, meaning that patching the latter would not work in that particular case, and an explicit timeout would have to be brought in somehow.

FWIW, BigQuery had similar problems (example), because there used to be no way of specifying a timeout to use in the transport layer.


Edit 2:
As a quick workaround, we could add a non-None default timeout to AuthorizedSession.request(), as this is much faster than adding a configurable timeout to every code path.


Edit 3:
Created a pull request in the google-auth dependency that adds a default timeout of 120 second (the same default timeout that is already used when refreshing credentials).

from python-storage.

busunkim96 avatar busunkim96 commented on August 10, 2024

@mcsimps2 @akuttig-hs Would you mind trying an upgrade to google-auth==1.11.0 to see if that resolves the issue?

from python-storage.

mcsimps2 avatar mcsimps2 commented on August 10, 2024

Thanks for the quick response!

@plamut I don't believe all requests still have a non-null Timeout with the recent PR you submitted that patched AuthorizedSession.request. The flow of the code is JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout now to _DEFAULT_TIMEOUT). Correct me if I'm wrong, but I believe the default kwarg is only used if the specified kwarg is missing in the call to AuthorizedSession.request. However, JSONConnection._do_request is passing timeout=None, which overrides the default kwarg value of _DEFAULT_TIMEOUT in AuthorizedSession.request. The _do_request method looks like

    def _do_request(
        self, method, url, headers, data, target_object, timeout=None
    ):
        return self.http.request(
            url=url, method=method, headers=headers, data=data, timeout=timeout
        )

So in the end, timeout=None is being passed to AuthorizedSession.request, which will override the default kwarg value of _DEFAULT_TIMEOUT, the end result being a null timeout again.

A quick code just for my sanity to make sure I understand kwargs correctly

def a(timeout=None):  #JSONConnection._do_request
      b(timeout=timeout)

def b(timeout=_DEFAULT_TIMEOUT):  # AuthorizedSession.request
     print(f"The timeout is {timeout}")

a() # prints "The timeout is None"

So perhaps, the code should be like this:

def a(timeout=None):  #JSONConnection._do_request
      b(timeout=timeout)

def b(timeout=None):  # AuthorizedSession.request
     timeout = timeout or _DEFAULT_TIMEOUT
     print(f"The timeout is {timeout}")

a() # prints "The timeout is 120"

from python-storage.

plamut avatar plamut commented on August 10, 2024

I don't believe all requests still have a non-null Timeout with the recent PR you submitted that patched AuthorizedSession.request.

That is correct. There was actually a PR commit in google-auth that would override any None value, but it was reverted during the review due to the concern that this could break some other use case in another dependent library.

The ruling was that if a library passes a None timeout, it probably has a reason for it. I'd say the next step is thus to change the default timeout values in storage lib to non-None - at least until all code paths can be given an explicit timeout by the library users.

from python-storage.

mcsimps2 avatar mcsimps2 commented on August 10, 2024

Ah, you’re perfectly right. I should’ve looked at the PR more closely, my apologies!

from python-storage.

maykmayx avatar maykmayx commented on August 10, 2024

Hey,
We are experiencing the same problems as @akuttig-hs described - simple requests for bucket.get_blob(<blob_name>) return 504 Gateway quite frequently but randomly.
I came across this document - Cloud Storage Retry Params
If we determine timeouts (and retries, for possible failures) through this class, will that be helpful for our case?
I'm about to try and test this and will update with news, but wanted your opinion
@plamut

from python-storage.

plamut avatar plamut commented on August 10, 2024

@maykmayx I can speak only generally, because I am not too familiar with this particular library (storage).

The following code drives the retries from inside the Retry objects:
https://github.com/googleapis/google-cloud-python/blob/master/api_core/google/api_core/retry.py#L182-L190

If the target callable blocks indefinitely, a retry will not have a chance to act, which is the issue that is being dealt with here (#10199 will add customizable timeouts to storage).

If the target callable does return, but with an error, predicate() is run to determine whether to retry the function or not. Different libraries define different predicated, but those can be customized, if needed:

from google.api_core.retry import Retry

retry = Retry()
retry = retry.with_predicate(lambda exc: "foobar" in str(exc))

From what I can see from the linked sample, RetryParams configure how to retry, but not what to retry. For the latter, a predicate needs to be configured, if you could obtain a reference to the Retry instance? And make it repeat any 504 errors, if this does not already?

from python-storage.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.