Code Monkey home page Code Monkey logo

Comments (18)

rmbeaty avatar rmbeaty commented on June 1, 2024 1

I've updated host.json to log more information with "StorageLogLevelLimit": "Trace". Please have a look. I would like to retain this hub, rather that start a new one, because it has prod data in it.

from durabletask-netherite.

sebastianburckhardt avatar sebastianburckhardt commented on June 1, 2024

Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed.

So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name.

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

sebastianburckhardt avatar sebastianburckhardt commented on June 1, 2024

After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception

Azure.RequestFailedException: The page range specified is invalid.
  at Azure.Storage.Blobs.PageBlobRestClient.UploadPagesAsync(Int64 contentLength, Stream body, Byte[] transactionalContentMD5, Byte[] transactionalContentCrc64, Nullable`1 timeout, String range, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 ifSequenceNumberLessThanOrEqualTo, Nullable`1 ifSequenceNumberLessThan, Nullable`1 ifSequenceNumberEqualTo, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesInternal(Stream content, Int64 offset, UploadTransferValidationOptions transferValidationOverride, PageBlobRequestConditions conditions, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesAsync(Stream content, Int64 offset, Byte[] transactionalContentHash, PageBlobRequestConditions conditions, IProgress`1 progressHandler, CancellationToken cancellationToken)
   at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass39_1.<<WritePortionToBlobAsync>b__0>d.MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 454
--- End of stack trace from previous location ---
   at DurableTask.Netherite.Faster.BlobManager.PerformWithRetriesAsync(SemaphoreSlim semaphore, Boolean requireLease, String name, String intent, String data, String target, Int32 expectedLatencyBound, Boolean isCritical, Func`2 operationAsync, Func`1 readETagAsync) in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/StorageOperations.cs:line 68

The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state.

To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing:

{
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "StorageLogLevelLimit": "Trace"
       }
    }
  }
}

Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing.

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

sebastianburckhardt avatar sebastianburckhardt commented on June 1, 2024

Thanks.

Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September.

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

sebastianburckhardt avatar sebastianburckhardt commented on June 1, 2024

Thanks for the traces, they were helpful. I have found the cause.

What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB).

Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix.

Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer.

Here are some things to consider:

  1. it is usually a good idea to purge completed orchestrations after a while (e.g. see the discussion in #229)
  2. If you have a significant amount of data you want to store persistently for a long time, I would consider writing such data to external storage suitable for long term storage.

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

sebastianburckhardt avatar sebastianburckhardt commented on June 1, 2024

I have implemented a fix (#329) that should unblock your partition once we can get it to you.

can you tell if there is a certain type of data taking most of the space

There is no quick way to see this type of information right now (maybe something we should consider monitoring).
Are you using entities and/or orchestrations?

The objects we are storing are probably not more than 255k or so

Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB.

can you tell if there is a certain type of data taking most of the space

there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored.

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

davidmrdavid avatar davidmrdavid commented on June 1, 2024

@rmbeaty:

We just published a Netherite private package with @sebastianburckhardt's fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1

The version is 1.4.1-blob-growth-fix-1. To be able to install it, you need to add the following key to your nuget.config:

<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks!

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

davidmrdavid avatar davidmrdavid commented on June 1, 2024

@rmbeaty: the source branch can be seen in this PR: #329

I'll investigate those 1.4.x upgrade issues. I seem to recall similar reports.

from durabletask-netherite.

davidmrdavid avatar davidmrdavid commented on June 1, 2024

@rmbeaty: the known upgrade issues to 1.4.0 were thought to be addressed in 1.4.1. Just to confirm, when you say "I saw the same issue with trying to upgrade to 1.4 " do you mean that you saw these failures in "1.4.0" but not in "1.4.1", or do you mean that you see these "silent failures" in both versions. Note that I'm asking about the official releases, not the private package I sent you a few hours ago.

from durabletask-netherite.

sebastianburckhardt avatar sebastianburckhardt commented on June 1, 2024

Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it.

(for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329)

from durabletask-netherite.

rmbeaty avatar rmbeaty commented on June 1, 2024

from durabletask-netherite.

davidmrdavid avatar davidmrdavid commented on June 1, 2024

@rmbeaty: Any updates here? We should be able to help if you can provide us any more details about the kind of failure you're seeing.

Since you have an internal support case with us, you can work your support point of contact to schedule a meeting with us to help you debug, for example. I'll ask the support person on our end to see if we can set that up.

from durabletask-netherite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.