Hi, I've am using Netherite in Azure Functions. At one point I had a

I have implemented a fix (<a class="issue-link js-issue-link" data-error-text="Failed

Repeating Timers on no longer existing durable function and instance,about microsoft/durabletask-netherite

Comments (18)

rmbeaty commented on June 1, 2024 1

I've updated host.json to log more information with "StorageLogLevelLimit": "Trace". Please have a look. I would like to retain this hub, rather that start a new one, because it has prod data in it.

from durabletask-netherite.

sebastianburckhardt commented on June 1, 2024

Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed.

So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name.

from durabletask-netherite.

rmbeaty commented on June 1, 2024

That is probably what it is. It is on a consumption plan. Tha app is called CritterCoinAPI. Please have a look Mike Beaty RedCritter

…

________________________________ From: Sebastian Burckhardt ***@***.***> Sent: Tuesday, October 24, 2023 5:48:56 PM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Author ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed. So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name. — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YOW5O3T5HXEPHMBNWDYBAZURAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGA4TMMRUGU>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from durabletask-netherite.

sebastianburckhardt commented on June 1, 2024

After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception

Azure.RequestFailedException: The page range specified is invalid.
  at Azure.Storage.Blobs.PageBlobRestClient.UploadPagesAsync(Int64 contentLength, Stream body, Byte[] transactionalContentMD5, Byte[] transactionalContentCrc64, Nullable`1 timeout, String range, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 ifSequenceNumberLessThanOrEqualTo, Nullable`1 ifSequenceNumberLessThan, Nullable`1 ifSequenceNumberEqualTo, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesInternal(Stream content, Int64 offset, UploadTransferValidationOptions transferValidationOverride, PageBlobRequestConditions conditions, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesAsync(Stream content, Int64 offset, Byte[] transactionalContentHash, PageBlobRequestConditions conditions, IProgress`1 progressHandler, CancellationToken cancellationToken)
   at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass39_1.<<WritePortionToBlobAsync>b__0>d.MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 454
--- End of stack trace from previous location ---
   at DurableTask.Netherite.Faster.BlobManager.PerformWithRetriesAsync(SemaphoreSlim semaphore, Boolean requireLease, String name, String intent, String data, String target, Int32 expectedLatencyBound, Boolean isCritical, Func`2 operationAsync, Func`1 readETagAsync) in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/StorageOperations.cs:line 68

The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state.

To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing:

{
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "StorageLogLevelLimit": "Trace"
       }
    }
  }
}

Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing.

from durabletask-netherite.

rmbeaty commented on June 1, 2024

That sounds good, I am traveling today and tomorrow but if I ca. get to this I will let you know . Anything you can do and suggestions how to fix would be great. Mike Beaty RedCritter

…

________________________________ From: Sebastian Burckhardt ***@***.***> Sent: Tuesday, October 24, 2023 6:51:08 PM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Author ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception Azure.RequestFailedException: The page range specified is invalid. at Azure.Storage.Blobs.PageBlobRestClient.UploadPagesAsync(Int64 contentLength, Stream body, Byte[] transactionalContentMD5, Byte[] transactionalContentCrc64, Nullable`1 timeout, String range, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 ifSequenceNumberLessThanOrEqualTo, Nullable`1 ifSequenceNumberLessThan, Nullable`1 ifSequenceNumberEqualTo, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, CancellationToken cancellationToken) at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesInternal(Stream content, Int64 offset, UploadTransferValidationOptions transferValidationOverride, PageBlobRequestConditions conditions, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken) at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesAsync(Stream content, Int64 offset, Byte[] transactionalContentHash, PageBlobRequestConditions conditions, IProgress`1 progressHandler, CancellationToken cancellationToken) at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass39_1.<<WritePortionToBlobAsync>b__0>d.MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 454 --- End of stack trace from previous location --- at DurableTask.Netherite.Faster.BlobManager.PerformWithRetriesAsync(SemaphoreSlim semaphore, Boolean requireLease, String name, String intent, String data, String target, Int32 expectedLatencyBound, Boolean isCritical, Func`2 operationAsync, Func`1 readETagAsync) in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/StorageOperations.cs:line 68 The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state. To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing: { "extensions": { "durableTask": { "storageProvider": { "StorageLogLevelLimit": "Trace" } } } } Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing. — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YKHSOUXTHUP5BVM6MLYBBA5ZAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE3TMMZZHA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from durabletask-netherite.

sebastianburckhardt commented on June 1, 2024

Thanks.

Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September.

from durabletask-netherite.

rmbeaty commented on June 1, 2024

I would like to avoid that since a new task hub will not have our data in it. Any other option? Mike Beaty RedCritter

…

________________________________ From: Sebastian Burckhardt ***@***.***> Sent: Wednesday, October 25, 2023 11:47:03 AM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Author ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Thanks. Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September. — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YPWOWZIEFIILH6KLGDYBEX7PAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZGU3TCOJQGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from durabletask-netherite.

sebastianburckhardt commented on June 1, 2024

Thanks for the traces, they were helpful. I have found the cause.

What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB).

Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix.

Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer.

Here are some things to consider:

it is usually a good idea to purge completed orchestrations after a while (e.g. see the discussion in #229)
If you have a significant amount of data you want to store persistently for a long time, I would consider writing such data to external storage suitable for long term storage.

from durabletask-netherite.

rmbeaty commented on June 1, 2024

Interesting, can you tell if there is a certain type of data taking most of the space. The objects we are storing are probably not more than 255k or so at most. Could it be some type of logging creating this large amount of data. Mike Beaty RedCritter

…

________________________________ From: Sebastian Burckhardt ***@***.***> Sent: Monday, October 30, 2023 2:01:19 PM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Author ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Thanks for the traces, they were helpful. I have found the cause. What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB). Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix. Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer. Here are some things to consider: 1. it is usually a good idea to purge completed orchestrations after a while (e.g. see the discussion in #229<#229>) 2. If you have a significant amount of data you want to store persistently for a long time, I would consider writing such data to external storage suitable for long term storage. — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YLQ4GNHCBVV24IUKCLYB72P7AVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBVHA3DEMJXGE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from durabletask-netherite.

sebastianburckhardt commented on June 1, 2024

I have implemented a fix (#329) that should unblock your partition once we can get it to you.

can you tell if there is a certain type of data taking most of the space

There is no quick way to see this type of information right now (maybe something we should consider monitoring).
Are you using entities and/or orchestrations?

The objects we are storing are probably not more than 255k or so

Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB.

can you tell if there is a certain type of data taking most of the space

there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored.

from durabletask-netherite.

rmbeaty commented on June 1, 2024

We are using only durable entities. The updates are actually very infrequent, because this feature is new in our system and very lightly used. I don't believe our data should be of any significant size. Can you think of any way that we would see the breakdown per object in the partition? Wondering if the meta data is growing perhaps with every paritition recovery attempt. Thanks

…

________________________________ From: Sebastian Burckhardt ***@***.***> Sent: Tuesday, October 31, 2023 1:43 PM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Author ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> I have implemented a fix (#329<#329>) that should unblock your partition once we can get it to you. can you tell if there is a certain type of data taking most of the space There is no quick way to see this type of information right now (maybe something we should consider monitoring). Are you using entities and/or orchestrations? The objects we are storing are probably not more than 255k or so Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table<https://microsoft.github.io/durabletask-netherite/#/ptable>) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB. can you tell if there is a certain type of data taking most of the space there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored. — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YNK6V3MCAHPEX5HNWLYCFBFHAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXG44TGMRYHE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from durabletask-netherite.

davidmrdavid commented on June 1, 2024

@rmbeaty:

We just published a Netherite private package with @sebastianburckhardt's fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1

The version is 1.4.1-blob-growth-fix-1. To be able to install it, you need to add the following key to your nuget.config:

<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks!

from durabletask-netherite.

rmbeaty commented on June 1, 2024

Unfortunately, Netherite is failing silently during initialization with this version. I saw the same issue with trying to upgrade to 1.4 All calls to it hang indefinitely. I have been running microsoft.azure.durabletask.netherite.azurefunctions.1.3.4 Can I get access to the project source for 1.4.1-blob-growth-fix-1 so that I can add the project to my solution and narrow down the failure in our dev environment? Thanks

…

________________________________ From: David Justo ***@***.***> Sent: Thursday, November 2, 2023 12:42 AM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Mention ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) @rmbeaty<https://github.com/rmbeaty>: We just published a Netherite private package with @sebastianburckhardt<https://github.com/sebastianburckhardt>'s fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1 The version is 1.4.1-blob-growth-fix-1. To be able to install it, you need to add the following key to your nuget.config: <add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" /> Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks! — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YKXWGRMP2N7BLOLT4TYCMXGDAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJQGEYDINBUGQ>. You are receiving this because you were mentioned.

from durabletask-netherite.

davidmrdavid commented on June 1, 2024

@rmbeaty: the source branch can be seen in this PR: #329

I'll investigate those 1.4.x upgrade issues. I seem to recall similar reports.

from durabletask-netherite.

davidmrdavid commented on June 1, 2024

@rmbeaty: the known upgrade issues to 1.4.0 were thought to be addressed in 1.4.1. Just to confirm, when you say "I saw the same issue with trying to upgrade to 1.4 " do you mean that you saw these failures in "1.4.0" but not in "1.4.1", or do you mean that you see these "silent failures" in both versions. Note that I'm asking about the official releases, not the private package I sent you a few hours ago.

from durabletask-netherite.

sebastianburckhardt commented on June 1, 2024

Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it.

(for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329)

from durabletask-netherite.

rmbeaty commented on June 1, 2024

Thanks, I am looking into it. It is unclear where it is failing. Hopefully I can narrow it down.

…

________________________________ From: Sebastian Burckhardt ***@***.***> Sent: Thursday, November 2, 2023 1:09 PM To: microsoft/durabletask-netherite ***@***.***> Cc: Mike Beaty ***@***.***>; Mention ***@***.***> Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312) Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it. (for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329<#329>) — Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YPAQSEAAHIVOH2PFULYCPOT3AVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRGI4DENJWGI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from durabletask-netherite.

davidmrdavid commented on June 1, 2024

@rmbeaty: Any updates here? We should be able to help if you can provide us any more details about the kind of failure you're seeing.

Since you have an internal support case with us, you can work your support point of contact to schedule a meeting with us to help you debug, for example. I'll ask the support person on our end to see if we can set that up.

from durabletask-netherite.

Repeating Timers on no longer existing durable function and instance about durabletask-netherite HOT 18 OPEN

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent