Comments (18)
I've updated host.json to log more information with "StorageLogLevelLimit": "Trace". Please have a look. I would like to retain this hub, rather that start a new one, because it has prod data in it.
from durabletask-netherite.
Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed.
So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name.
from durabletask-netherite.
from durabletask-netherite.
After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception
Azure.RequestFailedException: The page range specified is invalid.
at Azure.Storage.Blobs.PageBlobRestClient.UploadPagesAsync(Int64 contentLength, Stream body, Byte[] transactionalContentMD5, Byte[] transactionalContentCrc64, Nullable`1 timeout, String range, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 ifSequenceNumberLessThanOrEqualTo, Nullable`1 ifSequenceNumberLessThan, Nullable`1 ifSequenceNumberEqualTo, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesInternal(Stream content, Int64 offset, UploadTransferValidationOptions transferValidationOverride, PageBlobRequestConditions conditions, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesAsync(Stream content, Int64 offset, Byte[] transactionalContentHash, PageBlobRequestConditions conditions, IProgress`1 progressHandler, CancellationToken cancellationToken)
at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass39_1.<<WritePortionToBlobAsync>b__0>d.MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 454
--- End of stack trace from previous location ---
at DurableTask.Netherite.Faster.BlobManager.PerformWithRetriesAsync(SemaphoreSlim semaphore, Boolean requireLease, String name, String intent, String data, String target, Int32 expectedLatencyBound, Boolean isCritical, Func`2 operationAsync, Func`1 readETagAsync) in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/StorageOperations.cs:line 68
The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state.
To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing:
{
"extensions": {
"durableTask": {
"storageProvider": {
"StorageLogLevelLimit": "Trace"
}
}
}
}
Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing.
from durabletask-netherite.
from durabletask-netherite.
Thanks.
Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September.
from durabletask-netherite.
from durabletask-netherite.
Thanks for the traces, they were helpful. I have found the cause.
What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB).
Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix.
Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer.
Here are some things to consider:
- it is usually a good idea to purge completed orchestrations after a while (e.g. see the discussion in #229)
- If you have a significant amount of data you want to store persistently for a long time, I would consider writing such data to external storage suitable for long term storage.
from durabletask-netherite.
from durabletask-netherite.
I have implemented a fix (#329) that should unblock your partition once we can get it to you.
can you tell if there is a certain type of data taking most of the space
There is no quick way to see this type of information right now (maybe something we should consider monitoring).
Are you using entities and/or orchestrations?
The objects we are storing are probably not more than 255k or so
Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB.
can you tell if there is a certain type of data taking most of the space
there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored.
from durabletask-netherite.
from durabletask-netherite.
We just published a Netherite private package with @sebastianburckhardt's fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1
The version is 1.4.1-blob-growth-fix-1
. To be able to install it, you need to add the following key to your nuget.config
:
<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />
Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks!
from durabletask-netherite.
from durabletask-netherite.
@rmbeaty: the source branch can be seen in this PR: #329
I'll investigate those 1.4.x
upgrade issues. I seem to recall similar reports.
from durabletask-netherite.
@rmbeaty: the known upgrade issues to 1.4.0
were thought to be addressed in 1.4.1
. Just to confirm, when you say "I saw the same issue with trying to upgrade to 1.4 " do you mean that you saw these failures in "1.4.0
" but not in "1.4.1
", or do you mean that you see these "silent failures" in both versions. Note that I'm asking about the official releases, not the private package I sent you a few hours ago.
from durabletask-netherite.
Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it.
(for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329)
from durabletask-netherite.
from durabletask-netherite.
@rmbeaty: Any updates here? We should be able to help if you can provide us any more details about the kind of failure you're seeing.
Since you have an internal support case with us, you can work your support point of contact to schedule a meeting with us to help you debug, for example. I'll ask the support person on our end to see if we can set that up.
from durabletask-netherite.
Related Issues (20)
- Netherite scaling not working on consumption plan HOT 1
- Shadow store support HOT 2
- Entities become unresponsive under load HOT 7
- Partitions become unresponsive HOT 7
- partitions become unresponsive because recovery hangs after reading object log HOT 5
- Startup error : Storage provider type (Netherite) was not found HOT 19
- NetheriteOrchestrationService.StartAsync throws cached exception on retry if first call fails
- Question about performance warning logs HOT 4
- Unable to Debug Durable Entities on local machine HOT 2
- Partition becomes unresponsive because event batch blob was deleted prematurely HOT 5
- Question: is there support for the Event Hub to have private endpoints? HOT 2
- overwrite amqp protocol to amqpWebSockets HOT 1
- Purging and getting instances with DurableTaskClient in dotnet-isolated runtime HOT 3
- Large amounts of warnings from FASTER.KV HOT 13
- Perhaps longer latencies for Netherite for medium-sized payloads HOT 13
- System.NotSupportedException: Durable entities are not supported by the current backend configuration HOT 15
- Activity Function started multiple times before completion HOT 1
- Duplicate Messages from SignalEntityAsync HOT 2
- Error in orchestrator: Netherite backend failed to start: Operations per second is over the account limit. HOT 9
- ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from durabletask-netherite.