Comments (15)
I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool
.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms
The increasing latency might be that the threadpool is busy with other tasks.
e.g.
omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);
Eventually, the entire threadpool queue might be filled with this kind of tasks.
I'll dig deeper later today, but the increasing latency is very suspicious.
from protoactor-dotnet.
I debugged it. The workaround is to set actorSystemConfig.SharedFutures
to false
. The request is sent again over and over in the DefaultClusterContext.RequestAsync
because its future.Task
is already cancelled even before next attempt to call context.Request
. It causes TaskCancelledException
which is handled by retrying the send operation.
from protoactor-dotnet.
This sounds like a serious bug.
Have you been able to reproduce it in any other conditions than just high load?
from protoactor-dotnet.
My colleague reported something similar during debugging but it could be caused by different reasons like timeouts due to debugger pauses. I don't know any other ways to reproduce it without high load.
The setup is pretty simple: 3 virtual actor types but only one is actually used. This actor has 3000 instances, requests are spread among them. The actor handles two types of requests: place order and cancel order. The benchmark sends up to 256 requests simultaneously without waiting for results (limited by SemaphoreSlim
). The bug occurs only when I run without a debugger attached and after some time (a minute usually is enough). Each request has a unique id. I added Debugger.Launch
call in place where duplicate order id gets detected. The counter in DefaultClusterContext.RequestAsync
for this specific request goes higher than 600000 at the time I attach debugger and step into.
My two colleagues were able to reproduce it using the same code.
from protoactor-dotnet.
Many thanks for the report.
Using actorSystemConfig.SharedFutures = false is safe (and was the original behavior), so you can keep using that for now.
It's just a bit slower due to the allocation/deallocation of PIDs per Future.
I'm still trying to figure out what causes this, if it is inside the shared future implementation itself.
I'll report my findings here
from protoactor-dotnet.
I am not sure how much it relates to this, but we get these exceptions every now and then in production servers:
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Proto.Future.SharedFutureProcess.Cancel(UInt32 requestId)
at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)
at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, TimeSpan timeout)
....
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Proto.Future.SharedFutureProcess.SendUserMessage(PID pid, Object message)
....
@rogeralsing I hope the callstacks help in some way.
We have been ignoring it as it didn't result in messages being processed again and again for us.
I guess we will also disable shared futures and deploy that way.
One thing to note is that we get these exceptions even though our per-actor throughput doesn't exceed 200 msgs/sec.
from protoactor-dotnet.
cc @mhelleborg
from protoactor-dotnet.
@AqlaSolutions Could you share a reproducing example? Trying to reproduce with just high load on shared futures has come up empty here, no issues.
from protoactor-dotnet.
@mhelleborg I'm not sure cause NDA, you know... Also it will require some time to prepare and minimize it. I will let you know if I get the permission.
from protoactor-dotnet.
from protoactor-dotnet.
I am having a similar problem. The OnReceive method keeps getting the same message over and over again if I send the message as follows:
ActorSystemHelper.fSystem.Cluster().RequestAsync<object>(request.DeviceID, "FieldbusWorker_" + device.Gateway.Oid, "STOP", CancellationToken.None);
However, if I send it with the MethodIndex, the problem does not occur:
ActorSystemHelper.fSystem.Cluster().RequestAsync<object>(request.DeviceID, "FieldbusWorker_" + device.Gateway.Oid, new GrainRequestMessage(2,null), CancellationToken.None);
from protoactor-dotnet.
the source code to reproduce the problem described by AqlaSolutions in the attachment
shared_futures_repro_2.zip
from protoactor-dotnet.
Just run project benchmarks/PrototypeBenchmark from BEP.sln solution without debugger and in Release configuration.
After a few minutes (may require several runs) you will see the console message:
Order 12345 for market abc_def was already processed
from protoactor-dotnet.
@rogeralsing In this issue repro we don't have ProccedExecutionReport
method. May be you meant to post here #1977
from protoactor-dotnet.
yes 👍🏻
from protoactor-dotnet.
Related Issues (20)
- Don't list blocked nodes unknown to self HOT 1
- Large cluster stability notes: HOT 2
- Gossip actor jam HOT 1
- Introduce IsStopping flag
- What is the difference between EventStream and pub-sub? HOT 2
- Remote blocking and Cluster MemberList HOT 2
- At-most-once guarantee is broken HOT 2
- Debugging documentation HOT 3
- Mocking grain client and testability HOT 2
- Is it possible for PubSub to lose messages without subscriber knowing?
- Kubernetes Provider on pod hostNetwork enabled HOT 1
- Amazon ECS cluster provider fails with AWS error
- SeedNode provider conceptually broken HOT 5
- RequestAsync method doesn't work in client mode
- ETCD provider HOT 6
- Introduce message priority for Remote layer HOT 2
- Dedicated thread dispatcher for some system actors
- 1.5.0 cluster seems broken? HOT 1
- SeedNodeClusterProvider: Consensus not reached, Initiating rebalance [1.5.0] HOT 2
- ProtoGenTask fails when UseArtifactsOutput = true is active
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from protoactor-dotnet.