Comments (12)
@ondrejpialek we might have need to make a change in SQL plugins to make journal responsible for autoincrementing offset. I'll try to prepare PR for that this weekend.
from akka.persistence.sqlserver.
@gfgw as you noticed reader stream is a windowed function based on the Ordering column - when a frame of records is being received, stream handler internally records an Ordering of the latest received record and will use it as the constraint for WHERE clause, when the new frame is going to be requested.
Example:
Request 100 rows WHERE Ordering > 0 and Tag = banana
Ordering: 204800, Tag: banana
Ordering: 204801, Tag: banana
Ordering: 204802, Tag: banana
Ordering: 204803, Tag: banana
Ordering: 204804, Tag: apple
...
Ordering: 204809, Tag: apple
Ordering: 204810, Tag: banana --> this is 61st element, we skipped 6 elements since they didn't match the tag
Request 5 rows WHERE Ordering > 204810 and tag = banana
....
from akka.persistence.sqlserver.
@Horusiath Thank you for the explanation. The question is however: how is it possible that there is a gap between two requests? Starting at offset 204743 the request asks for 100 events. The query resurns 61 rows (because there are no more events at that time). Next request asks for events starting at 204810, which is 67 higher than the previous. So I am missing 6 events. How can that happen?
from akka.persistence.sqlserver.
@gfgw if you're reading by tag, offsets are not necessarily monotonic - are you sure, that these 6 event comply to a tag, you were looking for? I've updated the example to present exactly the case you may be talking about.
from akka.persistence.sqlserver.
@Horusiath No, all events in that range have the same tag.
from akka.persistence.sqlserver.
Hello, we have a similar setup and are also experiencing the very same issue. Occasionally a random event is missed by replay and does not make its way to the projection. This is obviously a serious problem as replaying events is a fundamental feature of Akka.
From the logs it appears that the replay actor loads two events (logs current offset = requested offset + 2) but only sends a single event to the projection. This does not always happen; other times when many events are read all make it to the projecting actor. It also has nothing to do with the messages themselves, they can be deserialized fine when fetched via other means.
There are no errors or any other messages around that time that would explain this (e.g. no dead letters etc), we also have retry logic around the reading & projection restarting the stream but that is not triggered. It seems that the replay actors knows there is some kind of event there but does not send it for some reason.
I wonder if this could somehow be caused by concurrency on the write side? From the logs it seems that there were to events, 55022 and 55023. The replay asked for new events since 55021 and moved the current offset to 55023. I wonder if this might have been caused by the transaction inserting event 55023 committing earlier than the transaction saving 55022, causing the reader to see only the later event, and skipping over row 55022 which might have not been committed yet? If so is there some required DB settings that would prevent this concurrency? Or is there some setting that can be applied to the SQL query to guarantee it does not skip over uncommitted rows?
For completeness this is running in Azure, Akka on a few services on Service Fabric with high concurrency on the write side. The SQL Server is also on Azure, an ordinary SQL DB inside a SQL Server Elastic Pool.
Many thanks,
Ondrej.
from akka.persistence.sqlserver.
Ok, I've spent some more time reading and googling around. It seems that relaying on the order of an Identity column is flawed and rows might silently be skipped over, see https://dba.stackexchange.com/questions/133556/can-i-rely-on-reading-sql-server-identity-values-in-order
@Horusiath is there something I am missing or do you agree that the current implementation might be lead to missed events due to the problem described above?
from akka.persistence.sqlserver.
I think it's the same problem as akkadotnet/akka.net#3590
from akka.persistence.sqlserver.
Would that solve the problem though? If you have two competing writers (two processes running) then you cannot guarantee which one writes their message first, right? So you could still end up with a wrong order & Persistence Query accidentally skipping over events with lower IDs persisted later.
I think that the only way to prevent IDs persisted out of order is to remove the concurrency on the write side by serializing the writes. Something like having the Journal as a cluster singleton (for distributed scenarios).
Alternatively a "staging table" could be created as form of a FIFO queue to which concurrent processes would write (at this point the offset would not be generated yet) and a single "processor" would move rows from this queue to the journal table, ensuring that the rows in the Journal table are persisted in the order of the Identity column. If this "processor" is some form of a SQL server job then no coordination is needed between different Akka processes. If the Journal however needs to know the offset of the written event then some changes will need to be made so that the offset is read after the row has been moved from the staging table to the journal table.
Overall I think that any of these approaches would lead to lower performance (which is OK imho as first and foremost the system needs to be correct), and the changes would be pretty severe.
I also do want to note that it seems that only the Persistence Query component is affected by this. The question is whether ensuring that rows are in order just for the sake of the Persistence Query is worth it. Perhaps there are other "smarter" ways to fix Persistence Query. For example, if PQ only considered events older than say 1 minute then the likelihood of "missing" events due to a allocated Identity whose row was not written yet is very low. If this is configurable then people have a chance to tweak this to fit their needs.
Some other alternatives would include "gap" detection on the Query side. A gap in IDs (e.g. reading 1,2,3,5 with 4 being committed only after the read is done) can be caused by the following cases:
- the race condition mentioned here
- a deleted event
- a rolled-back transaction
The query could be modified to read deleted events as well to ensure a false positive is not triggered for this particular case (the query could be written so that deleted events do not send their payload). Then a gap in the IDs either means that we encountered this problem or it's a genuine gap due to a rolled-back transaction. At this point the reader would have to retry at a later time to see if the gap can be filled or it could yield the events it has so far but remember it missed number 4 and retry a select for just that ID at a later time. The problem is that a rolled-back transaction would never fill that gap and therefore the reader would have to give up at some point. For systems that generate a lot of these genuine gaps retrying would slow the queries significantly; that being said though the system could ignore gaps older than a few minutes as it is not likely there is a transaction hanging for such a long time. So rebuilding a projection would be as fast as it is now, just a live query would perhaps retry once or twice in case a gap is encountered.
This is a lot of text and I apologize for that, I hope it makes at least some sense :)
NB: Other "Event Stores" based on traditional RDBMS have the same problem, see NEventStore with MSSQL and PostgeSQL: NEventStore/NEventStore.Persistence.SQL#21
from akka.persistence.sqlserver.
Also - they seem to encourage setting READ_COMMITTED_SNAPSHOT to OFF, as it should reduce the risk of ignoring uncommited rows. This should greatly reduce the risk of reading events out of order but it sadly does not make the problem go away completely. On our system where we've seen this bug occur the past two days (as we are increasing load) we have READ_COMMITTED_SNAPSHOT set to ON, which might explain why it can be seen this often. I am waiting an ack from the client to reconfigure the DB to turn this off so 🤞 that the number of issues will go down. I should be noted somewhere in the documentation that this is the correct configuration though. The default used to be OFF but perhaps MS changed this recently on their latest azure environment 🤷♂️
from akka.persistence.sqlserver.
A very informative debate around the very same problem: NEventStore/NEventStore#425
from akka.persistence.sqlserver.
@ondrejpialek I don't want to overcomplicate things here. Any change done here will quite probably have a negative performance impact: some of which can be amortized by batching journal capabilities.
The easiest option here seems to be using a single db connection per journal at the time - as all SQL providers use either TCP or IPC for communication, we should have a global order of events send by journal.
With that all we need to do is to make sure, that journal doesn't reorder write requests it received before sending them to database. The question here would be about the limitations of using a single DB connection. Of course, this comes with traditional issues (i.e. head of line blocking), but we need actual tests to be sure about how serious impact does it have.
from akka.persistence.sqlserver.
Related Issues (20)
- Events not snapshotted/deleted HOT 1
- BatchingSqlServerJournal - stuck persistent actor HOT 5
- Null Reference Exception at Akka.Persistence.Journal.AsyncWriteJournal Constructor HOT 4
- Need to upgrade build system to latest
- Error during snapshot store initialization HOT 6
- Persistent actors are still stuck after network failure HOT 13
- Fail to deserialize events containing exceptions HOT 1
- Normalize string lengths used in variable parameters sent to SQL Server HOT 2
- Dependabot couldn't reach https://www.myget.org/F/akkadotnet/api/v3/index.json as it timed out
- Default CircuitBreaker settings don't play well with Default SqlClient Timeouts
- Need to add the new AllEvents query into SqlServer plugin
- Batching journal not working with Akka.Persistence v1.4.11 HOT 8
- Occasional huge spikes when executing batches using BatchingSqlJournal HOT 16
- DockerClient.Images.ListImagesAsync() could not use MatchName anymore.
- SqlServer AD Authentication issues
- SqlServerJournal & SqlServerSnapshotStore not working with encrypted connection strings
- Updating from 1.3.2 to 1.4.25 results in errors creating SqlServer actors HOT 6
- DbUtil unit test cleanup SQL script problem
- Azure Identity SDK Remote Code Execution Vulnerability (CVE-2023-36414) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from akka.persistence.sqlserver.