Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Azure SQL #14

Open
adamfur opened this issue Aug 26, 2015 · 6 comments
Open

Possible Azure SQL #14

adamfur opened this issue Aug 26, 2015 · 6 comments

Comments

@adamfur
Copy link

adamfur commented Aug 26, 2015

Hi,

We have built a system that is using the PollingClient to feed around 25 commit observing projection builders.
Occupationally in production we have simultaneously received crashes in several of our projection builders at the same time, at the very same commit checkpoint.
When we reset the checkpoint, and feed the projection builders with the entire event stream again, it passes perfectly.
So it appears it sometimes skips a few events!
It only seem to happen during highly intense read+write periods during imports of data.

SQL Azure's default transaction level is different from vanilla MSSQL.

If I write event+0 and event+1 to the database, is it possible that event+1 is available before event+0 in SQL Azure?

In MSSQL we haven't been able to reproduce this result.

Any clues?

@andreabalducci
Copy link
Member

Haven't worker yet with SqlAzure but is possibile to have on a distributed system this kind of issue. You should handle this "glitch" in your client:
A) polling client -> sequencer -> projection
B) modify the polling client to read with few milliseconds delay

I will go for A

@adamfur
Copy link
Author

adamfur commented Aug 27, 2015

I've messed around a bit, received transaction exceptions in the polling client while using the EnlistInAmbientTransaction() call during the Wireup(). Not sure if it actually solves anything, but gonna try it out for a few days.

Regarding opt A and B.
I think creating a sequencer is difficult, we except a lot of holes in CheckpointNumber identity column, as we are using several buckets, also SqlAzure sometimes "randomly" bumps the identity by +10'000.

Implemented a version of B were we changed ObserveFrom*() to pass UtcNow - 300ms, and ignore all commits newer than that. Giving some time for the infrastructure to catch up.

If we settle for opt B, I will eventually send a pull request with a SqlAzureDialect.

@adamfur
Copy link
Author

adamfur commented Sep 11, 2015

Our issues:

  • Azure SQL sometimes bump up the identity (CheckpointNumber) value by 10'000.
  • Fetching a range of recently written commits, is sometimes missing a portion of the data.
  • There are unexplained gaps in CheckpointNumber, usually ranging from 1-2.

The workaround:

  • Added a predicate so we can filter on the bucketId, whatever we should use the OnNext()-method on the current commit.
  • Always fetches from all-the-buckets (Only way to know if we have a gap).
  • Always consume commits older than 5 seconds.
  • Throw if the next commit is not the last checkpointnumber + 1.

Notes:

  1. If case we stumble upon a gap, the clients will have to tolerate a lag of 5 seconds before their projections are updated (happens like two times a day).
  2. Invalid sequence retrieval is treated like a transient error, will retry until the next commit has aged to 5 seconds or more, or if we receive the correct sequence.
  3. In our logs we can see that it has taken almost 0.5s before we ultimately receive our expected sequence number.

@fschmied
Copy link

I believe this to be caused by READ COMMITTED SNAPSHOT, which seems to be on in Azure SQL by default (https://blogs.msdn.microsoft.com/sqlcat/2013/12/26/be-aware-of-the-difference-in-isolation-levels-if-porting-an-application-from-windows-azure-sql-db-to-sql-server-in-windows-azure-virtual-machine/) and is incompatible with NES.

I wonder if creating an AzureSqlDialect using READCOMMITTEDLOCK would have resolved the issue (if that works on Azure SQL).

@fschmied
Copy link

I just did a bit of experimentation that showed adding the WITH (READCOMMITTEDLOCK) table hint to NEventStore's queries would probably solve the problem observed by @adamfur as it reintroduces the blocking behavior of the polling client normally seen under SQL Server, but lost under Azure SQL.

@fschmied
Copy link

fschmied commented Aug 9, 2019

We've created a subclass of MsSqlDialect that adds the READCOMITTEDLOCK table hint for Azure SQL, and it seems to fix the main problem.

What remains is the very low likelihood of #21 occurring, but I think noone has actually ever seen this in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants