-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB-8936 Fix potential gaps in sequence generation #5677
base: master
Are you sure you want to change the base?
Conversation
jenkins please test branch @cdh6.3.0,skipTestsLongerThan2Minutes |
jenkins please test branch @cdh6.3.0,skipTestsLongerThan2Minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, i noticed a similar behavior in https://splicemachine.atlassian.net/browse/DB-11889 .
Here's the summary: SEQUENCES are not guaranteed to be incremented by incrementValue, but at LEAST incrementValue.
In general, this is not really a “bug”, as it makes sense to not write every change of the current sequence value to disc. This somehow corresponds to the CACHE option in SQL Server, see CREATE SEQUENCE (Transact-SQL) - SQL Server / Cache management. See also there for explanation why this is necessary (they have a NO_CACHE option but it has a big performance impact).
In our code, we do an implicit CACHE 10000 behavior that is configurable only globally for all sequences (with option splice.sequence.allocationBlockSize
).
That means every time the cache gets invalidated (e.g. you wait one minute, restart the cluster or maybe something you found here), the cache gets reset to the next 10k block.
updateLock.lock(); | ||
if(remaining.getAndDecrement()>0) | ||
return; | ||
rwLock.readLock().unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to get more understanding on why we can't simply upgrade a read lock to write lock, i.e. why the readlock().unlock()
is necessary, I came across this SO thread which made me worried this might block forever? Did you double-check that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible to directly upgrading a read-write lock without unlocking the read lock first. I think this is common behavior not only in Java but kind of any read-write lock implementations. The problem is that multiple threads could try to upgrade simultaneously, making the scenario very complex to be correct. Downgrading is simple because the current thread knows it's the only thread doing so.
For the behavior described in the SO thread, I wouldn't worry too much until we actually see it. It looks like a bug in Java that happens rarely. If this problem happens very often, ReentrantReadWriteLock
implementation is basically buggy and unusable, which doesn't see to be the case in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good PR, the logic in AbstractSequence
is very subtle and I think it should be simplified and unit tested. I am not sure whether "upgrading" ReentractReadWriteLock
could cause a deadlock, so I am leaving looking into this to more experienced peers.
Yes, that's the thing. So gap is bound to happen, similar to other vendors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Short Description
This change fixes sporadic failure in
GeneratedColumnIT.testMultiRowInsert
.3.1 PR:
#5678#5689.3.0 PR: #5679.
Problem description
DB-8936 describes an issue about generated auto-increment column:
Although
c1
is defined as starting at 1 and increment by 1, the following query might see gaps in the sequence generated forc1
:This issue happens rarely. There are four reports so far and three of which happened in a duration of 9 months. Running the IT without modification against a standalone cluster on my laptop produces a failure roughly 1-2 times per day.
Background
Identity column
When inserting data into table
T
, the generatd auto-increment columnc1
is firstly handled in binding phase in methodenhanceAndCheckForAutoincrement
. TheResultColumn
for this column is marked as auto-increment. Later in code generation, this flag is used in generating the auto-increment logic accordingly. In the end, the generated code callsincrement()
method inInsertOperation
to get the next value for a row. Thisincrement()
method eventually callsgetNext()
inAbstractSequence
.The IT asserts that
c1
values sorted ascendingly satisfiednext == lastValue + 1
. Since this fails sporadically, there are two possibilities:Sequence generator
For a generated sequence of values, we always generate a block of values once and read from it. Once the block is consumed, we generate another block. As an example:
With this in mind, let's take a look at how
AbstractSequence.getNext()
is implemented:The remaining available value count of the current block is stored in
remaining
. Once a thread comes for a value, it decreasesremaining
by 1 and see if it's non-positive. If so, it needs to allocate a new block. Otherwise, it proceeds by doingcurrPosition.getAndAdd(incrementSteps)
.RCA
Although the
getNext()
has a bad smell of not being atomic as a whole, the IT doesn't fail because of this. Adding traces ingetNext()
showed that a single insert statement in this IT runs on a single region server with a single thread retrieving next values fromc1
's sequence. At least in this IT, there is no concurrency in getting the next value.The IT runs two insert statements for 13 iterations. When it fails, the gap always occur when the following insert statement suddenly goes to Spark in iteration 11 or 12 (first iteration is 0):
When it happens in iteration 10, the gap starts at 40960 (next value is not 40961 but 50001). When it happens in iteration 11, the gap starts at 81910 (next value if not 81911 but 90001).
What's happening is that all previous iterations were on OLTP. Thus the sequence blocks are also allocated on region server 0. Default block size is 10000, so the last block allocated on RS 0 is [40001, 50000] or [80001, 90000]. But at the end of those iterations, there are only 950 or 1900 values retrieved from the last block.
If the next insert statement goes to Spark, the Spark executor has no way to retrieve sequence values allocated in RS 0. By design, it allocates a new block for its own, namely block [50001, 60000] or [90001, 10000] depending on the iteration. Thus comes the gap.
But why the gap starts at 40960 or 81910, not 40950 or 81900 as row count when the previous iterations has finished? Because even in iteration 11 or 12, the following insert is always on OLTP:
Only the
insert into T select ...
goes to OLAP.Now comes the question. Table
T
doesn't have statistics. Why the insert statement goes to Spark then? Shouldn't the default row count 20 be used in generating a plan?It turned out that the default row count 20 is only used when the estimated heap size used by the table is small enough. In
RegionLoadStatistics
, we do the following:If heap size is smaller than 1 MB, we use default row count 20. Otherwise, we estimate the row count. When the IT fails, the insert statement goes to Spark has the following plan:
It estimates the table having 2529019 rows! In reality, this table has 163820 rows at maximum. That does exceed the 20K Spark row count limit, though.
Final question: Why this happens sporadically?
I don't have a good answer for this yet. What I can show here is that based on the failing case traces I have collected so far, the insert statement goes to Spark always after a mem table flush:
I guess when this flush happens is not in our exact control. And when it happens at a specific time, something could go wrong.
In the end, it looks like each component works perfectly fine by its design and nothing is really wrong. When putting together, there is an edge case that has non-deterministic behaviour.
Potential concurrency issue within a RS
We have seen the implementation of
getNext()
has a bad smell because semantically, decreasingremaining
and modifycurrPosition
must be an atomic operation. With current implementation, other threads could modifycurrPosition
in between. That would mean our thread reserved a value from current block (by decreasingremaining
) but ends up reading a value from another block. If this happens, the symptom would not be duplicate values, but a gap in the sequence.Note that this gap is not exactly the same as the gap we saw above. There, as long as we continue inserting values, old gaps would be closed an new gaps open. Here, if this concurrency issue happens, the gap is permannent. Values in between are lost forever.
Can
currPosition
be modified in between? Here is howallocateBlock()
is implemented:Consider the following scenario (previous value == 99):
The previous value is 99, but t1 retrieves the next value as 101. Further threads continue retrieve values from the new block. Consequently, value 100 is missing.
This can happen because the operations in
getNext()
are not protected under theupdateLock
.One more note
An insert statement with multiple row values inserts the rows concurrently. For the following insert:
There is no guarantee that the inserting order is 1,2,3. Since the three rows could be inserted concurrently, the result order could be anything possible. This means the auto-increment values for
c1
can also be retrieved concurrently within a region server. That's why to assert the result order, we need anORDER BY c1
clause in the select statement. There used to be an issue with retrieving the auto-increment values concurrently (duplicates in sequence) and it's fixed in DB-7040. This part works fine now.How to test
GeneratedColumnIT.testMultiRowInsert
should not fail sporadically anymore.