Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB-8936 Fix potential gaps in sequence generation #5677

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

ascend1
Copy link
Contributor

@ascend1 ascend1 commented Jul 15, 2021

Short Description

This change fixes sporadic failure in GeneratedColumnIT.testMultiRowInsert.

3.1 PR: #5678 #5689.
3.0 PR: #5679.

Problem description

DB-8936 describes an issue about generated auto-increment column:

create table T (
  c1 int generated always as identity(start with 1,increment by 1),
  c2 int,
  primary key(c1)
);

Although c1 is defined as starting at 1 and increment by 1, the following query might see gaps in the sequence generated for c1:

SELECT * FROM T order by c1;

This issue happens rarely. There are four reports so far and three of which happened in a duration of 9 months. Running the IT without modification against a standalone cluster on my laptop produces a failure roughly 1-2 times per day.

Background

Identity column

When inserting data into table T, the generatd auto-increment column c1 is firstly handled in binding phase in method enhanceAndCheckForAutoincrement. The ResultColumn for this column is marked as auto-increment. Later in code generation, this flag is used in generating the auto-increment logic accordingly. In the end, the generated code calls increment() method in InsertOperation to get the next value for a row. This increment() method eventually calls getNext() in AbstractSequence.

The IT asserts that c1 values sorted ascendingly satisfied next == lastValue + 1. Since this fails sporadically, there are two possibilities:

  • next == lastValue
  • next > lastValue + 1

Sequence generator

For a generated sequence of values, we always generate a block of values once and read from it. Once the block is consumed, we generate another block. As an example:

Seq: start with 1,increment by 1
1st RS: allocateBlock() -> [1, 10000]
2nd RS: allocateBlock() -> [10001, 20000]

Later when the first RS has consumed all of the values in its block:
1st RS: allocateBlock() -> [20001, 30000]

With this in mind, let's take a look at how AbstractSequence.getNext() is implemented:

// class member fields
protected final AtomicLong remaining=new AtomicLong(0l);
protected final AtomicLong currPosition=new AtomicLong(0l);
...

public long getNext() throws StandardException {
    if(remaining.getAndDecrement()<=0)
        allocateBlock(false);
    return currPosition.getAndAdd(incrementSteps);
}

The remaining available value count of the current block is stored in remaining. Once a thread comes for a value, it decreases remaining by 1 and see if it's non-positive. If so, it needs to allocate a new block. Otherwise, it proceeds by doing currPosition.getAndAdd(incrementSteps).

RCA

Although the getNext() has a bad smell of not being atomic as a whole, the IT doesn't fail because of this. Adding traces in getNext() showed that a single insert statement in this IT runs on a single region server with a single thread retrieving next values from c1's sequence. At least in this IT, there is no concurrency in getting the next value.

The IT runs two insert statements for 13 iterations. When it fails, the gap always occur when the following insert statement suddenly goes to Spark in iteration 11 or 12 (first iteration is 0):

insert into T(c2) select c1 from T;

When it happens in iteration 10, the gap starts at 40960 (next value is not 40961 but 50001). When it happens in iteration 11, the gap starts at 81910 (next value if not 81911 but 90001).

What's happening is that all previous iterations were on OLTP. Thus the sequence blocks are also allocated on region server 0. Default block size is 10000, so the last block allocated on RS 0 is [40001, 50000] or [80001, 90000]. But at the end of those iterations, there are only 950 or 1900 values retrieved from the last block.

If the next insert statement goes to Spark, the Spark executor has no way to retrieve sequence values allocated in RS 0. By design, it allocates a new block for its own, namely block [50001, 60000] or [90001, 10000] depending on the iteration. Thus comes the gap.

But why the gap starts at 40960 or 81910, not 40950 or 81900 as row count when the previous iterations has finished? Because even in iteration 11 or 12, the following insert is always on OLTP:

insert into T(c2) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);

Only the insert into T select ... goes to OLAP.

Now comes the question. Table T doesn't have statistics. Why the insert statement goes to Spark then? Shouldn't the default row count 20 be used in generating a plan?

It turned out that the default row count 20 is only used when the estimated heap size used by the table is small enough. In RegionLoadStatistics, we do the following:

    if(regionLoad==null){
        heapSize =partitionMaxFileSize;
    }else {
        heapSize = regionLoad.getStorefileSize()+regionLoad.getMemStoreSize();
        // If size is less than 1 MB truncate it so that the logic doesn't change for small tables
        // (the size used to be reported in increments of 1 MB)
        if (heapSize < 1024*1024)
            heapSize = 0;
        rowSizeRatio = ((double)heapSize)/partitionMaxFileSize;
    }
    long fbRegionRowCount = config.getFallbackRegionRowCount();
    long fbMinRowCount = config.getFallbackMinimumRowCount();
    long numRows = (long)(fbRegionRowCount*rowSizeRatio);
    if(numRows<fbMinRowCount)
        numRows = fbMinRowCount;
    if(heapSize==0){
        heapSize = numRows*config.getFallbackRowWidth();
    }

If heap size is smaller than 1 MB, we use default row count 20. Otherwise, we estimate the row count. When the IT fails, the insert statement goes to Spark has the following plan:

 191685 2021-07-15 19:22:17,357 (DRDAConnThread_240) INFO  [c.s.d.i.a.PlanPrinter] - Plan nodes for query <<
 191686         insert into GENERATEDCOLUMNIT.T3(c2) select c1 from GENERATEDCOLUMNIT.T3
 191687 >>Insert(n=4,totalCost=30380.925,insertedRows=2529019,targetTable=GENERATEDCOLUMNIT.T3,engine=OLAP (cost))
 191688   ->  NormalizeResult(n=4,totalCost=30380.925,outputRows=2529019,outputHeapSize=0 B,partitions=1,parallelTasks=1)
 191689     ->  ProjectRestrict(n=3,totalCost=5062.038,outputRows=2529019,outputHeapSize=2.412 MB,partitions=1,parallelTasks=3)
 191690       ->  TableScan[T3(28352)](n=0,totalCost=5062.038,scannedRows=2529019,outputRows=2529019,outputHeapSize=2.412 MB,partitions=1,parallelTasks=3)

It estimates the table having 2529019 rows! In reality, this table has 163820 rows at maximum. That does exceed the 20K Spark row count limit, though.

Final question: Why this happens sporadically?

I don't have a good answer for this yet. What I can show here is that based on the failing case traces I have collected so far, the insert statement goes to Spark always after a mem table flush:

 191589 2021-07-15 19:22:06,585 (MemStoreFlusher.0) INFO  [o.a.h.h.r.MemStoreFlusher] - Flush of region splice:28128,,1626369668021.e02d79dc7ed50364ea3ca894579d5984. due to global heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap size=729.9 MTotal Memstore Off-Heap size=0, Region memstore size=113.4 M
 191590 2021-07-15 19:22:06,585 (MemStoreFlusher.0) INFO  [o.a.h.h.r.HRegion] - Flushing 1/1 column families, dataSize=28.02 MB heapSize=113.36 MB
 191591 2021-07-15 19:22:06,600 (MemStoreFlusher.0) INFO  [c.s.s.i.s.AbstractSICompactionScanner] - Starting Compaction-resolution-throttle-a8113e51-3527-4a54-a646-1f5015e795c4
 ...
 191597 2021-07-15 19:22:09,271 (JvmPauseMonitor) INFO  [o.a.h.h.u.JvmPauseMonitor] - Detected pause in JVM or host machine (eg GC): pause of approximately 1754ms
 191598 GC pool 'G1 Young Generation' had collection(s): count=2 time=72ms
 191599 GC pool 'G1 Old Generation' had collection(s): count=1 time=2150ms
 191600 2021-07-15 19:22:09,271 
 ...
 191605 2021-07-15 19:22:09,272 (MemStoreFlusher.1) INFO  [o.a.h.h.r.MemStoreFlusher] - Flush of region splice:28016,,1626369658484.dc69f753d45b7c51cacc6d2e74be958c. due to global heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap size=735.2 MTotal Memstore Off-Heap size=0, Region memstore size=113.4 M
 191606 2021-07-15 19:22:09,272 (MemStoreFlusher.1) INFO  [o.a.h.h.r.HRegion] - Flushing 1/1 column families, dataSize=28.02 MB heapSize=113.36 MB
 ...
 191684 2021-07-15 19:22:17,355 (RpcServer.default.FPBQ.Fifo.handler=42,queue=2,port=60020) DEBUG [c.s.h.RegionSizeEndpoint] - computeRegionSize
 191685 2021-07-15 19:22:17,357 (DRDAConnThread_240) INFO  [c.s.d.i.a.PlanPrinter] - Plan nodes for query <<
 191686         insert into GENERATEDCOLUMNIT.T3(c2) select c1 from GENERATEDCOLUMNIT.T3
 191687 >>Insert(n=4,totalCost=30380.925,insertedRows=2529019,targetTable=GENERATEDCOLUMNIT.T3,engine=OLAP (cost))
 191688   ->  NormalizeResult(n=4,totalCost=30380.925,outputRows=2529019,outputHeapSize=0 B,partitions=1,parallelTasks=1)
 191689     ->  ProjectRestrict(n=3,totalCost=5062.038,outputRows=2529019,outputHeapSize=2.412 MB,partitions=1,parallelTasks=3)
 191690       ->  TableScan[T3(28352)](n=0,totalCost=5062.038,scannedRows=2529019,outputRows=2529019,outputHeapSize=2.412 MB,partitions=1,parallelTasks=3)

I guess when this flush happens is not in our exact control. And when it happens at a specific time, something could go wrong.

In the end, it looks like each component works perfectly fine by its design and nothing is really wrong. When putting together, there is an edge case that has non-deterministic behaviour.

Potential concurrency issue within a RS

We have seen the implementation of getNext() has a bad smell because semantically, decreasing remaining and modify currPosition must be an atomic operation. With current implementation, other threads could modify currPosition in between. That would mean our thread reserved a value from current block (by decreasing remaining) but ends up reading a value from another block. If this happens, the symptom would not be duplicate values, but a gap in the sequence.

Note that this gap is not exactly the same as the gap we saw above. There, as long as we continue inserting values, old gaps would be closed an new gaps open. Here, if this concurrency issue happens, the gap is permannent. Values in between are lost forever.

Can currPosition be modified in between? Here is how allocateBlock() is implemented:

while(!success){
    updateLock.lock();
    try{
        if(remaining.getAndDecrement()>0)
            return;
        currPosition.set(getCurrentValue());

        // allocate another block
		...
    }catch(IOException e){
        throw Exceptions.parseException(e);
    }finally{
        updateLock.unlock();
    }
}

Consider the following scenario (previous value == 99):

            getNext()                                       getNext()
t1    decrease remaining > 0                            currPosition += 1
    (last value in block == 100)                            return 101
               |                                                |
time --------------------------------------------------------------------->
                  |                 |        |             |
              getNext()            lock  remaining   set currPosition
t2       decrease remaining == 0          test OK         == 101
        (no more value, allocate)

The previous value is 99, but t1 retrieves the next value as 101. Further threads continue retrieve values from the new block. Consequently, value 100 is missing.

This can happen because the operations in getNext() are not protected under the updateLock.

One more note

An insert statement with multiple row values inserts the rows concurrently. For the following insert:

insert into T(c2) values (1), (2), (3);

There is no guarantee that the inserting order is 1,2,3. Since the three rows could be inserted concurrently, the result order could be anything possible. This means the auto-increment values for c1 can also be retrieved concurrently within a region server. That's why to assert the result order, we need an ORDER BY c1 clause in the select statement. There used to be an issue with retrieving the auto-increment values concurrently (duplicates in sequence) and it's fixed in DB-7040. This part works fine now.

How to test

GeneratedColumnIT.testMultiRowInsert should not fail sporadically anymore.

@ascend1
Copy link
Contributor Author

ascend1 commented Jul 15, 2021

jenkins please test branch @cdh6.3.0,skipTestsLongerThan2Minutes

@ascend1
Copy link
Contributor Author

ascend1 commented Jul 15, 2021

jenkins please test branch @cdh6.3.0,skipTestsLongerThan2Minutes

@cloudspliceci
Copy link
Contributor

@cloudspliceci
Copy link
Contributor

Copy link
Member

@martinrupp martinrupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, i noticed a similar behavior in https://splicemachine.atlassian.net/browse/DB-11889 .

Here's the summary: SEQUENCES are not guaranteed to be incremented by incrementValue, but at LEAST incrementValue.

In general, this is not really a “bug”, as it makes sense to not write every change of the current sequence value to disc. This somehow corresponds to the CACHE option in SQL Server, see CREATE SEQUENCE (Transact-SQL) - SQL Server / Cache management. See also there for explanation why this is necessary (they have a NO_CACHE option but it has a big performance impact).

In our code, we do an implicit CACHE 10000 behavior that is configurable only globally for all sequences (with option splice.sequence.allocationBlockSize).
That means every time the cache gets invalidated (e.g. you wait one minute, restart the cluster or maybe something you found here), the cache gets reset to the next 10k block.

updateLock.lock();
if(remaining.getAndDecrement()>0)
return;
rwLock.readLock().unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to get more understanding on why we can't simply upgrade a read lock to write lock, i.e. why the readlock().unlock() is necessary, I came across this SO thread which made me worried this might block forever? Did you double-check that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to directly upgrading a read-write lock without unlocking the read lock first. I think this is common behavior not only in Java but kind of any read-write lock implementations. The problem is that multiple threads could try to upgrade simultaneously, making the scenario very complex to be correct. Downgrading is simple because the current thread knows it's the only thread doing so.

For the behavior described in the SO thread, I wouldn't worry too much until we actually see it. It looks like a bug in Java that happens rarely. If this problem happens very often, ReentrantReadWriteLock implementation is basically buggy and unusable, which doesn't see to be the case in practice.

Copy link
Contributor

@hatyo hatyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR, the logic in AbstractSequence is very subtle and I think it should be simplified and unit tested. I am not sure whether "upgrading" ReentractReadWriteLock could cause a deadlock, so I am leaving looking into this to more experienced peers.

@ascend1
Copy link
Contributor Author

ascend1 commented Jul 20, 2021

hi, i noticed a similar behavior in https://splicemachine.atlassian.net/browse/DB-11889 .

Here's the summary: SEQUENCES are not guaranteed to be incremented by incrementValue, but at LEAST incrementValue.

In general, this is not really a “bug”, as it makes sense to not write every change of the current sequence value to disc. This somehow corresponds to the CACHE option in SQL Server, see CREATE SEQUENCE (Transact-SQL) - SQL Server / Cache management. See also there for explanation why this is necessary (they have a NO_CACHE option but it has a big performance impact).

In our code, we do an implicit CACHE 10000 behavior that is configurable only globally for all sequences (with option splice.sequence.allocationBlockSize).
That means every time the cache gets invalidated (e.g. you wait one minute, restart the cluster or maybe something you found here), the cache gets reset to the next 10k block.

Yes, that's the thing. So gap is bound to happen, similar to other vendors.

Copy link

@carolp-503 carolp-503 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants