Skip to content

A blazingly fast reimplementation of pdamianik/TGM-ACC-2019 level3, bringing calculation times down to ~500ms for M(2*10^9).

License

Notifications You must be signed in to change notification settings

pdamianik/tgm_acc_2019_expert_level3_rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TGM Advent Coding Contest 2019 Expert Level 3

A blazingly fast reimplementation of pdamianik/TGM-ACC-2019 level3 (in Rust of course 😉), bringing calculation times down to ~500ms for M(2*10^9).

The problem

time space
O(n^4) O(1)

The given problem is defined as:

S[0]=290797 S[n+1]=S[n]^2 mod 50515093 Let A(i,j) min S[i],S[i+1],…,S[j] for i≤j. Let M(N) = ∑A(i,j) for 1≤i≤j≤N

or in python:

def S(n: Int) -> Int:
    result = 290797
    for i in range(n):
        result = result**2 % 50515093
    return result

def A(i: Int, j: Int) -> Int:
    minS = S(i)
    for i in range(i+1, j+1):
        val = S(i)
        if val < minS:
            minS = val
    return minS

def M(n: Int) -> Int:
    sumA = 0
    for j in range(1, n+1):
        for i in range(1, n+1):
            sumA += A(i, j)
    return sumA

or in rust: See src/initial.rs

In particular, we are interested in the value M(2_000_000_000) = ? and know that M(10) = 432256955 and M(10_000) = 3264567774119.

M(10) can be calculated fast enough with the naive implementation above (e.g. the rust implementation takes between 50ms and 100ms to complete), but since this algorithm has a time complexity of O(n^4) it requires a different approach for anything beyond n = 10^1.

Optimizations

Caching

time space
O(n^3) O(n)

The first simple optimization can be seen in the original repository where a simple cache for the S(n) function is added. This optimization reduces the time complexity to O(n^3) which still is far from usable for our target of n = 2_000_000_000.

Because a cache for S(n) is used in later optimizations (e.g. src/rayon.rs:7) there is no dedicated caching version in the rust implementation

Reuse previous minima and parallelize

time space
without cache O(n^3) O(1)
with cache O(n^2) O(n)

Note: the rust implementation exclusively uses caching (see: src/rayon.rs:7 and src/parallel.rs:7)

A different way to describe the problem is the sum of the minima of all continuous ranges of S[]. A visualization of the naive algorithm evaluating M(4) is:

S ->  |1|2|3|4|
i = 1  _        = 1
j = 1 --------- = 1
i = 1  _ _      = min(1, 2)
i = 2    _      = 2
j = 2 --------- = min(1, 2) + 2
i = 1  _ _ _    = min(1, 2, 3)
i = 2    _ _    = min(2, 3)
i = 3      _    = 3
j = 3 --------- = min(1, 2, 3) + min(2, 3) + 3
i = 1  _ _ _ _  = min(1, 2, 3, 4)
i = 2    _ _ _  = min(2, 3, 4)
i = 3      _ _  = min(3, 4)
i = 4        _  = 4
j = 4 --------- = min(1, 2, 3, 4) + min(2, 3, 4) + min(3, 4) + 4

where each i-row is the minimum of all underlined elements and each j-row is the sum of all i-rows up to the previous j-row (exclusive).

The result would be the sum of all j-rows: min(1) + min(1, 2) + min(2) + min(1, 2, 3) + min(2, 3) + min(3) + min(1, 2, 3, 4) + min(2, 3, 4) + min(3, 4) + min(4).

By reversing the order of i-rows between the j-rows we can reuse the minimum of the previous iteration of i by utilizing the fact that min(a, b, c) = min(min(a, b), c):

S ->  |1|2|3|4|
i = 1  _        = 1
j = 1 --------- = 1
i = 2    _      = 2
i = 1  _ _      = min(1, 2)
j = 2 --------- = 2 + min(1, 2)
i = 3      _    = 3
i = 2    _ _    = min(2, 3)
i = 1  _ _ _    = min(1, min(2, 3)) // reuse min(2, 3) from the previous iteration
j = 3 --------- = min(3) + min(2, 3) + min(1, min(2, 3))
i = 4        _  = 4
i = 3      _ _  = min(3, 4)
i = 2    _ _ _  = min(2, min(3, 4)) // reuse min(3, 4) from the previous iteration
i = 1  _ _ _ _  = min(1, min(2, min(3, 4))) // reuse min(2, min(3, 4)) from the previous iteration
j = 4 --------- = min(4) + min(3, 4) + min(2, min(3, 4)) + min(1, min(2, min(3, min(4)))

Each i-row can now reuse the result of the previous i-row, which results in a time-complexity of O(n^2) (assuming that a lookup into S[] has a time complexity of O(1), which can be accomplished with a cache)

Additionally, every j-row can be calculated independently of the other j-rows, and therefore we can use multiple threads to handle each j-row.

These optimizations are implemented in src/rayon.rs (implemented with rayon) and src/parallel.rs (implemented with rust native threads)

Make it linear

time space
O(~n) O(~1)

This algorithm iterates through S[] mostly linearly resulting in the O(~n) time complexity. Because this algorithm accesses S[] linearly there is no need for a cache for all S[] values anymore. There is a local cache for some of the last S[] values that reaches until the last maximum. This cache makes up the O(~1) space complexity as it doesn't grow with n but the specific properties of S[], specifically that it has a global minimum and that it is periodic. Another difference to the previous algorithm is that every j-row now depends on some of the last j-rows (at least the previous one but in the worst case on all the ones until the last global minimum; this is the local cache). This makes parallelization useless.

In this algorithm we always explicitly add the current element to the sum (in src/linear.rs:32) as every element will be the minimum at least once, when it is the only element being checked.

The simplest case for each iteration is that the last element is smaller than the current element. In that case we can simply reuse the last elements sum. This is done in src/linear.rs:49-51.

...|i-3|i-2|i-1|i|...
     _             = i-3
 _   _             = min(..., i-3)
   --------------- = i-3 + min(..., i-3)
         _         = i-2
     _   _         = min(i-3, i-2)
 _   _   _         = min(..., i-3, i-2)
   --------------- = i-2 + min(i-3, i-2) + min(..., i-3, i-2)
             _     = i-1
         _   _     = min(i-2, i-1)
     _   _   _     = min(i-3, i-2, i-1)
 _   _   _   _     = min(..., i-3, i-2, i-1)
   --------------- = i-1 + min(i-2, i-1) + min(i-3, i-2, i-1) + min(..., i-3, i-2, i-1) = last_sum
// current iteration
                _  = i // added anyway
             _  _  = i-1 // from here on we can just use the previous rows since they will always contian an element smaller than the current one (starting with the last element)
         _   _  _  = min(i-2, i-1)
     _   _   _  _  = min(i-3, i-2, i-1)
 _   _   _   _  _  = min(..., i-3, i-2, i-1)
   --------------- = i /* added anyway */ + last_sum = new last_sum

In case the current element is smaller than the last element instead, we have to backtrack through a local cache of the last elements to find a smaller element. While iterating through the local cache we keep track of a cache minimum from the last element onward and subtract it with each iteration from the last sum and add the current element. This effectively replaces each of the last added minima in the last sum which are bigger than our current element in S[] with the current element. The cache has to reach until the last global minimum since that element is guaranteed to be smaller than our current element (if it wasn't our current element would be the global minimum, which is a separate case handled separately). Lastly the last_sum can be reused again, just like in the last case. This is done in src/linear.rs:35-51.

...|g|i-3|i-2|i-1|i|...
    _                = g // global minimum
 _  _                = ... * g // always g
   ----------------- = g + ... * g
       _             = i-3 // smaller than i, in local cache
    _  _             = g
 _  _  _             = ... * g
   ----------------- = i-3 + g + ... * g
           _         = i-2 // bigger than i, in local cache
       _   _         = i-3
    _  _   _         = g
 _  _  _   _         = ... * g
   ----------------- = i-2 + i-3 + g + ... * g
               _     = i-1 // bigger than i, in local cache
           _   _     = min(i-2, i-1)
       _   _   _     = i-3
    _  _   _   _     = g
 _  _  _   _   _     = ... * g
   ----------------- = i-1 + min(i-2, i-1) + i-3 + g + ... * g = last_sum
// current iteration:
                  _  = i // added anyway
               _  _  = i
           _   _  _  = i
       _   _   _  _  = i-3
    _  _   _   _  _  = g
 _  _  _   _   _  _  = ... * g
   ----------------- = i /* added anyway */ + last_sum - (i-1) + i - min(i-2, i-1) + i /* backtracking */ = new last_sum

Finally, we have to keep track of the global minimum. If our current element is smaller or equals the last global minimum, S[i] * (i - 1) can simply be added to the total sum and the local cache can be reset. It is important that this case should also apply when the current element equals the last global minimum as this keeps the local cache small. This is done in src/linear.rs:53-59.

// current iteration
...|i-2|i-1|g|...
            _  = g // added anyway
         _  _  = g
     _   _  _  = g
 _   _   _  _  = ... * g
   ----------- = g /* added anyway */ + g * (i - 1) /* i is the index of S which starts at 1 which is why we have to subtract one here, in praxis S gets indexed starting at 0 so this is just i in the code */ = new last_sum

Abuse the cycle

time space
O(~1) O(~1)

The last two algorithms use the fact that the elements of S[] repeat and have a repeating global minimum of 3. The difference between src/cycle.rs and src/hardcoded.rs is that the former calculates the span of a cycle on the fly and uses that to extrapolate any M(n), whereas the latter has the cycles attributes hardcoded. Both also use src/linear.rs in the background to calculate values inside and before the first iteration of the cycle.

The following is a simplified view of the cycles of S:

         6               5
|-1-|---2v--|---3---|---4v--|
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|

g...Global minima
1...Prefix (all values before the first global minimum)
2...The first cycle
3...All cycles in between the target cycle and the first cycle (padding cycles)
4...The target cycle
5...The example target (n)
6...The example target position in the first cycle
a,b,d,e,f...arbitrary numbers in S[]

Any target before or in the first cycle will be returned early, since we have to calculate those areas anyway.

The first step in src/cycle.rs is to find the first and second occurrence of a global minimum in S[]. The results of this step are hardcoded in src/hardcoded.rs, which is the only difference between these two algorithms. This is done in src/cycle.rs:11-37 and hardcoded in src/hardcoded.rs:6-9.

When the first cycle is found we can calculate the targets equivalent position in the first cycle with first_min_index + ((n - first_min_index) % (second_min_index - first_min_index) /* the cycle width */). Additionally, we can get the target cycle index (from the first cycle onwards, in the example it would be 2) with the formula (n - first_min_index) / (second_min_index - first_min_index) /* the cycle width */. This is done in src/cycle.rs:58-61 and src/hardcoded.rs:19-21.

         4               3
|-p-|---0v--|---1---|---2v--|
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|

g...Global minima
0...cycle with index 0 (the first cycle)
1...cycle with index 1
2...Cycle with index 2 (the example target cycle)
3...The example target (n)
4...The example target position in the first cycle
p...The prefix
a,b,d,e,f...arbitrary numbers in S[]

The cycle_local_sum is simply M(targets equivalent first cycle position). This is taken from cached values in src/cycle.rs:62 and calculated based on the hardcoded state of M(min_index) in src/hardcoded.rs:23-46.

With the value of M(first_cycle_index) (calculated with src/linear.rs) we can fill in the cycles in between with the first_cycle_sum = M(second_min_index - 1) - M(first_min_index - 1) multiplied with the cycle_index. first_cycle_sum is the sum of all the j-rows in the first cycle and multiplied with the cycle_index. This results in the sum of all the cycles in between the first and the target cycle minus all the global minima in between (those grow with each cycle past the first cycle). This is done in src/cycle.rs:64-73 and src/hardcoded.rs:48-53.

|----6----|
    |---0---|>>>>>>>|
    |-------5-------|    
|-6-|-------5-------|--6--|
|0|0|3|3|3|3|3|3|3|3|3|3|3| // The global minima included in 5 and 6
|0|0|0|0|0|0|4|4|4|4|8|8|8| // The global minima missing in 5 and 6
         4               3
|-p-|---0v--|---1---|---2v--|
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|

g...Global minima
0...cycle with index 0 (the first cycle) (the sum of this cycle is the first_cycle_sum -> gets scaled with the cycle index 2)
1...cycle with index 1
2...Cycle with index 2 (the example target cycle)
3...The example target (n)
4...The example target position in the first cycle
5...The scaled first_cycle_sum
6...The cycle_local_sum
p...The prefix
a,b,d,e,f...arbitrary numbers in S[]

The last step is to fill in these missing global minima. Since the total sum of cycle global minima is an arithmetic progressing with each additional cycle we can use this formula to calculate the fill with the cycle index as n. This is done in src/cycle.rs:77-88 and src/hardcoded.rs:56-64. This also adds global minima for a full target cycle which have to be subtracted in the end to result in the right result. See src/cycle.rs:90-94 and src/hardcoded.rs:66-70.

|----6----|
    |-------5-------|    
|-6-|-------5-------|--6--|
|0|0|3|3|3|3|3|3|3|3|3|3|3| // The global minima included in 5 and 6
|0|0|0|0|0|0|4|4|4|4|8|8|8| // The global minima missing in 5 and 6
|0|0|0|0|0|0|4|4|4|4|8|8|8|8| // The global minima filled in
|0|0|0|0|0|0|0|0|0|0|0|0|0|8| // The global minima subtracted to get to the right result
         4               3
|-p-|---0v--|---1---|---2v--|
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|

g...Global minima
0...cycle with index 0 (the first cycle) (the sum of this cycle is the first_cycle_sum -> gets scaled with the cycle index 2)
1...cycle with index 1
2...Cycle with index 2 (the example target cycle)
3...The example target (n)
4...The example target position in the first cycle
5...The scaled first_cycle_sum
6...The cycle_local_sum
p...The prefix
a,b,d,e,f...arbitrary numbers in S[]

All in all the sums of the cycles can be visualized as follows:

|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
 _                           -> a
----------------------------- = a
   _                         -> b
 _ _                         -> min(a, b)
----------------------------- = b + min(a, b) // the sum until here is the prefix_sum
     _                       -> g
   _ _                       -> g
 _ _ _                       -> g
----------------------------- = 3 * g /* prefix global minima */ // the first_cycle_sum starts here
       _                     -> d
     _ _                     -> g
   _ _ _                     -> g
 _ _ _ _                     -> g
----------------------------- = d + 3 * g /* prefix global minima */
         _                   -> e
       _ _                   -> min(d, e)
     _ _ _                   -> g
   _ _ _ _                   -> g
 _ _ _ _ _                   -> g
----------------------------- = e + min(d, e) + 3 * g /* prefix global minima */ // The sum from the beginning until here is the cycle_local_sum
           _                 -> f
         _ _                 -> min(e, f)
       _ _ _                 -> min(d, e, f)
     _ _ _ _                 -> g
   _ _ _ _ _                 -> g
 _ _ _ _ _ _                 -> g
----------------------------- = f + min(e, f) + min(d, e, f) + 3 * g /* prefix global minima (are included in the first_cycle_sum) */ // the first_cycle_sum ends here
             _               -> g
           _ _               -> g
         _ _ _               -> g
       _ _ _ _               -> g
     _ _ _ _ _               -> g
   _ _ _ _ _ _               -> g
 _ _ _ _ _ _ _               -> g
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
----------------------------- = 4 * g /* cycle global minima (have to be filled in) */ + 3 * g /* prefix global minima */
               _             -> d
             _ _             -> g
           _ _ _             -> g
         _ _ _ _             -> g
       _ _ _ _ _             -> g
     _ _ _ _ _ _             -> g
   _ _ _ _ _ _ _             -> g
 _ _ _ _ _ _ _ _             -> g
----------------------------- = d + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */
                 _           -> e
               _ _           -> min(d, e)
             _ _ _           -> g
           _ _ _ _           -> g
         _ _ _ _ _           -> g
       _ _ _ _ _ _           -> g
     _ _ _ _ _ _ _           -> g
   _ _ _ _ _ _ _ _           -> g
 _ _ _ _ _ _ _ _ _           -> g
----------------------------- = e + min(d, e) + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */
                   _         -> f
                 _ _         -> min(e, f)
               _ _ _         -> min(d, e, f)
             _ _ _ _         -> g
           _ _ _ _ _         -> g
         _ _ _ _ _ _         -> g
       _ _ _ _ _ _ _         -> g
     _ _ _ _ _ _ _ _         -> g
   _ _ _ _ _ _ _ _ _         -> g
 _ _ _ _ _ _ _ _ _ _         -> g
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
----------------------------- = f + min(e, f) + min(d, e, f) + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */
                     _       -> g
                   _ _       -> g
                 _ _ _       -> g
               _ _ _ _       -> g
             _ _ _ _ _       -> g
           _ _ _ _ _ _       -> g
         _ _ _ _ _ _ _       -> g
       _ _ _ _ _ _ _ _       -> g
     _ _ _ _ _ _ _ _ _       -> g
   _ _ _ _ _ _ _ _ _ _       -> g
 _ _ _ _ _ _ _ _ _ _ _       -> g
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
----------------------------- = 4 * g /* cycle global minima */ + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */
                       _     -> d
                     _ _     -> g
                   _ _ _     -> g
                 _ _ _ _     -> g
               _ _ _ _ _     -> g
             _ _ _ _ _ _     -> g
           _ _ _ _ _ _ _     -> g
         _ _ _ _ _ _ _ _     -> g
       _ _ _ _ _ _ _ _ _     -> g
     _ _ _ _ _ _ _ _ _ _     -> g
   _ _ _ _ _ _ _ _ _ _ _     -> g
 _ _ _ _ _ _ _ _ _ _ _ _     -> g
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
----------------------------- = d + 4 * g /* cycle global minima */ + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */
                         _   -> e
                       _ _   -> min(d, e)
                     _ _ _   -> g
                   _ _ _ _   -> g
                 _ _ _ _ _   -> g
               _ _ _ _ _ _   -> g
             _ _ _ _ _ _ _   -> g
           _ _ _ _ _ _ _ _   -> g
         _ _ _ _ _ _ _ _ _   -> g
       _ _ _ _ _ _ _ _ _ _   -> g
     _ _ _ _ _ _ _ _ _ _ _   -> g
   _ _ _ _ _ _ _ _ _ _ _ _   -> g
 _ _ _ _ _ _ _ _ _ _ _ _ _   -> g
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
----------------------------- = e + min(d, e) + 4 * g /* cycle global minima */ + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */
                           _ -> f
                         _ _ -> min(e, f)
                       _ _ _ -> min(d, e, f)
                     _ _ _ _ -> g
                   _ _ _ _ _ -> g
                 _ _ _ _ _ _ -> g
               _ _ _ _ _ _ _ -> g
             _ _ _ _ _ _ _ _ -> g
           _ _ _ _ _ _ _ _ _ -> g
         _ _ _ _ _ _ _ _ _ _ -> g
       _ _ _ _ _ _ _ _ _ _ _ -> g
     _ _ _ _ _ _ _ _ _ _ _ _ -> g
   _ _ _ _ _ _ _ _ _ _ _ _ _ -> g
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ -> g
|a|b|g|d|e|f|g|d|e|f|g|d|e|f|
----------------------------- = f + min(e, f) + min(d, e, f) + 4 * g /* cycle global minima */ + 4 * g /* cycle global minima */ + 3 * g /* prefix global minima */ // the cycle global minima in this row would be an overcalculation if e is the target and therefore have to be removed retroactively

Some rough timings

Done with cargo bench which uses criterion.rs in the background. The following are the center timing results (probably medians), except for src/linear.rs M(2*10^9) which is just a single manual run.

call src/initial.rs src/rayon.rs src/parallel.rs src/linear.rs src/cycle.rs src/hardcoded.rs
time complexity O(n^4) O(n^2) O(n^2) O(n) O(~1) O(~1)
space complexity O(1) O(n) O(n) O(~1) O(~1) O(~1)
M(10^1) 2.6766 µs 18.657 µs 170.78 µs 5.1612 µs 691.81 ns 5.5037 µs
M(10^4) - 491.42 ms 54.406 ms 895.48 µs 1.2153 ms 841.83 µs
M(2*10^9) - - - ~4.5 min (single run) 1.5172 s 484.06 ms
M(u128::MAX) - - - - 1.5164 s 642.63 ms

About

A blazingly fast reimplementation of pdamianik/TGM-ACC-2019 level3, bringing calculation times down to ~500ms for M(2*10^9).

Resources

License

Stars

Watchers

Forks

Languages