Removes mutex locks from most memory accesses #120

doe300 · 2018-11-10T08:15:47Z

Removes all implicit mutex locks
Completely rewrites memory mapping for more uniform procedure
VPM scratch area (for "direct" memory access) is now extra space per-QPU
VPM/DMA accesses are no longer combined (no longer needed/possible), but setups are in an extra optimization step
Prepares for caching RAM in VPM with initial load and write-back
A lot of other things...

Effects (test_emulator, whole PR):

Instructions:       45160 to 49793 (+10%)
Cycles:             659247 to 641681 (-2%)
Mutex waits:        282551 to 0 (-100%)
Mutex locks:        18272 to 11 (-99%)
Total time (in ms): 56722 to 58101 (+2%)

Tests:

Passes all emulations that current master passes
Passes all regression tests (more than master)
Passes all boost compute tests that master passes (actual test on real hardware)

To be fixed:

3-element vector
A few TODOs
A few CTS regressions
Move unrelated changes directly to master

clpeak results:

Platform: OpenCL for the Raspberry Pi VideoCore IV GPU
  Device: VideoCore IV GPU
    Driver version  : 0.4 (Linux ARM)
    Compute units   : 1
    Clock frequency : 300 MHz

    Global memory bandwidth (GBPS)
      float   : 0.07
      float2  : 0.12
      float4  : 0.22
      float8  : 0.35
      float16 : 0.51

    Single-precision compute (GFLOPS)
      float   : 0.26
      float2  : 0.74
      float4  : 1.68
      float8  : 2.83
      float16 : 4.53

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 1.13
      enqueueReadBuffer          : 0.18
      enqueueMapBuffer(for read) : 5036.31
        memcpy from mapped ptr   : 0.18
      enqueueUnmap(after write)  : 4511.52
        memcpy to mapped ptr     : 0.82

Compared to the last results (from Nov. 2017), this is:

~ 25% higher throughput for single-precision compute test
~ 20% higher throughput for memory bandwidth test
~ 80% higher throughput for transfer bandwidth test (although this may be unrelated)

* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113

Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)

This changes allows us to remove mutex locks from "direct" memory access. See #113

* fixes memory association for some special cases * fixes elimination of moves for VPM usage without mutex Effects (test-emulator, last 2 commits): Instructions: 45779 to 52510 (+14%) Cycles: 661193 to 644891 (-2%) Mutex waits: 281459 to 0 Total time (in ms): ~57192 to 57232 (+-0%)

This version will only combine writing of same setup values, where possible. The full version is also removed, since it will anyway become obsolete with VPM cached memory (see #113). Effects (test-emulator): Instructions: 52511 to 49793 (-5%) Cycles: 644891 to 641680 (-0.5%) Total time (in ms): 62869 to 58456 (-7%)

nomaddo · 2018-12-25T01:11:09Z

I am not sure why we can remove all of mutex locks.
In my view, getting mutex locks was MUST to avoid resource conflict of multi-QPU VC4 program.

You mean, the responsibility to avoid the resource conflicts is moved to the users?

doe300 · 2018-12-25T08:57:29Z

I am not sure why we can remove all of mutex locks.

I am actually also not 100% sure, but I did not meet any test so far, which failed with the mutices removed, even when I explicitly forced race-conditions between the QPUs.

In my view, getting mutex locks was MUST to avoid resource conflict of multi-QPU VC4 program.

I thought the same, but I might have been wrong.

My old theory/understanding of the VPM:
There is 1 VPM unit which has the cache and manages DAM access. More specific, there is also only 1 VPM configuration (across all QPUs) active at all times, which forces the QPUs to synchronize access to not overwrite the configuration currently used by another QPU.

My new theory:
Still 1 VPM unit, but a separate configuration per QPU (similar to TMU), allowing (similar to a "normal" cache for a CPU) all cores to access the cache at the same time. DMA accesses are either also parallel or a single queue which is shared between the QPUs, but this does not change anything from the usage point of view. Having parallel access allows us to access separate VPM areas (and RAM regions) unsynchronized, only forcing us to lock when we access VPM areas/RAM regions where other QPUs might access at the same time.

You mean, the responsibility to avoid the resource conflicts is moved to the users?

No, VC4C would still synchronize, where required (i.e. when accessing memory regions), but most accesses to the VPM is restructured so the QPUs do not access the same VPM ares and therefore do not need to lock.

nomaddo · 2019-01-07T07:11:24Z

Sorry to be late. I finished my vacation. From now, I can response quickly.

I tried some program to check your opinion is correct.
At least, example/sgemm.py in py-videocore has a problem if mutex operations are removed.
I comment-outed mutex_acquire() and mutex_release() and executed it.
Without any modification, this program successfully exited.
With my modification (described the above), the program got execution timeout.

pi@nomaddo-pi3:~/py-videocore/examples$ sudo python sgemm.py 
==== sgemm example (96x363 times 363x3072) ====
threads: 12
numpy: 0.1021 sec, 2.1067 Gflops
GPU: 0.0332 sec, 6.4721 Gflops
minimum absolute error: 0.0000e+00
maximum absolute error: 7.3559e+01
minimum relative error: 0.0000e+00
maximum relative error: 1.4603e+02
pi@nomaddo-pi3:~/py-videocore/examples$ sudo python sgemm.py 
Traceback (most recent call last):
  File "sgemm.py", line 565, in <module>
    main()
  File "sgemm.py", line 541, in main
    uniforms=uniforms
  File "/usr/local/lib/python2.7/dist-packages/videocore/driver.py", line 186, in execute
    raise DriverError('QPU execution timeout')
videocore.driver.DriverError: QPU execution timeout

From the expement, I think this modification is a dangerous

* Adds BasicBLock#to_string printing the block label name * Fixes error with empty CFG, fixes crash in #98 * Fixes memory access error copying buffers for emulation tests * Fixes a few intrisic precalculations * Adds tests for intrinsics precalculation * Adds support for reading/writing multiple rows from/to RAM, taken from #120 * Adds function to dump used VPM layout to debug log

Memory access is now mapped in following steps: * Determine preferred and fall-back lowering type per memory area (e.g. register, VPM, TMU, DMA) * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Copies useful changes from #120 without changing the semantics.

doe300 added enhancement normalization related to a normalization step optimization related to an optimization step labels Nov 10, 2018

doe300 self-assigned this Nov 10, 2018

doe300 mentioned this pull request Dec 8, 2018

Inifinite loop with -O3. #128

Closed

doe300 force-pushed the remove_mutex branch from 075408f to 52a1d7a Compare December 20, 2018 11:45

doe300 added 6 commits December 21, 2018 11:21

Rewrites scratch VPM area to be per-QPU

016f870

This changes allows us to remove mutex locks from "direct" memory access. See #113

Fixes rebase issues, copies fixes from master

92e2c8b

doe300 force-pushed the remove_mutex branch from 52a1d7a to 92e2c8b Compare December 21, 2018 10:28

doe300 requested a review from nomaddo December 21, 2018 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes mutex locks from most memory accesses #120

Removes mutex locks from most memory accesses #120

doe300 commented Nov 10, 2018 •

edited

Loading

nomaddo commented Dec 25, 2018 •

edited

Loading

doe300 commented Dec 25, 2018

nomaddo commented Jan 7, 2019

Removes mutex locks from most memory accesses #120

Are you sure you want to change the base?

Removes mutex locks from most memory accesses #120

Conversation

doe300 commented Nov 10, 2018 • edited Loading

nomaddo commented Dec 25, 2018 • edited Loading

doe300 commented Dec 25, 2018

nomaddo commented Jan 7, 2019

doe300 commented Nov 10, 2018 •

edited

Loading

nomaddo commented Dec 25, 2018 •

edited

Loading