Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java Alternative Algorithm does not work for arbitrary NDRanges #142

Open
Helios85 opened this issue Nov 27, 2018 · 13 comments
Open

Java Alternative Algorithm does not work for arbitrary NDRanges #142

Helios85 opened this issue Nov 27, 2018 · 13 comments

Comments

@Helios85
Copy link

No description provided.

@Helios85
Copy link
Author

Aparapi "falls back" to a "Java Alternative Algorithm" (JAA) in cases where the Java byte code cannot be translated to OpenCL code. The JAA, however, only works if the NDRange is an exact power of 2. The following example illustrates the problem:


public class JAATest {
	public static void main(String[] args) {
		int[] a = {1, 2, 3, 4, 5, 6, 7};
		
		new Kernel() {
			@Override public void run() {
				System.out.println(); // induces fallback
				a[getGlobalId()] = 0;
			}
		}.execute(a.length);
		
		System.out.println(java.util.Arrays.toString(a));
	}
}

Execution yields the following incorrect output:

[0, 0, 0, 0, 5, 6, 7]

@Helios85 Helios85 reopened this Nov 27, 2018
@NorbiPeti
Copy link

It seems to be multiples of 4, not powers of 2. Anything below 4 is 0, below 16 is 12.

@freemo
Copy link
Member

freemo commented Mar 6, 2019 via email

@rkraneis
Copy link

rkraneis commented Jul 4, 2020

Same problem here. If NDRange is a multiple of available threads (12 in my case), all is fine. If not, then the results are (partially) incorrect. Tested by fiddling with ArrayTest (changing SIZE constant and forcing Java fall-back by adding a println of current id).
Edit: validated a bit less hacky by explicitly setting THREAD_POOL as the preferred device using KernelManager.

@rkraneis
Copy link

rkraneis commented Jul 4, 2020

SIZE = 6 (less than number of threads)

Running com.aparapi.runtime.ArrayTest
expected[0]: [0, 1, 2, 3, 4, 5]
target[0]: [99, 99, 99, 99, 99, 99]
expected[1]: [1, 2, 3, 4, 5, 6]
target[1]: [99, 99, 99, 99, 99, 99]
expected[2]: [2, 3, 4, 5, 6, 7]
target[2]: [99, 99, 99, 99, 99, 99]
expected[3]: [3, 4, 5, 6, 7, 8]
target[3]: [99, 99, 99, 99, 99, 99]
expected[4]: [4, 5, 6, 7, 8, 9]
target[4]: [99, 99, 99, 99, 99, 99]
expected[5]: [5, 6, 7, 8, 9, 10]
target[5]: [99, 99, 99, 99, 99, 99]

SIZE = 12 (number of threads)

Running com.aparapi.runtime.ArrayTest
expected[0]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
target[0]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
expected[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
target[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
expected[2]: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
target[2]: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
expected[3]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
target[3]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
expected[4]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
target[4]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
expected[5]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
target[5]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
expected[6]: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
target[6]: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
expected[7]: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
target[7]: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
expected[8]: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
target[8]: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
expected[9]: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
target[9]: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
expected[10]: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
target[10]: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
expected[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

SIZE = 13 (number of threads + 1)

...
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
expected[12]: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
target[12]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]

SIZE = 23 ( 2 * number of threads - 1)

...
expected[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
expected[12]: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]
target[12]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[13]: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
target[13]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[14]: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
target[14]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[15]: [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]
target[15]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[16]: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]
target[16]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[17]: [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
target[17]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[18]: [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
target[18]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[19]: [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]
target[19]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[20]: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
target[20]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[21]: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]
target[21]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[22]: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]
target[22]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]

From this it seems as if the final non-full batch is not calculated at all (arrays are initialized with 99 in this test).

@rkraneis
Copy link

rkraneis commented Jul 4, 2020

Setting SEQUENTIAL as the preferred device would work as a fallback ...

@rkraneis
Copy link

rkraneis commented Jul 4, 2020

This seems to be boiling down to execute(int) using createRange which uses device information to then create a range global:23 local:(derived)12 ... When manually creating the range with a null device and yielding global:23 local:(derived)23 also the thread pool execution leads to correct results.
So localSize0 will then be only 12 during thread pool creation in KernelRunner.

@rkraneis
Copy link

rkraneis commented Jul 4, 2020

This optimization(?)

if (_device == JavaDevice.THREAD_POOL) {
  withoutLocal.setLocalSize_0(Runtime.getRuntime().availableProcessors());
  withoutLocal.setLocalIsDerived(true);
  return withoutLocal;
} else 

is only happening for create, but not for create2D nor for create3D. Removing it makes everything work. But there probably was originally a reason for adding it ...

Edit: it also contradicts the javadoc, which reads

The groupsize will be chosen such that _localWidth > 0 && _localWidth < MAX_GROUP_SIZE && _globalWidth % _localWidth==0 is true

which is what the thread scheduling relies upon.

@CoreRasurae
Copy link
Collaborator

@rkraneis Thanks for your work on finding the root cause. The reason for that code being there is that Java performance is likely to be best if the number of Java Thread Pool threads doesn't exceed the CPU available cores (eventually Hyper-Threading siblings are also included in the count which is undesirable for High Performance Computing workloads, anyway). So either this is removed, since it can violate the rule _globalWith % _localWidth == 0, or the logic must be modified so that it respects such logic.
@freemo What is your opinion here?

@CoreRasurae
Copy link
Collaborator

@Helios85 Can you confirm @rkraneis theory for the root cause of this bug?

@Helios85
Copy link
Author

Helios85 commented Jul 9, 2020

@CoreRasurae I can indeed confirm that the Java Alternative Algorithm only works if NDRange is a multiple of the available logical processors. I tested this with two machines, one with 4, the other with 8 logical processors. In the first case, NDRange had to be a multiple of 4 for the alternative algorithm to work correctly, in the second case NDRange had to be a multiple of 8.
@rkraneis Congratulations for finding the cause of this problem!

@CoreRasurae
Copy link
Collaborator

@Helios85 Thanks for confirming this.

@KaiGerdMueller
Copy link

KaiGerdMueller commented Aug 29, 2023

Tested whether the bug is still there on my device using the provided class JAATest.
It printed [1, 2, 3, 4, 5, 6, 7], so works just fine.
Tested on Ryzen with surely not 7 kernels.
Seems like the bug may be gone, verification would be good tough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants