University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking
- Licheng CAO
- Tested on: Windows 10, i7-10870H @ 2.20GHz 32GB, GTX 3060 6009MB
-
the number of boids (with/without visualization)
- Figures 1 and 2 depict a clear trend: as the number of boids increases, the frames per second (FPS) decreases. This decline is primarily attributed to the growing number of boids that each thread must process. Among the various step methods, the naive method is the most affected by the number of boids, as it necessitates considering all boids within each thread. On the other hand, the scattered method significantly enhances performance by limiting the scope to boids in nearby grids, thereby reducing the number of boids each thread must handle. Additionally, in this method, I've implemented a specific order for searching the nearby grids to ensure contiguous grid access.
- The coherent method provides a slight performance boost over the scattered method. It accomplishes this by rearranging the position and velocity arrays of boids in such a way that information for boids within the same grid is stored contiguously in memory. This optimization allows for faster retrieval of information when iterating over all boids within a single grid.
- Figure1 average FPS/number of boids with visualization
- Figure2 average FPS/number of boids without visualization
- We observe that the average FPS is higher when visualization is disabled. As the number of boids increases, the disparity between FPS with and without visualization diminishes. This phenomenon occurs because the primary bottleneck affecting FPS is the GPU calculations when dealing with a large number of boids, whereas it becomes the drawing speed when dealing with a smaller number of boids.
- Figure3 average FPS with/without visualization
-
the blocksize and cell width
- Table 1 and Figure 4 provide insights into the behavior of the average FPS. In the case of the naive method, we observe that the average FPS remains relatively consistent. However, for both the scattered and coherent methods, altering the block size from 32 to 64 leads to an increase in the average FPS, while changing it from 224 to 512 results in a decrease.
- The increase in average FPS with the larger block size can be attributed to the block having more warps, allowing for smoother switching to hide the delay associated with accessing global data. Conversely, the decrease in average FPS with the smaller block size may stem from each Streaming Multiprocessor (SM) lacking sufficient Streaming Processors (SPs) to efficiently process threads in a single loop.
- Modifying the cell width and increasing the number of blocks to check from 8 to 27 can have a notable impact on the performance of my implementation. In my approach, I introduce a jitter vector (vec3(-0.5)) to ensure that each boid within a grid only needs to search within nearby 8 blocks to locate all neighboring boids efficiently. Consequently, examining 27 blocks would introduce unnecessary computational overhead.
- However, when we alter the cell width and block size, the outcomes may differ. A smaller cell width could result in a reduced bounding box for each boid, potentially decreasing the number of boids that each thread needs to inspect. This reduction in workload could, in turn, lead to improved performance.
- Table1 average FPS/blocksize without visualization (number of boids: 524,288)
-
blocksize 32 64 96 128 160 192 224 512 naiveFPS 0.323 0.454 0.452 0.458 0.449 0.455 0.457 0.1326 ScatteredFPS 129.62 174.728 181.97 181.566 179.985 181.666 177.534 166.941 CoherentFPS 231.937 270.333 273.499 270.334 269.264 269.608 267.44 251.679 - Figure4 average FPS/blocksize