![]() ![]() So we limit ourselves to all the positions inside the macro-tile and finish all the partial calculations before we move position. The micro-tile continues to accumulate, or convolve, partial results that add up to the dot product. By doing tile-focused calculations and keeping the pos incrementing gradually, we can maximize reuse of data already in the caches. ![]() We calculate partial results for all the micro-tiles in the macro-tile, then we increment pos horizontally in A and vertically in B at the same time. ![]() Likewise, all data in the column coming from matrix B gets shared between the work-items in the same column. All data in the row coming from matrix A gets shared. Meanwhile, the other work-items in the same macro-tile calculate their partial results in parallel, using the same data loaded from matrix A or matrix B. We start at pos = 0, calculate a partial result, or dot product, and store it in a temporal buffer for that micro-tile. To calculate the 4x8 micro-tile in matrix C, we focus on areas in matrices A and B that have sizes 4x8 and 4x4 respectively. Within the work-group, we operate entirely within the macro-tile. The micro-tile – here, – is generally a bigger, rectangular area consisting of one or more micro-tiles and corresponding to a work-group. The following figure represents how we map the matrices to multiply components in matrix A by components in matrix B and arrive at the single dot product in matrix C: Our algorithm recognizes two levels of tiling: micro-tiles and macro-tiles. We then enforce the order of memory operations so that the dot products resulting from matrix multiplication are partially completed in the entire tile before we move reading pointers outside of the tile boundaries. Our technique for improving data re-use is to split input and output matrices into sub-matrices called tiles. We have to try to group memory accesses (reads and writes) so that they are close to one another in the address space. The first well-known problem is to minimize repetitive reading of the same matrix elements from slow memories, such as higher-level caches and DDR. We’ve specified an OpenCL implementation that includes techniques to address each of the problems. OpenCL Optimization Techniques for Matrix Multiplication These problems complicate the main tasks of MM: reading the same values many times and sharing data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |