Mapping Vector/Matrix Multiplication Onto N Processors

1. The Vector/Matrix Multiplication Problem

y = Ax,

where A is a m x m matrix and x and y are m vectors. Here A is a dense matrix; that is, most elements are non-zero. Thus, balancing of the computation means that we can evenly distributed across all processors.

Assume there are N processors available P(i) (i=0,1,...,N-1) and that m/N = r is an integer.

2. Row Storage Method

Here each processor P(i) stores m/N = r rows of A, numbered i*r+1 through (i+1)*r, as well as the vector x. It can then calculate elements i*r+1 to (i+1)*r of y. If iteration is required (for example for time-stepped simulations), then all processors must broadcast their new vector y' to all other processors at each step.

3. Column Storage Method

Here each processor P(i) stores m/N =r columns of A, numbered i*r+1 through (i+1)*r together with the corresponding r elements of x and then calculates the r terms that partially define each element of y. Then processor P(i) must accumulate the r partial sums i*r+1 to (i+1)*r from all the other processors.

4. Block Storage Method

In this method we arrange the processors into a 2D grid numbered 0 to sqrt(N) - 1 in the horizontal and vertical directions. If b = m/sqrt(N), then each processor P(i,j) holds a square subblock of A composed of the rows i*b+1 through (i+1)*b and columns j*b+1 through (j+1)*b. Then each mesh stores elements of j*b+ 1 through (j+1)*b of the vector x and calculates its contribution to the elements ib+1 to (i+1)b of vector y. When all processors have computed their partial contributions to the vector y, each processor on row i transfers its partial y values to the diagonal processor P(i,i), where the final values of the elements ib+1 to (i+1)b of y are calculated. This is the single-node row accumulation. If it is required that we iterate on the solution, then the diagonal processors P(i,i) need to comunication the ib+1 to (i+1)b values of the solution y to each processor in column i.