Mapping
Vector/Matrix Multiplication Onto N Processors
1. The Vector/Matrix Multiplication Problem
y = Ax,
where A is a m x m matrix and x and y are m vectors. Here A is a dense matrix; that is, most elements are non-zero. Thus, balancing of the computation means that we can evenly distributed across all processors.
Assume there are N processors available P(i) (i=0,1,...,N-1) and that m/N = r is an integer.
2. Row Storage Method
Here each processor P(i) stores m/N = r rows of A, numbered i*r+1 through (i+1)*r, as well as the vector x. It can then calculate elements i*r+1 to (i+1)*r of y. If iteration is required (for example for time-stepped simulations), then all processors must broadcast their new vector y' to all other processors at each step.
3. Column Storage Method
Here each processor P(i) stores m/N =r columns of A, numbered i*r+1 through (i+1)*r together with the corresponding r elements of x and then calculates the r terms that partially define each element of y. Then processor P(i) must accumulate the r partial sums i*r+1 to (i+1)*r from all the other processors.
4. Block Storage Method
In this method we arrange the processors into
a 2D grid numbered 0 to sqrt(N) - 1 in the horizontal and vertical directions. If b = m/sqrt(N),
then each processor P(i,j) holds a square subblock of A composed of the rows i*b+1
through (i+1)*b and columns j*b+1 through (j+1)*b. Then each mesh stores
elements of j*b+ 1 through (j+1)*b of the vector x and calculates its
contribution to the elements ib+1 to (i+1)b of vector
y. When all processors have computed their partial contributions to the vector
y, each processor on row i transfers its partial y
values to the diagonal processor P(i,i),
where the final values of the elements ib+1 to (i+1)b of y are calculated. This
is the single-node row accumulation. If it is required that we iterate on the
solution, then the diagonal processors P(i,i) need to comunication the
ib+1 to (i+1)b values of the solution y to each processor in column i.