Designing Parallel Algorithms
(from Foster - Chapter 2)

Partitioning


-- focus on the largest data structure or most frequently accessed data structure
-- for example, our original decomposition of the Parallel Jacobi Method assigned a process to each element of the grid, in the parallel Simpson's rule for numerical integration, and in parallel matrix multiply we assigned one thread per element of resultant matrix
-- weather modeling might assign each particle to a thread (probably not)


-- for example, when we developed the full duplex sliding window, we isolated the control of packets and sequence numbering (the shared data) in the transmitter/receiver thread
-- example: layering of operating systems (file system on top of multithreading kernel)
-- functional decomposition of climate simulation model


-- an order of magnitude more tasks than processors (otherwise little flexibility in later phases)
-- avoid redundant computation and storage requirements (otherwise maybe not be scalable to larger problems)
-- tasks should be about same size (easy to allocate to processors)
-- the number of tasks must scale with problem size (problem size increases # of threads)
-- are there other alternative partitions (for example, the grid problem lends itself to row partitions and subpartitions)

Communication


-- for example, in row-by-row Jacobi, we send "signals" between row threads; in the partitioned version, we send border elements between subpartitions
-- in the data communications protocols, we send packets between the transmitter/receiver threads
-- in the shortest path algorithm, we sent trial distances between threads
-- in the adaptive quadrature problem, we sent interval objects (height, width, area)


-- local/global - in the grid problem, we talked only to our neighbors; in adaptive quadrature, we placed intervals in a global pool; we also implemented a multiway rendezvous to implement a global barrier (n-body problem, divide and conquer, replicated workers)
-- structured communication (as in the grid problems)
-- unstructured communication as in the shortest path algorithm
-- static communication architecture as in the topology algorithm
-- dynamic communication architecture as in the Java webpage server using TCP/IP
-- synchronous communications as in the alternating bit protocol and the Jacobi Relaxation algorithm
-- asynchronous as in the resource allocation thread and the file client/server interaction


-- balance of computation within each task (scalability)

** grid point, row, or partition

-- keep neighborhood of communication small

** only communicate with neighbors in grid

-- concurrency of communications

** grid neighbors send and then receive (all parallel)

-- concurrency of computation

** all grid computations are parallel within step


 

Agglomeration


-- grid points to rows to partitions
-- matrix elements to rows to partitions


-- in grid problem, we replicated borders of partitions
 


-- communication costs include both setup and transmission
-- task creation costs
-- surface-to-volume effects (grid partitions vs rows for various stencils)


-- have you increased locality and reduced communication costs?
-- replicating data can sometimes interfere with scalability
-- are tasks about the same granularity?
-- does number of tasks scale with problem size?
-- did agglomeration reduce concurrency?
-- do you need to parallelize sequential code to get more concurrency?


 
 
 

Mapping


- - example: vector matrix multiplication


- - intended to agglomerate one partition for each processor
- - Local Algorithms (process migration)
- - Probabilistic Methods - allocate tasks randomly
- - Cyclic Mappings - each processor allocated every Pth task


- - centralized or distributed task pool

**Manager/Worker - replicated workers under control of a manager or monitor
**Decentralized Schemes - task pool is a distributed data structure (shortest path algorithm or adaptive quadrature in distributed pool)

- - Termination Detection is necessary to determine when work pools are completed


- - make sure the manager is not a bottleneck
- - have you considered the implementation costs in dynamic load-balancing
- - have you considered the effectiveness of dynamic load-balancing