pprogdes

Designing Parallel Algorithms
(from Foster - Chapter 2)

Design is integrative thought known as creativity
Not easily reduced to simple recipes
Must be able to recognize design flaws (experience with design process)
Parallel Program Design Process
Partitioning - the computation that is to be performed and the data operated on by this computation are decomposed into small tasks.
Communication - that which is required to coordinate task execution is determined, and appropriate communication structures and algorithms are defined.
Agglomeration - the analysis of the first two stages with consideration for performance requirements and implementation costs; sometimes tasks are combined into larger tasks.
Mapping - each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs; this can be static or dynamic.
The parallel program design process is typically cyclic (refinement during each cycle)

Partitioning

The purpose is to expose opportunities for parallel execution - expose fine-grained decomposition
Domain decomposition - first focus on the data associated with a problem, then determine an appropriate partition for the data, and finally work out how to associate computation with the data.

-- focus on the largest data structure or most frequently accessed data structure
-- for example, our original decomposition of the Parallel Jacobi Method assigned a process to each element of the grid, in the parallel Simpson's rule for numerical integration, and in parallel matrix multiply we assigned one thread per element of resultant matrix
-- weather modeling might assign each particle to a thread (probably not)

Functional decomposition - first decompose the computation and then deal with the data.

-- for example, when we developed the full duplex sliding window, we isolated the control of packets and sequence numbering (the shared data) in the transmitter/receiver thread
-- example: layering of operating systems (file system on top of multithreading kernel)
-- functional decomposition of climate simulation model

Partitioning Design Checklist

-- an order of magnitude more tasks than processors (otherwise little flexibility in later phases)
-- avoid redundant computation and storage requirements (otherwise maybe not be scalable to larger problems)
-- tasks should be about same size (easy to allocate to processors)
-- the number of tasks must scale with problem size (problem size increases # of threads)
-- are there other alternative partitions (for example, the grid problem lends itself to row partitions and subpartitions)

Later phases will probably force us to revisit these decisions.

Communication

Define the channels of communications and the messages to be sent between threads.

-- for example, in row-by-row Jacobi, we send "signals" between row threads; in the partitioned version, we send border elements between subpartitions
-- in the data communications protocols, we send packets between the transmitter/receiver threads
-- in the shortest path algorithm, we sent trial distances between threads
-- in the adaptive quadrature problem, we sent interval objects (height, width, area)

Categories of communications patterns

-- local/global - in the grid problem, we talked only to our neighbors; in adaptive quadrature, we placed intervals in a global pool; we also implemented a multiway rendezvous to implement a global barrier (n-body problem, divide and conquer, replicated workers)
-- structured communication (as in the grid problems)
-- unstructured communication as in the shortest path algorithm
-- static communication architecture as in the topology algorithm
-- dynamic communication architecture as in the Java webpage server using TCP/IP
-- synchronous communications as in the alternating bit protocol and the Jacobi Relaxation algorithm
-- asynchronous as in the resource allocation thread and the file client/server interaction

Communication Design Checklist

-- balance of computation within each task (scalability)

** grid point, row, or partition

-- keep neighborhood of communication small

** only communicate with neighbors in grid

-- concurrency of communications

** grid neighbors send and then receive (all parallel)

-- concurrency of computation

** all grid computations are parallel within step

Agglomeration

combine (agglomerate) tasks identified in partitioning to increase granularity

-- grid points to rows to partitions
-- matrix elements to rows to partitions

Can we replicate data and/or computation to get better granularity?

-- in grid problem, we replicated borders of partitions

Increasing granularity

-- communication costs include both setup and transmission
-- task creation costs
-- surface-to-volume effects (grid partitions vs rows for various stencils)

Agglomeration Design Checklist

-- have you increased locality and reduced communication costs?
-- replicating data can sometimes interfere with scalability
-- are tasks about the same granularity?
-- does number of tasks scale with problem size?
-- did agglomeration reduce concurrency?
-- do you need to parallelize sequential code to get more concurrency?

Mapping

Need to map problem onto processors
The goal is to minimize total execution time.
Two strategies - tasks that are independent can be placed on separate processors and tasks that communicate a lot should be placed on the same processor.
Mapping problem is NP-complete.
For simple domain decomposition (fixed number of equal-sized tasks), we map tasks to minimize inteprocessor communication. (grid problem)
For complex domain decomposition with unequal amounts of work per task, we may use static load balancing (using heuristic techniques during partitioning and agglomeration - balance cost against improved execution).

- - example: vector matrix multiplication

Or, we could try dynamic load-balancing which tries to periodically determine a new agglomeration and mapping during execution.
For functional decomposition, short-term tasks are typically more amenable to task-scheduling algorithms.
Load-balancing Algorithms

- - intended to agglomerate one partition for each processor
- - Local Algorithms (process migration)
- - Probabilistic Methods - allocate tasks randomly
- - Cyclic Mappings - each processor allocated every Pth task

Task-Scheduling Algorithms

- - centralized or distributed task pool

**Manager/Worker - replicated workers under control of a manager or monitor
**Decentralized Schemes - task pool is a distributed data structure (shortest path algorithm or adaptive quadrature in distributed pool)

- - Termination Detection is necessary to determine when work pools are completed

Mapping Design Checklist

- - make sure the manager is not a bottleneck
- - have you considered the implementation costs in dynamic load-balancing
- - have you considered the effectiveness of dynamic load-balancing