Dependencies
editUnderstanding data dependencies is one of the foundations of knowing how to implement parallel algorithms. No program can run quicker than the longest chain of dependent calculations (known as the critical path), since the fact that the calculations are dependent introduces an ordering of executions. Fortunately, most algorithms do not consist of a long chain of dependent calculations and little else, but instead there are many opportunities for executing independent calculations in parallel.
Instruction level parallelism
editLet Pi and Pj be two program fragments. Bernstein’s conditions[1] describe when the two are independent and can be executed in parallel. Let Ii be all of the input variables to Pi and Oi the output variables, and likewise for Pj. P i and Pj are independent if they satisfy
Violation of the first condition introduces a flow dependency, corresponding to first statement producing a result used by the second statement. The second condition represents an anti-dependency, when the first statement overwrites a variable needed by the second expression. The third, and final condition q is an output dependency. When two variables write to the same location, the final output must be the one of the second statement.
Consider the following function:
1: function Dep(a, b) 2: c := a·b 3: d := 2·c 4: end function
Operation 2 in Dep(a,b) can't be executed before (or even in parallel) operation 3, due to the fact that operation 3 uses a result from operation 2. It violates condition 1, and thus introduces a flow dependency.
1: function NoDep(a, b) 2: c := a·b 3: d := 2·b 4: e := a+b 5: end function
In this example there are no dependencies between the instructions, so they can all be run in parallel.
Pipelining
editPipelining is a method for simultaneously executing several multi-stage problems using one execution unit, similar to a factory assembly line. Consider an algorithm with a 7 step dependency chain which should be applied to a set of data. This could be executed using 7 execution units, each executing one of the instructions. It can be observed that since there are data dependencies between every stage in our calculation, most of the hardware will sit unused unless pipelining is used. In a pipelined architecture, multiple pieces of data flow through the calculation steps simultaneously. As soon as one stage of the calculation is finished it moves on to the next step, making room for the result of the previous execution step to move into the newly vacated one. After an initial latency of filling the pipeline, the program can produce a new result on every step, assuming the pipeline can be kept full.
Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage. In other words, a processor with N pipeline stages can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor with five stages: instruction fetch, decode, execute, memory access, write back. The Pentium 4 processor had a 35-stage pipeline.[2]
Other forms of pipelining are software pipelining, in which a sequence of instructions are reordered and hardware pipelines, where the stages are custom built to execute a specific algorithm.
Vectorization
editIf the same sequence of operations is run over and over again with different data values, another way of lowering the total run time is to duplicate the execution units and have each execution unit process a subset of the data.
<<< figure here >>>
- ^ K. Hwang and F. A. Briggs. Computer architecture and parallel processing. McGraw-Hill, 1984.
- ^ Yale Patt. "The Microprocessor Ten Years From Now: What Are The Challenges, How Do We Meet Them? (wmv). Distinguished Lecturer talk at Carnegie Mellon University, April 2004. Retrieved on November 7, 2007.