Work-efficient parallel prefix (down-sweep phase)

Now let's continue with the down-sweep, which will operate on the output of the up-sweep:

input: x0, ..., xn-1initialize:    for i = 0 to n - 2:        yi := xi    yn-1 := 0begin:for k = log2(n) - 1 to 0:    parfor j = 0 to n - 1:         if j is divisible by 2k+1:            temp := yj+2k-1            yj+2k-1 := yj+2k+1-1            yj+2k+1-1 := yj+2k+1-1  temp        else:            continueendoutput: y0 , y1 , ..., yn-1

Get Hands-On GPU Programming with Python and CUDA now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.