# Parallel Line Breaking

In class, I was recently discussion parallel layout in web browsers, and mentioned that drawing text in parallel is quite difficult. That made me think: what about other text-related algorithms? So in this post I'll describe my five-hours' amusement: a parallel line-breaking algorithm.

## Defining the problem

By line breaking I mean the task of figuring out which words each line
of text contains. In a web browser, this is always done greedily. That
is, let's suppose we've already measured each word in the line of text
and have a list of their lengths.^{1} [^{1} This is assuming a kind of naive
English. In reality you'd use the Unicode line-breaking algorithm to
determine where the valid line-breaks are, and I don't really know how
you'd deal with shaping or kerning that depends on the line breaks.]
The greedy algorithm is something like:^{2} [^{2} The below is accidentally
quadratic on purpose; it is written like a spec, not a program.]

def line_break(lens, width): lines = [[]] for l in lens: if sum(lines[-1]) + l <= width: lines[-1].append(l) else: lines.append([l])

My goal is to speed this up using parallelism. The code here can't be
run in parallelism due to the dependency on the `lines`

variable; a
parallel algorithm has to work somehow totally differently.

The most useful parallel programming model here would be something
like the GPU model, but I've never done any GPU programming, so I'll
instead use a generic shared-memory multi-threading model with a
sequentially-consistent main memory for simplicity.^{3} [^{3} Maybe I'll
find a student with interest in GPUs.] I'm also going to assume that
there are as many threads as needed, which isn't really realistic but
ensures I use parallelism to the hilt.

## The solution

In short, my solution runs in \(O(\log N)\), where \(N\) is the number of words. I believe but won't prove that with precisely \(T\) threads this algorithm runs in \(O(N \log N / T)\) time. The algorithm proceeds in four phases:

- Phase 1
- Use parallel prefix to compute the partial sums of the lengths array in \(O(\log N)\) time. We use that in later steps to compute the length of subsequences in \(O(1)\) time.
- Phase 2
- In parallel for each starting word, compute the longest subsequence starting at that element of sum less than the maximum line width; use exponential array search to achieve \(O(\log N)\) time.
- Phase 3
- Think of the previous step as computing a pointer from each word to the start of the next line; this gives us a tangle of linked lists. Now, for each word in parallel, follow the pointer twice, then follow these double-pointers twice, and so on. Each step takes \(O(1)\) time, there are \(O(\log N)\) steps, and we end up using \(O(N \log N)\) space to record the results for each step.
- Phase 4
- Over-count the number of lines by at most 2 via the number of steps Phase 3 took, and allocate that much memory for storing pointers to the start of each line. Start a parallel thread for each line and traverse the skip list from the first word using the binary expansion of the index to determine whether or not to follow each pointer. Since you'll follow at most \(O(\log N)\) pointers, this take \(O(\log N)\) time.

You can still clean up the over-allocation but you're basically done.

## Notes and details

This algorithm computes line breaks simultaneously for each suffix of the input. That gives it a dynamic-programming feel, and provides the data-independence that allows for parallelism.

The parallel prefix in Phase 1 is a standard trick in parallel programming, and really also serial programming. If you have the partial sums \(s_k = \sum_{i=0}^k a_i\), you can compute \(\sum_{i=j}^k a_i = s_k - s_j\) quickly.

In Phase 2 we use that ability to quickly search for a subsequence of sum less than the line width. The algorithm here is inherently serial but logarithmic-time, so it's fine.

In Phase 3, the skip-list is an under-rated data structure in data-parallel environments. It is memory-intensive but works well on tangles of linked lists.

Finally in Phase 4 it is crucial that the skip-list has a perfect binary structure, so we can efficiently find the $k$-th line without counting to \(k\). We can also use the skip-list structure to approximate the number of lines, which was a stumbling block for me for a while.

The communication structure of this algorithm is not bad but not great. On the one hand, every memory cell is write-once, and each phase strictly segregates read-only and write-only memory. On the other hand, the memory is written locally but read globally, so there is a shuffle between each phase. Memory may be a bottleneck.

Finally, I have no idea how practical this algorithm is. Text layout is not really a bottleneck in browsers (though text rendering is!) and I would guess that for small paragraphs the overhead in time is prohibitive while for big paragraphs the overhead in memory is concerning.

But if you're interested in implementing, shoot me an email!

## Footnotes:

^{1}

This is assuming a kind of naive English. In reality you'd use the Unicode line-breaking algorithm to determine where the valid line-breaks are, and I don't really know how you'd deal with shaping or kerning that depends on the line breaks.

^{2}

The below is accidentally quadratic on purpose; it is written like a spec, not a program.

^{3}

Maybe I'll find a student with interest in GPUs.