Age-aware Array Search

Series

This is post 1 of the Age-Aware Data Structures series.

Age-aware data structures respect some notion of “recency” in your data. They can be useful for tracking logs or statistics: the runtime of many operations is independent of the total amount of data. This first of four parts describes why you might want age-aware data structures and implements an age-aware sorted array.

A use case

Let’s image we are building a Twitter clone. Every user has a list of posts that are ordered chronologically. Visitors to a user’s page sometimes want to grab all the posts in a certain time range (say, all the posts someone made about the Olympics). How can we store and look up posts that make this operation fast?

For now, let’s imagine new posts are always newer than all previous posts (realistic, though challenging to achieve in a distributed setting) and that users cannot delete posts. Then an array data structure naturally suggests itself, since pushing elements onto the end of an array is fast (\(\Theta(1)\) amortized). To find posts in a certain time range we can use a binary search. This returns an answer in \(\Theta(\log n)\) time, which seems like the best we can do.

But here’s an insight into how people use this system: people are far more likely to query recent time ranges than old ones. We’d prefer if accessing recent data were faster than accessing stale data.

Age-aware data structures solve that problem. An age-aware collection assigns a “timestamp” to every item, and measures its performance in terms of how old that item is instead of the total number of items. Let’s try to write such a data structure.

Staleness

Let’s begin by trying to make an array, as above. What do we want the runtime of lookups to be? Well, we want it to be independent of \(n\), the total number of items in the array. Instead, we only want it to depend on how old the item is.

Now, it seems like the actual value of the timestamp shouldn’t matter. For example, what if we scale all the timestamps? Instead, we define the staleness of an item by ordering all items by timestamp, and counting the number of items between it and the latest item. The most-recently-added item has staleness 1, the second-most-recent has staleness 2, and so on.

Our goal is to find the index of the first item with a timestamp greater than some value. Furthermore, if the staleness of that item is \(s\), we want to do this in \(\Theta(\log s)\) time. This means we cannot search over the entire array, since any way of doing that would take time dependent on \(n\), the total number of elements in the array.

Bounded binary search

One way to find the index in an age-aware way is to do a linear scan from the end of the array. This takes exactly \(s\) steps, so we get a runtime that is age-aware (independent of \(n\)), but which is too slow (we’d like \(\Theta(\log s)\), not \(\Theta(s)\)).

Binary search is a very fast way to find an element in an array. We’d like to use it in our age-aware search. Binary searching an array of length \(m\) takes time in \(\Theta(\log m)\), so if we use it we must pass it an array of size \(α s\) for \(α\). We need to find an bound on the index of the item we’re looking for, and then binary-search just part of the array in those bounds.

We can find this bound by linearly scanning from the end of the array, but that again takes \(\Theta(s)\) time, destroying the gains from using a binary search. But we don’t need an exact bound; as long as we find a bound that is within a multiple of the real index, we are happy. So instead of scanning linearly, we can take exponentially-bigger steps from the end of the array. That is, we try the last item, the second-to-last, the fourth-to-last, the eighth-to-last, and so on.

When we find an item that is older than the one we’re looking for, its index is at most twice the index of the item we’re looking for, and we can finish up the search with a binary search. The search takes \(\Theta(\log s)\) steps, and the binary search also takes \(\Theta(\log s)\) steps, so overall we have the runtime we want.

Implementation

Let’s implement this algorithm. I’ll be using Scala on the advice of Eric Mullen.

Our age-aware arrays will have a mutable ArrayBuffer to store elements in, and will represent timestamps with Double values. So, the buffer will store pairs of a timestamp and a value.

import scala.collection.mutable

class AAArray[T] {
  val buffer = mutable.ArrayBuffer[(Double, T)]()

  // ...
}

It’s easy enough to append to the array and get values at indices.

def +=(time: Double, elt: T) {
  this.buffer += ((time, elt))
}

def apply(i: Int) = {
  this.buffer(i)
}

def length = this.buffer.length

To implement look up a value, we first find a lower bound on its index by taking exponentially-increasing steps.

def lookup(t: Double): Int = {
  var i = 1
  val l = this.length - 1

  if (this(l)._1 < t) return -1

  while (i < this.length && this(l-i)._1 > t) i *= 2

  if (i >= this.length) i = this.length - 1

  // ...
}

Finally, once we find the lower bound i, we do a binary search to find the right element.

var lo = 0
var hi = i

while (hi - lo > 1) {
  val mid = (hi + lo) / 2
  if (this(l-mid)._1 < t) {
    hi = mid
  } else {
    lo = mid
  }
}

if (this(l-hi)._1 >= t) {
  return l-hi
} else {
  return l-lo
}

We can try our implementation out at the Scala evaluator:

scala> val t = new AAArray[String]()
scala> t += (0.0, "a")
scala> t += (1.0, "b")
scala> t += (2.0, "c")
scala> t += (3.0, "d")
scala> t += (4.0, "e")
scala> t.buffer
ArrayBuffer[(Double, String)] = ArrayBuffer((0.0,a), (1.0,b), (2.0,c), (3.0,d), (4.0,e))

Looks like inserting elements in fact works very well. What about looking up values?

scala> t.lookup(1.2)
Int = 2
scala> t.lookup(3.7)
Int = 4
scala> t.lookup(3.0)
Int = 3

Seems like our exponential skips and binary search work together wonderfully. We always return the index of the smallest item at or after the given timestamp.

Next steps

So far our data structure is very limited, because you can’t remove elements, or add elements anywhere except the very end. To get around those problems, we’ll need to generalize from an array to a tree, which is coming in the next part of this blog series.

This AAArray class is also barren. It should probably implement some traits from the scala.collection package, and perhaps inherit from Array. Proper object-oriented design isn’t really the focus of this series, but you’re free to take my code an improve it.

By Pavel Panchekha

20 February 2014