# Pedestrian Statistics

## Table of Contents

I recently helped another PLSE student with analyzing data for a paper, and the analysis involved a lot of the basics of statistical analysis. So I’m going to walk through the analysis as a tutorial in doing statistics for PL and systems projects.

## Statistics are not scary

People don’t like statistics, from Disraeli’s “lies, damn lies, and statistics” to the book *How to Lie with Statistics*. But statistics is no false science, and in this tutorial I’m trying to dispel the myth that statistics is complex and unintuitive, especially as applied in computer science.

Advanced statistics, which can see fine distinctions in circumstantial data, requires training, experience, and expertise to apply correctly. But when experiments are infinitely reproducible and completely observable, you do not need complex statistics. You can go far with some very basic tests. Hence this tutorial: how to use basic statistics when they apply.

The downside of the simplified statistics I describe is lessened *power*: many effects that do exist will be invisible without better statistical tests. Don’t use this approach to conduct the next minimum wage study. But when it’s easier to run more tests or gather more data, why not do that instead of learning statistics?

## Tools for statistics

Real statisticians, who need sophisticated statistical analyses to find truth in the barest and strangest of data sets, need powerful statistics packages. I’m doing in a much simpler task, so I don’t need these tools. In fact, I only need to:

- Manipulate tables of data with ease
- Evaluate basic mathematical functions (
`log`

,`sqrt`

) - Draw histograms

In this tutorial, I’ll be using a 150-line Python script to parse, clean, and visualize the data with ASCII-art histograms. The script doesn’t need any libraries except the standard `os`

and `math`

libraries. Feel free to follow along in any programming language.

## Collecting and cleaning data

I’ll be using data from experiments on a new dynamic analysis tool. The dynamic analysis adds instrumentation to a Java program, so it probably slows programs down. I want to know how much slower instrumented programs are. Formally, I’m interested in the effect of the instrumented / uninstrumented variable on request time.

The data comes from tests run on three different open-source server programs, each run five times both with and without instrumentation. I’ll look at only one server here; the other two have similar analyses. The test for this server sends random requests from 50 threads, and measures how long each request takes.

It’s important to treat data from all ten runs as part of the same data set. If all of the data points are described with the same variables, they are part of the same data set. That means, don’t split the instrumented and uninstrumented data, or the different runs, into separate data sets. Always prefer to combine.^{1}^{1} The cost of splitting data sets is weaker statistics: it’ll be harder to see small effects or make fine distinctions. If you have several totally different experiments (that is, experiments observing different variables) and want to make an inference using data from both, you will need more sophisticated statistics.

Each data point is described by some number of variables that we observe. For example, our data is approximately 622 000 requests, each annotated with five variables:

- How long the request took
- A request type
- The ID of the thread that made the request
- Whether the server was instrumented
- The test ID

Loading data is the first step to analyzing it, and I’ve stored my data in a CSV file to make this easy. I keep to simple text formats until I have a truly massive amount of data. This way, I can switch tools and not worry about library support. My scripts read the data into a Python list of lists; Python handles lists of mixed type, so this is fine. I also make sure to annotate each data point with its file and directory. This is data, even if it’s recorded in the file names; in my case, this data tells you which run the data point comes from, and whether the server was instrumented.

Next I throw away bad data, in this case requests that didn’t complete.

>>> len(GOODDATA), len(BADDATA) (622122, 224)

Whenever you split the data like this, it’s important to check that the split is independent of the variables you’re interested in. For example, if the dynamic analysis caused requests to fail, we’d want to know. So let’s split the good and bad data by instrumentation and check that the ratios look similar:

>>> [len(x) for x in partition(BADDATA, lambda v: v[12] == "base")] [112, 112] >>> [len(x) for x in partition(GOODDATA, lambda v: v[12] == "base")] [311385, 310737]

These ratios look similar—it certainly doesn’t look like instrumentation changes the probability of a failing test.^{2}^{2} You could test this statistically with a Χ² test, but I’m trying to avoid using computations when the results are obvious.

Next it’s time to clean the data: dropping useless columns and converting numeric fields to numbers. I do all cleanup on a separate copy of the data, `CLEANDATA`

, so if anything starts looking weird I can always go back to the dirty copy. By the way, as a convention, I’ll be using all-caps names to represent subsets of the full data set. Whenever you see a name like `CLEANDATA`

, you know it’s a table of data.

## Models and parameters

To analyze a collection of data, I proceed in two steps: first, I pick a *model* of the relationship that I want to study; then I use the data to estimate the *parameters* of the model. A model is a mathematical expression for the dependent variables of the data, in terms of the observed variables, hidden variables that aren’t in our data set, random variables, and parameters that we don’t know and are trying to learn.^{3}^{3} What’s the difference between a random variable and a hidden variable? A hidden variable is some aspect of the world that you could have measured. A random variable is inherently random and unpredictable. For example, the bias of a coin is a hidden variable, which you might measure by flipping the coin many times. But the result of each individual flip is random. This is a fluid, human distinction; a random variable is just a hidden variable that is independent of everything else and so impossible to infer.

In my case, the dependent variable is how long a request took, so I’m looking for a formula that gives the request time in terms of observed variables (like instrumentation or request type), hidden variables (like whether a request hit cache), random variables (like noise for this particular request), and parameters (like the overhead from instrumentation).

For example, the model that this tutorial will end with describes each request as taking time decided by: its request type; then by a hidden variable that represents caching, garbage collection, database contention, or something similar; then by instrumentation; and then by random, log-normally distributed noise. The model also describes how important each of these variables is, and how they affect each other (request type affects the hidden variable, for example).

Choosing a model is the most important part of analyzing data. “Lying with statistics” usually means picking an inappropriate model—modeling something as normally-distributed when it’s not, for example. Picking a model requires looking at the data, not armchair speculation. You can’t just take some averages and declare victory—you first have to make sure you can use a model where taking averages makes sense.

Once a model is chosen, you can do the calculations described by various statistical tests. First, you *infer* the hidden variables and *learn* the parameters in the model. That gives you a complete function from all the independent variables to the dependent one, from which you can extract an answer to any question you are interested in.^{4}^{4} In practice, I mix these stages. Instead of finding a complete model, which would include infinitely-many almost-irrelevant variables, I find an approximate model specialized to answering the questions I care about; and I mix the inference and learning steps with the model-finding. Of course, I only do some inference or learning when I know that whatever model it is that I will come up with, it will have that hidden variable to infer or that parameter to learn.

Variable inference and parameter learning are hard when there are complex hidden variables or weirdly-distributed random variables. But in the common case, all of the distributions are approximately normal, all of the hidden variables are easy to infer, and there aren’t that many parameters to learn. It’s this happy case that this tutorial is concerned with.

So let’s set off to find a model for my data.

## Handling tails and outliers

The first step to finding a model is to plot a histogram. Always start with a histogram.

Histograms are one of the most important tools in a statistician’s toolbox, because a histogram makes very few assumptions about the data before presenting it.^{5}^{5} It does assume that different requests are independent, which is actually false for my data, where the requests are sent in sequence. I’m just assuming that the order doesn’t matter too much; a better analysis would check this assumption. For almost any dataset, a histogram tells you something meaningful—contrast a linear regression (which only makes sense if the data is linear with normally-distributed noise) or an average (which only makes sense when the data is symmetric and has one hump and no tails).

Statisticians sometimes talk about *parametric* versus *non-parametric* methods; roughly, non-parametric methods make no assumptions about the model, while parametric methods make assumptions.^{6}^{6} Parametric methods are based on models with a fixed number of parameters, while non-parametric methods use as many variables as necessary to represent the data. So, parametric methods are meaningless when the model doesn’t fit the data, while non-parametric ones expand the model until it’s meaningful. Parametric methods are much more powerful—a parametric method will find differences that exist far more often—but require validating their assumptions. The reason I want a model is so I know what assumptions I can make. Until I have a model, I can’t make assumptions about the data, so I have to use non-parametric tools like histograms.

In the Python script, `hist`

draws histograms:

>>> hist(GOODDATA) Data, N=622122 # # # # # # # #

This is a histogram with the time per request (the variable we are trying to model) on the horizontal axis and frequency in each of 50 buckets on the vertical axis.

It’s not much of a histogram. All of the data lands in the first bucket! Almost all, anyway—the `hist`

function automatically sizes the horizontal axis to go from the smallest to the largest data point, so there’s definitely a data point or two in that far-right bucket. I’ve got *outliers*: data points that are very far from the bulk of the data.

For example, the longest request in my data set takes almost 12 seconds:

>>> max(fst(GOODDATA)) 11877

even though about 75% of the data points take under a tenth of a second:

>>> len([v for v in fst(GOODDATA) if v < 100]) 455802

So this data set has a long tail: there are values very far from the core, enough that it can’t be just luck. Tails are important because common statistical tools (like averages, standard deviations, and linear regressions) assume the data does not have tails. In order to use those tools, the tails need to be handled somehow.

There are three common types of tails that come up in computer science. Sometimes, the data is log-normally distributed—that is, the log of your data is normally distributed—which happens with random noise that has a multiplicative effect. Another type of tail is when you have a power-law distributed data set—that is, the probability of seeing a value \(x\) has probability \(x^{-\alpha}\), which often happens in networks. Or, a tail might be an due to an edge condition that makes a small number of data points behave very differently from the rest, which often happens in “exceptional” conditions like garbage collection, spotty networks, or machine failures.

Handling log-normal data is easy: take the log of the variable and move on with the analysis. Handling power-law distributed data is harder: switch to more sophisticated statistics^{7}^{7} Like robust or non-parametric statistics, which I’ll use a bit of later in this tutorial. Handling edge conditions usually involves treating the edge condition as a hidden variable—more on hidden variables below. Let’s start by trying the first option:

>>> LOGDATA = [v[:] for v in GOODDATA] >>> for v in LOGDATA: v[0] = log(v[0] + 0.5)

The reason I used \(\log(x + \frac12)\) instead of \(\log(x)\) is because I have request times of zero in the data set. I imagine these are requests that took, say, half a millisecond, and the request time was rounded down. Adding half a millisecond “unrounds” those data points.

Now the histogram looks a lot better:

>>> hist(LOGDATA) Data, N=622122 # # # # # ### # ### # # ### # # # ## ### # # # #### ####### #####

This has some interesting structure: there are clearly many separate humps, and the long tail is gone. Let’s dive in.

## Splitting on observed variables

Why does this data have multiple humps? One possibility is that the request time is affected by some random variable drawn from a distribution that has multiple humps. That would be unusual, because there aren’t many natural multi-modal distributions. A more likely explanation is that the different humps are actually normally-distributed noise, shifted over depending on some other variable. So let’s try to find which variable that is.

Now, *of course* the time for a request is affected by whether or not the server is instrumented. But you *cannot* jump immediately into analyzing the difference that that causes. You must always look for the largest effect first. I’ll only analyze the difference due to instrumentation after I’ve analyzed the differences due to any other variable that has more of an effect on request time. This is necessary to avoid variants of Simpson’s paradox. As soon as you discover a large effect, you must control for it in every smaller effect you look at.^{8}^{8} More precisely, if you are ignoring variables with a larger effect than the parameter you are interested in, you are making the assumption that the parameter is independent of this variable. This may or not be true, and is an assumption you should be uncomfortable making. So any investigation of the effects of instrumentation must control for the effects of any more-important variables. Are there any?

I have 4 observed variables, each of which is a discrete category. I’ll analyze the effect of each variable by splitting the data set across all the possible values of one variable, and visually comparing the histograms for each value. Visual comparison might seem unscientific, but it’s as much a part of data analysis as anything else. Visual comparison is *weak* (sometimes a true difference won’t be visually obvious), but when you see an obvious visual difference, you can skip a statistical test.

My Python script defines a `splithist`

function for splitting a data set on one variable, and drawing histograms of each possibility. For example, I can split on whether or not the server is instrumented:

>>> splithist(LOGDATA, 3) [3]=inst, N=310737 # # # # # ### # ### # # ### # # ## #### # # # ############# ###### [3]=base, N=311385 # # # # # # ### # # ### # # ### # # # ## ### # # # #### ###### #####

Or, I can split on the experiment ID or thread ID (results hidden, since there are lots of categories):

>>> splithist(BODY, 4) ... >>> splithist(BODY, 2) ...

None of those sets of histograms have large differences—they change a bit, but never dramatically. On the other hand, when I split on the request type, the histograms *do* differ dramatically:

>>> splithist(LOGDATA, 1) [1]=Inbox request, N=69800 # # # # # # ## # # ## # # ## # # ## # # #### [1]=Post reply, N=69590 # ##### ######### ########## ############## ################### ###################### ####################### [1]=Login Request, N=500 # ### #### #### #### #### # ####### ########### [1]=Get reply page, N=69684 # # # ## ## ## #### #### [1]=Send reply, N=722 # # # # ## #### # ###### # ######### # [1]=post reply page, N=722 # ## ## ## ### ## # #### ##### #### ##### ##### ############ [1]=Send message, N=67830 # # # # # # # # # # [1]=Get topic page, N=69737 # # ## ### #### #### # ####### ############ ############# [1]=add new topic, N=67615 # ###### ######### ## ############ ################## ################### #################### ###################### [1]=Get message page, N=722 # ## ## ## ### #### ####### ####### [1]=Get top posts page, N=69740 # # # # # # # # # ### # # ### # ## ### # ### ### [1]=New message page, N=67830 # # # # # # ## # # ## # # ## # # ## # # #### [1]=Request new topic, N=67630 # # # ## ## ## ## ## ### #### ####### ####

These look *very* different: some have two humps and some have one, and all the humps are in different places. Looks like request type is the variable that has the biggest effect on how long a request takes.

This is, of course, obvious. Different request types mean different database operations, different lines of code executed, and so on. This *should* have a large effect on how long a request take.

Now that I’ve discovered a large effect, I’ve got to control for the request type in the rest of the analysis. That means 1) doing the rest of the analysis separately for each request type and only aggregating the results at the end, and 2) checking every other variable for whether it affects the request type. You can do this second step with a Χ² test or just by eyeballing the ratios, but my experiment chose request types at random, so I know that no variable can affect the request type and so I can skip the second of those steps.

Since the analysis must be done separately for each request type, I’ll pick one and walk through it. Let’s say I pick the “Request new topic” request first:

>>> RNTDATA = select(LOGDATA, 1, "Request new topic")

## Inferring hidden variables

The request new topic data still shows some interesting non-normal structure:

>>> hist(RNTDATA) Data, N=67630 # ## # # ## # # ## # ## ## # ## #### # # ### ##### # # #### # #######

So let’s do what I did above and look for a variable that explains these two humps.

>>> splithist(RNTDATA, 2) ... >>> splithist(RNTDATA, 3) ... >>> splithist(RNTDATA, 4) ...

Unfortunately, none of the observed variables explain the two hump phenomenon. This suggests that either our random noise is chosen from some strange distribution with two humps (still unlikely) or there’s a *hidden variable* that causes two humps. It’s easy to imagine a hidden variable that would qualify: whether the request hit the cache, how much database contention there was, whether the garbage collector ran. Hidden variables seem like a better explanation than some weird noise distribution.

Handling a hidden variable is harder that handling an observed variable. What I must do is *infer* the value of the hidden variable for each data point, effectively adding another column to the data set. Sometimes this inference involves its own modeling and has to be done probabilistically.^{9}^{9} For example, you could use a gaussian mixture model, where you learn the paramters with Expectation-Maximization. Here the inference is easy because the two humps are far apart; and I don’t have to do the inference probabilistically because it’s obvious which point is in which hump. I’d say the midpoint between the humps is at a log request time of 2.5, so I split the data into a “fast” and a “slow” cluster based on whether the log request time is more or less than 2.5.

>>> FAST, SLOW = partition(RNTDATA, lambda v: v[0] <= 2.5) >>> len(FAST), len(SLOW) (30681, 36949)

Since I’ve again split the data set, I again must separately analyze the fast and the slow cluster. I also need to check whether instrumentation, or any other variables I later incorporate into my model, affect this hidden variable. To do this, let’s look at the number of instrumented and uninstrumented data points in each cluster:

>>> [len(x) for x in partition(FAST, lambda v: v[3] == "base")] [15772, 14909] >>> [len(x) for x in partition(SLOW, lambda v: v[3] == "base")] [18043, 18906]

It’s easy to see that the effect of instrumentation on cluster is very small.^{10}^{10} You can use a Χ² test to check if the difference is statistically distinguishable, and it is. This is interesting—it sounds like instrumentation has a small effect on contention or caching or garbage collection or whatever it is that this hidden variable represents. I could also check other request types with two humps to better quantify the effect; or, since the effect is small, I could ignore it. I don’t need to handle it until I’ve looked at all larger effects.^{11}^{11} Why not control for smaller effects than the one we care about, as well as larger effects? There are infinitely many tiny effects. In principle, each contributes bias (if it’s correlated with the effect of interest) and error. But in practice, the effect sizes decay quickly, so once you have found the effect of interest, all remaining effects will add negligible bias and error.

Now if I check the histograms of each cluster separately, I find a distribution that looks approximately normal^{12}^{12} The gaps in the first histogram are due to the discrete nature of the request time variable. If the request times were measured to the nanosecond, those gaps would disappear.:

>>> hist(FAST) Data, N=32883 # # # # # # # # # # # # # # # # # # # # # # # # # # # >>> hist(SLOW) Data, N=34747 # # # # # # # ## ##### #######

These distributions both look approximately normal, which is a reasonable distribution for some sort of noise. Let’s see if our other observed variables have any effect:

>>> splithist(FAST, 2) ... >>> splithist(FAST, 3) ... >>> splithist(FAST, 4) ...

Of these, only instrumentation seems to matter (and test ID, but only because it affects instrumentation). So let’s split the data again on that variable.

Just to summarize what I’ve done, I’ve split on three variables (two observed variables and one hidden variable) in decreasing order of effect size, until I reached the variable I cared about—instrumentation. I can now stop,^{13}^{13} This is not quite accurate. If some other variable looks to have a slightly-smaller, but similar effect size, it should also be analyzed—it is an important source of error and possibly bias. In most cases this will not happen: the effect sizes will decay rapidly, and the next-largest effect after the one of interest will be orders of magnitude smaller. since all of the remaining variables look to have a much smaller effect and so won’t change our results.

## Learning parameters

So for the “request new topic”, fast, instrumented data points, I’ve finally boiled my model down to a simple parametric model: a normal distribution of log-request-time. This gives us two parameters (the average and the standard deviation), which I now want to compute. I do this by just computing the average and standard deviation of the sample, since these are good estimators of the standard deviation of the population.^{14}^{14} I use the sample variance formula, which is close to correct, to compute the standard deviation. Using the wrong formula to compute standard deviation only really matters when you have less than, say, 100 data points or if you’re looking for a very small effect. Here are the numbers:

>>> fbavg = avg(fst(select(FAST, 3, "base"))) 1.6850396327096107 >>> fbstdev = stdev(fst(select(FAST, 3, "base"))) 0.2965868136176025 >>> fiavg = avg(fst(select(FAST, 3, "inst"))) 1.8372130591729217 >>> fistdev = stdev(fst(select(FAST, 3, "inst"))) 0.2751409150499033

It looks like the standard deviation didn’t change between the base and instrumented servers, but the average did. This is consistent with instrumentation having additive overhead (shifting the distribution to the right) on log-request-time, or a multiplicative overhead to normal request time.

Let’s check that the same holds for the slow cluster:

>>> avg(fst(select(SLOW, 3, "base"))) 3.1768947398167424 >>> stdev(fst(select(SLOW, 3, "base"))) 0.3267632224745982 >>> avg(fst(select(SLOW, 3, "inst"))) 3.205633511701644 >>> stdev(fst(select(SLOW, 3, "inst"))) 0.36058810742469105

The average changes by a smaller amount than for the fast cluster (`.15`

for the fast cluster, `0.029`

for the slow cluster).

This is the average for my data, but that might be a fluke of the data. I want to know how different an average I would get if I computed that average from a different set of data points generated by the same model. It turns out that the average of a sample from a normal distribution is itself normally distributed, so I want to compute that distribution’s standard deviation, often called the standard error. It is the standard deviation of the distribution you’re drawing samples from, divided by the square root of the number of samples:

>>> fbse = fbstdev / sqrt(len(select(FAST, 3, "base"))) 0.0023616114946958225 >>> fise = fistdev / sqrt(len(select(FAST, 3, "inst"))) 0.00225336175973499

I want to know how much the average changes due to instrumentation. This means I want the difference between two normally-distributed variables. That difference itself is normally distributed, so it has average value

>>> fiavg - fbavg 0.152173426463311

and its standard deviation can be computed from this useful formula:

>>> sqrt(fise**2 + fbse**2) 0.00326417647686448

So I can say that for the fast cluster, instrumentation adds an overhead of `0.15 ± 0.0033ms`

.^{15}^{15} It’s better statistics to give confidence intervals, not deviations, since confidence intervals make sense for more types of models. It’s probably a habit I should get into, but not something I do right now. Since the average is so much (fifty times!) larger than the standard error, I confidently say the difference is due to an actual effect from instrumentation, not from noise. (If the difference were only, say, two times its standard deviation, then I might wonder if there was a difference at all.)

For the slow cluster, the difference is `0.029 ± 0.0036ms`

. Despite the difference being much smaller, it’s still perfectly distinguishable from random noise.

## Aggregating data

There are 13 unique request types in the data set. Of these, seven have two clusters and six have one cluster. I’ll go ahead and analyze each cluster separately; each cluster looks normally distributed, and the standard errors are small:

Request Type | Cluster | Difference | SE |
---|---|---|---|

Post reply | 0.061 | 0.0089 | |

Login Request | 0.25 | 0.057 | |

Get reply page | 0.085 | 0.0032 | |

Send message | 0.069 | 0.0072 | |

add new topic | 0.086 | 0.0088 | |

Get message page | 0.16 | 0.037 | |

Inbox request | < 1.5 | 0.085 | 0.0065 |

> 1.5 | -0.034 | 0.0045 | |

Send reply | < 5.5 | -.58 | 0.034 |

> 5.5 | 0.032 | 0.035 | |

post reply page | < 2.6 | .13 | 0.030 |

> 2.6 | .0015 | 0.035 | |

Get topic page | < 4.5 | 0.22 | 0.0075 |

> 4.5 | -0.010 | 0.012 | |

Get top posts page | < 2.5 | 0.19 | 0.0043 |

> 2.5 | 0.037 | 0.0035 | |

New message page | < 1.5 | 0.084 | 0.0067 |

> 1.5 | -0.018 | 0.0045 | |

Request new topic | < 2.5 | 0.15 | 0.0033 |

> 2.5 | 0.029 | 0.0036 |

The fast cluster for `Send reply`

is an obvious outlier; all other clusters have similar differences and standard errors. If you compare the difference to the standard error, you see that most of these effects are very robust. I’ll record the outlier to investigate later (what’s so special about sending replies?) but for the rest of the analysis I’m throwing out the outlier.^{16}^{16} You can think of discarding outliers as splitting on yet another hidden variable.

I want a single number for the overhead of the dynamic analysis, not a number per request type, so I need to aggregate this data. Of course, different request types might be better or worse for the dynamic analysis, but I imagine there is some commonality as well, and I want to measure that commonality.

So like before, I start by drawing a histogram; note that I convert the log-request-time difference to percentage overhead with `exp`

:

>>> ovpct = [exp(v) - 1 for v in diffs] >>> histogram(ovpct) # # # # # # # ## # ### ## # # # # # # # # ## # ### ## # # # # # # #

There’s not enough data here to tell whether the data is normal or not. In absence of evidence of normality, I use the median instead of the average. The median makes sense for pretty much any distribution; I can’t promise that with more data this distribution wouldn’t reveal tails^{17}^{17} In fact, there’s already this suspicious-looking outlier. Why throw out the outlier before doing the analysis, then? Because I suspect that the outlier’s large speedup isn’t due to some natural variation between request types, but is due to some weird interaction with the garbage collector or even a bug. Basically, I think there are two different hidden variables that change the overhead for a request type—one which causes the overhead to be very large and negative, and another that just gently nudges it. I want to split on the first variable and aggregate on the second. Now, I’m cautiously using the median, a non-parametric statistic that handles outliers very well, so tossing in the outlier wouldn’t change things much. But the outlier is suspicious and isolating it feels more likely to represent any future data I collect. so I don’t know if the average is even meaningful.

>>> med(ovpct) 0.0876288938088261

I also want to know how much the overhead percentage varies, which is mostly variation in request type.^{18}^{18} In principle, the fact that the overheads have little standard errors attached means I need to take that variation into account, too. But here, almost all of the standard errors are tiny, so the correction would be miniscule. A good way to do that is with a confidence interval. A 95% confidence interval, for example, should include the real answer in 95% of replications of the same experiment.^{19}^{19} There’s a reason the Wikipedia entry for “confidence interval” has a section on “Misunderstandings”. In “frequentist” statistics you compute the error of a statistical process, not the error of a statistical conclusion. Once you have collected data and computed a confidence interval, it either contains the true value or it does not; but the process of computing a 95% confidence interval has a 95% chance of successfully producing an interval that contains the true value. “Bayesian” statistics instead produce credibility intervals, which must contain the true value with 95% certainty. I tend to use frequentist statistics, but only because that’s what I learned growing up; I hear good things about Bayestian statistics. Calculating confidence intervals for a median, though, is a bit of a chore.^{20}^{20} See this page on how to do it; for my data, I get the same interval as the first and third quartile that I use instead, but past around 24 data points the confidence interval is smaller. Instead I usually just talk about the first and third quartile:

>>> ovpct[round(19*.25)] 0.03251750530511832 >>> ovpct[round(19*.75)] 0.16183424272828306

Note that the only reason I feel comfortable aggregating these numbers is that with 19 clusters, there’s enough data that summarizing it makes sense. I mentioned at the beginning that there were actually three different projects evaluated; I would not aggregate the results across those three projects, because three data points don’t make a distribution. Just give me the three data points.^{21}^{21} This is a general point: if not for my small mind and limited time, I’d always rather have the data instead of their summary. Hence a scatter plot or histogram is preferred over summary statistic, unless visualization can’t see a fine distinction that statistics can, or the data is too complex to visualize or tabulate.

## Presenting results

Now that I’ve measured the effect of overhead on request time, I want to describe it in a sentence or two for the paper. It would be great to lay out the entire analysis (like I just did) or even just the model used, but in practice the statistical analysis is often not an important aspect of a paper (like here, it is just a sanity-check), so I need to summarize just the key ideas. What to summarize depends on the paper, the author’s style, and so on, but I’d recommend listing the resulting parameter, then some measure of deviation (like standard error or a confidence interval), then outliers and cleanup steps, then the other variables controlled for, and then the parametric model used, in that order of priority. The statistics section is often one of the first to be cut, especially in papers where it is not the focus. Sad but true.

So in a paper’s abstract, I’d write something like, “Our analysis added a median overhead of 8.8%.” Then in the evaluation section, I’d write, “Aggregating 19 normally-distributed clusters across 13 request types, we find that our dynamic analysis added a median overhead of 8.8% (discarding a faster outlier) for a log-normal distribution of request times (first quartile: 3.2%; third quartile 16.2%).” With another sentence of space, I’d look into that outlier and see if I could understand why that request was so much faster with the dynamic analysis enabled. Of course, my longer explanation is precisely the sort of sentence academics are made fun of for writing: it’s full of jargon, it’s hard to parse, and it’s really only useful if you want to replicate the analysis. But that’s the result of squeezing a lot of information into one sentence. Given two paragraphs, I could lay out the analysis conversationally, but who has two paragraphs to spare?

Sometimes people add gratuitous *p* values to papers—they really want to say \(p < 0.05\). Make sure the *p* value makes sense! In arbitrarily replicable experiments, just collect enough data that everything is obviously significant. And *p* values only make sense when you’re comparing a discrete variable for effect/no effect. No one’s wondering whether instrumentation has *an* effect—it has to, it’s running more code. I want to know the *effect size*: how much slower the program got. A *p* value doesn’t tell me that.

## Conclusion

With the right approach statistics isn’t scary. In this tutorial I analyzed a data set to determine how much overhead a dynamic analysis added. Some elbow grease in looking at distributions and splitting on appropriate variables produced a robust, easy to interpret statistical result. Do the same for your papers, and you will find statistics a powerful way to see the world.

*Acknowledgments*: Thank you Daryl Zuniga for his extensive comments on drafts of this post, and to Doug Woos, James Wilcox, and the PLSE lab for reading early versions and finding many typos.

^{1}

The cost of splitting data sets is weaker statistics: it’ll be harder to see small effects or make fine distinctions.

^{2}

You could test this statistically with a Χ² test, but I’m trying to avoid using computations when the results are obvious.

^{3}

What’s the difference between a random variable and a hidden variable? A hidden variable is some aspect of the world that you could have measured. A random variable is inherently random and unpredictable. For example, the bias of a coin is a hidden variable, which you might measure by flipping the coin many times. But the result of each individual flip is random. This is a fluid, human distinction; a random variable is just a hidden variable that is independent of everything else and so impossible to infer.

^{4}

In practice, I mix these stages. Instead of finding a complete model, which would include infinitely-many almost-irrelevant variables, I find an approximate model specialized to answering the questions I care about; and I mix the inference and learning steps with the model-finding. Of course, I only do some inference or learning when I know that whatever model it is that I will come up with, it will have that hidden variable to infer or that parameter to learn.

^{5}

It does assume that different requests are independent, which is actually false for my data, where the requests are sent in sequence. I’m just assuming that the order doesn’t matter too much; a better analysis would check this assumption.

^{6}

Parametric methods are based on models with a fixed number of parameters, while non-parametric methods use as many variables as necessary to represent the data. So, parametric methods are meaningless when the model doesn’t fit the data, while non-parametric ones expand the model until it’s meaningful.

^{7}

Like robust or non-parametric statistics, which I’ll use a bit of later in this tutorial

^{8}

More precisely, if you are ignoring variables with a larger effect than the parameter you are interested in, you are making the assumption that the parameter is independent of this variable. This may or not be true, and is an assumption you should be uncomfortable making.

^{9}

For example, you could use a gaussian mixture model, where you learn the paramters with Expectation-Maximization.

^{10}

You can use a Χ² test to check if the difference is statistically distinguishable, and it is. This is interesting—it sounds like instrumentation has a small effect on contention or caching or garbage collection or whatever it is that this hidden variable represents. I could also check other request types with two humps to better quantify the effect; or, since the effect is small, I could ignore it.

^{11}

Why not control for smaller effects than the one we care about, as well as larger effects? There are infinitely many tiny effects. In principle, each contributes bias (if it’s correlated with the effect of interest) and error. But in practice, the effect sizes decay quickly, so once you have found the effect of interest, all remaining effects will add negligible bias and error.

^{12}

The gaps in the first histogram are due to the discrete nature of the request time variable. If the request times were measured to the nanosecond, those gaps would disappear.

^{13}

This is not quite accurate. If some other variable looks to have a slightly-smaller, but similar effect size, it should also be analyzed—it is an important source of error and possibly bias. In most cases this will not happen: the effect sizes will decay rapidly, and the next-largest effect after the one of interest will be orders of magnitude smaller.

^{14}

I use the sample variance formula, which is close to correct, to compute the standard deviation. Using the wrong formula to compute standard deviation only really matters when you have less than, say, 100 data points or if you’re looking for a very small effect.

^{15}

It’s better statistics to give confidence intervals, not deviations, since confidence intervals make sense for more types of models. It’s probably a habit I should get into, but not something I do right now.

^{16}

You can think of discarding outliers as splitting on yet another hidden variable.

^{17}

In fact, there’s already this suspicious-looking outlier. Why throw out the outlier before doing the analysis, then? Because I suspect that the outlier’s large speedup isn’t due to some natural variation between request types, but is due to some weird interaction with the garbage collector or even a bug. Basically, I think there are two different hidden variables that change the overhead for a request type—one which causes the overhead to be very large and negative, and another that just gently nudges it. I want to split on the first variable and aggregate on the second. Now, I’m cautiously using the median, a non-parametric statistic that handles outliers very well, so tossing in the outlier wouldn’t change things much. But the outlier is suspicious and isolating it feels more likely to represent any future data I collect.

^{18}

In principle, the fact that the overheads have little standard errors attached means I need to take that variation into account, too. But here, almost all of the standard errors are tiny, so the correction would be miniscule.

^{19}

There’s a reason the Wikipedia entry for “confidence interval” has a section on “Misunderstandings”. In “frequentist” statistics you compute the error of a statistical process, not the error of a statistical conclusion. Once you have collected data and computed a confidence interval, it either contains the true value or it does not; but the process of computing a 95% confidence interval has a 95% chance of successfully producing an interval that contains the true value. “Bayesian” statistics instead produce credibility intervals, which must contain the true value with 95% certainty. I tend to use frequentist statistics, but only because that’s what I learned growing up; I hear good things about Bayestian statistics.

^{20}

See this page on how to do it; for my data, I get the same interval as the first and third quartile that I use instead, but past around 24 data points the confidence interval is smaller.

^{21}

This is a general point: if not for my small mind and limited time, I’d always rather have the data instead of their summary. Hence a scatter plot or histogram is preferred over summary statistic, unless visualization can’t see a fine distinction that statistics can, or the data is too complex to visualize or tabulate.