Paper Statistics with Git

POPL’15 had its deadline a week and a half ago, and I submitted a paper there about Hoot, a new framework for building eventually consistent distributed systems. This is the first grad school paper I’ve submitted as a first author, so I’m excited and exhausted: excited, because I’m hoping the paper is accepted, and exhausted, because writing it was quite a lot of work. It’s the second one that I want to focus on now.

Our (me, Zach Tatlock, and Mike Ernst) paper reads like your ordinary academic paper: there’s an abstract, an introduction, an overview, three technical sections (framework, consistency model, and synchronization algorithm), two applications sections (models and applications), a related work section, a future work section, and a conclusion.

But listing the sections like that hides the fact that some were significantly more work than others. The technical and application sections, though long, were relatively easy to write, because their content (the algorithm we developed) we understood even before we started writing. On the other hand, the introduction and overview have to describe not just what we do, but why, and why doing it that way (and at all) is the right thing to do. So the abstract, introduction, and overview saw numerous revisions. In fact, the introduction was re-written the day of the deadline.

I was curious to know: how stark is the difference between the technical and introductory sections of the paper? Luckily, every edit to the paper was stored in Git, so I could answer this with a little bit of tinkering at the shell.

Browsing through commits

To understand which sections were most heavily edited, I just want to know how much text was changed in each revision of each section. Now, LaTeX and Git can interact poorly with small edits: if each paragraph is hard-wrapped, paragraph wraps differently any time even a single word is changed, leading to huge and unreadable diffs. I have my own strange indentation style for TeX: I start a new line every sentence, and break and indent this line at every phrase. When Git computes a diff, only a single phrase is changed for small edits, so commits can be reviewed and understood easily. This quirk means allows me to use the number of lines changed as a proxy for the amount of editing. The peculiar indentation style keeps this approximation accurate.

We used the subfiles LaTeX package to manage this paper, so every section lived in its own file. Git can also compute the diff of a file between any two commits. Using this ability, I need to:

Compute all commits that changed any of the paper’s files.
Compute which files each commit changed.
Total the lines added / removed for each of these file / commit pairs.
Add up the lines changed per file over all commits.

Luckily, Git has tools to get all of the information I need, and shell utilities can do any aggregation.

Computing the commits that changed our paper’s files

Git’s log subcommand is a fantastic, feature-full tool for browsing through commits in a Git repository. It can take a file name and list all the commits that modified that file:

git log --follow -- paper.tex

Unfortunately, this doesn’t quite work in my case because paper.tex moved around quite a bit as we wrote. It started off in paper.tex (which used to hold a different, unrelated document), then moved to popl14/paper.tex, and then to popl15/paper.tex when I realized that I had the year wrong. Git’s not great at tracking moved files, even with --follow, so I used the above command and manually located the first commit that related to this paper. I’ll use all commits past this point, and filter out commits that don’t modify the paper:

git log "$START_COMMIT^.." --reverse --pretty=oneline | cut -f1 -d\

The cut commit extracts just the commit name; the --reverse puts the commits in chronological order.

Computing the files changed by each commit

Now that I have the commits I need, I need to get the files changed by each. This exactly what the diff-tree command does.

git diff-tree --name-only --no-commit-id -r "$COMMIT"

diff-tree normally prints out a lot of information, including the identifiers for each blob; the --name-only and --no-commit-id flags restrict the output to just the file names.

Totaling the lines added and removed for each file and commit

I spent a while looking for a built-in command to compute the lines added and removed for each file for a commit, but didn’t find such a tool. It’s relatively simple to re-create using the Unix shell tools.

Given a commit and a file name,

git diff "$COMMIT^" "$COMMIT" -- "$FILE"

returns the complete diff for a file at a commit. Diffs have a peculiar file format, where the first line describes whether the line was added, removed, or is just there for context ¹ [¹ Merge commits are a special case: the first two characters matter, since there are two files. I checked manually, and there aren’t any substantial merge commits to worry about.]. I can strip off everything except the first character to get just the crucial “in or out” information, then sort and count lines to get add/remove counts:

git diff "$COMMIT^" "$COMMIT" -- "$FILE" |
cut -c1 | grep '^[+-]' | sort | uniq -c | tac | tr -d ' +\-

Totaling a file’s changes over all commits

To get the a single file’s changes over all commits, we can just filter to the lines that describe that file, and use an awk script to add the additions and removals:

grep "[^/]$FILE\>" | awk '{p+=$2; m+=$3} END {print "+"p, "-"m}'

There’s a bit of a strange quirk in that line: I’m searching not for $FILE, but for $FILE preceded by anything except a slash. This is to count paper.tex separately from popl15/paper.tex and popl14/paper.tex.

Cleaning up the Data

The data comes out like so:

File	Current	Added	Removed
popl15/macros.tex	38	+224	-186
popl15/paper.tex	98	+419	-321
popl15/abstract.tex	45	+195	-150
popl15/intro.tex	139	+1004	-865
popl15/overview.tex	382	+1912	-1530
popl15/framework.tex	389	+390	-1
popl15/consistencymodel.tex	173	+423	-250
popl15/algorithm.tex	175	+458	-283
popl15/models.tex	146	+192	-46
popl15/applications.tex	246	+469	-223
popl15/relatedwork.tex	231	+550	-319
popl15/futurework.tex	26	+27	-1
abstract.tex	?	+100	-100
macros.tex	?	+8	-8
paper.tex	?	+900	-900
popl14/abstract.tex	?	+73	-73
popl14/intro.tex	?	+158	-158
popl14/macros.tex	?	+40	-40
popl14/overview.tex	?	+299	-299
popl14/paper.tex	?	+229	-229
popl15/discussion.tex	?	+53	-53
popl15/evaluation.tex	?	+413	-413
popl15/example.tex	?	+122	-122
popl15/formalism.tex	?	+1072	-1072

The question marks in the “current size” column are for now-deleted files. The current size column is there, because I expect longer sections to have more rewrites.

The top-level paper.tex file, which you’d expect to only contain links to subfiles, actually looks like it has a lot of editing going on. This is actually because I didn’t move things out to subfiles until somewhat late into the paper. To fix that problem, I went through each of the commits editing paper.tex manually, and recorded them in the counts for each section that way.

Some sections were once two parts and became one (like the Example and Overview). Also, some files moved around in ways that Git didn’t track. So the totals above need to be recombined into per-section, not per-file, counts.

Conclusion

This produces:

Section	Current	Added	Multiplier
[macros]	38	+297	7.8
[paper]	98	+484	4.9
Abstract	45	+392	8.7
Introduction	139	+1393	10.0
Overview	382	+2562	6.7
Framework	389	+1462	3.8
Consistency Model	173	+423	2.4
Algorithm	175	+458	2.6
Models	146	+247	1.8
Applications	246	+827	3.4
Related Work	231	+550	2.4
Future Work	26	+80	3.1

As expected, the biggest churn was seen in the introduction, abstract, and overview, and also (strangely) the macro file. The least churn was in the consistency model through related work sections, with “Models”, which describes ways to embed some earlier work into our approach, having the least churn.

Overall, the numbers support my impression that we wrote many, many introductions and overviews, while the remainder of the paper sailed through easily. What I didn’t realize was just how stark the difference was. For example, the abstract, introduction, and overview together contain 566 lines, which required 4292 lines of editing to produce (a multiplier of about 7.5×). The rest of the paper is 1386 lines, and required only 4047 lines of editing to produce (a multiplier of 2.9×). Maybe seasoned paper-writers know this, but I’m surprised: we spent less time writing and editing the technical 71% of the paper, then we did editing the non-technical 29%.

Here’s hoping that with practice, I’ll have a better idea of how to write introductions and overviews, and that this contrast will lighten. Until then, I know to budget way more time toward the introductory sections.

Consider running a similar analysis for your papers. Which sections took the most work? The least?

Edit: I went calculated the allocation of lines changed between “Models” and “Applications”, which were a single section at first but later split. I also dug up and added up an abortive total rewrite to the Introduction which was on a branch I had not counted when I first wrote this post.

Footnotes:

Merge commits are a special case: the first two characters matter, since there are two files. I checked manually, and there aren’t any substantial merge commits to worry about.

By Pavel Panchekha

19 July 2014