A Survey of PL Blogging (Part 1)

As a long-time blogger,¹ [¹ I guess…] I find the idea of blogging, especially blogging about technical topics, fascinating. So I've recently been attempting a survey of PL blogging. This part talks about overall statistics.

My aim was to do something like a comprehensive survey. Of course, it is impossible to be truly comprehensive, so I'll try to indicate when I am selecting against something or other, but I also hope that I have captured, say, a decent percentage of the PL research community's blogging.

Who is the PL community?

It's hard to define the PL research community without leaving someone out. New grad students, researchers at small universities, researchers emeritus or those in industry, and so on are all easy to miss. To get some measure of the PL community, I gathered the list of all authors of any sort of material at the last three PLDI, POPL, ICFP, and SPLASH conferences, or their attendant workshops. Note that this includes authors of, say, tutorials and workshop talks, and includes coauthors.

Looking through the names at random, I noticed several researchers from outside CS entirely (such as Marco Brambilla² [² The researchers, not the artist.]) and also many who are usually thought of as security, systems, or machine learning researchers. It is also missing important names, since it is not unusual for a researcher not to publish for a few years, or to publish in venues other than the big four.³ [³ Though note that some conferences were co-located with one of the big four, such as CAV with POPL, and so are included.]

For the researchers this method found, I wrote a small Python script to extract a link to their home page from the Researchr website. The Researchr researcher profiles are shared across conferences, so are often reasonably up to date. Still, I found roughly 1400 websites using this process. I doubt that only 70% of faculty have websites,⁴ [⁴ Though I may be wrong. Many faculty may only have the university-generated faculty page, which is uninteresting for my purposes, and many graduate students and industry researchers may not have web pages at all.] so I expect some leakage happened here, especially for people tangentially related to the community who have not had much opportunity to interact with Researchr and fill in its profile page.

Who blogs?

I attempted to detect which of the 1400 web pages contained blogs. To do this, I searched each home page source for the text “Blog”, “Post”, or “Archive” (case-insensitive). By searching source code, instead of the page text itself, I hoped to leverage the fact that blogging platforms will usually use the word post or blog in a URL, even if the page calls them “thoughts” or whatever. Of the 1400 web pages, 400 contained one of my key words.

I then opened each page manually and confirmed whether or not a true blog existed. False positives were common. Some people ran blog software without blogging, and a footer would mention that it is a blog even though no posts existed. Others had Postscript files for their papers. Some even described how to reach them by post. After filtering out false positives, I had 199 true blogs. Note that this number contained a few with zero posts, and a reasonably large number with exactly one post, announcing that a blog would soon exist, usually a few years back. Blogging is not for everyone! Academics do enough writing as is.

Note that the keywords I chose automatically select against most non-English blogs. I still got a few, but since I can't read most non-English languages, I wasn't going to be able to do much with them anyway.

How much do people blog?

Opening 400 web pages was enough fun for me for a few weeks, so to ensure some measure of automation, I decided to restrict myself to blogs with RSS feeds available. This would provide a structured description of the blog contents and allow me to write Python scripts instead of exercising my browser skills.

I wrote a simple script to download the main page of each blog and look for an RSS advertisement in the source code. Of the 199 blogs, 139 had RSS feeds (or Atom, or some other format). I know it's not the most popular technology these days, but I was surprised a bit at the loss at this stage. I expected most blogs to use a standard blogging platform, and most of those support RSS by default. I'm not sure what sort of setup the other 60 blogs use, but it could be worth looking into. I'll also note that my feed parser script encountered a few server errors and "Forbidden" exceptions. I visited those blogs by hand to extract RSS URLs.

With RSS feeds in hand, I wrote a script using Python's feedparser library to download each feed and output all posts written in 2018. The result was 237 blog posts across 62 blogs. More precisely, here's a histogram of number of blogs posts in 2018:

Number of Posts	Blogs with that Many Posts
1	23
2	8
3	6
4	5
5	3
6	3
8	3
9	2
10	7
11	1
20	1

There is an odd spike in this histogram at 10 posts. I believe this is because most blogging platforms cap RSS feeds at 10 entries. If your feed reader lets you see more posts, it's because it saves posts no longer present on the feed. So, it's safe to say that at least 9 blogs had at least 10 posts, but it's not really safe to say that 20 was the greatest number. If I wanted really good statistics here, I would visit those 9 blogs in a browser, and count the posts.

What do people blog about?

I opened each of the 237 blog posts, skimmed them very lightly, and attempted to classify them. The classification scheme is pretty arbitrary, but roughly reflects the common groupings:

Topic	Number of posts
Technical, PL-related	103
Technical, non-PL	35
Announcements and CFPs	25
Advice	14
Paper abstracts	13
Non-English posts	12
Book and music reviews	10
Photo posts	9
Personal posts	8
Peer review and academia	6
Politics and the tech industry	5
Diversity and accessibility	4

Some of the posts were hard calls, but the general scheme seems sound. I tried to bias toward counting things as technical and PL-related, giving both adjectives broad interpretation. Announcements and CFPs were common, including conference announcements, job postings, and congratulations on awards. I only categorized things into this category if they contained no substantial content besides the announcement. Paper abstracts were usually pages put up by someone to announce a published paper, and to provide a website for it. Advice posts varied, with 9 aimed at faculty, 3 at grad students, and 2 at undergrads; some of those aimed at faculty were more like general life advice. The non-English posts included Chinese,⁵ [⁵ Don't know which one.] German, and Portuguese. Personal posts covered topics like job changes, sports, and medical issues.

I hope the statistics are interesting. In Part 2, I hope to assign some categorization scheme to the technical, PL-related posts.

Footnotes:

I guess…

The researchers, not the artist.

Though note that some conferences were co-located with one of the big four, such as CAV with POPL, and so are included.

⁴

Though I may be wrong. Many faculty may only have the university-generated faculty page, which is uninteresting for my purposes, and many graduate students and industry researchers may not have web pages at all.

⁵

Don't know which one.

By Pavel Panchekha

25 July 2018