Designing Libraries Like Languages

New academic languages are near-universally variations on a theme: first-class closures, mostly functional, with inductive data types and an algebraic type system. So has programming language design as a field ended? I think yes, but there's a chance to apply the same approaches and techniques to the design of standard libraries. In this post, I demonstrate that comparing even the most basic libraries across languages demonstrates many degrees of freedom for a designer to weigh.

All serious programming languages come with standard libraries for string operations, list combinators, and data structures. Libraries for manipulating dates, making HTTP requests, and accessing the file system are common too. But these libraries differ drastically in their design, suggesting that some attention from a designer, plus some comparative study, should yield interesting insights, improve the design of libraries, and uncover beautiful theory.

List libraries

Take, for example, list and string operations. Every language from C++ to Python to JavaScript to Haskell provide some form of these, but the specifics differ widely. All of them provide map (which in C++ is called transform) and forEach (except Haskell for which it wouldn't make sense). But already C++ does not provide filter (it provides remove_if), and fold is even less common, with Python removing it in the 3 series of the language.¹ [¹ I'm not even talking about how many different fold variants functional languages have, let alone scan and so on.] That mainstay of intro-to-functional classes, reverse, is uncommon, and Python even subsumes it into an unrelated feature (via the [::-1] idiom). On the other hand, Haskell base provides a wealth of list functions with confusing names like intercalate and nub, and also provides a bunch of functions that are hard to imagine using outside a combinatorics problem: subsequences, permutations, and so on. The categorization of these functions differ as well. Haskell's zipWith and Python's zip are, in Lisp, a variant of map. Plus, different languages generalize from arrays to sequences differently, even though it's an abstraction most languages have.

Yet despite this diversity and the obviously existing degrees of freedom, I don't know of any academic work studying how list libraries should be defined. What combinations of list functions provide how much power? How can we prove equivalences between list functions? For example, any sequence of map and filter calls can be reduced to a normal form, one filter followed by one map, and adding reverse still allows for a normal form. But what happens when we add functions like groupBy or zip? (Note that there has been a lot of work on rewrites for list fusion, for example in Haskell, but this only covers a small part of a full list library.)

Then there is the question of the gap between string and list operations. How many string operations are list operations in disguise? String splitting, for example, makes sense both for lists and strings, and in Haskell (where strings are lists of characters) list functions like stripPrefix are frequently used on strings. But string operations usually operate on subsequences instead of elements (like, often, splitting), and sometimes use more complex machinery like regular expressions. Should list and string libraries be fundamentally the same, or fundamentally different?

Date libraries

Most languages have a date library, and most widely-used languages have several, because date libraries usually have inadequacies. And date libraries have plenty of degrees of freedom. Libraries differ in their fundamental unit—Java-derived libraries tend to use milliseconds since the epoch, C-derived ones tend to use seconds—and also in their support for time zones, different calendars, localization, and so on. Some (rare) libraries support both intervals and moments of time; others expect you to build intervals yourself. Sometimes dates and times are separate, other times they are the same. There are some small brilliant hacks, like the Go library's formatting function (instead of memorizing percentage escapes, you write how you'd like a reference date, Jan 2 15:04:05 -0700 MST 2006, formatted). There are also new concepts, like the Python timedelta class.

Here, again, there is the potential for comparative study, for functional study, and for first-principles theorizing. Which features are common; which uncommon features are used when available? Which common idioms can be replaced with core functionality? How can date operations be optimized, or proven equivalent?

File system interfaces

Many languages have a file system interface based around that provided by POSIX: file handles with read and seek and peek, directory listings and relative paths and stat calls, and so on. But while this interface adequately replicates the underlying kernel calls (and so should probably be available for low-level code) it usually requires learning various idioms to use correctly. For example, most programmers have been bit by the concurrency problem between open and write (and now know to create a new file and rename over the old one), or have written an analog of os.walk. And most languages that don't provide a comprehensive relative/absolute path API grow a library or several to provide it. Python provides two separate complete path APIs, os.path and pathlib, suggesting that writing a path library is not a trivial endeavor.

Different languages also use a variety of different concepts to wrap their file system interface. Python, for example, has a notion of a "file-like object", while Haskell has the clever idea of file reads returning a lazy sequence. Some languages identify paths with strings, while others don't; Python's two path libraries make opposite choices. There is some academic work on paths, like G. Ntzik and P. Gardner's work on separation logic for paths; but it seems to be a less explored area, despite the centrality of file system operations. And consider the fact that the basic Unix tools often don't have analogs in libraries: rm -r usually requires writing a recursive function, for example, and most implementations of something like os.walk don't make it possible to define recursive deletion in terms of it, because they are pre-order, not post-order, traversals.

I know of no language that presents the file system in terms of that language's built-in tree structure, even for languages, like Lisp, that have a widely-used built-in tree structure.

Conclusion

Mainstream languages use a bewildering variety of methods to accomplish nearly-identical goals in their standard libraries. Unlike the language core, which at this point is well-studied and well-understood, there is little work on studying the common components of standard libraries, so there is likely plenty of low-hanging fruit left that would make libraries simpler, easier to use, and perhaps provide better error messages or optimizations. This could be the place for a new generation of language design work.

Footnotes:

I'm not even talking about how many different fold variants functional languages have, let alone scan and so on.

By Pavel Panchekha

20 January 2019