Algebraic Data Types in Scheme

Algebraic data types feature prominently in a various functional languages, like ML, Haskell, Scala, and others. Unfortunately, Scheme doesn’t have any native support for them. I’ve been doing a lot of coding in MIT Scheme lately (as part of TAing Gerald Jay Sussman’s 6.945), and that lack of support was really starting to hurt. Luckily, it’s not hard to write a replacement. ADT.scm provides that replacement, allowing you to:

Define new algebraic data types
Pattern match on them, with destructuring.

This page neither explains how the code works, nor how to use it. The documentation can be found at the top of the source code, and the remainder of the source contains tests and explanations. Instead, this page is written with an eye of explaining why I chose this design. Explaining how I solve problems, and how I can better solve them.

Enabling Destructuring

Storing algebraic data types is easy, so the library is designed around the second feature: destructuring. Scheme actually permits only one binding construct¹ [¹ Note that this is wholly differing from typed languages. In ML or Haskell, let and lambda are very different due to how they treat polymorphism. In Scheme, values are untyped, and this distinction vanishes.] — lambda — so we’ll necessarily use lambda to bind variables. The way we use lambda to bind variables is by

Generating a function whose parameter names are the names we want to bind.
Applying that function the parameter values.

Generating such a function needs to be done either with eval or by a syntax rule (macro); eval is generally best avoided, so let’s assume for now that we solve part 1 with a macro-based mechanism. For second problem, a function in Scheme can be applied in two ways:

Applying a function with the (f v1 v2 v3) syntax: applying a function to a fixed number of arguments, stored separately.
Applying a function with the (apply f (list v1 v2 v3)) mechanism: applying a function to a variable number of arguments, stored in a list.

Of the two, the second seems more attractive since it does not require knowing the number of arguments; counting is hairy in Scheme syntax-rule macros. I also don’t expect to do anything with the fields in an algebraic data type, so there’s no need to treat them separately, which the (f v1 v2 v3) syntax would be better suited for.

Alright, so this gives us two design decisions: destructuring will generate a function, with the parameter names being the variables destructed into, via a syntax rule; and, destructuring will then apply this function, with the apply primitive, to a list containing the ADT fields.

So, it makes sense to store the fields of an ADT in a list.

Enabling Pattern Matching

To pattern-match, you need to also know which branch of the ADT a particular value represents. The number of branches is finite, so tags and a jump table of sorts would be fastest. Since the number of branches is also usually small, we can ignore the jump table and just do repeated tests. In that case, I’d need a way to determine if an ADT value

Is of a given type
Takes a given branch

And I need to extract the field values of an ADT value. Requirement 1 can be accomplished by fiat: in Scheme, we usually just assume that a value is of the correct type. Requirement 2 is most easily handled by a tagging scheme; since branches are usually named, we can use the name as a tag. The tag can be represented by a cons cell whose cdr is the field values (as a list) and whose car is a tag; this also makes ADT values print nicely. In fact, for the sake of debugging, we’ll also throw in a tag that names the type that value is of.

Implementation Sketch

Thus, ADT values look like (type-name tag-name field-1-value field-2-value ...). For every ADT, we need to test whether a value is of a given branch, to bind the fields by applying them to a function, and to create a new value from a given branch. In the code, this is accomplished by the function %adt-branch, the % being a convention I picked up from “Let over Lambda”² [² “Let over Lambda”, by Doug Hoyte, is a book on macro programming in Common Lisp. It’s available online, the first few chapters being free.]

(define (%adt-branch adt-name variant-name num-elts)
  (lambda (op . args)
    (apply ; This apply lets us use Scheme’s parameter list length checking
     (case op
       ((!)
        (lambda elts
          (cons adt-name (cons variant-name elts))))
       ((?)
        (lambda (val)
          (and
           (list? val)
           (= (length val) (+ 2 num-elts)) ; 2 for the ADT and variant name
           (eq? (car val) adt-name)
           (eq? (cadr val) variant-name))))
       ((@)
        (lambda (val f)
          (apply f (cddr val)))))
     args)))

Here, I’m representing an ADT branch by a multiplexed function: a function which takes an operation name and arguments, and executes that operation. Also note the cute trick with apply; this lets multiplexed functions benefit from reasonable parameter number checking and error handling.

With the above structure, we want a match expression, written like

(match val
  ((branch1 field1 field2)
   ...)
  ((branch2 field1)
   ...))

to expand into

(cond
 ((branch1 '? val)
  (branch1 '@ val (lambda (field1 field2) ...)))
 ((branch2 '? val)
  (branch2 '@ val (lambda (field1) ...))))

This can be accomplished by a simple macro:

(define-syntax match
  (syntax-rules (else)
    ((match var ((name vars ...) body ...) rest ...)
     (begin
       (if (name '? var)
           (name '@ var (lambda (vars ...) body ...))
           (match var rest ...))))
    ((match var (else body ...))
     (begin body ...))
    ((match var)
     (error "Incomplete pattern; no match for case" var))))

Finally, we need a way to define an ADT. Really, we need to know only the name of each variant and the number of fields that variant supports. Existing languages with ADTs also need the type of each field; in Scheme we don’t need this, but talking only about the number of fields would lose the self-documentation properties of types. Instead, we’ll require that names for each field be specified, and the number of fields is then just the number of names. An ADT definition then looks like:

(define-adt my-list?
  (my-cons head tail)
  (my-nil))

and a macro that makes this work would be

(define-syntax define-adt
  (syntax-rules ()
    ((define-adt adt-name (name1 fields1 ...) rest ...)
     (begin
       ; This (length '(fields1 ...)) trick is very useful, though
       ; doing it at compile-time would be better.
       (define name1 (%adt-branch 'adt-name 'name1 (length '(fields1 ...))))
       (define-adt adt-name rest ...)))
    ((adt adt-name)
     (define adt-name (%adt-predicate 'adt-name)))))

Note that since Scheme is untyped, it’s pretty important that each logical type be accompanied by a predicate. This predicate is generated by the %adt-predicate function, which hinges on tagging ADT values not just by the branch they represent but also the type they are an instance of.

(define ((%adt-predicate adt-name) val)
  (and (list? val)
       (> (length val) 1)
       (eq? (car val) adt-name)))

Critique of Implementation

Using a macro instead of eval for creating the function through which binding occurs seems to have been a good decision. The macro seems to do the job, and the code is rather nice due to the pattern-matching language used for Scheme syntax-rules.

Using apply for binding destructed values is a more dubious decision. The apply form is great when all values are treated uniformly; this is the case when destructing a single ADT value, but is not the case when you want to pattern-match on nested ADTs. At the moment, pattern-matching on nested ADTs means nested match expressions, and that might not be best. On the other hand, it does make it a lot easier to detect if your patterns are not exhaustive.

Also, destructuring into a function has a downside: the names of pattern variables have to be distinct. Scheme has no special rule for the _ variable, unlike Haskell or ML. It might be possible to improve the above code by making each instance of the _ variable use generate-uninterned-symbol to create the corresponding variable name; but this would make the match macro quite a bit hairier.

The match function itself is pretty nice. In particular, the else clause feels very Scheme-like, and the error checking is nice since it falls back to Scheme’s normal parameter list checking. Guards and recursive matches would be nice. Guards are easy to add by the addition of a few rules to the match macro, but recursive pattern-matching would need large-scale changes, as discussed above.

Finally, the define-adt macro is also pretty nice. In particular, since field names are completely opaque (we just count them), they don’t have to be symbols. Instead, our API is extensible by just making the field names property lists; for example,

(define-adt complex
  (rectangular (x :type real) (y :type real))
  (polar (r :type real :guard positive?)
         (theta :type real :guard (lambda (x) (< 0 x (* 2 pi))))))

Footnotes:

Note that this is wholly differing from typed languages. In ML or Haskell, let and lambda are very different due to how they treat polymorphism. In Scheme, values are untyped, and this distinction vanishes.

“Let over Lambda”, by Doug Hoyte, is a book on macro programming in Common Lisp. It’s available online, the first few chapters being free.

By Pavel Panchekha

11 June 2013