Intrinsic and Extrinsic Data

Haoyi Li recently wrote an excellent blog post on the Visitor pattern, specifically on how Visitors make fusion and streaming really easy. I loved the example of JSON parsing and processing. It is an excellent example of a fundamental duality in PL theory: intrinsic and extrinsic data.

Constructing extrinsic data

Intrinsic data is what we normally think of as data, a number, say, or an array.

Extrinsic data is data described by its functionality. For example, instead of five being then number 5, you could represent it as the ability to do an arbitrary thing five times. To be more precise, we could say that a natural number is represented by a function n : ∀ X, X → (X → X) → X which, given a starting point and a step to repeat n times, does just that and returns the result.

This is a fully-general representation, since if you have a preferred representation of numbers, say as binary numbers, you could recreate five in that representation by doing a binary increment five times.

It would be appropriate to say that intrinsic data is a platonic ideal existing in the world, while extrinsic data is that platonic ideal existing solely in the mind. It would also be appropriate to say that intrinsic data captures the past (how it was made) while extrinsic data captures the future (how it will be use).

In Haoyi's post, he starts with a JSON data structure:

data JSON :=
| Str : string → JSON
| Num : int → JSON
| Dict : list(string × JSON) → JSON

and defined a Visitor pattern:¹ [¹ I've represented the state of the dictionary visitor with S, since Haoyi uses that, and elected to represent the Visitor itself without state.]

type Visitor T :=
  (string → T) ×
  (int → T) ×
  ∃ S, S × (S → string → T → S) × (S → T)

Particularly striking is the parser that Haoyi defines:

parser : string → Visitor T → T

The type Visitor T → T is an extrinsic representation of JSON, and in fact it is not hard to define:² [² Here, out is what Haoyi calls dispatch.]

in : (Visitor T → T) → JSON
out : JSON → (Visitor T → T)

As Haoyi correctly points out, representing data extrinsicly has some nice benefits. Intrinsic data usually has a fixed representation in memory, so mutating the data involves a lot of allocation. Extrinsic data can be ephemeral, involving no allocations and making fusion easy. It can also be nicely adapted to streaming data.³ [³ Though there are some tricky questions if you need productivity for your streaming data.] There are also downsides to extrinsic data, however: allocations of memory for intrinsic data become allocations of closures for extrinsic data, and you can spend a lot of time traveling through trivial wrappers instead of operating on some kind of concrete data. The greatest challenge for extrinsic data, however, is with reasoning.

Reasoning about extrinsic data

The challenge is determining when two pieces of extrinsic data are equal.

Equality is a fundamental but complicated concept in type theory, especially for functions, and I've grappled plenty with it. When are two behaviors identical? And unless you can answer that, you can't say when two pieces of data, which are presented in terms of their behaviors, are equal.

While equality of functions is an obscure idea, the equality of numbers is not. For example, if a natural number is presented extensionally as ∀X, X → (X → X) → X, two numbers can be compared for equality like so:

eq := λ x, x (λ y, y True (λ_, False)) (λ eqx', λ y, y False (λ _, eqx' x (pred y)))

where:

pred := λ x, x None (λ p, p (Some 0) (λ x', Some (S x')))

where of course:

True := λ a b, a
False := λ a b, b
None := λ a b, a
Some := λ x, λ a b, b x

But it is not clear with this presentation whether two numbers that are eq are intersubstitutible, and substitution is the core principle of type theory. For example, if our type theory does not guarantee termination, it is possible for two different eq numbers to guarantee different behavior for non-terminating inputs.⁴ [⁴ In other words, to have numbers of different strictness.] It is also difficult to get a sense of all possible numbers. In an intrinsic world it is possible to prove that any number is equal to 0 or to the successor of a number. That is difficult with extrinsic presentations, since it requires proving the equality of functions with different bodies.⁵ [⁵ And it need not be true! For example, the "infinite number", which is the successor of itself, is often a perfectly valid extrinsic number.]

There are additional challenges with extrinsic data and state. If behaviors can change state, then extrinsicly-presented data can change state, and that is definitely not the sort of thing data should do! For example, imagine a number that remembers how many times you've used it, and pretends to be that number!

Parametricity

In type theory the solution in these cases is naturality or parametricity, where terms of a for-all type are required to behave “uniformly” on all types and therefore have very restricted behavior. For example, a number of type ∀X, X → (X → X) → X, if it must behave identically no matter what X is, really only has the choice of calling its second argument some number of times, starting from its first argument. That said, parametricity usually depends finely on the details of the language. For example, if the language is non-terminating, you can have all sorts of different "four"s, depending on strictness.

In some sense, a strong parametricity property is a promise from the language that intrinsic and extrinsic data behave identically.⁶ [⁶ As a result, it often makes the language core very small, since extrinsic data is just functions and the language no longer needs to define intrinsic data!]

Footnotes:

I've represented the state of the dictionary visitor with S, since Haoyi uses that, and elected to represent the Visitor itself without state.

Here, out is what Haoyi calls dispatch.

Though there are some tricky questions if you need productivity for your streaming data.

⁴

In other words, to have numbers of different strictness.

⁵

And it need not be true! For example, the "infinite number", which is the successor of itself, is often a perfectly valid extrinsic number.

⁶

As a result, it often makes the language core very small, since extrinsic data is just functions and the language no longer needs to define intrinsic data!

By Pavel Panchekha

28 April 2019

Intrinsic and Extrinsic Data

Constructing extrinsic data

Reasoning about extrinsic data

Parametricity

Footnotes: