Intrinsic and Extrinsic Data
Haoyi Li recently wrote an excellent blog post on the Visitor pattern, specifically on how Visitors make fusion and streaming really easy. I loved the example of JSON parsing and processing. It is an excellent example of a fundamental duality in PL theory: intrinsic and extrinsic data.
Constructing extrinsic data
Intrinsic data is what we normally think of as data, a number, say, or an array.
Extrinsic data is data described by its functionality. For example,
instead of five being then number 5
, you could represent it as the
ability to do an arbitrary thing five times. To be more precise, we
could say that a natural number is represented by a function n : ∀ X,
X → (X → X) → X
which, given a starting point and a step to repeat n
times, does just that and returns the result.
This is a fully-general representation, since if you have a preferred representation of numbers, say as binary numbers, you could recreate five in that representation by doing a binary increment five times.
It would be appropriate to say that intrinsic data is a platonic ideal existing in the world, while extrinsic data is that platonic ideal existing solely in the mind. It would also be appropriate to say that intrinsic data captures the past (how it was made) while extrinsic data captures the future (how it will be use).
In Haoyi's post, he starts with a JSON data structure:
data JSON := | Str : string → JSON | Num : int → JSON | Dict : list(string × JSON) → JSON
and defined a Visitor pattern:1 [1 I've represented the state of the dictionary visitor with S
, since Haoyi uses that, and elected to represent the Visitor itself without state.]
type Visitor T := (string → T) × (int → T) × ∃ S, S × (S → string → T → S) × (S → T)
Particularly striking is the parser that Haoyi defines:
parser : string → Visitor T → T
The type Visitor T → T
is an extrinsic representation of JSON
, and in
fact it is not hard to define:2 [2 Here, out
is what Haoyi calls dispatch
.]
in : (Visitor T → T) → JSON out : JSON → (Visitor T → T)
As Haoyi correctly points out, representing data extrinsicly has some nice benefits. Intrinsic data usually has a fixed representation in memory, so mutating the data involves a lot of allocation. Extrinsic data can be ephemeral, involving no allocations and making fusion easy. It can also be nicely adapted to streaming data.3 [3 Though there are some tricky questions if you need productivity for your streaming data.] There are also downsides to extrinsic data, however: allocations of memory for intrinsic data become allocations of closures for extrinsic data, and you can spend a lot of time traveling through trivial wrappers instead of operating on some kind of concrete data. The greatest challenge for extrinsic data, however, is with reasoning.
Reasoning about extrinsic data
The challenge is determining when two pieces of extrinsic data are equal.
Equality is a fundamental but complicated concept in type theory, especially for functions, and I've grappled plenty with it. When are two behaviors identical? And unless you can answer that, you can't say when two pieces of data, which are presented in terms of their behaviors, are equal.
While equality of functions is an obscure idea, the equality of
numbers is not. For example, if a natural number is presented
extensionally as ∀X, X → (X → X) → X
, two numbers can be compared for
equality like so:
eq := λ x, x (λ y, y True (λ_, False)) (λ eqx', λ y, y False (λ _, eqx' x (pred y)))
where:
pred := λ x, x None (λ p, p (Some 0) (λ x', Some (S x')))
where of course:
True := λ a b, a False := λ a b, b None := λ a b, a Some := λ x, λ a b, b x
But it is not clear with this presentation whether two numbers that
are eq
are intersubstitutible, and substitution is the core principle
of type theory. For example, if our type theory does not guarantee
termination, it is possible for two different eq
numbers to guarantee
different behavior for non-terminating inputs.4 [4 In other words, to
have numbers of different strictness.] It is also difficult to get a
sense of all possible numbers. In an intrinsic world it is possible to
prove that any number is equal to 0 or to the successor of a number.
That is difficult with extrinsic presentations, since it requires
proving the equality of functions with different bodies.5 [5 And it
need not be true! For example, the "infinite number", which is the
successor of itself, is often a perfectly valid extrinsic number.]
There are additional challenges with extrinsic data and state. If behaviors can change state, then extrinsicly-presented data can change state, and that is definitely not the sort of thing data should do! For example, imagine a number that remembers how many times you've used it, and pretends to be that number!
Parametricity
In type theory the solution in these cases is naturality or
parametricity, where terms of a for-all type are required to behave
“uniformly” on all types and therefore have very restricted behavior.
For example, a number of type ∀X, X → (X → X) → X
, if it must behave
identically no matter what X
is, really only has the choice of calling
its second argument some number of times, starting from its first
argument. That said, parametricity usually depends finely on the
details of the language. For example, if the language is
non-terminating, you can have all sorts of different "four"s,
depending on strictness.
In some sense, a strong parametricity property is a promise from the language that intrinsic and extrinsic data behave identically.6 [6 As a result, it often makes the language core very small, since extrinsic data is just functions and the language no longer needs to define intrinsic data!]
Footnotes:
I've represented the state of the dictionary visitor with S
, since Haoyi uses that, and elected to represent the Visitor itself without state.
Here, out
is what Haoyi calls dispatch
.
Though there are some tricky questions if you need productivity for your streaming data.
In other words, to have numbers of different strictness.
And it need not be true! For example, the "infinite number", which is the successor of itself, is often a perfectly valid extrinsic number.
As a result, it often makes the language core very small, since extrinsic data is just functions and the language no longer needs to define intrinsic data!