Pavel Panchekha


Share under CC-BY-SA.

WIP: Programming by Voice

Update: M. Arntzenius gave a great talk a few years ago about voice programming using the Talon voice typing system. I don't know how it is that I didn't think of voice typing systems but they're the well-developed answer here. Interestingly Arntzenius confirms a lot of points that I guessed at here (nominalism, integration with language) but also a lot of new ones (like the challenges of resolving ambiguity and handling errors).

Two years ago at PNW PLSE, I had a wonderful conversation with some folks about unorthodox language designs. What questions do functions, lambdas, and types not answer? I've been thinking about one topic in particular: how would you design a language for programming by voice?

There are lots of variations on this problem: does the user have a screen, or are they just speaking and listening? What kind of programs are they writing? I like to imagine programming Alexa by talking to it. “Alexa, start the drier one hour after I turn the bedroom lights off.” These programs are short, so you could imagine editing or reviewing them purely in conversation.

The language syntax

You can already program by voice with dictation software, but it requires you explicitly speaking each punctuation character—it's possible, but inconvenient. A language designed for speech should have a syntax made up of common words. I'd be inclined to avoid heavy nesting and design a language that is mostly flat.

Simple triggers, like the example above, seem well suited to the “if this then that” trigger-action style. It’s already been used in some popular automation tools. What is the overall power of such systems? How easy is it to write loops in them,1 [1 Say by triggering custom events and then reacting to them] and how important is that for the programs we want to write?

Programs will have to be decomposable into short, sentence-length fragments, in the style of Forth. Most programming languages have long-distance dependencies that I think will be much harder to track when you are speaking.2 [2 Dependencies like type declarations, module imports, and others.] Natural languages minimize such long-distance dependencies.3 [3 Even declined and conjugated languages tend toward word orders where the related words are close by.]

The editor

Programs are written with a lot of editing—I'm referring to micro-edits, reacting to typos or nesting errors. But you can't edit speech. So a lot of effort has to go into the “editor” commands. I think the language could adopt a sketch-elaboration style:

Alexa, call my husband when he leaves work.
Should I call Robert on his Mobile or Work number?
His mobile number.
Should I determine when he leaves work using a geofence?
No, you should check whether his car is on.
Is that the car named “Robert's car” or the car named “John's car”
The one named “Robert's car”.
Ok, I have wall Robert's Mobile number when Robert's car turns on

This conversation has a fairly dumb Alexa—it cannot track things like who owns which car, and certainly has no idea that calling someone at work once they've left work makes no sense. It does, however, recognize an alias like “my husband”.

It's not clear if Alexa here understands that “my husband” as a relationship between two people, as an empty symbol related to the speaker “me”, or as an empty phrase. Some households have only one “my husband”, but some have multiple generations or multiple families cohabiting, or same-gender couples that describe each other as husbands. For dumb systems, more elaboration might be needed. “Is ‘my husband’ Robert or John?” “It is Robert. Stupid Alexa.”

Sketch-elaboration systems naturally accommodate both dumb and smart Alexas, and furthermore are amenable to Alexas that grow smarter over time, whether through patches or machine learning.

Structuring programs

If programs involve many short definitions and shallow trigger-action statements, finding and referencing these statements will be a challenge. Without a screen, you can't find the right line by visually browsing a screen-full of them—you will need to search. Perhaps trigger-action statements could be named—but most actions will be hard to name, and those names will be quickly forgotten. It is better to search. “Go to definition” is useful, but you will also need to search causally, to ask, “what will happen after the lights turn off”, or, “Alexa—what causes the laundry to start?”

Since searching by the consequences of programs is in general undecidable, perhaps the language should be restricted—not Turing-complete. But that is also hobbling the language and limiting the programmer. Perhaps it should be mostly not Turing-complete, but with escape hatches.

Trigger-action programs have structure. The drier request above might have to be rephrased, or might compile, into two primitive trigger-actions:

When the bedroom lights turn off, start a one-hour alarm named ANON1
When ANON1 rings, start the drier

If the drier request needs to be specified this way, there need to be easy ways to jump from the first trigger to the second.

Alexa, what causes the drier to start?
I will start the drier when the one-hour alarm named ANON1 rings
How much time is there left on the alarm
The alarm hasn't started yet
Ok, what causes that alarm to start?
I will start a one-hour alarm named ANON1 when the bedroom lights turn off.

Perhaps it is collections of trigger-actions that should be named, and structured hierarchically or with tags.

Most of these programs will be temporary (but some will be permanent). The drier and husband-call programs will only be executed once. Other programs, like “Every time I get home, if no one else is home, play music,” will be executed many times, and may be edited and changed over time. Temporary programs need to disappear automatically, though perhaps they should be archived somewhere, or at least used to train the elaborator.

Data and state

The most common data types in most programming languages are strings, numbers, booleans, arrays, and hash tables. A smart-home-like systems will need to have people, devices, and time as frequently-used data types. However, leaving out more abstract data types would limit the language.

Since trigger-action statements do not have complicated internal structure, there would need to be lots of state to handle complex actions. State could be global, or scoped to collections of trigger-actions. In any case, there would have to be easy ways to survey the current state. Consider the program:

Alexa, create storage for a date named “last bathroom cleaning”
Ok, I have created storage named “last bathroom cleaning”
Alexa, store the current date into “last bathroom cleaning”
Ok, I have stored 13 July 2023 into “last bathroom cleaning”
Alexa, every time someone tells you “I cleaned the bathroom”, store the current date into “last bathroom cleaning”
Ok, I have created a new action “I cleaned the bathroom”
Alexa, every time the value of “last bathroom cleaning” is more than seven days ago, remind me to clean the bathroom
How often should I remind you?
Once a day
Ok, I have created a new watch based on “last bathroom cleaning”

This program bundles two trigger-actions and a state—note that the state is typed here. Checking the value of the state will be regularly useful:

Alexa, what is the value of “last bathroom cleaning”.
The value of “last bathroom cleaning” is six days ago, 23 July 2023.
— a little while later —
Alexa, I cleaned the bathroom.
Ok, I have updated “last bathroom cleaning” to today, 29 July 2023.

Heavily-stateful programming, where every part of the program can access the global environment, is the opposite of recent trends in language design. Is language design going wrong, or am I thinking about programming by voice incorrectly? Home automation is all about state and imperative actions.

Invariants and concurrency

Analyses of all enabled trigger-actions will be valuable. Stating conditions that must be globally true will also be important:

Alexa, make sure no alarm rings when no one is home.
Ok, I will make sure no alarm rings when no one is home.

In this case, Alexa could just silence the alarm, overriding a lower-priority action with a higher-priority one. In my experience, priority systems lead to confusing results. Perhaps Alexa could instead surface all possible conflicts, both when the condition is given and when future programs are dictated.

Trigger-action programs are concurrent.4 [4 Even if they are not parallel, that is, even if only one action executes at a time.] Races are possible, as are contradictory commands. Could an intervening event cancel the one-hour alarm in the laundry program? Could the “last bathroom cleaning” be set to the far future?

Data invariants would help, though scoping state to collections could make concurrency less of a problem. Perhaps transactions on state should be possible.

The Alexa would be a live system, similar to a Smalltalk image. There would have to be ways of overriding poorly-considered invariants from before. There may need to be a separation into things the user can do by speaking and things programs can do—but such separations limit the power of programs. Perhaps instead of a sudo action, Alexa would interactively ask the user to confirm actions that violate invariants.

Like other live systems, Alexa would need ways to add documentation within the system, especially to conditions and invariants. What is this state for? Why might you change it? If you go on vacation, set “last bathroom cleaning” to a future date.

Breaking an invariant might be a trigger—perhaps your child is going to set the last bathroom cleaning date far in the future to get out of his chores. You would like to be notified.


Programs will have bugs. Programs with access to your house, to your cell phone, and your appliances can have terrible bugs, and possibly very costly ones. If the “last bathroom cleaning” program were phrased as a seven-day alarm (not as a stored date) and the alarm weren't canceled when the bathroom is cleaned, it could ring unnecessarily. How could you fix it, and undo the damage?

How is new state allocated? Can it be easily surveyed and cleaned up? If a petulant child convinces Alexa to play many annoying alarms at once, could the parents disable them all easily?5 [5 The child may not have permissions to create alarms, but may have permissions to clean the bathroom, for example.] Named, fixed, and organized storage that is difficult to allocate could be easily surveyed. If alarms are allocated and named, instead of simply kept track of by the runtime, they can be easily found and disabled.



Say by triggering custom events and then reacting to them


Dependencies like type declarations, module imports, and others.


Even declined and conjugated languages tend toward word orders where the related words are close by.


Even if they are not parallel, that is, even if only one action executes at a time.


The child may not have permissions to create alarms, but may have permissions to clean the bathroom, for example.