ChatGPT o3 is the Real Deal
ChatGPT o3 (the full version, not the mini version I tested earlier) came out just over a week ago. I used up my rate limit that day. It is the real deal, and I am blown away.
I have been an LLM skeptic for a while. But this semester of paternity leave I decided that learning to use LLMs would be a goal, and I've been trying to use them more and more. And I'll be honest: I get a lot of value out of them for life stuff (I take photos of my plants and ask it for advice on plant care, I have it draw pictures of rockets for my son, etc) but I don't use them that much at work. I don't like the code it produces, it's not great at talk or paper feedback, and so on. I use it for learning stuff (how does SimplifyCFG work?) but I'm not, like, using it to write code.
But o3 is different, and it can do highly-nontrivial software engineering projects all on its own, to the extent that I am now very concerned about things like how much Chrome lags when rendering extremely long chat logs.
Test 1: Refactoring code
In the Herbie project we have some old bits of code that work, but are pretty ugly and could use some refactoring. I have found that asking o3 to refactor the code doesn't work: it takes me from code that I don't understand to different code that I don't understand. But asking it to make refactoring suggestions works great, and its suggestions are detailed and executable.
For example, there's a file called report-page.js
, written by Zane
Enders. It's great code that I use daily, but since I didn't write it
I don't know my way around it, and now I want to clean it up a bit,
add features, and so on. Since it's not my code, anywhere I look I see
things that might be worth changing—but it's hard to know where I
can start and make incremental progress.
Solution: feed the whole file into o3 and ask it for refactoring
advice. It immediately identified a couple of key places where I could
deduplicate stuff: debugging harnesses that weren't needed any more,
duplicate helper methods in the buildRow
function, and so on. I went
through with about five suggestions from o3's list of ten, and three
of them were easy wins that I immediately committed. That's not a
perfect hit rate (30% or 60%, depending on how much work you think my
selection of five did) but it's pretty good, and a lot of my own
refactoring ideas, even for code that I know well, don't work out.
And, critically, I'd never seen this code before! It took only a
minute to get a couple of useful refactoring ideas, which I could then
execute to learn the code base and also, as a byproduct, improve it.
Ok—but perhaps you don't trust Zane. (You should, he's good!) So I
applied the same idea to my own code. I uploaded two files in Herbie's
code base, compiler.rkt
and egg-herbie.rkt
, and asked for advice on
both. Both are performance-sensitive; the first is quite short while
the second is extremely long, over a thousand lines of code. I asked
o3 for advice on performance improvements.
It gave me something like five or six ideas, most of which didn't seem that promising to me. But two of those ideas struck me as great ideas, a kind of "aha" moment. The first was using thunks in the compiler to reduce decoding overhead; I actually liked this idea so much I immediately sat down to implement it, and still think I should have thought of that earlier. Unfortunately it didn't lead to a speed up, but still! The other idea was a way to fuse two phases of egraph type splitting into one. This one is pretty tricky to execute so I haven't tried it yet, but the point—that these two phases are potentially fusable—hadn't ocurred to me and seems novel and useful.
The upshot is that while I'm still not using o3 to write code here, but in its advisor role it is quite capable.
Task 2: Web Browser Engineering Exercises
Impressed, I immediately wanted to test o3 on my web browser textbook, which by the way is now available on Amazon and similar. I use this textbook as the basis of an undergraduate course, CS 4560. In the course, we go through the first ten chapters of the book, and for each week the students must:
- Merge all the code in the book into their personal web browser
- Solve four specific exercises in that chapter, which I think of as two easy exercises, one medium exercise, and one hard exercise.
There are unit tests for everything, and if you pass the tests you get full marks.
To test o3 on this task, I copied the full text of Chapter 1 into the prompt and asked o3 to collect all the code into a single file. That passed the Chapter 1 unit tests on the first shot. I then fed it the four exercises I assigned that week, giving o3 exactly the description I gave students (which is basically the same as what's in the book). These weren't one-shot, but when a test failed I copied the output of the test runner into o3 and asked it to fix it. In Chapter 1, the four exercises took two, two, three, and one try to get right. I also read the code; the solutions were fine, certainly superior to what the average student does.
I repeated this with Chapter 2. Here merging was a little harder (it's merging into an existing code base, and there's more refactoring going on) but it again passed the tests on its first try. The four exercises were also solved in one, two, and three shots, except that the last exercise, on emoji, I tried to make it do something I don't make students do—resize the emoji images in code instead of by hand—and it struggled with that. Its second attempt did, however, pass if I resized the image by hand.
By the way, there's always a worry that the model is doing this from memory, having seen either the textbook or solutions somewhere online. There's no way to know for sure, but I queried it about things like "what is a chapter listing", "how is the code organized in Chapter 3", "what does CSSParser.body do", and so on, and I didn't see any evidence that it knew the book; all its answers seemed like intelligent guesses based on function names and general knowledge of how web browsers work.
Task 3: Software Verification
I was talking about this with Zach and our discussion sort of intersected with the formal methods class that Zach is set to teach next year. That reminded me of my formal methods class, CS 6110, which I last taught pre-covid. For that class I eventually wrote a final project outline where students build a mini-Dafny for a subset of Python called Verified Python. The mini-Dafny checks a bunch of syntactic rules, performs type checking, builds weakest preconditions, and ultimately verifies functional correctness and termination for programs about the complexity of quicksort.
The assignment was, I thought, pretty difficult even for a month-long final project in a grad class. I'm pretty sure I could do it, but I think it would take me several days, and to be clear I've never actually done it. (I would if I ever assigned it, of course!)
So I started feeding each step into o3, or eventually into o4-mini-high when my rate limit ran out. Over the course of about two hours, o3/o4 wrote ~350 lines of Python code which completed something like the first half of the project:
- Check a long list of syntactic conditions of purported Verified Python code
- Perform type checking
- Flatten nested subexpressions
- Generate weakest preconditions for assignments, conditionals, and loops
I ran all the code, including testing a couple of example programs (absolute value, a counter, etc) and it works: the flattening is correct, the type checking works, and the weakest preconditions are correct.
Here o3 wasn't working totally autonomously. For one, I wrote a bunch
of test programs and was running its code and shuttling outputs back
to it. I also had to help it out in one case with a bug it couldn't
solve—for WP generation for assignments, it had written a
substitution function that mutated the proposition being substituted,
and then when it came time to do WP generation for conditionals that
lead to incorrect weakest preconditions. There were also some minor
problems in type checking that I had to help it find. But that all
also gave me the chance to actually read its code, and while the code
isn't, like, great—I particularly hated that it chose to use Python
ASTs to represent propositions instead of Z3 terms, but another really
annoying habit is adding lots of boolean fields like in_so_and_so
instead of refactoring functions—it did basically work and honestly
is better than most student code.
Let me be clear about what I found so impressive about this. Yes, this is a course project, so an unusually "clean" project with unusually good sign-posting and incrementalization. (I hope, I spent a lot of time on that!) But still—here was o3 owning basically a complete 600-line codebase, making steady progress on something that would take junior PhD students a week or two, and finally finishing, I think, a pretty damn impressive project! A mini-Dafny for a little "imp" language! This does not make o3 a super-coder, nor does this mean I am letting it loose autonomously on my code bases, but when Tyler says it is AGI I start asking myself whether junior PhD students qualify as GIs, or whether it's only us faculty that can really count as intelligent.
Unfortunately, there are always bottlenecks, and in this case I hit a pretty bad one that made it hard to complete the project. Specifically, I first ran out of o3 credits (you get, I think, 50 queries a week) and then, when I switched to o4-mini-high, I ran into the issue that my chat had gotten long enough that Chrome was visibly lagging with incremental layouts. I can probably keep going by starting a new chat and copying in all the file contents, but at this point basically I have answered the question to my satisfaction: o3 can do even difficult, long-term assignments for graduate-level classes, and produce working code for them. It can debug decently, though it still needs help at times, and the code it produces is bad but comparable to student code.
What next?
At this point quite complex code is nearly free to produce. Yes, I've used ChatGPT before to help me write some ELisp or some log analysis / visualization scripts, the kind of stuff that truly is throw-away. But that's throw-away code. And I've used Cursor, and mostly found it not worth it because it got lost in large code bases. But here o3 is doing stuff that really is quite complex, something that would be impressive for a human to do, and it's doing it really with minimal effort.
o3 is way better than o3-mini, and probably than o1 (I never had access, I don't know). o4-mini is clearly better than o3-mini, though inferior to o3 full. Look, the models will probably keep getting better. It really isn't long—it might be now, I just need to figure out the workflow—where I'll relate to o3 the way I relate to students working on code I own like Herbie.
I also have to think about teaching in the age of AI. My colleague David Johnson, who teaches Web Dev I with me at Utah, put it really well. He said there are two questions here:
- How can I assess students and have confidence I'm assessing their knowledge, not an AI's knowledge? This is presumably solvable using, like, more in-person exams and so on.
- Is the knowledge we're teaching actually worth knowing? Or is it obsoleted by AI?
Now, for web browsers, compilers, formal methods, all of that, I am confident the knowledge is still worth teaching. I think these classes teach conceptually and practically-important topics, and the coding in those classes make students into better programmers.
But for Web Dev I, I am not so sure. If there's anything the AIs are good at it's web development, and they're especially comfortable in the domain-knowledge-heavy world of crazy web APIs and languages. Do students really need to know how to write CSS? Yes, yes, but do they? I am no longer sure. That's not to say we shouldn't teach Web Dev I, but maybe the class should focus more on debugging tools? It should start students with a large code base to extend? We should move faster and get to Web Dev II topics in the first semester? I am not sure, but it's something I'll be thinking hard about over the summer.