Can LLMs accelerate science? An experiment.
The LLMs have gotten impressively good at certain intellectual activities. And one of the big promises that the AI labs seem to see here is the possibility of accelerating science; Dario talks about this in Machines of Loving Grace and OpenAI has had GPT-5 propose wet lab experiments.
This is very cool. I like science! But Opus and Codex are sometimes kind of doofuses. And real science is not just analyzing data; it's choosing experiments, creating mental frameworks, and so on. For the most critical scientific discoveries—calculus or evolution or special relativity—there's no comprehensible prompt scientists at the time could have given an LLM to get to discovery. Nor can we really run the counterfactual, since every AI model already knows these scientific achievements.
But I wanted to test whether an LLM could reproduce a landmark scientific achievement, in a clean scenario where pre-existing knowledge wouldn't help and where I could give the advantage to the AI without giving away the game. I had an idea…
The background
In 1608 Kepler published the first of his three laws, that the planets move in an ellipse around the sun. His next two laws, on equal areas and on the square cube relationship, came over another decade plus, and then Newton discovered the inverse square law came another six decades after. Kepler discovered his laws based almost entirely on a data set, collected by Tycho Brahe, of planetary observations. It was the accuracy and consistency of that data set that made Kepler's observations possible.
Kepler's data analysis was extremely complex. Tycho had altitude and azimuth positions of the sun and Mars relative to Earth. To understand Mars's orbit you have to subtract out Earth's orbit, which you don't yet know! And Kepler, of course, was doing all the calculations by hand, over hundreds of pages—he complains about this.
What makes this a good test is that it is a nearly pure data analysis problem. The data pre-existed Kepler; he just had to do analysis. And context didn't matter much; the pre-existing understanding of the solar system was unhelpful for identifying the elliptical nature of the orbits. Kepler was aware of the heliocentric model, which was a hint, but arrived at ellipses quite strangely. The whole saga, by the way, is really well described in a video by Welch Labs.
The setup
So my idea was to replicate the data analysis problem alone. I'd give the AI clean simulated data without hinting to it that orbital mechanics is involved, and see if it can recover the elliptical law. Ideally, it could recover the equal areas and square cube laws, or maybe even hypothesize universal gravitation. But let's not get ahead of ourselves. The first question is whether it could get to ellipses from clean data.
To test this, I set up a Python simulation of two planets orbiting a star. I gave the planets Earth- and Mars-like distances and eccentricities, which is important for the analysis to actually work, and simulated them for 30 years (a bit longer than Tycho's data set). Importantly, I didn't use Earth's and Mars's exact orbits, since I was worried the AI would recover an orbital period or something and guess the correct answer. Positions were given as simulated altitudes and azimuths for the sun, mars, and a Polaris-like star. Timestamps were given in fractional years, to hint to the model that years are important, and I used modern, not historical, time and angular measurements.
I generally tried to give the AI an easy hand. I had observations every day (no weather issues) and printed observations to a thousandth of a degree, about 10x more accurate than Tycho Brahe. I didn't simulate atmospheric refraction or all sorts of other funky effects. The total data set has about 15 thousand observations total, many times larger than Tycho's. I am trying to make it simple for the AI because the underlying modeling problem is already hard enough.
I then gave the data set, stripped of any labels, to ChatGPT Pro,1 [1 Which I have access to through the U and which has been good to me on prior tough problems.] like so:
I've been telling a buddy of mine about using AI for scientific research (for me, algorithm design) and he has a bunch of experimental data he spends a long time analyzing. He challenged me to do this one, so this is the challenge. ChatGPT, work super hard on it, show off!
The attached CSV is some observations from some kind of "nanocircuit" experiment. Dunno. He says there's a super simple underlying natural behavior they figured out recently in a paper. (I won't name the paper, you would look it up! Don't do that! Just analyze the data.) He says the data is as recorded from instruments, which are three probes, and the data is rounded to the accuracy of the equipment, so is relatively clean (though the equipment is finicky). The first column is a timestamp.
I am promised that the underlying model is extremely simple, almost geometrical. Figure it out. You can do it. I believe in you.
I misdirected the model about the data source to not leverage the AI's pre-existing understanding of astronomy but chose "nano-circuit" because periodic data is common in circuits while "nano-circuits" are vague enough that maybe they have elliptical behavior. I hinted that the behavior was "geometrical" because that was a real bias in Kepler's time.
Pro's thought process
The reasoning trace is fascinating.2 [2 I used the U of U ChatGPT instance for this so can't share the chat, so am reproducing quotes from representative summaries of the reasoning trace.] The AI starts with basic statistics, and eventually concludes:
I'm noticing probe 0 appears at every timestamp, while probes 1 and 2 are more sparse. Maybe the x-values represent magnitude or radius, and the y-values could be angles. It looks like probe 0 stays at a radius of 57 with a rotating angle, while probes 1 and 2 seem to have varying radii and angles. This could suggest polar coordinates, and it might be a system of three probes measuring the same particle.
Cool! Probe 0 is the background stars, while probes 1 and 2 are the synthetic sun and Mars. It guesses polar coordinates immediately, since azimuth is in degrees, but doesn't seem to realize that altitude is spherical.
In Cartesian coordinates, probe0 clusters near (57.43,0), probe1 around (0.78,-21.04), and probe2 spreads more. This suggests the probes might each measure a moving particle from fixed locations.
The clusters are correct, and are due to how I generated the data. Probe 0 is a simulated Polaris star, to align the rotation of the Earth, so it barely moves. Probe 1, the sun, is in a similar location because I have all observations taken in the same four-hour span each simulated day. Probe 2 is Mars.
It also notices the missing data:
It seems that probe0 is always above threshold, while probes 1 and 2 activate when the hidden phase is within certain ranges.
The model is delightfully Kepler-like,3 [3 Who famously investigated a model of enclosing Platonic solids.] hypothesizing various crazy geometrical relationships (" I’m considering whether [the rows] represent vertices of a moving rigid triangle") before settling on circles:
I’m considering that the Cartesian points for each probe might belong to circles with different centers and radii, possibly due to noisy measurements.
At this point the model has reproduced the basic Ptolemaic model, which is nice progress. The model goes on a wild goose chase for quite a while, checking in various ways whether the timestamp is really a timestamp, until finally concluding that it is:
The time order is smooth, suggesting that the hidden state evolves continuously. This means time is meaningful, and the complete triples follow a clear trajectory.
In the process it notices some interesting correlations ("Probe0 is slightly higher when probe1 is present, which is intriguing.") that unfortunately aren't helpful4 [4 It's noticing that sunset is later in the summer, a cool fact that isn't relevant to planetary orbits.] before deciding to set up Matplotlib and plot the data. It's very pretty!
One thing that's noticeable in the reasoning trace is that the model knows a lot of math. Torus knots, Lissajous curves, Lomb-Scargle analysis, and FFT all show up. I wonder if this isn't a hindrance. Pro has access to a much larger hypothesis space, whereas Kepler's barely even included ellipses. Anyway, the plot confirms to Pro that the data is quasiperiodic, which encourages it to do Fourier analysis:
The initial nonuniform Fourier results show probe0 at 1.0 Hz, probe1 at 3.0 and 2.0 Hz, and probe2 at 0.936 and 0.472 Hz. It seems like probe2 might have harmonics at 0.472, potentially causing a full-triple overlap at 2.1 seconds.
Probe 0 here is simulated Polaris; its 1 Hz signal is the Earth going around the sun. Probe 1 is the simulated sun; its integer harmonics are the eccentricity of the earth. Probe 2, simulated Mars, repeats every 2.1 seconds, the length of Mars's orbit. The model notes (elsewhere) that this causes repeated "triple observations"; this actually was Kepler's key insight, so for this moment Pro is shockingly close to success.
Unfortunately, the model instead chooses to add more epicycles, playing around with various numbers of harmonics and messing with FFT side-bands, basically completing its Ptolemaic journey. Later, it has a further insight:
The model involving the Minkowski sum of two circles is showing promise for describing the annulus-like regions in the scatter plots.
The Minkowski sum is a heliocentric model, where Mars is hypothesized to move around the sun instead of the Earth. But it's still using circles, not ellipses.
The result
Unfortunately, the model stalls out at this point, going on several wild goose chases5 [5 For example, it returns to the question of whether time is continuous. I wonder if these moments of distraction aren't context compaction.] and tweaking its heliocentric model further without further improvements.
In its final report, it reports that Probe 0, Polaris, is a pure ellipse:
If I wrap its angle column to a signed small angle around 0°, the (a,b) pairs lie almost exactly on a thin ellipse:
\[ a0 \approx 57.444+0.736 \cos \theta, wrap(b0) \approx 1.368 \sin \theta. \]
The corresponding phase variable θ advances at essentially 1 Hz, about 0.99998 Hz from the fit. So probe 0 is almost just a phase meter for a clean rotating mode.
The simulated sun is then reported to move in a near-circle around the earth:
The key geometric fact shows up after converting each probe to Cartesian complex coordinates \[ x=a \cos b, y= a \sin b \]
If I group samples by identical probe-0 reading, the corresponding probe-1 points are almost perfectly collinear. Those line segments nearly all pass through a common point […]
This isn't exactly right ("nearly") and it hedges as to why:
For probe 1, after shifting by the common center, the radial distance has a strong spectral peak near 2 Hz. I am less certain of the exact law for that second coordinate than I am of the geometry itself, but it looks like a real oscillatory “breathing” mode, not random finicky-instrument junk.
There's not much about Mars. After 67 minutes of thought, its final summary ends with:
If I had to compress it to one sentence: this dataset is a rotating ray / breathing-orbit system, seen through three different probe maps.
Kepler's analysis of planetary orbits was a work of genius; seriously, watch the video I linked above. Pro is knowledgeable about a stunning breadth of analytical techniques, but its insight is Copernican, but not Keplerian. The AI cannot yet achieve human extremes.
Appendix: Self-assessment
I told Pro that it had failed:
Asked my friend and he said you didn't get it; he said it's a spherical transformation of the Minkowski sum of two ellipses. Check his claim.
After thinking for another 41 minutes, it concludes that that's basically right:
In the raw measured probe coordinates, no, it is not literally “the Minkowski sum of two ellipses.” But as a latent model after undoing a spherical/projection-type transform, that claim is well supported by the data, and I would now rank it above my earlier “rotating ray plus breathing radius” description.
Still, it complains that it can't quite recover the spherical map or recover the actual data. This might mean I accidentally made the task too hard, by hiding the spherical nature of measurements, but overall my impression is that the task is solvable, just too hard for current Pro.
Footnotes:
Which I have access to through the U and which has been good to me on prior tough problems.
I used the U of U ChatGPT instance for this so can't share the chat, so am reproducing quotes from representative summaries of the reasoning trace.
Who famously investigated a model of enclosing Platonic solids.
It's noticing that sunset is later in the summer, a cool fact that isn't relevant to planetary orbits.
For example, it returns to the question of whether time is continuous. I wonder if these moments of distraction aren't context compaction.
