[>>] S1E13April 24, 202644:22

GPT-5.5 vs Reality: Do Benchmarks Lie?

Tim Williams (host)Paul Mason (host)

0:00

44:22

Now playing:Welcome & Weekly Wins

Chapters

Show Notes

Tim and Paul dissect the GPT-5.5 launch, weighing state-of-the-art benchmarks against real-world user vibes and token efficiency to determine if the upgrade is truly worth the increased cost for developers building production workloads at scale. They also unpack the groundbreaking HTML-in-Canvas proposal that promises to bridge the DOM and canvas rendering gap, unlocking new possibilities for accessibility, interactive web graphics, and shader-driven transitions without fragile hacks. Finally, Tim reveals exclusive results from a unique creative AI benchmark testing model taste and planning, exposing surprising winners beyond standard leaderboards and proving that real-world performance often diverges significantly from the spec sheet while highlighting which models possess the creative judgment required for complex multi-step tasks without hand-holding.

Transcript

Tim Williams: Hey everyone, welcome back to Rubber Duck Radio. I'm Tim Williams... Paul Mason: And I'm Paul Mason. Episode 13! Lucky number, right? Or, uh, unlucky, depending on how your week went. Tim Williams: Ha! Honestly, I think it depends on which AI company you asked this week. But we'll get to that. Paul, how you doing? How was the week? Paul Mason: You know, not bad, actually. Had one of those rare weeks where I shipped something, it worked the first time, and I didn't immediately find a bug five minutes later. So I'm, like, cautiously optimistic. Tim Williams: Careful now. That's how the universe lures you into a false sense of security before dropping a production incident on a Friday at 4:55. Paul Mason: Yeah, I'm already regretting saying it out loud. I probably just jinxed myself. Tim Williams: Alright, so today we've got another packed show for you. There's a lot moving in the AI world right now, and honestly, some of it is actually moving in directions I didn't expect — which is refreshing, because usually it's just the same hype cycle on a different day. But first, let's get into it... Paul Mason: Go ahead and start us off Tim Williams: So OpenAI dropped GPT-5.5 last week, and — here's the thing — the internet's been a mixed bag on this one. You've got the usual Reddit threads where people are like, "This is just GPT-5.4 with a fresh coat of paint," and the Twitter crowd's doing their thing with hot takes. But I actually spent some time with it this week, and... I think there's more here than people are giving it credit for. Paul Mason: Yeah, I've been using it too, and honestly, same. Like, I saw the r/singularity thread where everyone's like, "benchmark improvements were lower than expected" — and sure, I get that. But benchmarks don't tell the whole story. What I noticed right away is just... it feels more coherent? Like, the responses are tighter. Less fluff. Tim Williams: Right. So let's talk about what's actually under the hood here, because the branding is doing this thing where it sounds like a half-step — 5.5 — but internally, this is the first fully retrained base model since GPT-4.5. This isn't a fine-tune. This isn't a patch. They went back and retrained the whole thing. Paul Mason: Which is a big deal. And the codename internally is "Spud," which — I don't know if that's self-deprecating or just OpenAI being OpenAI, but I kind of love it. Tim Williams: A potato! They named their flagship model after a potato. You gotta respect it. So here's what's actually good — and I want to lead with this because the discourse has been so negative. Terminal-bench 2.0: 82.7%. That is state of the art. It narrowly beats Claude Mythos Preview on that benchmark. Paul Mason: And BrowseComp hit 90.1% on the Pro tier. Like, that's not nothing. For web research tasks, that's genuinely impressive. Tim Williams: Now, I do want to be fair here. SWE-bench Pro — which is the one most of us developers actually care about — Claude Opus 4.7 still wins that at 64.3% versus Spud's 58.6%. That's a 5.7-point gap. So if you're doing real-world GitHub issue resolution, Claude still has the edge. And that matters. That's not a nitpick — that's the benchmark that maps to actual dev work. Paul Mason: I think that's the nuance people are missing. It's not that GPT-5.5 is bad — it's that it's better at different things. Where I noticed the biggest difference is just... token efficiency. OpenAI is claiming 72% fewer output tokens on equivalent tasks compared to Claude Opus 4.7. And from my testing, that's not just marketing fluff. The responses are genuinely more concise. Tim Williams: Which brings us to the pricing conversation, because this is where it gets interesting. So the API pricing is $5 per million input tokens, $30 per million output tokens. That is twice the output price of GPT-5.4. On paper, that looks like a step backward. But if it's using 72% fewer tokens to accomplish the same task... the effective cost might actually be lower. Paul Mason: Right. It's like buying a more expensive gallon of paint, but you need way less of it to cover the wall. The per-token price is higher, but the per-task price could be lower. And for anyone running these things at scale — like, actual production workloads, not just noodling around in ChatGPT — that matters a lot. Tim Williams: Now, Claude Opus 4.7 is still cheaper on a per-token basis — $5 input, $25 output. So if your workload generates a ton of tokens and efficiency isn't saving you much, Claude's still the budget play. But for coding tasks where Spud can get it done in fewer tokens? The math might actually favor OpenAI. Paul Mason: And they got it onto GitHub Copilot the very next day. That's — I mean, that's aggressive. One day turnaround. They want this model in developer hands immediately. Tim Williams: Yeah, and here's what I think is the real story here. The timing. This dropped exactly one week after Anthropic launched Claude Mythos Preview. One week. That is not a coincidence. That is OpenAI drawing a line in the sand and saying, "We see what you did, and here's our answer." Paul Mason: The AI arms race is real, man. And honestly? As a developer, I'm kind of here for it. Like, these companies pushing each other means better models for us, faster. The week-over-week improvements are genuinely remarkable if you zoom out. Tim Williams: Agreed. Now, I do want to address the Reddit backlash honestly, because it's not all baseless. There's a thread on r/OpenAI where people are complaining about what they call "preformed personalities" — the model feeling like it has a fixed, almost scripted way of responding. And there's a lingering frustration from the GPT-5 launch where the autoswitcher broke and made the model feel dumber than it actually was. That's real. OpenAI has a trust deficit right now with power users. Paul Mason: Yeah, and I think that's the split. The benchmark people see progress. The daily power users — the ones living in ChatGPT all day — they're feeling something different. They're like, "The numbers say it's better, but it doesn't feel better to me." And that disconnect is a real problem for OpenAI. Tim Williams: But here's what I'd say — and I think this is the fair takeaway. GPT-5.5 is genuinely good. The retraining shows. The efficiency gains are real. The benchmarks back it up. It is not a potato. It's just... the vibes are off for some people, and OpenAI needs to figure out why. Because you can win all the benchmarks in the world, but if your most engaged users are canceling their subscriptions, you've got a problem. Paul Mason: The moral of the story is... benchmarks tell you what a model can do. Vibes tell you what it's like to work with. And right now, Spud's got the skills, but it might need a personality transplant. Tim Williams: Ha! Well said. Paul Mason: Thank you Tim Williams: So this next one — I'm genuinely excited about this. Have you been following the HTML-in-Canvas proposal? Paul Mason: Oh yeah. I've had the flag enabled in Chrome Canary for like two weeks now. I've been nerding out on it. Tim Williams: Okay, so for anyone who hasn't seen this yet — this is a WICG proposal. Web Incubator Community Group. It adds three new primitives that let you render real, fully-styled HTML directly into a canvas element. Native. No html2canvas, no screenshot hacks, no libraries. Paul Mason: Right. And the key thing is — it's not just rendering. The HTML elements stay interactive. They're still in the DOM, they're still in the accessibility tree, screen readers can still read them. You just get to draw them as pixels on a canvas. Tim Williams: So let me break down the three pieces real quick. First, there's the layoutsubtree attribute — you add it to your canvas tag, and it tells the browser, hey, treat my children as real layout participants. They go through normal CSS layout, they can receive focus, but they're invisible until you explicitly draw them. Second, there's drawElementImage — that's the method that actually paints the element into the canvas context. And third, there's a paint event that fires whenever a child element's rendering changes, so you know when to redraw. Paul Mason: And here's the part that blew my mind — it works across 2D, WebGL, AND WebGPU contexts. So you can take real HTML, with real forms and real text, and pipe it into a WebGL shader as a texture. Tim Williams: This is the thing that developers have been hacking around for years. Every chart library, every game UI, every data dashboard that uses canvas — they all have this same problem. The moment you step inside a canvas, the browser's layout engine waves goodbye. You lose screen reader support, native text selection, proper bidirectional text, hit testing. Chart libraries either maintain a hidden DOM that mirrors the canvas — which is fragile — or they just give up on accessibility entirely. Paul Mason: Yeah, and we've all been there, right? You're building some interactive dashboard or a game UI, and you need a tooltip or a form input inside your canvas, and it's just... pain. You're either faking it with canvas drawing primitives or layering DOM elements on top and praying the z-index stays in sync. Tim Williams: So here's what I actually built with it. I threw together a quick analytics dashboard — you know, the kind of thing we've all built a hundred times. Bar charts, some KPI cards, a little filter form. Except this time the labels and the form controls were real HTML elements, drawn into the canvas. And I have to say — the iteration loop is addictive. You write normal HTML and CSS, call drawElementImage, and there it is. Inside your canvas. Looking correct. Because the browser is doing the rendering, not you. Paul Mason: Same here. My experiment was simpler — I just wanted to see if I could put a styled form inside a WebGL scene. Like, actual input fields on a 3D surface. And it worked. Badly, at first — the transform synchronization is tricky. But when I got the transforms lined up... it's one of those moments where you're like, oh, this changes things. Tim Williams: The transform thing is worth explaining. So when you draw an element into the canvas, the canvas's current transformation matrix gets applied. But the DOM element stays where it was in the page. So you have to synchronize — the drawElementImage method returns a CSS transform, and you apply that same transform to the DOM element so that hit testing and accessibility line up with what the user actually sees. It's clever, but it's an extra step you have to get right. Paul Mason: And if you mess it up, the click targets are in the wrong place. Ask me how I know. Tim Williams: Ha! So Matt Rothenberg — he's been building some incredible demos with this. He did a focus ring effect where a warm glow follows your tab focus between form fields, all driven by a WebGL shader. And it composites correctly — the glow appears behind the form content but in front of the background. Because the shader controls the full composite. You literally cannot do that with CSS. Paul Mason: His burn transition demo went viral too, right? The dark mode toggle where fire literally consumes the page? Tim Williams: Yes! That one is wild. He renders both the light and dark themes as live HTML textures, and a shader blends them through this five-zone fire simulation — heat distortion, ember line, char zone, clean reveal, smoke. Two complete HTML pages existing simultaneously as canvas children, composited per-pixel through a shader. View Transitions gives you two snapshots and CSS animations. This gives you two live renders and a shader. That's a fundamentally new primitive. Paul Mason: And Amit Sheen over at Frontend Masters wrote this massive walkthrough — pixel manipulation, displacement effects, shader pipelines — the whole thing. His takeaway was basically, this isn't just a new API, it's a new workflow mindset. You keep real HTML, real semantics, real forms, but the output becomes a visual playground. Tim Williams: Right. So let's talk about where I think this actually lands in practice. Because the demos are stunning, but what are real developers going to do with this? Paul Mason: Chart libraries. This is the obvious one. Every major charting library — Chart.js, D3, whatever — they've been reimplementing font measurement and text wrapping from scratch for years. Axes, legends, tooltips — all text-heavy, all layout-sensitive. With this API, they could just delegate all of that to the browser. And get accessibility for free. Tim Williams: Totally. And I'd add game UIs. HUDs that are actual HTML — keyboard-navigable, screen-reader accessible, styled with CSS — but rendered inside your game canvas. No more layering DOM elements on top and praying. Paul Mason: Creative tools, too. Think about Figma-like editors, design tools, anything where you need rich text editing inside a canvas. Right now you're basically building a browser inside a browser. This could let you just use the actual browser. Tim Williams: Here's a more speculative one — theme transitions. Matt's burn demo points at something bigger. What if every theme switch, every page transition, could be driven by a shader? Not just CSS crossfades — real per-pixel transitions between two live DOM states. That could reshape how we think about navigation and state changes in web apps. Paul Mason: I'd also say — video. The spec mentions readbacks, like sending to VideoEncoder. If you can render HTML into a canvas and then encode that canvas as video frames, you've got a whole HTML-to-video pipeline built into the browser. That's a big deal for anyone building presentation tools, video editors, or content creation platforms. Tim Williams: Now, let's be honest about where this thing is today. It's behind a flag in Chrome Canary. Dev trial only. No other browser has signaled implementation yet. Mozilla's looking at it but hasn't taken a position. The API will probably change — Jake Archibald already has a handful of open issues about changedElements and hit testing edge cases. Paul Mason: The sizing story is rough right now too. Canvas was never meant to have children — it doesn't auto-size based on content the way a div does. Amit Sheen called this the most undercooked part of the API. You need a ResizeObserver to sync the drawing surface with the element size, or your pixels get stretched on Retina displays. Tim Williams: Right. And there are real privacy constraints — no cross-origin content, no system colors. The spec team is being very careful about fingerprinting vectors. Which is good! But it means there are things you simply can't draw. Paul Mason: So here's my honest take. I think this is the most exciting web platform feature I've seen in years. Not because the demos are flashy — they are — but because it solves a real structural problem. Canvas and the DOM have been two separate worlds for the entire history of the web. This is the first proposal that actually bridges them in a way that preserves what makes each one good. Tim Williams: I agree. And I love that Matt Rothenberg's takeaway from building all those demos was basically — the API is most compelling not when it enables crazy new effects, but when it makes something everyone's seen a thousand times feel dramatically better. Focus states. Theme transitions. The mundane stuff. The web has always been flat. This lets it have depth, but only when that depth serves a purpose. Paul Mason: That's exactly right. It's not about making everything a shader demo. It's about finally being able to use the right tool for the right job inside a canvas, instead of rebuilding the browser from scratch every time you need a text input. Tim Williams: Alright, if you want to try it — Chrome Canary, chrome://flags/#canvas-draw-element. The demos at html-in-canvas dot dev are genuinely worth your time. And the WICG repo on GitHub is actively looking for feedback. Go break things and file issues. That's how this becomes real. Paul Mason: That's really cool stuff. I think people are sleeping on how big this could be. Tim Williams: I agree. Alright — so, shifting gears here. This next segment is one I'm particularly excited about, and it actually ties right back into what we were just saying about benchmarks and vibes. So, you know I've been building that open source AI video editor, right? Paul Mason: Yeah, the one on GitHub. I've been following that project. Tim Williams: Right. So I got to a point where the agents had all the tools and context they needed to actually make compelling videos with very little user input. And I realized — this is actually a really interesting benchmark. Not the kind of benchmark where you run a test suite and get a number. A real-world benchmark where you give a model a creative, multi-step task and see if it can actually execute. Paul Mason: So what was the task? Tim Williams: Deceptively simple. I gave each model the same prompt: create a 30-second pitch video for the AI video editor project using exciting motion graphics. That's it. No hand-holding, no step-by-step instructions. The model had to fetch the GitHub page, understand what the project does, plan out a voiceover, create motion graphics using Remotion, produce a music track, and keep working until the whole thing was done. Paul Mason: That's... actually a really hard task. Like, that's not just coding. That's design. That's planning. That's knowing when your own work looks bad. Tim Williams: Exactly! And that's the point. Standard benchmarks measure can you write code. This measures do you have taste? Can you plan? Can you check your own work? And the results were — I mean, some of them genuinely shocked me. Let me start with the bad. Paul Mason: Lay it on me. Tim Williams: Grok 4.2. I was genuinely surprised how poorly it did. It made a decent plan — that part was fine. But then it just could not get the Remotion graphics working. Tried several times, kept failing, and then just... gave up. Fell back to using pre-baked graphics instead. And here's the kicker — it never used the visual tools to actually look at what it made. Just laid stuff out and moved on. Paul Mason: Wait, it didn't even check its own work? That's like writing an essay and not reading it back before turning it in. Tim Williams: Right! And it was outperformed by several desktop-grade models. Which — I mean, that's a rough look for a model that's supposed to be top-tier. MiniMax M2.7 was another disappointment. It could actually make the Remotion graphics, which is better than Grok, but it had zero design sense. Components overlapping on screen, terrible layout, and again — didn't use the visual tools to check. Paul Mason: So it's like someone who can technically write the code but has no eye for what looks good. The craft is there but the taste isn't. Tim Williams: That's exactly it. Now, on the surprisingly good side — Qwen 3.6 35b a3b. This is a desktop-sized model going up against the heavyweights, and I expected it to just fall apart. But nope. It built the Remotion graphics without issue, which makes sense given how much the Qwen team has focused on coding ability. The polish wasn't there — things intersected, layout wasn't great — but I was genuinely surprised it held up at this complexity level. Paul Mason: A small model that can code but lacks design polish? That's still impressive for the size class. What about the bigger Qwen model? Tim Williams: Here's where it gets weird. Qwen 3.6 27b — the dense model everyone's been raving about — it actually failed. Hit syntax errors on the Remotion code, fell back to baked-in graphics. Same failure mode as Grok. Though, subjectively, the layout and pacing were better than Grok's. So there's that. Paul Mason: That's wild. The smaller model outperformed the bigger one on a creative coding task. You don't see that in standard benchmarks. Tim Williams: Nope. Because standard benchmarks don't test this. Alright, now the genuinely good ones. GLM 5.1 — solid motion graphics, clearly above Grok and the smaller models. Needs polish, but the quality gap was noticeable. The one catch? The thinking on this model is... rampant. It took way longer than the other SOTA models to complete the task. Paul Mason: So it's thorough but slow. The overthinker of the group. Tim Williams: Ha! Yeah, that's fair. And then Kimi K2.6 — another genuinely good result, especially on a per-token cost basis. Solid motion graphics, and with some polish this model would clearly produce something superior. If you're watching your budget, Kimi is a serious contender. Paul Mason: So the Chinese models are punching way above their weight class on cost. That tracks with what we've been seeing across the board. Tim Williams: Yeah. And then there's Opus 4.7. Which — no surprise — stole the show initially. Incredible motion graphics, solid script, good all around. The only issue? Paul Mason: The token cost. Tim Williams: The token cost. A single task ran 10x more than any SOTA competitor. That's the Opus tax. You get the best, but you pay for it. Paul Mason: Ten times. That's not a rounding error. That's a completely different budget tier. Tim Williams: Right. But — and here's where it ties back to our earlier segment — I updated the article because GPT-5.5 came in and stole the crown from Opus. Paul Mason: Wait, really? It beat Opus on your benchmark? Tim Williams: It did. And it did something I didn't expect. Instead of sectioning out the Remotion graphics into separate pieces like every other model, GPT 5.5 created one long, detailed Remotion graphic. And the results are quite good. It took a different approach to the problem and it paid off. Paul Mason: That's interesting because that's kind of what we were talking about earlier — the token efficiency thing. If it can do one well-crafted piece instead of managing a bunch of separate sections, that's actually smarter execution, not just more capable execution. Tim Williams: Exactly. And I think that's what's missing from the benchmark conversations. SWE-bench tells you if a model can fix a bug. Terminal-bench tells you if it can navigate a terminal. But neither of those tells you whether a model has the kind of creative judgment to approach a problem differently — or the self-awareness to check if what it made actually looks good. Paul Mason: Right. Your benchmark is basically testing for taste. And that's something we don't have a good way to measure yet. Tim Williams: We really don't. And look, I'm the first to admit this is subjective. There's no objective score here. But I think there's value in that — because real-world use is subjective. When you ship a product, nobody's grading you on benchmarks. They're looking at the output and deciding if it's good. Paul Mason: The moral of the story is... the models that win on benchmarks aren't always the models that win on taste. Tim Williams: Well said. And honestly? The biggest takeaway for me was just how wide the gap is between models on these creative, multi-step tasks. It's not a minor difference. Grok and the Qwen dense model couldn't even complete the creative parts. The Chinese models delivered 80-90% of the quality at a tenth of the cost. And the top-tier models are clearly on another level — but you're paying for that level. Paul Mason: I'd add one more thing. The fact that Qwen 3.6 35b outperformed the 27b dense model? That tells me we're still really bad at predicting how models will perform on novel tasks. The benchmarks say one thing, reality says another. Tim Williams: Totally. And I think that's the real value of this kind of testing. It's not scientific, but it surfaces things that standard benchmarks just can't. I'm going to keep running these kinds of tests as I build out the video editor — because if nothing else, it's giving me a much clearer picture of which models I actually want to trust with real creative work. Paul Mason: Makes sense Tim Williams: Alright, I think that's a good place to wrap it up for today. Let's quickly recap what we covered. Paul Mason: Yeah, we packed a lot in. GPT-5.5 — the potato-named model that's actually really good but expensive, and the backlash is real but maybe overblown. Tim Williams: HTML-in-Canvas — the most exciting web platform feature nobody's talking about yet. Still rough, but the potential is huge. Go enable that flag and play with it. Paul Mason: And Tim's real-world model benchmarks — which proved that benchmarks don't measure taste, and the Chinese models are delivering serious value. Tim Williams: And if there's one thread that connects all three of these topics? It's that the gap between what the spec sheet says and what actually works in practice is still really wide. Whether that's a model's benchmark score, a web feature's spec document, or a new API's pricing — you gotta actually build stuff with it to know what's real. Paul Mason: That's exactly right. And honestly, that's kind of what this show is about, right? Cutting through the hype and getting to what actually works. Tim Williams: Yeah. We're not here to tell you what to think — we're here to tell you what we've actually tried and what happened. Your mileage may vary. Always does. Paul Mason: If you want to check out Tim's benchmark article, we'll link it in the show notes. It's on Medium — and the AI video editor project is open source on GitHub. Tim Williams: Yeah, go star it. Or don't. No pressure. But if you want to run your own model tests, all the tools are there. Paul Mason: And if you want to reach us, hit us up on the usual channels. We love hearing from you — especially when you've got a hot take that proves us wrong. Tim Williams: Those are the best ones. Alright, thanks for listening, everyone. Until next time — keep building, keep questioning, and keep your rubber ducks close. Paul Mason: Quack. Tim Williams: Ha! See you next episode.

Related Projects

AI Charts

AI-powered flowchart, ERD, and swimlane diagram builder with a built-in AI assistant and an MCP server exposing 18+ tools for external AI integration. Works with any OpenAI-compatible LLM — no vendor lock-in.

Solo DeveloperView project ->

AI Sound

AI-native audio editor built as a modern replacement for Audacity, with LLM integration at its core. Features multi-track editing, AI transcription, speaker diarization, semantic search, and a full MCP server for external AI assistant integration.

Solo DeveloperView project ->

GTZenda

Enterprise document intelligence pipeline that ingests procurement data from AI agents, classifies and normalizes documents using LLM processing, and pushes structured data into a government sales intelligence platform. Built on AWS with SQS-driven async processing and OpenAI integration.

Lead DeveloperView project ->