Gemini 3 Flash ruined my eval!

When I first launched my Connections Eval (Aug 1st), Gemini 2.5 Pro was able to solve them with no sweat.

By August 7th, I had ran quite a few models on this could compare the results – including Gemini 2.5 Flash – which wasn’t so good at 9% solve rate (1 out of 11 puzzles).

A little bit after this, my buddy Pedram optimized the prompts for this and I landed version 2.0 (October 1st). The connections eval, as it exists today, has not changed prompts since this version. His optimization improved performance significantly for Gemini 2.5 Flash from 9% to 72% (more on this later).

When Gemini 3 Flash launched yesterday (December 17th, 2025), I figured I would test it out – you can see it in the results in the table above. 100% win pct, only 4 missed guesses. It was also roughly twice as fast: 27s per puzzle vs 54s, and half the cost. That is a very insane increase.

However, as I was evaluating the results, I discovered a flaw in my eval – there was too much variation in which puzzles were solved by which models. “No problem” I thought – because I knew there were some harder puzzles and I could evaluate which ones were the hardest and then give only the hardest puzzles to my models. No problem, right?

Perhaps not. I gave Gemini 3 Flash every puzzle 10 times and it solved every single one, while sipping tokens, and ran 410 puzzles in 22m52s. Total cost: $3.03.

My eval is dead, and Gemini 3 Flash killed it.

So how far have we came since Gemini 2.5 Flash? Now that I have a nice test harness for ranking puzzles, lets run it for the older Gemini Model.

Run time: 1h7m37s. Cost: $8.30. Accuracy? 60%. Incredibly progress by the Gemini team in just 6 months.

So whats next? I’m working on my BASED eval – which has a head-to-head games of Codenames for my set of evals. I suspect it will be much harder to for the models to saturate this one!