Discussion about this post

User's avatar
Trelis Research's avatar

Eric, really nice work and very clearly written. And very cost efficient!

A few Qs, if you don’t mind.

Did you test any text-only models (like OSS 20B)? I notice all three models you tested were multi-modal and am wondering how important that is.

In the “Score-weighed program selection” column, does “No” mean you sampled uniformly (at random), while “Yes” means you greedily take the highest train accuracy, using pixel match as a tie breaker?

In the library generation phase, you did one round. Was that also 5 programs per task there?

I suppose you have to execute the whole library on every task so that you get the scoring, correct? Should be quick, but does that become a bottleneck?

The score is determined first by train accuracy and then by pixel accuracy. I assume that train accuracy is nearly always zero for the first round on a given task, so that means pixel accuracy must be doing all of the heavy lifting?

Expand full comment
2 more comments...