Great list! I’ll definitely run your benchmark against Doctly.ai (our PDF-to-Mar...

themanmaran · 2025-03-14T22:26:23 1741991183

Hey I wrote the Omni benchmark. I think you might be misreading the methodology on our side. Order on page does not matter in our accuracy scoring. In fact we are only scoring on JSON extraction as a measurement of accuracy. Which is order independent.

We chose this method for all the same reasons you highlight. Text similarity based measurements are very subject to bias, and don't correlate super well with accuracy. I covered the same concepts in the "The case against text-similarity"[1] section of our writeup.

[1] https://getomni.ai/ocr-benchmark

kapitalx · 2025-03-14T23:38:53 1741995533

I'll dig deeper into your code, but scanning your post does look like your are addressing this. That's great.

If I do find anything, I'll share with you for comments before I publish the post.

prats226 · 2025-03-14T22:09:15 1741990155

Bias wrt ordering is a great point. What we consider structured information in this benchmark is irrespective of how its presentation (Order, format etc), it should be directly comparable. So the benchmark does that it into account.

Example is if you are only converting lets say an invoice into markdown, you can introduce bias wrt ordering etc. But if the task is to find out invoice number, total amount, number of line items with headers like price, amount, description, in that case you can compare two outputs without a lot of bias. Eg even if columns are interchanged, you will still get the same metric.

kapitalx · 2025-03-14T22:18:38 1741990718

Exactly. You still have to be explicit in order to remove bias. Either by sorting the keys, or looking up specific keys. For arrays, I would say order still matters. For example when you capture a list of invoice items, you should maintain order.