I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.
XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.
I've wanted to do something like this for years. I might have to actually stop fiddling with the idea in my head and give it a real shot in 2025.
I'm curious - how does the design process go? Do you propose a design, do they usually have a pretty complete vision or do you have templates that they can take inspiration from?
I think it's appropriate linking directly to Kriesel's blog¹ or his talk, as that's about the scanner creating fake data and not about rce. Though technically it too is not an OCR bug as there's no ocr in JBIG2.
I wonder if OCR could be improved by adding a "language model" of sorts...
Like, sure, maybe it's hard to tell apart a "1", "i", or "l" purely visually, but if you knew it was supposed to be code, I'd suspect one could significantly improve the recognition accuracy if the system just worked in the probability of each confusable option given the preceding (and following) text.
To me the videos are absolutely mind blowing, congratulations on the launch.
It feels like a leap forward in an area of tech that has totally stagnated (e-ink). I think this device is tackling a few hard problems all at once. Well done and I wish you luck.
Please note (you will find info in these very pages) that they are tentative, made on prototypes. And they were apparently meant to be demonstrative of capabilities, not of real-life experience.
Edit: I misunderstood what I was reading in the link below, my original comment is here for posterity. :)
> From down in the same mail thread: it looks like the individual who committed the backdoor has made some recent contributions to the kernel as well... Ouch.
No that patch series is from Lasse. He said himself that it's not urgent in any way and it won't be merged this merge window, but nobody (sane) is accusing Lasse of being the bad actor.
1) No-one has been is proven to "be" anyone in this case. Reputation is OSS is built upon behaviour only, not identity. "Jia Tan" managed to tip the scales by also being helpful. That identity is 99% likely to be a confection.
2) People can do terrible things when strongly encouraged or worse coerced. Including dissolving identity boundaries.
The first problem can be 'solved' by using real identities and web of trust but that will NEVER fly in OSS for a multitude of technical and social reasons. The second problem will simply never be solved in any context, OSS or otherwise. Bad actors be bad, yo.
reply