Hacker News new | past | comments | ask | show | jobs | submit | faxmeyourcode's comments login

Snowflake and DuckDB are two flavors of SQL that allow things like trailing commas. My personal favorite feature is `GROUP BY ALL`.

    select 
      c1, 
      c2, 
      c3, 
      ...,
      c50,
      sum(c51),
    from 
      table
    group by all

I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

This is giving me hope that it's possible.


(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.


>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.

For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.

[1] https://github.com/dgunning/edgartools


I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.


If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.


isn't everyone on iXBRL now? Or are you struggling with historical filings?


XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.


I've wanted to do something like this for years. I might have to actually stop fiddling with the idea in my head and give it a real shot in 2025.

I'm curious - how does the design process go? Do you propose a design, do they usually have a pretty complete vision or do you have templates that they can take inspiration from?


force = mass * acceleration

work = force * displacement

work = mass * acceleration * displacement

lower mass, less work.


So my username is a little less ridiculous than I originally thought? :)

The fact that this can introduce OCR bugs into your C code is hilarious, and this is diabolical:

    #define one ( 4 - 3 )
    #define eleven ( 3 + 4 + 4 )

Source code is here https://github.com/lexbailey/compilerfax


> OCR bugs

Especially if your fax machine uses JBIG2 compression. See: https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-i...


I think it's appropriate linking directly to Kriesel's blog¹ or his talk, as that's about the scanner creating fake data and not about rce. Though technically it too is not an OCR bug as there's no ocr in JBIG2.

¹: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...


I wonder if OCR could be improved by adding a "language model" of sorts...

Like, sure, maybe it's hard to tell apart a "1", "i", or "l" purely visually, but if you knew it was supposed to be code, I'd suspect one could significantly improve the recognition accuracy if the system just worked in the probability of each confusable option given the preceding (and following) text.


This would also have a higher risk of introducing some nasty, hard to spot errors.

It's actually better for the compilation to fail than for the Clippy to make up something syntactically and compilation correct, but wrong.


You might be right in a practical sense, but for an art project like this, maybe not?


Need a proper preprocessor to take a code file and make it OCR-safe by substituting for dangerously glyphs.


This might be a good reason to support trigraphs again! https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(progra...

edit: fixed link, copy paste fail dropped the ++


Amateur! Use a barcode font!


monospace font OCR-B


I would not touch another computer for a very long time.


Yep. I'd cut computer technology out of my life as much as I could.


To me the videos are absolutely mind blowing, congratulations on the launch.

It feels like a leap forward in an area of tech that has totally stagnated (e-ink). I think this device is tackling a few hard problems all at once. Well done and I wish you luck.


> the videos

Please note (you will find info in these very pages) that they are tentative, made on prototypes. And they were apparently meant to be demonstrative of capabilities, not of real-life experience.

The Company says they will be updated.


Median _household_ income in California was right around $91k a few years ago.

https://www.census.gov/quickfacts/fact/table/CA/INC110222#IN...

After taxes and payroll deductions from your average paycheck you barely get $5,500 to take home...


Edit: I misunderstood what I was reading in the link below, my original comment is here for posterity. :)

> From down in the same mail thread: it looks like the individual who committed the backdoor has made some recent contributions to the kernel as well... Ouch.

https://www.openwall.com/lists/oss-security/2024/03/29/10

The OP is such great analysis, I love reading this kind of stuff!


No that patch series is from Lasse. He said himself that it's not urgent in any way and it won't be merged this merge window, but nobody (sane) is accusing Lasse of being the bad actor.


Lasse Collin is not Jia Tan until proven otherwise.


Speaking only hypothetically, but two points:

1) No-one has been is proven to "be" anyone in this case. Reputation is OSS is built upon behaviour only, not identity. "Jia Tan" managed to tip the scales by also being helpful. That identity is 99% likely to be a confection.

2) People can do terrible things when strongly encouraged or worse coerced. Including dissolving identity boundaries.

The first problem can be 'solved' by using real identities and web of trust but that will NEVER fly in OSS for a multitude of technical and social reasons. The second problem will simply never be solved in any context, OSS or otherwise. Bad actors be bad, yo.


No, he likely is not. But the patch series includes commits co-developed by Jia Tan, and lists Jia Tan as a maintainer of the kernel module.


Passive aggressive accusation.

This style of fake doubt is really not appropriate anywhere.


The referenced patch series had not made it into the kernel yet.


Also it may br a coincidence but JiaT75 looks a lot like Transponder 7500 which in aviation means hijacked...


Toss the turtle doesn't fit your aesthetic but the mechanics were unique and so fun to play as a kid


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: