No public code. This has been a long running project for me. Last I touched it- pre-LLM world- it had turned into a real Rube Goldberg machine. Hard to imagine anyone else putting up with it.
PDF to text (using either python or Java lib), which then is turned into a "header" structure with dates and balances via configuration driven regexes, and a "body" structure containing the transactions. The transactions themselves go through an EBNF parser to extract the date(s), narration, amount, and balance if reported. The narration text gets run against a custom merchant database for payee and categorization. It is a painful problem! The code is Clojure so there is not much of it, and there are high abstraction libraries like Instaparse that make it easy to use grammars as primitives. And the rube goldberg has yielded for me balance-validated data now for the last several years from half a dozen financial providers.
I have been incorporating local LLMs, running on an RTX 3090, into some other workflows I have, hope over the summer to see if those can help simplify some of the workflow.
> like convert PDF bank statements into CSV transaction files
I've tried this recently and it's surprisingly difficult. Any pro-tips?
Extracting pdf tables, while respecting the cell position, seems almost impossible in a way that works in all cases (think borderless tables, whitespace cells, etc)
It is remarkably difficult and continues to provide a good example of the limitations of LLM based systems.
In my case, I used perl, and exploited the fact for for a given bank, the statements are consistently formatted. Further, PDF OCR conversion responds consistently to the documents with the same formatting. With this combination, it is possible to extract the characters and numbers that are associated with transactions from the document, and then to take those extracted bundles of text and transform them into lines for a CSV file.
The caveat is that it works for only that bank, that "kind" of account (usually checking, credit card, or savings), and when using that specific document OCR tool. Within those constraints it is eminently reliable but utterly non-transferable to a general case.
If you use AWS config setup for the organization (aggregator), you'll get a athena-sql-queryable inventory of all your resources from all organization accounts.
So finding out which account owns a resource can be as simple as, roughly: select accountId where arn = "x"
As one of the people who commented on that thread, it was really eye-opening to see the group dynamics involved between people who have experience in the domain vs those who don't.
It definitely made me look at online debates a lot differently, as previously I thought good points can come from anywhere (which can still happen), but it turns out experience in the domain is usually way more relevant.
I guess it's similar to that effect where if you see news about a topic you don't know, you tend to take/believe everything as-is, but if you happen to know the domain, you'll usually spot quite a few factual errors which tend to discredit most of the news.
From this[1] PR, it seems to be related to "sensitive information in synthetic URLs", as this[2] article was just introduced bundled with the advisory article.
So I assume the security incident is about those who:
> You have a synthetic monitor that contains sensitive information (like passwords or usernames) in a URL or script.
This is exactly the reason why I'm sticking with the CAT S6x series phones and willing to put up with mediocre performance/features, as far as smartphones go.
They've been the only ones that don't just turn off in really cold temperatures, even without babying them in warm pockets.
The general ruggedness is also pretty good to amazing, depending on how fragile your previous phones were. For example: it survived a ~10 meter (32ft) drop on rocks, whereas everyone was convinced it was done for.
I've just been using regular hard drives for my RAIDs with really long lifespans, think 7-10 years -- and even then replaced just out of caution, they still work -- and I'm not sure if this is also achievable with SSDs or what to even look out for
I've been buying Samsung drives for a few years and had no problems. My general strategy is stagger the buys, in order to /hopefully/ get drives from different manufacturing batches however the reason I'm using a RAID setup is so I don't have to stress over durability. If one or two drives do tank, oh well. Buy new ones, clone contents from one of the parallel drives and go on with life.
reply