Hacker News new | past | comments | ask | show | jobs | submit | mbb70's comments login

This is while pretty much all software that extracts structured data from PDFs throws away the text and just OCRs the page. Too many tricks with layouts and fonts.

I'm always surprised how "generate PDF from Word" turns one word into 10 different print points, all with just a single letter.

Or even straight lines in a table. The straight lines from a table boundary get hacked into pieces. You'd think one line would be the ideal presentation for a line, but who are you to judge PDF?


Designing Data-Intensive Applications by Martin Kleppmann. It's my "this is the book I wish I read when I started programming"


When you started? I mean it’s a good book but it would be wasted on beginners.


Maybe not when I started, but after years of hard lessons working with HDFS, Cassandra and Spark (not to mention S3, Dynamo and SQS) seeing all those hard lessons pinned down like butterflies in a display case made me jealous of anyone who found this book early


I read this book in the first few years of programming professionally, and in my naïveté I was so eager to apply the patterns therein I missed out on many opportunities to write simple, straightforward code. In some ways it really hindered my career.

I don’t blame this on the book, of course; ultimately the intuition it helped me build has been very helpful in my work. With that said, as a particular type of feisty and eager young programmer at the time, I can now say I would have benefitted at least as much from a book titled “Designing Data-Unintensive Applications” :)


We’re about due for a new edition with some updates. I read it again just recently and in certain sections I would love some updates. Many things have changed in 7 years.


An absolutely excellent book! I learned so much going through it slowly with a reading group I started at my job.


If the customer wants a show, give them a show. Doesn't make it not theater.


Similar to 'If an article title poses a question, the answer is no', if an article promises a significant speedup of a database query, an index was added.


Deno charges for "inbound HTTP requests", so the DDOS can just query uuids till your checks start to bounce


Ruby has a `NilClass` and the best/worst part of it is the to_s is "", to_i is 0, to_f is 0.0, to_a is [], to_h is {}

It's incredibly clean and convenient until you wake up one morning and have no idea what is happening in your code


I dont think there is something wrong with that once you think about what is a Null Element (or identity) in a group that is represented by a set of elements and a function:

Integer, + => 0

Float, + => 0.0

Array, add => []

Hash, merge => {}

and so on.

I think maybe we can debate the operations/functions, but they make sense. For Integer in some ways you can define almost all other operations that you commonly use based on the addition.

So while nil is an object when trying to find a representation in other group I find it logical or expected.

Also Ruby will not automatically try to coerce nil when not asked to do so

like for example 0 + nil will throw an error.


Integers support both addition and multiplication and taking maximum and minimums, and a few other semi-group operations. Do you want to define different Null elements for all of them?


No, I don't want to define a representation for Null to all possible combinations of a set and a function/operation. That can be done by each developer if they see it fit and of the operations they want to have this.

But, for me, it makes sense to have a default representation for Null (that it is not automatically coerced—only if the developer explicitly asks for it) for one of the most common operations in that specific group.


Of course since it's Ruby you can just monkey patch those to_s methods to do whatever the hell you want, confounding anyone else working on your codebase.

I love using Ruby when I'm the only one who will ever have to look at or touch it.


> the to_s is "", to_i is 0, to_f is 0.0, to_a is [], to_h is {}

I somehow can't help reading that as some sort of high school sports-cheer: "Gimme an S to the quote to the I to the oh to the F to the zero! Goooo Rubies!"


Those conversions make sense though. They all mean empty or null. It's what I would expect from a language like Ruby.


Integer 0 means empty with respect to a particular operation (addition) but is not empty with respect to all operations (ex. multiplication)


Indeed. One's greatest strength is also their greatest weakness.


I believe the Arduino project used only a single hidden layer, whereas the authors quantization scheme allowed them to use multiple.


You will experience decline across many axis of your life as you age. An exercise that helps me accept this is to frame my identify in terms of things I will not lose.

- I like to hike, but what I really love is to be in nature and surrounded by trees.

- I like being funny and quick-witted, but what I really love is to laugh and see other laugh.

- I like to dance, but what I really love is to feel the rhythm of music in my body.

Obviously some of this is self-delusion (I'd also like to be young, strong and smart) but I find it helps.


Interesting to compare this to https://knock.app/blog/zero-downtime-postgres-upgrades discussed here https://news.ycombinator.com/item?id=38616181

A lot of the discussion boiled down to 'this is a lot of complexity to avoid a few minutes of downtime'. I guess this is the proof, just use AWS Data Migration Service, swap the DNS entries to go live and live with 11 seconds of downtime.


There is no "just" about it. The absolute key takeaway is in "what we learned":

> We chose to use DMS because it was well supported by the GOV.UK PaaS and we could also get support from AWS. If we were doing a PostgreSQL to PostgreSQL database migration in the future, we would invest more time in trying alternative tools such as pglogical. DMS potentially added more complexity, and an unfamiliar replication process than what we may have found with other tools. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.

The message here is not "just use DMS".


Even AWS in their own docs says to use the native tools when migrating from postgres to postgres[1]. They don't go into the details to much and points to pg_dump rather than pg_logical, but interesting to see that they don't recommend using DMS for it

[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source...


They do, but those recommendations are buried quite deep in the documentation, well behind all the marketing guff that suggests that DMS is all things to all people, and wonderful magic that is ideal for all situations.


Has anyone used https://cloud.google.com/database-migration/docs/postgres/qu... to do something like this? Does it work similarly to AWS DMS?


There are a lot of gotchas with using DMS (which seems to use pglogical under the hood). Since it’s not hardware-level replication, you can run into issues with large rows/columns/tables and it doesn’t really handle foreign keys. It may not handle some special data types at all. You also need to update the sequences after the migration or you’ll get errors about duplicate primary keys. You can also have issues if you don’t have proper primary keys, because it doesn’t always copy the entire row at once.

If the databases are within the same AWS account, it’s likely easier to use hardware-level replication with global database or snapshots to do migrations if you’re ok with 4-5 mins of downtime.


There are many options available with PostgreSQL you could also do a physical full backup + WAL level replication to keep key AND a get low downtime.

What might have oriented theirs choice is that they wanted to upgrade from major version 11 to 15 during the migration process. This is only available using logical replication. Otherwise you'd have to chain upgrade process of each major version (and possibly OS because 11 is EOL on some arch) and this is nor trivial nor quick.


Seems like the biggest edge they have over aws/gcp/azure is the ability to delay function calls to smooth demand. Not much other providers can do to match that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: