Hacker News new | past | comments | ask | show | jobs | submit | whakim's comments login

The author offers no evidence for the claim that API management and security solutions are needlessly complex in order to create more business for themselves. I think it's much more likely that API management and security software has grown to address the more complex needs of the APIs they serve. It isn't 2010 anymore - handing out plaintext API keys that never expire isn't good enough for many products, and features like RBAC and IAM have become more necessary as more people use APIs to do more stuff.

Now let me go remind myself how OAuth works again...


Yes, scaling vertically is much easier than scaling horizontally and dealing with replicas, caching, etc. But that certainly has limits and shouldn’t be taken as gospel, and is also way more expensive when you’re starting to deal with terabytes of RAM.

I also find it very difficult to trust your advice when you’re telling folks to stick Postgres on a VPS - for almost any real organization using a managed database will pay for itself many times over, especially at the start.


looking at hetzner benchmarks i would say VPS are quite enough to handle Postgres for Alexa Top 1000. When you approach under top 100, you will need more RAM than what is offered.

But my point is you won't ever hit this type of traffic. You don't even need Kafka to handle streams of logs from a fleet of generators from the wild. Postgres just works.

In general, the problem with modern backend architectural thinking is that it treats database as some unreliable bottleneck but that is an old fashioned belief.

Vast majority of HN users and startups are not going to be servicing more than 1 million transactions per second. Even a medium sized VPS from Digital Ocean running Postgres can handle that load just fine.

Postgres is very fast and efficient and you dont need to build your architecture around problems you wont ever hit and prepay that premium for that <0.1% peak that happens so infrequently (unless you are a bank and receive fines for that).


I work at a startup that is less than 1 year old and we have indices that are in the hundreds of gigabytes. It is not as uncommon as you think. Scaling vertically is extremely expensive, especially if one doesn’t take your (misguided) suggestion to run Postgres on a VPS rather than using a managed solution like most do.


shouldn't be expensive to handle that amount of indices on a dedicated server without breaking the bank


I don’t get it. Presumably the pricing model didn’t change, so all you’ve done is push the burden of doing the math onto the user (or more realistically, hope they just don’t even bother?) If users are frequently estimating costs that are off by orders of magnitude, surely the correct response is to change the pricing model so it’s easier for customers to understand?


Once they’re using the product they can see their actual usage and cost metering. So they can either extrapolate that to larger scale or test it at scale for a short time to see hourly/daily cost and then extrapolate for the month or year.

In other words it’s not much of a burden and they get much more reliable information.


But they can still do that even if there's a cost calculator upfront. Removing the calculator simply obfuscates the ability to estimate cost with the happy justification that fault lies with the customers who all seem to be bad at estimation.


It's not a burden to actually spend time setting up the system? This is usually a non-trivial amount of work.


While I'm all for standards-based options, I think the fetishization does a disservice to anyone dipping their toes into graph databases for the first time. For someone with no prior experience, Cypher is everywhere and implements a ton of common graph algorithms which are huge pain points. AuraDB provides an enterprise-level fully-managed offering which is table stakes for, say, relational databases. Obviously the author has a bias, but one of the overarching philosophical differences between Neo4J and a Triple Store solution is that the former is more flexible; that plays out in their downplaying of ontologies (which are important for keeping data manageable but are also hard to decide and iterate on).


I can attest to that, or at least to the inverse situation. We have a giant data pile that would fit well onto a knowledge graph, and we have a lot of potential use cases for graph queries. But whenever I try to get started, I end up with a bunch of different technologies that seem so foreign to everything else we’re using, it’s really tough to get into. I can’t seem to wrap my head around SPARQL, Gremlin/TinkerPop has lots of documentation that never quite answers my questions, and the whole Neo4J ecosystem seems mostly a sales funnel for their paid offerings.

Do you by chance have any recommendations?


I think neo4j is a perfectly good starting point. Yeah, I feel like they definitely push their enterprise offering pretty hard, but having a fully managed offering is totally worth it IMO.


We have been using different things for text, images, and tables. I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand; transformers are extremely powerful and can often do surprisingly well even when you've accidentally mashed a set of footnotes into the middle of a paragraph or something.

For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.

For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.

For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.

For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.


> I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand

This.

A while ago someone asked me why their banking solution doesn't allow to paste payment amounts (among other things) and surely there must be a way to do it correctly.

Not with PDF. What a person reads as a single number may be any grouping of entities which may or may not paste correctly.

Some banks simply don't want to deal with this sort of headache.


How do you combine the outputs? Wouldn't there be data duplication between unstructured text and tables?


We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?


Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.


You could make the exact same argument about locks (which are trivially defeated) - add up all the time that people spend unlocking doors and I'm sure you'd end up with a huge amount of time; add up all the money that people spend on locks and keys and I'm sure you'd end up with a huge amount of money. Should we logically conclude that locks are just a giant grift by Big Key? I'm sympathetic to the ickiness of captchas, but the paper never addresses the counterfactual of what might happen if there were no captchas at all.


Locks protect the user, captcha protects the company.


Escape Collective did an article on this (https://escapecollective.com/what-would-happen-if-everyone-i...), and while at first glance there's a lot to like about a "spec series" for cycling, ultimately it would definitely advantage/disadvantage some riders. Plus, people just want to race a bike that works for them, not some spec bike.


> And similar proposals like the Negative Income Tax would cost far less money and have none of the presented downsides.

It all depends on how you tweak the numbers; in theory a negative income tax and a guaranteed income cost exactly the same amount. A guaranteed income of $1200 taxed at a marginal rate of 50% is just the same as a marginal tax rate of -50% on an income of $400. That being said, there are some pretty big negative externalities to a negative income tax, in the sense that it even further overburdens the tax system with knowing people's exact monthly income (assuming monthly payments), which is not-at-all straightforward for the poorest taxpayers whom presumably such a system would be designed to most help.


They are mathematically the same, depending on the tax curves.

A negative income tax doesn't mean you get -50% of $400, it means your income starts negative. So someone making $0 gets like $1000 back (say by paying 20% over -$5000).


Contrary to some of the sibling responses, my experience with pgvector specifically (with hundreds of millions or billions of vectors) is that the workload is quite different from your typical web-app workload, enough so that you really want them on separate databases. For example, you have to be really careful about how vacuum/autovacuum interacts with pgvector’s HNSW indices if you’re frequently updating data; you have to be aware that the tables and indices are huge and take up a ton of memory, which can have knock-on performance implications for other systems; etc.


I’d start with something very simple such as Reciprocal Rank Fusion. I’d also want to make sure I really trusted the outputs of each search pipeline before worrying too much about the appropriate algorithm for combining the rankings.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: