Hacker News new | comments | ask | show | jobs | submit login
You don't need ML/AI, you need SQL (cyberomin.github.io)
286 points by cyberomin 7 months ago | hide | past | web | favorite | 83 comments

>say a person bought a pair of shoe, sunglasses and a book. For their newsletter, we will show include shoes, sunglasses and books. This was a lot more relevant than sending random stuff.

I agree with the general sentiment of the article, but this seems like a poor example, since a more sophisticated approach can add a lot of value to a recommendation system. How do you know whether a customer is likely to want more than one item in any of those categories? If they already purchased sunglasses, wouldn't they be more likely to purchase, say, a sunglasses case and/or sunscreen? If they purchased a book, do you recommend the same book again? And if not, how do you choose which book(s) to include?

Of course, you could technically still handle this in SQL with a bunch of CASE statements, but obviously that doesn't scale well across a wide range of products. The whole point of ML/AI in that use case is to scale that type of nontrivial decision making.

> but this seems like a poor example

In fact this is a perfect example of how NOT to do purchase-history-based suggestions, which unfortunately also seems to be how most companies do it. They see a big purchase (or search terms relating to one) and spam you with options for that purchase. But if I just bought a car, or a drone, or a laptop, then the last thing I want to see is ads for other cars or drones or laptops.

Even applying just a little intelligence and showing ads for accessories (floor mats? spare batteries? bluetooth mice?) would make things substantially more useful.

I used to see that a lot on Amazon. I just bought an electric toothbrush, why would I buy another? Haven't been shopping much lately so not sure if it's gotten any better. My example is from last year, and I remember mentioning this problem in an interview with Amazon 6 years ago, and I'm sure machine learning has been involved for longer than that. It's still easier to screw up a machine learning model than SQL.

How about if you can identify a smart phone in proximity to a store display, and later target ads at that user? That would get you a lot of people that didn't buy a product, but demonstrated interest in a category. I suspect someone has already figured out how to make this happen, but I haven't found explicit confirmation.

Obviously ML can add a lot of value here, but its questionable to me if its trivial to build such a model with available data, keep said model up to date, or train variations on it easily, cheaply and quickly enough to A/B test the result and ensure you’re actually making any tangible difference.

So you know... I don’t think it’s unfair to say that for smaller vendors, the cost/effort of setting up a ML model may dwarf the fractional improvement it offers over just having one person doing human generated SQL queries.

The point is this isn’t like machine vision or voice, where its almost expontentionally better than traditional approaches.

It’s just... a bit better. Which is worth it only if the fractional improvement pays for the setup cost.

It's not unheard of to see +10-30% in revenue when adding a recommender system [0]; The system described by the author is arguably more complex than a recommender system, since he has to develop, maintain and evaluate a set of rules that are not based on real data, but only on his intuition of what users want. GP gave good examples of how this would easily fail (do you always want to recommend items from the same category; if not, how do you know which other category to recommend?)

[0] http://citeseerx.ist.psu.edu/viewdoc/download?doi=

For newer or smaller firms the SQL approach makes sense. Both because there is less data, and it's less risky to implement. Once the context is fully fleshed out, then it's easier to move on to a ML recommender system. It's also easier to track improvement vs a benchmark.

All the big cloud providers are offering pre-trained models for currently popular AI/ML use cases, such as image labelling, face recognition etc. I think this will be the easiest way to apply AI/ML, combined with transfer learning so that the provider can pre-train the basic model and then provide a way for the customer to customize it further for a specific use case.

Of course this can also fail if the pre-trained generic models don't offer enough value and you end up having to develop your own models, but we'll see how it goes.

Btw, I've published a short Kindle book that aims to provide an overview of these pre-trained services currently available on various clouds, it can be found on Amazon by searching for AI ML Managed Services 2018. It attempts to save you the trouble of scanning through all the online documentation to find out what they do.

But the SQL approach at best only answers which item type to present the customer.

How does the query make a good decision on the specific item?

Speaking of their examples, and given they send promotional letters not too frequently, I'm sure there are many people around me who will buy many sunglasses, shoes, or clothes for different styles. I just don't want to be advertised toaster and bathroom stuffs together. IMO you have to do research on items those can be matched together for each group of users.

Recommend items that are bought together. Other customers who buy these sunglasses could be shown the shoe and the book.

It's a lot less stupid that recommending to buy the same item again and again.

Nice post,

Here's a different way to think about the situation with current AI/deep-learning; if the current upsurge of methodologies was getting close to general AI, it would be getting closer and closer to a hammer that really did let you treat everything as a nail. IE, it would be general purpose.

But I think I can say we're not seeing that even though deep learning seems to be continually expanding the domains that it can operate on. How is that? This Open AI is very eye-opening; "We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period)." Essentially, as a rather brute-force-y method, we have shown we can expand deep learning's impact to a larger and large domain but not at all in the fashion of human learning tricks (where the new isn't that much harder than the old trick).

Maybe, in this process, a better algorithm that adjusts to new situations without increased costs will surface. But until then it seems new and old methods will need to coexist.


This doesn't sound related to the post, no? The post doesn't argue for general AI that learns like humans do, it only discusses the merits of AI as a whole vs hard-coded SQL queries and heuristics.

Good post, but I couldn't disagree more. Regardless of your business size, it will always be valuable to know information such as:

* How does every additional coupon-dollar affect the total amount a customer buys?

* What is the relationship between customer age and retention for my store?

* Does giving a customer more purchase options help or hurt their chances of making a purchase?

My experience is that each of these questions can be solved, in part, using 3 lines of Python code:

    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
Then look at the beta coefficients of the model, and you have a rough idea of how different features are correlated. Doing something like this in SQL sounds difficult. If you have data to interpret, it makes sense to use similar methods. I can't think of an example where you have data but refuse to look at it until your company is "bigger".

I overall agree that ML is needed over 'just SQL' in a lot of cases (though SQL + good visualizations / exploratory analysis can answer a lot of those questions qualitatively). I would also be careful with the linear model approach. Multicollinearity can hide how important a feature is (or reverse sign of a feature) when trying to use coefficients to interpret importance, so using a linear model like that isn't as straightforward as it seems.

As a workaround, you could look for high VIF to detection multicollinearity, use some sort of stepwise selection / penalized regression, or use something like relaimpo (https://cran.r-project.org/web/packages/relaimpo/index.html) - not sure of a Python equivalent - to judge overall feature importance in the model.

Ha - maybe a SQL layer is lurking behind the scenes, crafting the input variables that make your little python script so powerful

The premise for this article is wrong!

The author describes using SQL to pull facts from history; who was the number one customer the last week, who abandoned online orders and so on.

The premise should instead be how to fit a model onto your business data so that you better can guess who will be the number one customer next week, what (s)he will order and so on.

The problem that ML addresses is how to arrive at that model, under the assumption that you can use historic data to pick either model or parameterise a model.

SQL has it merits, as does the relational database model, but this has nothing to do with creating models (even though we are modelling the data itself). The author gives some examples that are, frankly, trivial.

But he has a good argument around namedropping "hot" technology when your business need does not incorporate distributed trust (blockchain), modelling behaviour (or some such) using ML and so on.

Maybe I'm niave, but are there really people who want to hop on the AI bandwagon just to do mundane lookups like this?

When I worked with machine learning many years ago, we learned that it was no better than the heuristics already in place. The thing is, it's much easier to diagnose a well written and understood heuristic than a machine learning model.

Machine Learning is usually seeen as a magic black box by many people. So yes. There's a recent trend that seems to favor a machine learning first approach to solve very simple and mundane problems because people feel that they are missing some magic insight if they don't do it (FOMO).

For example, the author refers to a shopping newsletter where you personalize suggestions for certain products after a customer buys a particular product. This is very often a machine learning 101 example but really there's nothing preventing you from writing those heuristics yourself -no ML involved(e.g: if a customer buys a pillow, suggest pillow cases).

Machine learning does makes sense for something like that if your website is Amazon, but is definitely an overkill if your website is an e-commerce for house garments.

The funny thing is that usually you will end up writing those heuristics implicitly since you need to label your data anyways.

Another fun thing is that you will learn a lot more about your customer base if you do the research and write those heuristics yourself, vs. having ML do them. A business that understands the behavior of its customers is much more competitive than one which delegates that understanding to a black box. I don't contend that ML has no valuable (even transformative) applications but it's not a substitute for personally understanding every detail of your market.

I don’t understand this at all. I’ve worked with a close to a hundred data scientists now and every one of them is an expert in the business problem they are trying to solve.

You can’t just throw an algorithm (even one like AutoML) at a problem and expect to be able to do magic with no knowledge of the domain. The technology simply doesn’t work like that.

Machine learning can be implemented in just a few lines of code using something like Amazon Sagemaker to manage the process. In some cases it can be less work than hand crafting dozens and dozens of business rules.

And just because you have a small website doesn’t mean you have to behave like you have a small website.


A few years ago I was called into save a dying project. They had built some big Hadoop cluster, had consultants on site, etc.

End of the day, they were doing something similar to assessing fines on library books. I wrote a prototype in about 3 hours.

People mentioned resume building and FOMO, I'll add to this: funding and sales. Investors and corporate managers are into this hype as much as engineers - if not more, so AI/ML is a label people want to use to get more money.

Yeah, I think in most cases if an organization says they are doing AI then their user base is probably rather unsophisticated in regards to tech.

Yes, the hype is strong and people want to stick ML or Data Science on their resume. They'll try to do anything that seems like it might touch statistics with ML first and not even consider whether it's necessary.

Kafka, Kubernetes, and things like Spark and machine learning are basically the next stage of the "data is King" hype cycle that Hadoop was a few years ago.


The comment lists at least two "questions" that can be answered easily with SQL and a graph and even in ways that give more nuance than linear regression can capture.

They are, and they are getting VC funds to do it. They may just be on the bandwagon to get the funding, but they still need to 'do AI' in order to satisfy their investors.

I like to think I'm not too nitpicky about fonts, but that st ligature is incredibly distracting.

It's the second article I've seen here that uses it over the last few days, but I'm not sure if it's the same site or not.

Thanks for the feedback. Do you have any font preference?

FWIW, this is what I'm seeing: https://imgur.com/a/pbR0SCH

Please stick to a font already in my browser. My network is slow and webpages sometimes get stuck loading; the resources in your page took 40 seconds to load and your font didn't even load yet.

Turning this clause off in your style sheet turns off that awful st ligature:

font-feature-settings: "liga", "dlig";

So just remove that clause from your stylesheet and you'll be rid of that ligature.

I like ligatures when they're used well, but that one is just incredibly distracting.

Shame the advice is to turn them off completely...

Testing a big more, just dropping "dlig" from the feature settings declaration turns off the awful st ligature. So they don't have to 'all' be turned off to get rid of that one. What I don't know is what other ones get turned off by dropping "dlig" from the declaration.

I have no problems with the current one apart from the s and t connecting like that.

Was hard to put my finger on it but now that you mention it it's definitely something with the letter/line spacing or the font itself.

In the modern age Latin script, the only acceptable ligature is "fi" in a proportional type, because the top curve of the F usually comes very close to the dot of the I already. For the rest, they're totally useless because we don't use moveable type anymore.

Well, I also tend to enjoy programming ligatures, such as != becoming ≠ or => becoming ⇒. (but in a size proportional to the pair of fixed width characters, see https://github.com/tonsky/FiraCode )

But then I don't inflict them on anyone but myself and people who look over my shoulder...

That is why I disable custom fonts. Everything on the web is one of two fonts for me, sizes 9-14. I've been doing it for a decade and it makes websites so much better.

This post is downright bonkers. “We don’t need ML/AI! Proof: list of things you wouldn’t use ML for

There are so many problems you can solve with a neural network. Should Waymo ETL sensor data and do a WHERE NOT IN for bicyclists?

This is blog post is pretty dismissive. Statistics software has been in use since the beginning; see SAS. Financial institutions, actuaries, etc, have been using these methods with SQL data as the input and it’s the only reason they’re still in business.

If this blog post simply suggested hiring a BI Analyst in your startup, I wouldn’t disagree.

There's no logical equivalency between SQL and ML/AI.

SQL is a language that helps retrieve the data you're looking. ML/AI helps you predict the future (using past data).

Maybe this is directed towards product people? But it has SQL in the title so it can't be. I'm confused as to who the audience is here.

You're missing the point. Much of the "intelligence" that AI/ML is touted as solving could be accomplished through standard SQL queries on well normalized data. But no one will invest in a company because we have proper data structures and accurate SQL reports that we learn from.

That's been done for ages. The point of AI is to adapt to the world instead of having humans spend time understanding the problem set.

SQL can apply a human understood model to data points. AI lets us develop new models and adapt them.

AI lets us solve problems that have abstraction, or problems that change over time. You can't have SQL detect cats in an image or drive a car.

Cleaning the data, choosing the right ML algorithm, selecting the features, tuning the parameters, etc. ML also involves a lot of human time thinking to the problem.

The title is often true, but it doesn't mean too much or anything. And the same argument is brought up again and again in the past as well.

What OP suggests, the so called SQL, is basically a heuristic based system. When done probably and carefully, it could of course work very well, and is indeed often used as baseline model to bootstrap a ML system. However, eventually the rule-based system will hit the wall, and ML be the savior of the day to push the metric further for a margin of 20-30%.

So yes, when you are small and has little data, ML is irrelevant. But same thing could be said to too many things in software industry, you probably won't need Docker/Big Data/Fancy JS as well, if you are building a small scale online store.

Choose wisely your tech stack based on your problem, but the title is needlessly sensationalized.

It's an odd one for me, because he sounds like he's trying to argue against the "when all you have is a hammer" approach to ML - but then goes on to describe SQL as his hammer.

For things like figuring out who your biggest customers are, SQL probably is the right tool for the job. Whale-spotting probably gives a decent bang per buck, and isn't particularly complex.

But when he gets onto recommendations, it starts to look like it's the author who's attached to the wrong tool for the job. His example of recommending sunglasses to people who buy sunglasses is terribly blunt. If someone in my locale, who doesn't regularly buy sunglasses, buys sunglasses; they're probably going on vacation - there's not much sun at home for them. Surely there's a whole raft of things someone excited for their summer holidays would impulse-buy, but the sunglasses they just bought are no longer on the list.

If ML can match them up with with a "going on summer holidays" demographic, and BI wants to sell them the only thing we know they no longer need, it's no longer making a strong case for blunt instruments.

that is quite a leap to call SQL a rule-based system. SQL is a standard query language that you can use to discover how attributes and values relate to one another within data.

> standard query language that you can use to discover how attributes and values relate to one another within data.

That doesn't make it not a rule-based system. There is no learning component in SQL.

I think this highlights a problem separate from machine learning, block chain, and similar vs the tried and proven technologies and a long standing one: attempting to solve via understanding vs seeking the simple solution to avoid thinking about it.

Ironic that machine learning is 'simple' but that seems to be the case at times especially with the 'throw block chain or machine learning at it' approach when a proper algorithm could do it far more efficiently. The funny thing is that both approaches have their place. If turning it off and on again fixes a rare issue faster than following every instruction to machine code you are better off restarting it occasionally - unless it is a critical application where doing so will cost millions of dollars or lives.

I like this article's focus on technology as a way of helping skilled people do their job more effectively. Why shouldn't a business owner be able to use Bash and SQL to run their business? Maybe the solution isn't new technology, but training people to use the old stuff.

This was exactly the point I was making. Thank you.

But you don't gain ML/AI know-how by doing SQL, nor you discover previously unknown potential about your product buy sticking to your usual toolset.

Not that I necessarily disagree with the OP but I find it deeply uninspirational.

What's the difference between using ML/AI for problems traditionally solved by some other tool and using any other tool to solve the same problem unconventionally? Both can be "hacking". I guess my issue with this is the word "need", don't do what you need to do but what you want to do if you are looking for inspiration. After all, mankind never needed to leave the garden of Eden but left it anyway.

I think the OP just meant that you can get a lot done with databases queries and a bit of automation. There's no need to call that ML/AI.

Something I hate about the business side of the tech world: the fad-chasing.

From the article:

> I hear these days for you to close that funding round quickly and early enough, you must throw in “Blockchain” even if it has no relevance in the grand scheme of things. A while ago, it was Machine learning and Artificial Intelligence.

Right on. No, blockchain won't help you with your corrupt voting system. If you don't understand the technology, you can't reason about its applicability, and there are more buzzword-chasers than serious technologists.

A correct machine learning solution for a non-trivial problem is very likely to have higher complexity than a traditional approach known to work. For a toy or hobby problem, all's fair, but for a business application the added complexity can have significant impact on cost, time to market, etc.

Well, I know I don't need to install GPUs on servers I install Postgres on, so there's that cost.

You don’t need GPUs for machine learning.

In fact the majority of tools in this space are exclusively CPU based.

But you don't gain ML/AI know-how by doing SQL, nor you discover previously unknown potential about your product buy sticking to your usual toolset.

If current ML/AI is the future and reveals more than anything else could, then it's logical for everyone to be piling onto it whether it's an applicable at the moment or not.

If current ML/AI is just another tool, then it's reasonable to use if and only it's applicable. Sure, not doing ML means you don't get ML insights but doing SQL means you get SQL insights. Back in the day, I recall clever queries could reveal interesting things, find outliner data and so-forth. Certainly, you don't get the powerful ad-hoc statistics power that ML give. But I suspect that power requires extremely large datasets.

I think its going to be like websites/internet back in the early 2000s. Everyone knew it was the future, but didn't know what to make of it. Many did it wrong and the profits (if there were any) didn't live up to expectations. then bubble burst. lots gave up, but the survivors, and the ones that learned how to do it right ended up with near unassailable moats. now a large portion of businesses today have an internet/app presence, and its seen as indispensable to their business.

There aren't very many DBAs practicing in modern shops and devs don't seem to be too into SQL and delivering excellent SQL queries and schemas. It's its own skillset.

I would also call out the NoSQL hype train here.

NoSQL has its place, and largely its place is when SQL can not tolerate the intensity of traffic or the size of the dataset. You can look at the Dynamo paper for an example of the engineering rationale.

Postgres can take enormous amounts of data at quite decent rates - without spending too much time on tuning even.

usually I am joining many different data sets many of which include some time of log data (sometimes petabytes in size but usually a few TB). the logs are persisted to hdfs or s3, which is why spark and hive make such a nice way of doing work compared to something like postgress.

also, its nice to plop json, avro, csvs, parquet, or what ever data in storage and just query/join/analyze it. no need to put the story on hold because you are waiting for the oracle dba to increase space again.

Turns out that people are actually kinda smart - toss in some raw cycles to handle the mind-numbing bits and you can have a solid system that does smart things en masse.

I'm not sure about this article, but there is certainly scope for an article named "You don't need Blockchain, you need SQL/a Database"

The real problem is people (including the author of the article, apparently) think ML is necessarily some kind of ultra-complicated technique that needs a PhD and a GPU. But, come on, 80% of the times you can use ML, dead-easy techniques are more than enough.

I mean, the author is talking about how SQL is a good-old 40 year old tech. In the mean time, one of the simplest ML algorithm, linear regression, is about 200 years old, even older (AFAIK) than Ada's program for Babbage's machine. It's very easy to understand and implement, and even excel has it as a standard function.

Sure, linear/logistic regression or naive bayes won't help you tag pictures with text à la facebook "this is a picture of a young man dancing with a red shirt", but the vast majority of use cases of ML are way easier, anyway. So yes, most of the time, you can easily find "talents" that will solve your ML problems. And if you really want to, you can implement it in SQL.

sql is great but i am still waiting for the succesor to sql. sql was made for relational data. but a relational data with nested data structure kind of like postgres and jsonb built in mind from the ground up is what id really like to see.

Why? What problem do they solve that RE can't, other than putting the "consuming service" data structure into the database, instead of putting the data into the database and selecting appropriately? I've not seen a good justification for these things yet, other than "convenience" so the client "needs less code to structure the data", which is almost always a false saving.

Most data is relational. Even if you think it isn't relational, it probably is still relational.

Also if you think you have unstructured data but you need to interpret every bit of that data, then it's not unstructured.

kdb+ has a variant of SQL which is an evolution (not a revolution). The advances basically fall into three categories:

1. Shortcuts, such as "foreign key chasing" - i.e., if "a" int table x is a reference to field b in table y then "a.c" is "select c from x inner join y on (x.a=y.b)"; If you have a star schema, it cuts down queries and errors by 90% (and makes life simple for the optimizer). Of course, you can chase through as many tables as you wish in an expression, making item tables look a lot more like records.

2. Embracing order; The relational model has no ordering among tuples; SQL mostly pretends that's the case, but the order does emerge through "ORDER BY / TOP" and "ROWID" but not very usefully so. kdb+ embraces order and makes e.g. "first record that ..." very simple and intuitive (and also easier for the optimizer).

3. Embracing time series (not independent from embracing order) - when you have e.g. records with a "from .. to" validity range, it becomes exceedingly simple, as does "all records that are different from the previous one on this field".

Look at IMS or CODASYL from the 70s, or xml/object databases from the 90s. They did what you’re asking for.

Both times, the eventual consensus was that sql was simpler to implement and use, but maybe marginally slower. Then machines got faster, so SQL dominated the market for decades.

Or maybe you can do ML with SQL.... Postgres can do basic linear regression, I did this a couple of times for an analysis and found it pretty handy.

What? not having to choose either or, actually using good tools for the job, solving the problem instead of mindlessly following the hype? /s

Not to mention that recent versions of MS SQL can run (AFAIK) arbitrary R and Python code server-side.

What was the Twitter thread/HN thread referenced about using "boring" approaches to solving problems?

Also, more of the same: "Choose Boring Technology (2015)": https://news.ycombinator.com/item?id=9291215

Counting items by value is a maximum likelihood estimation method too. It's still ML if you do a count, group by, max or threshold - just a less sophisticated way of doing things. The Naive Bayes algorithm is implemented by counting, at its base.

So any reduction is ML?

Only if it fits to an existing statistical model.

How about counting the elements of a linked list.

When you already know the query logic or the logic is easy to derive - use SQL if you can. For more complex stuff ML may work as your rule derivation mechanism.

I thought graph databases were the canonical implemention for recommendation systems, one of the few use cases I'd not go straight to sql

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact