Hacker News new | past | comments | ask | show | jobs | submit | mocamoca's comments login

Question for people writing highly complex SQL queries.

Why not write simple SQL queries and use another language to do the transformations?

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?

Syntax is one thing, but actual performance (and safety/maintenance) is another deal?


Performance. A million times performance.

If I need to join a million rows to another table and then calculate an aggregate value, and do it all in a single query, it's fast. It might only take 0.01 seconds.

Whereas if I make separate queries to retrieve a million rows, and a million rows from another table, then it's incredibly slow just due to the data transfer. No matter how conceptually easier another language may be. So it might take 30 seconds for repeated sequential serialization and network and deserialization.

And even if you're looking up individual rows (not a million), with lots of joins that would be lots of round-trips to the database which multiplies latency and load -- so even if you can still get your final result quickly (e.g. 0.01 seconds rather than 0.001 seconds), the load you can handle drops by the same multiple (e.g. you can now only serve 10% as many users).

The general rule of thumb with databases is that they can be super-fast when everything is done on the database machine in a single query and your final result is a small amount of data (assuming everything is indexed properly and the query is written properly). But they become super-slow when you're doing a bunch of queries in a row, and where those intermediate queries can return massive amounts of data, or (even worse) need to send back massive amounts of data.


I've worked on a few SQL systems used for analytics and ETL.

My users fell into (for the purposes of this discussion) three categories:

1. Analysts who prefer sheets

2. Data scientists that prefer pandas

3. Engineers who prefer C++/Java/JavaScript/Python

I'm fairly sure SQL isn't the first choice for any of them, but in all three cases a modern vectorized SQL engine will be the fastest option for expressing and executing many analysis and ETL tasks, especially when the datasets don't fit on a single machine. It's also easier to provide a shared pool of compute to run SQL than arbitrary code, especially with low latency.

Even as a query engine developer, I would prefer using a SQL engine. Performing even the basic optimizations a modern engine would perform -- columnar execution, predicate pushdown, pre-aggregation for shuffles, etc -- would be at least a week of work for me. A bit less if I built up a large library to assist.


Re #2: I prefer https://pola.rs over Pandas


I've heard great things about Pola.rs performance. To get there, they have a lazy evaluation so they can see more of the computation at once, allowing them to implement optimizations similar to those in a SQL engine.


In the early days, even as I appreciated what Pandas could do, I never found its API sane. Pandas has too many special cases and foot-guns. It is a notorious case of poor design.

My opinion is hardly uncommon. If you read over https://www.reddit.com/r/datascience/comments/c3lr9n/am_i_th... you will find many in agreement. Of those who "like" Pandas, it is often only a relative comparison to something worse.

The problems of the Pandas API were not intrinsic nor unavoidable. They were poor design choices probably caused by short-term thinking or a lack of experience.

Polars is a tremendous improvement.


Hey, I agree with you.

On eager vs lazy evaluation -- pytorch defaulting to eager seemed to be part of the reason it was popular. Adding optional lazy evaluation to improve performance later seems to have worked for them.


I'm not interested in loading all the data in memory, or swapping it out if it doesn't fit, but also it may be a lot of data, and just moving it over the network is hard.

I have tests. I have reusable blocks (SQL functions, WITH blocks and views). I don't have logging though.

I can put the result in a non-materialized view and have it update in real time as the data changes. Or I can toggle it to materialized view and now it's snapshotted data.

Finally, views that depend on views that depend on views get automatically optimized by the query planner. You need a lot of very tricky custom code to start approaching that.


> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?

As others have said, yes. In most cases, the more complex the query the better the result.

> Doesn't working without reusable blocks / tests / logs make development harder?

SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want and then then the RDBMS goes off and finds the most performant way to retrieve the data that matches that shape. It doesn't compare well with procedural programming.

The closest "languages" that comparable to SQL are HTML and CSS. However you manage HTML and CSS is also how you can manage SQL.


>> Doesn't working without reusable blocks / tests / logs make development harder?

> SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want...

Exactly this. Generally speaking, your SQL queries won't have "bugs" the way that you can create bugs when writing a function.

The challenging parts of building a complex query are usually 1) getting it to function at all (just being a valid SQL statement for your tables that gives you your desired output fields), and 2) making sure it runs performantly on realistically large table sizes (usually in milliseconds as opposed to seconds), which may involve rewriting things like joins vs. subqueries and/or adding indexes.

A lot of bugs in functions come from edge cases or different combinations of paths through code or unexpected combinations of parameter values or math formula errors or whatever... but a SQL query won't usually really have any of those things. It's just a single transformation that basically either works or doesn't.


One reason SQL has become more popular lately is as an API for map/reduce.

Before you would write a Java/c++ class that would do the map/reduce job for you distributed over 100s of CPUS. And you would feel like you were on the bleeding edge doing innovative stuff.

Turns out that SQL is a perfect API for map/reduce.

Everything you write in SQL can be transformed into a massively parallel job. And you don't even know about it.

This is the secret behind BigQuery and Trino/Presto/Athena.


If your data is relational. SQL can get pretty ugly with semi-structured data, graph traversal etc.


> Why not write simple SQL queries and use another language to do the transformations?

> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?

Yes. With knowledge of the data (like indices) and statistics of the data it is usually very much more efficient than piping the data to another process to handle the same job.

Of course that requires you to write a good query and know how to tell what makes a good query, but if your data is in a relational database it is usually more efficient to do your filtering/matching/aggregating there.


I run statistically-sound AB tests on Meta* and can tell you that people click, click more than you think and buy more than you think from clicking on ads!

The same goes for email and Whatsapp marketing. People love these messages.

* We do not use their built-in ab test tools of course...


Incredible. I can't imagine what it must be like to live inside their skulls. In all my life I have clicked maybe one or two ads on purpose, misclicked (either by honest mistake or through malicious design) probably five times, and never bought anything because of an online ad. The one ad that made think that I might use the service in question was a sponsor read for a PCB fabrication company.


I clicked and bought something via an online ad once. I needed a new jacket and the one in the ad was exactly what I was looking for. I overpaid because the jacket fell apart after a couple wears. The experience has definitely soured me on ads, even more than I was before. If marketers weren’t selling me overpriced junk I might have a more favorable view.


I have to say Instagram ads are incredibly well targeted to me. It's been like that for maybe a year, I think (I remember noticing it). I've clicked on a bunch of them, much more then elsewhere.


But have you bought anything?


Yea, not immediately, but yes. E.g. specialty coffee an chocolate. Other ones were interesting job offers, which usually check and make note if the place is interesting


Is it a nerd thing, then? I cannot for the life of me imagine clicking on ads, but maybe it’s just the filter bubble?


There's two kinds of things here, which I will call "push ads" and "pull ads".

Push ads we hate. I don't want your random ads shoved in my face. The harder and more intrusively you shove, the more I hate it. No, I'm not going to look at it to see if I'm interested; I just want it gone.

Pull ads are different. Let's say I'm on a business trip, and I get out of some meeting at 5 PM. Well, what is there to eat around here? What do I get in response to that? Ads! But in how they interact with me, it's completely different, because I actually want to eat.

Or let's say I'm going on a trip, and I search for "lodging in location X". Well, some of what comes back is ads. Say it's from hotel.com. I probably wouldn't have gone to hotel.com as one of my choices for investigating lodging in X, but they're going to give me kind of a one-stop overview of what's there, so maybe I'll click on their ad link anyway, because it's actually interesting information for what I'm trying to do right then.

But some random Temu junk that they insist on shoving in my face every time they can? No way. Just get it out of my life.


I think pull ads are hated too. They try to short circuit your brain into making a quick decision/have a preference without doing any research. It's all manipulation.


I've got the same feeling as you!

Do you enjoy going through your mail account and unsubscribe all useless newsletters? Use email aliases? are wary of all ads and brand claims you see on social media and prefer doing our own research? hate clicking on SEA results: usually the landing page is a weird, summed up version of the full website (we want the full thing)

But i guess that's a particularity of hackers and tinkerers :)

For many different reasons make people click and buy: discounted prices, FOMO, clever retargeting, funny ads, hidden ads (sponsored/influence/fake news),...


Yeah, I give a lot of newsletters a chance, but only with Firefox single relay emails addresses which I have over 300+ currently. I probably unsubscribes 95% of them, but I did give them at least a chance. I probably should have tracked the ones I did so I could delete them.


Facebook's ads in particular are very well targeted, IMO. They know I'm into gaming, so I get gaming related ads (board/card games, conventions, arcades). Other sites like NYTimes, Amazon Prime Video, etc, all do a terrible job -- they think I'm a parent with a watch fetish who can't get enough prescription drugs.

If I were an ad buyer I'm not sure I'd bother with anything but Meta at this point.


I can definitely see clicking on things in email - after all I have already prescreened your company/product and found it to be interesting/worthwhile.

As for ads, maybe it is just that marketers generally have to advertise crap (because good things tend to sell themselves?) and so I rarely if ever see ads for things worth purchasing?


whats the breakdown of M vs F?


Users are my main motivation drivers. Can you get any? Even if that's you!

If not: why?


Just to balance things a little

Sometimes the cost of an offer is not linear with some metric or easily predictable.

And sometimes the offering company is not Salesforce and just does not have resources or existing similar customers to model price adequately


If someone from Shopify's backend team is around, i would love to know how difficult it is to maintain/improve the GraphQL API It looks like Shopify is deprecating REST in favor of GraphQl so is the developer UX that good for Shopify developers? And btw some features are missing from the GraphQL related to REST. I wonder if that's related to hard-to-implement features or good-occasion-to-delete features (Eg Checkout)


The core advice is to produce instead of consume. And the second one is to study.

However, packaging it under a "being a mere employee is lame" mantra... Is lame!

But I'm curious. Why do you think that doing something which is maybe not study related is a bad idea?

("Maybe not" because ie. john Mayer is both a rockstar and a Berkelee grad)


> Why do you think that doing something which is maybe not study related is a bad idea?

Because 99.999% of those kids won't go on to start the next Google and make more money for PG's investments.

When I was at school there were a few kids who were genuinely moderately good at football/soccer. With only minor encouragement from PE staff and family they essentially had the attitude of "I don't need to study, I am going to be a professional footballer." (who by the way make huge huge huge sums of money if you are at the very top)

And how many of those went on to to be top-flight professional footballers do you think? Yep - zero. What are they doing now considering they crashed out of out of school with no qualifications, no professional sports career, and barely able to even do basic literacy and arithmetic etc? Low skilled and relatively dead-end jobs like gardeners, working on building sites, security guards etc.

They threw away their chance to study and gain skills because irresponsible and self-interested adults told them they should abandon school and concentrate on this dream of fame and fortune with essentially zero chance of it coming true. You're fucking with people's lives by doing this - don't tell people you don't need to try hard and study and learn, you just need to become a billionaire business owner!


I don't mean to show a lack of respect for parents and their good will.

However, could we say that "you at least have to play to not lose" is a particularly selfish reason for having children?

Here, "Playing" is making sure the parents will have caregivers, not that the children will live happy.

If the current situation is not worrying enough to rule out having happy kids, what would be the threshold? (Not a rhetorical question)


> If the current situation is not worrying enough to rule out having happy kids, what would be the threshold? (Not a rhetorical question)

There is no threshold. Yes, I know Yudkowsky's position, not even there.

> Here, 'Playing' is making sure the parents will have caregivers, not that the children will live happy.

Parents having caregivers hardly comes into it at all.


> Parents having caregivers hardly comes into it at all.

For me, it does, an awful lot. Being a caregiver to my parents has implied a lot of existential suffering and emotional agony. Think about it: the people who raised you are going to die slowly and perhaps painfully and you are going to see it without being able to do anything. I feel I would be very cruel by giving that future to a child I love.


That's either going to happen or it's not, but I don't think everyone being dead is an acceptable way to force it to be wrong.


I got confused too at first.

But it makes sense because the counters get reset each month.

So if you consume "101TB" over a year, distributed as 90TB in January and then 1TB per month... You could get hit pretty hard in January and have free bandwidth for the rest of the year.


OVHCloud is free https://us.ovhcloud.com/public-cloud/faq/

As a data company we really benefit from Scaleway, Hetzner and Ovh.

However, Scale way has no plans to add an US infra

And I don't know if Hetzner has US présence?


Hetzner has a limited US presence for VM products but not baremetal/dedicated solutions.


European providers benefit from lower cross-connect fees in datacenters and more internet exchanges for easy peering. It's not surprising they offer more bandwidth at the same cost.


The US has plenty of internet exchanges. And plenty of smaller hosters with cheap bandwidth. Usually located near an internet exchange.

It's not like AWS or GCP have locations that are terribly far from major exchanges either.


You still have to pay for cross-connects to connect to exchanges in datacenters.

> In North America, the median cross-connect price is $300, more than five times higher than $58, which is the median price in Europe.[1]

Of course, that difference means nothing to a hyperscaler, but it's led to a difference in pricing culture between the smaller providers.

1: https://www.fiercetelecom.com/telecom/study-u-s-fiber-cross-...


I agree.

Proving that using cheats and auto complete does not break sample independence and keeps sampling as random as possible would be needed here for stats beginners such as me!

Drunk dialing but having a human operator that each time tries to help you connect with someone, even if you mistyped the number... Doesn't look random to me.

However I did not read the 85 pages paper... Maybe it's addressed there.


Page 9 & 10 of the paper [1] go into some detail:

> By constructing a search query that joins together 32 randomly generated identifiers using the OR operator, the efficiency of each search increases by a factor of 32. To further increase search efficiency, randomly generated identifiers can take advantage of case insensitivity in YouTube’s search engine. A search for either "DQW4W9WGXCQ” or “dqw4w9wgxcq” will return an extant video with the ID “dQw4w9WgXcQ”. In effect, YouTube will search for every upper- and lowercase permutation of the search query, returning all matches. Each alphabetical character in positions 1 to 10 increases search efficiency by a factor of 2. Video identifiers with only alphabetical characters in positions 1 to 10 (valid characters for position 11 do not benefit from case-insensitivity) will maximize search efficiency, increasing search efficiency by a factor of 1024. By constructing search queries with 32 randomly generated alphabetical identifiers, each search can effectively search 32,768 valid video identifiers.

They also mention some caveats to this method, namely, that it only includes publicly listed videos:

> As our method uses YouTube search, our random set only includes public videos. While an alternative brute force method, involving entering video IDs directly without the case sensitivity shortcut that requires the search engine, would include unlisted videos, too, it still would not include private videos. If our method did include unlisted videos, we would have omitted them for ethical reasons anyway to respect users’ privacy through obscurity (Selinger & Hartzog, 2018). In addition to this limitation, there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample to compare to our sample - would not be computationally realistic.

[1]: https://journalqd.org/article/view/4066


> case insensitivity in YouTube’s search engine.

That's very clever. Presumably the video ID in the URL is case-sensitive, but then YouTube went out of their way to index a video's ID for text search, which made this possible.


Good observation, but they also acknowledge: > there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample

In short I do believe that the sample is valuable, but it is not a true random sample in the spirit that the post is written, there is a heuristic to have "more hits"


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: