Hacker Newsnew | past | comments | ask | show | jobs | submit | willvarfar's commentslogin

This video is very liberal but does a good job of explaining which companies and industries pay for breaks and which don't. And uses soy bean farmers as a prominent example of a group who haven't been giving Trump bribes https://youtu.be/RPzcGeiNYvk?si=bfy_5KEo_ZUxOBHu

Can join cardinality can be tackled with cogroup and not expanding the rows until final write?

I don't know what cogroup is, sorry.

More generally, there are algorithms for multi-way joins (with some theoretical guarantees), but they tend to perform worse in practice than just a set of binary joins with a good implementation.


Yeah it's pretty obscure, sorry.

It's called cogroup in Spark and similar architectures.

It does a group-by to convert data into the format (key_col_1, ... key_col_n) -> [(other_col_1, ... other_col_n), ...]

This is useful and ergonomic in itself for lots of use-cases. A lot of Spark and similar pipelines do this just to make things easier to manipulate.

Its also especially useful if you cogroup each side before join, which gives you the key column and two arrays of matching rows, one for each side of the join.

A quick search says it's called "group join" in academia. I'm sure I've bumped into as another name in other DB engines but can't remember right now.

One advantage of this is that it is bounded memory. It doesn't actually iterate over the cartesian product of non-unique keys. In fact, the whole join can be done on pointers into the sides of the join, rather than shuffling and writing the values themselves.

My understanding is that a lot of big data distributed query engines do this, at least in mixer nodes. Then the discussion becomes how late they actually expand the product - are they able to communicate the cogrouped format to the next step in the plan or must they flatten it? Etc.

(In SQL big data engines sometimes you do this optimisation explicitly e.g. doing SELECT key, ARRAY_AGG(value) FROM ... on each side before join. But things are nicer when it happens transparently under the hood and users get the speedup without the boilerplate and brittleness and fear that it is a deoptimisation when circumstances change in the future.)


Group join in academia generally points to having GROUP BY and one join in the same operation (since it's common to having aggregation and at least one join on the same attribute(s)). But just making a hash table on each side doesn't really do anything in itself (although making it on _one_ side is the typical start of a classic hash join); in particular, once you want to join on different keys, you have to regroup.

And what about all those huge pending orders for F35 in ... Denmark and Canada? Etc.

Denmark ordered more in October. Canada talks a lot, but so far has done nothing concrete about reducing their order. You would think they would urgently cancel and get Gripens and/or Rafales.

Wonder what's going on behind the scenes.


Having a pending order that can be cancelled is negotiation leverage?

There are no other options. The F35 is the only gen5 fighter you can buy (Russia has one, but they can't make it and in any case Russia is invading Europe now and so not an option) , and as such it is going to be better than anything else you can get. Plus the cost of the F35 is similar or less than your other options.

The real question is what do those countries do when they have other options.


everyone is expecting everyone to actually go Gripen with Rolls Royce or MECA engines?

The very last clip in the video says that it is kids in affluent families taking that direction.

(I work a lot with BigQuery's BigLake adaptor and it's basically caching the metadata of the iceberg manifest and parquet footers in Bigtable (this is Google) so query planning is super fast etc. Really helps)


Greenland and Denmark have always been encouraging minerals deals etc, they just haven't materialized.


crazy to think that soon not being able to successfully complete the captcha will be a signal that the user is human


I had a great euphoric epiphany feeling today. Doesn't come along too often, will celebrate with a nice glass of wine :)

Am doing data engineering for some big data (yeah, big enough) and thinking about efficiency of data enrichment. There's this classic trilemma with data enrichment where you can have good write efficiency, good read efficiency and/or good storage cost, pick two.

E.g. you have a 1TB table and you want to add a column that, say, will take 1GB to store.

You can create a new table that is 1.1TB and then delete the old table, but this is both write-inefficient and often breaks how normal data lake orchestration works.

You can create a new wide table that is 1.1TB and keep it along side the old table, but this is both write-inefficient and expensive to store.

You can create a narrow companion table that has just a join key and 1GB of data. This is efficient to write and store, but inefficient to query when you force all users to do joins on read.

And I've come up with a cunning forth way where you write a narrow table and read a wide table so its literally best of all worlds! Kinda staggering :) Still on a high.

Might actually be a conference paper, which is new territory for me. Lets see :)

/off dancing


Sounds off to me tbh.

Were your table is stored shouldn't matter that much if you have proper indezes which you need and if you change anything, your db is rebuilding the indezes anyway


You mean you discovered parallel arrays?


specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.


Whats a good place to see parallel arrays defined. I have no data lake expetience. Know how relational db works.


I mean,

    Table1 = {"col1": [1,2,3]}
    Table2 = {"epiphany": [1,1,1]}
    for i, r in enumerate(Table1["col1"]):
      print(r, Table2["epiphany"][i])

He's really happy he found this (Edit: actually it seems like Chang She talked about this while discussing the Lance data format[1]@12:00 in 2024 at a conference calling it "the fourth way") and will represent this in a conference.

[1] https://youtu.be/9O2pfXkCDmU?si=IheQl6rAiB852elv


Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.


Why a paper? A repo should do the trick.


That code is for in memory data right? I see no storage access.

What is really happening? Are these streaming off 2 servers and zipped into 1. Is this just columnar storage or something else?


look into vector databases. for most representations, a column is just another file on disk


I agree that social media is a net negative, but want to also point out that before social media it was the mainstream press and TV have been shaping society for decades. Things like buying a used car from Nixon or fighting in Vietnam etc are all mainstream press impact.


I like to think that contrary to this modern idea of media bias that the “mainstream” media as you label it has been a net benefit to society. Journalists used to challenge authority in democracies and bring out truth. It’s a lot more difficult now due to social media polluting the information space.


All together, everyone!

https://www.youtube.com/watch?v=ZggCipbiHwE

"This sharing of biased and false news has become all too common on social media"

... say the local TV presenters parroting an identical script from the Sinclair Broadcast Group, which owns or operates 193 TV stations in the USA, covering 40% of US households

You'd be mad to think that consolidated control of information, the endgame of "mainstream" media, is of benefit to society.

"Mainstream" media is financed either directly by very rich individuals, who then use their control of the thing they own (even just by controlling its hiring policies, to give like-minded people a voice) to spam their own agenda on the populace, or a generic money-making enterprise that then deals with less-affluent people who want to spam the populace (advertisers).


And who owns every social media platform, if not a few very rich individuals?


Touche. But you miss that not all social media (e.g. blogs and forums, instant messaging) are "social media platforms".

Also, the trick doesn't work with social media platforms in the same way. Rupert Murdoch bought Myspace, where is it now? He didn't get the same control and power he got when he bought The Times and The Sun and could tell the staff who wrote the content what to say to their passive readers.


The world is not America.


Do you think this doesn't happen in other countries?

Just to give an example from the UK of "state" media, the nominally independent BBC has to answer to a board, and to the regulator Ofcom. But in 2021, Boris Johnson installed Richard Sharp (Tory pary donor, Rishi Sunak's old boss) as the head of the board, and Robbie Gibb (Theresa May's head of communication) as a member, and attempted to rig the selection of the head of Ofcom, even though he's not legally allowed to do that. He still tried it. He "let it be known" he wanted Paul Dacre (former Daily Mail editor) be head of Ofcom. https://www.prospectmagazine.co.uk/politics/63982/boris-john...

They are all at it, to try and control public opinion and gatekeep what is seen and not seen.


Sure, but on the whole I'd argue outside of tabloids there are still real journalists doing real journalism and trying their best to hold people to account.


The thing is, the internet was supposed to democratise, but it's ended up centralising (and therefore distorting) discourse

A good example is publishing: until relatively recently, books were how most knowledge was distributed, and publishers were able to gatekeep it

Back in the 1990s, one of the promises of the internet, was that it could break this stranglehold. The argument was that instead of 10-ish major publishers, we could have ten billion

What we've ended up with is 5 or so major platforms. Their algorithms now distort, not only the distribution of information, but the production of knowledge itself (click chasing)

An argument I'm sympathetic to, is that the internet hasn't just been a neutral medium, but has actually accelerated this centralisation

The other aspect is the shrinking role of non commercial institutions, like public sector broadcasters, universities, scientific orgs. These entities had their own biases and groupthink. But they added diversity to the media landscape and helped set useful norms


I've always noticed and wondered, so I guess it's easy to overlook but it's there.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: