Hacker News new | past | comments | ask | show | jobs | submit | pilgrimfff's comments login

It’s too bad there’s no way to use hybrid approaches or use some kind of learning, but for machines! No heuristics to rank and prioritize content.

We better tell every big tech company that what they’ve been doing forever is impossible to do if they don’t also de platform those deplorables.


Gee, I wonder how Wikimedia could make this practice ethical?

Maybe actually tell people it's going to those causes instead of Wikipedia?

Naaaaaaaaah


I'm 100% with you. We have to keep the money flowing away from taxpayers and towards corporate interests.


That's inevitable. What is preventable, however, is having taxpayer money flow out of the country to foreign corporate interests. As an American, I'd rather get screwed by an American company than one from, well, a whole host of other countries.


I agree. We need more American (TM) companies with no competition. In every industry, there should be a Comcast sitting on top collecting rents.


Really? That's a grossly uncharitable interpretation of what I said. How abominable of you, Socrates.


What if the the Americans screw you for 8x the cost (number the article uses), and you get a smaller inferior ship in return? Why not go with the foreign product at that point?


Money that flows out of the country is essentially free.

The Fed targets inflation. They'll just print more money, if the foreigners want to sit on it.


> this has been done properly numerous times in the past

I'm genuinely curious if you have any examples of this. I'm not aware of any modern product at scale that doesn't suffer from these same issues.


When people say that I imagine them thinking about a web forum with 200 monthly visitors.


Every newspaper and pretty much any other medium over the past century or so. That’s what editors used to do.

Scaling is not the problem - profits scale with the number of users, the cost of moderation would too, in the worst case. The problem is company’s unwillingness to pay anything at all. Their profits didn’t came from offering a good product, but from discovering a new way to offload the costs of their operation while keeping the income. In this case - by pretending to be a part telco, part newspaper (which brings income), but without taking their main responsibilities (which cost money).


A little unclear what analogy you are using here.

Newspaper editors don't have to deal with large scale attempts to publish material as journalists on the newspaper's staff - or insofar as they do have to deal with it they just ignore unsolicited submissions. That's the closest analogy I can see in the newspaper business.


In most newspapers there has been a section for letters to the editor. Those sections are moderated by editors. That is the reference, not some situation of random readers somehow publishing material, by posing as journalists on the newspaper's staff.

An updated version of that is the comments section for online stories.


Facebook etc already pays large amounts for their moderation army: https://www.wired.com/2014/10/content-moderation/ (unclear if the OP knew this or was ignoring it for some reason).

XCheck isn't really about content moderation, it is about special security for specific people and how that interacts with posting.


> Facebook etc already pays large amounts for their moderation army: https://www.wired.com/2014/10/content-moderation/ (unclear if the OP knew this or was ignoring it for some reason).

The article you linked doesn't say how much fb pays for their moderation. It doesn't say how many people are in their "moderation army". Nor does it say anything current. That article was written 8 years ago.

> XCheck isn't really about content moderation, it is about special security for specific people and how that interacts with posting.

According to the description, xcheck is all about content moderation - it chooses when not to moderate content.


Facebook pays a _very_ tiny amounts for their moderation team, compared to the number of users.

XCheck is all about content moderation, in particular about saving work (ie money) on content moderation.


Good. I'm so tired of all these folks saying that we haven't always been at war with Eurasia.


> it is possible to operate a Google account completely without a phone number

This is only true for a limited time. I've tried to use a couple Google accounts this way and inevitably I log in from a new IP and Google's 2FA system kicks in - forcing me to either furnish a phone number or lose access to the account.

It's similar to how Twitter forces phone numbers out of people - just not as immediate.


Do they really ask for a phone number, or would a Yubikey work as well?


A yubikey would be as useless in this article's specific case, as the problem is losing valuable things (eg, phones). A yubikey is no different.

It too would be lost.


That's definitely a problem, and a tricky one to solve in the context of 2FA: One of these factors is usually knowledge (your password); the other then has to be possession or inherence, and the latter has problems as well.

Essentially, if you rule out possession, your choice is between server-side validated biometrics (if offered at all), or "double knowledge" (e.g. a password and email 2FA, with the email account also only protected by a password), which is pretty phishable.


Buy a handful of alternative domains that redirect to your primary (you could stand up a minimal url shortener on each domain).

Even if you get unblocked this time, it could easily happen again. Until there’s systematic reform to this nonsense, you just have to work around it with redundancy.

If they’re going to treat you like a scammer, work around it like the scammers do.


I believe the facebook crawler will crawl redirects, such that a URL that results in a redirect to a blocked domain is still going to get blocked.

(Even if it were a satisfactory solution to say "message all your customers and tell them they have to start using the new domain for ticket sales, including for events that are already promoted with ongoing ticket sales" which of course it isn't, although I follow you that it would be perhaps better than nothing).


I was expecting the number of search results here to be much higher - like who cares if Google only serves the first million results out of a billion?

Very interesting to see that Google will only serve a few hundred links when they claim to have hundreds of thousands of relevant results indexed.

I'm very curious where Google is getting that count and why the reality is so different. Systematic overcounting? Suppressing hundreds of thousands of results?


The problem is generally called "deep pagination". It's extremely inefficient to compute.

Specifically, counting requires very low memory. When data is spread across 10,000 computers, all of them counting returns just 10,000 numbers i.e. 4 bytes * 10,000 = 40KB. It's easy for 1 computer to count those 10,000. Even at 100,000 computers 400KB.

Merging sorted search results is extremely memory intensive. Even with just the Id+Score pair, let's say 8 bytes. To get the 10,000th search result, each computer needs to create a List of 10,000 results, thats 10,000 * 10,000 * 8 bytes = 800 MB. For the 100,000th search result 10,000 * 100,000 * 8 bytes = 8 GB. OR if your data grows to 100,000 computers, thats 100,000 * 100,000 * 8 bytes = 80 GB of intermediate results to process at the end.

As you can see this doesn't scale well. You're required to retain context (i.e. sessions) of the search in memory instead, and get the search engine to better coordinate across all 100,000 computers. This also has scaling limitations based on memory of the session, the number of computers, the number of sessions, and their TTL (someone can leave the search page open for day and hit "next page" - should the sessions still be open? Thats an answer each search engine has to decide).

The reality is, if a customer wants deep pagination, they are better suited to a full data dump (i.e. full table scan) or using an async search API, rather than a sync search API.


Well at that point, who really cares if the content of the 1001s page is deterministic, or in perfect order? Get the first 100 or so pages right, and thereafter just request the nth results from each of those m computers. No merge and no memory explosion, you'll just get them slightly out of order.


You still need to filter based on the other indexes. If you search for [bitcoin mining] you don't want to find pages related to coal mining. So this data still needs to be joined.


the search term for this is intersection. The posting lists for the two terms are intersected, then the results are ranked. But there are a lot more steps in a production search engine.

The long and short of it is if you really want the full results, just join google, join the search team, and then get enough experience so that you can do full queries over the docjoins directly. This was part of Norvig's pitch to attract researchers a while ago. For a research project, I built a regular expression that matched DNA sequences and spat out the list of all pages containing what looked like DNA and then annotated the pages so in principle you could have done dna:<whatever sequence> but obviously that was not a goal for the search team.


I used to work at Google but not in search, these are just my own guesses.

> where Google is getting that count

This is very likely a fairly accurate of the number of pages in Google's index that "match" the search query. Basically exactly what you would expect when you see the number.

> why the reality is so different

Cost reasons. Most search engines are more or less scanning down a sorted list of pages. The further you need to scan the more expensive it is. Just like running "OFFSET 1000" is usually slow in SQL. At some point the quality of results is generally very low and the cost is growing so it makes sense overall for Google to just cut it off to prevent it becoming an abuse vector (imagine just asking Google for the 10 millionth page of results for "cat").

The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.


> The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

I used to (years and years ago) go past the first page pretty often, but results are so bad now that it rarely helps, so I almost never even click "2", let alone later pages. It's all gonna be obviously-irrelevant crap google "helpfully" found for me or the auto-generated spam that google used to try to fight (circa 2008 and earlier) but no longer seems to, just letting it gunk up and dominate up any results you get that aren't from a handful of top sites.

So this is in part one of those "we broke a thing and now no-one uses it, guess they didn't want it!"


What's really weird is that sometimes you get results that are outright repeating on those first N pages. Sometimes, more than once.

It's almost as if it tries to pad the output to be long enough that you'd lose patience before you reach the end of "effective pagination".


The thing has always been "broken". Google has had a page limit for at least a decade.


No, by "broken" I mean "let lazy auto-generated spam take over the results almost completely". So now those of us who did used to browse past page one (which, to be fair, may not have been many people) don't bother anymore.

[EDIT] For those who weren't around for it, Google used to play cat-n-mouse with spam-site operators. It'd go through cycles where results would slowly get worse, then suddenly a ton better, though never as bad as they are today. Around '08 or '09 they (evidently, I'm just judging from the search engine's behavior starting around then and continuing to this day) seemed to give up and just boosted a relatively small set of sites way up the results, abandoning the rest to the spammers.


Part of the difficulty is, if very few people are browsing to page 2, deciding what to put on page 2 becomes harder and harder.

Google has a lot of user behavior signals to decide what should be in results 1-10. Deciding if a page should be ranked 20, 200, or 2000 without any user clicks to check if you're right is really difficult.

I would bet that since 2008/9, the relative numbers of spam site operators, Google engineers, second-page searches have changed significantly.


Kagi has been working very well for me as an alternative


I find search results are frequently even worse than this, in that the first page will have nothing useful, with about three good links split between the second and third page. If I'm lucky.


If you've ever read Larry Niven's Fleet of Worlds series, there's a Bussard Ramjet with an AI programmed to hide any information that could help a hostile enemy/force find their way back to Earth.

A small cadre of humans who were raised by an Alien Race who came across a human seed ship cross paths with this Ramjet, and one of the protagonists realizes something is off when they do a query on the size of presentable search results in the astrographic/navigational dataset, and realizes that the number of starmaps the AI will produce is far smaller than the amount of space the system actually dedicates to storing said maps.

Point being, you can't trust any system that restricts results to a subset to not actually being designed to leave out results. and it furthermore makes a great, plausibly deniable way to drop search results... Force ranking to 10001+.

You'll forgive me, I'm sure, if I question a company well known for cooperating with an anti-humanitarian regime (Project Dragonfly) and that regularly black holes other undesirable datapoints, of engaging in less than up front search result presentation, I hope?


This isn't the revelation you act like it is. Because of course Google hides results. They don't pretend not to, and they even inform webmasters when it happens. The Search Console calls it a "Manual action" when they do so.

More importantly, the people asking for a "censorship-free search engine" are expressing an incoherent desire. The whole point of a search engine is to take the zillions of web pages that have matching keywords, push the crap to the bottom, and leave the gold on top. A system that does this is inherently censorious. We're just quibbling over what the criteria should be.

What our world lacks is a reasonably-quick way to hold Google accountable when they fail to represent the interests of the public who searches with them. The real-world consenquences of their filtering decisions need to filter back to the people making these decisions. Because "just don't make any filtering decisions" isn't going to result in a usable information retrieval system.


> "censorship-free search engine" are expressing an incoherent desire

That's not really true. `grep` is a censorship-free search engine. It just reports every matching result.

Of course that wouldn't generally be useful over the web, however even with sorting it is possible to be censorship free. You just need to include every matching result eventually.

Of course you would find that generating later pages likely also becomes expensive, so you may also add a page limit and ask the user to refine the query instead. Of course then you are back to this problem of it can be very difficult to find every result because you need to guess what words are on the page.

But all of this is basically moot because Google doesn't claim to be censorship-free so they have much simpler way of hiding results.


> even with sorting it is possible to be censorship free. You just need to include every matching result eventually

Do you honestly think that the people who complain about their favorite website being censored by Google would be satisfied with showing up on page 200*? I wouldn't.

It's only "not censorship" in the same sense that having your emails sent to the Spam folder isn't censorship. The spam folder, and low-scoring SERP results, are so full of items that every reasonable person acknowledges to be crap that getting banished to that area is pretty much equivalent to having someone blast your roadside protest with strobe lights and a sonic cannon. Surrounding you with so much garbage data that nobody can see or hear you any more is only "not censorship" on the dumbest technicality.

* Ignore, for sake of argument, the fact that page 200 won't even load in our universe. I'm imagining a parallel world where Google pretends to be censorship-free because they only push things far down in the results instead of removing them entirely.


"Do you honestly think that the people who complain about their favorite website being censored by Google would be satisfied with showing up on page 200*?"

My complaint has nothing to do with my favorite website. My complaint has to do with not being able to discover information and websites because Google won't allow me to dig very far into their search results. They're spidering the vast majority of the internet, and all I get are crumbs.


They're doing more than "push the crap to the bottom". They're pushing the crap to the bottom and then limiting how far you can dig into the pile. I am sometimes interested in that crap.


I agree. If you really want to see every result for a topic this system hurts you. However I think that use case is vanishingly rare. Most users would be better served by refining their query for what they are interested in than paging through hundreds of pages of results.

Google isn't designed to be a archive of every webpage matching a search result, it isn't what their infrastructure is optimized for.


"Google isn't designed to be a archive of every webpage matching a search result, it isn't what their infrastructure is optimized for."

I believe that's exactly what Google is. Limiting search results probably has to do with being able to serve more queries and respond quicker.


>The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

the fact is I just want the long tail or weird results to escape content farms, but I guess if it were possible for google to serve those content farms would spring up to game the long tail or weird results market.


Google tries to ignore them already, so the long tail is probably littered with old and mitigated content farms because they "match" but have a low page rank


Absolutely. But not your right to ensure that nobody can hear.


> But not your right to ensure that nobody can hear.

Exactly. On social networks, the block button is there for anyone to use if you don't want to hear it.


"I disapprove of what you say, and I will fight to the death to stop others from hearing you say it." - pretty sure this is a historic quote made into todays language.


I saw a solid video a couple days ago called "Why I hate Wikipedia (and you should too!)" https://youtu.be/-vmSFO1Zfo8

I'd never given Wikipedia much thought, but it raised some very interesting points to think about.

TLDR: Despite mediocre information, Wikipedia's ubiquity has supplanted the market for better sites dedicated to specific topics.


The "Why I hate Wikipedia" video was made by YouTuber and journalist J. J. McCullough. You wanna see something funny? This discussion is from 2008:

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

The J. J. McCullough biography was created and deleted a total of five times on Wikipedia (with two different spellings of J.J.):

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

Each time it was deleted for "lack of notability". I believe he may not be telling the whole story in that video.

Apart from that, his criticisms are quite valid, if not exactly new. The video was discussed here the other day, see https://news.ycombinator.com/item?id=32793755


I’ll have to watch the video later to see how good their argument is… but I don’t but it.

“The market” has seemingly deemed low value SEO spam to be what’s valuable. If Wikipedia didn’t exist, I imagine the alternative would more likely be garbage rather than quality.


In this case, "the market" is essentially just Google - who incidentally created the market for SEO spam.

I don't necessarily blame Google for the tight integration with Wikipedia. It's much easier to deal with one site than many.

I think it's disingenuous to blame "the market" for what's essentially the decision of a monopoly.


Did the government grant Google a monopoly?

This is a market outcome. Perfect competition is not the natural state of the market, even though free marketers prefer to hand wave it into existence.


The government is absolutely granting Google a monopoly today. They own the browser, your email, and search. If antitrust had any teeth in this era the company would have been nuked a decade ago if not even sooner.


Antitrust is government intervention into the market. Not enforcing antitrust is not granting a monopoly.

Don’t get me wrong, antitrust is good! Government intervention is necessary. But lasseiz-fares does not mean what you appear to think it means.


When a corporation has an obvious monopoly and the government does not intervene, they are signalling the status quo is allowed to continue


Yes, except that is not a market outcome, that is government intervention into the market.

That is substantively different from a government granted monopoly, wherein it is illegal to create a competing business.

The point is that monopolies are a natural market outcome, and are what created Google. Does Google continue to exist as it is because the government has chosen not to do anything? Yes. Welcome to the free market.


Google was not really created by a natural market outcome when you consider what it takes to be a large multinational corporation operating out of the US. Any time a policy decision is made that can benefit either the larger or more politically connected stakeholder versus the smaller stakeholder, guess who gets the short end of the stick almost every time in practice. It's not the big guy. Google is open about this, they have a political action committee, they have a website page where they open with "At Google, we believe it is important to have a voice in the political process..." (1). To suggest that they don't use these mechanisms to better their own position in this market you allege to be free is naive.

https://www.google.com/publicpolicy/transparency/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: