Hacker News new | past | comments | ask | show | jobs | submit | unionpivo's comments login

Even if you are releasing such a solution today, it will take months/years to build knowledge and toolchains and best practices. Then have traind developers to be able to use it.

> youre risking getting trapped in a local minimum.

Or you are risking years of searching for perfect when you already have good enough.


Because nowdays more than ever content you need is in silos.

Your facebooks/twiters/instagram/stack overflow/reddit ... And they all have limited expensive api's, and have bulk scrapping detection. Sure you can clobber together something that will work for a while, but you can't runn a buissness on that.

Aditionaly most paywalled sites (like news) explicitly whitlist google and bing, and if someone cretes new site, they do the same. As an upstart you would have to reach out to them to get them to whitelist you. and you would need to do it not only in USA but globaly.

Anothe problem is cloudflare and other cdns/web firewalls, so even trying to index mom and pops blog site could be problematic. An d most of the mom and pop blogs are nowdays on som ploging platform that is just another silo.

Now that i think about it, cloudflare might be in a good position to do it.

The AI hype and scraping for content to feed the models have increased dificulty for anyone new to start new index.


This is the best (and saddest) answer. LLMs break the social contract of the internet, we're in a feudalisation process.

The decentralized nature of the internet was amazing for businesses, and monopolization could ruin the space and slow innovation down significantly.


> LLMs break the social contract of the internet

The legal concept of fair usage has and is being challenged, and will best tested in court. Is the Golden Age of Fair Use Over? Maybe [0].

[0] https://blog.mojeek.com/2024/05/is-the-golden-age-of-fair-us...


While LLMs have accelerated, it, it was already the case that silos were blocking non-Google and non-Bing results before LLMs. LLMs have only made existing problems of the web worse, but they were problems before LLMs too and banning LLMs won't fix the core issues of silos and misinformation.


You're thinking too much by the rules. You can absolutely scrape them anyway. Probably the biggest relevant factor is CGNAT and other technologies that make you blend in with a crowd. If I run a scraper on my cellphone hotspot, the site can't block me without blocking a quarter of all cellphones in the country.

If the site is less aggressively blocking but only has a per-IP rate limit, buy a subscription to one of those VPNs (it doesn't matter if they're "actually secure" or not - you can borrow their IP addresses either way). If the site is extremely aggressive, you can outsource to the slightly grey market for residential proxy services - for fifty cents to several dollars per gigabyte, so make sure that fits in your business plan.

There's an upper bound to a website's aggressiveness in blocking, before they lose all their users, which tops out below how aggressive you can be in buying a whole bunch of SIM cards, pointing a directional antenna at McDonald's, or staying a night at every hotel in the area to learn their wi-fi passwords.


> You're thinking too much by the rules. You can absolutely scrape them anyway. Probably the biggest relevant factor is CGNAT and other technologies that make you blend in with a crowd. If I run a scraper on my cellphone hotspot, the site can't block me without blocking a quarter of all cellphones in the country.

I am familiar with most of that, and there is a BIG difference between trying to find a workaround for one site, that you scrape ocasionaly, than to to find workaround for all of the sites.

Big sites will definitely put entire ISP's behind annoying capachas that are designed to stop exactly this (if you ever wonder why you sometimes get capatchas that seem slow to load, have long animations, or other annoying slow things, that is why etc.)

And once you start making enough money to employ all the people you need for doing that consistently, they will find a jurisdiction or 3 where they can sue you.

Also good luck finding residential/mobile ISP's that will stand by, and not try to throttle you after a while.

You definitively can get away with doing all of that for a while, but you absolutely can't build sustainable businesses on that.


There are many rationalizations to not try.


And JavaScript/dynamic content. Entrenched search engines have had a long time to optimize scraping for complex sites


> management's thirst for elimiating pesky problems that come with dealing with human bodies

But that's what 95% management is for. If you don't have humans, you don't need majority of managers.

And I know of plenty of asshole managers, who enjoy their job because they get to boss people around.

And another thing people are forgetting. That end users AKA consumers will be able to use similar tech as well. So for something they used to hire a company for, they will just use AI, so you don't even need CEO's and financial managers in the end :)

Because , if software CEO can push a button to create an app that he wants to sell, so can his end-users.


> Wow this is such an awful excuse.

yes for whomever organized such a curse and didn't give such guidance.

And besides curse asked for project to do something. It did. It printed lines. We can call the email gimmick, the marketeering strategy, making a turd look good.

Don't blame students for failure of whomever designed the curse.


So did they disclose that all the Pi did was printed lines?

The problem with the email isn’t it’s a gimmick etc. it’s that it appears quite clear that the students created the impression that it was the Pi doing it.

Your excuse that it’s difficult for first year college students with no coding experience to do something useful with the Rapberry Pi is disproven by the fact that there exist many extremely useful projects that kids with no coding experience can do, so college students almost certainly should be able to do without needing to resort to gimmicks.

So I don’t understand your complaints about the course. It’s clearly not too hard which is what you’re implying. And if you’re suggesting that the wording for the project wasn’t clear enough then that’s a huge claim to make considering you don’t know what the wording was.

Also, college (at least in the U.S.) was never about playing funny word games with the professor. There’s a level of maturity, reasonableness, and respect that is expected of the students. None of which is indicated in the response here.


> There’s a level of maturity, reasonableness, and respect that is expected of the students.

Given that the general teaching style of colleges isn't unique to the US, and based on my experience throughout my degree at a similar institution, I somehow doubt that statement.

> It’s clearly not too hard which is what you’re implying.

It sounded like the students received literally no guidance, in the way the course is described. These types of assignments usually result in those with previous programming experience showing off their skills, while the actual rookie students are left in the mud. I.e. an assignment that targets the top 20% of the class.

Regardless, to my knowledge I never cheated during my college degree, but I can't hold it against people that do. Criticism such as yours disregards the reality that students face, pressure to graduate with good marks and whatnot. Not cheating will put you at an disadvantage, because your competition is actively doing so and they are already skewing the marks that way. If the intention of the assignment was to identify honest work, it was certainly structured wrong (as a required submission would have eliminated the cheaters).


That's another issue going on: you using your cheating to belittle Google scan which again plays against any ethical ground you might still had


Rest In Peace.

VIM is still among my top used editors. And Bram was the one that made sure it kept improving and being useful for all those years, since I first used it on my Slackware install.


Problem in big companies is that its is filled with people that , especially the higher you go, have their own plans and agendas. Sometimes they align with each other and oftentimes some part of company is actively trying to sabotage some other part of company.

Individuals also have their own personal goals such as high bonuses or maybe high IPO, where you will sell out and become rich.

And as CEO you have to not only have a plan, but also set up initiatives for people to follow it, which is often harder than it seems. That is why it sometimes seems a company is sabotaging itself. Sometimes there is something deeper going on but sometimes they are sabotaging themselves.


An apposite war story from Nokia, just about the time the iPhone entered the market: our Lords and Masters had decided that internal competition would be a good thing, so smartphones were divided into business, entertainment, and mass market divisions with each bidding for exclusive access to the components being delivered by the the R&D and productisation teams. So it was that barcode readers ended up with "business" (use case: scan a business card) and auto focus cameras with "entertainment" (use case: selfie). With of course the result that the business phone needed an A4-sized business card held at arm's length. This absurd "strategy", probably sold in by a gaggle of consultants, was first celebrated then subject to extensive soul-searching and some tedious "lessons learned" enquiries that resulted in long PowerPoints filed carefully in NUL, rather than perhaps simply resolving next time to ask some small child working in the front line just what they thought of the Emperor's splendid new attire...


Maybe visible, but as someone who works in EU, behind the screen there is a lot more emphasis on either not collecting data, or being more strict with what you collect and how you collect it.

PII data identifying, documenting it, and periodic review's of that is becoming standard procedure.

I remember when GDPR was announced, where for most projects, there was not anyone who could tell you for any given project, what all data is being collected and stored where.

So GDPR did have positive* effect at least with the part of the market I am familiar with.

* As in positive for people privacy


> Will you volunteer to tell the retiring teacher or firefighter that they will have to starve because you’d rather punish the ultimate perpetrators rather than hold the actually guilty (the then CEO and the board) accountable to the law?

I would. And firefighters and teachers are used to getting fucked by everybody anyway. (mom is a teacher, and have several friends who are firefighters).

But in reality, big union funds, invest only a little bit in any one company, so no, some big company loosing some of it market cap, would make little difference to each individual teacher and firefighter.


>It's a broken system where effort and performance isn't acknowledged and rewarded and the whole purpose of being there is undermined. At some point the whole enterprise is destructive and should be shut down.

The point of presentation is usably not to put up great presentation, but to get certain points across.

If you successfully do that with quickly thrown presentation, then its waste of time spending more on it.


> ... critically acclaimed ... > ... bait so it wins awards.

So if you are a producer that wants "critically acclaimed" and "wins awards", author , hire her ? Maybe that answers why she is getting paid that much.

> Writing 1000 episodes of those shows isn't comparable to a single good bond film. This is why quantity and quality are not the same.

That's debatable. Personally I wasn't impressed with neither her work or last few James Bond (with the last one being the best of the lot.)

For me the best and most consistent author of past decade has been Randall Munroe of https://xkcd.com/ fame. Does he count ?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: