Hacker News new | past | comments | ask | show | jobs | submit login
Are Product Hunt's featured products still online today? (scrapingbee.com)
199 points by daolf on Feb 9, 2022 | hide | past | favorite | 108 comments



Anecdotal data. I released a couple of things on product hunt. Popularity-wise one did will, the other went nowhere. Financially, it was completely the opposite. Boring stuff is very successful. I haven't seen a "popular" product hunt thing that I am willing to pay for in ages!!

People pay for pain meds, product hunt featured products are colorful vitamins.


Would you mind sharing the links? This is at the same time fascinating and total common sense, so I'd like to learn more.


It's a little like competing over the store check-out aisle shelves. It's unlikely it has what you actually need but there might be sometime shiny or tasty. And if you do want more next time you'll probably go shop around instead.


Totally anecdotal / single datapoint: launched my side project in 2018 on Producthunt. Total crickets. 5 upvotes.

We are now a 20+ people team, 400+ B2B customers and $12M raised.


And we are one of those 400+ B2B customers. Didn't know you were just 20+ people!

Really, really solid product. Checkly is a very core part of our infrastucture.


Checkly right? Looks pretty interesting.


Thanks, that means a lot!


> “$12M raised”

Why is that even a metric to speak about?

Shouldn't the metric that matters most be # of customers, how engaged they are with using the product and ultimately sales?

Bringing up how much you have raised seems like the priorities are misaligned.


Hey, I kinda agree. Customers are #1 and #2 etc. I believe we have quite some happy customers.

But this is a forum managed by a VC so some people are certainly interested in this. I also thought it was interesting in the context of a side project launched on PH.

Feel free to check my Twitter (link in my bio) on how I think about and interact with customers.


Oh don’t act like it’s not an important number. We don’t live in a utopia where money doesn’t matter. It’s an important number.


It’s an important number for the CEO and CFO, but it means nothing about the business or it’s success. “Funding raised” is completely uncorrelated with how successful the company is.


This is just patently untrue. The failure rate of ventue-backed startups is 75%. The failure rate of all startups is 90%. Funding is correlated with a lower failure rate.

https://www.failory.com/blog/startup-failure-rate


You can use statistics to make inferences about groups as a whole, but you can’t use stats to deduce anything about any specific case.

A coin has a 50/50 chance of being heads, but just because the coin landed on tails last time doesn’t mean it’ll be a head the next time.

Here is the link I should have included originally https://en.m.wikipedia.org/wiki/Ecological_fallacy


You said ' “Funding raised” is completely uncorrelated with how successful the company is. ' I showed you this statement is wrong, by providing you a statistic that proves they are correlated. We were not talking about any specific case, I don't know why you've opened with that. The coin flip statement is true but also completely unrelated to our discussion.


Citation required


I agree with @tnolet that it's not the number one stat I want to know about any given company. But, at the very least, it does mean that they were able to convince some people who look at thousands of companies a year and then interview hundreds of them to give them $12M. That's not nothing. Granted, $12M is not a whole lot of money in VC land, but there is some signal in that data point.


All that means is that they have $12 million in funding. Assuming anything else is a logical fallacy; there is no more information available in that number.


Yes, there is, just as I said: they were able to convince someone who looks at thousands of companies a year, then interviews a handful of them to give them $12M. You don't think they prayed really hard to the money fairy and got their wish granted, do you?

No, you shouldn't read much more into that, but there factually is information behind the number.


Well no. It means they convinced someone to give them 12M. It says nothing about the credentials of the lender, or that those credentials are even relevant or accurate.

Not to mention that convincing a vc, regardless of clout, is not about objective profitability or anything correlated with success.

It’s about convincing a person that you can make money, and people are in capable of being objective


Even ignoring all the other factors, there is a disconnecting between only managing to get five (free) upvotes in one forum, and finding a group of people willing to bet 12 million on it in another.


Because if I have $100m more than you, it's quite easy for me to get more, higher engaged customers than you


> We consider a 2XX (Success) and 3XX (Redirection) status codes successful

I feel like this is flawed, especially considering 1/2 of the successful responses were 3XX. It's possible that they had just linked a short URL that was a redirect, but it's also possible that the product was shuttered and a redirect put in place to a replacement product, the company homepage, or even an acquiring company. I don't think there is an easy way to tell based just on the response code, and I'm not sure you could even programmatically determine it unless you had samples of what the pages looked like on launch day (maybe compare today vs the Internet Archive?).


Won't parked domains be all 2XX? But those are hardly "alive"


Indeed, another flaw in the system. But I think the article at least calls out that there may be leading pages for dead apps that are still live.


Or that it even redirects to 4xx or 5xx. I had considered this, but decided to draw the line here.


Comments here are really surprising. I really struggle to understand Product Hunt. I've spent multiple sleepless nights scrolling through it and I couldn't find a single meaningful or useful thing. I guess if you start splashing water around the streets, you will find a few perfectly shaped puddle. But I have never stumbled upon anything that made me think "wow, this is awesome" not even "this might be useful".


When they first started a few years ago I used to visit them quite a bit, and often found interesting things, but most (all?) that I actually used ended up being discontinued a year or so later.

I don't visit the site really every anymore, but I did take a week the other week, and like you, nothing really stood out to me.


how about those awesome feature updates from giant tech cos...those weren't helpful?


Hi all, post author here. Just to say this was a really interesting piece to work on - I had a lot of fun poring through the data.


I have launched on PH last year, ended up 5th product of the day then companies started reaching out asking for demos.

I left my job end of that week and been doing it full time for a year now.

However, I agree with comments about launches in general. If you have a good network, you can launch rubbish and end up in the top 5.


Fair warning: This is a blog post advertisement for ScrapingBee. The data is still interesting.

The most interesting chart is one of the last: Proportion of Failures over time. As expected, more recent product links are less likely to 404 or 5xx.

Going back to 2014, almost 1/3 of the featured links give a 4xx or a 5xx response. That’s a lot!

More surprising, links as recent as 2020 show a 1/4 failure rate. Those projects basically launched on PH, then shut down shortly afterward.

Moreover, this analysis can’t actually account for products that have been shuttered but still have landing pages online. It’s ultra cheap to keep a placeholder “Sorry we’re closed” page online, so I imagine a lot of these projects are shutdown but counted as “success”.

Subjectively, this matches what I’ve gathered from watching PH. Getting a PH featured product listing seems to be a badge of honor, but PH users aren’t really interested in using 99% of the products and the submitters aren’t actually interested in building them past proof of concept. Recently, the bulk of postings seem to be advertisements for paid information products or pay-to-join communities.


I find your warning a bit unfair as there are literally no CTA inside the blog content promoting our product and only 2 internal links toward other educational posts.

But anyway,

I thought about taking a random sample of pages who returns a "200". Let's say 150, and manually tagging them to find if they're "dead" or not.

And then reuse the "dead or alive but a 200" ratio for all the pages but I was afraid that I'd need to tag much more than 150 pages to have a significant statistical result.


> I find your warning a bit unfair as there are literally no CTA inside the blog content promoting our product

It’s obviously blog content designed to promote your product, hosted on the company’s product website. I don’t see how the FYI is unfair.

I added it because the content was valuable but HN can be finicky about blog posts from companies advertising their own products. Trying to get ahead of indignant dismissals.


> It’s obviously blog content designed to promote your product, hosted on the company’s product website. I don’t see how the FYI is unfair.

There's so many blog posts posted here that could fall under "content marketing" umbrella if you want to be strict. I feel like there's no problem with that if the content is valuable and people like/upvote it. After all this is a platform that is doing marketing for YC where YC companies are supposed to post their content too.

That "warning" also stuck out to me as a bit unfair as I was even looking for how it hooks into ScrapingBee (as I was curious how these scraping-aaS platforms interface with custom code) and couldn't find anything.


Yours ended up coming across as the indignant dismissal. As a community member I didn’t appreciate the warning. From the second paragraph on your comment was an interesting contribution, though. I’m surprised that many of those PoC businesses have stayed online at all, but I guess romaine are easy to renew.


Where do they promote the product? It looks like they’re using a random http library.


For what it's worth, you could watch how quickly the confidence intervals converge as you sample the data, to see if it's worth continuing or if the variance is too high and whether you'd have to check thousands of pages by hand:

   from scipy.stats import binomtest
   chance_of_dead_page = binomtest(landing_page_counter["dead"], landing_page_counter["total"]).proportion_ci(confidence_level=0.90)
   print(f'Chance of a dead but existing landing page (90% Confidence Interval):{chance_of_dead_page.low * 100:.2f}% to {chance_of_dead_page.high * 100:.2f}%')


> I find your warning a bit unfair as there are literally no CTA inside the blog content promoting our product and only 2 internal links toward other educational posts.

I've worked in or adjacent to the content marketing world long enough to know that a CTA is not necessary for the post to be marketing/advertising. One of the major goals of content marketing it to establish the authority of the brand. You are well aware that the raison d'etre of that post is to spread awareness of and establish the authority of ScrapingBee.

It doesn't mean the post is not interesting, useful or valuable. But that post exists fundamentally for marketing/brand purposes.

Parents warning is completely fair, especially since they immediately point out the value of the post.


What's the point of the warning though?

If you're paying attention you know it's content marketing. If you're oblivious, the marketing probably isn't working.

Either way, you probably don't need a warning.


I love warnings, because I don't have to pay as much attention.


> Going back to 2014, almost 1/3 of the featured links give a 4xx or a 5xx response. That’s a lot!

Is that a lot? I would have been less surprised if it were 1/3rd of links still live.


I thought the same thing until I realized that dead domains are often snapped up by squatters/spammers (or just by other people who want that domain for actual reasons) so may not error when requested.


> Fair warning: This is a blog post advertisement for ScrapingBee. The data is still interesting.

Sure, but it's no different than any other blog post from a company. And framing it that way is quite disingenuous since the post pretty much only sticks to the topic and doesn't overtly promote their product.


> Fair warning: This is a blog post advertisement for ScrapingBee.

Yeah we have all seen it is on scrapingBee, no need for a warning.


Don't forget cases where a product's domain expired and has since be reused for something else entirely (or a product with similar goal but new vendor)


I would love a PH where you have to show off an actual product (instead of blogspam guides, design resources, etc, even though these can be useful)


> More surprising, links as recent as 2020 show a 1/4 failure rate. Those projects basically launched on PH, then shut down shortly afterward.

This would very well fit a "fail fast" attitude with testing MVPs, wouldn't it? At least that's what I would guess. Got a great start with PH but didn't move on from there, so the domain was not renewed...


Does a great start on PH mean very much? The chances your target audience is there for most products seems very low. I would love to see this data compared to all products/startups in general, but of course that's probably difficult to do.


Bingo. IME Producthunt is lazy marketing for indie founders. Their main user base is other founders, wannabe founders and super tech literate power users who are itching to use “the next new beta thing”, who’ll be a tiny percentage of any userbase.

Granted there are cases where that market IS aligned with your product, eg if you built a low cost site-builder or low cost social media publishing platform


The most surprising thing to me is not the failures but that the devs won't even pay a few bucks a year to keep the domains online. If I spent time and effort into building a product that went viral and got a bunch of users, I'd at least leave the front page of it up indefinitely as some sort of tombstone.


That's a lot of incredible journeys that have reached their end.


Fair warning: this response is an advertisement for ycombinator


Funny story: on the day that Product Hunt posted its Show HN, someone (unbeknownst to me) posted my startup on Product Hunt. It was fun to ride a little wave on top of a big wave!

My startup is still around, [1] and we posted on PH one or two other times when we launched new products. Even though we had some powerful hunters (thanks to our early presence on the site), I found it took too much time to be worthwhile for follow-on product releases. I'd be interested to know if others have had the same experience, or if they have tips for how to get a meaningful bump out of subsequent posts.

1: https://www.beelinereader.com


It would be interesting to see:

1. How the online ones are doing financially.

2. Which sectors are doing well - what are the trending tools.

Skimmed through PH APIs, don't think this is possible. Courtland's Indie Hackers (they have stripe verified revenue) maybe of help - a quick google resulted in this¹ result

There's also microconf report on SaaS's²

[1] https://www.indiehackers.com/post/indie-hackers-are-making-6...

[2] https://microconf.com/sois-report-2021


Revenue would be near impossible to do, however we could have analyzed traffic using SimilarWeb or Ahrefs API.

We could also have analyzed the sitemap to check the last update date.

Those articles are really fun to write (I haven't written this one, I'm just the editor), but at some point you have to stop otherwise you end with a 20k words essay.


Agreed. A subset of products are "stripe verified" on Indie Hackers - should be a good enough population.

I think the parent article is interesting, thanks for your contributions. I am not saying that the same should have contained revenue, performance data - just that it would be interesting to see :)


This is interesting data. A “failure over time and cohort” could be an interesting visualization. Similar to the cohort retention tables here: https://amplitude.com/blog/cohorts-to-improve-your-retention

It makes it easy to see based on when a product was featured whether it’s becoming more or less likely to fail after a given time period.


My favorite un-ironic ProductHunt product was an app that would let you map where you cried. Absolutely nutty. I miss peak product hunt :(


ProductHunt was simply a way for bootcamp graduates to express themselves during the frontend JS hype cycle of 2014-2018


Interesting, I'd like to see a more comprehensive version of this using https://builtwith.com/ detection data for products where it's relevant.


Many people use a blog platform for their main info site and a different setup on the actual product isn't that right?


Slightly related: If you wanna analyze all product hunt posts until july 2021 yourself you can do so here:

https://veezoo.com/phdemo

Disclaimer: I created that demo


Unrelated to the article - is it just me or is this scrapingbee product borderline nefarious? From the homepage:

> Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots!

> Scrapingbee helps us to retrieve information from sites that use very sophisticated mechanism to block unwanted traffic, we were struggling with those sites for some time now and I'm very glad that we found ScrapingBee.


It really depends. There are plenty of legitimate uses for scraping (for example, I've been involved with academic research that involved scraping Twitter search results), and it's only really feasible to collect the amount of data you need using scraping plus paid proxies. That being said, there are also a number of nefarious paid proxy services which offer residential IPs (read: are usually botnets).


What is legitimate to a user is not the same as what is legitimate to a site owner. The legitimate way would probably be to use the Twitter API.


The Twitter API has very low rate limits (from a data collection perspective). While there may be good reasons for that, these limits also preclude doing public interest research of the type we were doing (how Twitter's various search filters influence the political leanings of search results). When companies have Twitter's level of societal influence, I think it's also possible to define "legitimate use" in terms of public interest, rather than simply "users" or "site owners."


No more nefarious than the measures websites put up to avoid scrapers? This just rehashes the Linkedin vs Hiq case: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

(not a user, but I do some amount of scraping through other means)


It is definitely super annoying that companies are allowed to spy on us and do all kinds of crazy things with our data, all using computers and automation and "bots" and such, but individuals are increasingly not allowed to use automation to help us out online. Seems rather one-sided. On the other hand, I get that abuse is a huge problem. I do wish at least bots operating at roughly human request rates & daily total requests were considered OK and universally allowed without risk of blocks or other difficulties leading to increased maintenance costs (so, making them less valuable).


Sometimes the scraping situation gets kinda ironic. I worked at a large eRetailer/marketplace and obviously we scraped our major competitors just as they scraped us (there are four major marketplaces here). So each company had a team to implement anti-scraping measures and defeat competitor's defences. Instead of providing an API everyone decided to spend time and money on this useless weapons race.


Absent someone breaking really far away from the pack, that's a classic example of one type of "bullshit job" called out in Graeber's book... Bullshit Jobs. Zero-sum, ever-escalating competition. Militaries are another obvious example (we'd all be better off if every country's military spending were far closer to zero—but no one country can risk lowering it unilaterally, and may even be inclined to increase theirs in response to neighbors, which sometimes gets so insanely wasteful that you see something like the London Naval Treaty or SALT come about in response) but so is a great deal of advertising and marketing activity (you have to spend more only because your competitor started spending more—end result, status quo maintained, but more money spent all around)


I wonder how anyone in IT could take Graeber seriously. One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.


The presentation of that in the book, based off a message from someone in the industry, doesn't seem out of line with the overall tone and reliability-level that Graeber explicitly sets out in the beginning, which is both that the book is not rigorous science and that it's mainly concerned with considering why people's perceptions of their own jobs would be that they're bullshit.

[EDIT]

> One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.

Further, I'm not even sure that's incorrect. It can both be true that most open source (that's actually used by anyone) is done by people who are paid to do it, and that most programmers have very little interesting or challenging to do at work unless they work on hobby projects—maybe open source—in their free time.

The overall letter as quoted in the book, and Graeber's commentary on it, actually makes some good points aside from all this. Things don't have to be perfect to be useful.


The job being un-interesting and un-rewarding doesn't make it bullshit. The job of a truck or a taxi driver is boring as fuck, but it's not bullshit.


A company my previous employer partnered with once asked us to integrate with them.. via scraping and using bots to fill out forms.

Which would have been fine except they also imposed terribly low rate limits with no ability to check them.

We eventually pulled the partnership since it was more work than value.


A lot of data I provide to services is exposed to other individuals so that the service can function. They doesn't mean that data belongs to those people or that they can feely use that data elsewhere.

Allowing unfettered scraping and repurposing of data would have a chilling effect on all types of services. For example I wouldn't necessarily want a bot to scrape my comment history on HN, doxx me, and share my identity and comments with others.


I believe whenever the “no automation/scraping/bots” clause in Ts&Cs has been test in court they have never held up. However that’s not to say a service can’t just cancel your account if you are found to be using one.

Running a site thats had a bot get stuck in a loop and suddenly x10000 times the request rate, when they go wrong it’s super annoying for the website owner. We ultimately just banned the whole AWS ip ranges.


"Nefarious" is a strong word. Courts have repeatedly ruled that scraping data that is otherwise available publicly is legal. You may not personally agree with the ethics, but there are a lot of people who do.


I agree it's a strong word, which is why I said borderline nefarious. However, it's not that far off from a DDOS tool.

At least in the United States, sounds like the jury is still out on the legality: https://www.reuters.com/technology/us-supreme-court-revives-..., but my perspective was more from an ethics standpoint anyway.


It is very far from a DDOS tool. Scraping can be done from a single source, one request at a time, with self imposed rate limits. Sure it can overwhelm a server, but then so can a single user opening 10 tabs.


> Scraping can be done from a single source

That's not what this tool does though. It allows you to distribute your scraping to a layer of proxies. So, the only difference is whether there is an intent to do harm to the target or merely collect data... which could be a form of doing harm as well?


There are plenty of tools like this where going up to the line is much different than crosing it. There's a vast difference between driving your car to an event and driving the few extra meters into the crowd at an event. You can cut down a tree with a chainsaw or cut down a tree onto your neighbours house.

There's definetly an argument that dangerous tools should be regulated to varying degrees. If we're arguing regulations in this specific area you'd probably also be balancing it with regulations that sites can't close an account for reasonable rate automated access and that public research can have higher rates so long as they're not crippling.


The tree example is true and why I agree these things are very similar. The only significant difference is when you put it on your neighbor’s house on purpose.

I wouldn’t regulate this but If you’re introducing regulations, why not just require the source to deliver the data in a neatly packaged format? The necessity for scraping and the potential for DDOS and potentially nefarious behavior basically goes away.


Based on another comment, and the wikipedia article they linked to, it looks like the Supreme Court vacated the decision and remanded the case for further review in June 2021 (probably after this article).[1] Unfortunately there is no citation for that sentence so I'm not entirely sure.

I think that means the jury is still out, as you mentioned, but it's leaning towards scraping being legal as long as the data is publicly available. IANAL

[1] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


Nefarious? Then they should arrest Google first, it is the king of web scrapers.


Robots.txt


If the google crawler actually respected robots.txt your point might be salient.


It does.

Please verify your experience with the Google ip range.

https://developers.google.com/search/docs/advanced/crawling/...

A lot of crawlers spoof the Googlebot user agent so you wouldn't block them ;)


Surely you must be joking. Alphabet is the largest web scraper in the world. They would soon go out of business if robots.txt was the only data they scraped.

It’s not a web crawler. They are all web scrapers. And Alphabet/Google sells this data and makes profits from it.

It is not like it is trying to hide the fact that it is king web scraper.

Google has gotten in trouble from various publishers for this before. It is no secret there is a double standard in big tech.

Again if you are going to arrest a web scraper, then arrest the king of all web scrapers first to make it fair.

Data wants to be free. If it is publicly accessible then it is fair game.


I'm probably not going to get a reply, but let's try:

Source ?


You are stating that Google has never acted in bad faith and that robots.txt is the only thing that Google looks at when crawling/scraping the web.

You’re a smart guy. Surely you must know how ridiculous that sounds on the face of it.

It is common sense.

The sky is blue.

Source: Look up at the sky.


So, no source? Your response is unrelated to the statement at hand.

Think about it: Google has every advantage by respecting robots.txt and nothing to win by ignoring it.

Eg.

1) If a media company doesn't want to get crawled: add it in robots.txt

Then they realize their visitors drops and they'll remove it again.

Ergo: publishers sue. Because they want the advantages, but without the scraping. Which doesn't seem logical to me, since they currently give Google explicit permission to scrape content.

2) if they would sometimes leak personal documents protected by robots.txt they could have a lot of lawsuits on their hands.

Robots.txt is a simple method to not get blamed.

Ignoring robots.txt could literally be a core business liability from my POV.

---

So please, source outside of gut feeling, as requested before, would be greatly appreciated.


My point is that they scrape the web for data because that is their core business.

Im not sure why robots.txt was even brought up.

So google respects this file? I say so what.

Im arguing that while Google has free reign to scrape whatever data it wants, we indie devs are subject to the cider house rules.

Sources can be found for just about any argument. So they are more or less useless.

There is nothing wrong with self evident truths or reasonable hypotheses. That is how the modern world was created.

A search engine that scrapes the web for data to make a good search engine. Who wouldve dreamed of it?

We are not privy to what happens behinds closed doors at Google. They only work for their shareholders. Not us or the public good.

Source that google does what it wants based on what it thinks the web should be. Google can change its mind on a whim https://www.searchenginejournal.com/google-robots-txt-noinde...


It does.


Think how ridiculous it sounds that Google only has URLs listed in robot.txt. They wouldve gone out of business long ago.


Do you know how robots.txt works?

It's an exclusion standard, not an inclusion one.

https://en.m.wikipedia.org/wiki/Robots_exclusion_standard

For helping individual url discovery, you can use sitemap.xml.

In case you know how it works ( and i suppose so considering your account age), your comment is just weird tbh.


Google scrapes web data is my point. It is king web scraper.

Robots.txt does not fit into this argument. Im not sure why it was brought up. Google doesn’t scrape urls listed there? Ok. And so? Am I to believe that just because Google says so?

Google scrapes what it wants. It does so for its shareholders. It could care less about web standards.

Source: Amp


Nice! I've often wondered what proportion survive. Tbh, I've launched about a dozen things on PH and it's not realistic for every product to be a success. You learn by your bruises so I'd be surprised if most founders didn't have a string of failed launches behind them.

Interesting to see the categories that had the best responses include no-code!


How many of those dozen things are still up an running?


> there's actually proportionally less failures in Product Hunts busiest period

This is a really interesting post! I think there's a little survivorship bias. As Product Hunt grew 2015-2017, users posted old projects of theirs which were already popular and successful.


Glad you enjoyed the post - I hadn't considered this.


My guess would be that URLs for the categories eliminated after that period (eg. Books and Podcasts) are more likely to remain stable and available, even if the product was a flop.


ProductHunt has become a cess pool of spam and people gaming their voting system to appear as a featured product. Ryan Hoover is too busy with his web3 projects and investments to care.


The no-code trend taking off during the later half of 2020 is really interesting. Something I haven't payed too much attention too yet.


I just realized Product Hunt is a "top of the funnel" function for AngelList. Huh!


The founder uses it as a funnel for his personal investments as well.


This is a great example of Content PR.


Huh, I never thought about that. I mean, I'm literally launching today (Qvault) I hope I'll be around in a year!


don't worry about it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: