- how hard it is
- how much time it takes
- how long it takes to become successful
So instead of trying to explain it, I may just send them to this blog post, which shows all 3. Thank you, Gabriel!
(Now if only you would remove that Mojo Badge business from blocking your great content.)
Good to see several people writing recently (from experience) about how this couldn't be farther from the truth. While it may happen from time to time, it's clear that most founders of successful (technology) startups have built a long-term lifestyle around working really hard, for a really long time, on a bunch of their own ideas and projects. Especially true given what we know about how much of a role luck and timing plays in the success of startups.
Thanks for the post, Gabriel.
For example, one of the big issues in blackhat spam this past year was illegally hacked sites. Our algorithms weren't doing the best job on hacked sites, so the manual team kept an eye out for hacked sites to remove them (and often to alert the website owners that they'd been hacked). The data generated by the manual team helped us build and deploy multiple new algorithms to detect hacked sites, leading to a 90% reduction in the number of hacked sites showing up in Google's search results in the past few months. That decrease in hacked spam in turn frees up the manual team to tackle the next bleeding-edge technique the spammers use.
I suspect every major search engine uses similar approaches: try to stop the majority of spam with algorithms, but be willing to take action in the mean time while engineers work to improve the algorithms.
I kinda thought one example would make the point. Does it help that much more to give another example? I can look more up. For http://www.bigbadblogdirectory.com/ it looks like you were autogenerating typos not just for websites, but for popular blogs. So http://www.bigbadblogdirectory.com/jeffmatthewsisnotmakingth... looks like it had
(I had to cut out the vast majority of the typos because the comment was too long for HN.)
jeffmatthewsisnotmakingthisup.blogspoot.com, jeffmatthewsisnotmakingthisup.bloyspot.com, jegfmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnomakingthisup.blogspot.com, jeffmatthwesisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup.nlogspot.com, jeffmatthewsisnotmakingthisup.blogspot.ccom, jeffmatthewsisnotmakingthisup.bligspot.com, jeffmatthewsisnotakingthisup.blogspot.com, jeffmatthewsisnotmakinghtisup.blogspot.com, jeffmatthewsisnotmacingthisup.blogspot.com, jdffmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnot akingthisup.blogspot.com, ieffmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup/blogspot.com, jeffmatthewsisnotmajingthisup.blogspot.com, jeffmatthewsisnotmakingthishp.blogspot.com, jeff atthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup.blogspot/com, jeffmatthewwisnotmakingthisup.blogspot.com."
I could post more examples from the other domains, but my point is that this is the sort of thing that users dislike and complain about. If you were a blogger and saw pages like this ranking for your name or your site's name, you probably wouldn't be happy either. From looking at a few domains, I don't think that we overgeneralized from a few pages in this case.
I know that you've moved on and the domains are shut down now. And I'm not trying to be cantankerous. I'm just trying to say that from our point of view there's good reasons to take action on sites like this so that users don't complain to us.
Each site took a long time to make actually. They either involved generating a data set from scratch or piecing together and parsing other large data sets. This one in particular, I was crawling the Web for feed discovery and was planning on adding stuff like grouping the best posts by category, etc.
Yeah, would love to know about some others, e.g. japanese2englishdictionary.com, idnscan.com, serverslist.com. Also, did you actually get any complaints about this or was it triggered by some other threshold/thing? On a side note, I still get requests about exposing some of this data, i.e. sites behind ip addresses or lists of domains matching some criteria. In any case, thx for the info!
I can understand the need to take action. I just think it could have been handled better. If typos were the problem, I would have removed them immediately if someone told me, and that could have been automated. In retrospect, it seems pretty obvious, but it wasn't at the time.
There's also a class of folks we call navigation spammers who try to show up for tons of domain name queries. I can give you some history to provide context. In the old days, when you searched for [myspace.com] we'd show a single result as if someone had done the query [info:myspace.com]. The problem is that people would misspell it and do the query [mypsace.com], and then we'd end up either show no result or (usually) a low-quality typo-squatting url. So we made url queries be a string search, so [myspace.com] would return 10 results. That way if someone misspelled the query, they might get the exact-match bad url at #1, but they'd probably get the right answer somewhere else in the top 10. Overall, the change was a big win, because 10% of our queries are misspelled. But if you're showing 10 results for url queries, now there's an opportunity for spammers to SEO for url queries and get dregs of traffic from the #2 to #10 positions. Now we're getting closer to present-day, so I'll just say we've made algorithmic changes to reduce the impact of that.
But you were hitting a bunch of different factors: tons of typos, specifically for misspelled url queries, autogenerated content, lots of different domain names that looked to have a fair amount of overlap (expireddomainscan.com, registereddomainscan.com, refundeddomainscan.com, etc.). If you were doing this again, I'd recommend fewer domain names and putting more UI/value-add work on the individual domains.
I mean, if you're auto-generating a page that has this text: "Elcorillord.com
Common misspellings and typos: Elcroillord.com, Elcorilolrd.com, Elcori.lord.com, Elcofillord.com, www.elcorillord.com, lEcorillord.com, Elcotillord.com, Elcorillor.com, Elcoeillord.com, Elcori,lord.com, Wlcorillord.com, Elcorjllord.com, Elcorillord.coom, Elocrillord.com, Elcor8llord.com, Elvorillord.com, Elcorillprd.com, Elcorillord.cim, Elcorillorf.com, Elcorilloed.com, Elorillord.com, Elckrillord.com, Elcoriplord.com, Elcorillord.ckm, Elcorillord.cm, Elcorillord.ccom, Epcorillord.com, Elcoril;ord.com, Elcoirllord.com, Elcoriillord.com, Elforillord.com, 3lcorillord.com, Elcorollord.com, Elcorillordd.com, Elcorill0rd.com, Elcorillord/com, Elcoriolord.com, Ekcorillord.com, Elcorillord.xom, Elcorillord.co, Elcorilord.com, Elcoillord.com, 4lcorillord.com, Elcoriloord.com, Elcorillorr.com, Eldorillord.com, Elcorillord..com, Elcorrillord.com, http://www.elcorillord.com, El orillord.com, E.corillord.com, Elcorillord. om, Elcorilllrd.com, Elcorillrod.com, Elcoriklord.com, Elcorillorrd.com, Elcorillordcom, Elcorillkrd.com, Elcorillord.om, Elcorlilord.com, Elco4illord.com, Elcorillrd.com, Elcprillord.com, Elcodillord.com, Elcorillordc.om, Ecorillord.com, Elcoorillord.com, Slcorillord.com, Elcorillorx.com, Elcorill9rd.com, Elcorilpord.com, Elcorillord.cpm, Elcorillord.fom, Elco5illord.com, Elc9rillord.com, Elcorillird.com, Elcirillord.com, Elcorillord.clm, Elcorillors.com, Elcorillord.vom, Elcorullord.com, Elcorillord.comm, Elcorillord.c9m, Eocorillord.com, Elcorilloord.com, Elcourillourd.com, E,corillord.com, Elcorkllord.com, Elcorillodr.com, Elcorillodd.com, Elcorillord,com, Elcorillotd.com, Elcorillod.com, Elcorillor.dcom, Elcor9llord.com, Elc0rillord.com, Elcoril,ord.com, Elcorilllord.com, Elcorillo5d.com, EElcorillord.com, Elxorillord.com, E;corillord.com, Elcori;lord.com, Elcorllord.com, Elccorillord.com, Elcrillord.com, Elcoril.ord.com, Elcorilkord.com, Elcorillord.cmo, Ellcorillord.com, Eclorillord.com, Elcorillo4d.com, Rlcorillord.com, wwwelcorillord.com, Elclrillord.com, ElcorillordLcom, Dlcorillord.com, Elcorillofd.com, Elcorillore.com, Elcorillord;com, lcorillord.com, Elcorillorc.com, Elcorillord.c0m, Elcorillord.dom, Elcorillord.ocm."
Surely you have to see where many people would consider that either keyword stuffing, gibberish or typo spam.
However, I realize some were closer to the line and I should have focused on being less cookie-cutter and more useful in the domains that were really better (more farther along). I had always intended on coming back and working more on each, but wanted to get placeholders up quickly because it takes a while to get backlinks and indexed.
I guess I'm saying I had hoped I would have at least been contacted with a warning and what was found objectionable before just being totally blacklisted with no reason given. I would have also hoped that each site would have been addressed individually. If I had been contacted and you had said, hey, you need to remove these misspellings off of these sites, I would have done it immediately.
Here are some comments on the above though. Again, from my perspective these weren't violating the guidelines because the pages were useful from the user's perspective and there were no hidden tricks going on.
First off, there were actually many categories of sites, domains was just one of them. Others were sports stats, definitions, language, medical, and addresses. For each site I made, I was modelling it off of other sites that had gotten great Google rankings for years. I had hoped to eventually improve the UX on those sites and get similar rankings. For domains, I'm talking mainly about who.is and domaintools.com.
Each domain had a static site index, and that's what you linked to above in the screen shot. The extensive ones weren't really meant to be browsed, but just so search engines could find the pages (pre my knowledge of sitemaps). It's no different than any of the other static sitemaps, e.g. http://who.is/whois_index/index.php, and most of them looked better than the screenshot.
That one in particular came from the code for the streetsandzips site that was a big tag cloud. I was trying to find ways to make the static site better, and that was one of them. It looks better when the fonts are of different sizes :). I had intended for that site to make them different sizes based on the traffic numbers, so Google, Facebook, would be really big, etc. On the streetsandzips site the bigger cities are bigger.
In fact, I believe I evolved the sites so that those (site index) pages had noindex,follow on them such that they wouldn't come up on search results. I also added a search engine (Google custom search) on each page as well. I don't remember if I got to the tag cloud sizes for this particular domain at the time of blacklisting.
As for the misspellings, I did mess around with those, but not on all sites, and I believe at the time they were blacklisted that had been removed from most of the domains, if not all.
Common misspelling and typos as you know is a tool that people provide to those who buy domains. I built it for that purpose, and wanted to see how many people were actually searching for this stuff, so added it to some of the domains. Turns out, a lot of people do. I didn't just tack it on to the footer or cloak it or whatever; I put it in with a purpose that people ask for, e.g. common misspellings and typos.
Additionally from the users perspective, if they got to this page by typing in one of those misspellings, they were getting a big link to the official site at top and then more information about that site, e.g. siteadvisor rating, traffic, etc. So it was essentially functioning as one-click Did you mean x.
I'm happy to answer more questions about it. But it is pretty clear that it was still shoot first and ask questions later. No one ever contacted me about anything. I wasn't trying to hide anything from Google. It was all in my personal adsense account.
I can understand from a search engine perspective, banning sites. But given I already had a relationship with Google, I expected to be contacted. In fact, at one point I had a call with an Adsense guy from Google trying to help me better optimize my sites for Google! He looked at them and had no issues with them, so I thought I was fine.
Also, IIRC I submitted at least one re-inclusion request after being banned, and never heard a response back from that either. Before submitting that request I did a top to bottom review and tried to remove anything even close to the line, including misspellings I believe.
They claim they respond to all "Site reconsideration" requests. I had to file one once, they did respond, but with a very non-informative and unhelpful response.
The net effect is that we haven't found a way to talk 1:1 with every webmaster, and I'm not sure whether that's possible. The story of webmaster communication for the last few years at Google has been trying to improve scalability of the info. The earliest Google webmaster communicator ("GoogleGuy") answered questions on a webmaster forum. In 2005 I started a blog, which has the advantage of permalinks for posts like http://www.mattcutts.com/blog/seo-mistakes-autogenerated-doo... . We tried doing live webmaster chats, but that would only reach 400-500 webmasters at a time.
The most scalable thing I've found so far is making videos. Here's a video that came out last month about the dangers of autogenerating pages for example: http://www.youtube.com/watch?v=A8bgpWtVHo4 . We're at almost 300 videos now, and we're getting closer to 3M total views on our webmaster video channel. The hope is that this additional guidance helps people self-identify what can cause issues to avoid or to correct them without needing to talk to Google.
The other big tool that has been helpful is http://google.com/webmasters/ . That provides tools to identify the common errors/mistakes that webmasters make (crawl errors, 404 pages, canonicalization, robots.txt issues, identifying hacked sites using the "Fetch as Googlebot" feature, etc.). That helps with many of the straightforward issues, but of course it doesn't solve the issue with "sheer number of webmasters who have ranking questions vs. number of Googlers." If anyone has suggestions on how to tackle communication with webmasters in a more scalable way, I'd appreciate feedback on how to do better on that.
I understand the argument behind keeping it a black box, but it doesn't need to be as much of a blackhole. For example, in this case the following could have happened:
1) Site triggers some alarm for violating something.
2) Just those site(s) get strongly penalized.
3) Automatic emails go out in the message centers of Google Webmaster tools, analytics, adsense, and Gmail -- wherever the sites show up registered. In my case, it would have been all of the above.
4) The messages indicate the nature of the violation, that there is a penalty in effect.
5) There is a link to click on if you think you've corrected the errors.
6) If you click it, it auto-checks your site in y days and sends you another message that it passed or not.
7) If not corrected, it stays penalized or there are a series of penalties until full blacklisting.
That's all automated, i.e. scalable. I understand there are some tricky bits about how much to reveal about why things were penalized and what not, but I think those could be worked around usefully.
Google attempts to determine whether you deserve a warning; the goal is to notify honest folks, without notifying real "bad guy" spammers that they've been caught. Naturally, the algorithm gets it wrong sometimes... detecting wrongdoing is easier than detecting intent.
Were you blacklisted from Google Search or Google AdSense or both? Google AdSense's blacklist policy is totally separate from Google Search; Google AdSense's policy is to blacklist people on suspicion of wrong-doing (guilty unless proven innocent).
Second, we're doing our part to spread what we're learning about running a top 500 website (Stack Overflow) with the community, in the form of http://webmasters.stackexchange.com
Do we make mistakes? You bet we do. Just the other day I accidentally disallowed all questions on Stack Overflow from being spidered in robots.txt. That.. was .. not a good day.
However, I interact with a lot of customers who seem put off by webmaster central. It seems to be a very outdated interface. I understand it's important to be clear and concise when explaining these issues. But if you look around the web 2.0 world at people providing similar information there's a harsh contrast.
Put simply, webmaster central is small text with a dark appearance and little or no graphics. In my experience and testing this harbors a mentality of "This is too complex". Users seem to encounter long wordy pages with no graphics and convinced themselves it's beyond them, before they begin to read.
Making a page lighter and throwing in a few visual aids goes a long way in curbing this issue, as well as making the information easier to understand and more fun to read.
It seems like a small thing, but it scales to become overwhelming when you consider that most people who encounter a page like this and dismiss it at a glance start looking for an email us link or a contact phone number.
This is my experience anyway. Perhaps your results may vary.
The idea is still percolating, but I think it's got a lot of potential.
The funny part is that the domain in question was already expired when email arrived because I've decided to stop this venture.
Is this one of the reasons DDG got started?
I have a few websites that automatically make new posts. As of 10/14, they all show 0 pages indexed in Google. Previously they would get a few thousand visitors per day.
I guess Google feels as though they violate their terms and removed them. It seems to me it was a manual removal.
I received no emails in webmaster tools about the removal.
Making a bunch of autogenerated sites has its risks. For example, if you were just taking a bunch of MP3 names or Hot Trends queries and then scraping twitter for mentions of those phrases and slapping that all up on a website with scripts, that tends to cruft up our index with autogenerated content that users complain about and that violates our quality guidelines. Likewise, if all you were doing was scraping Twitter for phrases like sad or heartbroken or heartless and throwing that scraped Twitter content up on a webpage with a script, users would also complain about that autogenerated content and it would violate our guidelines. Would that be helpful insight?
I made this over a weekend. And the people whose poetry is being captured love it. But it is auto-generated in the sense you're talking about.
It actually went down for a bit and I got a bunch of complaints, enough that I got it back up relatively quickly.
How to Plan a Happy Blended Family
How to Harmony in Your New Blended Family
How to have harmony in your new Blended Family
How to Achieve Harmony in a Blended Family
How to Nurture A Blended Family
How to Successfully Manage a Blended Family
WTF is this junk? Why does ehow.com get 3 million Google visitors a day? The mind boggles!
As for mine:
1. Mefeedia.com Built it out for 2 years, then sold it because it wasn't going where I wanted it to go.
2. Poorbuthappy.com Lots of traffic for travel forums, but the community got out of hand so I had to close it.
Those where the 2 main projects where there was an expectation of it possibly becoming something big-ish.
Back to hacking at my project...
Why did this fail, really? This should be a runaway success. There are millions of people out there that can barely figure out their cameras, let alone understand the concept behind facebook or picasa or flickr or whatever could be considered "competition".
Even with a founder departure, this is a valid idea... why didn't you continue to peruse it?
You'll have dozens, or hundreds, of ideas during your entrepreneurial development...but, any one of them will probably require absolute focus and dedication to make it really work.
Thanks for this.
One quick question: I'm not a uber geek. Ie I like the business side of the equation too. Is this a good thing or a bad thing ?
So, the better question is: what kind of startup are you building? If it's a technology startup, you will either need to commit to learning your technology space or finding a technical co-founder. Committing to learning the technology will teach you a lot about your interests, you will either love it or not - if you don't love it, you won't succeed technically (you might business wise, there are plenty of companies with shitty technology that make money).
Just some thoughts from a non-business oriented intellectual and programmer.
Nth clubs sounds like it could work with a bit of incentive for club pros to recommend it...
It actually got traffic too. I don't know what to say other than that it didn't feel like a startup at the time.
nth Club isn't a bad idea -- it just requires sales work that I don't want to do. I thought my partner (who is into golf) would be doing it, but that just has turned out not to be the case.
In Googling "namesdatabase" I've come across some old claims of allegedly dubious practices of the site that occurred while you were running it. I know your reputation is stellar here and that you contribute much to the community so I was more than a bit surprised.
May I ask, have you addressed these allegations somewhere? I'd like to give you the benefit of the doubt, so I'm wondering where I can read your side of the story. Is there an HN thread or blog post you can point me towards? Many thanks.
I think a lot of it stems from either misunderstanding, edge cases (http://www.gabrielweinberg.com/blog/2010/02/one-in-a-million...), or just a fundamental problem with the idea of referring friends.
--you could opt-out from emails or remove yourself from the database at any time.
--you could see a detailed explanation of how every aspect of the site worked before signing up, on a page I spent countless hours writing and tweaking.
--similarly, there was a vast support system that answered almost any faq.
--you could see the whole database on our static site before signing up.
When you sold NDB did you have any concerns about how the new owners would treat members and their data? Knowing what has happened since (which--for some people at least--seems to be controversial) the sale would you do anything differently if you could?
A few years ago I had a company approach me to sell a small site I was running but I was never quite convinced they weren't just spammers/scammers wanting a customer list and felt like I owed my users more than that.
It be would interesting to know what did you learn of each these projects?
Which project do you consider the most important from learning perspective?
Call it something like 'Start Up Down'?
What do people think? Obviously not really something that's going to become a lucrative venture but it wouldn't be tough to create either.
I haven't had nearly as many at-bats as you have, but enough to know that I personally can't get very far on my own without an experienced voice guiding me past a lot of dumb ideas.
For Zoofoo, Email client, Yahoo store thing, Namesdatabase & Kangadoo I worked with the same partner. The "Wall" (never launched) was with a different partner. And nth Club was with another partner. The rest is/was solo.