
How we got 100,000 visitors without noticing - StavrosK
http://blog.historio.us/how-we-got-100000-visitors-without-noticing-4
======
mmaunder
Eek, not to rain the parade, but this is not good historious team. Google's
duplicate content filter is generally accurate but in this case it penalized
the site that originated the content and ranked yours higher because it
assumed you were the owner of the content i.e. the page about sending free
SMS's.

Google will quickly correct this error and you'll find the traffic on that
page drops to nothing. You may also find other duplicate pages on your site
penalized in the same way. You may also find your site penalized by Google for
essentially screen scraping sites and copying their content verbatim and in
it's entirety.

I'd recommend you set up a robots.txt to block Google from indexing identical
pages on your site. Copying content verbatim and republishing in it's entirety
is not a good SEO strategy unless you have thousands of throw-away domains and
wear a black hat.

Sorry about the negative message, I'm sure this was very exciting for you guys
but I don't think it's sustainable.

Perhaps an alternative is to have your users highlight the paragraph or
snippet on each page they found interesting and archive that instead. Then
publish those snippets on your site. Generally I've found republishing
paragraph size chunks of text is OK with Google and will net you decent SEO
traffic. It worked for my job search engine when we republished job
descriptions and limited them to the first 400 chars. You can also mash the
paragraphs up into pages with multiple chunks of text that e.g. show all
chunks of text a particular user found interesting or a set of users or by
date or location. That will give you the best of both worlds - lots of content
to SEO and no dup content penalty.

Best of luck!!

~~~
JeremyChase
Or set the canonical address of your cached pages to the real page.

~~~
StavrosK
Do you mean the "base" tag? We already do set that (admittedly, after this
happened and because of it).

~~~
fhars
No, he meant the rel="canonical" links.

~~~
patio11
This does not work across domains, guys. (Among other reasons, it would turn
injection attacks into $X million bugs for some companies.)

~~~
StavrosK
The google blog says it does...

------
RBr
The short form if don't want to read the entire post is summarized with this
snip:

"By some algorithmic oddity, the historious cache actually ranked higher than
the site (which turned out to be the most popular free sms service in Brazil),
so anyone who searched for "free sms" in Brazil ended up on the historious
page!"

In short, Google ranked an archive page (historious) higher then the content.
Google corrected this problem, presumably automatically, 3 days after it
happened.

------
bananaandapple
They could sue you for copyright infringment if they wanted.

And allowing google to index your "stolen" content pages is just outrageous.
You don't own that content.

~~~
StavrosK
Then we could sue Google for copyright infringement for caching _our_ pages, I
guess... Why would we not allow Google to cache it? Each cached page has a
great big box on top saying that this is the historious version of the cache
and linking to the original site...

Example: <http://cache.historious.net/cached/515865/>

~~~
sounddust
There's a huge difference between what Google is doing and what you're doing.

1) Google is caching pages for a specific purpose and ensuring that they
_aren't_ cached/scraped by others:

<http://webcache.googleusercontent.com/robots.txt>

By not excluding robots, you're opening yourself to all kinds of situations
where you are responsible for draining revenue from the owner of the content,
which leaves you liable to lawsuits. By contrast, the way that Google caches
content and their rules surrounding it do not generally harm the copyright
owner.

2) Google honors all robots.txt, no-archive meta-tags, and other indications
that the author doesn't want the page to be cached. Is historious doing the
same?

~~~
StavrosK
1) We do exclude robots now, yes. 2) historious doesn't spider websites, it
only saves the pages the users give us. It's the same as a user deciding to
make a backup of a webpage on their computer...

~~~
carbocation
"It's the same as a user deciding to make a backup of a webpage on their
computer..."

... and then publishing it on the Internet.

(This is not meant to be snarky or to imply opposition to your product at all.
I think there is a meaningful difference between saving to a computer and
saving to a web-accessible, apparently globally readable website.

~~~
StavrosK
Isn't it a users responsibility to obey copyright restrictions in this case,
given that we never publish content unless the user does it? It's basically
the same situation as hosting a website, if you upload and publish a
copyrighted page, is the host responsible?

~~~
sounddust
In my opinion, those two cases are not similar. I doubt that this type of
automatic caching/publishing would have any protection under the DMCA safe-
harbor laws unless you're making it clear to users what they're doing (I'm not
a user of the service, so maybe you already are).

If I understand correctly, the users of your site are simply bookmarking
pages. You are then caching it, storing it, and publishing it with a world-
readable URL. There are many ways that you could provide the same experience
to the user without making the cached page publicly accessible.

If you were to give users the option to make specific bookmarks world-readable
- and you provided a disclaimer explaining that they should not make
copyrighted material world-readable - then it might be different. But that's
probably something you should discuss with an attorney.

~~~
StavrosK
Ah, no, our users cache pages, but if they want the cache world-readable, they
need to explicitly click the "publish" link.

Thank you for the information, I'll talk to our lawyer about it just to be
safe.

~~~
carbocation
I wanted to let you know that I didn't mean to disappear without responding,
but that sounddust expressed what I was thinking already so I don't have much
to add. I did not know that the world-readable bit was opt-in; I think that's
a good start, and I'm glad you're getting legal advice on this topic.

~~~
StavrosK
Ah, that's the nature of online commenting! Thank you for your concern, we'll
talk to a lawyer to clarify this (perhaps in the ToS).

Thanks again!

~~~
bananaandapple
I thought you were european? The DMCA safe-harbor laws don't apply to you.

~~~
StavrosK
I am, I'm in the UK. I know the DMCA safe-harbor laws don't apply, but neither
does the DMCA. Copyright law is similar everywhere, however, so I just wanted
to get an idea. I have a lawyer researching this right now, though. Thanks
again!

~~~
bananaandapple
That's not true.

Your copyright law is similar to the US or Israel based one, not so strict as
the main european one, which is based on the napolion code and is very very
strict.

~~~
StavrosK
I see, thank you. Our lawyer advised us to add a clause to the ToS and we, of
course, take action against copyright infringement.

Thanks for the feedback!

------
chrismiller
This might be slightly off topic but if you don't mind could you share the
specs of your Varnish server? Everything I have read about Varnish says that
it will run extremely well on pretty much any decent hardware. But I am still
unsure how much power I should be providing it.

~~~
jacquesm
How many hits per day are you expecting?

We serve up about 2000 requests per instance per second, 8 instances on an 8
core machine with 32 G of RAM.

~~~
pvg
Why run separate instances on the same box?

~~~
jacquesm
To get over a bunch of limits.

~~~
moe
What limits?

~~~
jacquesm
Specifically 65K file descriptors, I can't seem to get around that one in
spite of all the limits having been raised appropriately.

So I ended up running multiple instances. I've discussed this with a few other
HN'ers and we agree that it shouldn't be necessary but to date I have no
solution for this.

------
jasonkester
This is exactly what should happen when your low-traffic site suddenly gets
50k visits in a day: Absolutely Nothing. Kudos to the historio.us guys for
building something that can actually handle a little spike in traffic without
falling over.

We see so many sites come through here that are showing 500 errors by the time
they get halfway up the front page that you'd think nobody knew how to do this
stuff anymore.

Great job of ticking off the basics, guys. Building on top of that foundation,
you shouldn't have any trouble scaling out when this sort of traffic starts
becoming a daily occurrence.

~~~
StavrosK
Thank you! To give credit where it's due, buro9 from HN urged us to implement
the Varnish caching feature for cached pages, so we had it written but
disabled because we didn't think it was going to get that much traffic. We
hadn't turned it on for the first 30k visitors and the server was still doing
great, but when we saw it we just switched it on and load dropped to nothing.

 _Serious_ amounts of love for Varnish here.

------
jacquesm
I predict a lot of sites featuring the words 'free sms' will be hammered out
within hours of this articles posting.

~~~
StavrosK
We didn't know one of our users publicised it, though, and we didn't know
Google would even see it...

------
csomar
nice discovery. I read somewhere in Blackhat forums about Google Search
Engines traffic exploits. I didn't understand what they are talking about
until I read your post today.

------
emehrkay
I just hit the "historify!" bookmarklet on the varnish page. :)

~~~
StavrosK
Your actions have placed all of us in great danger.

------
templaedhel
This makes me want to go and try out varnish. I have heard great things, but
never gotten around to setting it up.

