Search without Google Tracking You (My 'pseudo' Startup)

bls · on Aug 21, 2007

Instead of giving our private information to Google to use for 18 months, we are giving it to you to use indefinitely?

Where is your privacy policy?

And, what is your hosting provider's policy? I found out recently that my hosting provider refused to make any guarantees; that means that I cannot make any guarantees either, since my hosting provider has full access to my server.

drm237 · on Aug 21, 2007

A privacy policy is very high on my list of things to do, along with an explanation of how it works so that people can feel confident in their privacy. Right now, the only thing I record is a SHA1 has of everyone's IP address and a timestamp so that I can track how many unique people have used it. With the SHA1 hash, there's no (very difficult) way to trace that back to the user. I'm also talking to my host about the web server logs and how quickly those can be purged.

fadmmatt · on Aug 21, 2007

I hope that's SHA1 plus a secret salt.

Otherwise, I can just build a reverse map of SHA1'd ips.

for ip = 0 to 2^32: unhash[SHA1(ip)] = ip.

drm237 · on Aug 21, 2007

Yes it is. But first you'd have to get all of the results from my database, which is cleared every 24 hours (this is not a challenge!).

jsjenkins168 · on Aug 21, 2007

Can you use a SHA-2 hash instead? Those variants are considerably more difficult to break.

A problem I see with your service compared to an anonymizing proxy like Tor is that you are still a single point of failure(please correct me if I'm wrong though). If you were legally forced to turn over search records (as the govt was attempting with google a while back), then the requests could be traced directly back to the user.

You mention clearing the database daily which is a good idea. But again if it was compromised and a snap shot could be taken, then a brute force crack of your SHA-1 hashes would be possible. Basically, everyone is trusting the security of your database. A misdirection service which telescopes the request through interconnected proxies will not have this single point of failure issue.

Not criticizing your implementation, just making some observations. I think this is a great idea. Mainly your site is so easy for people to use, not needing to install a client application.

drm237 · on Aug 21, 2007

Interesting points. So if someone was able to get my php code, they could find the salt, dump the database, generate the lookup tables by hashing every IP address possible plus the salt, and then they would be able to figure out every IP that has used the site since midnight. But, they would still only have your IP as I do not record any of the search results. This frame of events also shows that the hash algorithm I use really doesn't matter. It's protecting the salt and the database that matter most. Anyone have thoughts on that?

Also, another feature I was thinking of adding was an ssl option so you could securely access the site. However, as I don't make any money from the site, it becomes more difficult to justify additional expenses.

jsjenkins168 · on Aug 21, 2007

You mentioned that you keep timestamp info for each inbound connection, is that a requirement? I only ask b/c this could be used to match up a request on the search engines servers (with your servers IP as the source) with the connection on your server to pinpoint the user in the event your database was compromised.

One thing you could do which should be easy is send chaff. Randomly send out connection requests to some of search engines from your server even though a user is not requesting the data. It makes tying back connections to the users more difficult because you dont know which request is real and which is fake.

SSL would eventually be important because it would protect against man-in-the-middle attacks. Someone could hijack connections to your server claiming to be you and then get all of the requests. Users could potentially be putting in very sensitive information so this could be a big deal. There will also be protection from someone sniffing inbound requests that come into your server as the channel is encrypted.

I understand the expenses thing, so I wouldnt worry too much about that. I'd prefer your service be free and not use SSL than to charge for usage. Although I wouldnt mind some ads, you could monetize a bit on that if you wanted..

drm237 · on Aug 21, 2007

It's true that timestamps are probably no longer important. I initially had them in there to monitor usage while the test group was fairly limited. I will remove them in the next day or two as the site gets going (or dies...).

The idea of a chaff is interesting and It wouldn't be too difficult to implement. Thanks for the idea.

bls · on Aug 21, 2007

The thing about low-cost web hosting is that it is impossible to stop your web host--or their contractors--from getting the data. Even if you use TLS (SSL), the web host and their associates have full access to your private key. For most hosting solutions, the best that you can do is colocating your own physically-locked server, and use TLS to encrypt everything.

Without using TLS, you cannot prevent the user's ISP from recording--and even reselling--your user's search histories. Similarly, your hosting providers could be doing the same thing. Keep in mind that hosting and bandwidth is a multi-level value chain--your host is probably renting space and bandwidth from somebody else, who is renting from somebody else, who is renting from somebody else. Any one of those companies and/or their rogue employees can collect, re-transmit, prevent, and/or redirect (e.g. man-in-the-middle) your user's queries without your knowledge.

tomek · on Aug 21, 2007

It's a nice initiative. You have probably heard of Scroogle (http://www.scroogle.org/cgi-bin/scraper.htl). How different is your thing from their thing?

drm237 · on Aug 21, 2007

They're very similar. The difference is that I like our interface better and we provide unlimited scrolling of the results. If people like the site and want to use it, I'm also planning on adding other search engines.

For the most part, it's been an experiment on my part to have a public site that people use. I've built quite a few sites, but this is the first with AJAX and some other technologies, so it's been a learning project for me as well.

drm237 · on Aug 21, 2007

It may be worth noting that the iframe you see when you search is not google, but google's results served from our site. Of course, if we just showed google in an iframe, it wouldn't do anything for privacy.

davidw · on Aug 21, 2007

Isn't that against their terms of service or something?

How's this different/better from some kind of anonymizing proxy like tor?

drm237 · on Aug 21, 2007

With things like tor, it can be a little slower, especially if you just want to do a single search. With this, we've added in the feature so you can add it to your Firefox/IE7 search box and then easily use it right from the browser. So it's the same idea, we're just focusing on this one feature for the good of the community.

The other feature we have is endless scrolling, so if you search and can't find it in the first few results, you can just keep scrolling and we'll continue to populate it. That's something not many others have.

amichail · on Aug 21, 2007

Why is this legal? I thought Google no longer has a search api.

drm237 · on Aug 21, 2007

The thing is, we're not making money from it in any way (which is why I call it a pseudo' startup), and because of that fair use can apply. Also, think about google. They scrape the entire web and profit by serving that as content...

cedsav · on Aug 21, 2007

Fair use certainly doesn't apply. You're using Google search technology (which btw, involves a bit more than 'scraping' the web) and stripping out the ads (their source of revenue). Expect a cease and desist letter soon.

paulgb · on Aug 21, 2007

Fair use doesn't apply, but if you can get your hands on an old API key it will still work. I built something similar (here: http://www.paulbutler.org/archives/endless-google-search/) and I just used an old Google API key, so I don't violate their TOS.

It makes me wonder though, since you can't make any money with it, how do you plan to pay for hosting fees? Surely you eventually intend to profit, what will you do then?

drm237 · on Aug 21, 2007

Correct me if I'm wrong, but isn't the API is limited to 1000 searches per day correct?

No, I don't plan on profiting from this. This is a very low impact system and I don't mind footing the bill for the hosting costs as long as it's of value to the community.

paulgb · on Aug 21, 2007

I see, good work then.

You're right, I forgot about the API limit.

nickb · on Aug 21, 2007

Awesome! It's better, faster, nicer looking than Scroogle.org