
Reverse Engineering OKCupid - dsshimel
http://davidshimel.com/reverse-engineering-okcupid/
======
jacques_chester
In terms of the backend, OKC have their own webserver, OKWS[0]. It's been
discussed on HN before[1].

It's a clever and IMNSHO insufficiently copied architecture with interesting
performance and security characteristics.

[0] <https://github.com/okws/okws> [1]
<http://news.ycombinator.com/item?id=2077484>

------
jonathanjaeger
About two years ago, some clever techies used the OkCupid subreddit to "hack"
the OkCupid frontend. If I remember correctly, they used Javascript to display
the number of messages someone received per day and how many they replied to
(among other things). Eventually OkCupid came across the info and started
putting everything server-side.

~~~
dsshimel
What information were they getting from the subreddit? Just usernames or . . .
?

~~~
oijaf888
The stoplight indicator and a few other things were done via client side js
with non obvious variables names. Once that Greasemonkey script came out it
was quickly moved to the server side since there was no reason not to do it
there. I believe it was just on the client side due to ease of development
when it was being built.

------
chacham15
This method is well known and there are ways of making this more difficult.
For example, servers add a hidden request cookie which is a random number
embedded in the page that the user is coming from. This forces you to actually
parse the page. Then they can move it to javascript making it even more
difficult.

~~~
kysol
I would just like to point out that not all sites do the following (and I'm
unsure if OKC does), but watch out for tripwires as chacham15 said. Some sites
that I've had the misfortune of "getting to know" use insignificant or blank
inputs as a form of detecting unauthorized access.

One rule I follow is to: Retrieve, Analyse and Regurgitate Everything.

~~~
dsshimel
By inputs do you mean CAPTCHAs or something else?

------
simon_weber
I do a lot of this kind of work for gmusicapi, and I still keep a Windows VM
around just to use Fiddler.

Does anyone have recommendations for other tools? I came away from Burp and
Charles disappointed in the past, but that was some time ago.

~~~
goodside
The most universally effective solution I've seen is Sikuli running in a VM
(which is the only sane way to run it, since it hijacks your input devices).
Everything else fails in some edge case. What other tool can scrape an
interface that uses both HTML and Flash and is only served over HTTPS?

It is brittle, in that it can be broken by cosmetic UI changes, but the
maintenance is generally trivial. Also, it's slow as all hell. But sometimes
you really need that sledgehammer.

------
berlinbrown
Linus responds...

~~~
hnriot
He does, but sadly his comment is way off. Anyone that's done any amount of
HTML scraping will use BeautifullSoup over lxml. The former being easier and
more tolerant of html's nuances. The latter being brittle for anything less
well formed than XHTML.

~~~
mickeyp
Sorry but lxml with ETree will handle any amount of broken html you throw at
it. Add in XPath and I find lxml to be a far superior, and more memory
efficent, option.

Source: former professional web scraper.

