
Who is horse_js? - juniusfree
https://whoishorsejs.com/
======
patrickaljord
In case you didn't notice, this is clever advertising for Microsoft Azure.
Both authors work for Microsoft. I could count 3 mentions of Azure, two direct
links to Azure products, 1 quote from a Microsoft researcher, 1 quote from a
Microsoft dev advocate and 1 embedded Bing maps. You've just been played by
Microsoft marketing. Also, Tom Dale works for Microsoft himself so it's just
one big family story.

~~~
the_duke
Tom Dale working for MS makes it likely that this was a reverse-engineering
effort.

As in, they knew who they were looking for from the start, and just worked
with the data to find the known conclusion.

Also, there's no actual machine learning in this really, except calling out to
a hosted language processing service...

~~~
jlborxes
A nice case of parallel construction.

------
sdrothrock
This wasn't very rigorous at all, but it was a moderately fun read because it
really made it clear how "large amounts of data" with some simple
visualization can help people make some mostly-educated guesses.

I was most surprised at them glossing over the activity patterns with guesses
based on their assumption of the target's sleeping patterns -- their guess of
a time zone would have been stymied by someone who liked to sleep early/late
or had an unusual work schedule, but there was no mention of that or their
reasoning.

That all made sense given patrickaljord's comment about it all being one big
Microsoft ad, though.

~~~
tetha
This also can be a pretty powerful troubleshooting way, which is why I
consider having something like grafana or prometheus around extremely
valuable.

Looking for anomalies at the same time, or in sequence easily turns "What is
going on?" into "Alright, why is there so much more stuff coming into the
system, and why is that increased ingress causing increased memory usage per
event?"

------
gumoro
Bug report: time series analysis chart fails to display properly on Firefox,
all points stay on x=0, I see "Unexpected value NaN parsing cx attribute" x250
in the console. Works fine on Chrome.

~~~
ricardobeat
Breaks in Safari too.

~~~
thibautg
The layout is completely broken in Edge too. They must have forgotten that it
does not use the Chromium engine yet.

Too bad for a (fun and clever) Microsoft advertising.

------
unao
The article was quite enjoyable to read though full of MS marketing.

When saw the part with Azure Cognitive Services Text Analytics - I burst out
laughing. Earlier, they quoted: _Half of the time when companies say they need
"AI" what they really need is a SELECT clause with GROUP BY._

Their motivation of using AI is even below that threshold.

Now awaiting some horse_js comment about this absurdity.

------
lclarkmichalek
I'm clearly a terrible person, but I read the reveal and immediately thought
"But X isn't funny enough to be horse_js!"

~~~
tomdale
Wowwwwww

~~~
pickpuck
> We got permission from our suspect before we released this site and they
> have allowed to use their name and release the data that we had about them.

Looking forward to your response!

------
Confiks
I was scrolling through the whole article raging for an ethics paragraph, but
I guess they handled that pretty well.

With a nifty and I think necessary touch of themselves still being in the
dark; I very much doubt that the data they gathered can really reveal the
author's identity, and the result they arrived on (Tom Dale) seems to largely
originate from the "quotes one person far more than others" metric.

You could almost consider it an anti-metric: which intently pseudonymous
author would dare to retweet their own nym? However, to counter this analysis
you'd have to blend in with the average Twitter user in your niche, so it then
comes down to a psychological game of "what would horse_js do?".

~~~
cwmma
The ember part is a giveaway too since it's a language that has a small but
passionate base you'd expect the person behind it to care about ember. That
being said the reasoning doesn't rule out wycats.

------
urvader
Just use elasticsearch/kibana and you will save yourselves from lots of Azure
costs. Find keywords and group by device and location. Simple as that.

~~~
sbarre
> Just use elasticsearch/kibana

Are you volunteering all the time to set that up for me? ;-)

------
saagarjha
> @horse_js lives in either the Central or Eastern time zone. Their activity
> dwindles sharply in the evening and disapears between ~11 PM - 12 AM CST and
> reappears at ~8 AM - 9 AM CST because they are likely asleep.

As someone commenting at 4 AM, this might not be a great assumption to make ;)

~~~
audiolion
do you comment every day with consistency around 4am? the point is with enough
data you establish a pattern and ignore outliers. the time series graph is
indicative of an EST/CST sleep schedule.

~~~
saagarjha
More consistently than I'd certainly like. I've been told by concerned friends
that I have issues with my sleep schedule; the most recent was something along
the lines of "why is that when I check Hacker News in the morning I keep
finding your comments made three hours ago".

------
Epskampie
Very fun read, it contains just the right amount of detail to stay
entertaining without getting bogged down in minutiae.

------
xg15
> _" But I already know who @horse_js is, and it's not [...]!"

Perhaps. The data here is not 100% conclusive. There are some critical
assumptions holding up our conclusion and [...] has never confirmed (or
denied) our findings.

Perhaps the horse lives to tweet another day..._

Ironically this highlights one of the main problems with how machine learning
is used.

On a very high level, I think you can sum up machine learning algorithms as
finding pattern in enormous heaps of noisy data ("training") then trying to
apply the discovered pattern to novel data and using the result to guess the
answer to a question you posed ("predicting").

The keyword being _guess_ here. Unlike algorithms not based on learning, there
is no guarantee that the answer is correct, because you usually don't know if
the training data you supplied was sufficient or if the learned patterns were
the ones you need. If you knew, you could just hard code the patterns directly
and get rid of the whole learning overhead altogether.

Researchers know and communicate this. However, in the press, "AI" seems to be
seen as almost the exact opposite: Not only can those fantasy AI systems
answer questions about fuzzy human concepts with the precision of a computer,
their answers are even _better_ than the human ones - which is why the things
we need to worry about are ethics discussions and humanity becoming
obsolete...

This could be funny if it were just restricted to science fiction and public
discussion, but it becomes problematic when "AI" systems are used to make
life-changing descisions like setting insurance premiums or declaring persons
suspicious to law enforcement.

~~~
_underfl0w_
> it becomes problematic when "AI" systems are used to make life-changing
> descisions like setting insurance premiums or declaring persons suspicious
> to law enforcement.

Hasn't the latter already happened? I'm without link/source, but I seem to
recall reading about there being tests of using a homanoid-looking AI-driven
"attendant" at a border somewhere that would judge people based on
looks/temperament and try to guess if they were lying about what's in their
luggage.

~~~
hobofan
Yes, discriminating machine learning is commonplace, but often hard to
uncover, as it's not often obvious to the users of automated systems how those
values are constructed.

Luckily some critical parts like issurance calculation is regulated (in some
parts of the world) to have the requirement of explainable algorithms to
prevent this kind of discrimination, so it's not as bleak as it's often made
out to be. Of course it's also important that it stays that way.

------
mcintyre1994
> The API is rate limited, so we created a set of Node.js Azure Functions that
> ran on a timers. These functions would request as many tweets as they could
> before they were rate limited, wait for the timeout interval specified by
> the API docs, then resume processing where they left off.

How does this work? You pay for your function's run time in serverless so you
wouldn't want to just have the function sleep for x minutes or however long it
gets rate limited surely. I can see a way to do it using a service bus queue
(push the message with a delay of x minutes, have the function set up to run
on messages on that queue) but they specifically said timers. Does Azure let
you programatically set the timer for a function from inside that function
(eg. "Run me again in 3 minutes")?

~~~
raudabaugh
Azure Functions can be configured to run on fixed schedules via timer triggers
([https://docs.microsoft.com/en-us/azure/azure-
functions/funct...](https://docs.microsoft.com/en-us/azure/azure-
functions/functions-bindings-timer)), so I’m guessing they set theirs to run
every API timeout interval + max amount of time they could request tweets
before getting rate limited. Their Cosmos DB instance could then be set up to
track how many tweets they had gotten through on each function run.

------
mothsonasloth
Its interesting but also worrying. This is pretty much doxxing under the guise
of intellectual curiosity.

If the article explained why they wanted to identify this account then fair
enough. However you are going to end up in an ethical slippery slope were it
will be used to doxx people who are controversial, troll, political dissidents
and whistle blowers.

~~~
nightfly
None of the techniques they used to come to their conclusion were
groundbreaking or even surprising.

~~~
mothsonasloth
To you or me but for others its an education on how they can be compromised or
how they can compromise others.

------
fartcannon
This is marketing. Someone should flag it.

------
manaatemandate
I am so tired of Microsoft meddling in the developer space. I wish they just
crawled away and made themselves irrelevant. If not for the money and how they
shove they products in people's throats nobody would consider even using them.
Internet? They had to taint it with IE6. Operating systems? The dreadful
Windows 10 malware OS. Then lure developers with their software, which if you
have money then you can produce a ton of and just crush everything that is
good in the IT.

------
pepijndevos
Tweet from Tom Dale himself
[https://twitter.com/tomdale/status/1086675110801625089](https://twitter.com/tomdale/status/1086675110801625089)

------
aw3c2
> Finally. Some Machine Learning.

> We ran all of @horse_js tweets from the last 2 years through Azure Cognitive
> Services Text Analytics service. This service identifies keywords in
> phrases.

How was that necessary in comparison to a simple "split by whitespace, count
occurences"? :P

~~~
a_bonobo
That service does more NLP level stuff - remove stop-words (a, the, an, etc.),
tokenize text (eating becomes eat), keep only words that represent the core
message of the text, I think that's about it

~~~
hhjinks
Neat, that's actually something I'll find useful in one of my personal
projects. Guess I'll have to check it out.

~~~
kamaln7
We used Lucene (open source) in our information retrieval course and
tokenizing (w/ removing stop words etc.) is one of the things it does. If you
just want to experiment, that's also another option to look at if you like!

------
eggie5
cluster horse_js and Tom Dale's tweets in an embedding space and you can
confirm your hypothesis.

------
LeanderK
So why was this a statistics problem?

------
jypepin
That was a nice, fun read, and a simple way to show how sometimes, simple data
analysis and common sense trumps everything else :)

Also, great website design. Simple and clean!

------
skilled
Neat concept, wonderful execution, and beautiful presentation. Now if only the
entire web could follow suit.

~~~
ivanche
I have to politely disagree. This could've been made as a static page.
Instead, with Javascript off you see just a totally blank page. Thank God that
most of the web doesn't follow the suit.

~~~
tastroder
I would add to that the unnecessary scrolling of this presentation style. With
classical layouting you can get far more information over the fold and allow
your reader to skip stuff "ah, yeah, I see what they did there" \- without
having to constantly interact with the keyboard / mouse.

------
nailer
Nice try, Angelina Fabbro

------
zemo
not knowing is half the fun.

------
ireallyknow
lon ingram

------
michaelmcmillan
This is journalism!

~~~
pickpuck
This is native advertising.

But it does have the tone of a type of “data journalism” that we should see
more often.

I would appreciate a site that treats all news the way 538 treats politics.

------
vijaybritto
This is very bad. They spoiled it for everyone. That privacy disclaimer at the
end is of no use really. Also Bing maps? Thats the first time I saw that.

~~~
XCSme
It's made by Microsoft employees, that's why all the advertisements for Azure
and other Microsoft products.

