
Show HN: Fast Real-Time Anomaly Detection in Dynamic Graphs - siddharthbhatia
https://github.com/bhatiasiddharth/MIDAS/
======
ugh123
The comments in this thread exhibit some serious astroturfing. Also the 6
unique commenters accounts all created in the last 12 days. Just sayin'

~~~
adevx
Did you use an anomaly detection library for these findings?

~~~
ugh123
Hah no. But the amount of exclamation points and softball questions is an easy
red flag to find

------
siddharthbhatia
Anomaly detection in graphs is a critical problem for finding suspicious
behavior in innumerable systems, such as intrusion detection, fake ratings,
and financial fraud. But most of the systems in place focus either on static
graphs or on entire graph snapshots if they consider dynamic (time-evolving)
graphs.

However, to minimize the effect of malicious activities and start recovery as
soon as possible, we need to detect anomalies in real-time or near real-time
i.e. to identify whether an incoming edge is anomalous or not, as soon as we
receive it. In addition, since the number of vertices can increase as we
process the stream of edges, we need an algorithm which uses constant memory
in graph size. Moreover, fraudulent or anomalous events in many applications
occur in microclusters or suddenly arriving groups of suspiciously similar
edges e.g. denial of service attacks in network traffic data and lockstep
behavior.

In this work, we propose MIDAS, which detects microcluster anomalies, or
suddenly arriving groups of suspiciously similar edges, in edge streams, using
constant time and memory. In addition, by using a principled hypothesis
testing framework, MIDAS provides theoretical bounds on the false positive
probability, which earlier methods do not provide. Also, we are up to 644
times faster and 48% more accurate than previous state of the art approaches.

For more details, read the paper at:
[https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf](https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf)

If you are in New York, you are welcome to attend the talk at AAAI in New York
Hilton Midtown on 11th February at 2pm.

------
ZhuoCai
Before I read your paper, I didn't realize the power of hypothesis tests. I
was surprised to see theoretical bounds on anomaly detection with such weak
constraint through simple Chi-Squared goodness-of-fit. Delicate idea!

------
shivin9
Impressive results! Few questions though:-

1) Do you use sliding window approach or exponential decay?

2) When you say groups of similar edges, do they have to be spatially close to
each other? or can they be equally distributed in the graph?

3) Can an attacker figure out the optimum time difference between his attacks
such that MIDAS doesn't detect it as an anomaly? The time gap is just
sufficient enough for the algorithm to weed out the potentially malicious
micro-cluster as obsolete.

Seems like an interesting extension to your work. Best of luck!

~~~
siddharthbhatia
Thank you :) 1\. We use a temporal decay (alpha). 2\. Good question! We
consider similar edges as those having at least one of source and destination
node as the same. 3\. Very interesting direction for future work! We can try
using a variable decay instead of a fixed one to tackle the adversary.

~~~
shivin9
Thanks. It would be nice to have similarity methods that are not dependent on
spatial locality to detect DDoS like attacks.

~~~
siddharthbhatia
We handle spatial locality in terms of not just the source but also the
destination, therefore we should be able to handle DDoS like attacks when
simultaneous edges come from several sources trying to deny one particular
destination.

~~~
shivin9
Ah alright, that makes sense. But it works only when the destination is the
same. In a setting where there are multiple web-hosting servers, you would
need to treat a group of source and destination points as micro-clusters
themselves.

Can you extend MIDAS to adapt to that scenario?

~~~
siddharthbhatia
In Figure 7 of the paper, we show an example of detection when neither
edge/source/destination is individually anomalous but as a whole, it is a
microcluster anomaly. It can similarly be detected when there are multiple
web-hosting servers.

------
arjitjain
Great work! However, from what I understand, microcluster detection seems to
be very similar to subgraph detection. Is there a difference between the two
problems? And if yes, then how do their complexities compare?

~~~
siddharthbhatia
Thanks! There is a subtle difference. We define microclusters within the
category of anomalous edge detection as 'suddenly arriving groups of
suspiciously similar edges' e.g. denial of service attacks in network traffic
data and lockstep behavior.

Subgraph detection on the other hand is detecting anomalous subgraphs and may
also not be able to detect microclusters (suddenly arriving groups of
suspiciously similar edges.)

------
shingeki
HN should use this algo to detect the blatant upvote-hacking going on in this
post :)

------
raghavjajodia
Great to see that! The results look really impressive!

------
sinhayash
Hmm... Interesting work!

Can we extend this to find anomalies in other kinds of data like video
streaming?

~~~
siddharthbhatia
We have extended the work for multi-aspect data i.e. records with multiple
features. Currently our approach is capable of handling structured data only.
It should be interesting to see how to detect anomalies using this approach in
data which does not have a regular structure.

~~~
shivin9
Anything that can be modelled as a graph should work I suppose.

~~~
siddharthbhatia
True

------
geetika92
Interesting work. Definitely has application potential in various businesses.

------
ramarohit
Interesting work and great paper as anomaly detection in real time is very
important now a days. Easy to understand as well.

------
riteshkumar99
I found the paper to be quite exciting, and it is also solving one of the main
tasks in anomaly detection, that is real-time anomaly detection. With other
machine/deep learning based algorithms, one can model the data distribution
quite efficiently, but there is a tradeoff of flagging time. With the
statistical method suggested in the paper, you not only get state-of-the-art
results but are also able to report the results in real-time, a big plus.

