Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Fast Anomaly Detection in Graphs [pdf] (nus.edu.sg)
215 points by siddhartb_ 52 days ago | hide | past | web | favorite | 29 comments

Git repo https://github.com/bhatiasiddharth/MIDAS

> Up to 48% more accurate and 644 times faster than the state of the art approaches

Thanks. We give theoretical guarantees on the False Positive Probability which can be useful to decide the parameters. Some use cases of the project include detecting: 1. Intrusions 2. Fake Ratings 3. Financial Fraud

If anyone would be interested in trying to apply these techniques to our COVID behavior change & anti-misinformation effort ProjectDomino.org, we'd be happy to share data - this may be quite helpful! Just jump into the Slack (open invite) and we can start getting you situated.

Sounds interesting. Can you elaborate on what the data is like?

Im currently curating an article about the best COVID datasets and resources so i'd be interested in this

How do you think this can help change the landscape of security, judging by the fast speeds, as low as .13s on DARPA, I imagine it will help block larger numbers of suspicious activities.

As a researcher in anomaly detection myself, I found the premise of the paper very intriguing.

Gephi has a realtime stream importer for Twitter. Would it be possible for this tool to be a Gephi plugin that could be used in realtime on the same graph?

Definitely. It will need only very small changes to the code. I would love to add it as a plugin. Can you point to some resources that can help in incorporating MIDAS into Gephi.

Awesome! It seems like this is the best place to start: https://github.com/gephi/gephi-plugins

This will propel important research into anomaly detection using dynamic graphs. Existing static graph methods have huge flaws; this would fix some of them

Can you share the implementation?

Code and Datasets we used are available at https://github.com/bhatiasiddharth/MIDAS

Nice! I like there's a ready-to-use command line utility.

I think it'd benefit from a refactor to actually allow real-time streaming from stdin.

Do you have any hints how to choose timestamp units and how that affects parameters I should choose?

Nice suggestion. Will definitely try to refactor. Thanks!

In most of the cases, timestamps should be with the data itself (assuming its a dynamic graph). If timestamps are to be chosen, one can select in a way seeing how many edges usually come in one time tick (second/minute etc.)

Timestamps don't affect any parameters other than alpha (temporal decay factor). You may want to check out how to decay the contribution of the past edges in the anomalousness of the current edge. If there is lot of granularity in the timestamps, a smaller alpha should be chosen. Hope it helps.

Thank you for the explanation.

I'm looking forward to M-Stream for multi-dimensional data - but I have one question for that. Is there some preferred approach for selecting features in multi-dimensional anomaly detection?

Because I wonder if given enough dimensions, everything would be anomalous. Kind of like p-hacking works (at p=0.05 one of twenty hypotheses is falsely accepted just by sheer luck).

Interesting question. With an increase in dimensions, we consider the correlation between the features in addition to considering them individually. The work is currently under review. Feel free to get in touch and I can update you once we release the MStream work.

How are anomalous edges (triplet pairings) different from anomalous communities?

We detect suddenly appearing bursts of activity which share many repeated nodes or edges, which we refer to as microclusters. E.g. denial of service (DoS) attacks in network traffic data and lockstep behavior.

Also, we detect scenarios where an individual edge may not be anomalous but along with other edges it acts as an anomalous community. For example, in the animation at https://github.com/bhatiasiddharth/MIDAS/ it may be possible that an individual edge is not anomalous but together the three malicious entities do a coordinated DoS attack.

This is very cool! I wonder where else this can be applied...

Thanks, MIDAS can be used to detect intrusions, fake ratings, frauds. Basically finding anomalous and suspicious behavior in a dynamic (time-evolving) graph.

We have also extended MIDAS to detect group anomalies in higher-dimensional records e.g. event-log data or multi-attributed graphs. We will release it soon.

Will it work if anomalies are more in number than normal?

We assume (like any anomaly detection algorithm) that the majority is normal sample. In your context, the normal samples will be considered as outliers and therefore caught by the algorithm. One way to mitigate this is to either swap the labels. Another way is to sample a subset of the anomalies and then try.

Thank you. Is there a Java implementation available?

Currently MIDAS is available in Rust, Python, Ruby and R at https://github.com/bhatiasiddharth/MIDAS. If someone is interested to convert MIDAS to other languages, please feel free to do so and let me know so that I can add a link in the repository.

Great job. Thanks for sharing.

amazing work



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact