
Ask HN: How to Find Anomalies in JSONs? - thiago_fm
Hello everyone, I have some JSON files(&gt;1000, possibly stream of JSONs) which look quite similar.<p>If I want to find anomalies among them, what would be the way to go? I saw that k-means isn&#x27;t the best method.<p>I don&#x27;t want to find particularly examples which are just a little bit different from others, but examples which are VERY different. If you ever did web development, you might as well in your life have got a strange error inside a JSON instead of what you expected. I want to be able to get it with an  algorithm.<p>Why I want to do that? I have a few APIs I use, but sometimes they end up changing those responses or give out unknown response body&#x27;s. I want my algorithm&#x2F;model to be able to detect them and show me a list of the biggest anomalies.<p>If I manage to do it successfully, I&#x27;ll make sure it&#x27;s open source. Also if you know an easy way or an OSS solution, please also share). Hell, even if you know what I should study! I was studying deep learning but didn&#x27;t find any known methods by me that I could use in order to make sense of that data.
======
rkx1
If I understand correctly, the type of data that's contained in the files is
much more important the format. As a starting point, is there anything that's
stopping you from reading the files in Python (pandas) for example and doing
some simple outlier detection (interquartile ranges, standard deviations)?

------
bdr
This seems like a statistics question that doesn't have much to do with JSON.
A starting point:
[https://en.wikipedia.org/wiki/Anomaly_detection](https://en.wikipedia.org/wiki/Anomaly_detection)

You should probably start off trying the simplest thing that could possibly
work for your use-case.

------
jrandm
> If I want to find anomalies among them, what would be the way to go?

What is an anomalous JSON file other than a JSON file that does not meet the
specification[0]?

I have never gotten a "strange error" from a JSON parser. Most JSON parsers
are very specific about whatever character they dislike. I would suggest that
the algorithm you're seeking is in fact whatever is giving you the error.

If you're speaking to an API returning JSON, then you should be able to
determine what the API is supposed to return to you. Many times different
responses contain meaning about why the response is different than expected,
like HTTP status codes.

Deep learning is a tool to use to solve a problem. Until you have a well
defined problem it will be difficult to apply various machine learning
techniques to it.

[0]: [https://www.json.org/json-en.html](https://www.json.org/json-en.html)

------
verdverm
Cuelang may be able to help here. You can define a schema and then process all
of the JSON. you can look for new/missing fields and constrain the field to a
regex.

The docs are missing some of this. If you jump into slack, we'd be happy to
help.

------
heavenlyblue
Use dictdiff module between subsequent responses and then check if the
difference in values is more than just atomics.

