
Presidio: Customizable data protection and PII data anonymization service - yarapavan
https://github.com/microsoft/presidio
======
gcbw3
The core of the product is a bunch of regex or something smarter?

Odd that a product that want you to base your legal safety is not more
transparent about the actual implementation on the initial page, instead it
reads like an advertisement for closed source SaaS.

~~~
gcbw3
answering myself, yes. It is a pile of regex

[https://github.com/microsoft/presidio/blob/master/presidio-a...](https://github.com/microsoft/presidio/blob/master/presidio-
analyzer/analyzer/predefined_recognizers/credit_card_recognizer.py)

and i can't find the body of test data easily... that's not good.

~~~
m00x
They also have a NLP engine, the regexes are just predefined recognizers for
known formats.

[https://github.com/microsoft/presidio/tree/master/presidio-a...](https://github.com/microsoft/presidio/tree/master/presidio-
analyzer/analyzer/nlp_engine)

------
omri374
Hey, I'm one of the maintainers of Presidio OSS. We built this tool mainly for
Microsoft's strategic customers and decided to open-source it for others to
use. No hidden agenda here.

The engine has two main advantages: It's easily expandable and customizable,
and it works well at scale. We do know of organizations who are very close to
production with it.

Every organization has their own requirements for PII entities, many of them
specific to the org itself, so the engine allows a developer to easily add
support for new PII entities using code, regex or black-lists.

See hree:
[https://github.com/microsoft/presidio/blob/master/docs/custo...](https://github.com/microsoft/presidio/blob/master/docs/custom_fields.md)

As for the productization aspect, we did some performance tests and are
confident with taking it to production. Single instance cluster with a medium
size machine has a ~65ms response time for a 100 word sentence. Using better
machines lowers the response time to ~24ms for 100 words and 150ms for a 1,000
words input.

The current service uses regex for known patterns and Spacy for named entity
recognition (person names, places etc.). Users often built custom ML models to
detect new types of entities.

Presidio is free, completely transparent, and fully customizable. Feel free to
use it and let us know what you think.

------
halflings
Still some work to do ; on their demo page [1]:

"Simply follow the instructions" -> "Simply follow the <US_DRIVER_LICENSE>" ;
same for "contribution". So I'm guessing some overly eager regex is to blame,
which doesn't make you super confident about using this for something
sensitive.

[1] [https://presidio-demo.westeurope.cloudapp.azure.com](https://presidio-
demo.westeurope.cloudapp.azure.com)

~~~
ht85
To be fair, there are different levels of detection [1], the demo is probably
using the weakest one.

[1]
[https://github.com/microsoft/presidio/blob/74ea983cc50ff76d7...](https://github.com/microsoft/presidio/blob/74ea983cc50ff76d7fc734683ca897ff01ceb9ba/presidio-
analyzer/analyzer/predefined_recognizers/us_driver_license_recognizer.py#L17)

~~~
omri374
Thanks, this is indeed the case. the US_DRIVER_LICENSE confidence is 0.01 and
the demo doesn't put any threshold on the response. We're working on fixing
the demo.

------
yarapavan
License: MIT

Demo: [https://presidio-demo.westeurope.cloudapp.azure.com/](https://presidio-
demo.westeurope.cloudapp.azure.com/)

Documentation:
[https://github.com/microsoft/presidio/blob/master/docs/index...](https://github.com/microsoft/presidio/blob/master/docs/index.md)

------
jpalomaki
I can see this being used to detect accidentally leaked sensitive content (for
example scanning repos, outgoing emails, shared folders).

However using this to redact material sounds a bit risky. Are there some use
case where you could accept the potential mistakes (missing something that
should have been redacted)?

------
zamadatix
My first thought was it was related to the cloud security company Presidio not
MS.

~~~
jngreenlee
Although Presidio would like you to think they are cloud security (higher
margins) they are really just a VAR/reseller. Weak services but they will sell
you stuff. So will I!

More importantly, this is proximate enough that they may sue for the name, OP.

------
AmphibianTree
This is good.. just need a hashbytes formula for Excel to make anonymization
accessible to the majority of MS customers who are fumbling around pii
haphazardly.

------
polskibus
Is anyone using presidio in production?

------
dejaime
What a horrible name

~~~
drKarl
I kind of agree with you.

From the README.md: Presidio (Origin from Latin praesidium ‘protection,
garrison’)

In Spanish sounds like prison, or related to prison. An inmate can be called
presidiario. Same root as the latin presidium.

------
Bootwizard
Please everyone:

Before you brand your software, google the name! It's not difficult!

[https://www.presidio.com/](https://www.presidio.com/)

