
Amazon Macie: A machine learning service to discover and protect sensitive data - polygot
https://aws.amazon.com/macie/
======
jcims
Important to note that the classifier only looks at the first 20MB of each
object, so it might miss sensitive content in large files.

[https://aws.amazon.com/macie/pricing/](https://aws.amazon.com/macie/pricing/)

------
jasonrhaas
I wonder if this is the same service that they use to scan public Github
repositories for Secret AWS keys. I'll admit that I've accidentally committed
a private key to a public repo before, and I received an email from AWS
letting me know about it shortly after.

I suppose that its in Amazon's best interest to not have people hacking
accounts and spinning up the maximum amount of EC2s to mine Bitcoins.

~~~
5706906c06c
Not for Github, but the TruffleHog project on GitHub might be of interest to
you. There is also SourceClear, which does the same for secrets in GitHub.

Note - AWS is monitoring AccessKey use and API thresholds to keep you
informed.

------
pmorici
Looks like Macie was originally developed by harvest.ai before they were
acquired by Amazon last year.

[https://techcrunch.com/2017/01/09/amazon-aws-harvest-
ai/](https://techcrunch.com/2017/01/09/amazon-aws-harvest-ai/)

~~~
trhway
>The San Diego-based startup, co-founded by a team that includes two former
NSA employees

>Harvest.ai’s flagship, patent-pending AI product is called MACIE Analytics.
It uses AI to monitor how a customer’s intellectual property is being accessed
in real-time, assessing who is looking at, copying or moving particular
documents, and where they are when they’re doing this, in order to identify
suspicious patterns of behavior and flag potential data breaches before
they’ve taken place. It bills the service as a way to combat the risk of
insider attacks.

did they get the idea after seeing what happens at NSA with
contractors/whoever downloading data to wherever?

~~~
mbrookes
Or after seeing Veritas Data Insight? [0]

Data insight is targeted at more user oriented unstructured content
repositories (CIFS, NFS, SharePoint, OneDrive, SharePoint Online, Box), but
the fundamentals are very similar: content classifiaction, data profiling,
risk scoring, access pattern anomaly detection, access control remediation.

[0] [https://www.veritas.com/product/information-
governance/data-...](https://www.veritas.com/product/information-
governance/data-insight)

------
eat_veggies
Classic, selling the poison and the cure. Access controls shouldn't be so
convoluted and opaque that it requires a separate service to analyze your
configurations. Crazy that we've made such a mess of the security landscape
that we need _AI systems_ to tell us if we're leaking info.

~~~
5706906c06c
Not the case. I've seen seasoned developers (not to single them out) make
simple stupid mistakes with the S3 bucket ACL, Permissions, and Policies. The
issue has to do with the sheer laziness of "let's create unstructured data
buckets, write once and forget it all" mentality. At some point, this sort of
service can be useful in identifying the "crown jewels" within the buckets.
Beyond that, the ACL is noAccess by default, so I can't agree with your
assertion that AWS is somehow making it difficult to sell more services in
favor of vendor lockin.

~~~
latchkey
Just a day or two ago on HN... [https://github.com/eth0izzle/bucket-
stream](https://github.com/eth0izzle/bucket-stream)

~~~
5706906c06c
Yes, thank you for linking, but fail to see the correlation. This tool is
scanning public HTTPS endpoints based on keywords in its dictionary to
discover misconfigured buckets. AWS doesn't manage the bucket Perms/ACL, the
customer does. AWS' shared-responsibility model clearly defines all of this.
The customer is responsible for the bucket ACL, the same would apply if I ran
my stack in a data center and went on to configure Apache/NGNIX with open
Directory indexes that allowed anyone to traverse them.

------
RKearney
Previous Discussion:
[https://news.ycombinator.com/item?id=15012225](https://news.ycombinator.com/item?id=15012225)

------
tptacek
The S3 data classification seems too expensive, but if the Cloudtrail stuff
works, that seems pretty cheap for what you get.

~~~
mcqueenjordan
CloudTrail is indeed very cheap for customers — we record nearly all API calls
and access to AWS resources and deliver these events to our subscribing
customers. And the events are delivered for free, outside of S3 and Lambda
“Data events” — gets, puts, and function invocation is billed at a very cheap
rate.

(We recently released our AWS Lambda integration — you can now record all
Lambda function invocations with us!)

Disclaimer: I’m a Software Engineer with the AWS CloudTrail team.

~~~
tptacek
If I'm reading this right, you now have two paid services for detecting CT
anomalies: Guard Duty, which is nosebleed expensive, and Macie, which is
practically free. What's the difference between the two?

~~~
p0rkbelly
Macie Analayzes a subset of CloudTrail, not all actions and is about
historical behavior (though for high sev actions, it is more point in time)

GuardDuty is looking for specific threats/attacks and can combine multiple
sources of telemetry for more advanced correlation. E.g. A combination of VPC
Flows + CloudTrail + DNS that trigger an alert when formed together while a
single CloudTrail event may not have.

~~~
tptacek
Within CT, what are examples of things Macie will catch, vs. things you'd need
GD to catch?

If GD weren't so expensive, I wouldn't really care that much. But GD is so
expensive that it can be hard to recommend, which is especially weird since
the pricing for Macie CT is so low --- even weirder when you note that the
pricing for Macie S3 is so high!

~~~
p0rkbelly
I found the pricing for Guard Duty reasonable compared to most IDS systems.

They let you turn on GuardDuty for free for 30 days and give you an estimate
your bill so that helps.

------
trengrj
How does this compare to Google’s Data Loss Prevention?

~~~
stedev
Google's Data Loss Prevention is provided on G Suite and Google Cloud Platform
(GCP). Both products use the same unified classifier codebase. G Suite DLP
allows admins to enforce policy on Gmail and Drive files. On GCP, the Data
Loss Prevention API allows developers to classify and redact sensitive data in
virtually any data source in real-time or at-rest (e.g. Google Cloud Storage,
BigQuery, AWS Redshift, AWS S3, Salesforce, Slack, on-prem, custom apps,
etc.).

DLP API scans are not limited to 20MB and can scale up to virtually any size.
API results can be used for programmatic automation of alerts, IAM/ACL
settings, or other remediation and can be sent automatically into BigQuery for
detailed analysis or reporting. In addition to classification, Google’s DLP
API provides data masking tools for structured and unstructured data including
format-preserving encryption, bucketing, and tokenization. This helps
developers reduce unnecessary PII when collecting, storing, or sharing data.

(Note: I am the Product Manager for DLP API at Google Cloud)

------
captn3m0
Waiting for it (and many more services) to become available in ap-south-1.

------
jijji
jeff bezos's own personal dashboard to exfiltrate warez

