
Transparent SLIs: See Google Cloud the way your application experiences it - markcartertm
https://cloudplatform.googleblog.com/2018/07/transparent-slis-see-google-cloud-the-way-your-application-experiences-it.html
======
robertp
I think you have to hand it to Google, that is really next level deep
analytics. I feel it would be hard for most companies without huge investment
to do this (or maybe not).

It is like the reverse of a "customer survey" that instead of asking and
getting an arbitrary number, instead of really detailed level of service
performance.

~~~
wora
People often consider to use this information for reliability and performance,
but you can do much more with the data. For example, if a method has low
latency, you can use short deadline with fast retry to improve reliability. If
you see a sudden jump of certain usage, you can consider to use batching and
caching to reduce your cost. If you see an unexpected usage of a service, you
know someone introduce a new dependency in your system. Google teams often use
the same data to understand how large services work and how they are
correlated.

Disclosure: I worked on this feature at Google.

------
jonny_eh
This is amazing, kudos to Google. It's astounding that most cloud services
only show status info for their entire cloud (or big chunks of it). I don't
care if 99.999% of your customers are fine, if the server running my servers
is down for no reason.

------
haimez
Any idea when stackdriver is actually going to be usable for production? The
latency of events is way to high to drive alerts from and the UI has basically
always had issues. Clicking though any link seems to have about a 50% chance
of resulting in "this page was not found" which only means you have to somehow
find a different navigation path to get to that page.

~~~
markcartertm
We (The GCP Ops Management and Stackdriver teams) are working hard in multiple
fronts to deliver innovation (Such as service graph highlighted in day 1
keynote at Next and GKE monitoring) and at the same time deliver first class
scale and reliability. its a journey, but we have made a lots of improvements
over the last 12 month, and will continue to raise the bar in the next 6-12
month. We have many very large customers as well as startups using Stackdriver
as the core of their Ops and SRE command and control center. I can personally
guarantee that our users getting great UX and reliability is top of mind for
the entire team. I would appreciate it if you can flag to our team any time
you see a page not found or any other experience in Stackdriver that you feel
does not meet the bar - we listen and we will resolve bugs one by one to meet
your expectations. we have an email list, bug tracker and a feature request
forum all listed here [https://cloud.google.com/stackdriver/docs/contact-
us](https://cloud.google.com/stackdriver/docs/contact-us) . you can also
submit in context feedback from the Stackdriver and GCP consoles which will be
reviewed by the team. finally, please feel free to DM me on Twitter
@markcartertm . We care deeply and would love to hear and respond.

------
kenhwang
On one hand, this is really cool and I'd love this type of info out of AWS.

On the other hand, the whole point of using a cloud provider is to lower the
amount of ops work, and this feels like I'm helping do ops work for Google,
and paying them for the privilege.

~~~
dantiberian
This is purely additive. If you don’t want to look at these metrics, you don’t
need to, but if they would be helpful to you, then you can. Speaking as a GCP
customer, I’m very glad these are available. It means that I will more easily
be able to tell the difference between issues in my service, and issues in the
APIs I depend on.

------
OP9000
=

