
Why Loggly Chose AWS Route 53 Over Elastic Load Balancing - jtblin
https://www.loggly.com/blog/why-aws-route-53-over-elastic-load-balancing/
======
former_loggly
Former Loggly employee here. Loggly is at CTO #3 or 4 in about 3 years. The
CEO, marketing guy with black turtle neck, "runs" engineering. It is NOT an
engineering company and they are on their way to outsourcing all development
to India.

Formally they had all of their EC2 instances configured to run without swap
and didn't use EBS such that instances would crash 1-3 times a day and lose
all data which would require 1-2 day customer restores of data.

Additionally, this Java shop oversubscribed threads on every Solr box which
made them restart each Solr instance every hour. To think any revolutionary
engineering ideas come from an former Apple marketing wannabee who puts
outsourced Indian engineering in place as yes men is a huge stretch.

Let's be honest, Loggly is in huge trouble and can't hire quality engineering
talent and as a result is trying to remarket themselves as an engineering
driven company as they outsource to India.

Key question isn't..do you use DNS or Elastic Load Balance...it is...what is
your VOLUNTARY RATE OF ATTRITION? Hint, really bad!

~~~
hijinks
I interviewed there for a devops/sys admin role and ran far away after that
process when I learned about what is going on and problems the ops group has
to solve.

Then reading this fluff piece made me glad I never even thought about working
there after that phone interview. Whoever claims a DNS round robin is a good
way to handle fail over doesn't really know what they are talking about. I
have dug into the how something like rsyslog handles a dns request. My guess
is it just passes it off to the OS.

But what I got from this is loggly is ok with losing customer data

~~~
riffraff
could you expand on what made you run away? i.e. what sort of problems which
are tasked upon the ops group would make an ops person run away?

This is unrelated to the loggly bit, it would just be interesting to know for
a non-ops-guy.

------
fubu
Serious question: Are people upvoting this to poke fun like some kind of daily
wtf?

A logging platform that lists 1 of their 2 major requirements as "To not drop
any data, ever" is using round robin DNS for fault tolerance? I can't see too
many people on HN upvoting this for being insightful or impressive.

Edit: I just can't help myself. How are you going to send syslog when any
server fails and not "drop any data, ever"? Even over TCP the in transit
messages are lost when the connection is broken. So like, their business is
basically syslog and they don't know that?

~~~
latch
I upvoted it in the hopes that someone would provide the missing piece. Like
"oh, we forgot to mention that the DNS is pointing to our own haproxy servers
that all have redundant power/network/whatever) or something.

~~~
skuhn
Loggly seems to be all about running In The Cloud, so that seems unlikely.

EC2 instances running haproxy would mitigate a number of the problems they
discussed with using ELBs but the inability to use VIPs (with vrrp or ucarp)
in AWS means that a failure will always boil down to the same pattern: a key
front end instance dies, client traffic keeps being directed to it for 5
minutes (at best), and that's life.

~~~
toomuchtodo
The only way for them to fix this would be to have their own client that would
attempt retries while caching unacked logs locally, but they market not
needing a client (only syslog) as the benefit.

TL;DR Loggly can't promise no data lose in its current incarnation.

------
zimbatm
> If there is an issue with a collector, Route 53 automatically takes it out
> of the service; our customers won’t see any impact.

Except when for example rsyslog caches DNS resolution forever. Or the log
forwarded doesn't have a buffer and logs get lost.

~~~
korzun
Yeah I don't get their approach. There is no way this will cause 100% delivery
if one server fails within that rotation.

That chances of failure go up dramatically if 2+ hosts behind round robin
fail, etc.

Not to mention once hosts resolve this to an IP they will re-use the route.
This approach is not balanced.

I don't want to be /that/ guy but if they can't scale with ELB they should
invest in a dedicated load balancer infrastructure that can offload requests
to their cloud instances.

This is a really bizarre post.

~~~
hatred
I am more worried about the DNS caching issue.

The 2+ hosts failing should not be much of a problem if you have a separate
health checker host which does nothing except gathering heart-beats from all
the hosts in your fleet and updating the DNS periodically.

~~~
korzun
That's exactly why it's a problem.

If you have 3 hosts and 2 of them go down, in this setup there is more than
50% chance that cached hosts will be trying to connect to a non-existing
server.

Also expecting client to perform a DNS lookup every time there is an outgoing
log packet is pretty shitty for performance. You can't guarantee near instant
DNS server availability for every client.

------
skuhn
Lots of other comments have torn this article apart (and justifiably so), but
I still feel the need to pile on.

In their docs, Loggly only gives out one API endpoint: logs-01.loggly.com.

It is referenced as the endpoint for HTTP, HTTPS, syslog and syslog TLS. These
seem to be the only methods available to send log data to them.

There is the obvious problem that a DNS record with a 60s TTL cannot possibly
receive every single packet sent to it in the event of a server failure. Even
if the returned IP address is an elastic IP, it takes a substantial amount of
time to move to another instance in AWS.

I don't know why you would use the same service hostname for all of these
endpoints. Separate names for each endpoint, even if they all pointed to the
same pool of hosts, would at least give some flexibility in the future when
they have enough traffic to get desperate about capacity. I would also think
they might want to segregate native syslog from HTTP traffic, since I presume
it uses different processes on the backend.

It's also curious that they chose to return only one A record. DNS RR is a
poor substitute for real load balancing, but it's better than nothing. With
multiple A records, there is at least a chance that some of their traffic will
go to other servers -- rather than all of it potentially going to one as it is
now.

While they made no claims about using Route 53 for its geo DNS capabilities, I
still found it amusing that I was sent to a US East IP from California. Not
that it's super critical that my log lines get delivered quickly, but it is
ideal to shorten the path of an insecure and unreliable transport in order to
improve durability. Although I would never ship syslog out to some host on the
Internet, a host 16 hops away is even more ludicrous.

I think their article says a lot more about how poorly ELBs function when you
exceed the low traffic threshold it is seemingly designed for than about how
well Route 53 works (and it is a decent static DNS service). The inability to
robustly direct incoming traffic is the achilles heel of AWS.

------
mbell
There is a rather large technical divide between 'no logs left behind' and
relying on DNS lookup to provide that guarantee.

~~~
lfuller
I was thinking the exact same thing while reading this - it reads like a
company that doesn't understand the unique challenges involved with
distributed computing.

I'm actually in the middle of deciding between Loggly, Papertrail, and
Logentries for centralized log management. I guess that cuts it down to two.

------
philip1209
This is primitive. It seems like they are on the verge of discovering BGP,
which could be used to provide scalability, load balancing, and clean failover
without DNS caching issues.

------
mey
Why would you allow your clients to transmit potentially sensitive data to you
as clear text over the internet?

~~~
Goopplesoft
I'm sure it's because rsyslog supports it (many mentions in the article to
staying compliant with rsyslog).

~~~
mey
Thats what I don't get, how is it responsible to stay compliant when that
means insecure?

~~~
nulltype
You can use rsyslog with TLS
[http://www.rsyslog.com/doc/rsyslog_tls.html](http://www.rsyslog.com/doc/rsyslog_tls.html)

------
ejain
What are some alternatives to Loggly? I really like being able to aggregate my
logs with minimal setup (and cost). I'm logging with Logback (Java), and there
is a convenient extension that forwards log statements to Loggly.

~~~
troydavis
We just finished writing a syslog4j-derived Logback appender with support for
UDP, TCP with TLS encryption, and cleartext TCP:

Background and setup: [http://help.papertrailapp.com/kb/configuration/java-
logback-...](http://help.papertrailapp.com/kb/configuration/java-logback-
logging/)

GitHub repo: [https://github.com/papertrail/logback-
syslog4j](https://github.com/papertrail/logback-syslog4j)

Papertrail also works with the standard Logback SyslogAppender.

~~~
ejain
Does the syslog appender handle large, multi-line log messages (i.e. messages
containing stack traces)?

I recall having some trouble with that when using syslog with Loggly, before
switching over to the json appender.

~~~
troydavis
Short answer: it depends.

Long answer: logback and both appenders can accept pattern formats to adjust
how they're formatted. How useful the end result is depends a lot on the
receiver, though, and more than that, there's no one implementation that's
great for everyone -- that is, there's no right way to "handle large, multi-
line log messages," only attempts at making them more useful.

An easy example is searching. Some people want to see the entire message,
others want only the matching portion of a stack trace, others want some
combination, and others - probably most people - just want something that's
useful, however the actual UX works.

In Papertrail's case, our sender-specific context links (think grep -A/-B/-C)
were designed for navigating multiline output from a single sender:
[https://papertrailapp.com/tour/viewer/context](https://papertrailapp.com/tour/viewer/context).
It's basically pivoting from a single entry in a stack trace to the entire
stack trace.

~~~
ejain
My main concern isn't formatting or searching, but truncation, see
[http://stackoverflow.com/questions/2011986/does-syslog-
reall...](http://stackoverflow.com/questions/2011986/does-syslog-really-
have-a-1kb-message-limit).

------
mrucci
Interesting points. Here is a few things you'll miss choosing Route 53 over
ELB:

* HTTPS termination.

* Autoscaling group management. By connecting an ELB to an autoscaling group, the logic of registration and deregistration is fully managed behind the scenes. With route53, you have to implement it yourself.

* Minimum autoscaling group size. If you enable ELB health checks, you can rely on the ELB to maintain a group of instances of constant size.

~~~
spamizbad
ELB's HTTPS termination is mediocre and, last I checked, doesn't offer the
best ciphers. A year ago It was impossible to get an A+ on ssltest
[https://www.ssllabs.com/ssltest/](https://www.ssllabs.com/ssltest/) using ELB
to terminate SSL.

Not to mention it still needlessly includes a ton of dangerously insecure
ciphers just begging to be misclicked.

~~~
dbarlett
The current default, ELBSecurity Policy-2014-01 [1] enables ECDSA/PFS and is
close to the Mozilla TLS recommendations [2]. Getting an A+ on the Qualys test
requires the HSTS header [3], which isn't an ELB issue.

[1]
[http://docs.aws.amazon.com/ElasticLoadBalancing/latest/Devel...](http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-
security-policy-table.html)

[2]
[https://wiki.mozilla.org/Security/Server_Side_TLS#Amazon_Web...](https://wiki.mozilla.org/Security/Server_Side_TLS#Amazon_Web_Services_Elastic_Load_Balancer_.28AWS_ELB.29)

[3] [http://mir.aculo.us/2014/04/04/how-to-get-an-a-on-the-
qualsy...](http://mir.aculo.us/2014/04/04/how-to-get-an-a-on-the-qualsys-ssl-
labs-test/)

------
kfnic
What kind of TTL value would they use for these records? Should something
happen to one of the collectors, couldn't that value still be cached by an
endpoint or an intermediary?

Even with a short TTL, are there still servers out there that don't respect
all TTLs, or has that been eliminated by now?

------
hnhipster
Everyone should use hosted services for everything. Soon we'll have hosted
services for hosted services. (I actually worked at a company that was a
hosted service running mainly off of another hosted service + AWS.)

------
hobs
>Amazon Route 53 DNS Round Robin Was a Win

>If you’ve ever used the Internet, you’ve used the Domain Name System, or DNS,
weather you realize it or not.

Interesting article, wrong weather used in this sentence.

~~~
KarenS
Thanks for noticing this! It's fixed now.

------
jcampbell1
It seems odd to leave off any discussion about DNS TTLs, and the risk that
something like 8.8.8.8 could end up sending them a thundering herd.

------
lpgauth
What kind of time granularity can you get for health checks on ELB vs Route
53?

~~~
mrucci
ELB Health Check: min, max = (1s, 300s)

Route 53 Health Check: either 10s or 30s

