
Sysadmin mistakes start-ups make - polvi
https://www.cloudkick.com/blog/2009/oct/20/three-start-up-sysadmin-mistakes/
======
DrJokepu
Here's one we made recently: Purchasing an array of hard drives (for storage
servers) and not making sure that not all of them are from the same batch.
Since they were made in the same batch, they had the same defects and when
they failed, they failed one after each other in a very short interval. Since
all of them failed, RAID didn't help, we had to restore the day-old offline
backup.

~~~
michaelbuckbee
That isn't something I'd ever thought about until now.

Would it make sense to use drives from more than one company as they would
very different failure characteristics?

~~~
blasdel
It's a terrible idea to mix and match drive models, because they'll have
drastically different _performance_ characteristics, and if you're lucky
you'll only get a little worse than the lowest common denominator on all axes.

What quality OEMs do is make sure they never ship you drives from the same
manufacturing batch in one enclosure.

~~~
DTrejo
How do you make sure to buy from a quality OEM?

------
jrockway
Fork is actually a very fast system call. It never blocks, and (on Linux),
only involves copying a very small amount of bookkeeping information. If you
exec right after the fork, there is basically no overhead.

However, forking a new shell to parse "mv foo bar" is more expensive than just
using the rename system call. And it's easier to check for errors, and so on.

SQLite is also not as slow as people think it is; you can easily handle 10s of
millions of requests per day with it. If your application's semantics require
table locks, MySQL and Postgres are not going to magically eliminate
competition for locks. It's just that they both pick very weak locking levels
by default. (They run fast, but make it easy to corrupt your data.
Incidentally, I think they do this not for speed, but so that transactions
never abort. Apparently that scares people, even though it's the whole point
of transactions. </rant>.)

Most of my production apps are SQLite or BerekelyDB, and they perform great. I
am not Google, however.

~~~
teej
> SQLite is also not as slow as people think it is; you can easily handle 10s
> of millions of requests per day with it

Scaling relational databases to the point of 10s of millions of requests is
extremely non-trivial. Unless you can show me personally or can show me
evidence otherwise, don't make this claim. You're doing a disservice to the
people that have worked countless hours to eke every last millisecond of
performance out of MySQL and Postgres.

~~~
jrockway
10 million per day is about 100 per second. SQLite performs about this
quickly. I wrote a test script and it did 125 (unindexed) lookups per second.
Then I ran two of these tests at the same time, and the rate stayed about the
same. I have 8 cores, so I made 8 processes, and it was the same. 125
requests/second * 8 * 86400 seconds/day = 86_400_000 requests per day.

I added another thread writing as quickly as possible to the mix (7 readers, 1
writer), and this brought the read rate down to about 45/reads per second per
thread. Still more than 10 million per day, so technically I am right.

Also, I don't doubt that MySQL and Postgres (and BDB) are both significantly
faster than this. It's just that SQLite is not going to guarantee "instant
failure" of your project, as the article implies.

(One thing to note -- every time you type a character in Firefox's address
bar, you are doing an SQLite query. It is Fast Enough for many, many
applications.)

~~~
gstar
That -is- fast, but I still have trouble reconciling that deep down in my
computer, a human readable SQL query gets built, and then another process
parses that SQL. Seems so wasteful building and then parsing a human readable
string for something that's happening on the same machine.

I know nothing of SQLites's internals, but wouldnt it make more sense to parse
the query once and then store a compiled version of the query for subsequent
lookups? Like you might do with a regexp?

~~~
azim
Yes, This is known as a prepared statement. You compile a parametrized
statement once, then execute it as many times as you like with different
arguments.

Also, SQLite, unlike most other databases, is an embedded database which does
everything in-process rather than invoking multiple processes.

------
peterwwillis
The memory use is not accurate unless you take shared pages into account.
Copy-on-write will make it look like each apache child is using 40MB, when
really it's only 10MB private RSS. Use a RSS-calculating script
(<http://psydev.syw4e.info/new/misc/meminfo.pl>) to determine the close-to-
real memory use. If you don't calculate your maximum memory use correctly you
will run into swap with traffic peaks. Also keep in mind that swap is a _good_
thing. Is your app constantly cycling children? This isn't going to allow it
to move unused/shared memory into swap. Don't ignore memory leaks by reducing
your max requests per child.

The forking thing is more of the same. Copy-on-write means it's not going to
balloon your memory unless some function turns that shared rss into private.
It isn't something that you want to do a lot of, though.

~~~
Periodic
This stood out to me as well. I like the script to actually calculate real
usage. Modern operating systems are smarter than I am when it comes to memory
management.

What you don't want is to have anything you use more than once a minute in
swap, and preferably only the stuff you don't plan on using for an hour (i.e.
not any time soon). That probably means you want your main application and web
server in memory all the time. If there are pieces of it that are unused and
you're hitting a resource cap then you have something mis-configured.

RAM is also dirt cheap right now, making it often easier to add RAM than to
optimize slightly sloppy code.

------
SwellJoe
One of the most common problems we see is DNS misconfiguration. It seems most
folks just haven't read the grasshopper book. If you're doing anything on the
Internet, you _need_ a basic understanding of DNS.

Once you grasp the fundamentals, most DNS problems become completely
transparent, but I've seen people spend _weeks_ trying to solve DNS problems
due to lack of understanding.

~~~
omouse
Could you name that book? Do you have know of any other books people should
read for sys-adminning?

~~~
sparky
DNS and Bind (now in its 3rd edition) by Liu, Albitz, and Loukides. It's an
O'Reilly book.

<http://www.amazon.com/DNS-BIND-Cricket-Liu/dp/1565925122>

~~~
nixme
Amazon shows it's actually in its 5th edition now:

<http://www.amazon.com/DNS-BIND-5th-Cricket-Liu/dp/0596100574>

~~~
sparky
Good call :) Too late to edit :(

------
abalashov
In my experience of the most common mistakes is the failure to realise that on
pretty much all Linux distros, services like Apache and MySQL come
conservatively tuned. This is deliberate; it means a DoS or out-of-control
process within one of those domains is unlikely to take out the entire server,
because there's a hard limit on consumption of memory, CPU, child processes,
threads, etc.

However, this default configuration needs to be tuned to allow you to take
advantage of the hardware - if you have generous hardware. Otherwise, you will
wonder why your web sites are extremely unresponsive, yet the server load
stands at something relatively unimpressive.

I found this out the first time a blog post on one of my servers got digg'd.

------
michaelbuckbee
I'd guess the real number one mistake is insufficient paranoia about backups.

I know lots of companies doing TDD but that have never done a full test
restore from their backups.

~~~
gaius
The easiest way to do this is to make your backups the mechanism by which you
refresh your Dev/QA environment from Production. It means your Ops team are
very nearly doing a DR exercise every week.

~~~
simonw
I'd never heard that advice before - sounds like a great idea.

------
tptacek
This was a great article, but I ended it wondering whether they either (a)
knew what a system call was (until the end, I thought maybe they meant a
system() shell-out) or (b) realize how many system calls a vanilla
request/response cycle incurs.

------
aristus
I disagree with 1.3. "Serving static content is the easiest possible task for
any web server." Yes, but keeping connections open for slow clients (esp with
KeepAlive on) is not a good use of your 500MB Mongrel process' time. On the
other hand, KeepAlive is a handy thing to have.

Using a proxy like nginx or varnish to serve static files (and even dynamic
data) if you have the proper KeepAlive and Nagle bits flipped can save you a
_lot_ of server resources at the application layer.

~~~
staunch
It's almost always a bad idea to use anything other than a non-blocking/async
server to handle static content.

I think it's simpler/easier (maybe faster) to serve content from a separate
sub-domain (static.site.com or whatever). Using a reverse proxy works too, but
unless you're caching dynamic content it's probably no benefit and it's less
efficient.

~~~
andrewtj
A good reverse proxy will buffer client and server side so that your heavy app
can be available to serve the next request whilst the light proxy feeds the
page back to a slow client.

Under certain circumstances serving static files from separate hostnames can
be beneficial as HTTP clients are supposed to limit the number of simultaneous
connections per hostname.

------
julio_the_squid
Yep, #1 happened to me the other day. We hit our Apache server limit of 256
and the site slowed to a crawl. I'm not really sure what was causing the load
to be like 50-90, but requests were quite delayed waiting for an open process
(keepalive was at 5 secs).

Indeed, my first idea was indeed to install nginx for images really quick.
However, I have no experience with nginx. Thankfully, we had a spare server
and I offloaded the images to there for now... Throwing more hardware at the
problem usually works.

------
ajross
FTA:

 _However, sqlite should never be used in production. It is important to
remember that sqlite is single flat file, which means any operation requires a
global lock_

I don't know jack about sqlite's locking architecture or scalability, but this
statement is just silly. There are a conceptually infinite number of ways to
make fine-grained locking work on a single file, both within a single process,
a single host, or across a network. Maybe the author is thinking fcntl()
locking is somehow the only option.

I guess the corrolary to this article has to be "Don't let your startup's
sysadmins diagnose development-side issues."

~~~
pquerna
SQLite locking: <http://www.sqlite.org/lockingv3.html>

""" An EXCLUSIVE lock is needed in order to write to the database file. Only
one EXCLUSIVE lock is allowed on the file and no other locks of any kind are
allowed to coexist with an EXCLUSIVE lock. In order to maximize concurrency,
SQLite works to minimize the amount of time that EXCLUSIVE locks are held. """

But compared to something like MySQL w/ InnoDB (or postgres, or Cassandra, or
BerkeleyDB), which all have something closer to Row Level or Page Level
locking, SQLite's concurrency for server side applications is a serious
deficiency.

Yes, there are lots of ways to have fine grained locking, SQLite just doesn't
do them.

~~~
silentbicycle
Like many of SQLite's other quirks, this is because SQLite is designed to
accommodate embedded usage.

------
absconditus
How are the last two system administration problems?

------
rythie
This seems like an odd section of sysadmin mistakes - I would have thought
there are some other ones being made more often.

~~~
thwarted
Especially since the third one is a developer mistake that, as a sys admin and
developer, I've had to point out to developers not to do -- but for security
reasons, not because fork is oh-so-super expensive (even though it can be).

Also, there is no "system" system call. "system" is a library call that forks
and execs a shell to evaluate and execute a string. Having a sys admin that
doesn't know the difference may be the biggest sys admin mistake you could
make. There are a lot of library wrappers for system calls, but these are
documented in section 2 of the man pages as system calls.

------
bcl
I'd say their biggest mistake is usually not hiring a sysadmin who also has
development experience (or developers without sysadmin experience). I've found
that my knowledge in both realms has been invaluable in determining how to
design the infrastructure and how to write the code.

------
patio11
_If you fork inside an app server, such as mod_python, you will fork the
entire parent process (apache!). This could happen by calling something like
os.system("mv foo bar") from a python application._

I nominate this post as the most distressingly important bit of information
I've ever received at 2:43 AM in the morning.

Now the question: what can I do in Ruby to avoid the four calls a second or so
I'm currently making to system(big_command_to_invoke_imagemagick) ?

~~~
polvi
The solution to use an image processing library such as RMagick,
<http://rmagick.rubyforge.org/>

~~~
tptacek
Calling into RMagick/ImageMagick from inside the request/response cycle is
probably even worse than shelling out, because ImageMagick does grievous
damage to your runtime.

~~~
polvi
I guess it all depends how you design it and what you are doing. I would have
to agree with others, the out of request cycle image processing solutions are
definitely the right way to go overall.

------
toisanji
hmm, I have a hard time understanding why anyone would try to use sqlite in
production unless they explicitly wanted to?

~~~
anApple
Simpler to use (no external program to start and monitor) and to backup (just
copy the sqlite file)

------
joshOiknine
Personally I don't think we would ever run into those issues. A. we don't have
other servers to switch over to B. We are using MySQL for testing and
development and C. we don't like what happens when we make system calls for
within a web app. forget about forking.

------
durana
Take #1 and generalize it to the mistake of trying to fix a problem without
really understanding what the problem is. This has to be the most common
mistake I've seen in the sysadmin world.

------
c00p3r
Is this a example of a knowledge level of modern sysadmin? If so, we're in
trouble. =)

Sysadmin should be able to think in terms of data flows, which means memory
management, data partitioning, and network stack usage, able to put different
types of data into different kinds of storage, and understand the role of
cache and how data should be access.

Packages are just a tools.

