
Ask HN: I want to start learning about databases, where should I start? - ARR
I hear about so many different types of databases and so many similar kinds of databases, which is the one I should start to learn about so that I get a good understanding of how databases work and how it can be used in programming?
======
keyist
If you use Firefox or a similar program that makes use of SQLite, that would
be the best place to start -- you'll get experience working with a database
populated with real data that is relevant to you, and one that has been
designed for practical use instead of just a book example.

Copy ~/.mozilla/firefox/<profile_name>/*.sqlite into another directory and
start playing with those files. It should be easy to come up with things you
want to find out. Start with some basic single-table ones like 'what are my 10
most visited pages'. Then expand to use multiple tables ('what are my 10 most
visited pages that I have not bookmarked').

Once you have an idea of how queries work, think about the database design and
the decisions that led to it -- why did the developers choose to relate tables
to each other that way? What kinds of queries are the indexes there to
optimize for?

Once you've done this, rinse and repeat with a variety of programs so you can
get multiple perspectives. <http://www.sqlite.org/famous.html> or google
around for others.

~~~
gaius
SQLite is such a beautiful program. It doesn't do much, but what it does do,
it does very well. If I need to analyze some data, say in a logfile, I'll
often write a trivial bit of Python to load it into SQLite and do my work in
SQL. For a few 10s of Gs of data, this works brilliantly. I'd go so far as to
say it's the least-buggy program I use day to day, and that includes the OSs I
run it on and "real" commercial databases.

~~~
mkramlich
Yep I like to think of sqlite as being to relational databases as memcache is
to caches. Both are super simple and straightforward, with a minimum of moving
parts, and yet does something really well. You can start with each, and then
later evolve into something more featureful. But often you'll not need to!

------
gxti
As pointed out by others here already, SQL may not be the optimal fit for
every scenario at large scales but it will nearly always work at smaller
scales, and I can't imagine that you're already trying to write the next
amazon.com. So in other words, start with SQL.

There are three main free software SQL databases: MySQL, PostgreSQL, and
SQLite. Don't use MySQL, it'll get you into a great deal of bad habits, is
missing an ass-ton of useful features (the default engine doesn't even support
transactions), and likes to corrupt data. It also has questionable licensing
and is owned by Oracle (a commercial database vendor) which as you can imagine
doesn't have a lot of incentive to improve it.

SQLite is a very robust database that is great for read-mostly sites (like
blogs) and for embedding into applications (Firefox for example). Its primary
limitation is that there can only be one writer at a time. It is very easy to
pick up and get going with since there's no persistent daemon, you just point
it at a file and start writing. I'd recommend starting here.

PostgreSQL is the more featureful database and supports hundreds of concurrent
connections, locking, access control, etc. This is the real "workhorse"
database you're likely to find in production at large sites that do make use
of free-software RDBMS. Eventually you will need to use something more
powerful than SQLite, and this is probably it.

------
vital101
I recommend checking out Apache Friends site download XAMPP. That will set up
a fully functional web environment (Apache, PHP, MySQL) which will let you
play a bit with database. It also comes packaged with PHPMyAdmin, which is
great for learning how to create tables and the like.

Also, you'll need to learn SQL. Check out
<http://www.w3schools.com/sql/default.asp> for a good primer on the subject.

------
SkyMarshal
First learn Relational fundamentals correctly from Chris Date and Fabian
Pascal, then branch out into other more trendy stuff like NOSQL.

<http://en.wikipedia.org/wiki/Christopher_J._Date>

<http://en.wikipedia.org/wiki/Fabian_Pascal>

~~~
silentbicycle
A more specific recommendation: _An Introduction to Database Systems_ by C. J.
Date.

Be warned that he's got an axe to grind - He's often critical of SQL because
it doesn't live up to the mathematical elegance of the relational model. (I
happen to agree with him.) He knows his stuff, though.

~~~
msg
If you want the Cliffs notes of the textbook, read Date's Database in Depth:
Relational Theory for Practitioners. You might want to read it a couple of
times though.

------
siculars
The only thing you really need to understand about databases is that they
store data for you, ie. they persist data. So once you get that then you start
asking questions like "where" and "how".

The "where" is on hard drives. Don't let anyone tell you that a purely memory
resident data store is a database, it's not. It's a cache. If you aren't
writing to disk you are not a persistent database. Think memcached and redis.
Although I would say the later is more of a database because you can configure
it to write on every update.

The "how" is where all the devilish details are. How do you put data into and
get data out of a database. How does the database store your data on disk.
Well, there are two main schools of thought in the market place at the moment.
Relational[0] and non-relational. Relational is the older more mature, more
well supported, more well understood, more "standards" compliant of the two.
Nowadays, non-relational is championed by the NoSQL[1] movement which eschews
the relational model for a looser concept, less defined, less "compliant",
less understood model. You will find many more and varied answers to "how" in
the non-relational space precisely because the renaissance in non-relational
is so fresh (the concept itself has been around a long time). Nevertheless,
NoSQL has been attracting considerable attention over the last year or so in
the open source community.

Every developer worth his salt will know or at least be familiar with the
relational model and at least one or two of it's implementations. MySQL,
PostgreSQL and SQLite are just three of the most popular open source RDBMS's
(relational database management system)[2] around today. They all have varying
degrees of support for the primary language used to speak to RDBMS's, SQL[3].

There are many open source, non-relational implementations in production
today. HBase, Cassandra, Riak, Voldemort and MongoDB are all playing in the
NoSQL field. Popular systems within NoSQL draw their lineage from either the
Google approach (gfs[4], big table[5]) or the Amazon approach (dynamo[6]). Yet
another is Neo4j, a graph database[7] which, at the moment, is lumped under
the "NoSQL" moniker but is a separate beast entirely. Just like their
relational cousins, the way NoSQL solutions do what they do and what they are
best suited for vary.

As with most interesting discussions, which approach is best is left as an
exercise for the reader. I'll just throw my two cents in for good measure.
First I would say that you need to know SQL. Knowing SQL and not a specific
system (like mysql or postgresql) will allow you to move in between relational
systems and let you pick the right one for the job. After taking a look and
familiarizing yourself with the SQL world, I would take a very serious look at
what is going on with NoSQL. Best I can tell, there is a lot of mindshare and
"cool" factor foisted upon this space right now because of the scale non-
relational concepts can bring to enterprise users. Be sure, though, that rdbms
is the predominant player in the database world. Every company large and small
use some form of rdbms. Newer more forward looking companies that have large
and very large scale needs to store and analyze vast quantities of
unstructured or loosely structured data are looking more and more to non-
relational NoSQL solutions.

I specifically avoid linking to particular implementations but rather link to
broader concepts here. It is the concept that is more important than the
implementation when learning about the big and confusing world of databases.

[0] <http://en.wikipedia.org/wiki/Relational_model> [1]
<http://en.wikipedia.org/wiki/NoSQL> [2]
[http://en.wikipedia.org/wiki/Relational_database_management_...](http://en.wikipedia.org/wiki/Relational_database_management_system)
[3] <http://en.wikipedia.org/wiki/SQL> [4]
<http://en.wikipedia.org/wiki/Google_File_System> [5]
<http://en.wikipedia.org/wiki/Big_table> [6]
<http://en.wikipedia.org/wiki/Dynamo_(storage_system)> [7]
<http://en.wikipedia.org/wiki/Graph_database>

~~~
pbh
I want to be careful in how I phrase this, because I feel like this was a good
effort, but I disagree with many parts of this description.

First, databases are not "just" about persistence, and I don't mean that in a
pedantic way. The point of a database management system is to manage your data
over time and allow you to make use of it. This includes things like stating
constraints to ensure that the data in the database is always correct and
providing methods for querying your data at varying levels of complexity.

Second, there are a number of in-memory databases. Oracle sells one. SQLite is
often used as one. The fact that they are in-memory does not make them not
databases.

Third, relational versus non-relational is not really an argument about "how
the data is stored on disk" or about being "standards compliant." A database
traditionally consists primarily of three things: a data model, a query
language, and an implementation. A data model, like "everything is a table
with column headers and rows" can be written to disk in many different ways
and queried in many different ways. (It's an abstraction.) Indexing structures
and data formats may be mostly the same even when the data model is different.
The (current) "relational vs non-relational" debate, such as it is, is usually
about whether it makes sense to change the data model and query language in
order to ensure that the implementation is scalable on "cloud-like" platforms
with specific data access needs. (That said, there are many different types of
non-relational models and query languages designed for different purposes.)

Fourth, there isn't one SQL. There are in fact a number of standards, SQL-86,
SQL-89, SQL-92, and so on. It makes sense to learn the standard first (at
least up to 92 or so), rather than learning particular implementation
specifics. Most databases implement SQL-92 or above, though there is somewhat
wide variance in additional features like indexing structures and functions
and in data modeling languages.

~~~
sesqu
The problem with the standards is that getting copies costs actual money. If
one really wants to start learning about databases, it can be very difficult
to decide what to pay for. Implementation documentation, at least, is
generally free.

For books, I've heard good things about [http://www.amazon.com/Fundamentals-
Database-Systems-Ramez-El...](http://www.amazon.com/Fundamentals-Database-
Systems-Ramez-Elmasri/dp/0136086209/), though I haven't read it myself.

~~~
pbh
I completely agree. My intent was to suggest that one should learn to the
standard first rather than any particular implementation, not that one should
read any of the actual standards documents directly. (Yikes!) By analogy, if
you want to learn C, read K&R to learn something approximating C89, rather
than picking up a book on how to code to the specific dialect of C understood
by GCC 4.5.

For what it is worth, I learned from Database Systems: The Complete Book.

[http://www.amazon.com/Database-Systems-Complete-
Book-2nd/dp/...](http://www.amazon.com/Database-Systems-Complete-
Book-2nd/dp/0131873253)

DS:TCB is pretty explicit about which of the SQL it teaches is part of which
standard. That said, I suspect that any general database book should do a
reasonable job.

------
timtadh
If you want to learn more than just how to use a database I recommend Ullman's
and Garcia-Molina's "Database Systems: The Complete Book."[1] It has overview
of classic topics such as relational algebra and calculus, practical topics
such as using a SQL database, and the second half is devoted to an overview of
how databases are actually implemented. Unfortunately I don't know of a book
that will introduce you to things like column-stores and such, my knowledge
comes from reading papers on them.

[1] <http://www.amazon.com/exec/obidos/ASIN/0130319953>

------
arebop
Relational databases are no longer fashionable, but they are backed by some
great theory and they remain very important in commercial practice. Read
<http://philip.greenspun.com/sql/> for a quick introduction, but use
<http://www.postgresql.org/> rather than Oracle for the exercises/examples
because it's much easier to install.

I trust you'll have no shortage of recommendations about CAP, column-based
stores, and support for schema flexibility in other comments.

~~~
jedbrown
I'm browsing this and I like it, combined with keyist's suggestion to use
SQLite sample data from Firefox/Chrome.

------
yummyfajitas
Start with SQL. 90% of the time, it is the right choice.

Pick any SQL flavor to start, sqlite is probably the easiest.

------
b0b0b0b
I think it depends also on what strengths you already have and how you are
personally motivated to learn. For example, I have found java's h2 database to
be easy to work with (and you never have to leave java). I don't think any db
could be easier to get started with and have less administrative overhead than
sqlite. If, to learn how to program computer games, one is advised to first
write tetris, to learn programming with databases, you might write a message
board.

------
edw519
I always thought that "Database Design for Mere Mortals: A Hands-On Guide to
Relational Database Design" by Michael J. Hernandez was a great starting
point. Sometimes it's nice to have a real book and you can't go wrong with a
used one from Amazon for under 5 bucks.

[http://www.amazon.com/Database-Design-Mere-Mortals-
Hands/dp/...](http://www.amazon.com/Database-Design-Mere-Mortals-
Hands/dp/0201694719)

------
etravers
I'm probably going to loose the few points I have left, but hey...

If you have access to a Windows machine my two cents says get a copy of SQL
Server Express. In addition to that you will want a copy of the AdventureWorks
database and the SQL Server Books Online.

SQL Server Express is a solid and free (as in beer) database engine. There are
processor and file size limitations, but other than that it is pretty much the
same as every other version they sell.

The Books Online (available online or as an installable download) has tutorial
style information on every level of the system; database engine, reporting
services etc.

The AdventureWorks database is a sample database that is used in the BOL and
just about every other SQL Server tutorial I have seen. It is quite extensive
and will save you from having to create your own data.

I'm sure somebody here will try and pick this apart, but this is how I
learned. If it means anything, I have been employed as a Databse Analyst for
the past four years.

They also offer "SQL Server 2008 R2 Update for Developers Training Kit", but I
have not tried it.
[http://www.microsoft.com/downloads/details.aspx?familyid=FFF...](http://www.microsoft.com/downloads/details.aspx?familyid=FFFAAD6A-0153-4D41-B289-A3ED1D637C0D&displaylang=en)

SQL Server Express:
[http://www.microsoft.com/sqlserver/2008/en/us/R2Downloads.as...](http://www.microsoft.com/sqlserver/2008/en/us/R2Downloads.aspx)

BOL:
[http://www.microsoft.com/downloads/details.aspx?displaylang=...](http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=c18bad82-0e5f-4e82-812b-5b23e5d52b9c)

Adventure Works site: <http://sqlserversamples.codeplex.com/>

------
thibaut_barrere
If I had to start learning this today (2010), I'd really do a mix of sql
(probably mysql) and non-sql (probably mongodb
<http://www.learnivore.com/search/mongodb>).

------
F_J_H
Although specific to Oracle, any of Tom Kyte's books are invaluable when it
comes to learning about databases and how they work.

If you do work with Oracle, Tom's "Ask Tom" site is excellent for learning
tips and tricks, as well as for finding elegant solutions to tricky SQL
problems. Again, it is oracle focused, but still very valuable. Link to Ask
Tom:

<http://asktom.oracle.com/pls/apex/f?p=100:1:0>

------
JangoSteve
There are a lot of great suggestions here for learning on your own. I'll just
add that if your company/venture will support this learning progress, check
out this workshop by Xavier Shay called Your Database Is Your Friend. He
travels around the world putting on this workshop, and though I haven't been
yet, I've heard good things.

<http://dbisyourfriend.com/>

------
grandalf
1) load some data into a table (probably using sqlite) and play around with it
using sql

2) read up on some database theory

3) check out a few popular options: sqlite, mongodb, couchdb and play around a
bit more.

4) find a big dataset that interests you, load it into one or more of the
databases you want to experiment with, and experiment away by trying to
extract interesting data, etc. Notice how each one performs, etc.

------
mtr
I don't know what your level is, but I recently used this book to get started
on the basics: Beginning Database Design: From Novice to Professional by Clare
Churcher

[http://www.amazon.com/Beginning-Database-Design-Novice-
Profe...](http://www.amazon.com/Beginning-Database-Design-Novice-
Professional/dp/1590597699)

------
fragmede
<http://en.wikipedia.org/wiki/Database_normalization>

------
michaelfairley
<http://sol.gfxile.net/galaxql.html>

------
jules
How do database do I/O? Do they use the usual file api? Memory mapped files?
Something else?

~~~
AlisdairO
_usually_ , they'll use standard unbuffered read and write calls, afaik. They
don't use buffering because they usually manage their own caching.

------
mkramlich
Google, Wikipedia, get and study the source to some open source databases,
etc. How does one start to learn about anything/everything now that we have
Internet access? Twenty years ago you would go about something like this
starting with a library or a university. Now we still have those things plus
the WWW. Knock yourself out.

------
known
Spreadsheet is a type of database.

------
bond
Which is better for a dating site, MyISAM or InnoDB? Thanks.

~~~
pjscott
It doesn't matter yet. Make something people want, _then_ you can worry about
scaling up.

