

Ask HN: How would one go about writing his own database? - bernatfp

I was thinking that writing a DB would be a good project to start getting comfortable with a language I&#x27;ve been learning. However I have no idea about design, strategies or best practices for creating a DB.
======
kkowalczyk
What kind of database? An embedded key/value database like TokyoCabinet? An
embedded SQL database like SQLite? An in memory data-structure database like
Redis? A server SQL database like MySQL? A distributed key/value database like
Cassandra? A json document database like CouchDB? I could go on...

In general, if you want to learn a new language, pick a project of limited
scope (weeks) that you have done in the past in some other language.

A database is a bad choice. Even the simplest database worth being called a
database requires months of effort.

~~~
bernatfp
To be precise, I want to code a key/value store exclusively oriented at
geolocated queries by key. It doesn't look like a very complex project or am I
underestimating it?

------
zamalek
I noticed you want to do geolocation. Let's put that aside for now, just
looking at key/value.

1\. You need to store all your data in "pages". These are fixed-length blocks
(that can be chained together).

2\. Use a classical "dictionary" data structure to store your index (most use
B-Tree), but you place that structure on-disk instead of in-memory. Your index
basically points a key to the index of the first page that contains the value.

3\. You may need to store other types of data in the file, such as the free
page list.

As for geo-location, B-Tree fits nicely here.

1\. Intelligently lay out your B-Tree so that nearby nodes are typically
within the same node-family-system. Include information about the distance
spanned by descendants in the tree (you probably want to use Manhatten
distance).

2\. When trying to do "near" queries, first locate the node you are interested
in and then start traversing up the parents until you find the ancestor that
encompasses the distance you are looking for.

3\. Grab all the descendants of that ancestor and re-check them against the
distance.

------
jacquesm
This is one of those questions where if you ask it the answer is 'you
shouldn't'.

If you want to get comfortable with a language that you are learning my advice
would be to pick a simple real-world problem and to try to solve that rather
than to tackle a major project like this. That will only cause frustration in
the long run.

~~~
ulisesrmzroche
in the short run too. even setting up the project sounds like a bitch.

------
shailesh
If you meant database engine, then you should contribute to open source
database projects like PostgreSQL or MariaDB. This assumes that you're
comfortable with populating and maintaining databases.

Start with downloading the source code for say MariaDB, reading internals
documentation on project Wiki, compiling it on your machine.

[https://wiki.postgresql.org/wiki/Developer_FAQ#How_do_I_get_...](https://wiki.postgresql.org/wiki/Developer_FAQ#How_do_I_get_involved_in_PostgreSQL_development.3F)

[https://mariadb.com/kb/en/contributing-
code/](https://mariadb.com/kb/en/contributing-code/)

Good luck.

------
hardwaresofton
I think if you really wanted to get comfortable with something that could have
you make a performant, database quickly, you might want to start with
something like C...

Now, C is a shotgun that you will invariably use to shoot yourself in the foot
with, but I think it's absolutely straightforward and powerful manual memory
management, as well as raw speed, would make it worth learning.

Like, you could for example just focus on building a database of a given size,
in memory -- you could even do it all in RAM, if you wanted.

You could have the database specialize in holding one kind of data structure
(for example, how about tries? they're pretty rare, despite how useful they
are)

~~~
bernatfp
Thanks for the constructive feedback. In fact, I wanted to focus just in
latitude-longitude pairs to perform geolocated queries, this is my single
goal. I'm trying to get rid of MongoDB for my backend and this is one of the
last barriers...

~~~
hardwaresofton
So you were trying to get rid of mongodb as a backend, and you ended up at
writing your own? Well I'm not sure how set you are on that, but maybe you
should look into some other databases, what did you dislike about mongo?

And yeah, with such a specific goal, you could very well make a very nice
niche DB that optimizes for the kind of data you're trying to search!
Especially with map data becoming more and more important, I think it's
actually a pretty awesome idea.

However, I would say that you should go in with the intent to make a kickass
latitude-longitude database, rather than just having it be a piece of
something you're trying to push to production (you have better things to worry
about with something that's going to production -- don't make the db one of
them)

You never know, you might find that the database is the more important
project!

------
arh68
I, for one, think you should do it. But you've got to realize, up front, this
will be thrown away, will only work for 1 use case, will be slow, etc. Don't
build anything general, or anything reusable. Write in Go. Put everything in
one big textfile. Start with O(slow) code. Use a simple 'schema', like
tsv/csv/json. It will work.

Production-ready, simple, written by you: pick 2. The 'correct' thing to do is
choose the first two. Have at it.

------
redox_
Wow, to start getting comfortable with a language, a DB won't be my choice.
You should really start with something easier.

------
lmm
One feature at a time, the same as any other program.

Start with something very simple - maybe a single table with a hardcoded
schema, or a simple string key/value store (a la memcached). Find a use case
for which this is an improvement over what you had before. Add features as and
when you need them.

------
pbrumm
if you have a different angle on querying your data that the current solutions
don't solve then you should definitely give it a shot.

You should try to limit what you build though. Can you utilize leveldb, Riak,
Cassandra, Redis or even Postgres Foreign Data Wrapper.

Your data will probably need to be queried with other user data so consider
that as well.

I would start with figuring out how to query the data you need that different
from the current measures and prove to yourself that it beats out the existing
strategies. If it does then expand by storing and loading the data to disk.

then work on connecting to it through multiple clients, look into using
msgpack, thrift, message pack.

Have fun.

------
drakaal
The language you have been learning had best be C or Assembly. Almost anything
else and your database will always be the weakest link in performance.

~~~
jacquesm
I really don't see how you could suggest to learn assembly in this context.
I'm not aware of any production database for commodity hardware written in
assembler. (in the embedded world there are a few but that's specialty stuff).

~~~
drakaal
I don't think commodity hardware was specified as a project goal.

If you run any of the big EMC, or NetAPP, or Exadata systems those are built
with many of the core components written in Assembly.

