Hacker Newsnew | comments | show | ask | jobs | submit login

MySQL is circa 1 million lines of code.

I like SQL engines for moderate data sets that fit nicely on one machine and well within the normal performance envelope. But even there I will often have to try a few different incantations and cross my fingers that one of them will perform reasonably because that's easier than trying to figure out what that 1 MLOC engine is up to. And I don't know anybody who does very large MySQL setups without a lot more hassle than that.

For some things I'd much rather deal with 1KLOC that I had to write myself than the 1 MLOC that I'm scared to even start digging through.




The question isn't 1MLOC vs 1KLOC.

It's a stable, well understood DB vs an immature, not well understood DB AND 1KLOC to deal with not being consistent.

To be clear, I'm not saying any given DB is the OneTrueWay, just that people seem to be a bit cavalier in regards to some of this crap and chasing the newest shiny thing while rediscovering why some of the braindamage in those 1MLOC was put there in the first place.

-----


How do you run your stable, well understood DB that probably uses thread locks and shared memory, across a cluster of 10 machines?

> just that people seem to be a bit cavalier in regards to some of this crap and chasing the newest shiny thing while rediscovering why some of the braindamage in those 1MLOC was put there in the first place.

But usually when people need to scale they need to scale, they usually know it. Since only successful companies need to scale, they probably know a thing or two about their domain and specific dataset.

It used to be that when you wanted to scale a database you used to be one of the large companies out there and your manager went and played golf with an Oracle salesman and you ended up with Oracle all of the sudden. That's what I call "golfware".

Small companies that all of the sudden had to handle 100ks of thousands of client connections was not very common.

So I think don't have a choice but to be cavalier about this crap. They either end up with an overpriced blade server that still has on single lock below all those expensive blades or you have to think of a distributed solution (or you just give up and move out of the way and let others eat your lunch).

-----


Oh, I agree that some people are being a bit cavalier about the new shiny. These folks are migrating away from MongoDB for a reason.

On the other hand, even if half of that 1 MLOC is still relevant to new ways of building systems (and given that SQL databases are a 70s tech, I doubt it's that much), that still leaves half that isn't.

The only way we'll find out which part matters is for people to try different approaches. So I fully support experimentation like this. If we don't rediscover for ourselves the good parts in the tools we use, then we're just stuck honoring the traditions of our heavily bearded ancestors.

-----


There is an awful lot of 70s tech that is still in constant operation, and is highly relevant to the types of work you're probably doing today. TCP, C, Pipes, etc.

-----


Sure. But I think the problem space for databases has changed quite a bit in ways that aren't true for TCP, and are only partly true for C.

My dad was writing code at the time, and he saw the big benefit as allowing developers to manage larger amounts of data on disk (10s of megabytes!) without a lot of the manual shenanigans and heavy wizardry in laying out the data on disk and finding it again. Plus, the industry thought the obvious future was languages like COBOL, what with their friendly English-like syntax and ability for non-programmers to get things done directly.

So little of that is true anymore. For a lot of things that use databases, you're expected to have enough RAM for your data. We don't distribute and shard because we can't fit enough spinning rust in a box; we do it because we're limited on RAM or CPU. A lot more people have CS degrees, the field is much better understood, and developers get a lot more practice because hardware is now approximately free. And nobody really thinks the world needs another English-like language so that managers can build their own queries.

TCP, on the other hand, is solving pretty similar problems: the pipes have gotten faster and fatter, but surprisingly little has changed in the essentials.

C is somewhere in between. A small fraction of developers working these days spend much time coding in plain C, and many of its notions are irrelevant to most development work.

But unlike SQL databases, you could ignore C if you wanted to; there were other mainstream choices. That wasn't true for SQL until recently; the only question was Oracle or something else. I'm very glad that has changed.

-----


My apologies for the rant here - it's not directed specifically at you, but toward a general attitude I see on HN and in the tech community.

I was more commenting on your phrase "if half of that 1 MLOC is still relevant to new ways of building systems (and given that SQL databases are a 70s tech, I doubt it's that much)".

There has been a TON of academic and industrial research on SQL databases since they were invented in the 70s. Calling them 70s tech is akin to calling LISP 50s tech. The basic ingredients haven't changed much (sets in SQL and lists in LISP), but the techniques on top have evolved by leaps and bounds.

To your point here - there are plenty of companies that have way more data in their databases than RAM available. The early users of Hadoop, etc. were primarily constrained by disk I/O on one machine, rather than constrained by RAM or CPU on one machine. It is certainly convenient that a distributed architecture can solve both sets of problems if the problem space lends itself to distributed computation.

I'm not a huge defender of SQL - I think it has some serious problems - one fundamental problem is lack of strong support for ordered data, and it can be a huge pain to distribute. I agree that having some options with distributed K/V stores is really nice, but you have to admit that much of it hasn't yet been proven in the field.

I, for one, DO think that the world needs something like an english-like language so that "managers" can write their own queries. Honestly, the roles of programmer and data analyst are often wholly different. I think it's a huge kludge that the only people capable of manipulating massive datasets are people with CS degrees or the equivalent. Programmers suck at statistics, and while they're generally smart folks, they often don't have the business/domain expertise to ask the right questions of their data. Software is about enabling people to solve problems faster - why should we be only allowing people with the right kind of academic background to solve a whole class of problems.

Finally, saying that you doubt SQL is relevant to building modern systems is borderline irresponsible. Experimentation with new tools is good - but you have to also keep in mind that people were smart way back in the 70s, too, and that their work may be perfectly relevant to the problems you're trying to solve today.

-----


No need to apologize! You make excellent points.

There has indeed been a ton of research on SQL databases. But still, Stonebraker, one of their pioneers, said that they should all be thrown out:

"We conclude that the current RDBMS code lines, while attempting to be a “one size fits all” solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of “from scratch” specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs." -- http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf

I'm sure a lot of their intellectual work is indeed something I could learn from. But SQL databases are an artifact of a particular moment in technological and cultural history. The thing I really want to learn about isn't the residue of their thoughts as interpreted through 30 years of old code, it's their original insights and how to transfer those to today's world.

Hadoop is a fine example of Stonebraker's point. The original sweet spot of relational databases was easy queries across volumes of data too large to fit in RAM. But Google realized early on that they could do orders of magnitude better with a different approach.

I agree that these new approaches haven't been fully proven in the field, but I'm certainly glad that people are trying.

As a side note, I think the right way to solve the pseudo-English query language problem is by making the query language actual English. If you have a programmer and an analyst or other business person sit next to one another, you get much better results than either one working alone.

-----


Stonebraker's research and commercial ventures the last several years have been focused on building specialized variants of existing database systems. Vertica (C-Store), VoltDB (H-Store), Streambase (Aurora), and SciDB are all specialized DBMS systems designed to overcome the one size fits all nature of things.

Further, he's been critical of NoSQL/MapReduce recently: http://dl.acm.org/citation.cfm?doid=1721654.1721659

Regardless, there's always going to be a balance between specialized systems and platforms, but my point is that we should be willing to trust the platforms that have proven themselves, avoid reinventing the wheel (poorly), and not be too quick to throw them out in favor of the new shiny.

I agree that programmer/analyst working together is a terrific pair, but the beauty of software is that we live in a world where we can have 1 programmer write a platform on which 10 programmers build 10 systems that 1000 users use to get their jobs done and make society that much more efficient.

-----


Oh, I trust the current stable platforms to be the current stable platforms. My beef isn't with people who use them. It's with people who don't know how to do anything else, which was a plague on our industry for at least a decade. At least the people who get burnt on the latest fad will make new and interesting mistakes.

I agree that when we can find ways to let users serve themselves, that's best for everybody. I just don't think universal pseudo-English query languages are the way to do that, because the hard part isn't learning a little syntax, it's learning what the syntax represents in terms of machine operations.

Once the programmer and the analyst have found something stable enough to automate, by all means automate it. Reports, report builders, DSLs, data dumps, specialized analytic tools: all great in the right conditions. But people have been trying to build pseudo-English PHB-friendly tools for decades with near zero uptake from that audience. I think there's a reason for that.

-----


>On the other hand, even if half of that 1 MLOC is still relevant to new ways of building systems (and given that SQL databases are a 70s tech, I doubt it's that much)

SQL is not a technology in the sense of a specific artifact (say, a PDP-11), it's a design based on relational algebra, a formal specification for relational data.

A specific RDBMS implementation might "age", but Math do not age. For as long as we have relational data (data with relations to each other), the relational algebra will be the best, and formally proven correct, way to model it. Period.

The same holds true for every other "tech" that is based on Computed Science. All those technologies are older than the '70 and will be used FOREVER: garbage collection, hash maps, linked lists, regular expressions, b-trees, etc etc, ...

Even specific artifacts remain relevant: TCP/IP, C, UNIX, windowing UIs, etc etc...

-----


The chant that "X is best, period" is a religious notion, not a practical one. You're welcome to worship whatever you please, but for those of us who are here to have a real-world impact, "best" is defined in terms of utility for a particular situation.

Even one of the great RDBMS pioneers, Stonebraker, agrees that RDBMSes are an artifact of a particular era in technology and commerce and should be thrown out and done over:

http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf

-----


>The chant that "X is best, period" is a religious notion, not a practical one.

When it comes to Math, there is no argument. Period.

-----


In the name of the Tangent, and the Sum, and the Multiplier, go thee and spread thy gospel. Amen.

-----


Exactamundo, my friend, but non-ironically of course!

-----


I'm curious: What metric are you using to measure these dimensions of stability and well understoodness?

-----


The Linux kernel has over 15 million lines of code, people normally don't hold it against it. Judging a piece of software by its LOC count is a fallacy.

A project with rigorous error handling and testing will have more LOCs than a corresponding project without.

Some problems are just hard, and you'll want as much code as is necessary to make it secure and performant. Some parts of the code you will never run, but inactive code seldom hurts you.

MySQL has its issues, but none of them would be fixed just by having less code.

-----


It is material as a measure of complexity when it becomes necessary for a developer to understand that complexity. Database tuning is a black art to many because databases are very complicated.

It turns out that having less code does in fact fix some issues. Consider Prevayler, for example, an object persistence layer that provides full ACID guarantees in something like 3 KLOC. It's also radically faster. It has a number of limitations (e.g., data must fit in RAM, no query language) but if you're ok with that, it's great: a Prevayler system is radically easier to reason about and optimize than something 300x as complicated.

Also, your 15 MLOC for Linux is a bit of a red herring. Look at this (somewhat old but presumably representative) breakdown here:

http://www.h-online.com/newsticker/news/item/Kernel-Log-More...

70% of the code is in drivers and arch. The kernel itself is circa 1% of the total, which at that point was 75KLOC. I think that they've been so disciplined in keeping it small is part of what has made Linux such a success.

-----


Hi,

The way I look at it is that a database, much like Linux, is a platform. I very seldom look at the source code for either for day to day programming.

As popular platforms they have in common that they are very well tested through everyday use, and are likely to operate as documented for ordinary configurations.

When you can rely on the correct operation of the system, the code of the underlying implementation is irrelevant. What you care about is how well the system supports your requirements, and what performance you can get by tweaking the available knobs.

Contrast this with your average 1000 line script. It has simplicity on its side, but when something breaks, that script is a suspect, and the source code of your DB probably isn't.

> Consider Prevayler, for example [...]

I'm not really sure what you're saying here. That an in-process in-memory object persistence framework without indexing can be faster than a heavy-duty relational database? That's not just "less code", that's "less features". Or "different features", at any rate; they're not the same species. I'm just going to assume what you mean to say is, "Not everyone needs a relational database".

> Also, your 15 MLOC for Linux is a bit of a red herring. [...]

All the more reason that the LOC count by itself is a meaningless metric.

-----


Sure, databases are a platform all their own. Like any platform, as long as you are operating well within the expected envelope, they work as advertised. When you get near the edge, though, you really need to understand how they work. As we are seeing with the rise of all sorts of RDBMS alternatives, a lot of people are getting near a lot of different edges.

The ability to understand how something works is a function of complexity. LOC is correlated with complexity, so it's a good rough metric. If you have a better one, please offer it. But otherwise I'll stand by my original point, which is that the guy bitching about 1000 lines of consistency code is ignoring the much larger amount of code used in alternative approaches.

What I'm saying with Prevayler's example is that if you don't need all the features of a database, then the extra complexity is a drag on what you're trying to get done. Less features means less code means less work to master.

> All the more reason that the LOC count by itself is a meaningless metric.

Yes, you throwing in a bullshit number is definitely proof that all numbers are bullshit. Bravo.

-----


> The Linux kernel has over 15 million lines of code, people normally don't hold it against it.

They would if all the Linux kernel did was play Tetris for example. The point was that here is 1000 lines of what someone thinks is awkward code to deal with eventual consistency vs 1M lines of code to deal with consistency in another case. If consistency for a particular application can be dealt with in 1000 lines, you should usually go for that instead of for the millions of lines solution.

Think of it the other way. They have availability and partition tolerance from Riak and they can handle eventual consistency with 1000 more lines. Now imagine you have MySQL and you have to make run in a multi-master distributed mode, how many lines of code would you need to handle 2 of the CAPs and then haven an application specific way to handle an incomplete (or untimely third part)? I bet it would be more than 1000 lines...

-----


> Some problems are just hard, and you'll want as much code as is necessary to make it secure and performant.

Wtf? Since when did bloat make code "secure and perfomant" ? And it hurts you if you ever want to touch or look at that code again.

-----


It is true that the most important thing is a good design, which will hopefully get you good performance with minimal and maintainable code.

However, in my experience, there are almost always additional optimizations that can be done after you have implemented your basic design. Things like "this part could make smarter choices with a more complicated heuristic", "we could use a faster datastructure here, though it requires a lot of bookkeeping", or "We could cut a lot of computation here with an ugly hack that cuts through the abstraction".

Of course, more code makes it harder to change the structure of the program, so it's the classic trade-off of maintainability versus optimization.

A good example of this, besides databases, is CPUs. Modern CPUs use loads of silicon on complex optimization tricks; out-of-order execution, register renaming, prefetchers, cache snooping. And all that "bloat" is actually making it faster. You can't make a super-fast CPU by removing all the cruft to get a minimal design. (Or rather, you can make it faster for certain cases, but it would be slower at doing almost anything useful.)

-----


>Wtf? Since when did bloat make code "secure and perfomant" ?

WTF? Since when one reads the phrase "Some problems are just hard, and you'll want as much code as is necessary to make it secure and performant." and deduces (who knows by what logic) that the guy means _bloat_ and not _necessary_ code (error checking, code for handling corner cases, etc)?

Not to mention that bloat is a silly term used by non-programmers to mean "this program is large" or "I don't use that feature, so it must weight down the program needlessly".

That is, people who don't understand that features they don't make use of (e.g the full text search capability of MySQL) are not even loaded from disk by the OS in the first place, or that most of the size of a large program like Office is not comliled code but assets (graphical etc).

-----


"MySQL is to database what PHP is to programming languages". Use PostgreSQL.

-----


By that, you mean used effectively on some of the largest, most profitable websites in the entire world? ;)

-----


Why do mysql and PHP apologists think "people have managed to succeed despite deliberately making things more difficult for themselves" is a compelling argument? Mysql and PHP didn't make them succeed, or even help them succeed.

-----


It proves that the technologies are capable products at the most massive scales. Do you have evidence to support that Facebook, Tumblr, or Etsy would have been better off had they chosen different technologies (of course not)? Or that they made things more difficult for themselves? How would Facebook be improved by PostgreSQL? At scale, data is so massively partitioned that the fact that you don't have windowing functions is utterly irrelevant. I hate PHP as much as the next guy, but the facts are that it gets the job done. And that, in business, is what matters. The angst on here about MySQL is unfounded and is largely a symptom of groupthink.

-----


Does it? If so, I suppose the original Macintosh proves that assembly language is all you ever need, and the success of Windows 3.1 proves that truly nobody will ever need more than 640k of RAM.

In my view, Facebook's success isn't proof that PHP is awesome. It's proof that they hired awesome people.

-----


I said capable, not awesome, or cutting edge, or the only thing you'll ever need, or the best thing since sliced bread. Capable is very different from your obviously flawed analogy. Windows 3.1 and the original Macintosh were successful at the time because they were capable products at the time which have evolved into the good, but not flawless products that exist today.

The groupthink MySQL/PHP hate here implies these are terrible products which only a moron with bad taste would use, which is demonstrably false when you look at the choices made by those using them.

-----


I don't see any conflict between the notion that PHP is a turd of a language and that it's perfectly adequate for building something major. If you're willing to spend years holding your nose. Or if you have little aesthetic sense. Or aren't experienced with other languages and don't know any better.

As far as I'm concerned, PHP is the Bud Light of programming languages. Popular and perfectly adequate for a large audience, but definitely not a sign of taste and discernment.

-----


I don't think Facebook or Wikipedia engineers would choose PHP right now if they could choose. Maybe the same with MySQL but less sure.

-----


What do you base that on?

-----


Thanks, papsosouid. Facts are much more useful to discussions like this. It still doesn't change the fact that PHP is capable of running the most trafficked sites in the world (side note: I don't like PHP, either. I just hate haters who can't see the world in shades of gray). Having heard Facebook engineers speak about MySQL many times, I have not seen any inclination that they are dissatisfied with MySQL as a platform choice. (And please, no one reference on that utterly BS article published on gigaom.com)

-----


The masochism rate is still low, even in these companies.

-----


So, you have no facts, just bias. Thanks.

-----


Them saying that?

http://www.quora.com/Quora-Infrastructure/Why-did-Quora-choo...

-----


You aren't supporting your case at all. I said "just because you can accomplish something with bad software, doesn't mean you accomplished it because of that software". And you responded with "it is possible to build big things with bad software". Yeah, I know. That doesn't make the bad software good though, which was my point.

-----


MySQL is 1 million lines of code and isn't even ACID

See ALTER.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: