
Ask HN: Developing Web Apps around SQL tables that have millions of rows - mihok
Myself and a few buddies have been working on a project that seems to have started taking off. Currently I&#x27;m in charge of building our dashboard analytic platform for our client base, and suffice to say the amount of data is beyond what I&#x27;ve been used to dealing with. We&#x27;re upwards of 31+ million rows as of yesterday and it seems to be rising at a insane pace (almost even exponentially) while our client base increases. We&#x27;re using a Apache&#x2F;PHP&#x2F;MySQL system, no frameworks on the server side but Backbone on the front end. It looks like currently we&#x27;ll hit 50 million rows in a week, but that pace will increase as we add more clients.<p>My question for everyone here is what (if any) experiences, insights, or resources did you find really helpful when you started having to deal with querying 1+, 10+, 100+ million rows? Any advice? I&#x27;m currently scouring the net for any sort of past experiences or alternative solutions to issues we&#x27;re facing now and problems we may encounter in the future.
======
txutxu
You have get already very good tips. I could add this ones for completion:

Review your backups and contingency plans (and measure your restore times).

Usually you will want a mysql.conf for production mode, and a different one
for full restore (with different parameters). A full restore of a so big
database with the mysql defaults, can take looooooooong.

As an anecdote, in the most big mysql system I've ever touched, there was a
"performance problem", when I got the credentials and started to review, 3
different sysadmins, had scheduled 3 different full backups daily, using
dumps. Some of them at office hours.

Yes, this is just plainly stupid, but it's an example of real life.

On a big database system, backups (and restores) is something who is
convenient to design with care, to monitor, and to measure/adapt periodically.

If you go to master/slave(s) usually you can make backups from some slave,
without disrupt the master operations.

Also, you maybe interested in play with things like mysqlproxy to split
reads/writes, so you can have writes going to master, but reads spread across
multiple slaves.

I assume you're already using 64bits, and that you know already about mysql
tuning variables.

~~~
mihok
Thank you for your experiences, Small things like that can cause big problems
and they're easy to forget about! I'm thinking we're going to need to write
out a lot of processes for doing backups and adjustments.
Organization/structure will help squash those incidents you mention, I
think...

------
tixocloud
TLDR; Split databases, split tables, prefetch queries with cron jobs or
database functions, use separate servers, don't collect redundant data

As a BI developer, I'd also like to second the idea of being able to split the
table up and changing your architecture. Ideally, you would leave your
production database untouched and have a replicated database to work with. The
replicated database will of course still probably be really massive, so you
will want to create tables specifically for querying (attempts to simulate a
data mart) where you only grab the data that you need for querying. Another
idea would be to have cron jobs pre-query at say midnight so when your users
hit the database, they are fetching pre-fetched data.

Also, if it's rising exponentially, I would have a look to see if all that
data is really what you need. It's great to collect every single piece of data
if you can but if you can't cover the costs to do that and there's no added
value to all that data, it doesn't make sense to keep everything.

~~~
mihok
Thanks a lot for your insights, splitting the database up and having a
replicate/production database setup is an interesting Idea, I'm going to look
into a bit more

------
ScottWhigham
If you're growing at that pace, you can't do just one thing; you've got to
attack it on multiple fronts. I approach things like this doing something like
this:

1) Optimize data type usage

2) Optimize indexing - removing redundant indexes, adding new indexes,
changing existing indexes

3) Optimize storage - get your SAN/disks in order

4) Partition the data in such a way that you serve queries and users quickly

If you do the above, you should easily be able to scale any decent hardware
into 10x where you are now. That will buy you some time to then build a data
warehouse and backend support/reporting systems. At some point you're going to
be in a spot where you either have to invest $5m+ in hardware to keep serving
your current data, or you're going to have to start archiving some of this
data to a data warehouse.

~~~
japhyr
_$5m+ in hardware to keep serving your current data_

Can you explain this a little further? Are you speaking specifically about
buying hardware, or do you mean buying virtual servers/ database
infrastructure?

------
hashtree
What type of data makes up the rows? Relational databases fit many data
persistance problems, however there are other types of databases which might
be more applicable to your problem. Essentially, use the right tool for the
job.

Relational, key/value, document, columnar, graph. Within each of those types
there are specific implmentations (mysql, postgres, oracle, redis, riak,
neo4j, titan, cassandra, mongodb, couchdb, etc), each with their own
pros/cons.

Within the relational database realm, make sure you understand: when and when
not to normalize/denormalize, clustered/non-clustered indexes,
btree/rtree/hash indexes, schema design, storage engines, partitioning,
master/slave, replication, caching strategies, execution plan optimization,
etc..

~~~
mihok
Good point, our biggest table consists mostly of numeric data, with the
exception of a MAC type of address and date/time columns. I'm going to do more
research on your last point as well. Thank you!

------
AznHisoka
Check to see if there are any indices you don't need anymore. Check to see if
you got indices on long text or string columns, and see if you can convert
those to indices on bigints or int columns (by hashing the string).

In my opinion, it's not the amount of data that is the bigger issue, but the
amount of reads/writes coming into the system. 100+ million rows with just 1
thread reading and writing to it really isn't a big deal, technically. If you
have load issues, you might need to shard the data, do bulk inserts by
buffering the data, increase the memory, cache commonly read rows in memory,
etc.

------
dpcx
Make sure you're storing your data optimally, and look in to sharding your
data if possible.

------
devonbarrett
Drop Apache use Ngix and make sure your cacheing what you can.

~~~
mihok
I'm pushing for this real hard, I've used Nginx for a couple previous projects
and run it locally on my machine. I've also heard of people using Nginx as a
reverse proxy for Apache, which I'm not really sure what benefits that gives

