Just recently one of our team has been looking at them again (following a strong benchmark being posted at mysqlperformanceblog). So I went and checked out the forums and saw it was pretty much in a similar state to when I last used it four years ago.
Sad to see they couldn't make it work. The team was always really friendly and quick to help with issues - good luck in the future.
Over many problems and sad surprises, we came to the point that we'd better have used PostgreSQL (or any other SQL) for relational data, and some NoSQL for "fat data". Or just single PostgreSQL would do better, because Mongo's feature set is a subset of PostgreSQL's feature set. Just Mongo renamed them and pretends it's something new.
"fat data" (just to avoid calling these mere gigabytes "big data"
I have a rule that should be more widespread:
If you can put the database in RAM on a x86 server, its not "big data" by any stretch. Then beyond that, it becomes more complex, but for starters lets consider whether the indexes fit in RAM.
If the indexes/hashes for your data cannot fit in RAM on a commodity x86 then you probably can consider that you have "big data".
So, currently its possible to buy supermicro systems that take 6TB of RAM (just a normal QPI link) without getting into any of the exotic SSI systems (like the SGI UV 2000).
We should also avoid talking about the physical plant requirements for "big data" as well, since as its possible to put over 350TB of storage in 4U with products from Nexsan, JetStor, etc. That is over 3PB per rack...
So, you can call your data set "big data" if the indexes are > 6TB or the actual data set is > 3PB. These numbers will change next year when new machines/storage arrive.
I used to see 24hr OLAP cube runs and no one ever called that 'big data'. It's entirely a question of scale in my mind, because these days you can buy truly gigantic servers, and they are phenomenally powerful but if your data needs multiple servers dividing the load in order to perform queries in a timely manner then you start talking about big data, it's a question of scale.
After working in Google I (with another teammate) kinda "feel" what's big data: it's more about approach, mindset and toolset to work with it. I agree with you and @techdragon that if you can fit the data into one machine it's probably not really big. But one can also work with 1GB of data using bigdata approach, what we call "fat data". When we grow out of a single machine we won't need to rewrite our project.
Nevertheless, this all doesn't prevent our sales team to say "big data" and "cloud" in every their phrase :)
Just a slight warning on this, because I was mildly burnt by hearing this and assuming it was true.
For background, I'm a long time happy Postgres user.
Recently I decided to use it for a new project instead of Mongo and to utilise the new JSON support.
Turns out it is great for CRD apps, not so much for CRUD.
The current release version of Postgresql (9.3) has no capability to do updates to parts of JSON stored as the JSON datatype (ie, you have to read the entire JSON blob, change it and write it back).
Updates within JSON fields are coming in 9.4.
I have found "source diving" a scary experience with Mongodb in the past.
For that matter, imho doing a locked partial update via something like _.assign would be fine imho in postgres. It depends on how you really need to use your data... and how it fits into that.
If you have a lot of recursion in your data, it may be better suited towards sql... if you have a lot of data gathered around a group of objects/documents a document db like mongo/rethink/elasticsearch may be best... if you really need key/value lookups, then cassandra is hard to beat.
For that matter, having data duplicated/replicated to multiple types of db servers is entirely reasonable. You management UI can interact with an SQL datastore, and on save, you also save to Mongo.
That was the interim step I chose in migrating our data structures.. the queries that run against mongo work great, there's three servers in the replica set, for a relatively small data set, and it is really nice.
Even if MongoDB does at least you don't have to write the functionality yourself in client code.
Of course, if you have a case to store unstructured data where you don't know the structure in advance, in this case it won't work for you. But for your own data -- we'd better let DBMS maintain the schema, instead of maintaining it in application code (inventing the wheel).
Side note, regarding updates of particular fields in objects: a NoSQL datastore must provide some "compare-and-set" functionality to avoid race conditions during updates. PostgreSQL way is to use row-level-locking transactions, but MongoDB locks the whole collections (well, several months ago it blocked the whole database, so it's and improvement :). They kinda offer findAndUpdate() for "compare-and-set", but see my another comment below on why it doesn't work.
That said, we are looking around to see if there is an even better document-oriented DB available and PostgreSQL looks interesting (with its JSON). Haven't had the chance to try it yet though. Another interesting option is OrientDB (having graph database would be beneficial - but only for small part of our system). Does anyone have experience with other document-oriented storages? (primarily single node usage)
MongoDB is actually pretty good as a document store, where you can accept soft commits, and are dealing with non-relational data that doesn't need to be aggregated.
When you start needing more than just a loose pile of documents, or live in a world where you really need ACID, Mongo has this nasty habit of falling down on the floor and twitching.
Those problems are solveable, but it doesn't just happen "out-of-the-box".
For example, it doesn't support crucial for NoSQL "compare-and-set" functionality to avoid transactions/locks. Their best suggestion is to use findAndUpdate() with the full object for "find". And it works (though slow) when you have static schema. But over time you'll want to change your objects, and findAndUpdate() won't find them anymore. Grief. Also, order of fields in nested JSON object matters for findAndUpdate(), so have a happy time debugging why it doesn't find some objects anymore.
The main mongo solution for things is to run lots of copies of it across many machines each with small chunks of data. In theory what is then a pain point on bigger systems becomes lots of lesser pains on small systems.
(note) every criticism is answered with how the next release will improve things. Sometimes that happens.
Our biggest problems with it have been CPU saturation due to compression (solvable with sharding) and oplog size (due to supporting ACID; supposedly much better in the upcoming release), but both of those are surmountable. In exchange we get massively better disk usage characteristics, no global locks, ACID compliance, transactions, and generally better performance. It's not perfect, but it solved a lot of our problems.
RethinkDB really needs to get the auto-hot failover and geo searches worked out, geo is on the table for the next release iirc, and failover the next after.
Cassandra is great for key/value searches, but falls down for range queries.
ElasticSearch is pretty awesome in its' own right, but not perfect either.
PostgreSQL has a lot to offer as well. 9.4 should be pretty sweet, and if they get automagic failover in the community versions worked out, I'm totally in.
It really just depends on what your workload is... MongoDB offers a lot of DB-like scenarios against indexes in a single collection, a clean set of interfaces, and a fairly responsive system overall. There have been growing pains, and problems... the same can be said of any database.
To each their own, it really just depends on your needs, and for that matter how far out your project's release is, vs. how long you need to support it.
Right now, I'm replacing an over-normalized SQL database structure that is pretty rancid. Most of the data fits so much better with a document db it isn't funny. When I had done the first parts, I had issues with geo searches in similar alternatives, and that has been a deal breaker for a lot of the options.
You don't use a document store if you need to use joins.. you're better off either duplicating the data, or using separate on-demand queries... odds are your data isn't shaped right and you should have used a structured database, or you aren't thinking of the problem right.
MongoDB replica sets are for availability not for consistency. Even with a write concern of majority, you can still have inconsistency. Without heavy load you might never see this race condition.
Someone else likely Will in my experience. Or I will, when a new requirement comes in.
If you need certain reporting, does it have to be real time, is real time enough okay, and what are the other needs. I find that sometimes duplicating data (one point of authority) is better than using one or the other.
In my opinion,what people usually really want is a RDBMS + a full text search engine like elastic search. But again,one needs to set these things up.
Mongodb didnt have aggregating features in the past,and their map/reduce feature is not that good.But again,the product is still young,maybe it will get better.
Mongo, Cassandra, etc are not good fits for this. Vertica was very expensive. In the end we went with a sharded and partitioned MySQL setup (partitioning really is great if you use it well). It's worked very well.
Glad it worked out for you guys.
btw, InfiniDB was originally going to be the backend to Redshift. Lets just say the previous executive team 'screwed that up'
You can even join together data from Hive, Cassandra, MySQL, PostgreSQL, Kafka, etc., all in one query. We don't have a connector for Mongo yet, but contributions are welcome!
The Cassandra connector was actually an external contribution. We don't use it at Facebook.
SQL is not part of the Cassandra or Mongo feature sets. They have certain analytic possibilities, but not if you want to use SQL (window functions for example), or most of the BI client tools associated with data warehousing.
There's also Shard-Query and (not free) Amazon Redshift.
Or you could just shard your regular community Postgre or MySQL.
Curious if they are returning money to VC's or just really burned through that much in 7 months...
Of course after a while this stop since all these mismanagement corrupt the products too, and pretty fast you're left with a huge black hole for money that doesn't produce anything meaningful.
I don't know what the differences are to produce that, but when it comes to storing as much crap as I was looking at, I was willing to design around the limitations that Infobright CE has (ie: no insert/update queries) rathre than deal with the massive extra disk cost. I have currently got 223m rows sitting in Infobright and it's taking about 38MB.
I really hope that the OSS project takes off and that InfiniDB sees some better compression implemented, similar to Infobright. The extra features that InifiniDB has over Infobright CE (not only insert/update but also a multi-threaded infile loader, for example) would convert me if only the compression were better.
I'm all new to this though so if there's some good reason why they differ so greatly I'm all ears and would love to know. Maybe I screwed something up in the configuration? I'm not sure.
Either way, it's sad to see them go. Columnstore databases fill a really useful application that I can only see growing over time as more and more operational data is collected by industry.
SELECT table_schema, sum (data_length + index_length) / 1024 / 1024 'Data Base Size in MB', TABLE_COMMENT
GROUP BY table_schema
You're right though - the cost per row sounds really low, I might try to find it via other means tomorrow and report back.
I can only assume that IB does some sort of differential compression on the data to get such a small filesize on that data and that it's an artifact of the machine data I'm using. But that's what I think the next big wave of data will be - stuff generated by data loggers on machines and equipment and analysed to scrape out incremental improvements on efficiency, reliability, etc. from previously relatively low-tech industrial corporations.
For almost everything else though, putting compression in the filesystem layer is better.
Integrating with MySQL at this level would probably only make sense if the team was already familiar with the internals; otherwise, PostgreSQL's code base would be a much cleaner choice.
I imagine the PostgreSQL players in MPP space such as CitusDB have done very similar things that we did with MySQL. And it is not that InfiniDB could not move away from MySQL, but that is a lateral sideways move, and has to be funded for no advancement in benefit.
Labeling InfiniDB as MySQL+ is a gross underestimation of what it does. MySQL is used as the front end query parser, and that is about it. Everything else behind it was custom written, and that is where the power is.
As with all DB technologies, your use case is the primary thing that determines your mileage. Comparing InfiniDB to MongoDB is one of the first signs to me that you don't fully comprehend the differences between database architectures. For the use cases that InfiniDB was made for, we routinely were faster performing on a smaller footprint. Using InfiniDB as a document store can be done, but that is not what it was made for.
What people call 'big datasets' is relative. Some think 500GB is alot, some think 5TB alot. Coming from telecommunications monitoring background, I will appreciate your dataset when we are talking TBs a day of churn per monitoring point with hundreds and thousands of monitoring points. The size of dataset you are working with, along with your use case for analysis is the two most important things in determining the technology stack. InfiniDB operated at these higher end scales very efficiently. There is a reason why Impala was a primary comparison, and we would usually operate on fraction of the hardware they needed.
Best technology does not always win. See InfiniDB.
Decisions made by previous executive teams years in the past can set a course that cannot be corrected sometimes (not efficiently or without alot of money)
Patents are worth their weight in gold.
Being open source is great for the community, but is a challenge to a business to build consistent revenue. There were many big projects running InfiniDB with the open source version, but not contributing to revenue. Even if they did sign up with support, you need custom feature development and other big ticket items to make impact. Or you have to build a large customer base paying for support, and that takes time. With the multiple iterations of adapting the technology to different architectures over the years, that was hard to retain those customers consistently. Also many customers will pay for support for their rollout or initial deployment, but when the project is done, they feel they are adequate enough to live with open source only.
Just because a company raised $X in a month, does not mean all that money is slated for going forward from that month. On top of that, payroll is not cheap, and you would be surprised how quickly you can burn through money keeping the lights on. For those of you who think people at startups are working for pennies on the dollars, I would advise you it is not the case. And if you are one of those people, I wish you the best, and odds are there are other reasons why you are doing so. Why would good engineers work at discount? Equity? There is not enough of the pie to go around to make that sustainable. Most startups pay competitive market salaries.
InfiniDB was at a junction where it was time to go for it, or go home, and that is exactly what happened in 2014. The marketplace for data solutions with Hadoop rising, other MPP vendors consolidating, and bigger players entering the field, made it very competitive, and the time to swing for the fence was now, versus treading water and hoping.
Even with stars aligned and everything else, all you have done in a company is weight the opportunity of it succeeding, not guaranteed.
I really enjoyed my time with InfiniDB and the team there. I really do feel its a missed opportunity with some decisions that could have been changed several years ago. Not securing patents and probably choosing MySQL as a frontend are some of those.
Side note, core group of us at InfiniDB have landed at Tune, a company that has appreciated the technology of InfiniDB and what it can offer for their solutions. Look forward to this new opportunity and what we can provide to the ad and mobile analytics space.
I guess I agree, in that $1220/oz x 0 is 0.
I have my name on a couple patents. I've never seen a situation in which patents actually helped a viable business.
Also try defending yourself against billion dollar companies the space.
but glad you took analogy literally
Most importantly: the "door-to-door" time from initiating a patent application to bringing it to bear in a legal dispute is something on the order of a decade.
Incidentally: I didn't take your analogy seriously; I just used it as a hook to disagree with you. I'm not an anti-patent zealot. I've just worked in startups for ~20 years and have come to the conclusion that they are a total waste of time for software companies.
btw, dont get me wrong in saying that if they had their patents they would have been successful. Just one cog in the whole machine. And the timeline I am referring to is there are things back 5-6 years ago that could have been filed before others were doing it, and it would have provided a nice differentiator in the market. It would have helped, but it was not the sole reason.