I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka, Node, Mongo etc with a plethora of the ‘the latest cool programming languages’ running on big clusters of Amazon and Google to simple, but not sexy, c# and mysql or pgsql. Despite people commenting on the inefficiency of ORMs and unscalable nature of the solution I picked, it easily outperformes in every way for the real worldcases these systems were used for. Meaning; far easier achitecture, easier to read and far better peformance in both latency and throughput for workloads that will probably never happen. Also; one language, less engineers needed, less maintenance and easily swappable databases. I understand that all that other tech is in fact ‘learning new stuff’ for RDD, but it was costing these companies a lot of money with very little benefit. If I need something for very high traffic and huge data, I still do not know if I would opt for Cassandra or Hadoop; even with proper setup, sure they scale but at what cost? I had far better results with kdb+, which requires very little setup and very minimal overhead if you do it correctly. Then again, we will never have to mine petabytes, so maybe the use case works there: would love to hear from people who tried different solutions objectively.
The tree swing cartoon has never been more true [1]
Large companies would benefit from dev not over-engineering simple apps and spending less time on their own tools and ticket systems, and spending more time instead on solving/automating more problems for the company.
I have a very similar experience, in particular with my current project one of the users (a sysadmin actually) asked if I was using elasticsearch or similar because he noticed that searching a record and filtering was very fast.
My response: nope, MySQL! (plus an ORM, a very hated programming language and very few config optimisations on the server side).
This project DB has a couple of tables with thousands of records, not billions, and, for now, a few users (40-50).. a good schema and a few well done queries can do the trick.
I guess some people are so used to see sluggish application that as soon as they see something that goes faster than average they think it must use some cool latest big tech.
More than once I've run across situations of everyone sitting around going "we need to scale up the server or switch the database or rewrite some of this frontend stuff or something it's so slow and there's nothing we can do" and solved their intractable performance problems by just adding an index in MySQL that actually covers the columns they're querying on.
Lots of people seem to want some silver bullet magical software/technology to solve their problems instead of learning how to use the tools they have. That's not software development.
kdb+ is very cool. Just a single binary, less than 1 MB and yet an extremely fast and powerful tool for analytics. There's a very nice talk on CMU's series of database videos if you want to know more. https://www.youtube.com/watch?v=AiGdfmxEP68
And it scores extremely well on the 1 billion taxi drives benchmarks (no cluster but some seriously big machine): http://tech.marksblogg.com/billion-nyc-taxi-kdb.html
Well, I'm not sure "seniority" is the right word - the more tech stuff you know, in general, the _less_ seniority you're going to achieve in terms of org charts, decent seating, respect and actual pull within an organization. You can achieve job security and higher pay that way, though.
I once converted a simulation into cython from plain old python.
Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).
Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.
Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.
Sometimes it's worth it to just step back and experiment a bit.
Yeah, back when I was in gamedev land and multi-cores started coming on the scene it was "Multithread ALL THE THINGS". Shortly there after people realized how nasty cache invalidation is when two cores are contending over one line. So you can have the same issue show up even in a single machine scenario.
Good understanding of data access patterns and the right algorithm go a long way in both spaces as well.
Even earlier, when SMP was hitting the server room but still far from the desktop, there was a similar phenomenon of breaking everything down to use ever finer-grain locks ... until the locking overhead (and errors) outweighed any benefit from parallelism. Over time, people learned to think about expected levels of parallelism, contention, etc. and "right size" their locks accordingly.
Computing history's not a circle, but it's damn sure a spiral.
> Computing history's not a circle, but it's damn sure a spiral.
I think that is my new favorite phrase about computing history. Everything old is new again. There's way too much stuff we can extract from past for current problems. It's kind of amazing.
I wish I could take credit for it. I know I got it from someone else, could have been lots of people, but Fun Fact: one famous proponent of the idea (though obviously not in a computing context) was Lenin.
Lenins views on this comes directly from Marx - it's the concept of "dialectical materialism" (though Marx did not use that term) - and Hegel. Specifically Marx noted in Capital that (paraphrased) capitalist property is the first negation (the antithesis) of feudalism (the thesis), but that capitalist production necessarily leads to its own negation in the form of communism - the negation of the negation (the synthesis) - basically applying Hegelian dialectics to "the real world".
In that example the idea is that feudalism represents a form of "community property", which though owned by a feudal lord in the final instance is in practice shared. Capitalism then "negates" that by making the property purely private, before communism negates that again and reverts to shared property but with different specific characteristics.
The three key principles of Hegels dialectics comes from Heraclitus (the idea of inherent conflict within a system pulling it apart), Aristotle (the paradox of the heap; the idea that quantitative changes eventually lead to qualitative change), and Hegel himself (the "negation of the negation" that was popularised by Marx in Capital; the idea that a driven by inherent conflict, qualitative changes will first change a system substantially, before reversing much of the nature of the initial change, but with specific qualitative differences).
The idea is known to have been simultaneously arrived at by others in 19th century too, at least (e.g. at least one correspondent of Marx' came up with it independently), and it's quite possible variations of it significantly predates both Marx and Hegel as well.
I usually think of it more as a three-dimensional spiral like a spring or a spiral staircase. Technically that's a helix, but "history is a helix" just doesn't sound as good for some reason.
What kind of games? I always thought that e.g. network-synced simulation code in RTSes or MMOs would be extremely amenable to multithreading, since you could just treat it like a cellular automata: slicing up the board into tiles, assigning each tile to a NUMA node, and having the simulation-tick algorithm output what units or particles have traversed into neighbouring tiles during the tick, such that they'll be pushed as messages to that neighbouring tile's NUMA node before the next tick.
(Obviously this wouldn't work for FPSes, though, since hitscan weapons mess the independence-of-tiles logic all up.)
This was back in the X360/PS3 era and mostly FPS/Third-person(think unreal engine 3 and the like).
What's really ironic is the PS3 actually had the right architecture for this, the SPUs only had 256kb of directly addressable memory so you had to DMA everything in/out which forced you to think about memory locality, vectorization, cache contention, etc. However X360 hit first so everyone just wrote for its giant unified memory architecture and then got brutalized when they tried to port to PS3.
What's even funnier in hindsight is all that work you'd do to get the SPUs to hum along smoothly translated to better cache usage on the X360. Engines that went <prev gen> -> PS3 -> X360 ran somewhere between 2-5x faster than the same engine that went <prev gen> -> X360 -> PS3. We won't even talk about how bad the PC -> PS3 ports went :).
Hadoop has its time and place. I love using hive and watching the consumed CPU counter tick up. When I get our cluster to myself and it steps 1hr of CPU every second it's quite a sight to see.
Yeah, I got a bit of a shock when I first used our one and my fifteen minute query took two months of CPU time. Thought maybe I'd done something wrong until I was assured that was quite normal.
Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.
One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.
Morals:
1. Small changes can have a 3+ orders of magnitude effect on performance
2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)
csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.
Oh nice, I hadn't seen this before, so a similar query would be a bit shorter!
wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv"
csvsql --query "SELECT * FROM newyorkdata ORDER BY `COUNT PARTICIPANTS`" rows.csv > new.csv
This is really interesting to me - I can't seem to reproduce this on a couple different platforms. In the past (and just now) I've definitely seen 100x differences with LC_ALL=C, but you used that on both, I'm sure.
How large was the 10 million line file, and do you know whether sort was paging files to disk?
> The COST of a given platform for a given problem is the
hardware configuration required before the platform outperforms a competent single-threaded implementation.
COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual
performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
Keep in mind it's for graph processing - Hadoop/HDFS still shines for data-intensive streaming workloads like indexing a few hundred terabytes of data - where you can exploit the parallel disk io of all disks in the cluster - if you have 20 machines with 8 disks in a cluster that's 20 * 8 * 100mbyte/s = 16gbyte/s throughput - for 200 machines it's 160gbyte/s.
However for iterative calculations like pagerank the overhead for distributing the problems is often not worth it.
For my performance book, I looked at some sample code for converting public transport data in CSV format to an embedded SQLite DB for use on mobile. A little bit of data optimization took the time from 22 minutes to under a second, or ~1000x, for well over 100MB of Source data.
The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.
There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.
Normal developer behavior has gone from "optimize everything for machine usage" (cpu time, memory, etc.) to "optimize everything for developer convenience". The former is frequently inappropriate, but the latter is, as well.
(And some would say that it then went to "optimize everything for resume keywords," which is almost always inappropriate, but I don't want to be too cynical.)
Not so sure it's really obsolete, because connections are often spotty and local data can provide some initial information while waiting for the servers.
3 times over 4 years isn't unreasonable for interesting content. With the exception of the unlucky post this has generated good discussion each time and I think is safely in the interesting category.
The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.
I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.
It's frustrated me for the better part of a decade that the misconception persists that "big data" begins after 2U. It's as if we're all still living during the dot-com boom and the only way to scale is buying more "pizza boxes".
Single-server setups larger than 2U but (usually) smaller than 1 rack can give tremendous bang for the buck, no matter if your "bang" is peak throughput or total storage. (And, no, I don't mean spending inordinate amounts on brand-name "SAN" gear).
There's even another category of servers, arguably non-commodity, since one can pay a 2x price premium (but only for the server itself, not the storage), that can quadruple the CPU and RAM capacity, if not I/O throughput of the cheaper version.
I think the ignorance of what hardware capabilities are actually out there ended up driving well-intentioned (usually software) engineers to choose distributed systems solutions, with all their ensuing complexity.
Today, part of the driver is how few underlying hardware choices one has from "cloud" providers and how anemic the I/O performance is.
It's sad, really, since SSDs have so greatly reduced the penalty for data not fitting in RAM (while still being local). The penalty for being at the end of an ethernet, however, can be far greater than that of a spinning disk.
That's a good point, I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.
As a software engineer who builds their own desktops (and has for the last 10 years) but mostly works with AWS instances at $dayjob, are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment? Short of going full homelab, tripling my power bill, and heating my apartment up to 30C, I mean...
> I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.
That's probably better, since it'll scale a bit better with technological improvements. The problem is, it doesn't have quite the clever sound to it, especially with the numbers and dollars.
Now, the other main problem is that, though the cost of a workstation is fairly well-bounded, the cost of that medium-data server can actually vary quite widely, depending on what you need to do with that data (or, I suppose, how long you might want to retain data you don't happen to be doing anything to right at that moment).
I suppose that's part of my point, that there's a mis-perception that, because a single server (including its attached storage) can be so expensive, to the tune of many tens of thousands of (US) dollars, that somehow makes it "big" and undesireable, despite its potentially close-to-linear price-to-performance curve compared to those small 1U/2U servers. Never mind doing any reasoned analysis of whether going farther up the single-server capacity/performance axis, where the price curve gets steeper is worth it compared to the cost and complexity of a distributed solution.
> are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment?
Sadly, no great tutorials or blogs that I know of. However, I'd recommend taking a look at SuperMicro's complete-server products, primarily because, for most of them, you can find credible barebones pricing with a web search. I expect you already know how to account for other components (primarily of concern for the mobos that take only exotic CPUs).
As I alluded in another comment, you might also look into SAS expanders (conveniently also well integrated into some, but far from all, SuperMicro chassis backplanes) and RAID/HBA cards for the direct-attached (but still external) storage.
Well do notice I did say the penalty "can be" not "always is" far greater.
That's primarily because I'm aware of the variability that random access injects into spinning disk performance and that 10GE is now common enough that it takes more than just a single (sequentially accessed) spinning disk to saturate a server's NIC.
Plus, if you're talking about a (single) local spinning disk, I'd argue that's a trivial/degenerate case, especially if compared to a more expensive SSD. Does my assertion stand up better if I it had "of comparable cost" tacked on? Otherwise, the choice doesn't make much sense, since a local SSD is the obvious choice.
My overall point is that, though one particular workload may make a certain technology/configuration appear superior to another [1], in the general case, or, perhaps most importantly, in the high performance case, to have an eye on the bottlenecks, especially the ones that carry a high incremental cost of increasing their capacity.
It may be that people think the network, even 10GE now, is too cheap to be one of those bottlenecks, arguably a form of fallacy [2] number 7, but that ignores the question of aggregate (e.g. inter-switch) traffic. 40G and 100G ports can get pricey, and, at 4x and 10x of a single server port, they're far from solving fallacy number 3 at the network layer.
The other tendency I see is for people not to realize just how expensive a "server" is, by which I mean the minimum cost, before any CPUs or memory or storage. It's about $1k. The fancy, modern, distributed system designed on 40 "inexpensive" servers is already spending $40k just on chasses, motherboards, and PSUs. If the system didn't really need all 80 CPU sockets and all those DIMM sockets, it was money down the drain. What's worse, since the servers had to be "cheap", they were cargo-cult sized at 2U with low-end backplanes, severely limiting existing I/O performance. Then, to expand I/O performance, more of the same servers [3] are added, not because CPU or memory is needed, but because disk slots are needed and another $4k is spent to add capacity for 2-4 disks.
[1] This has been done on purpose for "competitive" benchmarks since forever
[3] Consistency in hardware is generally something I like, for supportability, except it's essentially impossible anyway, given the speed of computer product changes/refreshes, which means I think it's also foolish not to re-evaluate when it's capacity-adding time after 6-9 month.
Actually my example is far simpler and less interesting.
Having a console devkit read un-ordered file data from a local disk ends up being slower than reading the same data from a developer's machine from an SSD over a plain gigabit network connection.
Simply has to do with the random access patterns and seek latency of the spinning disk versus the great random access capabilities of an SSD.
Note this is quite unoptimised reading of the data.
At a theoretical level, as a sysadmin, I learned the theoretical capabities by reading datasheets for CPUs, motherboards (historically, also chipsets, bridge boards, and the like, but those are much less irrelevant), and storage products (HBAs/RAID cards, SAS expander chips, HDDs, SSDs). Make sure you're always aware of the actual payload bandwidth (net of overhead), actual units (base 2 or base 10) and duplex considerations (e.g. SATA).
From a more practical level, I look at various vendors' actual products, since it doesn't matter (for example) if a CPU series can support 8 sockets if the only mobos out there are 2- and 4-socket.
I also look at whatever benchmarks were out there to determine if claimed performance numbers are credible. This is where sometimes even enthusiast-targeted benchmark sites can be helpful, since there's often a close-enough (if not identical) desktop version of a server CPU out there to extrapolate from. Even SAS/SATA RAID cards get some attention, not in a configuration worthy of even "medium" data, but enough for validating marketing specs.
A C/C++ program can go through that amount data. You just have to use a simple divide and conquer strategy. It also depends on how the data is stored and the system architecture, but if you have any method that lets you break that data up into chunks or even just ranges. Then spin your program up for each chunk across each node or processing module that has quick access to the data your processing. Then take those results and have one or more threads merge results depending on the resulting data. I guess it also depends if this a one off job or continually if you would want to do this.
Assuming the locality of data is not a big issue. It can be extremely fast. However, depending on the system architecture reading from the drives can be a bottle neck. If your system has enough drives for enough parallel reads you will turn through data pretty quickly. Moreover from my experience most systems or clusters with a few petabytes have enough drives that one can read quite a lot data in parallel.
However, the worst is when the data is referencing non-local pieces of data. So your processing thread will have to fetch data from either a different node or data not in main memory to finish processing. This can be pain since that means either the task is just nor really parallelize-able or the person who originally generated the data did not take into account certain groups of data may be referenced with each other. Sadly, what happens here is that commutation or read costs for each thread start to dominate the cost of the computation. If it's a common task and your data is fairly static it makes sense to start duplicating data to speed up things. Also restructuring data can also be quite helpful and pay off in the long run.
Research programmers using MPI have been dividing up problems for several decades now but what they often don't get about hadoop until they've spent serious time with it, is that the real accomplishment in it is hdfs. Map-reduce is straight forward entirely because of hdfs.
A lot of people are saying how they've worked on single-machine systems that performed far better than distributed alternatives. Yawn. So have I. So have thousands of others. It should almost be a prerequisite for working on those distributed systems, so that they can understand the real point of those systems. Sometimes it's about performance, and even then there's no "one size fits all" answer. Just as often it's about capacity. Seen any exabyte single machines on the market lately? Even more often that that, it's about redundancy and reliability. What happens when your single-machine wonder has a single hardware failure?
Sure, a lot of tyros are working on distributed systems because it's cool or because it enhances their resumes, but there are also a lot of professionals working on distributed systems because they're the only way to meet requirements. Cherry-picking examples to favor your own limited skill set doesn't seem like engineering to me.
But the article is kind of wrong. It depends on your data size and problem - you can even use commandline-tools with Hadoop Map/Reduce and the Streaming API and Hadoop is still useful if you have a few terabytes of data that you can tackle with map and reduce algorithms and in that case multiple machines do help quite a lot.
anything that fits on your local ssd/hdd probably does not need hadoop... however you can run the same unix commands from the article just fine on a 20tb dataset with Hadoop.
Hadoop MapReduce/HDFS is a tool for a specific purpose not magic fairy dust. Google did build it for indexing and storing the web and probably not to calculate some big excel sheets...
a) resume enrichment by way of buzzword addition
b) huge budget grants and allocation, purportedly for lofty goals while management is really unaware of real technology needs/options
Much has been talked about this already; sharing them again!
The biggest DBs I've worked on have been a few tens of billions of rows, and several hundreds of gigabytes. That's like... nothing. A laughable start. You can make trivially inefficient mistakes and for the most part e.g. MySQL will still work fine. Absolutely nowhere near "big data". And odds are pretty good you could just toss existing more-than-4TB-data through grep and then into your DB and still ignore "big data" problems.
15 years ago I was working at a utility and was re-architecting our meter-data solution as the state government was about to mandate a change from manually read meters to remotely read interval meters. Had to rescale the database from processing less than 2 million reads a year, to over 50 million reads _a day_ (for our roughly 1M meters). Needed to keep 16 months online, and several years worth in near-line storage. We went from a bog-standard Oracle database to Oracle Enterprise with partitions by month. This was being fed by an event-driven microservices[0] architecture of individually deployable Pro*C/C++ unix executables. The messaging was Oracle A/Q.
At the time, I thought "wow - this is big!", and it was for me.
[0] we didn't call it microservices back then though that's clearly what it was. The term I used was "event-driven, loosely coupled, transactional programs processing units of work comprised of a day's worth of meter readings per meter". Doesn't roll off the tongue quite so easily.
This, a million times this!!! I see people doing consulting promoting the most over engineered solutions I have ever seen. Doing it for data that is a few hundred GB or maybe at best a TB.
It makes me want to cry, knowing we handled that with a single server and a relational database 10 years ago.
Lets also not forget that everyone today forgets that the majority of data actually has some sort of structure. There is no point in pretending that every piece of data is a BLOB or a JSON document.
I have given up on our industry ever becoming sane, I now fully expect each hype cycle to be pushed to the absolute maximum. Only to be replaced by the next buzzword cycle when the current starts failing to deliver on promises.
Yep. I left consulting after working on a project that was a Ferrari when the customer only needed a Honda. Our architect kept being like "it needs to be faster", and I'm like "our current rate would process the entire possible dataset (everyone on earth) in a night, do we really need a faster Apache Storm cluster?" :S
It's important to remember what drives this - employers often like to think their problems are 'big data' and by god, they need the over-engineered solution. Your peers who interview you will toss your resume in the trash if you are not buzzword compliant. Hate the game not the player.
I have given up on our industry ever becoming sane, I now fully expect each hype cycle to be pushed to the absolute maximum. Only to be replaced by the next buzzword cycle when the current starts failing to deliver on promises
I sometimes wish I was unscrupulous enough to cash in on these trends, I’d be a millionaire now. Instead I’m just a sucker who tries to make solid engineering decisions on behalf of my employers and clients. It’s depressing to think what professionalism has cost me in cash terms. But you’ve got to be able to look at yourself in the mirror.
Oh yeah, the default dataframe is practically only for toy problems / very-already-filtered data (which accounts for a LOT of useful scenarios!). It's more that I've run into "big data" people who (╯°□°)╯︵ ┻━┻ when the default settings + dataframe choke, which is around 200MB of heap.
There's a big difference between a 50Tb data warehouse and a data warehouse which has a number of 10Tb tables. Our data warehouse used to have 20k tables and 18Tb data. Our big data instance has 4k tables and 800Tb data
> Now I see people who think 50Gb is “big data” because Pandas on a laptop chokes on it.
IIRC the term "big data" was coined to refer to data volumes that are too large for exiting applications to process without having to rethink how the data and/or the application was deployed and organized. Thus, although the laptop reference is on point, large enough data volumes that make Pandas choke in a desktop environment does fit the definition of big data.
Sorta, depends on your workload and if you are a solo hero dev or if you are an enterprise requiring shared data resources.
For 16TB you'll definitely get benefits if you are doing something embarrassingly parallel and processor bound over a cluster of a handful of machines. It's just parallelism, and it's a good thing.
I totally agree about the hundreds of Gigs - that is unless you are in a setting where many teams need to access that database and join it with others, in which case a proper data lake implemented on something beefy is a good idea. Hadoop has the benefit of distribution and replication, but another data warehouse might work better - say Oracle or Teradata if you are a small shop.
I'd say more than the boundary is not on a 1 dimensional axis but more on a size x time x [access - pattern ] space :
100Gb of data may not be that much to you, but it's another story if you have to process all of it every minute. Also everything is fine if you have nicely structured ( and thus indexed ... ) data, but I do have a few 100s Gb of images as well...
I'm pretty sure you can buy IBM's newest mainframe, the z14, with 32TB of memory. I don't think that mainframes are that popular in trendy "big data" though.
(the actual problem is our SAN is getting us a grand total of 10 iops. No, that's not wrong, 10. The "what the christ, just buy us a system that actually works" momentum has been building for 2 years now but hasn't managed to take over. Hardware comes out of a different budget than wasted time, yo.)
I don’t know why these ‘you don’t have big data’ type articles bother me so much, but they really do. I know it isn’t saying NO ONE has big data, but I feel defensive anyway. Some of us DO work for companies that need to process millions of log lines a second. I know the article is not for me, but it still feels like a dismissal of our real, actual, big data problem.
These articles are not targeted at you indeed but to the thousands of companies trying to setup a massive architecture to process 3GB of data per year.
All these big data solutions are still necessary of course, it's just not for everyone.
Agreed they are tools designed to fill a niche. This doesn't make them bad, or uneccessary, just specialized. A spanner drive screw isn't worse than a hex drive, it's just designed for a different more specific use case.
Really this just goes to show how impressive RDMSs like Postgres are. There's nothing out there that's drastically better in the general case. So alternative database systems tend to just nibble around the edges.
My rule of thumb is always try to implement it using a relational model first, and only when that proves itself untenable, look into other more specialized tools.
Yeah but the whole cloud paradigm is predicated on big data, so large actors are pushing it where it makes no sense, and it makes everyone less effective. Not more.
Ever notice how with cloud it's actually in the interest of cloud providers to have the worst programmers, worst possible solutions for their clients ? Those will maximize spend on cloud, which is what these companies are going for.
Of course, they hold up the massive carrot of "you can get a job here if you ...".
For every programmer like you there's ten more at places where I've been employed trying to claim they need big data tools for a few million (or in some cases a few hundred thousand) rows in mysql. I get why you could feel attacked when this message is repeated so often, but apparently it still isn't repeated anywhere near enough.
I hope the NoSQL hype is over by now and people are back to choosing relational as the default choice. (The same people will probably chasing block-chain solutions to everything by now...)
Why should relational be the default choice? There are many cases where people are storing non-relational data and a nosql database can be the right solution regardless of whether scaling is the constraint.
Most nosql SaaS platforms are significantly easier to consume than the average service oriented RDBMS platform. If all the DBMS is doing is handling simple CRUD transactions, there's a good chance that relational databases are overkill for the workload and could even be harmful to the delivery process.
The key is to take the time to truly understand each workload before just assuming that one's preferred data storage solution is the right way to go.
You can have minimally relational data such as URL : website, but that's still improved by going URL : ID, ID : website because you can insert those ID's into the website data.
Now plenty of DB's have terrible designs, but there I have yet to year of actually non relational data.
That's fair and I'll concede that my terminology is incorrect. I suppose I'm really considering data for which the benefits of normalization are outweighed by the benefits that are offered by database management systems that do not fall under the standard relational model (some of those benefits being lack of schema definition/enforcement* and the availability of fully managed SaaS database management systems).
I'm also approaching this from the perspective of someone who spends more time in the ops world than development. I won't argue that NoSQL would ever "outperform" in the realms of data science and theory, but I question whether a business is going to see more positive impact from finding the perfect normal form from their data or having more flexibility in the ability to deliver new features in their application.
* I'm fully aware that this can be as much of a curse as a blessing depending on the data and the architecture of the application, which reenforces understanding the data and the workload as a significant requirement.
Because there are URL's in the website data and or you want do do something with it. Looking up integers is also much faster than looking up strings. And as I said you can replace URL's in the data with strings saving space.
But, there are plenty of other ways to slice and dice that data, for example a URL is really protocol, domain name, port, path, and parameters, etc. So, it's a question of how you want to use it.
PS: Using a flat table structure (ID, URL, Data) with indexes on URL and ID is really going to be 2 or 3 tables behind the scenes depending on type of indexes used.
> The key is to take the time to truly understand each workload before just assuming that one's preferred data storage solution is the right way to go.
Although that is true in principle, in reality that results in the messes I see around me where a small startup (but this often goes for larger corps too) has a plethora of tech running it fundamentally does not need. If your team’s expertise is Laravel with MySql then even if some project might be a slightly better fit for node/mongo (does that happen?), I would still go for what you know vs better fit as it will likely bite you later on. Unfortunately people go for more modern and (maybe) slightly better fit and it does bite them later on.
For most crud stuff you can just take an ORM and it will handle everything as easily as nosql anyway. If your delivery and deployment process have a rdbms, it will be natural anyway and likely easier than anything nosql unless it is something that is only a library and not a server.
Also, when in doubt, you should take a rdbms imho, not, like a lot of people do, a nosql. A modern rdbms is far more likely to fit whatever you will be doing, even if it appears to fit nosql better at first. All modern dbs have document, json/doc storage built in or added on (plugin or orm) : you probably do not have the workload that requires something scaleout like nosql promises. If you do, then maybe it is a good fit, however if you are conflicted it probably is not anyway.
> There are many cases where people are storing non-relational data
No, there are not. In 99% of applications, the data is able to be modeled relationally and a standard RDBMS is the best option. Non-relational data is the rare exception, not the rule.
I was working in a company where they brought in some consultants to do the analytics and they were going to use Hadoop, and they said straight up "We don't need it for the amount of data, but we prefer working in Hadoop"
It's far more common for your assumptions/thinking/design to be the root cause of any real problem in things you build than your choice of tech. It doesn't matter if you use the 'wrong' tech to solve a problem if the solution works effectively, and very often using something you know well means you'll solve the problem faster and better than you would with the 'right' tool. Those consultants were probably doing the right thing.
This pipeline has a useless use of cat. Over time I've found cat to be kind of slow as compared to actually passing a filename to a command when I can. If you rewrite it to be:
grep -h "Result" *.pgn | ...
It would be much faster. I found this when I was fiddling with my current log processor to analyze stats on my blog.
Each grep invocation will consume a maximum number of arguments, and xargs will invoke the minimum number of greps to process everything, with no "args too long" errors.
Okay, sounds reasonable, that's larger than most machines' memory...
> Tried spark on my laptop
Yikes, how did we get here? Not to shame you or anything, but that's like two orders of magnitude smaller than the minimum you might consider reaching for cluster solutions...and on a laptop...I'm legitimately curious to hear the sequence of events that led you to pick up Spark for this.
In my opinion this isn't something you should be leaving the command line for. I'm partial to awk; if you needed to get, say, columns 3, 7, 6, 4 and 1 for every line (in that order):
awk '{print $3"\t"$7"\t"$6"\t"$4"\t"$1}' file.csv
...where you're using tab as the delimiter for the output.
While I'm at it, this is my favorite awk command, because you can use it for fast deduplication without wasting cycles/memory on sorting:
awk '!x[$1]++' file.csv
I actually don't know of anything as concise or as fast as that last command, and it's short enough to memorize. It's definitely not intuitive, however...
cut is line oriented, like most Unix style filters. It needs to keep only one line in memory at most.
If you say:
pd.read_csv(f)[[x,y,z]]
It has to read and parse the entire 28GB into memory (because it is not lazily evaluated; cf Julia).
If you actually need to operate on three columns in memory and discard the rest, you should:
pd.read_csv(f, usecols=[x,y,z])
Then you get exactly what you need, and avoid swapping.
The lack of lazy evaluation does inhibit composition--just look at the myriad options in read_csv(), some of which are only there to enable eager evaluation to remain efficient.
While Pandas csv parser is quite slow anyway, the reason Pandas is particularly slow in this case is because it insists on applying type detection to every field of every row read. I have no clue how to disable it, but it's default behaviour.
Parsing isn't actually a tough problem – https://github.com/dw/csvmonkey is a project of mine, it manages almost 2GB/sec throughput _per thread_ on a decade old Xeon
28GB is really too small for Hadoop to get out of bed for, in general. Though I would wonder why it was _that_ slow with Spark (or Hadoop, for that matter).
Ehh... sounds like you might've over-eng'd something with python, tbh. I've dumped 50gb+ worth of CSV into a sqlite DB with something like 20 lines of code, and it only took about 30 seconds. (until I added some indexes)
I feel that we have spent 30 years replicating Bash across multiple computer systems.
The further my clients move to the cloud, the more shell scripts they write at the exclusion of other languages. and just like this, I have clients who have ripped out expensive enterprise data streaming tools and replaced them with bash.
The future of enterprise software is going to be a bloodbath.
anecdota - We used fast csv tools and custom node.js scripts to wrangle import of ~500GB of geo polygon data into a large single vol postgresql+postGIS host.
We generate svg maps in psuedo-realtime from this data-set : 2MB maps render sub-second over the web, which feels 'responsive'.
I only mention this as many marketing people will call 50Million rows or 1TB "Big Data" and therefore suggest big / expensive / complex solutions. Recent SSD hosts have pushed up the "Big Data" watermark, and offer superb performance for many data applications.
[ yes, I know you can't beat magnetic disks for storing large videos, but thats a less common use-case ]
I gave your service a quick whirl. Love the design language as a developer. What I don't like is that I have to provide my credit card information to get any detailed info (assuming there is such a thing behind the Stripe checkout).
The info available to non-signed-up people is simply not enough. It would be great if you had a demo account so one can get a feel for how this works.
E.g. I couldn't find info on why I have to add a job description. Wouldn't I post job board, with the posting liniing to the ApplyByAPI test?
I could also not find any info on the tests candidates have to do. Yes, APIs, I get it. But an example wouldn't hurt.
We will certainly consider those points, and try to make things both more clear from a communication perspective and also consider adding some demo account (great idea!).
To answer the question about the job description, you're right that most customers so far have a post on a jobs board which they use to get traffic over to their ApplyByAPI posting. Once the candidate is there, they get a page with the job description and some API information they can use to generate their application.
We'll work on improving the language and information presentation to potential customers, and thank you again for taking the time to give your perspective!
You can get very good improvements over Spark too. I've been using GNU Parallel + redis + Cython workers to calculate distance pairs for a disambiguation problem. But then again, if it fits into a few X1 instances, it's not big data!
If you need to do something more complicated where SQL would be really handy, and like here you're not going to update your data, give monetDB a try. Quite often when I'm about to break down and implore the hadoop gods, I remember about monetDB, and most of the time it's sufficient to solve my problem.
Well yes in the edge case where you don't really have big data of course it will.
Where MapReduce Hadoop etcera shine is when a single dataset one of many you need to process is bigger than the biggest single disk availible - this changes with time
Back when I did M/R for BT the data set sizes where smaller - still having all of the Uk's larges single PR!ME superminis running your code was dam cool - even though I used a 110 baud dial up print terminal to control it.
I think much of this issue can be attributed to 2 most underrated things
1. Cache line misses.
2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )
Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.
It would benefit people to actually understand algorithmic complexity of what they are doing before they go on these voyages to parallelize everything. It also helps to know what helps to parallelize and what doesn't.
Partially because of the bloated architecture of those Hadoop, Kafka, etc. And of course Java. Implementing modern and lighter alternative to those in C++, Go or Rust would a step forward.
Wierd. I’d always thought of go as a slower Java with an shallower pool of good libraries and weak dev tools. Is the compiler and gc that good? Maybe I’ll have to give it another try.
Depends on what you're targeting. For raw computation, Go's similar and sometimes noticeably faster, even after hotspot has its turn. For garbage collection pause time, Go's rather amazing, typically measuring less than a millisecond even for gigabytes of heap mem[1]. For bin startup time (e.g. for a CLI tool) try Go instead just because it'll be done way before the JVM even hands control to your code.
For dev tools, oh hell yes. Stick to Java. Go comes with some nice ones out-of-the-box which is always appreciated, but the ecosystem of stuff for Java is among the very best, and even the standard "basic" stuff vastly out-does Go's builtins.
I’m not saying don’t look at other languages, but don’t do so given what was said. Java still runs better than Go. Hadoop, et al won’t necessarily run better as C++ or Rust. Distributed system performance is often dominated by setup and other negotiating between the components in the system. Java still has better, well tested in prod libs. Shiny is not better.
Anecdotally from my own use cases, I’ve found Go to be good for problems easily made parallel. This is especially true for things that commonly fall under “microservices.”
On the other hand, for more intensive work, I’ve found it doesn’t perform as well as JVM-based applications. A lot of this is either JRuby or data models in Java doing interesting things with data structures.
Finally, for sheer processing speed, we’ve started converting some of our components to Rust. This has a 3-4x speed advantage to Java and Go for similar implementations.
Again - all anecdotes and I’m not really in a position to deeply explain how it all works (proprietary, etc). Just my two cents.
No, it's because those things are for working with large amounts of data. So some overhead is acceptable, as it ends up being a tiny percentage of the overall time spent when you're dealing with actually large data.
People are suckers for complex, over-engineered things. They associate mastering complexity with intelligence. Simple things that just work are boring / not sexy.
I’ll be in my corner doing asking “do we really need this?”/“have you tested this?”
I've heard this so often, especially from my boss - whoever this is. I come to the conclusion this is just a lame excuse because the people feel out of control and overwhelmed when using this kind of software. It is heavily Java based and to those who never touched javac, the Stack traces must look both intimidating and ridiculous.
On the other hand, processing data over several steps with a homegrown solution needs a lot of programming discipline and reasoning, otherwise your software turns into an unmaintainable and unreliable mess. In fact this is the case where I work right now...
Just reading 500TB from 60 Hdds will take at least 12 hours, assuming a very optimistic constant 200MB/s throughput per HDD. If you need things to go faster you won't be able to do that without resorting to distributed computing.
If I do one thing this year, I need to learn more about these tried and tested command-line unix shell utilities. It's becoming increasingly obvious that so many complicated and overwrought things are being unleashed on the world because (statistically) nobody knows this stuff anymore.
This applies to much of the software world. Many of the places I've worked have a bad habit of ditching perfectly fine tech in order to chase the latest thing. Part of it is driven by RDD (resume driven development) and positioning the company so they can hire the latest crop of fad chasers. You end up with a bunch of specialists that are siloed into their particular tech. You got the Docker guy. You got the Redis guy. You may even have one of those gray-haired DBAs that still talk about this "Oracle" thing all day long, and no one knows what the hell he's going on about. You got the Ruby guys, but now you guy a bunch of React or Angular guys. What to do with the Ruby guys? Can't just fire them yet. You have a dozen of 'em, and it took you years to hire them. Fine, just banish them to the backend somewhere. Give them some fidget spinners and a few pointless projects to keep them busy.
Let's not forget when you're interviewing: The new places expect the latest technologies. Don't have time to learn spark, flink, kafka streams, and repeat scala spec whilst having intimate knowledge of all of the REST libraries? BAD DEVELOPER.
Hey now, Ruby ain't dead. I am one of these "Ruby" guys and, in the 48-or-so hours since my profile was activated at a new recruiting shop this week, I've gotten several potential candidate employers (some in seed stage) who want to get in touch.
The real issue is why do we silo ourselves so? It's fucking stupid. I do so much more than Ruby but conveying that to the new breed of tech people is just impossible. All they see is your biggest resume badge, and all they care about is how wide you will spread your legs for them.
One fun thing is that xargs supports parallelization.
-P max-procs
Run up to max-procs processes at a time; the default is 1.
If max-procs is 0, xargs will run as many processes as
possible at a time. Use the -n option with -P; otherwise
chances are that only one exec will be done.
-n max-args
Use at most max-args arguments per command line.
Fewer than max-args arguments will be used if the size
(see the -s option) is exceeded, unless the -x option is
given, in which case xargs will exit.
So here is a "map/reduce" job which takes the log files in a directory
and processes them in parallel on eight CPUs, then combines the results.
xargs is one of my favorites, it's so easy to split something apart to speed it up by a few multiples.
gotta SSH into 100 machines and ask them all a simple question? xargs will trivially speed that up by at least 10x, if not better, just by parallelizing the SSH handshakes.
Not really. awk is pretty simple, basically I just assume it's plain C with a huge regexp pattern 'switch' case at the top level, then I google again for the arguments to substr() :-)
Some of these have come up for me recently, particular sed, awk, and xargs. Xargs in particular is really gnarly and opens up a lot of possibilities from the command line when used in conjunction with other tools. Love it!
I've got some similar ones for work that take data csv like data and output sql. A little bit of vim-foo on the csv (yy10000p) and I've got all the test data I need.
^ the moment you do anything even slightly non-trivial with sed, abandon it and use perl instead. for even simple regexes with a couple non-"match this char" segments I regularly see almost 100x speedup, it's rather crazy. it's also a LOT more portable in my experience because the regex engine is so much more capable and consistent, even across fairly major perl versions.
Personally, I'm not too interested in any commands that take string arguments in their own homegrown programming language[0]. I'd rather use something like this[1] than jq, to get some extra ROI out of one of the languages I already know.
adding socat into the mix if you do network related things! it’s amazing what you can do when you replace bits and pieces of a pipe chain with network endpoints
Companies aren't making these complicated and overwrought things because nobody knows this stuff anymore. The ones actually doing the making generally know all the tried and tested ways to do what they want, because they used them to scale. Then they eventually outgrow the existing solutions and created these complicated systems to account for whatever specific technical or business issues that necessitated hopping from existing solutions to creating a solution.
The problem tends to come along with the visibility that comes when one of these get released publicly. People use these things without doing the same level of due diligence to see if something less complicated fits their needs. Either deliberately (resume driven development) or simply because it doesn't occur to them to do so and "if it's good enough for this unicorn it's good enough for me". Or in some cases they do due diligence, but incorrectly evaluate their needs and rule out simpler solutions.
But the most common reason I've ran into is simply that it's more fun to play with these projects than it is to use crufty old stuff, no matter how tried and tested and potentially appropriate it is.
At a previous company, I had a python script running on a cron job every 5 minutes to do some data processing needs. Once or twice a month, the batch would be so large that it took 6-7 minutes to complete and the cron job would trigger the script again before it finished, causing the second instance of the script to see the lock file, log an error, and exit. It didn't cause any problem for these periodically skipped runs, because the business need only required data to be processed within 24 hours of coming in. The 5 minute cron job was just to even out resource usage throughout the day instead of doing a larger nightly batch job. A piece of data not getting processed for 8 minutes instead of 5 did not make any material impact.
Another team had noticed the errors popping up in the log and were in the process of testing out a whole bunch of real time data pipelines like Kafka, leveraging my error messages to justify the need for a "real time" system without ever even asking me about the errors. After I found out other people were noticing those superfluous errors, I moved the cron job to a 30 minute window to stop them from happening anymore. Turns out they didn't have any other justification for their data pipeline greenfield project and weren't happy to go back to their normal day to day work. I offered to let them maintain and expand my python scripts if they were interested in data processing work, but for some reason they never took me up on that offer. :(
I think this is a dangerous, irritating mythology that programmers permit to their detriment. Skipping work entirely to go play pool or watch a movie is "fun". Evaluating a new technology stack to see if it fits business needs (present or future) might be _intellectually stimulating_, but deriding it as "fun" - and allowing management to write it off as time-wasting - hurts everybody. This isn't "fun", it's research, just the same as particle physics experiments are, and it's a big part of what we went to college to learn to do effectively.
Companies aren't making these complicated and overwrought things because nobody knows this stuff anymore. The ones actually doing the making generally know all the tried and tested ways to do what they want, because they used them to scale
Right, Yahoo made Hadoop because (probably) they needed it. 99% of companies... just don’t.
You can write many of these yourself. Just reading records, and sending them across pipes in ways you can compose is a good start. I find PyPy is excellent for this kind of work, fast to write, if the pipeline runs for more than 10 seconds, it has plenty of time for the JIT to warm up and amortize the cost. If you don't need fancy libs, LuaJIT is amazing and great to doing transforms in a pipeline.
I rewrote a Hadoop job to use Streaming (pipes, basic unix model) from a Java job to native. Just that switch was 10x faster. Mostly because of Hadoop overheads.
They try to make the cloud relevant for a lot of things that could easily be accomplished locally... Most of the time, they do that just to get your data.
Vendor lock-in is what they're after in the cloud case. Every time we talk to our cloud provider reps they ask us which of their SaaS offerings we're using. Their heads spin when we tell them we're just using VMs and ask us why.
So much internal tooling at a company can be avoided if you can just get people comfortable with basic *nix programs. No, we don’t need a web interface to show that data. No, we don’t need to code a dropdown to sort it differently.
CLIs have a discoverability problem. Knowing how to use any given one is a skill that takes years to hone. Learning a single command can take up to an hour. Most webapps are more constrained, but a common operation can usually be accomplished by someone who hasn't seen it before in less than 3 minutes.
The 'apropos' command is very handy on machine or OS you aren't familiar with.
If you want to find out what tools are available on the CLI that operate on jpegs, you'd try 'apropos jpeg' and get a list of things that mention jpeg in their man page.