I think my favorite example, was a team that spent six months trying to build a system to take in files, parse them, and store them. Files came through a little less than one per second, which translated to about 100kb. This translated to about 2.5Gb a day of data. The data only needed to be stored for a year, and could easily be compressed.
They felt the need to setup a cluster, with 1Tb of RAM to handle processing the documents, they had a 25 Kafka instances, etc. It was just insane.
I just pulled out a python script, and combine that with Postgres and within an afternoon I had completed the project (albeit not production ready). This is so typical within companies it makes me gag. They were easily spending $100k a month just on infrastructure, my solution cost ~$400 ($1200 with replication).
The sad part, is that convincing management to use my solution was the hardest part. Basically, I had to explain how my system was more robust, faster, cheaper, etc. Even side-by-side comparisons didn't seem to convince them, they just felt the other solution was better some how... Eventually, I convinced them, after about a month of debates, and an endless stream of proof.
You were saving the company money but hurting their resumes.
I recently wrote about this. The TLDR is that 20% time is a great investment and can ultimately save the company a ton of time and money. It gives the engineers some playtime in order to build their CV's and "get their wiggles out". Ultimately, if done right it can protect your production systems from a lot of madness. https://hvops.com/articles/how-to-keep-passion-from-hugging-...
The message could be that the goal is about the destination not the journey.
Basically, they are looking for web developers but it seems like they have to filter out all the frontend ninja rockstars.
This sounds like typical anti-change pushback, which I have learned can actually be a good thing. However, this anecdote is severely lacking in insight; much like most people's support of, or opposition to, change. Further, like the widespread belief that sentences shouldn't start with conjunctions; much less conjunctive adverbs.
Not every web project needs react, or even JS.
I'm sure there are declarative, object-oriented (rather than text-oriented) templating engines out there that use an approach like React's. But I would consider using an imperative, text-oriented templating language a yellow, if not red, flag in 2017.
It is functional (well, as functional as React), and templates compile to plain 'ol functions, so compatibility and static typing is the same as the rest of the your program.
Obviously, if I needed a SPA or something, it's not what I would use, but again, not everything should be an SPA.
Twirl is fine, insofar as it's attached to the Play (that's not to impugn you for picking it, my history with Play is colorful and frustrating). I wouldn't raise a flag for that. But not using something in this vein definitely is, and React is probably the most accessible way to do it for the 90-95% case of developers.
the job that should be plain simple html was done in angular,
and you could notice it, simple flickering and slowdowns for no aparent benefit.
just so stupid.
And your semicolon is misplaced.
On the other hand, we also had difficulties hiring "good" engineers. People chose either a company with a brand name recognition or one that's working on "exciting" technology.
As engineers, we fail to appreciate that we are there to serve business first and foremost.
As leaders, we fail to put our companies first.
This is an industry wide problem. If past trends of other major industries are any indicator, the whole "meritocracy" of tech industry will disappear within the next decade.
The company has zero loyalty to you and will screw you over if it makes business sense to do so. There is absolutely no reason to put the company above your career interests.
As said in this topic - it's true that in some places, resume-driven development pays off for the developers. It's not the same everywhere and to me it looks more like a symptom of a dysfunctional structure than par for the course in business.
This means it is in your career interests to increase company revenue and/or reduce costs, because this will make you more attractive to many companies (the ones we would likely want to work on) when you move on.
That's just my personal impression but I believe you have better chances coming from an unsuccessful company with all the right buzzwords than from a successful company with old tech. You will quickly be labelled as "dinosaur".
The real winner is to use cool stuff AND come from a successful/known company.
I have no experience in being labelled a 'dinosaur', but I'm sure there are jobs where being practical and generating actual results will matter. In ideal conditions, these are the jobs which are desirable to work at, so I don't like the idea of optimizing for hotness in itself (at least for my own career decisions).
The issue is that, compared to other industries, it's really hard to find people with that 20% in tech, so business people are forced to let political and image ignoramuses (sometimes ignorant to the point of failing to perform basic hygiene) into the club, and forced to try to stuff them in the attic or some other dark corner of the office where they won't ugly things up too much.
Many developers naively interpret this as a ruling meritocracy. The reality is that the business types resent having to do this, and a horrible technology worker with some image consciousness will easily destroy his better-qualified peers.
I'm familiar with a case of a person who can't even code getting himself elevated to a technical directorship with architectural and high-level direction responsibilities through such tactics. He appears to code by merging branches in the GitHub interface, by making vacuous technical-sounding comments in meetings with executives, by occasionally having someone come "pair" with him and committing the changes they make together at his workstation, etc., but if you locked him alone in a room with a blank text editor and said "Write something that does this", he wouldn't be able to do it. And the executives believe in him wholeheartedly and are currently working to install him as the company's sole technical authority. All significant decisions would have to get his approval, despite his being literally the worst technician in the company. All of his decisions up to this point have been politically motivated, aimed at coalescing power within his group and blocking outsiders who may want to contribute.
He was able to get there because he dresses well, he adopted the executive's interests and chitchats with them about these, he walks around the office telling jokes and smiling at people, and generally being what would be considered personable and diplomatic, whereas the other technical people go to their desks, hunker down, and spend their day trying to get some real work done.
Which strategy wins in the long run?
I read countless anecdotes on HN and hear many more in person of people with just the shittiest managers, of people who rarely see "competent" engineering organizations, of people who have "never" seen a competent project manager, that it really is a wonder we have any profitable companies at all.
In reality, if you don't understand the value someone is providing them, you should make an effort to understand what they might be doing before making claims like the ones you're making.
On the other hand, I am pretty convinced that there is a sizeable number of people in companies who create a lot of busywork "managing" things. The project I am on has 3 developers (as far as I can tell) and probably more than 10 business analysts, project managers, architects and other managers putting their name on it. I have tried to understand what they are all doing but from what I can tell there are two managers who actually help the project and the other ones write reports to each other, call a lot of meetings but don't really contribute. They just regurgitate what the few active people are doing.
Once I was on a team with 2 QA analysts, 1 Eng Manager, myself as PM, 3 BA's (that I did not want), and 3 developers, and one platform architect. All this plus 1 director overseeing our tiny team. Not to mention the 1-2 BA's I worked with whenever I worked on something that impacted another team.
During my 1:1 with said director, I once lashed out - I hadn't slept well in 4 days and I simply sounded off. I literally said everything that's been said in this thread: everything from why the fuck do we have so many people, give me 5 engineers and fire everyone else, to all you care about is the headcount that reports to you.
Luckily, I was a top performer, and while this tarnished my reputation with this director, I was able to smooth things over over the course of a few months.
This director explained to me that I was no longer at a start up. That this team should be resilient - that anyone should be able to take 2-3 weeks off at a time without interrupting the work. That they didn't want us working pedal to the metal 100% of the time. That it was ok that it was slow, and that I shouldn't be so self-conscious or hard on myself if I wasn't always working my fingers to the bone.
Now, I still thought we had way too much fat. Some of those BA's had no business being on a technical team, even as BA's and we should have traded in the architect and dev manager for an extra QA and developer.
But what that conversation did was bring me back down to earth. So much of what we view as right and wrong is personal preference. While I still disagreed with the amount of waste, it removed the chip on my shoulder and now I simply make sure to join teams that I like.
That's more of a ramble, but gives you some context as to where I was coming from.
This, however, doesn't excuse hiring incompetent people based on appearance and likability with blatant disregard for their competence (I recognize that for many non-technical managers, it is difficult or impossible to discern the quality of one's skillset), nor does it excuse stuffing teams with dead weight just because the hiring manager personally likes the people involved. And those practices are indeed rampant.
People established in their career don't need buzzword bingo resumes. Stability is important because you can leave the job at the door. Other things are more important, such as paying the mortgage, taking kids to the park on the weekends and not working all hours with a fragile stack.
And they'd be stupid if they didn't.
Not everything should build their resume, but some of it has to. It's one of my arguments for buy over build.
To take it a step further, Management has to own up to their original failure and try to explain to their bosses how they could spend so much time and money unnecessarily.
Another psychological problem here is the perception by people who do not understand the technology that the higher priced solution is better because it costs more.
- Monty Python, Meaning of Life
Plus people jump on fads, big data is trendy right now and when you read and hear about something a lot your mind tends to go to it first.
Suppose management in a firm uncritically contracted the big data revolution meme. Then they believe they are in the position of a city mayor trying to build a bridge and some wiseacre comes along and says they can do it in an afternoon with two pieces of string and a stick of gum. The problem is that the analogy doesn't hold, but they don't know that.
Dick move but so is rejecting an objectively better solution to save one's own middle management ass.
The grandparent is correct. This is 100% a political problem. We have a bad habit of discounting such problems in tech. We shouldn't do that anymore. Life is much better when you cooperate, instead of fighting an uphill battle.
Compact, reasonable solutions are the domain of startups. Bloating the engineering layer beyond any reasonable limit is an inherent cost of growing the company, and we shouldn't try to counteract that. We must operate within the framework we're given.
The political concerns extend beyond the company's own internals. They must appear enterprisey if they expect to be treated enterprisey. Today, enterprises have "data science" departments and blow $100k/mo on useless crap. If they're not doing that, it's a liability for the whole company, not just from the petty territorial perspective of one individual. It doesn't matter that it's possible to accomplish the same thing with one one-hundredth of the monthly expense. The enterprise signaling is worth the cost.
This is a great opportunity to sell $400/mo. enterprise solutions for $100k/mo.
Google, Facebook, Amazon, Apple are definitely not enterprisey, and I don't think it has hurt their profits. I think they tend to be pretty frugal, overall, when spending their own money on hardware and infrastructure.
And I bet if the "enterprisey" company ever tries to compete with a company like Google, Facebook, Amazon, or Apple, they will be destroyed in that market.
I think you're conflating enterprise with old and stuffy, and non-enterprise with bright colors and cutting edge technology. When I think of enterprise I think of software that needs to operate at scale with strict requirements on performance and uptime.
Apple loses a lot of the talent to other companies, and has never really been known having strong technology, so I understand that.
There are some "enterprisey" companies that do the same, but there are also a whole lot of companies that reach for big-data tools because they want to be like Google, ignoring that their problems are actually quite different from the problems Google faces.
Google publishes an academic paper on this and the general public misinterprets it as a recommendation. Soon you see people writing open-source implementations "based on the GoogleThing Paper", and a new tech fad is born. It will consume billions of dollars before it dies in favor of another fad "based on the FacebookThing/TheNextGoogleThing Paper".
Walk up to most business guys and they will jump at the chance to "become more like Google". Try to talk them down from this, and your challenge is to convince that no, we don't want to be more like one of the most important and influential technology companies in the world, the company that's on the news every day, and whose logo he sees every time he looks at his phone, and the company who keeps taking all of the best hires from the universities. Worse, you'll be making that argument because "we're just not as big [read: important] as them". Not a promising position for the reasonable engineer.
This has been a terrible blight on our profession these last several years, but we just have to learn to roll with it. It's only by understanding and accepting the psychology around this that we can formulate effective counterstrategies, or make the best of the situation that's before us.
That statement is just ridiculous.
Apple innovates a lot in the mobile and desktop spaces and on the software side they have pushed a lot of projects forward e.g. WebKit, LLVM. They also run some very large web services e.g. iCloud, Messages which are on par with some of the challenges Google and Facebook have.
But as a web service that underpins so much of iOS it is still on a scale and complexity that rivals anything Google and Facebook has. Apple doesn't get enough credit for actually make this work on a daily basis.
They definitely deserve credit for making it work because even at their scale it's an amazing feat. But there's no comparison to Google or Facebook's scale.
Many times these issues are not in the core product teams. Companies tend to hire the best and frugal on spending as it shows up as COGS in their finance reports.
Issues crop up when it comes to non-core product teams. Example a business intelligence (BI) team is more prone to over spend time and money on getting huge clusters with "big data" because they perceive their users needing data in real time.
Like most things that matter, the value of enterprise signaling is abstract, but it is undervalued at the company's peril. There are real consequences to getting it wrong, even if they're not directly measurable.
>And I bet if the "enterprisey" company ever tries to compete with a company like Google, Facebook, Amazon, or Apple, they will be destroyed in that market.
There are entire bodies of work on the question of when an enterprisey dynamic is better suited than a "disruptive" dynamic, to use Clayton Christensen's term, and vice-versa.
Such questions are not straightforward because in business, the winner is not the best technician. Good technology can give you an advantage if it's used right, but there is much more to business than just having the tech down. Most people cannot value the technology on its merits, and it therefore does not enter into their purchasing decision.
At the scale of most major, non-startup tech companies, however, 99k worth of work is miniscule: it is less than the cost of a single fully-loaded engineer's salary and benefits package.
We can look at the manager based on this and see his choice from two angles, depending on if we assume he has good or bad faith for the company:
"The large team is effectively guaranteed to succeed.
The likelihood that the 400 dollar solution works is an unknown quantity, and since that single engineer made it in the first place, I'd be putting a lot of negotiation power in his hands to ask for some large portion of the savings back as pay, meaning it's less likely we succeed and extremely possible he goes rogue. I'll go with the team."
"The company doesn't care about the difference between those numbers, they're the same at our scale. If I can waste ten people's time and net a sexy resume boost out of it for that little cost to the company, I'm probably the best manager they have.
No, you're not going to get to sabotage my next job if you're not going to do any work helping me spin this as somehow being better for my resume than me running a department with 10 people under me.
Actually, I've got an idea about that! I'm sure I can find something either wrong with your solution (or you) that allows me to say I tried for the savings, and after that failed, I went for the department I wanted anyways. I love a good compromise, don't you?"
Spot on. I'm getting flashbacks just reading this!
It's so easy to create FUD around "the $400 solution" that it's laughable.
Upper management will be filled with so many questions:
* "If this is so cheap, why isn't everyone doing it this way? Surely all those important people wouldn't be wasting money, so there's gotta be something we're missing here."
* "What have we been paying 4 guys to do this whole time? Surely they would've figured this out earlier if it could've been done this way. I hope my boss doesn't hear that I've had a completely redundant department this whole time..."
* "If it sounds too good to be true, it probably is. This guy is probably just trying to supplant my trusted middle manager by making him look like a money-waster. I need to tell my secretary to filter my emails better..."
And middle management can easily say:
* "While Bob is able to get the same output right now, he is doing it in a non-scalable way that will have to be rewritten over and over again as we grow. Our way costs more upfront but it will allow us to expand to fulfill EXECS_WILDEST_DREAMS. You don't want to go down the day that you're featured in Fortune Magazine because Bob's data analysis script hammered the database, do you? We should use the solution you wisely previously approved, the solution on which you heard that compelling talk at CIOConf last year. It is much better than being penny wise and pound foolish!"
* "If I pull up Monster.com right now, there are 500 Hadoop candidates in our area. How many 'Bob's Data Processing Toolkit' candidates are there? We would be painting ourselves into a corner, and if Bob ever left us, we would be stranded."
* "I too was amazed by Bob's Data Processing Toolkit, and I enthusiastically tried it. Unfortunately, my best employee Sally pointed out that Bob's Toolkit causes disruptive fits of packet spasm in the switch hardware, threatening our whole network. I asked him to fix this but he says that he doesn't even think that problem is a real thing. Yes, he had the gall to impugn my best employee, Sally! He is clearly in denial about this and too close to see the impact objectively, so I put him on another task. [Under breath: he is also clearly a sexist pig, and we're lucky Sally didn't call HR.]
"It was a valiant effort and I do indeed applaud Bob for his attempts and concern for the company's well-being, and I assure you, Mr. Upper Manager, that we are continuing to analyze his Toolkit's mechanisms in depth and we will apply all savings and optimizations that we can. However, as you know, if it seems too good to be true, it probably is, and it is just not realistic that a Very Important Company like ours could handle all of our Very Important Data for less than half the cost of your car payment each month."
And I can assure they absolutely fall under the definition of an enterprise. Sure they develop a lot of technology in house but they still have significant amounts of classic, enterprise technologies. Especially in the "business" side.
A better question is what ROI does generic non-tech enterprise company X get from standing up a huge data science team for simple data management problems.
Because that's what you are doing with your statement above.
The reason they are the Google, FB, etc of the world is _because_ of that unique capability. Do you honestly think there is, say, a hospital, anywhere on this planet that can hold a candle to what they do everyday?
The poster above was simply stating what the normal world looks and acts like.
At some point, enough developers/managers will begin to take advantage of the system until executives wise up.
It's also possible that this is the CTO/CIO/management's first real position where they can throw around $100k projects out of a multi-million dollar budget and they are simply learning the ropes.
It's also again possible that the company has so much cash devoted to this department that no one cares because they are collecting large paychecks. In which case, you're likely acting against your best interest (long and short term) to not take advantage of it.
They only wise up if they have data that tells them they need to.
Who is the source of that data? Developers and managers.
This is why it is very, very hard for big companies to stay agile: the management hierarchy insulates self-interested managers from consequences.
The key is that that value is not strictly technical. You can present a technical solution that is cheaper, but doesn't offer the non-technical value they derive from being a player on the "Big Data" scene. They can say "Yeah, we use our main guy's Perl script" or they can say "Yeah, we use Hadoop".
Is the value in that worth $99k per month? That's a subjective judgment for each company to make based on their specific circumstances.
And if you overcomplicate things, you can easily get to a state where there are 2 guys that both half-understood the Hadoop setup, but both left for different startups. Complexity alone does not make things simpler.
Of course, you can provide the non-technical value of providing ~training~ makework and resume lines for 10 code monkeys and their managers, but that is not really value to the company.
a) get its engineers to give a talk at a Hadoop conference, resulting in marketing [logo shown prominently around the conference], PR, and recruitment gainz;
b) get articles published about how the company uses cutting edge technology to do new things and all the other CIOs and big shots better listen up, resulting in prestige, PR, and recruitment gainz; (this happened to a client in real life)
c) reasonably field interrogatory questions from other fad followers, whether they are investors, journalists, peers, or whomever. When asked "How is YourCorp using data science and Big Data?", being able to say "We have a team working with that" is much better than having to say "Our guy Bob says that's just a fad, so we don't really 'do that'". This is basically a PR gain, but it means that investors and clients will feel the company is cutting edge, instead of backward philistines who listen to Bob all the time.
I could go on but it's pretty boring.
The point is that business is all about the customer's perception of the company as something to which they want to give money. If the business does not appear to be following the trends, they will be substantially harmed, because people do not want to get involved with an outmoded business. Being perceived as the last to adopt a new technology looks bad.
Welcome to the land of engineering, where we will fine-tune an algorithm to save milliseconds, yet waste man-years of engineering time to look like the big kids.
So many engineering choices are based on fashion rather than need.
Gotta get that hot new bag from Prada to show off to your engineering friends! :D
If companies were serious about providing realistic advancement opportunities (in salary, in tech, in responsibility) there wouldn't be as much of an impulse to work on "cool" (i.e., marketable) projects.
Many dev jobs can be accomplished by pretty tried and true technology. For a startup and even most mid sized companies, MySQL and a web framework will get you pretty much everything you need. But any kid a year out of school can do that. So those positions are not really well paid, certainly not enough for a family in the Bay Area.
To become a mid-level engineer and make more money, you've got to implement some kind of complicated distributed system. Incorrectly. And then you can "fix" it (or someone else's) to get the next promotion!
This is stylized and not 100% true to reality, but I think it gets at a core truth.
There are many cases where there's a real need to create a solution either without any established framework or where previous attempts have failed. That means you're either a founding engineer or some skunk-works employee where you are given great autonomy but also bear most of the responsibility for the failed project. And, AFAIK, these positions are usually not advertised, you need good connections and luck to arrive at such opportunities.
> Many dev jobs can be accomplished by pretty tried and true technology.
Then they should be.
> To become a mid-level engineer and make more money, you've got to implement some kind of complicated distributed system. Incorrectly. And then you can "fix" it (or someone else's) to get the next promotion!
Yea... that's dishonest, and frankly theft of the companies money.
They were 6 months in and made no progress other than building 30 servers. I wrote an awk script to process the CSVs and did the rest in Excel in about 30 minutes. I had an intern automate the process with a perl script, which took 3-4 days! :)
The program management was very upset, mostly because they looked like a pack of clowns.
How did they respond to that? Any kind of retaliation?
Pulling a move like that requires extensive, pre-positioned aircover.
I work with a start up which currently doesn't have "big data", but perhaps "medium data". And, I can perhaps manage without a big data stack in production. But, if I look at the company/sales targets then in the next 6-12 months we will be working with clients that assures "big data".
Now, here are my choices -
1. Stick to python scripts and/or large aws instances because it works for now. If the sales team closes deals in the next few months after working tirelessly for months, then though the sales team sold the client a great solution, but in reality we can't scale and we fail.
2. Plan for your start up to succeed, plan according to the company targets. Try and strike a balance for your production stack that isn't an huge overkill now but isn't under planned/worked either.
It's easy to say we shouldn't use big data stack till we have big data, but its too late (specially for a startup) to start building a big data stack after you have big data.
Why? Because your problem is not technical, it is business related. You have no idea why your startup will fail or why you will need to pivot. Because if you did, it wouldn't be a startup. Or you would have had that client already.
You might need to throw away your solution because it is not solving the right problem. Actually, it is almost certain that it is solving a problem nobody is prepared to pay for. So stick to Python until people start throwing money at you - because you don't have a product-market fit yet. And your fancy Big Data solution will be worth nothing, because it will be so damn impossible to adapt it to new requirements.
I wish I could send this comment back in time to myself... :-/ But since I can't, how about at least you learn from my mistakes and not yours?
EDIT: good luck!
With tools like message queues and Docker making it so easy to scale horizontally you don't even have to go vertically.
We just won an industry award at work for a multi billion data point spatial analysis project that was all done with Python scripts + Docker on EC2 and PostgreSQL/PostGIS on RDS. A consultant was working in parallel with Hadoop etc and we kept up just fine. Use what works not what is "best".
Edit: A dumbed down version of the Python/Docker piece for anyone interested (https://medium.com/@mbaker/horizontally-scaling-gis-python-a...). It's really easy to scale horizontally with Docker...
That depends entirely on the workload. It's not always a good idea to move from one sql instance, to a cluster of them. Just buy the better machine that gives you time to make a real scalable solution.
My profile has a link to my personal website though and it's the current employer listed on the associated LinkedIn profile.
I'm not ashamed of anything I've said on HN but would rather not have people just searching for my employer ending up here (especially since I work in an office that routinely deals with sensitive political and community issues). It's a minor amount of (perceived) anonymity vs stating my name/job title/employer here!
But yes, scaling up is far easier than scaling out. A box with 72 cores and 1.5TB of DRAM can be had for around $50k these days. I think it would take a startup a while to outgrow that.
This holds for all languages, of course, not only Python. Forget raw speed, it is just the other end of the stick from Hadoop. Believe me, you don't need it. And even when you think you do, you don't. And when you have measured it and you still need it, ok, you can optimize that bottleneck. Everywhere else, choose proper architecture and write maintainable code and your app will leave others in the dust. Because it is never just about the speed anyway.
But you can write code in Common Lisp or Clojure that's just as readable and maintainable (once you learn the language, obviously) as anything you can write in Python, and the development experience is just as good if not better.
IMO the answer is almost always "good enough". This has been expressed in countless tropes/principles from many wise people, like KISS (Keep It Simple Stupid), YAGNI (You Ain't Gonna Need It), "pre-optimization is the root of all evil", etc.
If you go the YAGNI route, and when your lack of scale comes back to bite you (a happy problem to have), you'll have hard data about what exactly needs to be scaled, and you'll build a much better system. Otherwise, you'll dig deeper into the pre-optimization rabbit-hole of hypotheticals, and in that case, it's turtles all the way down (to use another trope).
Be mindful of people looking to introduce big data without justification. They are playing a game of some sort (maybe just personal resume value, or maybe a larger vie for power), and you are positioning yourself as their opponent when you try to stop the proposal they're pushing. Do not go into this naively.
Successful people have BMWs. If I buy a BMW that means I'm successful.
No, son, it doesn't.
For that particular task I used Spark in standalone mode on a single node with 40 cores, so I don't consider it Big Data. But I think it does illustrate that you don't have to have a massive dataset to benefit from some of these tools -- and you don't even need to have a cluster.
I think Spark is a bit unique in the "big data" toolset, though, in that it's far more flexible than most big data tools, far more performant, solves a fairly wide variety of problems (including streaming and ML), and the overhead of setting it up on a single node is very low and yet it can still be useful due to the amount of parallelism it offers. It's also a beast at working with Apache Parquet format.
Hadoop and Cassandra lend themselves to distributed nodes, but you can also use them without that. Or you can use solutions that work well with "big data" that aren't as opinionated about it, such as HDF5.
I guess the point is this: if I have 20TB of timeseries data on a single machine, and I have 20GB incoming each day, do I get to say I'm working with "big data" yet?
EDIT: My other complaint with this definition (perspective, really) is that it predisposes you to choose distributed solutions when you really do have "big data", which is not ideal for all workflows.
For reference - this is entirely timeseries financial data and PyTables. For basically everything else I use postgres.
But big data has to be distributed simply because there is no single computer big enough to hold all of it.
If you can buy a bigger machine, you can make "big data" bigger, and maybe evade this problem; if you must access it a lot of times, fitting on disk is useless and "big data" just got smaller; etc.
One area which might be a more interesting difference to talk about might be flexibility/stability. A lot of the classic big iron work involved doing the same thing on a large scale for long periods of time whereas it seems like the modern big data crowd might be doing more ad hoc analysis, but I'm not sure that's really enough different to warrant a new term.
I also define "small data" as anything that can be analyzed using Excel.
It is usually dramatic enough to get buy-in :).
Once you start dealing with multiple computers, complexity goes way up, because you've added a very large point of failure: the network.
Small data: fits in memory
Medium data: fits on disk
Big data: fits on multiple disks
I've yet to come up with a rule of thumb for throughput though, and this can never replace the expertise of an experienced, domain knowledgeable engineering team. As always, there are lots of things to balance, including cost, time to implement and the now Vs the near future. Rules of thumb over simplify, but also give you a way to discuss different solutions without coming over as one size fits all.
1. What's your tech stack?
This blatantly disqualifies ~90% of startups which are doing crazy things like using Hadoop for 10gb of data. OTOH, I get really impressed when someone describes effectively using "old" technologies for large amounts of data, or can pinpoint precisely why they use something with reasons other than data size. One good example: "We use Kafka because having a centralized difference log is the best way we've found for several different data stores to read from one source of truth, and we started doing this years ago. If we were starting today, we might use Kinesis on AWS, but the benefits are small compared to the amount of specific infrastructure we've built at this point."
Consider your programmer that goes to a 'big data' class and is taught how to use the stack. They are taught this generally on a 'toy' application because it would take to long to set up a real application. They are there to learn the stack, so they either ignore or get only lip service to the 'when would this be appropriate' slide. Now they come back to work and they know the recipe for using this big data stack.
The boss gives them their task, they apply their recipe, and voila, they have a solution. All win right?
If it is any consolation the type of engineering situation you are experiencing does (in my experience at least) eventually correct with the manager being moved out.
>> That doesn't necessarily mean people don't use Kafka for the wrong reason.
Exactly what I meant.
Nevertheless, the case from above is a very extreme case of over engineering and premature optimization. To the limit where would qualify it as reckless.
See here for details: https://www.confluent.io/blog/turning-the-database-inside-ou...
The big-data/"cognitive computing" unit of the company would have suggested using Hadoop, Kafka, IBM Watson , AI , neural networks and lots of billable man-hours to match my solution.
They even looked interested in making and aquisition for a company with an event processing engine that did basically the same: pipe stuff trough processes.
I recommend reading Event Processing in Action and then just building the same with your favourite pub/sub toolchain.
But then you come across the 1% case and Kafka is the only thing that you can throw at the problem that works.
Not sure of exact numbers, but I think we're doing 4-5m writes per second across one Kafka cluster, and around 1m writes per second against another.
Most people don't have big data, but some people do.
I'm not saying that's the case with your system, but my immediate thought when I see those numbers is: how much of the value from those 4m/s writes could you get with a system that did something like 100 w/s? Either through sampling/statistical methods, reduce "write amplification" or simply looking hard at what is written and how often :-)
Many databases that I regularly touch would shrink an order of magnitude or more if someone went through, redid the layout, and scripted a process to perform the migration.
Some very interesting architectural ideas in there. In particular, the observation that if you turn off history pruning in Kafka, you have a consistent distributed log, which can then be piped off into whatever services need to consume it. That's appealing for cases where you want an audit trail, for example.
Do most systems require that sort of thing? Absolutely not. RabbitMQ is boring tech, in the good sense; it is well understood and does its job, so IMO it's a better default option where it fits.
It is kind of infectious though, once you have system implemented as Kafka-and-Samza the easiest way for them to communicate e and be communicated with is more Kafka and Samza.
I am fully on board the Kafka hype train. Choo choo.
It raises interesting questions and I've had fun producing both arguements for and against the approach.
My (biased, as I work for them) opinion is that something like Pachyderm (http://pachyderm.io/) will ease some of these struggles. The philosophy of those who work on this open source project is that data people should be able to use the tooling and frameworks they like/need and be able to push there analyses to production pipelines without re-writing, lots of frictions, or worrying about things like data sharding and parallelism.
For example, in Pachyderm you can create a nice, simple Python/R script that is single threaded and runs nicely on your laptop. You can then put the exact same script into pachyderm and run it in a distributed way across many workers on a cluster. Thus, keeping your code simple and approachable, while still allowing people to push things into infrastructure and create value.
A full on hadoop stack is rarely warranted, but I can understand the reasoning behind wanting a flexible enough processing capacity to accommodate any anticipated load regardless of frequency.
So there is still a good argument to be made for developing the Minimum Viable Product in whatever technologies are most productive for your developers, and figure out how to scale as you grow.
It's a difficult balancing act.
There are an awful lot of problems that can be solved with a simple Postgres instance running on RDS.
When the BizTalk implementation broke down and I first needed to look into how it worked and what it did, I found that it just moved a small XML file to an SFTP server and wrote a log entry. So I replaced the entire setup with 50 lines of C#. Luckily my boss was onboard, arguing that we didn't really have the qualifications to do BizTalk.
The idea had originally been that the customer, a harbour, would in the long run need to do a ton of data exchanges with the government organisations and shipping companies. The thing is that they where planing for a future that never really happened.
I was almost too stunned to answer. Their solution, was not to just use something more efficient, or even to just rent a single, more powerful EC2 instance or something, but to go to the effort of setting up and running a cluster. The dataset wasn't even that big: a few gb on disk.
And Spark is usually much more efficient than R or Python (and gives you a nicer language to work with IMO, though that's very subjective). It's entirely possible a 1-node cluster would have outperformed your non-cluster approach, and while a 1-node cluster seems stupid it's useful if you know you're eventually going to have to scale, because it ensures that you test your scalability assumptions and don't embed any non-distributable steps into your logic.
I cant use a static HTML plugin called head, because port 9200 was already taken and site plugins were disabled.
Wanna use head plugin? Install 300 megabytes docker image and setup CORS.
There are two problems here - one is that people prototype their architecture using massively over-engineered systems and the second is that a rough prototype makes it way into production.
So, as a Hadoop perf engineer, I deal with both issues - "We have a Kafka stream hooked up to a Storm pipeline and it is always breaking and we can't debug ... what is this shit?" or "Postgres is stuck on BYTEA vaccuum loops and you can fix it with Hadoop, right?".
There are significant advantages in prototyping with an easy to debug architecture, until all the business requirements stabilize.
Sometimes the right answer is to use Postgres better (specifically, table inheritance for deletion of old data + indexes instead of delete from with <=), sometimes the right answer is a big data system designed for cold data storage & deep scans.
This all comes back to the "We made a plan, now it is a death march" problem - the iterative failure process of building a prototype and having it actually fail is very important, but most people feel like they'll get fired if their work fails in the real world.
Over-engineering is usually career insulation and somewhat defense against an architecture council.
Never got an email back.
I just flat out told him I didn't want anything to do with that kind of project and asked to be shown the exit.
I have run a SQL database on a USB disk with the same data without problems but some people are just attached to the idea of "big data" so Hadoop it.
Also compressing doesn't go well with ad hoc analysis.
Sounds like maybe a Hadoop setup is not the worst idea to be ready for the future.
The industry standard is ~100-200 debugged lines of code a day. (If your team tracks hours, look this up on a big project; taking into account all the hours one spends, not just coding.)
So your claim, being generous, is this team spent 6 months to not produce the equivalent of 100 lines of code? Even if completely true, this comes off as a humblebrag.
At the beginning of my current project, I had a job that involved 35T of input, but the vast majority of records would be ignored, and then for each successful one, only a few hundred bytes of output would be generated. Rather than Hadoop, I setup a simple system where a number of worker processes would query Postgres for the next available shard, mark it as in-progress, and then stream it from S3 and process it. When they finished, they'd write a CSV file back to S3. The reduce phase was just 'cat'.
The resulting system took a few hours to build (few days, including the actual algorithms run), and it was much more debuggable than Hadoop would be. You could inspect exactly where the job was, what shards had errored out, and which were currently running on machines, and download & view intermediate results before the whole computation finished. You could run the workers locally on a MBP if you needed to debug a shard, with no setup needed.
When I was at Google, we had a saying that "The only interesting part of MapReduce is the phase that's not in the name: the Shuffle". [That's the phase where the outputs of the Map are sorted, written to the filesystem and eventually network, and delivered to the appropriate Reduce shard.] If you don't need a shuffle phase - either because you have no reducer, your reduce input is small enough to fit on one machine, or your reduce input comes infrequently enough that a single microservice can keep up with all the map tasks - then you don't need a MapReduce-like framework.
To the extent that's true it's an indictment of Hadoop's implementation. Doing all those things in Hadoop ought to be trivial; maybe there are a few tools you'd have to make a one-time effort to learn, but reusing them ought to save you effort over making a custom system every time.
There's a particular bias I come across where people who genuinely have big data want to set it up in ways that is not necessarily performant because Hadoop is basically the most recognizable tool. If the data processing I'm doing generated a lot of output data I might consider a different flow, but there just isn't much of a reason: most of the data is inert for long periods of time, the output insights are fairly small and the actual processing has to occur very quickly and with low latency.
Column stores are crazy fast, but there isn't much simple tooling built around things like parquet or ORC files. It's all gigantic java projects. Having some tools like grep,cut,sort,uniq,jq etc that worked against parquet files would go a long way to bridge the gap.
Something like pyspark may be the answer, I think it may be possible to wrap it and build the tools that I want.. like
find logs/ | xargs -P 16 json2parquet --out parquet_logs/
parquet-sql-query parquet_logs/ 'select src,count(*) from conn group by src...'
Edit: another example... I have a few months of ssh honeypot logs in a compressed json log file. Reporting on top user/password combos by unique source address took tens of minutes with a jq pipeline. The same thing imported into clickhouse took a few seconds to run something like
select user,password,uniq(src) as sources from ssh group by user,password order by sources desc limit 100
The community around dask is quite active and there's solid documentation to help learn the library. I cannot recommend dask enough for medium data projects for people who want to use python.
They have a great run down of dask vs pyspark  to help you understand why'd you use it.
You example above would actually work perfectly. You can literally use grep in a distributed fashion in our system. One of our example pipelines is using grep and awk to do log filtering/aggregation or word count.
so you start looking at stuff like sharding or vertical scaling and you keep on doing things more or less the way you have been but with steadily degrading performance on every new insert.
clickhouse turns my annoying data back into something that I can query in 30 seconds.
I just ran a random query to find what day had the most connections:
select day,count() as c from conn group by day order by c desc limit 1
1 rows in set. Elapsed: 16.412 sec. Processed 1.43 billion rows, 2.87 GB (87.33 million rows/s., 174.65 MB/s.)
 Technically, 6 2-core guests on a single hypervisor; using guests to test deployment and scaling more easily.
1.4 billion of our connection logs (24 fields) takes up 89G on my clickhouse VM. 5 billion records would take ~320G.
Based on http://tech.marksblogg.com/benchmarks.html a 6-node ds2.8xlarge redshift cluster is about as fast as clickhouse on a single i5.
Edit: Oh, or you could use something like BigQuery which should be significantly easier than Hadoop (Never had the need myself though)
sqlite is a sql database.
Of course, it depends on the precise nature of your workload.
1 rows in set. Elapsed: 16.412 sec. Processed 1.43 billion rows, 2.87 GB (87.33 million rows/s., 174.65 MB/s.)
yes. the required infrastructure will look different though, and obviously elasticsearch doesn't speak sql (not without some crazy 3rd party plugin anyway)
And can it do this on a vm with 4G of ram?
> I've been testing https://clickhouse.yandex/. I threw it on a single VM with 4G of ram and imported billions of flow records into it. queries rip through data at tens of millions of records a second.
1. When the devs' agenda is about learning new tech rather than solving business problems. The ways to solve this are to incentivize devs at the business problem level (hard) or find devs who care more about solving business problems instead of learning hot new tech (easier).
2. When the product management function is weak within an org. Product defines the requirements, and makes trade-offs around the solution. A strong PM will recognize when a bazooka is being used to kill a fly, and will push dev to make smarter trade-offs that result in a cheaper, faster, more maintainable solution. This is especially challenging when the dev team cares more about shiny tech than solving business problems.
Developers have also been telling other developers that switching jobs is the best way to maximize their paycheck.
It's debatable whether developers should study this new tech on their own time or not, but there's clearly developers who would prefer to do it on the company's time.
Naturally, some developers will learn new tech while solving a company's problem so they can get paid to learn new tech.
Combining all of these things, it's rather obvious why developers are building solutions to simple problems with new tech.
If you want developers to act differently, you have to change the narrative (or incentives) around new tech, jobs, and salaries.
You can't blame us for doing the things we've been told to do by our peers for at least the last 5 years.
Half- or nontechnical managers often follow tech hypes as well and may push for the project to use "Big Data technology" simply because it makes them feel more important to lead a project that is part of the hyped topic.
FWIW I think most devs could learn how to push back on poor decisions like this made by non-technical managers as it really shouldn't be the manager's responsibility to dictate technology choices for a solution. It should be their responsibility to push the dev team to make tradeoffs in order to achieve a business result.
Devs who are good at understanding the business objectives and pushing the non-technical team to make better decisions are both wonderful to work with and command higher salaries / fees.
Scout the technology ahead of business needs. If you want to look at new tech, you should be doing a proof-of-concept that the business is not depending on. If you are doing business work, then you can either use things you know, or use a completed proof-of-concept to move in a new direction. But you should not mix business needs with that initial proof-of-concept.
What does this mean? What is "the business problem level?"
Spark is orders of magnitude faster than Hadoop, too.
There's also Facebook's presto, which is so much faster than hive it will make your head spin!
Presto is nice, but you can't use it for an ETL job. It is great for analysis.
Plus you can mix and match R, Scala and SQL all together.
SQL can be a beautiful language that feels very natural, once you have had a few years to build up fluency in it. It might make for an excellent shell language. But, having spent time prototyping systems in CouchDB (which were admired for their elegance, but rejected due to the relative obscurity of Couch, grrr!), I have to say, that my previous bias for querying over transforming, was ultimately holding me back, bogging me down in leaky abstractions. We should have started with MR, and then learned SQL only when presented with something that doesn't fit the MR paradigm, or even the graph processing paradigm, which IMO is also simpler than SQL.
As for the original subject, yes, Hadoop is a pig, ideally suited to enterprisey make-work projects. All the way through the book, I kept thinking, "there has got to be a simpler way to set this up."
Hive and other SQL-on-Hadoop systems tend to do better in that department.
Not sure the performance difference would be as significant now.
And as someone else mentioned Spark != MR and most people using Spark are writing code.
I do agree in principle that you're better off using simpler tools like Postgres and Python if you can. But if you're in the middle band of "inconveniently sized" data, the small overhead of running Spark in standalone mode on a workstation might be less than the extra work you do to get the needed parallelism with simpler tools.
And it's very simple to manage operationally if you know anything about JVM apps.
A comparison between Spark & Hadoop doesn't make much sense though.
Spark is a data-processing engine.
Hadoop these days is a data storage & resource management solution (plus MapReduce v2). Spark often runs on top of Hadoop: Hosted by YARN accessing data from HDFS.
As you said: Marketing :)
Disclaimer: I built CSV Explorer.