If you add real-time sensor data sources to the mix, the rest of the architecture model kind of falls apart. Requirements upstream have cascading effects on architecture downstream. The deficiencies are both technical and economic.
First, you need a single ordinary server (like EC2) to be able to ingest, transform, and store about 10M events per second continuously, while making that data fully online for basic queries. You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second; even at that rate, you’ll need a fantastic cluster architecture. Most of the open source platforms tap out at 100k events per second per server for these kinds of mixed workloads and no one can afford to run 20k+ servers because the software architecture is throughput limited (never mind the cluster management aspects at that scale).
Second, storage cost and data motion are the primary culprits that make these data models uneconomical. Open source tends to be profligate in these dimensions, and when you routinely operate on endless petabytes of data, it makes the entire enterprise problematic. To be fair, this is not to blame open source platforms per se, they were never designed for workloads where storage and latency costs were critical for viability. It can be done, but it was never a priority and you would design the software very differently if it was.
I will make a prediction. When software that can address sensor data models becomes a platform instead of bespoke, it will eat the lunch of a lot of adjacent data platforms that aren’t targeted at sensor data for a simple reason: the extreme operational efficiency of data infrastructure required to handle sensor data models applies just as much to any other data model, there simply hasn’t been an existential economic incentive to build it for those other data models. I've seen this happen several times; someone pays for bespoke sensor data infrastructure and realizes they can adapt it to run their large-scale web analytics (or whatever) many times faster and at a fraction of the infrastructure cost, even though it wasn't designed for it. And it works.
Disclaimer: I work at VictoriaMetrics open source.
VictoriaMetrics ingest rates are around 300k / per second / PER CORE. So theoretically you should be fine with just a single n1-standard-32 or *.8xlarge node.
Though I would recommend cluster version for reliability, of course, and to scale storage/ingestion/querying independently.
Here's the benchmarks with charts:
As to what these companies want to do with sensor data, it is often considerably more interesting than what people imagine. Many of the applications have an operational real-time or low-latency tempo. (I can't be too specific here.)
For my purposes, I put "medium-sized" on the order of $1B annual revenue. As to why a company would literally spend 10+% of its revenue on sensor data infrastructure, it is difficult to overstate the extent to which getting this right is viewed as near- to medium-term existential for these companies. The CFO has run the models and this is their best chance at survival.
Here is the interesting thing: to the extent they've been able to put this sensor data infrastructure in place, it has been successful at generating margin. If they could bend the infrastructure cost curve down a bit, most would spend even more on it. I've seen the financial models at several companies, there is a tremendous amount of money to be made in this transformation.
Is this about optimizing things to a precision that is impossible when humans try to decide whether something is too much or too little of something?
In the short-term, big tech companies can't replicate what is possible for these hardware companies. Longer term, I would expect these data sources to be commoditized as well.
Let's take plastic injection molding since it's such a good example of a really broken industry (there are a small number of excessively competent injection molders and a vast legion of incompetent ones).
You're shooting a part every couple of seconds (or faster), and that injection molding machine has lots of knobs to dial in. Temperature of incoming plastic pellets, water content of incoming plastic pellets, dye feed rate, plastic feed rate, mixing chamber temperature, feed screw motor load, initial injection pressure, plateau injection pressure, release injection pressure, actual pressure inside the mold, time spent cooling--I can go on and on and on.
Most injection molding problems generally get solved one way: increase injection time. It's fairly straightforward to adjust, isn't likely to make things go wrong, and the people on the line don't get paid to experiment. They've got 100K parts to shoot in 72 hours, and an hour lost is a thousand or so parts they're going to get yelled at for. Better to dial the time up 10% and take 79 hours rather than experimenting for 7 hours and not shooting or waste a bunch of plastic.
Of course, if this is your only hammer, you can see where this is going. Every single time something goes wrong, that mold gets another 10% added to its cycle time. And it never goes the other way without "A Pronouncement From God, Himself(tm)". Eventually, your entire business is running at 50% productivity because all the molds are shooting so slow and you think you need to build another factory when what you need to do it fix your molding times.
Now, back to sensors--the problem is that nobody with incentive has a way to identify AT THE TIME IT CAN BE DEBUGGED that "something is going wrong". Someone on the line dialing up the injection time should cause an immediate dump of ALL the data on that machine (probably a week or more+) up to an engineer who can go through it looking for anomalies. Even better would be for the machine to flag to an engineer any "likely anomalies" (an increase in incoming plastic pellet water content should get flagged, for example) so that they can be corrected before they affect the injection process and cause failures/wastage.
This is, of course, all predicated on logging an enormous amount of data and being able to run an analysis against it in almost real-time.
Handling this data is non-trivial.
One of the things that struck me then is how much the line workers were treated like furniture, when many of them were quite sharp. They took such pride in getting things done well and at speed, in continually improving. I really wish I could put that kind of data in the hands of a couple of the people who trained me. Just an app on their phones. Spending 40+ hours/week on a machine means you really get to know it. I'd love to see how many of them would get great first-pass analysis and remediation.
Honestly from working in plastics and sensor design for 15 years it usually boils down to engineers not willing to let go of information because they envisage potential future issues. Its easier to imagine problems in a meeting than to imagine and deliver solutions upfront before the problem ever happens. Also, a lack of care for the economics of doing such sampling.
That's not to say there is an easy fix. These same people are the ultimate end customer who have the final word on such engineering environments.
This has long been the main marketing message used to promote Complex Event Processing (CEP)  systems. There is no shortage of enterprise and Open Source solutions for this space; what is missing is strong demand/adoption which in itself undermines the next-big-thing claim.
One can argue that sensor data is included in the ETL category.
This isn't necessarily an issue, complex constraint matching is typically a fundamental part of the ingest path anyway given the algorithms used; making it support more generalized CEP is a fairly straightforward extension of the same computer science mechanics that make polygon search scale efficiently.
In virtually all real systems, data is aged off after some number of months, either truncated or moved to cold storage. Most applications are about analyzing recent history. Everyone says they want to store the data online forever but then they calculate how much it will cost to keep exabytes of data online and financial reality sets in. Several tens of petabytes is a more typical data model given current platform capabilities. Expensive but manageable.
The more fundamental point that the GP is making is that the realm of industrial sensor data scales in ways that people haven't really grasped yet. It's much less about brute storage than it is about the interplay between bandwidth, storage, and concurrent processing power.
So, the problem is that you threw away 90% of your data, and that's where the problem was. Oops. Now you have to switch on "Save all the data" and hope it repeats. So, given that you have to have a "Save all the data" switch anyhow, you might as well turn it on from the start.
In addition, changepoint analysis is an entire field of research in and unto itself.
Look at how many articles there are about analyzing "Did something break in my web service or am I really doing 10% more real traffic?"
It’s typically a mix of everything, but predictive maintenance, anomaly detection and failure analysis are the most common. For example, there is one process that does trend analysis and tries to “predict” acceptable boundaries of a certain sensor’s measurements, and this is then compared in real-time with the actual sensor readings. If things fail for some reason, a technical engineer will dive into the data with dashboards (think: Grafana), zoom in, compare the readings with other sensors, etc.
The sheer volume of the data makes it fairly painful. Downsampling does happen, but only after a few weeks. This means that you still need enough storage capacity to deal with the full stream of data in real-time.
Most of the analysis that is done usually falls under one of two categories. First, inferring (you can rarely measure it directly) when something has changed in the real world that is relevant to your business so that you can adapt to it immediately -- the applies to everything from autonomous driving to agricultural supply chains. Second, detecting anomalies -- the unknown unknowns -- so that risks can be managed when the real world appears to not conform to the models upon which you base decisions. A third category is support of industrial automation, which benefits immensely from high-resolution multimodal sensor data models, though this is largely a cost reduction measure. These categories are hand-wavy but in practice, boring industrial companies have concrete metrics they are trying to achieve or risks they are trying to manage in the most efficient way possible.
Having "cost" on the board as a factor we were actively trying to optimize for during design pulled us in a direction that is quite foreign compared to off-the shelf solutions.
That last paragraph rings true -- one of our big challenges specifically was in ingesting and indexing data that needs to be queried across multiple dimensions, things like aircraft or drone position telemetry. But once we found a workable solution for that, it specializes quite well to simpler workloads very well.
I think this is due to the nature of the mentioned companies are not being too common (yet?). There are tools and systems that you can use, especially from high frequency trading which has somewhat similar challenges. KDB+ and co. would be my first stop to check if there is something that I could use. The question is the financial structure and scaling of the problem, to determine if these tools are in game. There are other interesting projects in the space:
Of course these are not exactly what you need, long term storage and querying (like KDB) is largely unsolved.
The other tools that you might be referring to by "most of the opensource platforms" indeed are not capable doing this. I spent the last 10 years on optimizing such platforms but it is not even remotely close to what you need, you (or anybody who thinks these could be optimized) are wasting your time.
We do this. Have a load balancer with a fleet of nginx machines insert into bigquery. Inserts scale well and the large queries work since it is columnar. The issue is price. It's terribly expensive.
As a systems engineer with a good track record and an interest in starting an endeavor, this is a very attractive statement to me.
Where can I read more about how the sensor networks are configured, the use-cases, etc? I'd like to read into this a bit more.
Every use case has unique data model requirements (minimal standardization, different sectors) but there are easily identifiable platform components that almost everyone needs which aren't available. Surprisingly "simple" architectural holes would be a scalable business if competently plugged, the perfect MVP. These enterprises have an aversion to developing software, it isn’t their strength, and they know precisely how many millions per year a real platform would save them -- value is concrete. However, they are also technically sophisticated as to why all existing platforms fail for them, you can’t fake understanding the problem. I have the benefit of having worked on this market problem for several dozen organizations over the last 15 years, ranging from Big Tech to small EU industrials, so I see it more from their side.
Little is written about it. Everyone is essentially trying to use diverse multimodal sensor data sources to paint an accurate model of some part of the physical world in as close to real-time as possible. Easy to say, very challenging to do. Sometimes these data models are not about their business per se, their hardware puts them in an excellent position to build them so that they can sell it as a service to businesses that can actually use it. Often overlooked is that there are extremely difficult computer science problems with little public literature buried in the design of such systems, and expertise in this computer science is critical to being successful at it. Virtually all startups that try to enter this market completely botch the technical execution, assuming that these platforms don’t exist as a function of business execution when it is actually a hardcore tech startup. The technical execution expertise is the real moat for this business, everyone underestimates how deep that rabbit hole goes.
FWIW, I’ve been laying the groundwork to build a startup in this space for a while now, I even purchased a very good .com domain. :) Bespoke implementations at several highly recognizable organizations are based on licensed code components I designed. There is a massive demand overhang and the market was ready yesterday. The broader ecosystem has room for several startups to coexist, there are many niches currently unfilled.
What piece of architecture you think is left out?
I've used the following pattern in the past:
- generally only send batched data in as large an interval as possible
- if somebody looks at a device, immediately (well, might take some seconds) query the batched data and switch device to a "live" mode that provides live data instead of "wait and batch".
This will be a bad idea for scenarios where there's a reasonable expectation of surges of people needing "live" access, but for our use cases of industrial data, it works very well. We only watch our own devices, which are in the lower tens of thousands, but I don't see why this should not scale to more, under the restrictions mentioned above.
Let’s step back for a second and just acknowledge that you’re in a very narrow slice of the market. The number of companies that are paying $100M/year to store sensor data is probably countable with 8 bits.
So it might seem like a large gap for you, but it’s honestly not relevant for 99.99% or developers.
It isn't a solved problem but people are working on it.
I guess you could have a time series database that used compression but I don't know of databases that do
What is most important here in my opinion is that the underlying data is useful. If your underlying data wasn't collected, collected properly, or even worse the wrong data was collected.. then setting up data infrastructure will be a boondoggle that will cause your organization to be data hostile.
Just as much, if not more effort, needs to go into collecting the right data in the right way to fill your data infrastructure with. Most of the projects I've seen or heard of are just people taking the same old data that Ted in accounting, Jill in BI, etc. are already pretty proficient at using. So the gains you get by moving that into a modern infrastructure are marginal. How many more questions can you really ask of the same data that people have decades of experience with and an intuitive sense for?
So these trends aren’t in a vacuum, they directly support the issues you discuss.
> Most of the projects I've seen or heard of are just people taking the same old data ...
I don’t disagree with you here. But in my experience it’s about getting Frank in marketing to use the same numbers as everyone else.
When you have 5 different ads platforms that all take revenue credit for a single conversion and have conflicting attribution models, and none of them add up to what accounting says is in the bank account. That’s a hairy problem.
There are different flavors of that class of problem at lots of companies.
I don't think this is any shift. The "store everything" has always existed in my experience, that's how the aggregates were built in the first place. The aggregates were for speed and convenience, and you drill-down as necessary, including to the individual record level.
Maybe the shift is people thinking that it's cheaper to just analyze the entire corpus on-demand because we can throw a spark cluster at it?
And that's precisely why modern data processes are inferior to 20 years ago. People reinvent the wheel over and over and spend massive budgets on unnecessary tech stacks that would be alleviated if the time was simply taken to model the data.
A clean data model is about a whole lot more than simply storage space.
What’s changed in the last 10 years are the access patterns. There’s increased demand to have arbitrary query access over the raw data. The most impactful technology changes have been about pushing the access layer (queries, stream & batch processing, dashboards, BI tools, etc) down as close to the raw data as possible and making that performant. What’s fallen out of that are better MPP OLAP databases (snowflake), new columnar formats (parquet), SQL as the transform layer (dbt).
That’s an engineering issue that needs new engineering management who don’t enable wasting company resources making incompatible APIs in the first place.
We already did the monolithic DB design, I used to name those hosts “ocean”. And we already know the math. “Data lake” is just more jargon by a salesman to obfuscate peddling the same old abstraction, and wow fresh grads with new words for hyping the same old habits.
While not the author of this piece, Bezos is quoted as pointing out how circular social behavior is.
What do you think the odds of this author being on a similar page?
Have humans evolved much in 100 years? Or does the con simply get rewritten for the next generation to hide a simple truth?
What’s keeping people going in this circle isn’t logistical necessity. It’s us.
There are typically 2 types of data to collect: Transactional data and behavioural data.
Most transactional data, due to their important nature, are already generated and captured by the production applications. Since the logic is coded by application engineer, it's usually hard to get this data wrong. These data are then ETL-ed (or EL-ed) over to a DW, as described by the article.
For behavioural data, this is where your statement will most apply to. This is where tools like Snowplow, Posthog, Segment, etc come in to set up the proper event data collection engine. This is also where it's important to "collect data properly", as these kinds of event data changes structure fast, and hard to keep track over time. I'd admit this space (data collection management) is still nascent, with only tools like iterative.ly on the market.
The fidelity of the data is of course important, but I would claim it's not a blocker. Yes, you need to trust the data you collect. That's table stakes - if you can't collect data correctly at all, even without worrying about the past, you're in for a world of hurt. It's P0. That said, a lot of people assume you also need to do this historically - and that's not the case - at least for ML.
Reinforcement learning has been making great strides in recent years. If you're in this situation - you have a flow where you want to use a model without having any past data to train with - use something like VW's contextual bandits . You don't need historical data to build your model, just real-time decision point & reward signals. Once deployed, the model converges over time to the optimal model using real-time feedback.
All that said - baby steps are important. If you're in this situation, start by getting fidelity and then expand scope slowly without sacrifice to fidelity. It's a lot easier to backfill than to "fix" data - get that right and it get's easier from there. You'll need fixups regardless - mistakes happen and requirements change - but you have to start with something you trust, at least in the moment it's deployed.
Let’s say I deleted every time series whose Y axis isn’t measuring US dollars in every tech company’s database everywhere. Maybe for all those time series you just store the most recent value. Describe to me what would be lost.
You’re onto something but you’re not going far enough! Most, if not all, historic metadata, analytics and behavioral data collection - when it is not measuring literal dollar amounts - is completely worthless.
We literally run long term A/B tests with thousands of variations of what you're describing. The purpose of these tests is to measure the effect of losing some data. The tests show (to nobody's surprise) that each piece of data is useful. These tests tell us exactly how useful each piece of data is.
Honestly when I read comments like this I have to wonder, do you really believe that thousands of companies spend trillions of dollars a year for something that doesn't work? Maybe talk to somebody who works on this stuff a bit?
Joking/not-joking. Have you ever been to the Bay Area?
Yes. Emphatically yes it is the case companies spend trillions of dollars unnecessarily.
We've seen this with people who didn't know how to build microservices and farcical "LMNOP"  type services that might as well be a joke. We've seen it with gigantically-valued unicorns that over-engineered tons of crap and hired too many people and still can't make a profit. We've seen it with CMOs and massively overpriced marketing technology because budgets and statuses are related. We'll see it with tons more iterations of this exact same affluenza.
The history of our industry is that the margins on software are so good that people can afford to do crazy nonsense.
I have vast amounts of firsthand evidence from randomized controlled trials that non-financial data can be used to create value. This is enough evidence for anyone in the industry.
Presumably the commenter doesn't have access to this evidence. Instead he has to rely on other heuristics, like the weaker argument that companies spend trillions on data and analytics.
Companies sometimes waste money, and maybe microservices are an example of this. But companies collectively spend 3-4 orders of magnitude more money on "all data that does not have USD units" than on microservices, so the commenter should take that as strong evidence that data can be used to create value.
> Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?
And for the huge majority of companies, it is not. Even many 1B+ dollar companies. Most value produced by businesses still could exist with boring ETL or BI concepts that have been around forever, because the hardware powering it is so fast. Many if not most of those businesses probably would be better off.
So yes, companies blow a lot of money on stuff with questionable ROI. I don't discount your experiences and that there are cases where the complexity might have a payoff. But honestly, we've seen periods of excess complexity and waste in software before: it's the norm.
They're not unrelated. They're all related to fat 60-80% profit margins on SaaS. And no coordination is necessary to spend money on silly, make-work activities if you have margins like that.
Software has extremely low, borderline zero, variable costs. A lot of these companies that spend money on examples like I've given probably could hire nobody at all and still have crazy growth because of the unit economics. (Not coincidentally, the companies most in want of "Big Data" solutions tend to be past this point.)
I can get 1M QPS on a silly Aurora setup with replicas. Best tool for all jobs: no. But don't tell me that dollar-for-dollar a data architecture with like 25 different components is dramatically superior to an OLTP db, OLAP offline store + batch jobs, and a streaming system.
I think it’s an intriguing thought exercise. For example, does one need the entire history of interactions with e.g., an Instagram post, or just aggregated measurements? I’m not like, against measuring. Just against warehousing of non financial timeseries.
* Timestamps of events related to content loading and rendering. This is crucial for debugging and improving load times.
* Backfilling aggregated data so that ML models can be trained without waiting weeks for new streaming aggregation.
* Answering product questions of almost any kind that weren't asked when logging was built.
Concrete example from my recent experience, you may want to know how often people like a post then later look at comments, vs look at comments then later like a post. This gives you information about cause and effect.
The second can be done simply on something like Dynamo, CosmosDB, or your cloud-hosted NoSQL of choice. Heck, it can even be done on Aurora or vanilla Postgres + partitioning if it's <64TB.
The third can be done with any off the shelf cloud data warehouse software, at many petabyte scale. And even then, I'm sorry, but I just don't believe you that the product clicks over some large timeframe are historically relevant if your software and UI changes often.
All of these things mentioned have had extremely simple, boring solutions at petabyte scale for >10 years, and in some cases more than that. If you add a batch workflow manager and a streaming solution like Spark, that's like 3-4 technologies total to cover all these cases (and many more!)
Microservices <--- You are here
Yes. Every advertising platform ever uses this information. In Europe, you have to have regulation that makes account costing (what the US might call forensic accounting) possible. The presentations on A/B tests by FANG companies might also interest you. They are on Youtube.
Based on my experience working at an analytics company and running one myself, what this post misses out is that an increasing number of people working with data today are not engineers. These people can range from product managers who are trying to figure out what features the company should focus on building, marketers to figure out how to drive more traffic to their website, or even the CEO trying to understand how their business as a whole is doing.
For that reason, you'll still see many companies pay for full stack analytics tools (Mixpanel, Amplitude, Heap) in addition to building out their own data stack internally. It's becoming more and more important that the data is accessible to everyone at your company including the non-technical users. If you try to get everyone to use your own in-house built system, that's not going to happen.
The whole point of data infrastructure is that sometimes you’re collecting data from the most random places. Many of that data is not necessarily user behavior.
Sometimes it’s things like temperatures, latencies, CPU usage or instrument tallies. Sometimes it’s a stream
of minute to minute weather data or timings or anything, really. Besides many companies have been collecting data for decades but it all live in silos where it can’t be used for anything.
Mixpanel can’t capture all that data, or query it, or analyze it.
Mixpanel is just capturing a super small subset of web event data and it happens to provide an analysis suite on top that data they collect.
That’s why Segment shows up in this list instead. They help to move a lot of siloed data into a common systems.
Mixpanel is just another source of data. You need something like Snowflake to put everything together and be able to do queries across multiple datasets.
All these tools need to be fed data about user behavior - from apps, server backends, other tools etc. It's a messy data connection problem, not just one way from SaaS to warehouse. Mobile App->SaaS; SaaS->SaaS; Warehouse->SaaS; SaaS->Warehouse and so on.
Creating the ID graph is a next level problem altogether!!. How do you know a record in Salesforce is the same as the anonymous visitor on your website. Requires joining across at-least 3 (possibly more IDs) - anonymousID, userID email (if the user signs up) and Salesforce record email.
Should the data pipe do this automatically? If not, what API abstraction should be exposed to the user?
It goes into details how much relevance the practices of the past (OLAP, Kimball's modeling) has with the current changes in by the cloud era (MPP, cheap storage/compute, etc). Chapter 4 will be most interesting for HN audience: It walks through the different waves of data adoption ever since BI was invented in the 60-70s.
Otherwise the spoils of war go to the unethical evil because they are now unchallenged.
Building powerful tools and then using them ethically doesn't reduce the amount of "unethical evil" done by others. Quite the contrary. And it doesn't deny them "spoils", as though there's a zero-sum prize, because there isn't one.
If you're really good at building tools, it will result in the creation of new, powerful tools which may be wielded for good or evil. If most other actors out there will wield those tools you're building for more evil than good, the mere act of building those tools will lead to more evil than good.
So I'm with cageface on this.
Deciding which tools to build does have consequences, and it's other people who primarily decide how those tools will be used, not the toolmaker. Sometimes you can already see what choices others look likely to make.
Some would argue this doesn't place an ethical burden on the toolmaker, because you can't and shouldn't control other people. That's a different argument though. Ethical or not, there are undeniably consequences from building tools when you can see how they are likely to be used.
To the downvoters - grow up and learn to present an argument.
I think most people would agree that there’s lots of positive applications that you could use your data science skills for. But if you can’t then good for you for staying out of the field.
Unless though, you're on a massive, massive scale, Just Use Postgres, and write your ETL (ELT now?) queues normally. Keep It Simple Stupid.
I think executing this in orgs with legacy data technologies is hard but it is much more a people problem than a tech problem. In orgs that have achieved this foundation it's really cool to see the business and analytic impact to the company.
^ This holds true for nearly every aspect of nearly every company.
Fivetran, dbt, Preset (Superset/Airflow), Sisu, Imply and Databricks.
Though, as someone who's in this space a while, I think they did a decently fair job at articulating the 'modern' data infrastructure landscape.
And there are dozens of charting/visualization/business-intelligence vendors to do whatever you want beyond or on top of that SQL structure.
Check out Google Connected Sheets: https://cloudblog.withgoogle.com/products/g-suite/connected-...
Will MS modify Excel 365 fast enough to meet this need? Will another spreadsheet program disrupt Excel’s dominance? Will another player come in with the ability to “ingest” arbitrary Excel files? Another major issue is Excel’s massive failure when it comes to handling uncertainty in data. I’ll be curious to see how it all plays out.
We connect all our sensors to an edge AI Server that handles sensor data, and only uploads to the cloud what’s actually relevant.
It works quite well, and there are many OEMs that offer such systems, with accelerators for inference, sensor data compression, 5G, etc.