Hacker News new | past | comments | ask | show | jobs | submit login
Emerging Architectures for Modern Data Infrastructure (a16z.com)
431 points by soumyadeb on Oct 18, 2020 | hide | past | favorite | 105 comments

This article has a large gap in the story: it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude. They have become ubiquitous in diverse, medium-sized industrial enterprises and it has turned them into some of the largest customers of cloud providers due to the data intensity. Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially. Almost no one provides tooling and platforms that address it. (This is not idle speculation, I’ve run just about every platform you can name through lab tests in anger. They are uniformly inadequate for these data models, everyone relies on bespoke platforms designed by specialists if they can afford the tariff.)

If you add real-time sensor data sources to the mix, the rest of the architecture model kind of falls apart. Requirements upstream have cascading effects on architecture downstream. The deficiencies are both technical and economic.

First, you need a single ordinary server (like EC2) to be able to ingest, transform, and store about 10M events per second continuously, while making that data fully online for basic queries. You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second; even at that rate, you’ll need a fantastic cluster architecture. Most of the open source platforms tap out at 100k events per second per server for these kinds of mixed workloads and no one can afford to run 20k+ servers because the software architecture is throughput limited (never mind the cluster management aspects at that scale).

Second, storage cost and data motion are the primary culprits that make these data models uneconomical. Open source tends to be profligate in these dimensions, and when you routinely operate on endless petabytes of data, it makes the entire enterprise problematic. To be fair, this is not to blame open source platforms per se, they were never designed for workloads where storage and latency costs were critical for viability. It can be done, but it was never a priority and you would design the software very differently if it was.

I will make a prediction. When software that can address sensor data models becomes a platform instead of bespoke, it will eat the lunch of a lot of adjacent data platforms that aren’t targeted at sensor data for a simple reason: the extreme operational efficiency of data infrastructure required to handle sensor data models applies just as much to any other data model, there simply hasn’t been an existential economic incentive to build it for those other data models. I've seen this happen several times; someone pays for bespoke sensor data infrastructure and realizes they can adapt it to run their large-scale web analytics (or whatever) many times faster and at a fraction of the infrastructure cost, even though it wasn't designed for it. And it works.

> 10M events per second

Disclaimer: I work at VictoriaMetrics open source.

VictoriaMetrics ingest rates are around 300k / per second / PER CORE. So theoretically you should be fine with just a single n1-standard-32 or *.8xlarge node. Though I would recommend cluster version for reliability, of course, and to scale storage/ingestion/querying independently.

Here's the benchmarks with charts: https://medium.com/@valyala/measuring-vertical-scalability-f...

I don't doubt you, but it's surprising that there's really that much value / inefficiency laying around that many "medium sized" industrial enterprise can justify spending tens or hundreds of millions of dollars a year just to collect an insane amount of sensor data (and presumably take some action based on that). How big is medium sized and what kind of industries?

There is a long and fascinating discussion to be had about how the economics of many industrial sectors are evolving. The short version is hardware differentiation is no longer economically viable, margin is going to zero, and pivoting to sensor analytics and exploitation is broadly seen as the primary means of generating margin going forward. I've had this same conversation across several industrial sectors. Anything tangentially related to transportation (automotive, logistics, telematics, aviation, and all related supply chains) is a good example.

As to what these companies want to do with sensor data, it is often considerably more interesting than what people imagine. Many of the applications have an operational real-time or low-latency tempo. (I can't be too specific here.)

For my purposes, I put "medium-sized" on the order of $1B annual revenue. As to why a company would literally spend 10+% of its revenue on sensor data infrastructure, it is difficult to overstate the extent to which getting this right is viewed as near- to medium-term existential for these companies. The CFO has run the models and this is their best chance at survival.

Here is the interesting thing: to the extent they've been able to put this sensor data infrastructure in place, it has been successful at generating margin. If they could bend the infrastructure cost curve down a bit, most would spend even more on it. I've seen the financial models at several companies, there is a tremendous amount of money to be made in this transformation.

I just fail to see how sensor data gives you an edge or (ultimately) a higher margin.

Is this about optimizing things to a precision that is impossible when humans try to decide whether something is too much or too little of something?

They are leveraging their privileged hardware positions to enter adjacent high-value data markets that have little to do with their core business, from which they can generate considerable margin. Essentially, the hardware business becomes a loss leader for a sensor platform and data business. Many hardware companies are in a position to capture data models that would be difficult to acquire any other way, if they can stop thinking of themselves as hardware companies.

In the short-term, big tech companies can't replicate what is possible for these hardware companies. Longer term, I would expect these data sources to be commoditized as well.

Your comment has no examples of how these margins will be generated.

Medium-sized can be in the $10 million per year range.

Let's take plastic injection molding since it's such a good example of a really broken industry (there are a small number of excessively competent injection molders and a vast legion of incompetent ones).

You're shooting a part every couple of seconds (or faster), and that injection molding machine has lots of knobs to dial in. Temperature of incoming plastic pellets, water content of incoming plastic pellets, dye feed rate, plastic feed rate, mixing chamber temperature, feed screw motor load, initial injection pressure, plateau injection pressure, release injection pressure, actual pressure inside the mold, time spent cooling--I can go on and on and on.

Most injection molding problems generally get solved one way: increase injection time. It's fairly straightforward to adjust, isn't likely to make things go wrong, and the people on the line don't get paid to experiment. They've got 100K parts to shoot in 72 hours, and an hour lost is a thousand or so parts they're going to get yelled at for. Better to dial the time up 10% and take 79 hours rather than experimenting for 7 hours and not shooting or waste a bunch of plastic.

Of course, if this is your only hammer, you can see where this is going. Every single time something goes wrong, that mold gets another 10% added to its cycle time. And it never goes the other way without "A Pronouncement From God, Himself(tm)". Eventually, your entire business is running at 50% productivity because all the molds are shooting so slow and you think you need to build another factory when what you need to do it fix your molding times.

Now, back to sensors--the problem is that nobody with incentive has a way to identify AT THE TIME IT CAN BE DEBUGGED that "something is going wrong". Someone on the line dialing up the injection time should cause an immediate dump of ALL the data on that machine (probably a week or more+) up to an engineer who can go through it looking for anomalies. Even better would be for the machine to flag to an engineer any "likely anomalies" (an increase in incoming plastic pellet water content should get flagged, for example) so that they can be corrected before they affect the injection process and cause failures/wastage.

This is, of course, all predicated on logging an enormous amount of data and being able to run an analysis against it in almost real-time.

Handling this data is non-trivial.

Thanks for this. A long-ago summer job was running those machines to make things like mirror rims and taillights. I still have a couple of the fantastic blobs produced when changing over from one job to another.

One of the things that struck me then is how much the line workers were treated like furniture, when many of them were quite sharp. They took such pride in getting things done well and at speed, in continually improving. I really wish I could put that kind of data in the hands of a couple of the people who trained me. Just an app on their phones. Spending 40+ hours/week on a machine means you really get to know it. I'd love to see how many of them would get great first-pass analysis and remediation.

I don't understand why such a system needs to look at every data point. Are the failures so rare that you can't get away with sampling?

Usually the cause that requires high sampling is vibration. But that can be condensed with FFT stuff easily.

Honestly from working in plastics and sensor design for 15 years it usually boils down to engineers not willing to let go of information because they envisage potential future issues. Its easier to imagine problems in a meeting than to imagine and deliver solutions upfront before the problem ever happens. Also, a lack of care for the economics of doing such sampling.

That's not to say there is an easy fix. These same people are the ultimate end customer who have the final word on such engineering environments.

I'm in a small energy company and log 50k/s events for several million points (temperature, pressure, voltage, current, power, ...). And most of what was said is true for us but we dont pay millions per year to our vendor. Its not cheap though and will pay millions over years easy. Horizontal scaling is not what these databases do as mentioned (thinking of Wonderware, IP.21, Honeywell, PI, ...). I have some hope for AWS Timestream for cloud but still think price will be high and they only ingest near live data so nothing older than what fits in memory. Most of the open source like influx, timeseriesdb, prometheus, lack features I expect but they are getting closer.

Check out VictoriaMetrics, with 50k/s ingest rate you can just use one single core machine (300k/s/core).

Cant tell from website but most metric solutions do not do bad quality/status markers. This is fundamental to IoT data. Next most dont do time weighted averages correctly even timeseriesdb does this badly.

> it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude.

This has long been the main marketing message used to promote Complex Event Processing (CEP) [1] systems. There is no shortage of enterprise and Open Source solutions for this space; what is missing is strong demand/adoption which in itself undermines the next-big-thing claim.

One can argue that sensor data is included in the ETL category.

[1] https://en.m.wikipedia.org/wiki/Complex_event_processing

For many sensor data models, CEP is a core element of the data flow but the constraint/query data model is much larger and more dynamic than is typically supported in classic off-the-shelf CEP systems.

This isn't necessarily an issue, complex constraint matching is typically a fundamental part of the ingest path anyway given the algorithms used; making it support more generalized CEP is a fairly straightforward extension of the same computer science mechanics that make polygon search scale efficiently.

Interesting. This may be a naive question — this is very far from my area of expertise — but is there a reason sensor data can't be sampled? It seems gratuitous to store that many events.

You don't know what you need until you need it. The signal you need to dig out of the data often isn't known until some other event provides the context. Also, for some industries and some applications, there are regulatory reasons you retain the data. In some cases these are sampled data feeds, even at the extreme data rates seen, because the available raw feed would break everything (starting with the upstream network).

In virtually all real systems, data is aged off after some number of months, either truncated or moved to cold storage. Most applications are about analyzing recent history. Everyone says they want to store the data online forever but then they calculate how much it will cost to keep exabytes of data online and financial reality sets in. Several tens of petabytes is a more typical data model given current platform capabilities. Expensive but manageable.

I interesting worked on a project as a data scientist with a client who worked in high precision manufacturing. Their signals (sensors) and actuators were stored in a historian which couldn't handle data 100ms samples even though the data was collected at a 10ms rate. One of the problems required us to look at the process that took just 85ms. The problem was the historian was showing signals up to 20ms it took a while to realise that it was extrapolating when you tried getting finer resolution. The company was using this historian for more than 20 years they had to commission another project to change the historian. So you're right, you don't know what you need until you need it.

Sometimes tens of scalars per second is the sampled data. It depends upon your requirements for accuracy and responsiveness for alarms, threshold checks, etc. I work with paper making machines that only give us a profile every 30 seconds--but that profile is a thousand floats, and we need to be constantly resampling it both spatially and temporally, and we're doing that for tens or hundreds of profiles for a single system--and we're supposed to handle hundreds of systems.

The more fundamental point that the GP is making is that the realm of industrial sensor data scales in ways that people haven't really grasped yet. It's much less about brute storage than it is about the interplay between bandwidth, storage, and concurrent processing power.

The problem is that you are generally looking for "Something's different" rather than "Smooth ALL The Points".

So, the problem is that you threw away 90% of your data, and that's where the problem was. Oops. Now you have to switch on "Save all the data" and hope it repeats. So, given that you have to have a "Save all the data" switch anyhow, you might as well turn it on from the start.

In addition, changepoint analysis is an entire field of research in and unto itself.

Look at how many articles there are about analyzing "Did something break in my web service or am I really doing 10% more real traffic?"

Depends on the application. Often down-sampled data is useful for drawing trends but not so useful for better understanding failure events.

For server monitoring data (mostly counters) is usually saved at 10 to 15 seconds intervals. It rarely queried at full resolution, it’s almost always sampled, yes.

Thanks for this great comment. What kind of workloads are people trying to run on sensor data that arrives at such a high velocity? Time series analysis? Anomaly detection? I wish I had a better idea of what kind of specific problems users you've run into are trying to solve, which fail on the existing software stack.

Not OP, but I work for QuasarDB and we deal with a lot of customers in this sector.

It’s typically a mix of everything, but predictive maintenance, anomaly detection and failure analysis are the most common. For example, there is one process that does trend analysis and tries to “predict” acceptable boundaries of a certain sensor’s measurements, and this is then compared in real-time with the actual sensor readings. If things fail for some reason, a technical engineer will dive into the data with dashboards (think: Grafana), zoom in, compare the readings with other sensors, etc.

The sheer volume of the data makes it fairly painful. Downsampling does happen, but only after a few weeks. This means that you still need enough storage capacity to deal with the full stream of data in real-time.

The data models for any non-trivial sensor analysis are intrinsically spatiotemporal -- every measurement or event happens at a place and time. Spatial relationships are central to the proposition of analytically reconstructing the dynamics of the physical world from disparate entities and sensors. The objective is to sample enough pixels and their relationships to sketch an accurate picture of reality as it happens. For example, a car is trying to understand its relationship to every relevant static and dynamic entity in its environment that affects its ability to operate safely. There is no business that does not benefit from having a model of reality that converges on ground truth in real-time, if you can take advantage of it.

Most of the analysis that is done usually falls under one of two categories. First, inferring (you can rarely measure it directly) when something has changed in the real world that is relevant to your business so that you can adapt to it immediately -- the applies to everything from autonomous driving to agricultural supply chains. Second, detecting anomalies -- the unknown unknowns -- so that risks can be managed when the real world appears to not conform to the models upon which you base decisions. A third category is support of industrial automation, which benefits immensely from high-resolution multimodal sensor data models, though this is largely a cost reduction measure. These categories are hand-wavy but in practice, boring industrial companies have concrete metrics they are trying to achieve or risks they are trying to manage in the most efficient way possible.

That's one of the big challenges we've been running to at UrbanLogiq. We've built bespoke storage and processing pipelines for this data because existing options in this space both didn't fit our needs and also would bankrupt our company while we tried to sort it out.

Having "cost" on the board as a factor we were actively trying to optimize for during design pulled us in a direction that is quite foreign compared to off-the shelf solutions.

That last paragraph rings true -- one of our big challenges specifically was in ingesting and indexing data that needs to be queried across multiple dimensions, things like aircraft or drone position telemetry. But once we found a workable solution for that, it specializes quite well to simpler workloads very well.

>> Almost no one provides tooling and platforms that address it

I think this is due to the nature of the mentioned companies are not being too common (yet?). There are tools and systems that you can use, especially from high frequency trading which has somewhat similar challenges. KDB+ and co. would be my first stop to check if there is something that I could use. The question is the financial structure and scaling of the problem, to determine if these tools are in game. There are other interesting projects in the space:

- https://github.com/real-logic/aeron

- https://lmax-exchange.github.io/disruptor/

Of course these are not exactly what you need, long term storage and querying (like KDB) is largely unsolved.

The other tools that you might be referring to by "most of the opensource platforms" indeed are not capable doing this. I spent the last 10 years on optimizing such platforms but it is not even remotely close to what you need, you (or anybody who thinks these could be optimized) are wasting your time.

"You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second;"

We do this. Have a load balancer with a fleet of nginx machines insert into bigquery. Inserts scale well and the large queries work since it is columnar. The issue is price. It's terribly expensive.

One thing people seem to be doing it put incredible effort into timeliness of data nobody ever looks at. (Plus, creating hundreds of TCP/IP + JSON overhead for single bit events.)

I've used the following pattern in the past: - generally only send batched data in as large an interval as possible - if somebody looks at a device, immediately (well, might take some seconds) query the batched data and switch device to a "live" mode that provides live data instead of "wait and batch".

This will be a bad idea for scenarios where there's a reasonable expectation of surges of people needing "live" access, but for our use cases of industrial data, it works very well. We only watch our own devices, which are in the lower tens of thousands, but I don't see why this should not scale to more, under the restrictions mentioned above.

> Almost no one provides tooling and platforms that address it.

As a systems engineer with a good track record and an interest in starting an endeavor, this is a very attractive statement to me.

Where can I read more about how the sensor networks are configured, the use-cases, etc? I'd like to read into this a bit more.

Structurally this is a nearly ideal ultra-scale startup opportunity given the right team.

Every use case has unique data model requirements (minimal standardization, different sectors) but there are easily identifiable platform components that almost everyone needs which aren't available. Surprisingly "simple" architectural holes would be a scalable business if competently plugged, the perfect MVP. These enterprises have an aversion to developing software, it isn’t their strength, and they know precisely how many millions per year a real platform would save them -- value is concrete. However, they are also technically sophisticated as to why all existing platforms fail for them, you can’t fake understanding the problem. I have the benefit of having worked on this market problem for several dozen organizations over the last 15 years, ranging from Big Tech to small EU industrials, so I see it more from their side.

Little is written about it. Everyone is essentially trying to use diverse multimodal sensor data sources to paint an accurate model of some part of the physical world in as close to real-time as possible. Easy to say, very challenging to do. Sometimes these data models are not about their business per se, their hardware puts them in an excellent position to build them so that they can sell it as a service to businesses that can actually use it. Often overlooked is that there are extremely difficult computer science problems with little public literature buried in the design of such systems, and expertise in this computer science is critical to being successful at it. Virtually all startups that try to enter this market completely botch the technical execution, assuming that these platforms don’t exist as a function of business execution when it is actually a hardcore tech startup. The technical execution expertise is the real moat for this business, everyone underestimates how deep that rabbit hole goes.

FWIW, I’ve been laying the groundwork to build a startup in this space for a while now, I even purchased a very good .com domain. :) Bespoke implementations at several highly recognizable organizations are based on licensed code components I designed. There is a massive demand overhang and the market was ready yesterday. The broader ecosystem has room for several startups to coexist, there are many niches currently unfilled.

What happened to SpaceCurve?

I've got 15 million connected cars, the data they can generate is large and you care about each specific car. Sampling the data doesn't work.

You can sample the data from each car? You don’t have to sample the cars themselves.

Insurance co?

I work with sensor data and although not explicitly mentioned I thought you could locate it in the "event streaming" and "stream processing" boxes.

What piece of architecture you think is left out?

> Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially.

Let’s step back for a second and just acknowledge that you’re in a very narrow slice of the market. The number of companies that are paying $100M/year to store sensor data is probably countable with 8 bits.

So it might seem like a large gap for you, but it’s honestly not relevant for 99.99% or developers.

Sounds like you need to move more processing and storage to the edge.

That's a key part of it too over the long term. There will never be enough bandwidth to backhaul all the sensor data to a data center. However, there are huge technical gaps that need to be addressed to make edge computing viable, particularly around managing federation considering the compute profile of many considered applications. Ad hoc transient meshes of powerful compute elements attached to diverse multimodal sensor sources without a trivial root of trust is... interesting.

It isn't a solved problem but people are working on it.

If monitoring is the use case then go for Netdata

Sounds like you need 1,000 nodes to do 1Bpps without edge computing. With some compression at the edge, it'd be closer to 150-250. The limits of a conventionally architected network make it more annoying than it needs to be.

Can you provide some examples of the kinds of sensors you're talking about?

In our factory we use temperature/humidity sensors, electricity meters, air pressure, statuses from various machines... We don't even have that many sensors and we normally poll every 5 seconds. But when the data processing stops for some reason, the backlog queue starts growing FAST.

The vast majority of sensor data compresses well. Delta encoding with Huffman is pretty standard.

I guess you could have a time series database that used compression but I don't know of databases that do

Thoughts on Honeywell Forge?

While this is an article about data infrastructure I feel like we're missing the forest for the trees.

What is most important here in my opinion is that the underlying data is useful. If your underlying data wasn't collected, collected properly, or even worse the wrong data was collected.. then setting up data infrastructure will be a boondoggle that will cause your organization to be data hostile.

Just as much, if not more effort, needs to go into collecting the right data in the right way to fill your data infrastructure with. Most of the projects I've seen or heard of are just people taking the same old data that Ted in accounting, Jill in BI, etc. are already pretty proficient at using. So the gains you get by moving that into a modern infrastructure are marginal. How many more questions can you really ask of the same data that people have decades of experience with and an intuitive sense for?

The biggest shift has been towards data lake (store everything) away from data cubes (store aggregates). This makes it orders of magnitude easier to diagnose, debug, and assert the correctness of data.

So these trends aren’t in a vacuum, they directly support the issues you discuss.

> Most of the projects I've seen or heard of are just people taking the same old data ...

I don’t disagree with you here. But in my experience it’s about getting Frank in marketing to use the same numbers as everyone else.

When you have 5 different ads platforms that all take revenue credit for a single conversion and have conflicting attribution models, and none of them add up to what accounting says is in the bank account. That’s a hairy problem.

There are different flavors of that class of problem at lots of companies.

> The biggest shift has been towards data lake (store everything) away from data cubes (store aggregates).

I don't think this is any shift. The "store everything" has always existed in my experience, that's how the aggregates were built in the first place. The aggregates were for speed and convenience, and you drill-down as necessary, including to the individual record level.

Maybe the shift is people thinking that it's cheaper to just analyze the entire corpus on-demand because we can throw a spark cluster at it?

I agree, data warehouse was what the data lake is today. Data cube is the aggregation of data in the warehouse, and then you can drill down and roll up. Difference between warehouse and lake is the emphasis on correctness (one canonical data model) and deduplication of data (when warehouses were invented, storage was expensive so one tried to normalise it into a star schema with as little duplication as possible — when emergence of cheap storage, this is less important and we can spend less time developing fancy ETL processes to make everything fit into one, conformant data model).

> this is less important and we can spend less time developing fancy ETL processes to make everything fit into one, conformant data model

And that's precisely why modern data processes are inferior to 20 years ago. People reinvent the wheel over and over and spend massive budgets on unnecessary tech stacks that would be alleviated if the time was simply taken to model the data.

A clean data model is about a whole lot more than simply storage space.

Agree if people would model everything to a “Grand Unified Data Model” of everything, it would be a lot more efficient... unfortunately that is very hard and puts a massive bureaucracy around data governance and management. It slows things down. I guess a more modern approach is to relax those constraints a bit and realise that some data can be expressed in different ways, and that duplication isn’t too much of a concern because storage is cheap. That said, it’s not an excuse for reinventing the stack over an over. I think the thirst to reinvent has largely been driven by the shift from expensive proprietary solutions like Teradata and Oracle to open source ones. That’s a positive shift.

And this is why big enterprises are moving out of data warehouse model for processing all the data and prefer a data lake concept. It is better to centralize some aspects of the data, but definitely not all (like common relational model for all sources, as it is in data warehouse). The data warehouse model quickly becomes super expensive to maintain and evolve.

You miss the critical difference -- nowadays people don't store aggregates, they just scan sharded data very fast. That simplifies a lot of things, because you don't need to keep two databases in sync (raw and aggregated).

I was working in data analytics + data science a decade ago and we stored everything, not aggregates, and pushed them through hadoop. I have been "out of the game" since then. What has changed that is making people saying "store everything" is a new phenomenon? (genuine question bc I am clearly missing something.)

It’s not a new phenomenon so much as it has emerged as an important shift from the status quo 20 years ago.

What’s changed in the last 10 years are the access patterns. There’s increased demand to have arbitrary query access over the raw data. The most impactful technology changes have been about pushing the access layer (queries, stream & batch processing, dashboards, BI tools, etc) down as close to the raw data as possible and making that performant. What’s fallen out of that are better MPP OLAP databases (snowflake), new columnar formats (parquet), SQL as the transform layer (dbt).

Ah that makes sense. Thanks.

why is it actually that SQL "re-emerged" as the transformation layer? I thought that it first shifted from SQL to Query Builders inside talend, matillion etc. Why now SQL again?

Probably just the emergence of dbt? I’ve only been doing ETL for a couple years personally but couldn’t imagine using so much SQL in our pipelines without a framework like dbt

The problem of data confusion you describe is resolved by replacing management. That’s not an engineering issue that requires new technology (consider the source of social power for author; selling technology).

That’s an engineering issue that needs new engineering management who don’t enable wasting company resources making incompatible APIs in the first place.

We already did the monolithic DB design, I used to name those hosts “ocean”. And we already know the math. “Data lake” is just more jargon by a salesman to obfuscate peddling the same old abstraction, and wow fresh grads with new words for hyping the same old habits.

While not the author of this piece, Bezos is quoted as pointing out how circular social behavior is.

What do you think the odds of this author being on a similar page?

Have humans evolved much in 100 years? Or does the con simply get rewritten for the next generation to hide a simple truth?

What’s keeping people going in this circle isn’t logistical necessity. It’s us.

I think you have a point, but there are more nuiances than that.

There are typically 2 types of data to collect: Transactional data and behavioural data.

Most transactional data, due to their important nature, are already generated and captured by the production applications. Since the logic is coded by application engineer, it's usually hard to get this data wrong. These data are then ETL-ed (or EL-ed) over to a DW, as described by the article.

For behavioural data, this is where your statement will most apply to. This is where tools like Snowplow, Posthog, Segment, etc come in to set up the proper event data collection engine. This is also where it's important to "collect data properly", as these kinds of event data changes structure fast, and hard to keep track over time. I'd admit this space (data collection management) is still nascent, with only tools like iterative.ly on the market.

I completely agree - there's only so many ways to slice the data. The caveat is - the type of data matters quite a bit for the data architecture. There's another thread that mentions sensor data as a source of complexity since the data has a theoretical delay between events (i.e. period) of 0 - something few systems are built to handle, even if you sample approximations at some fixed frequency. Algorithmic trading is a similar domain that still has a huge bar for entry - a sign that _this isn't easy_.

The fidelity of the data is of course important, but I would claim it's not a blocker. Yes, you need to trust the data you collect. That's table stakes - if you can't collect data correctly at all, even without worrying about the past, you're in for a world of hurt. It's P0. That said, a lot of people assume you also need to do this historically - and that's not the case - at least for ML.

Reinforcement learning has been making great strides in recent years. If you're in this situation - you have a flow where you want to use a model without having any past data to train with - use something like VW's contextual bandits [1]. You don't need historical data to build your model, just real-time decision point & reward signals. Once deployed, the model converges over time to the optimal model using real-time feedback.

All that said - baby steps are important. If you're in this situation, start by getting fidelity and then expand scope slowly without sacrifice to fidelity. It's a lot easier to backfill than to "fix" data - get that right and it get's easier from there. You'll need fixups regardless - mistakes happen and requirements change - but you have to start with something you trust, at least in the moment it's deployed.

[1] https://vowpalwabbit.org/tutorials/contextual_bandits.html

Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?

Let’s say I deleted every time series whose Y axis isn’t measuring US dollars in every tech company’s database everywhere. Maybe for all those time series you just store the most recent value. Describe to me what would be lost.

You’re onto something but you’re not going far enough! Most, if not all, historic metadata, analytics and behavioral data collection - when it is not measuring literal dollar amounts - is completely worthless.

This is completely wrong, and nobody who works with data at any scale could possibly believe anything like this.

We literally run long term A/B tests with thousands of variations of what you're describing. The purpose of these tests is to measure the effect of losing some data. The tests show (to nobody's surprise) that each piece of data is useful. These tests tell us exactly how useful each piece of data is.

Honestly when I read comments like this I have to wonder, do you really believe that thousands of companies spend trillions of dollars a year for something that doesn't work? Maybe talk to somebody who works on this stuff a bit?

> Honestly when I read comments like this I have to wonder, do you really believe that thousands of companies spend trillions of dollars a year for something that doesn't work?

Joking/not-joking. Have you ever been to the Bay Area?

Yes. Emphatically yes it is the case companies spend trillions of dollars unnecessarily.

We've seen this with people who didn't know how to build microservices and farcical "LMNOP" [1] type services that might as well be a joke. We've seen it with gigantically-valued unicorns that over-engineered tons of crap and hired too many people and still can't make a profit. We've seen it with CMOs and massively overpriced marketing technology because budgets and statuses are related. We'll see it with tons more iterations of this exact same affluenza.

The history of our industry is that the margins on software are so good that people can afford to do crazy nonsense.

[1] https://www.youtube.com/watch?reload=9&v=y8OnoxKotPQ

Yeah I mean this misses the main argument I'm making.

I have vast amounts of firsthand evidence from randomized controlled trials that non-financial data can be used to create value. This is enough evidence for anyone in the industry.

Presumably the commenter doesn't have access to this evidence. Instead he has to rely on other heuristics, like the weaker argument that companies spend trillions on data and analytics.

Companies sometimes waste money, and maybe microservices are an example of this. But companies collectively spend 3-4 orders of magnitude more money on "all data that does not have USD units" than on microservices, so the commenter should take that as strong evidence that data can be used to create value.

His quote was:

> Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?

And for the huge majority of companies, it is not. Even many 1B+ dollar companies. Most value produced by businesses still could exist with boring ETL or BI concepts that have been around forever, because the hardware powering it is so fast. Many if not most of those businesses probably would be better off.

So yes, companies blow a lot of money on stuff with questionable ROI. I don't discount your experiences and that there are cases where the complexity might have a payoff. But honestly, we've seen periods of excess complexity and waste in software before: it's the norm.

Just so I don't misunderstand your point: "Yes, companies without coordinating are collectively spending vast sums of precious resources on a product/service, BUT they are all misguided and wasting their money because look at these unrelated events where individuals misplaced their efforts/ money". Is that correct?

> wasting their money because look at these unrelated events where individuals misplaced their efforts/ money

They're not unrelated. They're all related to fat 60-80% profit margins on SaaS. And no coordination is necessary to spend money on silly, make-work activities if you have margins like that.

Software has extremely low, borderline zero, variable costs. A lot of these companies that spend money on examples like I've given probably could hire nobody at all and still have crazy growth because of the unit economics. (Not coincidentally, the companies most in want of "Big Data" solutions tend to be past this point.)

I can get 1M QPS on a silly Aurora setup with replicas. Best tool for all jobs: no. But don't tell me that dollar-for-dollar a data architecture with like 25 different components is dramatically superior to an OLTP db, OLAP offline store + batch jobs, and a streaming system.

That video is great, I've actually worked on a project where we tried to federate customer databases across 3+ mergers. Started before I joined the company and after 3-4 iterations they shipped something (that worked) around 3 years later right before I left.

I’m not trying to get into a flame war with you, there’s no reason to be hostile. It sounds like you “work on this stuff a bit,” you’re welcome to share concrete examples, it would be really interesting!

I think it’s an intriguing thought exercise. For example, does one need the entire history of interactions with e.g., an Instagram post, or just aggregated measurements? I’m not like, against measuring. Just against warehousing of non financial timeseries.

Yeah don't mean to argue, here are some examples. It's difficult for any company to be competitive at scale without sufficient logging to support all of these things.

* Timestamps of events related to content loading and rendering. This is crucial for debugging and improving load times.

* Backfilling aggregated data so that ML models can be trained without waiting weeks for new streaming aggregation.

* Answering product questions of almost any kind that weren't asked when logging was built.

Concrete example from my recent experience, you may want to know how often people like a post then later look at comments, vs look at comments then later like a post. This gives you information about cause and effect.

The first one doesn't even need much historical data. Unless you have some very unoptimized periodic jobs, the last few days or something is plenty.

The second can be done simply on something like Dynamo, CosmosDB, or your cloud-hosted NoSQL of choice. Heck, it can even be done on Aurora or vanilla Postgres + partitioning if it's <64TB.

The third can be done with any off the shelf cloud data warehouse software, at many petabyte scale. And even then, I'm sorry, but I just don't believe you that the product clicks over some large timeframe are historically relevant if your software and UI changes often.

All of these things mentioned have had extremely simple, boring solutions at petabyte scale for >10 years, and in some cases more than that. If you add a batch workflow manager and a streaming solution like Spark, that's like 3-4 technologies total to cover all these cases (and many more!)

yes consider RPC:




Microservices <--- You are here

> Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?

Yes. Every advertising platform ever uses this information. In Europe, you have to have regulation that makes account costing (what the US might call forensic accounting) possible. The presentations on A/B tests by FANG companies might also interest you. They are on Youtube.

For a post detailing the modern data infrastructure I'm surprised they intentionally leave out SaaS analytics tools. I find this especially surprising given a16z has invested >$65M into Mixpanel.

Based on my experience working at an analytics company and running one myself, what this post misses out is that an increasing number of people working with data today are not engineers. These people can range from product managers who are trying to figure out what features the company should focus on building, marketers to figure out how to drive more traffic to their website, or even the CEO trying to understand how their business as a whole is doing.

For that reason, you'll still see many companies pay for full stack analytics tools (Mixpanel, Amplitude, Heap) in addition to building out their own data stack internally. It's becoming more and more important that the data is accessible to everyone at your company including the non-technical users. If you try to get everyone to use your own in-house built system, that's not going to happen.

I don’t think Mixpanel fits here. Mixpanel it’s just one end-to-end suite, mostly behavioral data that is captured from user sessions or user derived events/sub-events. Basically web analytics.

The whole point of data infrastructure is that sometimes you’re collecting data from the most random places. Many of that data is not necessarily user behavior. Sometimes it’s things like temperatures, latencies, CPU usage or instrument tallies. Sometimes it’s a stream of minute to minute weather data or timings or anything, really. Besides many companies have been collecting data for decades but it all live in silos where it can’t be used for anything.

Mixpanel can’t capture all that data, or query it, or analyze it. Mixpanel is just capturing a super small subset of web event data and it happens to provide an analysis suite on top that data they collect.

That’s why Segment shows up in this list instead. They help to move a lot of siloed data into a common systems. Mixpanel is just another source of data. You need something like Snowflake to put everything together and be able to do queries across multiple datasets.

That's a great point. On similar vein, marketing teams too are increasingly data driven and would tools like Braze, CustomerIO etc to run personalized data driven campaigns. Support teams are using tools like GainSight

All these tools need to be fed data about user behavior - from apps, server backends, other tools etc. It's a messy data connection problem, not just one way from SaaS to warehouse. Mobile App->SaaS; SaaS->SaaS; Warehouse->SaaS; SaaS->Warehouse and so on.

Don't forget the challenges around identity resolution and privacy compliance to properly join it all up effectively and accurately.

Indeed. Even when you have the full identity graph in warehouse and just want to assign a cannonical-ID (by doing a transitive closure), it is not easy in SQL. We wrote a blog on it (sorry for the shameless plug) https://rudderstack.com/blog/identity-graph-and-identity-res...

Creating the ID graph is a next level problem altogether!!. How do you know a record in Salesforce is the same as the anonymous visitor on your website. Requires joining across at-least 3 (possibly more IDs) - anonymousID, userID email (if the user signs up) and Salesforce record email.

Should the data pipe do this automatically? If not, what API abstraction should be exposed to the user?

For those who're interested in learning more about the history and evolution of data infrastructure/BI - basically why and how it has come to this stage - check out this short guidebook [1] that my colleagues and I put together a few months back.

It goes into details how much relevance the practices of the past (OLAP, Kimball's modeling) has with the current changes in by the cloud era (MPP, cheap storage/compute, etc). Chapter 4 will be most interesting for HN audience: It walks through the different waves of data adoption ever since BI was invented in the 60-70s.


This sounds like an in-depth discussion of what the a16z document calls Blueprint 1: Modern Business Intelligence. I don’t know if the other two blueprints for Multimodal and AI are explored.

The ELT (rather than ETL) insight was really cool, hadn't heard of that before.

Unless though, you're on a massive, massive scale, Just Use Postgres, and write your ETL (ELT now?) queues normally. Keep It Simple Stupid.

While I think data science is a very interesting field with a lot of beneficial applications it also seems to be the one that's right at the heart of a lot of the negative impact some tech is having on society right now. I seriously considered specializing in it for a while but ultimately decided it was too likely I'd be asked to work on things that make me uncomfortable.

Power(ful tools) can be wielded for good or evil, the courageous thing to do is to learn it AND act ethically, not shy away from it.

Otherwise the spoils of war go to the unethical evil because they are now unchallenged.

I disagree; I think that approach does not work.

Building powerful tools and then using them ethically doesn't reduce the amount of "unethical evil" done by others. Quite the contrary. And it doesn't deny them "spoils", as though there's a zero-sum prize, because there isn't one.

If you're really good at building tools, it will result in the creation of new, powerful tools which may be wielded for good or evil. If most other actors out there will wield those tools you're building for more evil than good, the mere act of building those tools will lead to more evil than good.

So I'm with cageface on this.

Deciding which tools to build does have consequences, and it's other people who primarily decide how those tools will be used, not the toolmaker. Sometimes you can already see what choices others look likely to make.

Some would argue this doesn't place an ethical burden on the toolmaker, because you can't and shouldn't control other people. That's a different argument though. Ethical or not, there are undeniably consequences from building tools when you can see how they are likely to be used.

Maybe so but I'm not really in a position to martyr myself professionally right now so I'm just avoiding it instead.

False dichotomy

How so? If I level up on data science and then go work somewhere as a data scientist and refuse to work on the tasks assigned to me I don't see that working out so well in my performance reviews.

To the downvoters - grow up and learn to present an argument.

You could say the same thing about lots of knowledge though: AI, economics, behavioral psychology, crypto currencies etc.

I think most people would agree that there’s lots of positive applications that you could use your data science skills for. But if you can’t then good for you for staying out of the field.

I'm really excited about the state of data infrastructure and the emergence of the data lake. I feel like the technical aspects of data engineering is reduced to getting data into some cloud storage (s3) as parquet. Transforms are "solved" using ELT from the data lake, or streaming using kafka/spark.

I think executing this in orgs with legacy data technologies is hard but it is much more a people problem than a tech problem. In orgs that have achieved this foundation it's really cool to see the business and analytic impact to the company.

"it is much more a people problem than a tech problem"

^ This holds true for nearly every aspect of nearly every company.

Snowflake (and others) will let you either pull that in and query it or as an external query that queries it in place. You can, if it makes sense for your use case, now just T from the data lake.

Good start for this vast and complex topic. One thing that pops out here as missing is Data Mesh [1] It is emerging pattern for complex data management and data exchange between multiple products and product components/services.

[1] https://martinfowler.com/articles/data-monolith-to-mesh.html

I wonder how many of those companies in the proposed architecture have A16z as investors?

I counted 6.

Fivetran, dbt, Preset (Superset/Airflow), Sisu, Imply and Databricks.

Though, as someone who's in this space a while, I think they did a decently fair job at articulating the 'modern' data infrastructure landscape.

The recent HN threads about excel made me think there's definitely room for a new kind of excel that works well for big data.

That's just SQL.

And there are dozens of charting/visualization/business-intelligence vendors to do whatever you want beyond or on top of that SQL structure.

> The recent HN threads about excel made me think there's definitely room for a new kind of excel that works well for big data.

Check out Google Connected Sheets: https://cloudblog.withgoogle.com/products/g-suite/connected-...

This, to me, is now the rate limiting step in this architecture; there are probably 1000x as many people who can operate in Excel than people who can operate on a “data stack”. Yes, the fundamental goal of these data stacks is to enable insight and decisions “at scale”.. but beyond that you have probably hundreds or thousands of employees who just need to do quick analyses for one-off decisions that can be handled by Excel. But there’s usually a benefit to those analyses being “operationalized” and integrated into the broader architecture.. having a live connection to the central database, and having results piped back... so many Excel spreadsheets get emailed back and forth, completely out of the stack’s purview.

Will MS modify Excel 365 fast enough to meet this need? Will another spreadsheet program disrupt Excel’s dominance? Will another player come in with the ability to “ingest” arbitrary Excel files? Another major issue is Excel’s massive failure when it comes to handling uncertainty in data. I’ll be curious to see how it all plays out.

I remember reading about Looker, before Google bought them out. I never used it myself, but it may have fit the description.

Citation needed?

We connect all our sensors to an edge AI Server that handles sensor data, and only uploads to the cloud what’s actually relevant.

It works quite well, and there are many OEMs that offer such systems, with accelerators for inference, sensor data compression, 5G, etc.

I considered this piece as sort of a loose validation that the Automunge library is filling an unmet need for data scientists. Intended for tabular data preprocessing in the steps immediately preceding the application of machine learning.

Great article, but surprising that it does not mention or use the concept of DataOps. Even Gartner has recently written at length about the role of DataOps [1], and of course, we at Composable [2] are biased as they just name us as a Cool Vendor in DataOps [3].

[1] https://www.gartner.com/en/documents/3970916/introducing-dat...

[2] https://composable.ai

[3] https://www.gartner.com/en/documents/3991447/cool-vendors-in...

What's the point of data hoarding? Intelligent systems in nature ingest the data, learn, and discard them

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact