Arent a lot of businesses being sold on "real time analytics" these days? That m...

lmkg · on Aug 16, 2023

I work as a web analyst (think Google Analytics).

One time I ran an A/B test on the color of a button. After the conclusion of the test, with a clear winner in hand, it took eleven months for all involved stakeholders to approve the change. The website in question got a few thousand visits a month and was not critical to any form of business.

This organization does not benefit from real-time analytics.

Now that's an extreme outlier, but my experience is that most organizations are in that position. The feedback loop from collecting data to making a decision is long, and real-time analytics shortens a part that's already not the bottleneck. The technical part of real-time analytics provides no value unless the org also has the operational capacity to use that data quickly.

I have seen this! I have, for example, seen a news site that looked at web analytics data from the morning and was able to publish new opinion pieces that afternoon if something was trending. They had a dedicated process built around that data pipeline. Critically, they had a specific idea of what they could do with that data when the received it.

So if you want a framework, I would start from a single, simple question: What can you actually do with real-time data? Name one (1) action your organization could take based on that data.

I think it's also useful to separate what data benefits from realtime and which users can make use of it. Even if you have real-time data, some consumers don't benefit from immediacy.

coredog64 · on Aug 16, 2023

Generally speaking “What questions do you hope to answer with this data?” is a good filter for all kinds of operational data.

iamacyborg · on Aug 16, 2023

Hate to say it but if your site was only getting a few thousand visitors a month your test was likely vastly underpowered and therefore irrelevant anyway

mrbungie · on Aug 16, 2023

Power is not just about sample size, but also (expected/previously informed by some other evidence) effect size. You can't make that conclusion without that.

iamacyborg · on Aug 16, 2023

For sure, but you’d need one hell of a good cta to be getting a sufficient effect size to warrant small samples.

slotrans · on Aug 16, 2023

Just gonna keep linking this til the heat death of the universe: https://mcfunley.com/whom-the-gods-would-destroy-they-first-...

Real-time analytics are worse than useless. At best they are a distracting resource sink, at worst they directly harm the quality of decision-making.

RyanHamilton · on Aug 18, 2023

I find saying "X is worse than useless" a bad approach to technology. I recommend you try and think, what is the pivot point to decide between these options? e.g. php and node js, when would I pick one over the other. It's rare for one technology to completely dominate another.

Eumenes · on Aug 16, 2023

From my experience (mostly startups), real time analytics is generally overkill, esp. from a BI perspective. Unless your business is very focused on real time data and transactional processing, you can generally get away with ETL/batch jobs. Show executives, product, and downstream teams some metrics that update a few times per day saves a ton of money over things like Snowflake/Databricks/Redshift stuff. While cloud services can be pricey, tools like dbt are really useful and can be administered by savvy business people or analyst types. Those candidates are way easier to hire compared to data engineers, sql experts, etc.

ozim · on Aug 16, 2023

For me it mostly is that business people don't understand OLAP vs OLTP and that if they add 5 items to database and they are visible in the system their "dashboard" will not update instantly but only after when data pipes run.

Which is hard to explain because if it is not instant everywhere they think it is a bug and system is crappy. Later on they will use dashboard view once a week or once a month so 5 items update is not relevant at all.

higeorge13 · on Aug 16, 2023

I work in a real time subscription analytics company (chartmogul.com). We fetch, normalize and aggregate various billing systems data and eventually visualize them into graphs and tables.

I had this discussion with key people and i would say it depends on multiple factors. Small companies really like and require real-time analytics: they want to see how a couple invoices translate into updated saas metrics or why they didn’t get a slack/email notification as soon asit happened. Larger ones will check their data less frequently per day or week, but again it depends on the people and their role. Most of them are happy with getting their data once per day into their mailboxes or warehouses.

But we try to make everyone happy so we aim for real time analytics.

mrbungie · on Aug 16, 2023

I think GP's point is that is not about the perceived value of real time data/analytics, but rather, its actual value. Decision makers may ask for RT or NRT, but most of the time won't make a decision or action in a timeframe that actually justifies RT/NRT data/analytics.

For most operations RT/NRT data stuff normally is about novelty/vanity rather than a real existing business need.

andrenotgiant · on Aug 16, 2023

The article is separating "operational" and "analytical" use-cases.

IIUC analytical = "what question are you trying to answer" and in analytics, RT/NRT is absolutely novelty/vanity. Operational = "what action are you trying to take" and it makes sense to want to have up-to-date data when, for example, running ML models, triggering notifications, etc...

mrbungie · on Aug 16, 2023

Yeah, totally. I should've specified "analytical operations", as in, updating dashboards and other non-time-critical data processing that eventually feed into decision making. That's were devs or decision makers asking for RT/NRT makes no sense.

jandrewrogers · on Aug 16, 2023

The term "real-time" is much abused in marketing copy. It is often treated like a technical metric but it is actually a business metric: am I making operational decisions with the most recent data available? For many businesses, "most recent data available" can be several days old and little operational efficiency would be gained by investing in reducing that latency.

For some businesses, "real-time" can be properly defined as "within the last week". While there are many businesses where reducing that operational tempo to seconds would have an impact, it is by no means universal.

chimerasaurus · on Aug 17, 2023

Disclaimer - Snowflake here.

I will just point out that when my team and I talk about streaming, we are focused on not real-time because in many cases, the value to a customer is not there. Not every "streaming" use case is fraud detection. In fact, we have been saying for awhile that for many streaming use cases, the value is 60 seconds < [value here] < 60 minutes.

Example: (and yes, this is a Snowflake video but has a visual) https://youtu.be/Ou04UZWwxgg?t=64

RyanHamilton · on Aug 18, 2023

The way I think about this is a typical up and to the right graph, of cost vs Speed. As you increase the speed, you increase the cost. So for real-time to be benefical to your business, you need to have be able to make more profit with data that is 1 second fresh vs 1 minute delayed. Looking at it like this, you can roughly group them: 50 milliseconds, 1 second, 1 minute, 1 hour, 1 day. In which area is your business for making profit? Uber showing a taxi location = 1 minute (perhaps fake it moving in between). Large electricity substation monitoring = 1 minute - assuming power down takes 5 minutes to commence. Trading with user interaction = 50ms. Then at each of those points are technology systems for deliverying that speed. I guess what some of these vendors are trying to do is change the shape of the graph. If they can bring the cost down massively, then more it may be worth Uber showing a 1 second update :) I know some users that watch the litle car obsessively.

hintymad · on Aug 17, 2023

Real-time analytics for human is not that useful. Human can't make decisions and take actions in minutes anyway, let alone seconds. A notable exception can be log analytics for operations, but I'd argue in that case throughput is more more important than a few seconds of latency. Case in point, CloudWatch Insights can consistently drive about 1GB/s of log scan. It's good enough for log search in practice.

On the other hand, real-time analytics for machines can be critical to a business, which is why Yandex built Clickhouse and ByteDance deployed more than 20K nodes of Clickhouse.

Just like using any technology, we need to figure out what problems we solve for real-time analytics first.

atwebb · on Aug 16, 2023

Real-time generally means near-real-time and even then I liken it to availability.

If asked people would say "I need to always be up" until they see the costs associated with it, then being out for a few hours a year tends to be ok.

datadrivenangel · on Aug 16, 2023

This is a great way looking at it. The cost starts going up rapidly from daily and approaches infinity as you get to ultra-low latency realtime analytics.

There is a minimum cost though (systems, engineers, etc), so for medium data there's often very little marginal cost up until you start getting to hourly refreshes. This is not true for larger datasets though.

riordan · on Aug 16, 2023

> In this context-- the section in article where it says present data is of virtually zero importance to analytics is no longer true. We need a real solution even if we apply those (presumably complex and costly) solutions to only the most deserving use cases (and not abuse them).

Totally agreed, though where real-time data is being put through an analytics lens is where CDW's start to creak and get costly. In my experience, these real-time uses shift the burden from being about human-decision-makers to automated decision-making and it becomes more a part of the product. And that's cool, but it gets costly, fast.

It also makes perfect sense to fake-it-til-you-make-it for real-time use cases on an existing Cloud Data Warehouse/dbt style _modern data stack_ if your data team's already using it for the rest of their data platform; after all they already know it and it's allowed that team to scale.

But a huge part of the challenge is that once you've made it, the alternative for a data-intensive use case is a bespoke microservice or a streaming pipeline, often in a language or on a platform that's foreign to the existing data team who's built the thing. If most of your code is dbt sql and airflow jobs, working with Kafka and streaming spark is pretty foreign (not to mention entirely outside of the observability infrastructure your team already has in place). Now we've got rewrites across languages/platforms, and leave teams with the cognitive overhead of multiple architectures & toolchains (and split focus). The alternative would be having a separate team to hand off real-time systems to and only that's if the company can afford to have that many engineers. Might as well just allocate that spend to your cloud budget and let the existing data team run up a crazy bill on Snowflake or BigQuery as long as it's less than the cost of a new engineering team.

------

There's something incredible about the ruthless efficiency of sql data platforms that allows data teams to scale the number of components/engineer. Once you have a Modern-Data-Stack system in place, the marginal cost of new pipelines or transformations is negligible (and they build atop one another). That platform-enabled compounding effect doesn't really occur with data-intensive microservices/streaming pipelines and means only the biggest business-critical applications (or skunk works shadow projects) will get the data-intensive applications[1] treatment, and business stakeholders will be hesitant to greenlight it.

I think Materialize is trying to build that Modern-Data-Stack type platform for real-time use cases: one that doesn't come with the cognitive cost of a completely separate architecture or the divide of completely separate teams and tools. If I already had a go-to system in place for streaming data that could be prototyped with the data warehouse, then shifted over to a streaming platform, the same teams could manage it and we'd actually get that cumulative compounding effect. Not to mention it becomes a lot easier to then justify using a real-time application the next time.

[1]: https://martin.kleppmann.com/2014/10/16/real-time-data-produ...