
Data is not an asset, it’s a liability - markonen
https://www.richie.fi/blog/data-is-a-liability.html
======
cwp
He's on to something here, but I think the asset/liability duality isn't a
matter of your point of view. Code really _is_ a liability, even from the
high-level view in the boardroom. The asset is functionality. If you can get
more functionality with less code, you improve your balance sheet.

By analogy, data is also a liability. The asset is insight.

~~~
cgio
From an accounting perspective I am inclined to disagree. Firstly, the way you
present functionality as asset and code as liability, you do not make it easy
for me to read how they are components of a balance sheet. More specifically,
how do you see your balance sheet improving while still being balanced?

Furthermore, both assets and liabilities (including capital) are stock
measures. I would consider insight more of a flow than a stock. More
specifically, I think it is the process and tools of achieving insights that
is the asset. Whether data is part of this asset base is up for grabs, but
accountants have not identified any reasonable way to measure it.

"...the light that falls on to your eye, sensory information, is meaningless,
because it could mean literally anything. And what's true for sensory
information is true for information generally. There is no inherent meaning in
information. It's what we do with that information that matters." Beau Lotto

From a very interesting TED presentation.

[http://www.ted.com/talks/beau_lotto_optical_illusions_show_h...](http://www.ted.com/talks/beau_lotto_optical_illusions_show_how_we_see.html)

And I know I am confusing information with data, but no one has a convincing
definition to separate between the two.

~~~
demian
How about "data is the n-dimensionally measured point and information is data
plus some reference and dimensional axes" as a working definition?

~~~
cgio
I don't know if I understand properly, but are the dimensions not implicitly
at least defined if we have n-dimensionally measured point? Any additional
dimension would be either irrelevant or would be necessitating additional data
to populate the dimension

------
zeveb
The reason to collect everything (or rather, more things than you think you'll
need to answer the questions you think you'll need to answer) is that…you
could be wrong about what data are required to answer the questions you've
identified, and that you could be wrong about which questions you will care
about.

And historical data _can_ be extremely useful, e.g. when looking at seasonal
or cyclical trends (woe betide the grocer who doesn't stock up on turkeys in
mid November, which means woe betide the grocer who doesn't order turkeys in
the summer).

Yes, he's _absolutely_ right that data management and privacy impose a cost:
the cost of storing a GB of data is more than S3's 10¢. It takes a human being
to make a judgement call about whether this data is likely to be worth its
overall cost and risk. That's why managers and other decision-makers get paid
what they get paid. 'Data is a liability' is a nice soundbite, but it doesn't
capture the full reality; one can't manage to soundbites.

~~~
oconnore
This is a strange critique. His recommendation isn't a soundbite. It is: ask
questions, figure out what you need to know to answer those questions, and
then collect that data.

That (asking questions and making informed decisions) sounds like exactly what
those "managers and decision makers" ought to be doing in the first place. To
collect everything because you can't decide on a strategy is exactly the
opposite.

 _(Re: turkeys and fairly complex retail sales cycles. A year or three of
aggregated purchase history is likely to yield actionable insight. It 's
unlikely that a slight upward trend in turkey purchases since 2006 is very
actionable.)_

~~~
bhntr3
Old data is possibly uninteresting. That's debatable. But all uncollected data
is freshly uncollected data.

I strongly believe in collecting as much granular data as possible. It's
totally possible to throw it away after it has outlived its likely usefulness.

But if I want to calculate a new machine learning signal today , I can't wait
six months to accumulate enough data. I want that data to exist. And the only
solution is to overcollect. And to collect at a granular level so I can
aggregate and transform later.

The liability and security concerns are real. The fact that most companies are
stupid about investing in unnecessary big data infrastructure is real too. But
a recommendation to spool to cheap offline or nearline storage immediately is
interesting. A recommendation to throw data away seems like folly to me.

~~~
oconnore
If your operation is the sort that capitalizes on machine learning on subtle
signals, you capture the sorts of data that you know might be beneficial 6
months down the road. Although I suspect that you overstate the sort of
"surprise questions" that might come up (and also the granularity necessary to
answer them), that is a correct response to a business model that demands
those insights at that granularity.

That doesn't mean that "collect everything at max granularity" is good advice,
because as you said:

> The fact that most companies are stupid about investing in unnecessary big
> data infrastructure is real

~~~
yummyfajitas
The fact of the matter is that I've never been unhappy about overcollecting
data. Worst case, step 1 of my pipeline is 10x or 50x slower than it needs to
be due to filtering out a bunch of junk. The added latency to my workflow
might be a few minutes.

Every time I've undercollected I've been unhappy, and this was hardly a rare
occurrence. I need to build the collector, deploy it, and wait for data to
flow in. Added latency = 1 week, minimum.

You can always throw useless stale data away. You can never retroactively
collect data you needed.

------
colin_mccabe
This article is highly misleading. Sure, certain kinds of data (like credit
card data or medical data) can be a liability if not properly managed. If you
are working in one of those industries you already know that (although
arguably many financial and medical companies underestimate the risks).

In contrast, data about what users have clicked, comments on an online forum,
or who has friended who on your social media site is not a liability. In many
cases this data is already public. Even in cases where it's not, it is usually
incredibly boring for any hacker to steal, like what web pages John Doe has
clicked on in the last 10 minutes. Hackers are going to go after stuff like
social security numbers or credit card numbers, not data about the average
length of time people spent looking at Pepsi Inc. web page layout A versus web
page layout B.

The argument that you should only store "the data that you need" is a circular
one. How do you know what data you need? Well, you run an analysis. How do you
run an analysis? Using the data you have. Short-sighted policies like throwing
away historical data, as the article recommends, effectively blind you to
long-term trends.

~~~
x5n1
The bigger problem is the nature of this sort of data. All of this sort of
data needs to be invalidated all the time and provided as some sort of hash
rather than actual data that is usable.

~~~
roymurdock
Check out Enigma [1]. It allows a 3rd party to perform computations on
encrypted data using distributed, secure multi-party computations.

Thing is it's at least 20x slower than computing over plaintext, and nodes
will have to be compensated with fees, much like Bitcoin miners.

Still, for safety critical industries, this could be a very useful tool.
Especially if the government steps in and mandates such a protocol.

Here's a more in-depth write-up I did a few weeks back if you're interested:
[https://www.linkedin.com/pulse/iot-use-cases-enigma-
homomorp...](https://www.linkedin.com/pulse/iot-use-cases-enigma-homomorphic-
encryption-roy-murdock?trk=prof-post)

[1]
[http://enigma.media.mit.edu/enigma_full.pdf](http://enigma.media.mit.edu/enigma_full.pdf)

------
roymurdock
It depends on what industry you're in.

So JPMorgan took a bit of a reputation hit when hackers compromised the data
of 83m of the company's clients. [1] The financial repercussions were minimal,
if there even were any to begin with. You can bet they still collect and store
all the data they can. They don't really care, and customers don't really care
all that much either.

On the other hand, we have medical data, which is essential for pharmaceutical
and academic research but would be very, very harmful in the wrong hands. A
non-compliant company will get smacked with heavy fines under HIPAA for not
safeguarding data in a strict, standardized manner.

Until government regulation makes data breaches substantially costly for
Company X (Target, Adobe, LastPass, Department of Personnel Management, etc.),
Company X will continue to gather as much data as possible.

It's an asset with unbounded upside (who knows what great economic engine data
might fuel in 5-10 years) and no financial downside because it carries no
legal risk, and very minimal storage costs.

[1] [http://dealbook.nytimes.com/2014/12/22/entry-point-of-
jpmorg...](http://dealbook.nytimes.com/2014/12/22/entry-point-of-jpmorgan-
data-breach-is-identified/?_r=0)

------
tomlock
>>Think this way for a while, and you notice a key factor: old data usually
isn’t very interesting. You’ll be much more interested in what your users are
doing right now than what they were doing a year ago. Sure, spotting trends in
historical data might be cool, but in all likelihood it isn’t actionable.
Today’s data is.

Uhhhh, really? There have been a lot of times in my past where, as an analyst,
I wished I had an bunch of historical data to measure the seasonality of
trends. Am I the only one who baulks at this comment? Is this a startup-
oriented perspective?

------
pisipisipisi
Cultural difference: In US, your customer DB is an asset like anything else.
In EU, your customer database is a liability.

~~~
kuschku
One could, cynically, break it down to

> In the US, your customers are an asset. In the EU, your customers are people
> with rights

because that’s what this is about. And I, personally, have to fully support
the EU view here, which is that the privacy of a person is more important than
the profits of a company.

------
mizzao
Another way of looking at the article is that it's more important to have the
_right_ data than just having lots of data. So it's very important to think
about the questions that one would want to answer and design data collection
to capture the answers, rather than just going ahead and storing everything in
a brute force way.

~~~
rue
But having lots of data improves your chances of having the right data _when
you figure out what it is_.

~~~
rodgerd
Like all the data the NSA, TSA, and FBI dragnet stopped 9/11, the Boston
bombings, Christian terrorists shooting up doctors and churches?

Oh, no, it doesn't. Not least because there's so much data even they have
trouble working out what to do with it all.

~~~
kuschku
One of the arguments during the Bostom bombings was that "if Boston had, like
NYC, a fully integrated CCTV system automatically filming everything and
allowing to track people across the city, we could have found the attackers
more easily".

One has to seriously think about this. We, as a society, are signing away some
of our most important rights for a statistically insignificant boost in
security.

I live in Germany, but even here we still feel these changes, which were
started with 9/11.

It's now the 14th anniversary of 9/11, and I have to say, long-term, Bin Laden
has won. The western world has given up so many rights for the war on
terrorism.

Giving up privacy just to collect more data, so we feel safer (while the data
is just thrown away) is just another step where we forget that people are
actually, well, people. With rights.

------
ohitsdom
A $125 billion market for big data solutions is NOT a sign that data is a
liability. What is the value of that data?

I'm also skeptical of the compliance/privacy argument. If you're collecting
any data at all, it's a potential liability. The volume of the data doesn't
change the risk level much.

~~~
dredmorbius
False premise.

This assumes that:

1\. The externalities cost of data exposure is fully realised. What is Ashley
Madison's liability for suicides of members attributed to its data leak? Or of
the _multi-generational_ impacts of fractured families (immediate plus
affected children, possibly parents of spouses who split on account of the
breech). A frequently leveled compaint about "remediation" for data disclosure
is that it does little to address the full costs to those victimised by it.

AM is all the more interesting in that, as Annalee Newitz's investigative
reporting reveals: not only were personal data collected, but (at least for
straight males), it _wasn 't_ for the stated purpose: "Ashley Madison created
more than 70,000 female bots to send male users millions of fake messages,
hoping to create the illusion of a vast playland of available women." In fact,
it was a feeder system to a set of bots, "affiliates", and, it seems, escort
services.

[http://gizmodo.com/ashley-madison-code-shows-more-women-
and-...](http://gizmodo.com/ashley-madison-code-shows-more-women-and-more-
bots-1727613924)

2\. It assumes the market is rational. Enron's market capitalisation was $60
billion on December 31, 2000, prior to its collapse. The share price fell from
$0.75 to less than one dollar by November, 2001, its bankruptcy filing was
dated December 2, 2001.

The larger problem: what is seen cannot be unseen.

An exceptionally peculiar aspect of digital data is that, while it may remain
in the boxes and cages provided for it, it's got a notable tendency to find
itself liberated. Often without warning, and not detected for days, weeks,
months, or longer, afterward (as in this case). In the real world we've got
friction, especially associated with data processing and transfer. In digital
form, far less so. Sometimes friction is good.

Finally: that's the market for _tools to analyse the data_ , not for the data
itself. Big Data is a current business fad. Companies are told they must
capitalise on Big Data, so they buy miracle solutions to do that. Some can,
and in cases spectacularly. Many cannot -- the added value-per-customer is
small, or, in the case of breach, negative.

~~~
icebraining
_1\. The externalities cost of data exposure is fully realised._

No, it only says that _because_ it isn't fully realised, the data is not a
liability as it would be otherwise.

------
dredmorbius
I'm quite happy to see others starting to recognize this. It's a problem, as
someone who's dealt with "big data" since the early 1990s, that I've been
quite well aware of for several decades.

An exceptionally peculiar aspect of digital data is that, while it may remain
in the boxes and cages provided for it, it's got a notable tendency to find
itself liberated. Often without warning, and not detected for days, weeks,
months, or longer, afterward (as in this case). In the real world we've got
friction, especially associated with data processing and transfer. In digital
form, far less so. Sometimes friction is good.

What you almost always want to do is to _roll data up to non-individualised
aggregates_ as soon as practically possible. The rest is just dry powder
waiting for a spark.

[https://www.reddit.com/r/dredmorbius/comments/3hn4r5/on_the_...](https://www.reddit.com/r/dredmorbius/comments/3hn4r5/on_the_media_asks_what_can_we_learn_from_ashley/)

------
mmaunder
Totally agree. Was going to add that an expanding schema is a liability but
that seems self evident with the comments about code expansion.

If you've ever built a very busy application you know the truth of this. At
first the massive access to data you have seems like a gift and you'll log it
all "just in case". You might even brag about the volumes of data you have and
speculate about their value to investors and internally. But eventually you
realize the risk and cost and cost of managing the risk.

------
tarr11
_Private customer data_ often includes a liability, but not for the reasons
that OP states. The liability is that companies have an ongoing obligation to
their customers to protect their private data. However, a lot of data does not
have this liability. If it gets published to the world, it wouldn't matter.

This article portrays a common misunderstanding of the accounting terms
liability and asset. Just because something has a cost to maintain, it does
not make it a liability.

Code is an asset. Data is an asset. Businesses do not value assets at their
cost, as the article represents, but in their future economic value.

Asset:

"Things that are resources owned by a company and which have future economic
value that can be measured and can be expressed in dollars. Examples include
cash, investments, accounts receivable, inventory, supplies, land, buildings,
equipment, and vehicles."

Liability:

"Obligations of a company or organization. Amounts owed to lenders and
suppliers. Liabilities often have the word "payable" in the account title.
Liabilities also include amounts received in advance for a future sale or for
a future service to be performed."

~~~
Mz
_A liability can mean something that is a hindrance or puts an individual or
group at a disadvantage, or something someone is responsible for, or something
that increases the chance of something occurring (i.e. it is a cause)._

[https://en.wikipedia.org/wiki/Liability](https://en.wikipedia.org/wiki/Liability)

------
kainosnoema
This is one reason we decided at Cotap to purge messages after 14 days by
default, and only keep them longer if requested
([https://cotap.com/blog/customizable-data-retention-for-
busin...](https://cotap.com/blog/customizable-data-retention-for-business-
messaging)). Contrary to what one might expect, most users embraced the
change.

------
Tloewald
Somewhat in this vein:

Years ago I worked at a large advertising network that was concerned about
fraudulent impressions. E.g. Ads placed "under the fold" or hidden or behind
stuff or otherwise generating impressions that weren't real.

I suggested we could build a small piece of supplemental ad code that would
load alongside one of our ads in a row page and "look around" — see where ads
were placed and so on.

The idea was rejected because it would create too much data. I argued that we
could trigger the fraud detection code once per n impressions with n being 100
or 1000 and still be able to identify fraudulent sites with statistical
certainty (our problem would be false negatives) but they couldn't wrap their
heads around merely sampling enough data to answer a question rather than ALL
the data, so the idea was rejected.

Of course it's also highly likely that they didn't actually want to detect
fraud.

------
lighthawk
"Then you collect the data you need (and just the data you need) to answer
those questions."

If you are providing something to anonymize activity- then sure, if it is
legal. And you want to store as little data as possible that would be
hazardous if it was made public. But for everything else, it's probably not a
good idea to have this attitude.

There are many questions you don't need to answer before you need them, and
then it would be good or even necessary to have them historically. For
example, auditing changes made to the system or data, logging some requests
and responses, tracking user behavior for analysis by marketing, etc. Over
time, depending on the site/service, you may want all those things and more.

------
rdlecler1
Maybe you don't have the resources to analyze the data today, but you still
need to collect it for when you do. A history of data can give you a null
hypothesis to work with. When we see a drop in traffic in August is it because
we're losing relevance, or because August tends to be a slow month.

------
eanzenberg
Data is absolutely an asset. All decisions are driven by data (could be
personal, biased, anecdotal, etc.), so why not make decisions based on more
data points? It's cute to think you can "ask questions first, then collect"
but this wastes time in 2015. Imagine being asked at Netflix "calculate the %
watch through of 18-21 year olds for kids movies", then a week later "for
action movies in the 1980s. As opposed to having the viewing info for all your
users before these questions arise.

BTW, users DO want you to use their data to improve your service. Otherwise
google, facebook, twitter, netflix, etc. would not be as successful as they
are. Liability only comes into play when OTHER parties access (legally or
illegally) your data.

~~~
kuschku
Well, this collecting data is exactly why the US tech industry is considered
to be reckless with privacy and safety of data.

And lawcases have shown that the "we record everything" in the ToS is legally
null and void – only stuff the user directly expects to be recorded are legal.

If you are a flashlight app, and in your ToS is "we’ll send all your contacts
to a third-party company", then this is void, and if you do so, the user can
sue you. As has happened with Facebook in the EU several times.

This reckless "we’ll record everything" is why the EU is planning their new
"you can't record anything unless directly allowed to, and you can not sell it
or give it away to any third-party entities, not even governments" law. (Well,
the last part was thanks to the NSA).

This attitude hurts the US tech industry.

~~~
eanzenberg
I am perfectly fine with that, in fact I expect it when using services like
google and facebook. I expect that facebook is crunching my data and feeding
me ads, while the advertisers never see who I am or what I'm about.

~~~
kuschku
Well, it's simple: If people are outraged when you tell them what facebook
actually stores, then it's not okay.

And the outrage exists.

I do not think anyone should store anything about me unless I explicitly want
to give it to them.

One example is Google's horrible traffic tracking system. In my city every
lane of every intersection has a separate inductive loop to measure traffic
congestion, speed and amount. Also every few hundred meters between
intersections. Perfect data, and you only collect as much as necessary.

In comparison, Google's system: Track the location of everyone, and then check
if people are moving slowly on a road in a car.

One of these stores an Orwellian amount of information, the other stores just
as much as necessary.

Anyway, as a German, I hate this type of data collection. Yes, I openly admit,
I hate the business model of the US startup industry. And I hope this fad
stops very soon, as currently we have a truly dystopian nightmare of data
collection.

Just think about what a less-democratic government could do if they'd get
elected in the US in a few years. They could seize Google's data, force them
to comply (like the NSA already did) and then use the data against the people.

You have location data of almost everyone, and not just for now, but for every
day for the past 5 years. With 5 minutes accuracy. You have banking data and
access, emails, browser and search history.

This is too much power that rests in a single place. Potential for abuse is
insane.

------
readams
It's definitely an asset, but it's potentially a dangerous one. It might be
like owning enriched uranium fuel pellets. Extremely valuable asset that can
power a society, but dangerous if it falls into the wrong hands or is allowed
to contaminate the environment.

------
lcnmrn
Public data is an asset. Private data is a liability.

~~~
dragonwriter
> Public data is an asset. Private data is a liability.

I think, more precisely:

 _Useful_ data is an asset. _Legally sensitive_ (e.g., because of privacy
laws) data is a liability. Those categories overlap, and data in the overlap
may be a net asset or a net liability depending on the specific utility it
provides balanced against the costs (both certain compliance costs and risks)
associated with the specific legal protections.

This isn't really unusual -- the same thing is true of pretty much everything
a business (or individual) might own. Real estate, for instance, is an asset
to the extent that it is useful, but owning (or even possessing as a tenant)
real estate comes with certain maintenance costs _and_ liability-related
risks.

------
m52go
First, hat-tip to Marko for running a business with integrity. I really like
the parallel between data and finance as it relates to privacy.

As it turns out, banks have a very similar history of making promises &
violating them. There are many parallels between banks & debt and data &
technology.

I wrote a post titled "Silicon Valley Data is the New Wall Street Debt" that
you folks may like:

[http://livefreeortry.com/2014/04/30/silicon-valley-data-
is-t...](http://livefreeortry.com/2014/04/30/silicon-valley-data-is-the-new-
wall-street-debt/)

------
BinaryIdiot
I think this is a pretty good, general rule. But with everything in technology
it's not true for everything. For instance auditing and medical records.
Basically the only exceptions are going to be things not actionable in the
market.

Good read!

------
Lerc
If data is a liability aren't we all worse off for knowing that?

~~~
Mz
I read an interesting book once that distinguished clearly between data,
information, wisdom and perhaps a couple of other levels of this hierarchy.
Advice is usually not _data_. Data is raw, unprocessed information. Being
awash in it is not necessarily valuable, but wisdom is a different matter
altogether.

~~~
hewhowhineth
DIKW Pyramid
[https://en.wikipedia.org/wiki/DIKW_Pyramid](https://en.wikipedia.org/wiki/DIKW_Pyramid)
Now your turn. What's the name of the book? :)

~~~
Mz
Thanks!

Unfortunately, I don't recall the name.

------
tuananh
you make it work for you -> it's an asset. You leave it laying there doing
nothing -> it's a liability.

Perspective people.

------
qihqi
Well, Asset = liability + owner contribution. /s

------
MikeNomad
Data are an asset. Data custodians are the liability.

------
xcyu
TL;DR

Don't collect or store data you don't need yet.

------
yuhong
This reminds me of the RadioShack bankruptcy. I don't think the bankruptcy
process was designed for "selling" personal data, right?

------
crimsonalucard
Folks, Big data can cure cancer. Big data can also start world peace. Google
it. I'm srs.

------
luckydata
This person doesn't know much about data.

