
Blockchains for Artificial Intelligence - trentmc
https://blog.bigchaindb.com/blockchains-for-artificial-intelligence-ec63b0284984
======
lappa
This really doesn't make any sense.

>blockchains introduced three new characteristics: centralized / shared
control, immutable / audit trails, and native assets / exchanges.

Blockchains aren't immutable, they are just expensive to mutate.

Blockchains aren't centralized.

Blockchains didn't introduce audit trails, these have always been possible
simply through having a transaction table that is only appended to. This of
course does require trust in the central authority.

>(4) Leads to provenance on training/testing data & models, to improve the
trustworthiness of the data & models. Data wants reputation too.

Is training set fraud really an issue in training AI?

>(1) Leads to more data, and therefore better models. >(2) Leads to
qualitatively new data, and therefore qualitatively new models. >(3) Allows
for shared control of AI training data & models.

The author has a poor understanding of both AI and blockchain technology[1].
Blockchains are for decentralized consensus, but it seems the author is
vaguely proposing using a blockchain as a mass datastore (with ownership
labels) for both training data and AI algorithms.

Of course AI is an exciting field so this means you can generate hype by
implying the field of AI has yet to solve the problem of sharing data with
fellow researchers until now.

[1]
[https://download.wpsoftware.net/bitcoin/alts.pdf](https://download.wpsoftware.net/bitcoin/alts.pdf)

~~~
trentmc
Hi, it's the author here.

> This really doesn't make any sense.

Disagree. Below, I respond to each of your points.

> Blockchains aren't immutable, they are just expensive to mutate.

Agreed; very little is truly absolutely immutable. It's all shades of grey. I
actually prefer the word "tamper-resistant" and I usually say that next to the
"immutable" definition, such as in the first paragraph of
[https://bigchaindb.com/whitepaper](https://bigchaindb.com/whitepaper). But
"immutable" makes for a good shorthand, especially because that's the label
that the community uses.

> Blockchains aren't centralized.

Oops, that was a typo. I meant to say "decentralized". Fixed it. (That was a
pretty big oops!)

> Blockchains didn't introduce audit trails ...

Correct.

> ... [status quo] require[s] trust in the central authority

Exactly. And it's crucial to note that once you don't have to trust a central
authority to store your audit trail, you have a way more trustworthy audit
trail that improves many applications and unlocks new ones.

> The author has a poor understanding of both AI and blockchain technology ...
> "treatise on altcoins"

Disagree. First, you don't need an "altcoin" to have a blockchain. To
understand of what's special about blockchains, you first have to understand
what already exists for distributed databases, then what the delta is.

Second, consensus in a distributed database has been around for decades;
Satoshi did not invent it. Lamport laid down much of the theory for FT and BFT
consensus in 1982. What Bitcoin brought forward, in addition to BFT-ish
consensus, was Sybil tolerance (addressing attack-of-the-clones).

Third, as for my understanding of AI: I've been doing it professionally since
the late 90s; here are my publications:
[http://trent.st/publications](http://trent.st/publications). Doing AI in the
90s was one of the least popular things one could possibly do, so I certainly
didn't do it for the hype.

> vaguely proposing using a blockchain as a mass datastore (with ownership
> labels)

Obviously I gave much more specific proposals than that. I have other writings
that dive into more detail on some of the use cases, such as an IP registry
[1] and for AI DAOs [2].

[1] [https://medium.com/ipdb-blog/a-decentralized-content-
registr...](https://medium.com/ipdb-blog/a-decentralized-content-registry-for-
the-decentralized-web-99cf1335291f) [2] [https://medium.com/@trentmc0/ai-daos-
and-three-paths-to-get-...](https://medium.com/@trentmc0/ai-daos-and-three-
paths-to-get-there-cfa0a4cc37b8)

~~~
pdpi
> First, you don't need an "altcoin" to have a blockchain.

Hmm, you're treading on thin ice here. What exactly do you mean by a
"blockchain", then? I tend to work on the assumption that people are talking
about something closer to the whole Satoshi consensus stack on top of hashed
linked lists (the _actual_ blockchain), rather than just the hashed linked
list data structure. Which segues into -- Satoshi consensus was designed
against a stringent set of requirements, and the security model for the
algorithm he devised absolutely demands a miner reward of some sort that's
denominated in the same currency as that being used in transactions (in those
circumstances, double spend attacks can be proved irrational once a
transaction is deep enough inside the chain)

I'm not saying you're flat-out wrong, but you do need to specify which
properties of Satoshi consensus you are willing to discard to make that
statement true.

~~~
trentmc
Throw a rock in the blockchain space and you'll find a different definition of
blockchains, or what it means for one thing to be a blockchain or not. It
doesn't really help anyone to argue for hours on end over this, when it's
largely a matter of opinion. To me, what's more interesting is to identify
what are the new characteristics (in terms of benefits) that blockchains have,
above and beyond traditional distributed databases; which in turn unlock new
applications or improve existing applications. To me, those characteristics
are: decentralized, immutable, assets; I describe them in the article, and
also in more detail in
[https://bigchaindb.com/whitepaper](https://bigchaindb.com/whitepaper).

~~~
pdpi
> To me, those characteristics are: decentralized, immutable, assets;

How do you achieve fully trustless decentralisation without the currency
aspect, while still being resilient to sybil attacks? Or are you willing to
sacrifice truslessness? In which case, how are you defining decentralisation?

~~~
trentmc
I define "decentralized" as "no single entity owns or controls". It can be
further distinguished with "server-based decentralization" and "server-free
decentralization" [1].

You only need to be Sybil tolerant if you want your validating nodes to be
anonymous. There's certainly some applications where that's useful. But it's
not a requirement for being decentralized. (Some will argue otherwise, and
that's ok; once again it depends how you define "decentralized"; my approach
is about what benefits come to the application.)

[1] [https://blog.bigchaindb.com/the-dcs-
triangle-5ce0e9e0f1dc](https://blog.bigchaindb.com/the-dcs-
triangle-5ce0e9e0f1dc)

~~~
pdpi
Care to link to sources for server-based and server-free decentralisation that
aren't your own blog posts?

(Also, that blog post you linked is flat out wrong -- Bitcoin, Ethereum, et al
are all instances of eventually consistent systems that can and do prevent
double spends)

Still -- if you're willing to sacrifice anonymity for the sake of avoiding
sybil attacks, then how do your nodes federate without a central authority?

This is why having precise definitions matters: With each question I ask,
we're eroding away the guarantees that such a system provides, and the
implementation requirements with them. At what point do we just give up on
"blockchains", and just adapt a run of the mill distributed log-based db
instead, which gives you BFT and immutability, and where the asset layer is
easy enough to add to the top?

~~~
trentmc
> Care to link to sources for server-based and server-free decentralisation
> that aren't your own blog posts?

The best precedent is simply the long-standing difference between servers and
clients in computing systems.

In blockchain discourse, this difference had not been acknowledged as much;
though of course a similar pattern exists between full nodes vs light/SPV
clients.

To my knowledge, no one else had made the clarification of "server-free" vs
"server-less" as different types of decentralization. It is a useful
distinction as the article discusses.

> Also, that blog post you linked is flat out wrong -- Bitcoin, Ethereum, et
> al are all instances of eventually consistent systems

As the article states, they can and do prevent double spends, so we agree
there. But that's not what "consistent" means in a CAP setting. As the article
states, "they they never have a deterministic guarantee of a consistent order;
they’re only eventually consistent (in a probabilistic sense). But let’s be
generous and call them consistent, because in practice they are used that way,
the workaround being higher latency as one waits for a sufficiently high
probability of avoiding inconsistency."

> how do your nodes federate without a central authority?

Each node votes on any transaction coming through. The transaction only clears
if it gets enough positive votes.

> just adapt a run of the mill distributed log-based db instead ...

Well in some cases that's all that people actually need; sometimes find myself
referring people to Kafka and the like.

But Kafka and the like are still controlled by a single admin; you can do more
to decentralize. As for immutability, it's all shades of grey, and certainly
being a log-based db (read-only) helps a lot. You can do more with Merkle
DAGs, continuous backup to write-only media, etc.

To me it's not about "eroding" guarantees. It's about saying "ok, I have this
database, what properties do I want?" The potential results might be
blockchain-like or not. If decentralization, immutability, or assets are
potentially interesting, then a blockchain technology could be interesting.
Otherwise it comes down to other questions to choose among traditional DBs.

~~~
pdpi
> As the article states, they can and do prevent double spends, so we agree
> there. But that's not what "consistent" means in a CAP setting.

You're the one who asserted that CAP-style consistency is required to prevent
double-spends. Eventual consistency is a weakened form of consistency in the
CAP sense. Both Bitcoin and Ethereum have eventual consistency as much more
than a "theoretical" concern. In your own words: "But let’s be generous and
call them consistent, because in practice they are used that way, the
workaround being higher latency as one waits for a sufficiently high
probability of avoiding inconsistency." The only way this is true is if you
accept latencies measured in hours. For real-world applications, you
absolutely need to deal with the eventual consistency (and, in fact, I've
written several applications that deal with precisely that).

> Each node votes on any transaction coming through. The transaction only
> clears if it gets enough positive votes.

I didn't ask how you establish consensus. I asked how the nodes federate -- if
a node tries to peer with you, how do you decide whether or not to accept the
node? You suggested anonymity is out the window, so who controls node
identity?

~~~
trentmc
> You're the one who asserted that CAP-style consistency is required to
> prevent double-spends

Could you point me to where? I like my thinking and expression to be
consistent (pun intended:)

> I asked how the nodes federate .. who controls identity?

Each node has a list of the public keys of other nodes. There are various ways
to handle key distribution, of course.

~~~
pdpi
From the blog post you linked three comments up or so:

> Big “C” means all nodes see the same data at the same time. Being big-C
> consistent is a prerequisite to preventing double-spends, and therefore
> storing tokens of value. There are many models of consistency; we mean
> “consistent” in the same way that the CAP theorem does, except a little bit
> more loosely (for pragmatism). Little “c” means strong eventual consistency,
> such that when data gets merged there are no conflicts, but not consistent
> enough to prevent double spends.

> there are various ways to handle key distribution, of course

Right. Having the public keys advances the discussion precisely nothing,
because public key auth is pretty much a given if we're talking about the
servers identifying themselves.

What tells us if we have a decentralised system with no central authority is:
who's the gatekeeper? Who controls key distribution, and which keys are
accepted into the pool?

------
kordless
> They’re Software-as-a-Service on steroids.

I've been writing about this for a while and I'm thankful someone else "gets
it" on a level. The key to understanding why blockchain data structures are a
big deal for software (including AI) is realizing how they enable change in
business models and the more customer focused "business layers" in which those
models operate.

It's worth pointing out that SaaS is a evolutionary step up from the older MSP
models. Assuming software models do not change over time is dogmatic.

Blockchains, like Bitcoin, enable a means by which suffering/cost of work can
be encapsulated in a transaction that cannot be altered later by the consumer
of that suffering cost or the producer of that suffering cost. This means the
suffering (work units) incurred by the AI (or the human involved) can be
preserved in a way that is immutable and used later for efficiency
improvements. That's not to say the work done by the AI is necessarily
valuable, but by making it immutable the work done by the AI can be
judged/measured to have been worth the amount of resources it took from
something else (i.e. an other's suffering) to do the work it did.

It is my belief that adversarial machine learning may benefit from blockchain
based data stores.

All of this is very important given our current infrastructure, and all
software models that go with it, resemble Swiss cheese from a security
perspective. Super viral growth models may make VCs and a few scrappy
millennials wealthy, but they don't scale long term for customer satisfaction.
We certainly don't want an AI going around creating super viral growth models
that drain the world's economy either!

Blockchains enable "do better". "Do better" is the first step on the path to
enlightenment.

I should note that my use of the term "suffering" relates to the cost of
causality, either by one's own choice or by something outside one's own
choice. It is not meant to refer to the human emotion of suffering, although
the two can certainly be related.

~~~
mccoyspace
But the definition of work changes and the value of that work also changes.
Bitcoin's internal self-referentiality and internally self-transparent genesis
sidesteps this issue.

"This means the suffering (work units) incurred by the AI (or the human
involved) can be preserved in a way that is immutable and used later for
efficiency improvements. That's not to say the work done by the AI is
necessarily valuable, but by making it immutable the work done by the AI can
be judged/measured to have been worth the amount of resources it took from
something else (i.e. an other's suffering) to do the work it did."

I think you are alluding to that fact in that above quote, but I think it
still begs the question of preserving a standard, abstract 'value of work'
across a constantly changing AI dataset.

------
quinndupont
As I see it, the main challenge with much of this proposal is that it isn't
clear how to convince the kind of broad sharing much of this needs (for
training). There's some incentive, in the sense that if you participate you
reap AI improvement, but traditional (and regulated) players will be hesitant,
on pure risk-management terms (and conservatism).

~~~
trentmc
It's the author here. (Hi!)

I agree, for the use cases that involve sharing, how do you incentivize this
when many see data as the moat? This is why I wrote the section about "Hoard
vs. share?". In some applications, there will be more incentives to hoard; in
others, to share.

Note that one thing that softens this impasse is the idea of data exchanges.
So rather than "hoard vs share" you can think of all data as potentially "for
sale, but with a price" and where the price could vary dramatically. And
that's where a data exchange could come in to reduce friction for price
discovery.

~~~
m3ta
I don't understand the assertion made regarding blockchains/digital signatures
supposedly increasing accuracy or trust in IoT sensor-reading reliability.

Also, how do you respond to the inevitable criticism that advertising your
solution as a "shared global registry" invokes the "How Standards Proliferate"
XKCD[1]?

Finally, I like the idea of creating a marketplace or exchange for AI training
information, but it's strictly not necessary to use a blockchain for this
purpose unless a very large portion of your users are hyper-concerned about
their anonymity. It's also misleading to try to build a marketplace of signed
IoT data, with the assumption that it is more trustworthy than unsigned IoT
data, when that is simply false.

[1]: [https://xkcd.com/927/](https://xkcd.com/927/)

~~~
trentmc
Thanks for the thoughts. I'll respond to each point in turn.

> I don't understand the assertion made regarding blockchains/digital
> signatures supposedly increasing accuracy or trust in IoT sensor-reading
> reliability.

If each IoT sensor has a known public key then we can know that the data came
from that sensor.

It doesn't solve all reliability problems of course. E.g. if you have a
hardware failure, a digital signature ain't gonna help:)

> Also, how do you respond to the inevitable criticism that advertising your
> solution as a "shared global registry" invokes the "How Standards
> Proliferate" XKCD[1]?

I get this. What's cool is that IPDB doesn't need to be (and shouldn't be) the
"one network to rule them all. The path is to leverage other existing
standards and new connectivity standards, so that IPDB (or whatever registry)
plays well with other emerging & existing registries.

The main lesson is from the internet itself, which was born by combining
together disparate networks (ARPANET, NSFNet, etc) that didn't communicate
previously, via the invention of TCP/IP. All the networks had to do was alter
their top-level protocol and suddenly they could talk to other nets. There's a
similar protocol for blockchain tech, called Interledger
([https://www.interledger.org](https://www.interledger.org)). Whereas TCP/IP
connected networks with data, since you can re-send packets etc it doesn't
account for double-spending; Interledger does this. You can view it as a way
to connect networks of value.

Interledger has already been used to connect Bitcoin, Ethereum, and many
centralized payment networks.

There are other protocols that are closer to the data level. E.g. simply using
JSON-LD helps. And ILPD on top of that (roughly, a Merkle-ized JSON-LD). And
domain-specific value transfer protocols above that, like COALA IP for
intellectual property (which plays well with existing protocols like DDEX for
music, PLUS for photography).

> Finally, I like the idea of creating a marketplace or exchange for AI
> training information, but it's strictly not necessary to use a blockchain
> for this purpose

Agreed; and I wrote about it: "An exchange could be centralized, like
DatastreamX already does for data. But so far, they are really only able to
use publicly-available data sources, since many businesses see more risk than
reward from sharing. What about a decentralized data & model exchange? ..."
and I went on to list some benefits.

> unless a very large portion of your users are hyper-concerned about their
> anonymity.

To me, there are much greater benefits. I'm hoping that the biggest benefit
will be to further catalyze a truly open data market.

> It's also misleading to try to build a marketplace of signed IoT data, with
> the assumption that it is more trustworthy than unsigned IoT data, when that
> is simply false.

I'm not sure why you say this; knowing the provenance of the data has clear
trust-related benefits in some cases like I described. Obviously just standing
on its own, it has no benefit, it's all about usage farther along in the
pipeline.

------
mtdewcmu
Just an observation on ML--

If the key to making ML useful has been huge sets of training examples, then
that shows that current ML algorithms are much worse than humans at
generalizing from a small number of examples. That suggests that current ML
algorithms are leaving a lot on the table in terms of how much they are able
to learn from each example, and there is a lot of room for improvement.

~~~
trentmc
You're absolutely right - there is a lot of room for improvement. One of the
big research threads in ML is better unsupervised learning - to learn from
data without labels. Another hot topic is adversarial networks, which generate
their own training data by playing against each other. (15 years ago this was
"competitive co-evolution";)

> If the key to making ML useful has been huge sets of training examples

BTW this is only for some ML approaches and problems. There are some problems
where getting more data is just way too expensive; e.g. building a model from
silicon manufacturing where each mfg. run costs $50M; i.e. each datapoint
costs $50M. Guess what - there are still useful ML-y things you can do here.
Less extreme, there are many problems with only 100-1000 training points
available, and you can still do a lot.

------
mehh
Big Data + Blockchain + AI, looks quite hand wavy, have you found a practical
use case for BigchainDB yet?

~~~
trentmc
Articles like the "AI + Blockchain" one are meant to be forward-looking, to
help inspire people to build. Call it hand-wavy if you wish; I call it laying
out a vision:)

But we're all about real apps getting built. Many comanies are building on
BigchainDB. For example, I describe some of them here:
[https://blog.bigchaindb.com/where-does-blockchain-
scalabilit...](https://blog.bigchaindb.com/where-does-blockchain-scalability-
matter-specific-use-cases-from-digital-art-to-hr-9cf5ad8f7042).

There are many, many more companies that have not publicly announced what
they're building yet. Stay tuned:)

------
jcfrei
I have also thought about the applications blockchains could have for AI.
However I didn't consider blockchains as a means to store data but rather as a
way to reliably track the performance/quality of various AI services. You can
read about it here: [http://jcfrei.com/the-ai-economy-
bitcoin/](http://jcfrei.com/the-ai-economy-bitcoin/)

~~~
trentmc
Oh cool! Thanks for the link. I just added it to the "further reading" section
on the bottom.

------
davidgerard
Just off the top of my head:

* Why use a blockchain rather than a database?

* Why do you require blockchain rather than some other approach?

* Are you positing this as open to the hostile world? If so, why? If not, why use a blockchain?

* If you're open to the hostile world: what's your threat model, and how are you addressing it?

* How do your defences stop this from being the next Tay?

~~~
cesnja
Regarding "hostile world," I recently learned that on private networks where
all participating nodes are authenticated, blockchains can substitute proof-
of-(work|stake|...) for faster and more efficient algorithms for reaching
consensus [1] (I only remember raft something). What's the use case then? At
least for older data (in blocks far away from the current one), blockchain in
an immutable database. On the other hand, it's fun operating with buzzwords.

[1]
[http://ieeexplore.ieee.org/document/7467408/?reload=true](http://ieeexplore.ieee.org/document/7467408/?reload=true)

~~~
davidgerard
Pretty much none. The very funniest one is Intel's "Proof of Elapsed Time",
which might as well be called Proof of Buying An Intel CPU. Rather than have
miners compete to produce the next block, a timer running in an environment
secured by a DRM mechanism built into your Intel CPU picks if you get to do
the next block. The white paper is an extended advertisement for Intel®
Software Guard Extensions™ (SGX™). Also, they only have a simulated Proof of
Buying An Intel CPU mechanism as yet.

[https://intelledger.github.io/introduction.html](https://intelledger.github.io/introduction.html)

This doesn’t provide any security against malicious participants; the excuse
is that private blockchains need speed over security. You might think that at
that point you don’t need a blockchain at all, but you’re hardly going to sell
any consultant hours with _that_ sort of thinking.

------
Uptrenda
This article is confusing and contradicting in so many ways. To start with, I
was expecting some kind of trustless self-contained system whereby everything
can be independently verified and concretely recorded in an autonomous system
similar to a blockchain. But instead what OP seems to be describing here is
more like a distributed database for storing AI training set data (similar to
Storj) where he uses the blockchain more as a metaphor than anything specific.

To give you an example, he uses the distributed nature of the blockchain (as
it pertains to storing full copies of the chain on multiple nodes) as a model
for storing training set data within a distributed system and the benefit he
states for this is that since the infrastructure isn't controlled by one
person - organizations will be more inclined to share.

The author then goes on to describe how the integrity of training set data can
be attested to in the form of ... "I believe this data / model to be good at
this point" which is about as far away from verifiable as you can get. The
problem is, the article is very high level and its filled with meaningless
buzz words and double-speak like "DAO" ... which is apparently something that
can be applied to a human-based reputation system and a blockchain (hence
useless.)

There are ways to use artificial intelligence as [part of] the reward
mechanism for a blockchain by creating an initial AI to recognize specific
types of content and then rewarding miners for finding said content
("datachains" in data mining.) And there are also trustless ways for rewarding
people for creating the best general-purpose intelligence for understanding a
given type of data. But from the author's account its very hard to see if
we're talking about anything concrete here as the article is so technically
vague on details.

He does go into detail for several of these ideas. But of the ideas stated
none of them are discrete, trustless, self-contained systems like a blockchain
(or even really need a blockchain) which is a shame because AI, data, and
blockchains can actually be unified to create bullet-proof entities (which in
my opinion will be much more purpose-specific than any vague Ethereum-type "AI
DAO" non-sense.)

Edit: The authors treatment of AI DAOs is worth the read.

------
devty
I noticed that bigchainDB is written enitrely in python. I love python, but
I'm surprised to see it used in the database space!

(if the author is reading this) - what were your interest in using python? Do
you have any concern for it's performance/scalability as the service grows?

~~~
trentmc
Hi, this is the author:)

I've been using Python in enterprise-grade, large-scale production systems
since the early 2000s. And it works, well. Most python libraries where
performance matters have C under the hood. E.g. numpy matrix manipulation.
Parallel computing got way better with the async support.

As for BigchainDB itself: it wraps RethinkDB (and soon MongoDB) which is where
much of the compute intensive stuff happens. Also, we leverage these fast
Python+C libraries, and parallelization. If/when needed, we can always convert
some of our core stuff to C or another fast language.

~~~
devty
cool. thanks for your explanation

------
antocv
bigchaindb main product is actually a cluster of something like mongodb. Read
the whitepaper, was disappointing.

~~~
trentmc
Hi, it's Trent here, I'm CTO of BigchainDB and also the author of the
blockchains-AI article.

Yep, it's a [database] cluster, and proud to be that:)

But, it's special in that there isn't a single organization that owns or
controls it. That is, it's decentralized.

And in fact it draws on traditional distributed DBs by building on them, to
get first-class scalability and querying abilities. The first version was
built on RethinkDB and we're working on wrapping MongoDB too.

There's also a public version of BigchainDB that's getting rolled out, called
IPDB ([http://ipdb.foundation](http://ipdb.foundation)). With IPDB, people
building apps don't need to roll their own cluster, they can just talk to it
via http.

> was disappointing

How? We're always looking for feedback.

~~~
antocv
Yeah I get that, but a single "mongodb or whatever cluster database behind
bigchaindb that you are using was called" drop tables * ; query would actually
delete your so called blockchain. Thats a huge security whole, which last time
I checked bigchaindb - would be solved "administratively" ie operational
security level instead of protocol.

Not to sound like an ass though, I do keep an eye on bigchaindb for what it is
to become as the people behind it do seem to have great talent.

Has that security issue been resolved? How is the ipdb secured?

~~~
trentmc
We acknowledge this attack vector and others; and list them in github. It's in
our roadmap to address them. The current schedule is the following: we're
currently wrapping up the a base MongoDB integration (next couple weeks); on
the heels of that we'll improve the MongoDB replication (a couple months); and
on the heels of that we'll address the "drop" issue (and other unwanted
commands) via a wire protocol firewall. More security stuff to follow that.

BTW thanks for the kind words about our team. I think they're amazing too:)

------
daxfohl
What about AI for blockchains?

~~~
trentmc
Great question. As the article says: there are many ways that AI can help
blockchains, such as mining blockchain data (e.g. Silk Road investigation).
That’s for another discussion:)

I'd love to see more suggestions here or elsewhere. It would be my pleasure to
write an article on this.

