
Making Postgres Bloom - usman-m
http://www.pipelinedb.com/blog/making-postgres-bloom
======
gleb
The example doesn't quite make sense:

    
    
      SELECT user_id IN (SELECT DISTINCT user_id FROM user_actions);
    

is not valid SQL. You may mean something like:

    
    
      SELECT 123 IN (SELECT DISTINCT user_id FROM user_actions);
    

which is a strange query, as it's equivalent to:

    
    
      SELECT 123 IN (SELECT user_id FROM user_actions);

~~~
usman-m
Oops, it was meant to be:

    
    
      SELECT <user_id> FROM (SELECT DISTINCT user_id FROM user_actions);
    

You're absolutely right that both those queries will give the same result. I
guess I was trying to motivate the basic problem of finding whether some user
exists in a _set_ of users, and `SELECT DISTINCT` is the SQL way of
representing a set.

Fixed the post, thanks!

~~~
gleb
I'd put more effort into setting up a believable problem in these kind of
posts, before presenting a solution. Much like in a company pitch, it's hard
to understand the value of product if you don't understand what problem it is
trying to solve.

It doesn't help that using unnecessary DISTINCTs is subqueries is a common
performance problem in novice SQL. Why people do that I don't really
understand, but they do.

That's the thing about probabilistic data structures - I've never seen a real-
world performance problem in SQL where they would have been helpful. I really
would like to have an "aha" moment where somebody shows me one.

Probabilistic data structures do seem like a natural match for streaming
databases, but that's different.

------
mamikonyana
When you say adding more online algorithms, is that mostly going to be limited
to adding more probabilistic data structures?

~~~
usman-m
Mostly that. I've also been thinking about how we could incorporate some
machine learning algorithms, like online perceptrons.

------
pbnjay
Don't get me wrong, I love Postgres and use it in pretty much all of my
projects... but for this kind of application it's not very well suited. Leave
your relational data for the database and use something more efficient!

Redis comes with both bitfields (see
[http://redis.io/commands/bitcount](http://redis.io/commands/bitcount)) and
hyperloglog counters (see
[http://redis.io/commands/pfcount](http://redis.io/commands/pfcount)),
requires almost no setup and has very minimal overhead.

~~~
danneu
Classic premature optimization.

"Just add another database!"

~~~
mycelium
From the article:

> Some of you, who have ventured deep into the bowels of databases, will
> probably point out that doing something like this in a real setup is
> committing concurrency suicide. All updates to the same row will essentially
> be executed serially which is no bueno if you're trying to build a
> performant data pipeline.

------
matsur
Semi-related in the land of Postgres and probabilistic data structures --
Redshift supports APPROXIMATE COUNT. Much, much faster than a raw COUNT, and
their stated error is +-2%

[http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html](http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html)

~~~
striking
I'd eat my hat if it didn't use a Bloom filter in some way.

~~~
usman-m
It probably uses a HyperLogLog--the 2% error rate kind of gives it away. Bloom
filters approximate set membership queries, HyperLogLogs approximate set
cardinality queries. COUNT DISTINCT is a set cardinality query.

We actually support a HyperLogLog backed COUNT DISTINCT aggregate too:
[http://docs.pipelinedb.com/aggregates.html#general-
aggregate...](http://docs.pipelinedb.com/aggregates.html#general-aggregates)

~~~
striking
Consider my metaphorical hat eaten. Thanks for the cool tools! I'm currently
working with Postgres and this looks like a great thing to add to the mix.

~~~
anarazel
There's a postgres extension that implements hll for postgres. Rather useful:
[https://github.com/aggregateknowledge/postgresql-
hll](https://github.com/aggregateknowledge/postgresql-hll)

------
jordibunster
Internally using hashtext(), which is not a good idea for a bloom filter for a
few reasons, one of which is [http://www.postgresql.org/message-
id/CABUevExTx2whgSpKaoMVow...](http://www.postgresql.org/message-
id/CABUevExTx2whgSpKaoMVowDxBe=pm7w4LJkb=-k8NTohQT12Kg@mail.gmail.com)

~~~
usman-m
Our native implementations of all probabilistic data structures use
MurmurHash3, so this isn't a problem. The dumbloom implementation is in no way
a _good_ Bloom filter, as the name suggests :)

------
zallarak
The idea of using probabilistic data structures instead of counting every
point of data (for things like customer analytics) is pretty significant --
getting caught in the weeds of managing every data point is error-prone and
inefficient.

------
ahachete
usman-m, the approach of PipelineDB seems really interesting. However, I'd
like to understand how in your opinion it compares with processing the stream
of data changes accessed over PostgreSQL's logical decoding
([http://www.postgresql.org/docs/9.4/static/logicaldecoding.ht...](http://www.postgresql.org/docs/9.4/static/logicaldecoding.html))
interface. Thank you

~~~
usman-m
ahachete, I'm not sure if I totally understand your question.

Continuous views are consumers for streams. You can think of them as high
throughput real-time materialized views. The source of data for the stream can
be practically anything. Logical decoding on the other hand is a producer of
streaming data--it's basically a human readable replication log. So you could
potentially stream the logically decoded log into PipelineDB and build some
continuous views in front of it.

~~~
ahachete
I was thinking of a system where data is extracted from the source database
and then the (changes) data are processed real-time by a software that
consumes this stream. So other than the obvious differences (need to write the
software, SQL support) what would be the real advantage of using PipelineDB
over a system with PostgreSQL+logical decoding+stream processing of that data?

Thanks!

