
Rudder, an open source Segment alternative - feross
https://github.com/rudderlabs/rudder-server
======
martingxx
This looks great, and I look forward to finding time to try it soon.

However, I believe it's misleading to call it "open source". The SPPL license
is not generally considered to be an open source licence by any meaningful
definition, and in particular does not meet the OSI definition and is
incompatible with most licences that do.

I understand the need these days to protect against aggressive cloud
providers, but there are other ways to achieve that without becoming
completely non open source, such as the BSL

(See license.[https://opensourceforu.com/2019/06/cockroach-labs-changes-
it...](https://opensourceforu.com/2019/06/cockroach-labs-changes-its-
licensing-strategy-adopts-permissive-version-of-business-source-license/) )

~~~
soumyadeb
Thanks for the pointer - will check it out. We are totally novice on this - we
don't even have a license attorney. We just picked SSPL because that's what
everyone seemed to suggest to prevent the likes of AWS cloning it. Now that we
got some visibility, we will carefully take a look at this issue.

But at heart we want to build an open-source community while still being a
viable business on the likes of MatterMost, Elastic etc.

~~~
abdullahkhalids
This is, what I think is, Mattermost's answer to the current debate around the
ethics of FAANG (and others) using open source software to make lots of money
without substantially contributing back to the OS projects financially or in
code. [https://github.com/mattermost/mattermost-
server/blob/master/...](https://github.com/mattermost/mattermost-
server/blob/master/LICENSE.txt)

My understanding is that, Mattermost is okay with others making money from
their software if they don't modify it - which will practically work for some,
but not all, small companies, and will be very difficult for the big companies
to use. If the big companies want to modify and use Mattermost-server for free
they are forced to contribute back the changes to the OS project, and then can
make as much money as they want. Or use option 2, pay Mattermost a bunch of
money for the privilege of not contributing back code to the OS project. In
other words FAANG and co can either contribute to Mattermost financially or in
code - their pick.

~~~
soumyadeb
That would work perfectly for us too. Is that considered an open-source
license? We have heard OSI has strict standards around that

~~~
ensignavenger
The AGPL license Mattermost uses for their source is OSI approved Open Source.

------
tomnipotent
I want to see this married with the Meltano project from Gitlab - it would
create an unprecedented end-to-end data environment.

[https://meltano.com/](https://meltano.com/)

~~~
soumyadeb
Love the thought. We weren't aware of this project - thanks for the pointer.
Will follow-up.

~~~
dmor
I am leading the Meltano project at GitLab and would love to collaborate.
Could you drop me a note dmorrill@gitlab.com?

~~~
soumyadeb
Sent you a note.

------
tankster
Segment has done an awesome job of building a great product but it is
impossible to be provably secure and private given that they host all the
data. Rudder can avoid situations such as the Segment Security incident that
happened recently.
[https://segment.com/security/bulletins/incident090519/](https://segment.com/security/bulletins/incident090519/)

Very excited about this project!

~~~
soumyadeb
Thanks :)

------
vollmarj
As a longtime segment customer, I am excited to see this exist. Segment's
product is fairly good but their pricing model is really bad. We recently ran
into an issue where they were billing us for almost 10x the number of users we
were actually tracking. It was a nightmare that required a few months of back
and forth with their support trying to get it fixed. In the end, they gave us
a partial refund for the overages but we had to do a lot of technical work to
resolve the issue ourselves.

It would be nice to have an open source alternative that doesn't get you
locked into an unpredictable pricing model that you have very little control
over.

~~~
soumyadeb
Some of our initial pilots had similar problems. Segment's pricing also
doesn't work when you have a lot of free users. That(and privacy) are the two
pain points we hope to address with this.

Happy to help you try this out if you want (please email
soumyadeb@rudderlabs.com). We are behind on number of integrations (vs
segment) and features but we will catch-up. And hoping to get community
support on that.

------
torpedolaser
We are in the mobile game industry. Due to the incredible high volume events
data per user generates per day, we need to join certain events together to
reduce the events amount send to our analytics platform. There is no other
segment tool can do this for us. We have been working with Rudder Labs to
solve this problem. They have been really helpful and act super fast to our
requests and suggestions. With Rudder Labs SDK, we are able to join event data
on the platform(which we choose to host internally) + all other segment tool
features. Besides that, since we run a freemium game so that most of our users
are free users/non-payers, cost is another thing we have to consider, current
segment market pricing apparently is way too high for us. Rudder labs solves
this problem for us as well. Great deal!

~~~
dehrmann
> Due to the incredible high volume events data per user generates per day, we
> need to join certain events together to reduce the events amount send to our
> analytics platform.

Don't gloss over the fact that telemetry over cell networks can be costly for
users (more and more plans are unlimited, so this doesn't worry me as much)
and draining on batteries. However you do it, data that's not latency-critical
should be buffered and batched.

~~~
soumyadeb
Yes, the SDK buffers and batches events before sending to Rudder BE. In the
BE, the events are combined (value summed up, not batched) before it is sent
to the analytics platform. We wrote a blog for this interesting use case

[https://rudderlabs.com/customer-case-study-casino-
game/](https://rudderlabs.com/customer-case-study-casino-game/)

------
soumyadeb
One of the authors here. This post was a pleasant surprise!! Happy to answer
any questions. Or please feel to reach out to me at soumyadeb@rudderlabs.com.

~~~
tyri_kai_psomi
You put a spotlight by saying "privacy and security focused alternative to
segment"

Are there things you believe make segment not privacy and security focused? As
a long time user of segment, I find their protocols feature and new data and
privacy features world class for this.

Having also just left their Synapse conference, privacy and security was the
#1 topic of discussion throughout the conference. I would say they are very
much privacy and security focused.

Not trying to shill, but it comes off as maybe you are misrepresenting segment
a little bit. "Open source segment alternative" would have probably been just
fine.

~~~
colordrops
The author responded to you, but I think this deserves emphasis. _True_
privacy is not about additional features or promises made by a cloud provider.
It's not private unless you handle the data yourself, period point blank. This
should be obvious.

~~~
LittlePeter
The entire point of using Segment is not their data transformation SDKs. AFAIK
these are open source anyway.

It is the managed infrastructure. Outage? You can keep sleeping. Google
Analytics deprecating their API for a new one? Segment's on it.

The have all your data. They can replay all events from beginning if needed
(so they say, never had to ask for this).

~~~
kelnos
> _It is the managed infrastructure. Outage? You can keep sleeping._

I don't think an outage in a third party service I've depended on has ever
allowed me to keep sleeping. In reality, I've been awake, dealing with outage
fallout, keeping up to date on status updates, and overall just feeling in the
dark, unable to improve the situation for my customers.

That's not saying that depending on a third party service isn't sometimes the
right move, but "I get to sleep when there's a problem" isn't a consideration
there. The pager still goes off.

------
cyberferret
I am a happy Segment user, however their handling of a recent data breach did
leave a bit of a sour taste. They took several days to get back to me with a
definitive answer as to whether our customer data that we collected was
compromised by the internal breach.

I am keen to look at competitive product where we may have more control over
the data collected and can manage the risk ourselves.

~~~
soumyadeb
Would love if you try this :) If you need help, please
email(soumyadeb@rudderlabs.com) or join our slack community -
[https://rudderlabs.herokuapp.com/](https://rudderlabs.herokuapp.com/)

------
beager
So what's the pricing model? Your site lists "pricing" and shows no info.

\- Are you charging for support?

\- Do you/will you have a paid enterprise tier that will increasingly be the
only tier with a viable feature set?

\- What's keeping you from dumping on Segment's market until you hit traction
then ratcheting up to Segment's pricing?

\- Who are you? Who are your investors?

~~~
soumyadeb
Pricing: Honestly, we haven't figured out the business model yet. Like other
open-source products, it will likely be a combination of support + enterprise
features (like HA, auto-scale etc) but again we don't know what those
enterprise features would be.

Ratch up Pricing: Good question and not sure how to answer. Our vision is to
be like other open-source companies like Mattermost, ES. Our base version
(which would work for 90% of users) would be free and under open-source
license. But I do understand your concern - maybe there is a way to put that
in the license (that the base version will be perpetually open-source)

\- Here is our company page
([https://www.linkedin.com/company/rudderlabs](https://www.linkedin.com/company/rudderlabs)).
Our lead investor is S28 capital (Partner: Shvet) - they have also invested in
Mattermost (an open-source slack competitor)

~~~
beager
Thank you for being candid and straightforward in your responses. If you can
get this to work really well and make it easy to deploy, I think it can be a
great service to the FOSS community and also a profitable support/enterprise
business for you. Good luck.

~~~
soumyadeb
Thanks :)

------
sundbry
Is it really so difficult for engineers to create a task to process a Kafka
topic? It takes one day to write a program to consume from a topic of events
and push to an API like Amplitude, and you have total flexibility in how you
push to those integrations.

Why would you use postgres for an event professing system? This seems like an
inefficient architecture.

~~~
soumyadeb
Great question. Complication arises because of failures. One or more
destinations may be down for any length of time, individual payloads may be
bad etc. To handle all these you need to retry with timeouts while not
blocking other events to other destinations. Also, not all events may be going
to all destinations.

We build our own streaming abstraction on top of Postgres. Think layman's
leveled compaction. We will write a blog on that soon. The code
(jobsdb/jobsdb.go) should have some comments too in case you want to check
out. Segment had a similar architecture and had a blog post on it but can't
seem to find it. Also, eventually we will replace postgres with something low-
level like RocksDB or even native files.

Yes, in theory you can use Kafka's streaming abstraction and create create a
topic per destination. Two reason's we didn't go that route

1) We were told Kafka is not easy to support in an on-prem environment. We are
not Kafka expert but paid heed to people who have designed and shipped such on
prem softwares.

2) More importantly, for a given destination, we have dozens of writers all
reading from same stream. Only ordering requirement is events from a given
device (end consumer) are in order. So we assign same user to same writer.
However, the writers themselves are independent. If a payload fails, we just
block events from that user while other users continue. Blocking the whole
stream for that one bad payload (retried 4-5 times) will slow things down
quite a bit. If we had to achieve the same abstraction on Kafka, we would have
had to create dozens of topics per destination.

~~~
atombender
I'm curious how you guys are usoing Postgres as a log.

I took a brief look at the code, and while the append-only dataset strategy is
sound, it looks like your scenario only has a single reader and a single
writer?

In my experience, it's not entirely trivial when you have:

1\. Multiple readers who each needs to be able to follow the log from
different positions in real time.

2\. Multiple writers receiving events that need to be written to the end of
the log.

3\. Real-time requirements.

From what I can tell — I could be wrong here — your system doesn't need to
poll the table constantly, because you also save the log to RAM, so whenever
you receive an event, you can optimistically handle it in memory and merely
issue status updates. If anything goes wrong, a reader can replay from the
database.

But that doesn't work with multiple writer nodes where each node receives just
a part of the whole stream. The only way for this to work would be dedicate a
writer node to each stream so that it goes through the same RAM queue. So then
you need a whole system that uses either Postgres or some consensus system
like Etcd to route messages to a single writer, and you need to be able to
recover when a writer has been unavailable.

Edit: I see you wrote that "we assign same user to same writer", so you're
doing something like that.

~~~
soumyadeb
Agreed. Our current implementation does not work when there are multiple
readers for the same event stream and we need to track per-reader watermarks.
We have a very simple model where one reader reads from DB and distributes the
work to multiple workers (e.g. network writers) which in turn update the job
status.

Multiple writers should work though. StoreJob() should handle that.

I missed the logging to RAM part. Yes, we always wanted to do that but haven't
gotten to that yet. Right now, all events are moved through the DB - between
gateway and processor and then router. Hence, we poll the table constantly.

Would love if you join our discord channel
[https://discordapp.com/channels/625629179697692673/625629179...](https://discordapp.com/channels/625629179697692673/625629179697692677).
Slightly easier to have technical discussion there :)

------
Roritharr
Interesting, we've built our in-house solution that is also Go based, also
writing to a postgres db besides forwarding events, but much simpler, without
a UI and comes with backend sdks already.

What I found interesting is that you wrote 3k/events per second on a rather
beefy 2xlarge machine. Our version is MUCH less demanding, I wonder if there
isn't a lot of performance left on the table here.

I'll keep this in mind once we've grown out of our solution, though.

~~~
soumyadeb
The bottleneck for us (on that instance) is not postgres but transformations.
Transformations are tiny snippets of javascripts which convert the event from
Rudder JSON to whatever structure (JSON, keyval etc) that the destinations
expect. We also support user defined transformations - functions defined by
the user to transform/enhance the event.

Currently, transformations are run in nodeJS. So, for every batch of events
there is a call into nodeJS from go and that is slow. We do batching/parallel-
calls but still.

I think, postgres gets us > 15K/sec throughput.

~~~
LittlePeter
What happens if I pass a 64 bit integer, and the Rudder pipeline being in
JavaScript silently down-casts it to a 53 bit integer?

Segment's pipeline involves JS at some point. We had an issue where our 64 bit
integers were down-casted silently. We found out the hard way. We use strings
now (perhaps should have used strings right away, I am not necessarily the
sharpest tool in the shed).

~~~
soumyadeb
Hmm, never thought about that. Need to think how to handle it - great point!!

------
yodi
This is awesome! Never know if anyone build things like Segment. I will try
this alternative solution to our company and will keep update here about the
result.

~~~
soumyadeb
Author here. Happy to help you out with this, in case you need it. Please
email me at soumyadeb@rudderlabs.com. Or please join our slack -
[https://rudderlabs.herokuapp.com/](https://rudderlabs.herokuapp.com/)

------
sails
How does this compare to Snowplow?

[https://github.com/snowplow/snowplow](https://github.com/snowplow/snowplow)

~~~
sumanthpur
Thanks for asking, I am one of the authors of Rudder. Snowplow is a great
analytics tool, especially used for internal analytics. It is open-source and
on-prem and keeps your data privacy. It is centered around event enriching and
storing to a data warehouse.

We are aiming at routing the events reliably to destinations, transforming
events real-time, storing them into your data warehouse with a dynamic schema
and eventually build a data platform with help from the community.

~~~
sails
Ok that is interesting. I've been thinking about a Snowplow implementation for
a while but seems like a big task. Could you expand the differences?

> Snowplow is a great analytics tool, especially used for internal analytics\

> centered around event enriching and storing to a data warehouse

Is rudder not aimed at these use cases?

I seem to gather that you are building a more full stack platform, with a core
feature set similar/better to/than Snowplow?

Edit: Can you expand on the types of destinations you plan on supporting? I
see Sources [Android, iOS, JS] and Destinations [Amplitute, GA, Hubspot etc]
but no data warehouses. Can I send raw events to a Snowflake dwh for instance?

------
qurt
I tought this was [https://www.rudder.io/en/](https://www.rudder.io/en/)

~~~
soumyadeb
Haha, it is [https://www.rudderlabs.com](https://www.rudderlabs.com)
Unfortunately, we haven't done anything re: SEO/SEM. I think we have no-index
on our website (default with wp-engine) - would fix it.

------
indianCoder
One issue I had with Segment, I couldn't run real-time transformation of the
event to join data from our data tables. We eventually got over it with AWS
Lambda and sending it back to Segment. Segment recently announced functions to
help on this, still could not get my hands on it.

Any plans on this?

~~~
soumyadeb
Yes, that is exactly the use case of our "user defined" transformation
functions. You can define any javascript function (right now by modifying the
code but will be available from the UI in the release coming next week).
Inside that function you can filter/transform/enhance the event in any way you
like. You can lookup your DB, call external APIs etc. You can also combine
multiple events into one.

Since this whole thing runs inside your VPC, you don't have to open up your
production database to 3rd party as you have to do with segment.

Happy to work with you on your use case. Please email soumyadeb@rudderlabs.com

We wrote a couple of blog posts on that.

Case Study: [https://rudderlabs.com/customer-case-study-casino-
game/](https://rudderlabs.com/customer-case-study-casino-game/)

Transformation Details:

[https://rudderlabs.com/transformations-in-rudder-
part-1/](https://rudderlabs.com/transformations-in-rudder-part-1/)

[https://rudderlabs.com/transformations-in-rudder-
part-2/](https://rudderlabs.com/transformations-in-rudder-part-2/)

~~~
indianCoder
This would be so awesome. Looking forward to it.

------
fuzzyfroghunter
This looks great, nice work.

Is there an easy way for someone to set it up on their cloud infrastructure?

~~~
soumyadeb
Yes, this is exactly the use case we want to target. What's your cloud infra?
We can easily run inside your AWS VPC. If you have your own private cloud, we
can run there too - just need to disable the S3 dump. The only dependency we
have is Postgres.

Happy to help you setup - Please email soumyadeb@rudderlabs.com OR join our
slack [https://rudderlabs.herokuapp.com](https://rudderlabs.herokuapp.com)

~~~
fuzzyfroghunter
Actually asking because we are building data science automation software which
we deploy onto our customers' cloud infrastructure. Very similar value
proposition to Rudder re: privacy and security, but different use case.

Always interested in how other people achieve similar effects.

Right now, for AWS, we provide customers with a zipped up Packer+Terraform
configuration with source and a README which lets them:

1\. (terraform+sh) Create an IAM user responsible for the deployment with the
right policy to deploy and export environment variables for the user.

2\. (packer) Create an AMI for an instance that can host our APIs

3\. (terraform) Spin up an EC2 instance on the desired VPC to host our APIs
(with optional key pair for debugging, etc.)

------
mushufasa
also, segment was the O.G. segment alternative
[https://github.com/segmentio/analytics.js](https://github.com/segmentio/analytics.js)

~~~
soumyadeb
Analytics.js was only a client/browser side utility. They developed an entire
backend stack later - which is needed for data-warehouse dump, event replay
etc etc. We are also developing a complete backend stack with all those
features. This is still a work in progress

------
drixta
We're lucky to find and deploy this project early at our startup. Being a
cybersecurity company, we cannot have out customers data leave our aws
account.

~~~
rixed
> we cannot have out customers data leave our aws account.

Unintentionally funny?

------
amelius
If every successful business is eventually copied by open source, can we
really blame big companies for protective practices such as customer lock-in?

~~~
HillRat
It’s licensed under SSPL, so the definition of “open source” may vary.
(Specifically, they have an enterprise business model, so this is more
properly a proprietary, source-available on-prem Segment competitor.)

~~~
soumyadeb
Founder here. Yes, that is an appropriate statement. We want to be a business
on the likes of Mattermost, Kafka, ElasticSearch and so on. We are committed
to building an open-source community.

Honestly, we were not prepared for this post. We don't even have a license
attorney. We just picked SSPL because that's what everyone seemed to suggest
to prevent the likes of AWS cloning it. By no means we are an expert.

~~~
techdragon
Since you have chosen the SSPL in order to stop companies free riding off your
work, you may want to also look at
[https://licensezero.com](https://licensezero.com) as another license option
for this or other work you release.

“Open source” software and “free” software as concepts have become entangled
to an unhealthy level. Burnout and software project sustainability issues are
important to be aware of and manage if we want to have successful long term
open source projects that aren’t just sponsored by large companies like
Microsoft, Google, Facebook and so on.

So I’m glad you thought ahead and chose the SSPL, good luck and don’t let the
haters get to you, at the end of the day it is “open source” and you’ll still
be able to feed yourself!

~~~
soumyadeb
Thanks for the reference and for your words of encouragement. Will check out
licensezero and DM you if we have any questions.

~~~
techdragon
Sure! Email is in my profile, feel free to DM. :-)

~~~
soumyadeb
Thanks :)

------
platform
very interesting and approacheable solution.

With regards to: >" Rudder runs as a single go binary with Postgres. It also
needs the destination (e.g. GA, Amplitude) specific transformation code which
are node scripts. "

what would be a recommended approach, if I would like to keep the data
internally, and not use external Analytics engine?

~~~
soumyadeb
We also have a S3 destination so you can just add S3.

Support for other data-warehouses (Redshift, Bigquery etc) is coming soon.

~~~
platform
Thank you, my understand was that S3 is also 'external' (that's AWS service).

What I was hoping/looking for is something where another tool lightweight tool
that can be installed alongside Rudder, that can use PG instance (may be just
different db within Rudder's instance or just different schema).

For a light weight, end-to-end solution (including analytics) that can 'grow'
with needs.

For many small self-funded startups that care about privacy a combination of
following challenges exist:

a) need self-hosted solutions that require minimal footprint (because data,
even anonymized, is hosted on self-paid VPS instances )

b) need to use solutions that do not require initial purchase/investment, but
yet can scale out (both technically, and in terms of support) when/if
commercial model of the startup becomes more successful

c) need to use solutions that are not heavy VC backed, because those typically
do not have stable engagement model (as VCs rates of return demand, typically,
changes in initial service/open source model, while self-funded startups are
typically much slower growing (and slower in committing to buying expenses))

d) need self-hosted solutions to cover lot of ground in one (so that the
learning curve, and time needed to integrate are minimized).

~~~
sumanthpur
In next 2 weeks, we are releasing Redshift as destination. After that, we have
PostgreSQL destination in our pipeline. You can configure it and capture
everything there.

If you prefer that to be in files, you could setup a minio server on your VPS
instance. It's coming in next weeks too.
[https://github.com/minio/minio](https://github.com/minio/minio)

We would like to understand your preference on this so that we could align our
next set of destinations.

Please drop an email at sumanth@rudderlabs.com or join our Slack/Discord
channels.

------
mychael
Segment IO is currently being blocked on browsers with adblock. Does this mean
it will work even with adblock?

~~~
davej
By the way, you can get around this by using their custom domain support and
proxying the segment script.

------
orasis
Good luck, but this is a huge amount of work to keep something like this
running. I hope it’s successful!

~~~
soumyadeb
Indeed. That's why we have raised a seed round and doing this full time as a
serious business. We are passionate about privacy but understand that we
cannot have a good shot while doing this part-time.

------
dacompton
The majority of risk lies outside of Segment and they are damned serious about
the risk they own

~~~
soumyadeb
I am sure they do. We have great respect for segment. However, at some
point/scale a company needs to take ownership of their their user-data.

Yes, some events may go to 3rd party vendors (analytics, advertising) but not
everything goes and those often have PII removed. We want Rudder to be the
point where you decide and enforce what goes where.

------
gauravagarwal
This looks exciting

------
dinrat
this is so awesome! i have been waiting for it.

~~~
soumyadeb
Thank you :)

