This looks great, and I look forward to finding time to try it soon.
However, I believe it's misleading to call it "open source". The SPPL license is not generally considered to be an open source licence by any meaningful definition, and in particular does not meet the OSI definition and is incompatible with most licences that do.
I understand the need these days to protect against aggressive cloud providers, but there are other ways to achieve that without becoming completely non open source, such as the BSL
Thanks for the pointer - will check it out. We are totally novice on this - we don't even have a license attorney. We just picked SSPL because that's what everyone seemed to suggest to prevent the likes of AWS cloning it. Now that we got some visibility, we will carefully take a look at this issue.
But at heart we want to build an open-source community while still being a viable business on the likes of MatterMost, Elastic etc.
This is, what I think is, Mattermost's answer to the current debate around the ethics of FAANG (and others) using open source software to make lots of money without substantially contributing back to the OS projects financially or in code. https://github.com/mattermost/mattermost-server/blob/master/...
My understanding is that, Mattermost is okay with others making money from their software if they don't modify it - which will practically work for some, but not all, small companies, and will be very difficult for the big companies to use. If the big companies want to modify and use Mattermost-server for free they are forced to contribute back the changes to the OS project, and then can make as much money as they want. Or use option 2, pay Mattermost a bunch of money for the privilege of not contributing back code to the OS project. In other words FAANG and co can either contribute to Mattermost financially or in code - their pick.
Segment has done an awesome job of building a great product but it is impossible to be provably secure and private given that they host all the data. Rudder can avoid situations such as the Segment Security incident that happened recently. https://segment.com/security/bulletins/incident090519/
As a longtime segment customer, I am excited to see this exist. Segment's product is fairly good but their pricing model is really bad. We recently ran into an issue where they were billing us for almost 10x the number of users we were actually tracking. It was a nightmare that required a few months of back and forth with their support trying to get it fixed. In the end, they gave us a partial refund for the overages but we had to do a lot of technical work to resolve the issue ourselves.
It would be nice to have an open source alternative that doesn't get you locked into an unpredictable pricing model that you have very little control over.
Some of our initial pilots had similar problems. Segment's pricing also doesn't work when you have a lot of free users. That(and privacy) are the two pain points we hope to address with this.
Happy to help you try this out if you want (please email soumyadeb@rudderlabs.com). We are behind on number of integrations (vs segment) and features but we will catch-up. And hoping to get community support on that.
We are in the mobile game industry. Due to the incredible high volume events data per user generates per day, we need to join certain events together to reduce the events amount send to our analytics platform. There is no other segment tool can do this for us. We have been working with Rudder Labs to solve this problem. They have been really helpful and act super fast to our requests and suggestions. With Rudder Labs SDK, we are able to join event data on the platform(which we choose to host internally) + all other segment tool features. Besides that, since we run a freemium game so that most of our users are free users/non-payers, cost is another thing we have to consider, current segment market pricing apparently is way too high for us. Rudder labs solves this problem for us as well. Great deal!
> Due to the incredible high volume events data per user generates per day, we need to join certain events together to reduce the events amount send to our analytics platform.
Don't gloss over the fact that telemetry over cell networks can be costly for users (more and more plans are unlimited, so this doesn't worry me as much) and draining on batteries. However you do it, data that's not latency-critical should be buffered and batched.
Yes, the SDK buffers and batches events before sending to Rudder BE. In the BE, the events are combined (value summed up, not batched) before it is sent to the analytics platform. We wrote a blog for this interesting use case
One of the authors here. This post was a pleasant surprise!! Happy to answer any questions. Or please feel to reach out to me at soumyadeb@rudderlabs.com.
You put a spotlight by saying "privacy and security focused alternative to segment"
Are there things you believe make segment not privacy and security focused? As a long time user of segment, I find their protocols feature and new data and privacy features world class for this.
Having also just left their Synapse conference, privacy and security was the #1 topic of discussion throughout the conference. I would say they are very much privacy and security focused.
Not trying to shill, but it comes off as maybe you are misrepresenting segment a little bit. "Open source segment alternative" would have probably been just fine.
We started this before segment launched the privacy tool so yes that was a good validation of the pain point.
However, unlike segment we don't require you to send all your events to us. You are welcome to download and run the software yourself without ever talking to us.
We believe, having to send everything to a 3rd party (segment) goes against the fundamental goal of guaranteeing user privacy.
The author responded to you, but I think this deserves emphasis. True privacy is not about additional features or promises made by a cloud provider. It's not private unless you handle the data yourself, period point blank. This should be obvious.
> It is the managed infrastructure. Outage? You can keep sleeping.
I don't think an outage in a third party service I've depended on has ever allowed me to keep sleeping. In reality, I've been awake, dealing with outage fallout, keeping up to date on status updates, and overall just feeling in the dark, unable to improve the situation for my customers.
That's not saying that depending on a third party service isn't sometimes the right move, but "I get to sleep when there's a problem" isn't a consideration there. The pager still goes off.
BTW, event replaying is coming in 2 weeks with Rudder :) Also, we are developing extensive test suites to check for API depreciations etc. That was one of the reasons we picked JS for our transformations - shipping an update would be easy. Infra outage - sure but you probably have a prod infra to manage anyway so why not one more piece.
No doubt segment is a great product and there will always be folks who want its simplicity. However, at some scale companies should take privacy of their user data seriously.
I am a happy Segment user, however their handling of a recent data breach did leave a bit of a sour taste. They took several days to get back to me with a definitive answer as to whether our customer data that we collected was compromised by the internal breach.
I am keen to look at competitive product where we may have more control over the data collected and can manage the risk ourselves.
Pricing: Honestly, we haven't figured out the business model yet. Like other open-source products, it will likely be a combination of support + enterprise features (like HA, auto-scale etc) but again we don't know what those enterprise features would be.
Ratch up Pricing: Good question and not sure how to answer. Our vision is to be like other open-source companies like Mattermost, ES. Our base version (which would work for 90% of users) would be free and under open-source license. But I do understand your concern - maybe there is a way to put that in the license (that the base version will be perpetually open-source)
- Here is our company page (https://www.linkedin.com/company/rudderlabs). Our lead investor is S28 capital (Partner: Shvet) - they have also invested in Mattermost (an open-source slack competitor)
Thank you for being candid and straightforward in your responses. If you can get this to work really well and make it easy to deploy, I think it can be a great service to the FOSS community and also a profitable support/enterprise business for you. Good luck.
Is it really so difficult for engineers to create a task to process a Kafka topic? It takes one day to write a program to consume from a topic of events and push to an API like Amplitude, and you have total flexibility in how you push to those integrations.
Why would you use postgres for an event professing system? This seems like an inefficient architecture.
Great question. Complication arises because of failures. One or more destinations may be down for any length of time, individual payloads may be bad etc. To handle all these you need to retry with timeouts while not blocking other events to other destinations. Also, not all events may be going to all destinations.
We build our own streaming abstraction on top of Postgres. Think layman's leveled compaction. We will write a blog on that soon. The code (jobsdb/jobsdb.go) should have some comments too in case you want to check out. Segment had a similar architecture and had a blog post on it but can't seem to find it. Also, eventually we will replace postgres with something low-level like RocksDB or even native files.
Yes, in theory you can use Kafka's streaming abstraction and create create a topic per destination. Two reason's we didn't go that route
1) We were told Kafka is not easy to support in an on-prem environment. We are not Kafka expert but paid heed to people who have designed and shipped such on prem softwares.
2) More importantly, for a given destination, we have dozens of writers all reading from same stream. Only ordering requirement is events from a given device (end consumer) are in order. So we assign same user to same writer. However, the writers themselves are independent. If a payload fails, we just block events from that user while other users continue. Blocking the whole stream for that one bad payload (retried 4-5 times) will slow things down quite a bit. If we had to achieve the same abstraction on Kafka, we would have had to create dozens of topics per destination.
I'm curious how you guys are usoing Postgres as a log.
I took a brief look at the code, and while the append-only dataset strategy is sound, it looks like your scenario only has a single reader and a single writer?
In my experience, it's not entirely trivial when you have:
1. Multiple readers who each needs to be able to follow the log from different positions in real time.
2. Multiple writers receiving events that need to be written to the end of the log.
3. Real-time requirements.
From what I can tell — I could be wrong here — your system doesn't need to poll the table constantly, because you also save the log to RAM, so whenever you receive an event, you can optimistically handle it in memory and merely issue status updates. If anything goes wrong, a reader can replay from the database.
But that doesn't work with multiple writer nodes where each node receives just a part of the whole stream. The only way for this to work would be dedicate a writer node to each stream so that it goes through the same RAM queue. So then you need a whole system that uses either Postgres or some consensus system like Etcd to route messages to a single writer, and you need to be able to recover when a writer has been unavailable.
Edit: I see you wrote that "we assign same user to same writer", so you're doing something like that.
Agreed. Our current implementation does not work when there are multiple readers for the same event stream and we need to track per-reader watermarks. We have a very simple model where one reader reads from DB and distributes the work to multiple workers (e.g. network writers) which in turn update the job status.
Multiple writers should work though. StoreJob() should handle that.
I missed the logging to RAM part. Yes, we always wanted to do that but haven't gotten to that yet. Right now, all events are moved through the DB - between gateway and processor and then router. Hence, we poll the table constantly.
How would you get ordering for millions of users? How many Kafka topics would you create? How would you managed failed events, would you reorder whole queue?
I don't think it is inefficient. Segment blog linked below talks about specifics of the problem.
Interesting, we've built our in-house solution that is also Go based, also writing to a postgres db besides forwarding events, but much simpler, without a UI and comes with backend sdks already.
What I found interesting is that you wrote 3k/events per second on a rather beefy 2xlarge machine. Our version is MUCH less demanding, I wonder if there isn't a lot of performance left on the table here.
I'll keep this in mind once we've grown out of our solution, though.
The bottleneck for us (on that instance) is not postgres but transformations. Transformations are tiny snippets of javascripts which convert the event from Rudder JSON to whatever structure (JSON, keyval etc) that the destinations expect. We also support user defined transformations - functions defined by the user to transform/enhance the event.
Currently, transformations are run in nodeJS. So, for every batch of events there is a call into nodeJS from go and that is slow. We do batching/parallel-calls but still.
What happens if I pass a 64 bit integer, and the Rudder pipeline being in JavaScript silently down-casts it to a 53 bit integer?
Segment's pipeline involves JS at some point. We had an issue where our 64 bit integers were down-casted silently. We found out the hard way. We use strings now (perhaps should have used strings right away, I am not necessarily the sharpest tool in the shed).
This is awesome! Never know if anyone build things like Segment. I will try this alternative solution to our company and will keep update here about the result.
Author here. Happy to help you out with this, in case you need it. Please email me at soumyadeb@rudderlabs.com. Or please join our slack - https://rudderlabs.herokuapp.com/
Thanks for asking, I am one of the authors of Rudder.
Snowplow is a great analytics tool, especially used for internal analytics. It is open-source and on-prem and keeps your data privacy. It is centered around event enriching and storing to a data warehouse.
We are aiming at routing the events reliably to destinations, transforming events real-time, storing them into your data warehouse with a dynamic schema and eventually build a data platform with help from the community.
Ok that is interesting. I've been thinking about a Snowplow implementation for a while but seems like a big task. Could you expand the differences?
> Snowplow is a great analytics tool, especially used for internal analytics\
> centered around event enriching and storing to a data warehouse
Is rudder not aimed at these use cases?
I seem to gather that you are building a more full stack platform, with a core feature set similar/better to/than Snowplow?
Edit: Can you expand on the types of destinations you plan on supporting? I see Sources [Android, iOS, JS] and Destinations [Amplitute, GA, Hubspot etc] but no data warehouses. Can I send raw events to a Snowflake dwh for instance?
Would love to work with you while you are evaluating this. Please email soumyadeb@rudderlabs.com or join our slack https://rudderlabs.herokuapp.com if you need any help
Honestly, we haven't figured out our pricing model yet - it will likely be a combination of support+enterprise features (don't know what those features would be).
At this point we want to make our pilots/customers successful. Would love to understand your use case first and we can discuss pricing (e.g. if you need us to manage it etc)? Follow-up with you on your HN email?
Haha, it is https://www.rudderlabs.com Unfortunately, we haven't done anything re: SEO/SEM. I think we have no-index on our website (default with wp-engine) - would fix it.
One issue I had with Segment, I couldn't run real-time transformation of the event to join data from our data tables. We eventually got over it with AWS Lambda and sending it back to Segment. Segment recently announced functions to help on this, still could not get my hands on it.
Yes, that is exactly the use case of our "user defined" transformation functions. You can define any javascript function (right now by modifying the code but will be available from the UI in the release coming next week). Inside that function you can filter/transform/enhance the event in any way you like. You can lookup your DB, call external APIs etc. You can also combine multiple events into one.
Since this whole thing runs inside your VPC, you don't have to open up your production database to 3rd party as you have to do with segment.
Happy to work with you on your use case. Please email soumyadeb@rudderlabs.com
Hey there! I work at Segment, and I'm one of the engineers working on Segment Functions. If you let me know your Segment workspace's name (get in touch through jason.tu@segment.com), I can grant you beta access.
Yes, this is exactly the use case we want to target. What's your cloud infra? We can easily run inside your AWS VPC. If you have your own private cloud, we can run there too - just need to disable the S3 dump. The only dependency we have is Postgres.
Actually asking because we are building data science automation software which we deploy onto our customers' cloud infrastructure. Very similar value proposition to Rudder re: privacy and security, but different use case.
Always interested in how other people achieve similar effects.
Right now, for AWS, we provide customers with a zipped up Packer+Terraform configuration with source and a README which lets them:
1. (terraform+sh) Create an IAM user responsible for the deployment with the right policy to deploy and export environment variables for the user.
2. (packer) Create an AMI for an instance that can host our APIs
3. (terraform) Spin up an EC2 instance on the desired VPC to host our APIs (with optional key pair for debugging, etc.)
Analytics.js was only a client/browser side utility. They developed an entire backend stack later - which is needed for data-warehouse dump, event replay etc etc. We are also developing a complete backend stack with all those features. This is still a work in progress
We're lucky to find and deploy this project early at our startup. Being a cybersecurity company, we cannot have out customers data leave our aws account.
It’s licensed under SSPL, so the definition of “open source” may vary. (Specifically, they have an enterprise business model, so this is more properly a proprietary, source-available on-prem Segment competitor.)
Founder here. Yes, that is an appropriate statement. We want to be a business on the likes of Mattermost, Kafka, ElasticSearch and so on. We are committed to building an open-source community.
Honestly, we were not prepared for this post. We don't even have a license attorney. We just picked SSPL because that's what everyone seemed to suggest to prevent the likes of AWS cloning it. By no means we are an expert.
Since you have chosen the SSPL in order to stop companies free riding off your work, you may want to also look at https://licensezero.com as another license option for this or other work you release.
“Open source” software and “free” software as concepts have become entangled to an unhealthy level. Burnout and software project sustainability issues are important to be aware of and manage if we want to have successful long term open source projects that aren’t just sponsored by large companies like Microsoft, Google, Facebook and so on.
So I’m glad you thought ahead and chose the SSPL, good luck and don’t let the haters get to you, at the end of the day it is “open source” and you’ll still be able to feed yourself!
I think you’ve got a strong value proposition, especially in a GDPR/CCPA world. Have you figured out a channel and integration partner strategy yet? That’s probably the hill that you’re going to need to fight on.
No. I don't think we even understand the problem :) You mean, it would be hard to work with all the 200 or so 3rd party providers segment integrates with without having 1-1 relationships with them?
Solve it the same way Singer solved it for the opposite problem - provide a strong library/convention for how things are done, and trust the community to contribute while also providing integrations for the most used services.
When I first found Singer, it had less than a dozen taps & targets. Now has multiple dozens.
That’s definitely one advantage they have in the market, but it’s also related to your sales strategy. You’re looking for companies that have both the need for large-scale streaming customer and analytics data routing and transformation, no existing investment in Segment, and the ability to manage the on-prem IT and development load. While you can probably go after smaller and more agile fish with, e.g., recipes for common needs, this reads a lot like big enterprise transformation sales and support, so you might want to look at connecting with established players who are already selling in those markets — consulting orgs, analytic tools vendors, or software companies selling things like API-first e-comm, where you’re upending legacy stacks.
Hmm, honestly we don't have any connections into those. Would love to connect with you if you can help us.
Interestingly, some of our initial pilots chose us not so much for privacy but for pricing. Segment's pricing doesn't work if you have a lot of free users (free games, online bloggers etc. There was a hackernews thread on this - https://news.ycombinator.com/item?id=19220518
We want to compete on the privacy angle but looks like the pricing might get us the initial downloads.
Thanks for Rudder. Segment pricing was a deterrent for us because we have free tier. I searched a few days back but I am so glad I found something so awesome again on Hacker News :)
Yes. Customer services, as in both support and feature development, should be enough to keep paying customers paying. If a bunch of "use at your own risk" software can replace it, what's the actual value of a business?
With regards to:
>" Rudder runs as a single go binary with Postgres. It also needs the destination (e.g. GA, Amplitude) specific transformation code which are node scripts. "
what would be a recommended approach, if I would like to keep the data internally, and not use external Analytics engine?
Thank you, my understand was that S3 is also 'external' (that's AWS service).
What I was hoping/looking for is something where another tool lightweight tool that can be installed alongside Rudder, that can use PG instance (may be just different db within Rudder's instance or just different schema).
For a light weight, end-to-end solution (including analytics) that can 'grow' with needs.
For many small self-funded startups that care about privacy a combination of following challenges exist:
a) need self-hosted solutions that require minimal footprint (because data, even anonymized, is hosted on self-paid VPS instances )
b) need to use solutions that do not require initial purchase/investment, but yet can scale out (both technically, and in terms of support) when/if commercial model of the startup becomes more successful
c) need to use solutions that are not heavy VC backed, because those typically do not have stable engagement model (as VCs rates of return demand, typically, changes in initial service/open source model, while self-funded startups are typically much slower growing (and slower in committing to buying expenses))
d) need self-hosted solutions to cover lot of ground in one (so that the learning curve, and time needed to integrate are minimized).
In next 2 weeks, we are releasing Redshift as destination. After that, we have PostgreSQL destination in our pipeline. You can configure it and capture everything there.
If you prefer that to be in files, you could setup a minio server on your VPS instance. It's coming in next weeks too.
https://github.com/minio/minio
We would like to understand your preference on this so that we could align our next set of destinations.
Please drop an email at sumanth@rudderlabs.com or join our Slack/Discord channels.
Yes, if you host the JS SDK yourself. However, certain destinations require loading their javascript anyway (segment calls them client-side integration) and those will still be blocked. However, a vast majority of others (e.g. analytics) products would work
Indeed. That's why we have raised a seed round and doing this full time as a serious business. We are passionate about privacy but understand that we cannot have a good shot while doing this part-time.
I am sure they do. We have great respect for segment. However, at some point/scale a company needs to take ownership of their their user-data.
Yes, some events may go to 3rd party vendors (analytics, advertising) but not everything goes and those often have PII removed. We want Rudder to be the point where you decide and enforce what goes where.
However, I believe it's misleading to call it "open source". The SPPL license is not generally considered to be an open source licence by any meaningful definition, and in particular does not meet the OSI definition and is incompatible with most licences that do.
I understand the need these days to protect against aggressive cloud providers, but there are other ways to achieve that without becoming completely non open source, such as the BSL
(See license.https://opensourceforu.com/2019/06/cockroach-labs-changes-it... )