However, I believe it's misleading to call it "open source". The SPPL license is not generally considered to be an open source licence by any meaningful definition, and in particular does not meet the OSI definition and is incompatible with most licences that do.
I understand the need these days to protect against aggressive cloud providers, but there are other ways to achieve that without becoming completely non open source, such as the BSL
(See license.https://opensourceforu.com/2019/06/cockroach-labs-changes-it... )
But at heart we want to build an open-source community while still being a viable business on the likes of MatterMost, Elastic etc.
My understanding is that, Mattermost is okay with others making money from their software if they don't modify it - which will practically work for some, but not all, small companies, and will be very difficult for the big companies to use. If the big companies want to modify and use Mattermost-server for free they are forced to contribute back the changes to the OS project, and then can make as much money as they want. Or use option 2, pay Mattermost a bunch of money for the privilege of not contributing back code to the OS project. In other words FAANG and co can either contribute to Mattermost financially or in code - their pick.
Very excited about this project!
It would be nice to have an open source alternative that doesn't get you locked into an unpredictable pricing model that you have very little control over.
Happy to help you try this out if you want (please email email@example.com). We are behind on number of integrations (vs segment) and features but we will catch-up. And hoping to get community support on that.
Don't gloss over the fact that telemetry over cell networks can be costly for users (more and more plans are unlimited, so this doesn't worry me as much) and draining on batteries. However you do it, data that's not latency-critical should be buffered and batched.
Are there things you believe make segment not privacy and security focused? As a long time user of segment, I find their protocols feature and new data and privacy features world class for this.
Having also just left their Synapse conference, privacy and security was the #1 topic of discussion throughout the conference. I would say they are very much privacy and security focused.
Not trying to shill, but it comes off as maybe you are misrepresenting segment a little bit. "Open source segment alternative" would have probably been just fine.
However, unlike segment we don't require you to send all your events to us. You are welcome to download and run the software yourself without ever talking to us.
We believe, having to send everything to a 3rd party (segment) goes against the fundamental goal of guaranteeing user privacy.
It is the managed infrastructure. Outage? You can keep sleeping. Google Analytics deprecating their API for a new one? Segment's on it.
The have all your data. They can replay all events from beginning if needed (so they say, never had to ask for this).
I don't think an outage in a third party service I've depended on has ever allowed me to keep sleeping. In reality, I've been awake, dealing with outage fallout, keeping up to date on status updates, and overall just feeling in the dark, unable to improve the situation for my customers.
That's not saying that depending on a third party service isn't sometimes the right move, but "I get to sleep when there's a problem" isn't a consideration there. The pager still goes off.
No doubt segment is a great product and there will always be folks who want its simplicity. However, at some scale companies should take privacy of their user data seriously.
I am keen to look at competitive product where we may have more control over the data collected and can manage the risk ourselves.
- Are you charging for support?
- Do you/will you have a paid enterprise tier that will increasingly be the only tier with a viable feature set?
- What's keeping you from dumping on Segment's market until you hit traction then ratcheting up to Segment's pricing?
- Who are you? Who are your investors?
Ratch up Pricing: Good question and not sure how to answer. Our vision is to be like other open-source companies like Mattermost, ES. Our base version (which would work for 90% of users) would be free and under open-source license. But I do understand your concern - maybe there is a way to put that in the license (that the base version will be perpetually open-source)
- Here is our company page (https://www.linkedin.com/company/rudderlabs). Our lead investor is S28 capital (Partner: Shvet) - they have also invested in Mattermost (an open-source slack competitor)
Why would you use postgres for an event professing system? This seems like an inefficient architecture.
We build our own streaming abstraction on top of Postgres. Think layman's leveled compaction. We will write a blog on that soon. The code (jobsdb/jobsdb.go) should have some comments too in case you want to check out. Segment had a similar architecture and had a blog post on it but can't seem to find it. Also, eventually we will replace postgres with something low-level like RocksDB or even native files.
Yes, in theory you can use Kafka's streaming abstraction and create create a topic per destination. Two reason's we didn't go that route
1) We were told Kafka is not easy to support in an on-prem environment. We are not Kafka expert but paid heed to people who have designed and shipped such on prem softwares.
2) More importantly, for a given destination, we have dozens of writers all reading from same stream. Only ordering requirement is events from a given device (end consumer) are in order. So we assign same user to same writer. However, the writers themselves are independent. If a payload fails, we just block events from that user while other users continue. Blocking the whole stream for that one bad payload (retried 4-5 times) will slow things down quite a bit. If we had to achieve the same abstraction on Kafka, we would have had to create dozens of topics per destination.
I took a brief look at the code, and while the append-only dataset strategy is sound, it looks like your scenario only has a single reader and a single writer?
In my experience, it's not entirely trivial when you have:
1. Multiple readers who each needs to be able to follow the log from different positions in real time.
2. Multiple writers receiving events that need to be written to the end of the log.
3. Real-time requirements.
From what I can tell — I could be wrong here — your system doesn't need to poll the table constantly, because you also save the log to RAM, so whenever you receive an event, you can optimistically handle it in memory and merely issue status updates. If anything goes wrong, a reader can replay from the database.
But that doesn't work with multiple writer nodes where each node receives just a part of the whole stream. The only way for this to work would be dedicate a writer node to each stream so that it goes through the same RAM queue. So then you need a whole system that uses either Postgres or some consensus system like Etcd to route messages to a single writer, and you need to be able to recover when a writer has been unavailable.
Edit: I see you wrote that "we assign same user to same writer", so you're doing something like that.
Multiple writers should work though. StoreJob() should handle that.
I missed the logging to RAM part. Yes, we always wanted to do that but haven't gotten to that yet. Right now, all events are moved through the DB - between gateway and processor and then router. Hence, we poll the table constantly.
Would love if you join our discord channel https://discordapp.com/channels/625629179697692673/625629179.... Slightly easier to have technical discussion there :)
I don't think it is inefficient. Segment blog linked below talks about specifics of the problem.
What I found interesting is that you wrote 3k/events per second on a rather beefy 2xlarge machine. Our version is MUCH less demanding, I wonder if there isn't a lot of performance left on the table here.
I'll keep this in mind once we've grown out of our solution, though.
Currently, transformations are run in nodeJS. So, for every batch of events there is a call into nodeJS from go and that is slow. We do batching/parallel-calls but still.
I think, postgres gets us > 15K/sec throughput.
Segment's pipeline involves JS at some point. We had an issue where our 64 bit integers were down-casted silently. We found out the hard way. We use strings now (perhaps should have used strings right away, I am not necessarily the sharpest tool in the shed).
We are aiming at routing the events reliably to destinations, transforming events real-time, storing them into your data warehouse with a dynamic schema and eventually build a data platform with help from the community.
> Snowplow is a great analytics tool, especially used for internal analytics\
> centered around event enriching and storing to a data warehouse
Is rudder not aimed at these use cases?
I seem to gather that you are building a more full stack platform, with a core feature set similar/better to/than Snowplow?
Edit: Can you expand on the types of destinations you plan on supporting? I see Sources [Android, iOS, JS] and Destinations [Amplitute, GA, Hubspot etc] but no data warehouses. Can I send raw events to a Snowflake dwh for instance?
I haven't looked into rudder but I'll switch if it offers easier setup and schema.
I know self hosting snowplow is already dirt cheap; but is there any price comparison by any chance?
At this point we want to make our pilots/customers successful. Would love to understand your use case first and we can discuss pricing (e.g. if you need us to manage it etc)? Follow-up with you on your HN email?
Any plans on this?
Since this whole thing runs inside your VPC, you don't have to open up your production database to 3rd party as you have to do with segment.
Happy to work with you on your use case. Please email firstname.lastname@example.org
We wrote a couple of blog posts on that.
Case Study: https://rudderlabs.com/customer-case-study-casino-game/
Is there an easy way for someone to set it up on their cloud infrastructure?
Happy to help you setup - Please email email@example.com OR join our slack https://rudderlabs.herokuapp.com
Always interested in how other people achieve similar effects.
Right now, for AWS, we provide customers with a zipped up Packer+Terraform configuration with source and a README which lets them:
1. (terraform+sh) Create an IAM user responsible for the deployment with the right policy to deploy and export environment variables for the user.
2. (packer) Create an AMI for an instance that can host our APIs
3. (terraform) Spin up an EC2 instance on the desired VPC to host our APIs (with optional key pair for debugging, etc.)
Honestly, we were not prepared for this post. We don't even have a license attorney. We just picked SSPL because that's what everyone seemed to suggest to prevent the likes of AWS cloning it. By no means we are an expert.
“Open source” software and “free” software as concepts have become entangled to an unhealthy level. Burnout and software project sustainability issues are important to be aware of and manage if we want to have successful long term open source projects that aren’t just sponsored by large companies like Microsoft, Google, Facebook and so on.
So I’m glad you thought ahead and chose the SSPL, good luck and don’t let the haters get to you, at the end of the day it is “open source” and you’ll still be able to feed yourself!
When I first found Singer, it had less than a dozen taps & targets. Now has multiple dozens.
Interestingly, some of our initial pilots chose us not so much for privacy but for pricing. Segment's pricing doesn't work if you have a lot of free users (free games, online bloggers etc. There was a hackernews thread on this - https://news.ycombinator.com/item?id=19220518
We want to compete on the privacy angle but looks like the pricing might get us the initial downloads.
With regards to:
>" Rudder runs as a single go binary with Postgres. It also needs the destination (e.g. GA, Amplitude) specific transformation code which are node scripts. "
what would be a recommended approach, if I would like to keep the data internally, and not use external Analytics engine?
Support for other data-warehouses (Redshift, Bigquery etc) is coming soon.
What I was hoping/looking for is something where another tool lightweight tool that can be installed alongside Rudder, that can use PG instance (may be just different db within Rudder's instance or just different schema).
For a light weight, end-to-end solution (including analytics) that can 'grow' with needs.
For many small self-funded startups that care about privacy a combination of following challenges exist:
a) need self-hosted solutions that require minimal footprint (because data, even anonymized, is hosted on self-paid VPS instances )
b) need to use solutions that do not require initial purchase/investment, but yet can scale out (both technically, and in terms of support) when/if commercial model of the startup becomes more successful
c) need to use solutions that are not heavy VC backed, because those typically do not have stable engagement model (as VCs rates of return demand, typically, changes in initial service/open source model, while self-funded startups are typically much slower growing (and slower in committing to buying expenses))
d) need self-hosted solutions to cover lot of ground in one (so that the learning curve, and time needed to integrate are minimized).
If you prefer that to be in files, you could setup a minio server on your VPS instance. It's coming in next weeks too.
We would like to understand your preference on this so that we could align our next set of destinations.
Please drop an email at firstname.lastname@example.org or join our Slack/Discord channels.
Yes, some events may go to 3rd party vendors (analytics, advertising) but not everything goes and those often have PII removed. We want Rudder to be the point where you decide and enforce what goes where.