Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: RudderStack, open-source CDI (a.k.a. open-source Segment)
123 points by soumyadeb 7 months ago | hide | past | favorite | 39 comments
GitHub: https://github.com/rudderlabs/rudder-server/

===

Firstly, a big thanks to the HN community for showing us love and support in our previous HN post (https://news.ycombinator.com/item?id=21081756). At that point, we had just open-sourced the repo and were not fully prepared for a Show HN. We wanted to share updates since then and also do our official Show HN.

Updates since Sept 2019

1. Changed the name from Rudder to RudderStack :)

2. API compatibility with Segment

3. Open-source control plane so no dependency on the hosted control plane for open-source users. (https://github.com/rudderlabs/rudder-server/blob/config-gen/...)

4. Multiple hosting options: Docker, Kubernetes, Terraform, Native.

5. ~30 integrations (https://rudderstack.com/) including cloud mode and device mode

6. Support all the popular data-warehouses & lakes - RedShift, Snowflake, BigQuery, S3, Google Cloud Storage, Azure Blob Storage

7. Detailed documentation - https://docs.rudderstack.com/

8. Multiple production deployments including few really large ones (our largest deployment is sending a peak of ~40K events/sec, ~300M events/day)

9. Switched license from SSPL to AGPLv3 (after long discussions internally as well as on HN)

10. Built some interesting Analytics & ML use cases

11. Launched our “paid plans” (primarily around managed hosting)

Wishing everyone best wishes for staying safe from COVID-19




I recently joined Mattermost, and am currently in the midst of getting us switched over from using Segment to RudderStack.

The RudderStack team has been super responsive and it's great to be able to support another open-source based business.

The biggest selling point for us is being able to maintain total control of our data with us having barely having to do more than change an endpoint. It also allows us to use one place to send all our analytics data across our various platforms and it just works.


Thanks team Mattermost. Super excited to be working with you guys!!


Here at @Grofers we have been using rudder for a while. A great solution for people who want to create their customer data platform. Plus an awesome team who is always happy to help out :)


Thanks team @Grofers for the kind words and for helping us build this out as our early patrons.


Always a bit frustrating to hit a pricing page and not see pricing. Please take time to provide at least some info https://rudderstack.com/pricing/


Fair point.

We don't post pricing because we're still figuring it out (always a tricky subject for OS project). We charge a platform fee and a small per node charge (beyond 1 node) but there are many dimensions there - node size, number of nodes, etc and we haven't figured out all the combinations. The platform fee generally starts at $2k/month but we offer discounts for start ups/open source projects/non profits, etc.

Right now we more focused on growing our open-source adoptions & customers and we want to talk to our prospects to better understand this.

As a benchmark 1 node can handle ~1K events/sec.


Bit of a novice question, but what is the point of a tool like this or Segment? I have a vague idea that they help with gathering your analytics but I am not entirely sure how. Is it like Google Tag Manager that you can use to control multiple analytics script(edit: this is a wrong comparison, I guess it is like Map function where you send all your analytics data and then it splits and send it to all relevant services where you want the data to be displayed)? Sorry for an off-topic and noob question.


Not at all, this is a great question and we get it all the time :)

Peter (segment's CEO) had a great answer to this in the thread https://www.quora.com/What-is-the-advantage-of-using-Segment...

TO summarize and highlight the main reasons

1) With GTM you still have to write code. Specifically, you need to figure out how to send events to a destination following their API and JS library. With Segment and Rudder, you just call a couple of generic functions identify, track, page etc.

2) You often want a copy of the events in your own S3 Or Redshift or Snowflake. GTM doesn't help you there but Rudder/Segment can.

3) We have some features like Replay etc which lets you send historic events to a new destination. Say you signed up for a new Analytics tool and want to send all your historic events - Replay helps there

4) Finally, Segment has this product called personas which lets you create audiences (e.g give me all people who have done X but not Y and then send that to MailChimp for emailing). We too are working on it.


Oh interesting. So, how do you figure out where to send which data? Like say, I want to track an event on Google Analytics where I want to see if a new user is being redirected to my site from another site based on the Referer header(no idea if this is a valid use case for GA, its just something I have implemented for one of my projects), how would rudder figure out where to send that data? Does the User have to give a list of events that they want to be sent to GA? Or do I have to specify that I want an event to be sent to GA every time I call Rudder api?


That's right. You would have to specify at a per-event level (by specifying a flat to Rudder JSON) where to forward that event.

We have something called Transformations (user defined functions) by which you can modify the event structure from the Rudder BE. You write the transformation function (currently javascript) on our UI and that gets executed in the backend on your event stream. Using the transformation, you can also control where the event goes to. This is helpful when say you want to change the destination without pushing an update to mobile app


Got it, thanks. Is there any way of transforming old data? Lets say, I want to add a new platform to just store data about clicks, but some of these events stored thru Rudder doesn't have the element id, is there any way I can give these events a default element id before sending them off to the new platform?


Yes, that is the goal of the user transformation. You can add an element_id field while in the event before it is forwarded to destinations.

Would love to understand your use case a bit more.


Founder here - happy to answer any questions


It's not really clear to me what it does, seems like a bus for data, but how is it related to kafka for example? sorry if it's too silly, I'm a bit lost, and I think it may be useful to me. I've checked segment's website and some of their stuff like unified view of a customer may be useful to me. Regards


That's right. What we have today is more of a bus but for customer event data.

Let's say you have a website or mobile app and you want to collect all the user interaction events (clicks, searches, impressions) in a data-warehouse like Snowflake.

RudderStack can do it for you. We have SDKs (Android, iOS, JS, Python etc) which you can use to send events. We have a corresponding backend (you can run it or we can host it) which will collect these events and dump into your warehouse.

You often want to send (subset of these) events to other 3rd party websites too like Amplitude for analytics, Braze for marketing automation and so on. RudderStack can forward events to those too so you don't have to embed multiple 3rd party SDKs and understand their library etc.

The second part of this is to create a unified view of the customer as you mentioned and take action on that (e.g. find all churning users and send them email). That's where Segment has a product called Personas and we are working on something similar but it's not launched yet.

Though even before, you can build that customer view on top of your event data in your data warehouse by using SQL or a tool like Looker. For example, you find all churning users by writing a simple SQL query. And then send that result by something like Looker Action Hub.

Happy to discuss more here or offline (email: soumyadeb@rudderstack.com)


In your pricing > faq - there is a mention of why you chose SSPL and NOT AGPLv3!

And then you are on AGPLv3 ?

Can you please share your thoughts on why you moved to AGPLv3 ?


Oops, that was old.

We had long discussions on and off HN about this topic. The biggest problem with SSPL is it is not an open-source license and we got a bunch of pushback (rightly so) for calling ourselves open-source.

We initially went with SSPL to help prevent cloud providers from offering this as a service. However, there was a viewpoint that AGPLv3 already provides this and there is no reason for SSPL. Also, neither has been held up in court.

In the end, we prioritized something which our community felt strongly about. The business risk was questionable anyway.



Yes, very recently did we switch from SSPL to AGPLv3 (as noted in updates above)


But the repo says SSPL... in the readme & in the LICENSE file isn't that the source of truth? Or what matters to an organization evaluating adoption?


Thanks for pointing it out. The LICENSE file was old, removed it now.

The README says AGPLv3. Or am I missing something?


@soumyadeb - can't reply another level deep here. I'll create an PR in the repo.

https://github.com/rudderlabs/rudder-server/pull/183

Cool project!


Thanks for the pull request :) Merged...


Is this suitable if I want to stream events (e.g. click event data), around 1k events/per sec, to a Redshift database, and see it happen realtime? Redshift should receive the data within 5-20 minutes.

Will rudderstack work for this usecase? Or will it require something in between.


Yes, this is a very standard use case. A single node RudderStack (m4.xlarge) should be able to process more than 1K/sec. And Redshift should receive the data in 30 mins (configurable parameter).

Please give it a shot. Email me soumyadeb@rudderlabs.com if you run into any issues.


How does the AGPL license interact with the enterprise plan? If a company pays for the cloud version that rudder hosts, is that company obligated by the provisions of the AGPL?


No, the cloud-hosted version is under a TOS which has an enterprise license.


How much can be run independently vs what the company provides? Are there parts that are private? Or is it more of a service and hosting model?


I recently looked at segment and I was quoted 40k for personas. I think what you are doing is great and if you can do personas , I and my org would be eternally greatful.


If you are looking into Segment Persona, I would love to pick your brain about what you are trying to accomplish with it.

At getGensus.com we have a tool that allows you to build persona on top of your existing data set and then sync these segments, golden records, etc to the tools of choices. It is a bit of a different approach since the data stays with our DB/Warehouse. I'd like to hear your thoughts.

If you have the time love to connect. Email is in profile


If you are open to it, would love to work with you (e.g understand your requirements) as we develop the personas product. Can you please email soumyadeb@rudderstack.com


What is a "CDI"? What is "Segment"?

Yes, I could look these things up, but if you are making an announcement, it might be nice if you were clear.


Sorry, Customer Data Infrastructure and segment.com. A platform which makes it simple to collect event data from your apps and send it to 3rd party destinations (including dumping into your warehouse).

We wanted to give out an update from our last HN post so we didn't get into the depth of what the product is. Kind of assumed everyone knows segment but yes a fair point.


Customer Data Infrastructure


I will be interested to take a look as soon as the ElasticSearch integration is launched. We did look into Segment a while back, but $$$.


One of our users has setup logstash to load data from S3 to ElasticSearch. Would that setup work for you?


what is your k8s strategy?


Our Kubernetes helm charts are open sourced here. It should be straight forward to install it on your current k8s cluster.

https://github.com/rudderlabs/rudderstack-helm


We use helm charts to deploy Rudder. Repo here https://github.com/rudderlabs/rudderstack-helm

We have 3 separate pods - one for the RUdder core BE, one for the transformation (javascript which map events from Rudder format to destination format as well as user defined transformations) and one running Postgres.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: