This looks like an amalgamation of 8+ open source projects or industries with products put forth by companies that have dozens of employees and worked on their products for years.
It also doesn't even categorize the products they compete with correctly[0].
Why not contribute some of your resources to one of the many active open source libraries already trying to solve some of these problems, and focus your engineering efforts on your core product?
What we're doing different is making one product that does the whole lifecycle instead of having to string tools together. It took us many months to string our toolset together and we felt there had to be a better way. Just like GitLab we try to leverage existing open source projects wherever possible.
As someone who works very, very closely in this industry, I would just be very careful how much of this you think you want to bite off.
Consider how you trust using dbt more than rolling your own transformation tool. Why wouldn't this apply to the rest of your stack? The 10+ companies that offer data extraction and loading are likely a better choice. Again with Analytics - the dozens of companies that offer BI tools are probably going to be the better choice.
Maybe you can build all these tools better than the hundreds of companies with thousands of employees and millions of dollars. It just seems like the odds that you build the best of each is so unlikely.
I would have been more impressed if your team had designed some API that other tools/platforms could plug in to coordinate a lot of the above jobs with your CI system. There is a SERIOUS need for that and I've had a lot of conversations with companies about what that would look like.
To answer your quest, no, Fivetran does not currently belong in the orchestration area, IMO. I've heard they are soon to release some sort of orchestration tooling to compete with dbt, but it isn't the type of orchestration you get with Airflow.
slap_shot one of our major goals is to provide a solution that startups and small companies can utilize to start putting their data to work.
It shouldn't take weeks of effort, a data engineer, multiple proprietary solutions, and tens of thousands of dollars to answer key questions like CAC or the efficiency of a given marketing campaign.
We're hoping to lower the barrier to entry in both cost and effort, by providing an open source pre-packaged solution.
[Obligatory: Someone is downvoting you, but it isn't me. I upvoted you]
Yeah, I get that. The analytics space is very complex and companies, even ones with good engineering teams, don't have the internal knowledge or resources to typically put all this together.
In addition to working in this space, my copmany helps companies set up their analytics stack.
We typically set them up with one cloud-based data integration tool (the one with the most # of integrations they need at the best price), dbt, and one BI tool (usually Looker or Periscope, in that order). All in, that takes us a few weeks to get them set up and going.
I applaud your effort. I just struggle to understand why you accept punting on transformations (and using dbt (amazing library, by the way - great choice)), but then try to tackle something like integrations or BI tools. The complexity of both of those is massive and there are great open source efforts already out there.
"but then try to tackle something like integrations or BI tools. The complexity of both of those is massive and there are great open source efforts already out there."
I would love to hear your suggestion for a great open source BI tool. We tried Superset and Metabase but both didn't came close to what we could do with Looker. That is why we're giving Meltano Analyze a shot.
BTW Do you want to do a livestreamed video call to discuss further in the 30 next minutes? You have a lot of knowledge. If so please email me and comment here.
What a great interview. @slap_shot, you had great questions and you are so well spoken. Really appreciate the feedback. We're all taking notes here. Hope you will keep an eye on our issue tracker for Meltano and give us your feedback as things come up.
I have no horse in this race, but this is so cool that one minute you're exchanging comments on HN and the next you're livestreaming a conversation on the topic! What a world :)
Thanks to both of you for your time doing that discussion!
@slap_shot and anyone else — I'm curious if you have thoughts on, or even have heard of the Ballerina language? It's a programming language for doing data integration work, built by the ESB/integration consultancy WSO2. It seems to have a lot of eng resources sunk into it but surprisingly little fanfare.
Non english speaker here you mentioned an OSS solution called "Inbulk" or something like that during the conversation. Could you spell it I'm pretty interested in finding out more about that project but google return a lot of unrelated result because of the name I guess...
For batch-type of workloads embulk has been really excellent tool for my company (for all extract and load steps. We do most of the transformations in db/warehouse)
> designed some API that other tools/platforms could plug in to coordinate a lot of the above jobs with your CI system
That's GitHub's strategy. Don't choose solutions for their customers. Be a platform other tools can plug into.
Gitlab's strategy is to cobble together a bunch of open source software (including their own) to provide a solution out of the box. It's not necessarily the best one for you, but it's certainly less effort for you.
On the analytics side, we're using GitLab CI as our orchestration tool. We're pushing it to its limits and trying to find ways to make it better for us (i.e. data teams) and for GitLab more generally.
I'd love to learn more about what you'd like to see CI be able to do from a dataops perspective.
I'm not 100% with all the tools you are using, but stringing together random SaaS tools and having to survey a random number of open source tools in order to assemble a sensible platform makes way less sense.
At the very least, what we end up with is a group of folks working together in the open to surface some of the limitations and challenges and attempt to work out some of the alternative solutions to the problems that arise in this space.
So, I applaud your effort. Ignore the salesmen and the haters.
Thanks for the positive comment! We're generally taking the same approach that was taken with GitLab the product: do it out in the open, iterate constantly, and work with the community. Especially doing it out in the open enables these sorts of _awesome_ conversations! And we definitely want feedback - this needs to work for more than just us!
Our goal is to meet our data team's need by answering our company's data questions.
A lot of the solutions out there are fantastic but aren't up to the tasks we are looking for. Why shouldn't the whole life cycle be in one tool, be open source, and be version controllable? That's what we are looking for in a tool.
There's no inherent reason that the whole life cycle can't be handled in a single tool. However, there have been tens of thousands of person-years spent on these tools, so people here are pointing out that it is a tall ask for any company to create one tool that integrates everything. This goes doubly so if it is only going to be a side project to GitLab itself.
We'll only get there if Meltano get a lot community contributions. We think there is space for an open source end-to-end product that works out of the box. The contributions will tell if we're right.
_Especially_ GitLab. Basically their entire product seems to be about building a whole bunch of separate tools and integrating them seamlessly into each other. GitLab has a built-in CI system, a deployment pipeline with Kubernetes integration, a built-in Docker container registry, performance monitoring tools for deployed applications, automated static analysis tools, etc. Describing it as "an amalgamation of 8+ open source projects or industries" seems pretty accurate.
That's by no means a bad thing though. While yes, there are downsides to tightly coupled tools, there are also advantages. If GitLab is trying to do the same thing for data analytics that they've already done for source control, they may very well succeed.
I can't understand why GitLab thinks they have to embark on a new project every so often instead of focusing on their current product and features. There is just a lot to work on, so many of the current features/products are half assed. At my place we moved to GitLab 2.5 years ago and updates where smoother back then but the past few months we had to hire a new sys admin for our build machines and GitLab server to follow on new issues created on GitLab.com and decide if it's safe release and even then he still reports 4-5 issues to GitLab support after every update. We were expecting it to be an easy `yum update` like a normal package but it's just getting worse update after update. It's so bad that my manager asked me to look into GitHub + another CI/CD solution.
Agreed - our company moved to Gitlab about the same time (2.5 years ago) and it's very clear from their updates that their focus has splintered in different directions. Our company has recently moved to more Microsoft products so I am pushing our CTO to move to Github.
If the CEO is following this, please improve basic user stories like:
* As a user, I want to easily know who has approved my merge request. Note the word "easily". The UI lists the people who did not approve next to label "Approved" and the people who did approve next to the label "Approved by". Makes absolutely no sense
* As a user, I want to see all the merge requests that I need to review because I am listed as an approved (it boggles my mind that this doesn't exist)
* As a user, I want to be notified by todos that only have any pending actions on them
* As a user, I want to disapprove a merge request
There are so many basic areas of the core product that are almost unusable. All of our engineers who have to regularly switch between github and gitlab prefer the github ui.
Loading a Merge Request with 168 changes, basically breaks a 4cpu's 4gb instance on gitlab, so yes, "the most basic areas of the core product is almost unusable".
And while some integration is good... A lot of recent stuff is just "we try to grab the easy money"
we are on 11.1 and it takes way more than 3 seconds (or even 15 seconds) (Request takes around 7 seconds and parsing it, and parsing it takes 10 seconds).
(P.S.: I also have the same source in gitea of a way less powerful instance which basically renders the whole thing in way less than 1 second.)
---
As a user, I want to be notified by todos that only have any pending actions on them
---
Can you explain further what you mean by "pending actions on them"? We are working to simplify and streamline our notifications and todos in GitLab. In particular, the current thinking is that they are very similar. A "notification" is an email, and a "todo" is a something that GitLab calls your attention to in the Web UI to take action on. So mechanically, they are very similar and we would like to harmonize them.
If I have a merge request where I am added as an approver and that code is merged in, then I don't want it to show up on my todos. A merge request has a specified amount of approvals required and when that threshold is hit and the code is merged, there is no work to be done. It is no longer a todo but a done. Unfortunately todos doesn't work this way and become useless for senior members who get listed as approvals for many merge requests.
In terms of combining them with notifications, I agree. I just need a web place to see all the "pending" action items I need to work on. A web notification feature should be the place to see all the pending notifications left for me (similar to how it would work on email except you can't expire emails).
Thanks for taking the time to share your feedback. I'm the Product Manager responsible for merge requests. Approvals are such an important part of merge requests, and we are working to make them better.
Thanks for the disapprove merge request idea. We're considering this idea in https://gitlab.com/gitlab-org/gitlab-ee/issues/761 where further feedback would be much appreciated, or on any other issue.
I'd point out that if you look at issue 5439 the team itself was originally unaware of the high number of edge case states of the merge request and closed the issue prematurely. Having many code paths is a code smell so I'd suggest simplifying your UX and edge cases here.
Since you own the merge request flow, I would suggest looking at the page and all it's edge cases and seeing where you can simplify for the user. There is a dizzying large amount of info and CTAs presented to the user; it's pure information overload. Don't just measure yourself by how many features you ship but rather on how you communicate those features to your users. Simplicity is a powerful feature in itself.
Looking forward to batch comments.
The disapprove merge request is a feature available in Phabricator and other competitors so I would look to see how they've implemented it.
I'm sorry to heard your experience with GitLab hasn't been smooth. We have more people then ever working on the core of GitLab. And the number of reported issues per customer are going down. But every problem is one too many. Please email me at sytse@gitlab.com if you're open to a call about your situation.
> And the number of reported issues per customer are going down.
This doesn't mean anything, maybe the customers are simply tired of reporting issues. For example last year we didn't do any updates for 6 months because we were afraid it'd break something and we were too busy to be willing to spend the time reporting problems.
We also don't report issues that are already open on gitlab.com, reporting the issue means your customer is willing to spend time reporting, following up and testing your bug. This is your job, not the customer's. At the moment we are only reporting issues that are either blocking us from work or slowing down our development. The majority of issues we are facing are performance problems.
I just wrote a script to plot the number of issues on gitlab-ce over time and percentage of open/close issues, and the overall period they have been open for, you are accumulating issues with: `backend`, `UX`, `technical debt`, `performance`, `CI/CD`, ... labels, a lot of them don't have a Milestone and have been open for a long time.
I am not sure how emailing you would help us, it's not like the problems are not reported or you don't already know about them. It just appears that the priority of GitLab, as a company, is not shipping a quality product anymore.
EDIT: I work in the aerospace industry and one of the stages of our pipelines is to run stress test on our product. I would suggest you to run a stress test on an instance of GitLab, this would be an amazing place to start looking for performance problems.
>maybe the customers are simply tired of reporting issues. For example last year we didn't do any updates for 6 months because we were afraid it'd break something and we were too busy to be willing to spend the time reporting problems.
We just stopped upgrading GitLab over 2 years ago, we're on 8.9
We're correct metrics of how frequently customers upgrade and as far as I can tell it is way above industry average.
I'm sorry to hear you experienced to much breakage. Can you maybe point to a regression or two that stayed open too long or that caused you a lot of trouble so we can learn from it?
I appreciate how dedicated GitLab is to continuously improving the product. Thinking about moving my projects from Bitbucket to GitLab for that reason.
It is truly a bummer that you feel you are receiving half assed updates and features with constant problems.
The size of the GitLab is constantly growing and Meltano is adding to GitLabs capabilities, not subtracting. We've hired 2 very awesome Python developers for Meltano specifically. They each have tons of experience in the ELT space.
All this to say, that no one at GitLab has turned their eyes away from GitLab, it's the opposite. This business is here to help GitLab as our first customer. Rather than having GitLab struggle to get it's data tools together, and make business decisions based on that data, we've devoted a whole team to provide a solution while helping the community at the same time.
Data pipelines are not a great subject for an open-source project. We've been building these for the last 3+ years at Fivetran, and I can tell you that the challenge is:
- Studying each source to figure out the right data model
- Chasing down a million weird corner cases
- Working around dumb bugs in the data sources
This is the kind of problem where paying for software really works better. When people build data pipelines in-house, they tend to hack at it until it works for their use case and then stop. When we build data pipelines, we map out every feature of the data source, implement the whole thing at once, and then put it through a beta period with multiple real users. This is easy to do when you have a tight-knit dev team; much harder for a group of part-time open-source contributors.
I think the point is to provide a set of tools for people that build data pipelines. Period. The software being open source don't reflect in any way WHO will use this tool. Depending on the success of this project, it might be that you could switch your team to this new tool at some point.
Personally I work as a "lone wolf" (to my own complains) because I'm in a small company that can't afford a huge team. Most of my (ETL) Transforms are done in SQL which happen to be pretty standardized as opposed to many ETL products I've seen so far.
This solution is probably far from being ready, but I find this approach quite interesting, because it look like a code based ETL that use SQL for transform (so I might be biased). Overall this might result in a more maintainable/versionable data pipeline model than GUI-first ETL which usually generate spaghetti code. Because you are usually forced to regularly adapt data-pipeline to unstable external inputs, being able to easily diff ETL process would be a blessing.
The scope of Meltano isn't limited to just data pipelines, though that is the first major part of it.
One thing that gets me really excited about it is the way we want to build version control in from the start. To give you an example of where that's really powerful - we have a bunch of dashboards in Looker. Right now, figuring out what Looks/Dashboards rely on a given field is very challenging. If I change a column in my extraction, right now I can fairly easily propagate it to my final transformed table (thanks to dbt!) and even to the LookML. But knowing what in Looker is going to change / break if I change the LookML is way harder.
But if everything was defined in code from extraction, loading, transformation, modeling, _and_ visualization, that'd be really powerful from my perspective.
The Meltano team has several user personas that they're looking at focusing on, data engineers are definitely one of them, but data analyst/BI users are as well, and we want the product to work well for the whole data team.
IMHO, if you want to make a dent in the space, figure out better debugging tools!
In particular, tools that explain how a certain (specific) value was calculated in the system, tools that let you bisect the source data in some way and let you focus on the source data that are likely to have a problem, tools that help you figure out that certain intermediate value in calculations is an outlier, tools that let you test certain assumptions about data over the whole pipeline..
I'd love for a more robust way to test data pipelines and the data within them generally. I was at DataEngConf earlier this year and many people were talking about this problem exactly. One way we're trying to address it a bit is by using the Review Apps feature on Merge Requests within GitLab. Right now, when you open an MR on our repo it will create a clone of the data warehouse that's completely isolated from production. This, obviously, can't scale once the DW is beyond a certain size, but I think there are ways to keep this sort of practice going.
I kind of agree with this. To take an example outside of ETL/DW/BI, when I first saw Zapier I was skeptical of how many APIs they could support because I'd seen a decent amount of open source ESBs like Mulesoft run out of steam after a certain number of connectors. Zapier, being proprietary from day one (albeit less featureful than a full blown enterprise ESB) has done better than I expected. Still, they only support 100 or so datasources and the types of data/objects/triggers/whatever they support is limited at times. IMO at some point both open source and proprietary models fall apart in the face of the long tail. Amazon has tackled the long tail of ecommerce but that's an enormous market that allows them to employ hundreds of thousands of people to tackle that long tail. Tackling the long tail of connectors (whether it's for ESBs/SaaS integration or ETL/DW/BI) is just too expensive compared to the size of the markets that are willing to take a shot at it.
Thanks for the great advice! Obviously your years of experience, with trial and error, advice is greatly appreciated.
The idea is to give users a set of default extractors (which are the ones we use internally, so they are battle tested), along with loaders, transformers etc. With documentation on how to build their own. For our MVP, and possibly into the future, it will work similar to Wordpress plugins where you have an extractor directory that you place your extractor which is written following our protocol, and the UI will recognize it and give you choices of extractors to run, same for loaders, and so on.
We do not want to be chasing down every last corner case, for extractors (except for our own) because that's just not a good long term solution, needing constant maintenance (as we've seen already). With user contributions, I believe it can work.
This is Gitlab taking stuff they were doing already internally and making it available to a broader audience.
Once you take VC funding, you gotta go where the money is. Everyone wants/expects "fast, stable, like Github" for free unless you have special needs. So, you do analytics on what people are doing with your free site, you offer enterprisey features, you get into the "platform" business etc.
I think Gitlab distracts itself, spreads itself thin, and isn't great at partnering, its ambition to do-it-all knows no bounds, which is both commendable and a smh moment. It's not likely sustainable or scalable. They're definitely trying to "go big or go home" as a company, which is not how most originally felt about Gitlab (a fast, stable OSS alternative to Github).
At the same time, I can't blame them. I think it comes down to: Don't hate the player, hate the game.
I meant no offense. But you’re also dabbling in building your own k8s distro / platform, you have your own CI/CD , Jira storyboard , and data science stuff, etc.
My point is that you’re aiming a lot broader than Github ever did - you are competing more as a suite than as a focused product.
And I’ve seen personally this impact the support side with customers, partnership side, etc. I help maintain a medium-large Gitlab for one of your bigger customers. Anyway this isn’t the place for me to get specific, I am just saying that you are taking a risky path in terms of sustainability IMO as a rando on the internet.
Not trying to be negative. I genuinely would like for GitLab to succeed. My experience (in a totally different industry and scenario, but with product building all the same) was that our decision to pare down to our core competency and focus was the best decision we ever made. We were attempting a full productivity suite, similar in concept but again a different industry. I’m interested in finding an example of a similarly modeled company to compare.
Really no, look at all the comments here (and this is only from techies): you have lost us, we don't know anymore what you are doing, or even trying to do.
The idea seems great, but it's not working: there is no single application that can fit all uses, and you are loosing most of the users on the way.
I'm using Gitlab, btw, but only for the self-hosted git and it's user interface (ie. your core). All the other parts (bug tracking, CI, chat, ...) are in different and more appropriate tools for each of our use-cases... because most of yours are not complete enough, or sometimes it's not even clear how they actually could work for us (mattermost for example).
This could potentially become part of the Meltano stack. At GitLab, we're not at the phase yet where we're in need of data versioning. But I could imagine a data registry that's integrated with the workflow of data analysts/scientists to easily link versions of code and data.
Thanks for the link - we'll definitely keep an eye on it.
Reading this I was concerned that it would be written in Ruby. While Ruby is a reasonable language for server development, it has almost no data science community when compared with some other ecosystems.
I was very glad to see this is Python! Python has some of the best data tools out there, and a mature ecosystem for solving all the engineering problems that go along with a great data stack.
Personally, I quite like the approach FloydHub has for deep learning projects. At GitLab, we currently don't have any deep learning projects happening - we're still further down the AI hierarchy of needs - i.e. focusing on solid data infrastructure and descriptive analytics.
I fully expect we'll have a use case for the "cool" machine learning stuff, but there's a lot of groundwork to cover with the basics first. Meltano is focusing on those basics for right now.
I am interested in knowing more about how you think FloydHub can better serve the market. FloydHub does have metrics support for later reviewing: https://docs.floydhub.com/guides/jobs/metrics. Are you only interested in using tensorboard for graph viewing?
The page talks mentions MVC, and the issue page[0] keeps mentioning MVC as well. Was this supposed to be MVP, or something else? Model-view-controller doesn't make sense in the context.
slap_shot, I agree and as I disclaimer I also work at GitLab. There is no shortage of data tools in the space today. A majority of my career has been spent in the data & analytics space and I've talked / worked with at least 60% of the companies you mentioned. At the end of the day, these are the questions I've asked over and over again.
1. Do we have enough money / budget for a tool like this?
2. Can we derive enough insights from this product fast enough to make a good ROI?
3. Does this tool use a proprietary language that no one wants to learn or can I code in a language that is relevant?
4. In all honesty, can I get insights faster in a spreadsheet than these tools?
5. What is the learning curve?
6. Can I answer the business question that was originally asked?
Open to more discussions around the topic as it is a lot harder to answer than a few philosophical questions, but it certainly resonates with many data & analytics professionals. A nice goal would be to have project where you can stand up a business, turn your data pipelines on, ingest the data, and view the insights needed to make a business decision all within a short timeframe of when a business goes live.
One major difference will be the complete data life cycle vs providing just one part of it. Just like we do in GitLab except for data teams instead of software development teams.
All of our extractors are available in our source code, which is open source. http://gitlab.com/meltano/meltano/. Right now we are working towards an MVP, so things might be in flux, but we value any feedback you have.
It also doesn't even categorize the products they compete with correctly[0].
Why not contribute some of your resources to one of the many active open source libraries already trying to solve some of these problems, and focus your engineering efforts on your core product?
[0] Fivetran is only considered "Orchestrate" but is actually competes directly with Alooma in the Extract and Load. Also, there are DOZENS of company in that space. https://gitlab.com/meltano/meltano/blob/master/README.md#dat...