
GitLab is working on a tool just for data teams - TheMissingPiece
https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/
======
slap_shot
This looks like an amalgamation of 8+ open source projects or industries with
products put forth by companies that have dozens of employees and worked on
their products for years.

It also doesn't even categorize the products they compete with correctly[0].

Why not contribute some of your resources to one of the many active open
source libraries already trying to solve some of these problems, and focus
your engineering efforts on your core product?

[0] Fivetran is only considered "Orchestrate" but is actually competes
directly with Alooma in the Extract and Load. Also, there are DOZENS of
company in that space.
[https://gitlab.com/meltano/meltano/blob/master/README.md#dat...](https://gitlab.com/meltano/meltano/blob/master/README.md#data-
science-lifecycle)

~~~
sytse
What we're doing different is making one product that does the whole lifecycle
instead of having to string tools together. It took us many months to string
our toolset together and we felt there had to be a better way. Just like
GitLab we try to leverage existing open source projects wherever possible.

I agree Fivetran also belongs in extract and load and updated it
[https://gitlab.com/meltano/meltano/commit/1df9813f5ab42c4479...](https://gitlab.com/meltano/meltano/commit/1df9813f5ab42c4479120f4d7e9f6f8e8a06a1ae)
Do you think it should be removed from Orchestrate? Any other suggestions for
proprietary products in that category?

~~~
slap_shot
As someone who works very, very closely in this industry, I would just be very
careful how much of this you think you want to bite off.

Consider how you trust using dbt more than rolling your own transformation
tool. Why wouldn't this apply to the rest of your stack? The 10+ companies
that offer data extraction and loading are likely a better choice. Again with
Analytics - the dozens of companies that offer BI tools are probably going to
be the better choice.

Maybe you can build all these tools better than the hundreds of companies with
thousands of employees and millions of dollars. It just seems like the odds
that you build the best of each is so unlikely.

I would have been more impressed if your team had designed some API that other
tools/platforms could plug in to coordinate a lot of the above jobs with your
CI system. There is a SERIOUS need for that and I've had a lot of
conversations with companies about what that would look like.

To answer your quest, no, Fivetran does not currently belong in the
orchestration area, IMO. I've heard they are soon to release some sort of
orchestration tooling to compete with dbt, but it isn't the type of
orchestration you get with Airflow.

~~~
joshlambert
slap_shot one of our major goals is to provide a solution that startups and
small companies can utilize to start putting their data to work.

It shouldn't take weeks of effort, a data engineer, multiple proprietary
solutions, and tens of thousands of dollars to answer key questions like CAC
or the efficiency of a given marketing campaign.

We're hoping to lower the barrier to entry in both cost and effort, by
providing an open source pre-packaged solution.

~~~
slap_shot
[Obligatory: Someone is downvoting you, but it isn't me. I upvoted you]

Yeah, I get that. The analytics space is very complex and companies, even ones
with good engineering teams, don't have the internal knowledge or resources to
typically put all this together.

In addition to working in this space, my copmany helps companies set up their
analytics stack.

We typically set them up with one cloud-based data integration tool (the one
with the most # of integrations they need at the best price), dbt, and one BI
tool (usually Looker or Periscope, in that order). All in, that takes us a few
weeks to get them set up and going.

I applaud your effort. I just struggle to understand why you accept punting on
transformations (and using dbt (amazing library, by the way - great choice)),
but then try to tackle something like integrations or BI tools. The complexity
of both of those is massive and there are great open source efforts already
out there.

I'm eager to see where this goes.

~~~
sytse
"but then try to tackle something like integrations or BI tools. The
complexity of both of those is massive and there are great open source efforts
already out there."

I would love to hear your suggestion for a great open source BI tool. We tried
Superset and Metabase but both didn't came close to what we could do with
Looker. That is why we're giving Meltano Analyze a shot.

BTW Do you want to do a livestreamed video call to discuss further in the 30
next minutes? You have a lot of knowledge. If so please email me and comment
here.

Update: He did email and livestream will happen on
[https://www.youtube.com/watch?v=F8tEDq3K_pE](https://www.youtube.com/watch?v=F8tEDq3K_pE)

~~~
slap_shot
Sure - just shot you an email at website@yourhandle.com

~~~
jakecodes
What a great interview. @slap_shot, you had great questions and you are so
well spoken. Really appreciate the feedback. We're all taking notes here. Hope
you will keep an eye on our issue tracker for Meltano and give us your
feedback as things come up.

~~~
mendelk
I have no horse in this race, but this is so cool that one minute you're
exchanging comments on HN and the next you're livestreaming a conversation on
the topic! What a world :)

~~~
sytse
Glad you like it!

------
cheghook
I can't understand why GitLab thinks they have to embark on a new project
every so often instead of focusing on their current product and features.
There is just a lot to work on, so many of the current features/products are
half assed. At my place we moved to GitLab 2.5 years ago and updates where
smoother back then but the past few months we had to hire a new sys admin for
our build machines and GitLab server to follow on new issues created on
GitLab.com and decide if it's safe release and even then he still reports 4-5
issues to GitLab support after every update. We were expecting it to be an
easy `yum update` like a normal package but it's just getting worse update
after update. It's so bad that my manager asked me to look into GitHub +
another CI/CD solution.

~~~
hunter23
Agreed - our company moved to Gitlab about the same time (2.5 years ago) and
it's very clear from their updates that their focus has splintered in
different directions. Our company has recently moved to more Microsoft
products so I am pushing our CTO to move to Github.

If the CEO is following this, please improve basic user stories like:

* As a user, I want to easily know who has approved my merge request. Note the word "easily". The UI lists the people who did not approve next to label "Approved" and the people who did approve next to the label "Approved by". Makes absolutely no sense

* As a user, I want to see all the merge requests that I need to review because I am listed as an approved (it boggles my mind that this doesn't exist)

* As a user, I want to be notified by todos that only have any pending actions on them

* As a user, I want to disapprove a merge request

There are so many basic areas of the core product that are almost unusable.
All of our engineers who have to regularly switch between github and gitlab
prefer the github ui.

~~~
merb
Loading a Merge Request with 168 changes, basically breaks a 4cpu's 4gb
instance on gitlab, so yes, "the most basic areas of the core product is
almost unusable".

And while some integration is good... A lot of recent stuff is just "we try to
grab the easy money"

~~~
sytse
Yep, load times of large merge requests was a big problem. In 11.1 we launched
a refactor of merge requests to solve this
[https://about.gitlab.com/2018/07/22/gitlab-11-1-released/#me...](https://about.gitlab.com/2018/07/22/gitlab-11-1-released/#merge-
request-comments-vuejs-refactor)

That got the time down for the worst case we measure from 15 seconds to 3
seconds, see
[https://news.ycombinator.com/item?id=17671300](https://news.ycombinator.com/item?id=17671300)

~~~
merb
we are on 11.1 and it takes way more than 3 seconds (or even 15 seconds)
(Request takes around 7 seconds and parsing it, and parsing it takes 10
seconds).

(P.S.: I also have the same source in gitea of a way less powerful instance
which basically renders the whole thing in way less than 1 second.)

------
georgewfraser
Data pipelines are not a great subject for an open-source project. We've been
building these for the last 3+ years at Fivetran, and I can tell you that the
challenge is:

    
    
      - Studying each source to figure out the right data model
      - Chasing down a million weird corner cases
      - Working around dumb bugs in the data sources
    

This is the kind of problem where paying for software really works better.
When people build data pipelines in-house, they tend to hack at it until it
works for their use case and then stop. When we build data pipelines, we map
out every feature of the data source, implement the whole thing at once, and
then put it through a beta period with _multiple_ real users. This is easy to
do when you have a tight-knit dev team; much harder for a group of part-time
open-source contributors.

~~~
MarHoff
I think the point is to provide a set of tools for people that build data
pipelines. Period. The software being open source don't reflect in any way WHO
will use this tool. Depending on the success of this project, it might be that
you could switch your team to this new tool at some point.

Personally I work as a "lone wolf" (to my own complains) because I'm in a
small company that can't afford a huge team. Most of my (ETL) Transforms are
done in SQL which happen to be pretty standardized as opposed to many ETL
products I've seen so far.

This solution is probably far from being ready, but I find this approach quite
interesting, because it look like a code based ETL that use SQL for transform
(so I might be biased). Overall this might result in a more
maintainable/versionable data pipeline model than GUI-first ETL which usually
generate spaghetti code. Because you are usually forced to regularly adapt
data-pipeline to unstable external inputs, being able to easily diff ETL
process would be a blessing.

~~~
veritas3241
The scope of Meltano isn't limited to just data pipelines, though that is the
first major part of it.

One thing that gets me really excited about it is the way we want to build
version control in from the start. To give you an example of where that's
really powerful - we have a bunch of dashboards in Looker. Right now, figuring
out what Looks/Dashboards rely on a given field is very challenging. If I
change a column in my extraction, right now I can fairly easily propagate it
to my final transformed table (thanks to dbt!) and even to the LookML. But
knowing what in Looker is going to change / break if I change the LookML is
way harder.

But if everything was defined in code from extraction, loading,
transformation, modeling, _and_ visualization, that'd be really powerful from
my perspective.

The Meltano team has several user personas that they're looking at focusing
on, data engineers are definitely one of them, but data analyst/BI users are
as well, and we want the product to work well for the whole data team.

------
tbrock
I wish they would focus on making a fast, stable, GitHub alternative.

~~~
parasubvert
This is Gitlab taking stuff they were doing already internally and making it
available to a broader audience.

Once you take VC funding, you gotta go where the money is. Everyone
wants/expects "fast, stable, like Github" for free unless you have special
needs. So, you do analytics on what people are doing with your free site, you
offer enterprisey features, you get into the "platform" business etc.

I think Gitlab distracts itself, spreads itself thin, and isn't great at
partnering, its ambition to do-it-all knows no bounds, which is both
commendable and a smh moment. It's not likely sustainable or scalable. They're
definitely trying to "go big or go home" as a company, which is not how most
originally felt about Gitlab (a fast, stable OSS alternative to Github).

At the same time, I can't blame them. I think it comes down to: Don't hate the
player, hate the game.

~~~
sytse
We are building a fast, stable, GitHub alternative.

We have hired 3 times as many people in our security team for GitLab.com (not
our product team for security) as are working on Meltano.

We have hired 3 times as many people in our SRE teams as are working on
Meltano.

And we still have a lot of vacancies for both
[https://about.gitlab.com/jobs/](https://about.gitlab.com/jobs/)

~~~
parasubvert
I meant no offense. But you’re also dabbling in building your own k8s distro /
platform, you have your own CI/CD , Jira storyboard , and data science stuff,
etc.

My point is that you’re aiming a lot broader than Github ever did - you are
competing more as a suite than as a focused product.

And I’ve seen personally this impact the support side with customers,
partnership side, etc. I help maintain a medium-large Gitlab for one of your
bigger customers. Anyway this isn’t the place for me to get specific, I am
just saying that you are taking a risky path in terms of sustainability IMO as
a rando on the internet.

------
n42
Is there any example of an open source software company that has taken on so
many products at once, so early in its life, and succeeded?

~~~
sytse
We did [https://about.gitlab.com/2017/10/11/from-dev-to-
devops/](https://about.gitlab.com/2017/10/11/from-dev-to-devops/) when we
where at 50% of our current number of engineers. So far so good.

~~~
gregoriol
Really no, look at all the comments here (and this is only from techies): you
have lost us, we don't know anymore what you are doing, or even trying to do.

~~~
sytse
We are trying to make a single application that covers the whole DevOps
lifecycle, from planning your change up to monitoring its effect.

We're doing it because we believe there are emergent benefits to having the
lifecycle in a single application
[https://about.gitlab.com/handbook/product/single-
application...](https://about.gitlab.com/handbook/product/single-
application/#emergent-benefits-of-a-single-application)

~~~
gregoriol
The idea seems great, but it's not working: there is no single application
that can fit all uses, and you are loosing most of the users on the way.

I'm using Gitlab, btw, but only for the self-hosted git and it's user
interface (ie. your core). All the other parts (bug tracking, CI, chat, ...)
are in different and more appropriate tools for each of our use-cases...
because most of yours are not complete enough, or sometimes it's not even
clear how they actually could work for us (mattermost for example).

------
veritas3241
Taylor from GitLab here! Happy to answer any questions about what we're doing.

~~~
thebiglebrewski
Kudos to you for trying something new!

~~~
veritas3241
Thanks so much!

------
_pmf_
GitLab's usage of team members in marketing material is creeping me out (as
does the whole team page[0]).

[0] [https://about.gitlab.com/team/](https://about.gitlab.com/team/)

~~~
sytse
We say team members instead of employees because some are contractors. Why
does it freak you out?

BTW We don't call it a family
[https://about.gitlab.com/handbook/leadership/#management-
tea...](https://about.gitlab.com/handbook/leadership/#management-team)

~~~
_pmf_
I wouldn't want to have that level of public affiliation with my employer (no
matter who that employer might be).

------
ageofwant
[https://quiltdata.com/](https://quiltdata.com/) ticks a lot of boxes in this
space for me.

~~~
veritas3241
This could potentially become part of the Meltano stack. At GitLab, we're not
at the phase yet where we're in need of data versioning. But I could imagine a
data registry that's integrated with the workflow of data analysts/scientists
to easily link versions of code and data.

Thanks for the link - we'll definitely keep an eye on it.

------
danpalmer
Reading this I was concerned that it would be written in Ruby. While Ruby is a
reasonable language for server development, it has almost no data science
community when compared with some other ecosystems.

I was very glad to see this is Python! Python has some of the best data tools
out there, and a mature ecosystem for solving all the engineering problems
that go along with a great data stack.

~~~
ksec
I am on the opposite side, Given Gitlab is a Ruby house I was secretly hoping
some innovation coming from Ruby Data Science.

------
tamersalama
Is there some resemblance with Floydhub
[http://floydhub.com/](http://floydhub.com/) ?

~~~
NegatioN
Does anyone have a comprehensive list of similar offerings to floydhub? or OSS
alternatives?

I think this market is not being served properly, most of them seem to still
require most of the heavy lifting to be done by the ML practitioner.

I suppose I would even be okay with a service that just saves all my graphs
from tensorboard for later reviewing.

~~~
houqp
I am interested in knowing more about how you think FloydHub can better serve
the market. FloydHub does have metrics support for later reviewing:
[https://docs.floydhub.com/guides/jobs/metrics](https://docs.floydhub.com/guides/jobs/metrics).
Are you only interested in using tensorboard for graph viewing?

------
Luuseens
The page talks mentions MVC, and the issue page[0] keeps mentioning MVC as
well. Was this supposed to be MVP, or something else? Model-view-controller
doesn't make sense in the context.

[0]
[https://gitlab.com/meltano/meltano/issues/10](https://gitlab.com/meltano/meltano/issues/10)

~~~
jakecodes
We use the term mvc here, as "minimal valuable change", in a recognition that
it may not be a product yet.

------
ajbosco
Do you see this as a (future) competitor of Airflow/Luigi type workflow tools?

~~~
sytse
Yes, the orchestrate part (working on GitLab CI) is an alternative for
Airflow. Also see
[https://gitlab.com/meltano/meltano/blob/master/README.md#dat...](https://gitlab.com/meltano/meltano/blob/master/README.md#data-
science-lifecycle)

------
hn_throwaway_99
Be interested to know all the competitors in this space.
[https://data.world/](https://data.world/) is one I am most familiar with.

~~~
slap_shot
This projects competes with too many industries to really give a succinct
answer, but here's just Extraction/Loading and Analyze:

Extraction/Loading Dell Boomi SAP SAS Pentaho Domo Oracle IBM Microsoft
Informatica Talend JitterBit SnapLogic Mulesoft SyncSort Information Builders
Actian Attunity Datameer Alteryx Striim Treasure Data Cask StreamSets Snowplow
DataTorrent Astronomer Panoply Apache Nifi Stitch Data FlyData Bedrock Data
Alooma ETLeap Fivetran Xplenty MethodMill Celigo TerraSky DBSync Youredi
Scribe Civis Analytics DataScience Dataloader.io datorama Astera

Analyze Microsostrategy GoodData Sisense Looker Power BI Wagon Birst Tableau
Qlik Domo Hue Mode Chartio Periscope Pentaho

The amount of hype and BS in the Notebook space would require me to spend some
time combing through that again.

~~~
chasewright
slap_shot, I agree and as I disclaimer I also work at GitLab. There is no
shortage of data tools in the space today. A majority of my career has been
spent in the data & analytics space and I've talked / worked with at least 60%
of the companies you mentioned. At the end of the day, these are the questions
I've asked over and over again.

1\. Do we have enough money / budget for a tool like this? 2\. Can we derive
enough insights from this product fast enough to make a good ROI? 3\. Does
this tool use a proprietary language that no one wants to learn or can I code
in a language that is relevant? 4\. In all honesty, can I get insights faster
in a spreadsheet than these tools? 5\. What is the learning curve? 6\. Can I
answer the business question that was originally asked?

Open to more discussions around the topic as it is a lot harder to answer than
a few philosophical questions, but it certainly resonates with many data &
analytics professionals. A nice goal would be to have project where you can
stand up a business, turn your data pipelines on, ingest the data, and view
the insights needed to make a business decision all within a short timeframe
of when a business goes live.

------
gandutraveler
Looks like gitlab just wants to be in news since Microsoft's aquisition of
GitHub.

------
sbr464
Are you releasing/sharing any of the extractors you built for various
services?

~~~
jakecodes
All of our extractors are available in our source code, which is open source.
[http://gitlab.com/meltano/meltano/](http://gitlab.com/meltano/meltano/).
Right now we are working towards an MVP, so things might be in flux, but we
value any feedback you have.

~~~
sbr464
Thanks. I had looked but only saw one for fastly, am I missing others
somewhere?

~~~
sbr464
Apologies, I found them in another repo - GitLab Analytics. Thanks

[https://gitlab.com/meltano/analytics/tree/master/elt](https://gitlab.com/meltano/analytics/tree/master/elt)

