Honestly if the resultsets are small-enough, I just dump them to JSON and diff the files. But it has to be fully deterministically sorted for that (in a sane world "order by *" would be valid ANSI SQL).
Thank you for mentioning Data Diff! Founder of Datafold here.
We built Data Diff to solve a variety of problems that we encountered as data engineers: (A) Testing SQL code changes by diffing the output of production/dev versions of SQL query. (B) Validating that data is consistent when replicating data between databases.
Data Diff has two algorithms implemented for diffing in the same database and across databases.
The former is based on JOIN, and the latter utilizes checksumming with binary search, which has minimal network IO and database workload overhead.
It should yeah, our builders are based on BuildKit rather than Kaniko, which optimizes for building container images in parallel and caching as much as possible. BuildKit also supports some more advanced types of caches, such as cache mounts: https://github.com/moby/buildkit/blob/master/frontend/docker...
Both Kaniko and BuildKit can be run in rootless mode - we are not doing this, instead we give every builder access to an isolated VM, so builds are a bit quicker as well by avoiding some of the security tricks that rootless needs to work.
In AWS - we launch either Intel or Arm EC2 instances depending on the requested build platform (or both for multi-platform builds). When a project's builds are running, they have sole control of that instance, which is terminated when the builds are done.
To make this performant we keep a certain number of spare "warm" machines ready for build requests so that you don't have to pay the instance launch time penalty yourself.
Just to clarify, when you run depot build, does build run locally or it runs remotely in an ec2 instances? Also, it sounds like the instances is on your side, not on customers infrastructure. Compounding build time is a problem, but I think we solved it with buildkit cache. But the setup you are describing, if I understand correctly might be a no-go for enterprise customers. May be you are going after the mid-market companies, in that case it might work. Just an opinion from my side.
I think Kyle answered this below, about the option for enterprises to run the data plane of Depot in their own cloud account. In that model, the Depot CLI connects directly to that data plane without passing through any infrastructure on our side.
> I think we solved it with buildkit cache
One big thing we're doing here, if you're familiar with BuildKit cache, is providing builds a stable cache SSD that's reused between builds. This means we support all of BuildKit's caching features, including things like cache mounts that aren't directly supported in ephemeral CI environments. Plus Depot doesn't need to save or load the cache to a remote store like S3 or GitHub Actions cache, instead the previous cache is immediately available on build start.
This may not be any better or different than what you're doing, I just wanted to mention the detail for anyone familiar with trying to make BuiltKit more performant.
Hi there! Kyle here, the other half of Depot. This is correct, we have a self-hosted data plane model that larger enterprises can use if they want full control over the builders + build cache.
In that deployment model, the Depot control plane passes changes to make in the customers environment via a small agent they run in their account. Here is some docs we put together for anyone that wants to go into a bit more of the details: https://depot.dev/docs/self-hosted/architecture
This way, we can ensure that the catalog stay fresh as the work to maintain it is on the vendors side that have an incentive to push updates of their catalog.
It also ensure complete transparency of how the catalog is being built.
The tech behind it is a Next.js app that is statically generating the website from the catalog data to ensure a good SEO capacity.
The end goal is to offer a central place where the community (vendors + users) can exchange the current capabilities of the connector market.
Next steps are:
- Add the capability for customers to provide comments / feedbacks on connectors that they tried / used
- Add more info on the Vendors pages (which destinations they support to write data, pricing info)
A friend of mine would use fiberglass resin, bondo, and plywood because they wanted their sculptures, which they then boxed into custom-sized crates, to last for at least one hundred years.
Algolia ex-CEO is now a partner at YC.
So the connection between the two is pretty strong, this will be resolved before the end of the day.