At least 95% of the challenge in querying data doesn't come from the difficulty of writing SQL statements IMO, but from the complexity/brokenness of the data itself. Especially with the move to unstructured data streams, most data warehouses seem to have become pretty bad when it comes to extracting truth out of the information within. Many business users know some SQL but blanch at creating coherent reports from what's available to them. What BI people are paid for seems to be having knowledge about the problems particular to a domain.
If your data is relatively clean and follows a good model, this would be a great way to help someone join, group, and aggregate their data without knowing SQL. I think a lot of people would use Tableau for this if it didn't also have a steep learning curve / high expense. However as soon as you get into data where you're writing case statements, coalescing null fields, matching on different data types, decoding, partitioning over streaming data etc., it doesn't help someone without knowledge of the caveats within the data sources themselves. Show me someone who doesn't know SQL using this to produce insight out of compromised data and I will be impressed.
Our next phase is to help people get to that cleaner source of truth much more quickly than traditional dimensional modeling approaches. Tools like Visual SQL and DBT (https://www.getdbt.com) are really changing the complexities here.
My coworkers seem much more interested in making a bigger EMR or adding nodes to Redshift than designing a reasonable data mart because "star schemas don't scale". I'm interested to see what you come up with, it is a huge gap in the current literature.
Shameless plug: http://github.com/socialpoint-labs/sqlbucket
What is really good is to assemble a library of visual queries for the customer. This is a good idea for the reason that many users have the same fundamental types of queries on their data. When finally you have enough of the basic queries that the user can do useful work without programming then you can find a way to customize this yes.
Have you data on how many similar queries customers use? Then you should know how to create the basic set of important operations.
So what you describe is somewhat built in to what we have now. Users still have to choose what columns they want to look at (there's no real way for us to guess that) and then we do apply some knowledge on what type of data they're looking at to help them get to what they're likely looking for.
We've also tried at times to make default dashboards for data sources when people connect. We can do this to some extent with known Schemas like connecting GA, SalesForce, Hubspot, etc, but for databases - that's proven to be a largely impossible task so far. Everyone's data is so different, and have such odd conditions to consider filtering by, that the auto dashboards end up being quite useless.
Finding out the distinct set of values in any column helps a lot, referential integrity helps a lot, but without those queries its pretty dang hard.
SQL is easy. Data is hard.
If you aren't investing the work up front to make your data useful then no amount of tooling or magic beans will make writing meaningful SQL easy. If your reports (SQL) are hard to write it's because you have bad data.
Having said that making data exploration easier is always a worthwhile exercise. The more business people can self-serve their questions the better. It falls to data people to make that possible not just with tools but with the data itself.
Not everyone needs to be a programmer even if they do need the value typically buried in programmer interfaces.
Questions to be made on data are always an afterthought, after the systems to gather and produce this data have already been designed.
The idea that it’s ok to just give a BI tool to business people later, and not involve them at the beginning to inform the system and data model design, is maybe why we’re in the tar pit.
A really excellent process/tool would help business figure out what kind of questions they need answered, and work backwards to the data model and implementation.
The advantage of this approach would be the fact that you do the dirty laundry upfront during the modeling & population phases. The end result is a data structure that is fool-proof, i.e. there is only one way to join facts/dimensions, it's self-documenting. In fact pre-joined views could automatically be created (and persisted if need be), giving the business user a clean structure to interact with.
The dumb thing for people who don't know what this looks like: make a separate database. Create tables based on what you often want to query (but don't try to keep the same shape as the base data!). Then figure out how to fill in those tables from the base data
Joins between tables are one of the big pain points for me. I am not sure how the demoed product does joins but in SAS EG you join tables by click and dragging a line between the two columns you want to join together and then it gives you a pop up to select Inner, Outer, Left, Right etc. (although it uses plain English. I.e rather than saying "left join" it says something like "All Rows from Table 1 and matching Rows from Table 2) it also gives you a visual Venn diagram type preview. This is frustrating for a few reasons: if table is large i.e several hundred columns you have to scroll for an eternity to find the columns you want to join against which is frustrating (a drop down box you can use keyboard shortcuts to jump to column name would be much better UI IMO). It is difficult to do complex joins "Where A.Col1 Between B.Col1 and B.Col2" for example. Maybe just my org but this is common feature of our data we have a lot of Event driven stuff with Start and End date and then Raw time series data you need to aggregate between the event frames. SAS's Timeseries stuff is very good but SQL side not so much...
It is also very easy for unsuspecting users to join on unindexed columns which leads to very poor database performance. In general there are a lot of performance footguns with generated SQL user can have working query then change something in the GUI and suddenly query that ran in subsecond takes several minutes. I have developed the habit that for complex I use GUI to generate SQL then hand edit it before running it to make sure it will be performant. Or I just hand write it to begin with.
Pattern matching and filtering is another pain point the query editor lets you use "LIKE" but does a poor job explaining to user how it works people who only use the GUI are surprised when I explain you use "%" and "_" to do character expansion a lot of people blindly assume "*" is used and then complain their query isn't working
Microsoft Power BI also has a query editor I've used this program a little bit but do not have as much experience from what I experienced using it I found it to be a bit less flexible it seemed easier to do the data extract and transformation first then load it into Power BI.
So I agree that while this tool may be handy for some, the real challenge with this sort of work is knowing where the (often unstructured) data sits, how it's entered and by whom, and how to extract meaning from it.
I'm not here to poo-poo your product at all but want to explain why I think this cant work in all cases, and that is totally fine!
The application that is my bread and butter is about 25 years old and started as a mid 90s power builder application. Its since migrated to a 3 tier .NET enterprise app and the data model has been dragged along with it largely unchanged. The data model is full of cruft and unnatural keys that are the result of a ton of technical debt. The newer tables in the database are more sane and follow most SQL best practices, but there is a large portion that was written (and I know the original data architect who SWEARS this is true) in no-vowel notation, unless the table starts with a vowel, or maybe throw a random e in there too. A few examples:
Sane column name -> actual column name
AccountDetailAccountCode -> AcctDtlAcctCdeID
AccountingPeriodID -> AccntngPrdID
PlannedMovementBatchID -> PlnndMvtBtchID
Provision -> prvsn
There are about 1000 tables in this database, so part of the problem with SQL is that you have to know about the table structure to know about the database. Another is that a lot of this data is constructed in stored procedures, functions or views.
Most people need at least 12-18 months to totally grasp the application and the data model in a real way. I have to create database diagrams still to find a path to get the data I want, and Ive been at it for 10+ years. What I would love to see is a way to show table relationships that is filterable in some way- e.g. not just because there is a key relationship. To me its much more useful if I knew that table a is related to table b and both table a and table b have some threshold of row counts. I don't care if there is a key that joins table a and table b if table b has 0 rows. Building that intelligent meta data about the database would be hugely helpful.
More and more of our consulting gigs are this - collating and cleaning crappy data, and then putting BI reporting on top of it.
I sit in a monthly "metrics review" meeting where we mostly speculate about why the dashboards are broken, and what the data team will have to do to fix it.
If our TPM was able to self-service this through an intuitive interface it would be a massive productivity win.
Somewhat related: does anyone know of any good components/libraries that can be embedded in a web page to let users interactively build SQL queries of reasonable complexity? By that I mean something that not only lets you pick columns and filter and maybe group things, but also do joins on other tables as well.
Edit: just to clarify, it doesn't have to be be visual only. In fact, something that lets you edit a query interactively either by visual components or actual SQL code is perfect, especially if it's bidirectional so that edits in the code also show in the UI and vice versa.
The idea is to have an open source (but commercial use paid license) that people can self host and totally have it offline or have it hosted on cloud with a single click.
I quit my job last month to work on it fulltime to work on an MVP. If you’re interested please add your email for an early beta.
I would love your feedback.
Just want to clarify – it doesn't have to be no code. In fact, a bidirectional workflow where you can edit the query either by writing SQL or clicking buttons is ideal.
Can you make a demo session with dummy data for people with play with?
This video is maybe the closest you can get without trying yourself - https://www.youtube.com/watch?v=YBXMTipHGfQ
If I'm going to pay money for a hosted application for serious work, I'm going to tend to prefer an established service that I'm confident will be around for a while. I'm sure you've got something good here but I'm not sure what it is.
As for an established service - Chartio's been around for almost 10 years, we're profitable, and are the main data interface behind some really great companies and brands https://chartio.com/customers/
Over the years, we’ve found that even power users who know they’ll eventually go into SQL mode prefer to start in the visual mode, as the grouping, date formatting, and joins are all done automatically.
Right now the SQL we write is very proper - with quotes around all of the column names and table names listed before each column name. It's not what a human would write. I'd love to one day make it a little more human so that the switching into SQL mode will feel even better. It'll be a fun project.
One issue I've had with pure SQL for analytic purposes is that a user can't build a library of higher-order functions that can be parameterized (imagine a window function to do a cumulative returns across various time periods, applied to different underlying source columns), so its difficult to build up a common business domain language without taking the typical approach of constructing SQL through glueing string fragments together typical of most programming language SQL libraries.
It would be amazing to have something like this embedded in Azure Data Studio.
If security is your main reason for wanting self-hosting you may be interested to know we're also SOC2, HIPAA, and GDPR compliant.
EDIT>> The implication being that you may not be HIPAA compliant if you don't enter into a BAA. IANAL, so I defer to the experts, but I figured I'd ask the original question anyway. If you are above board on all of this, I will definitely take this tool to my boss.
Redash can be self-hosted
(and our SeekTable also has on-premise version)
But my business address, right now, is @ yahoo.com
Or, I don't want to give you my real business address yet because I don't want spam at that account.
Then you would have a mail account @yahoo-inc.com
But as a sort of tangent / related thing:
As a visual learner I don't feel like I usually have problems visualizing SQL (that doesn't mean I wouldn't have use for this product, quite the opposite).
On the other hand I personally have been playing with Firebase and OH MAN do I have problems visualizing what is going on with document type databases, I start writing code and think of the DB and my brain starts sputter like there's sand in the gears ;)
The interface right now only works for structured data sources and my brain also starts to sputter when I think about making a visual language for anything not structured :).
Worth noting, the idea of "visual SQL" appeared long, long ago, it was called "Query by Example" https://en.wikipedia.org/wiki/Query_by_Example.
Congratulation on the launch and thank you for having PostgreSQL support. (Side-question: any reasons why you have mirrored Postgres logo?)
I also love how everything is a table. Looker does this, and it's great (though Chartio's implementation looks better). Tableau does not, which makes certain things a headache.
I do wonder how well this will scale to large datasets and columnar databases, though. Showing a preview of the tables is really cool and useful, but it is not what data warehouses like Redshift are designed for. The preview probably wouldn't load in time for my users to see it unless they just sit there waiting for it.
As for large scale datasets - we do cache those sample tables and we only grab the first 10 rows. Also, BigQuery actually has a great API for fetching a set of sample data that we utilize heavily. They made it because of exactly the potential concerns you outlined around columnar stores.
We also have the ability to with our visual language join data from multiple different queries/datasources. You can see how that happens here https://chartio.com/docs/visual-sql/merge-queries/
We also often forget the indirect cost of supporting a client, at my current job our software is great for small companies, but we found out that under 50 employees, it was just not worth it to support them.
You can also always lower a price, nobody will complains for a lower price ;), but making it higher without justification is much harder.
So all that to say, it can makes sense to start with a price way too high,
why even use sql if you are simply looking for patterns in
static data? why not use an app dedicated to analytics, of which many exist? sql comes into play when data needs to be shared, and those types of apps need coders who would not struggle with sql. i simply do not buy the argument that sql is difficult to learn for competent coders.
I believe most data exploration just shouldn't be SQL/text based. It's just faster to do it visually if there aren't extra steps added and the interface is good/flexible enough.
Even as developers we spend most of our time in a GUI environment vs the command line. It's not just that it's easier/better for business users - it's also just better for everyone.
Totally the future is hybrid. Quickly press some buttons to get to a starting SQL and then go complex from there on if needed.
My principle is that “simple things should be simple, complex things are possible”
Nowadays people use an amalgamation of random tools: Grafana, Redash, Superset, Tableau, etc. But they all have rudimentary alerting or none at all.
Self hosting is another feature that is important. Enterprise customers simply won't buy if it's SaaS only.
Like what mode.com is doing with its "bridge": https://mode.com/help/articles/how-mode-connects/#bridge
“On your database server, run the following to install autossh”
I think the elephant in the room though is that this is a technical solution to a non-technical problem. You can't teach all users to be highly proficient in your app through a perfect interface - because the perfect interface does not exist. You teach them through traditional training methods and continuous training investments and infrastructure.
Going through their landing pages it seems like Chartio is offering support and training packages, so there's definitely nothing wrong with them supplementing that with better visualization tools.
In general however I just wish more companies would invest in the training layer earlier and not try to bail themselves out with the perfect redesign - cough reddit cough.
Every product needs some education (excel is still of the most popular online courses) and we knew ours would be no exception which is why we've also last year launched DataSchool - our free online community driven courses on data
We do this though as a very flexible visual language, where you can create a pipeline of actions including merging queries from multiple different datasets and then doing some post query computation.
As an aside, I was actually looking for a comment about the opposite issue: I have a browser which blocks autoplaying videos, it took me a while to realise that what I thought were static images, were in fact videos without playback controls (and a little bit longer to work out how to re-enable playback controls!).
Either way, once I got them playing, it looked like a slick experience, congrats on the launch!
I consider a product to be a success when it combines the best features of both and makes them available to the user so that he may organise his work flow according to the own preferred style.
besides, SQL is DATALOG which is PROLOG which is to 'program logic', which gets as close as it can to 'wiring' logic. many attempts to visualize this process has failed so far, because... they are more complex and more specific.
SQL takes like 30 chars and dozen keywords to master. any visual paradigm - takes more.