Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Open-source, browser-local data exploration using DuckDB-WASM and PRQL (github.com/pretzelai)
227 points by prasoonds on March 15, 2024 | hide | past | favorite | 74 comments
Hey HN! We’ve built Pretzel, an open-source data exploration and visualization tool that runs fully in the browser and can handle large files (200 MB CSV on my 8gb MacBook air is snappy). It’s also reactive - so if, for example, you change a filter, all the data transform blocks after it re-evaluate automatically. You can try it here: https://pretzelai.github.io/ (static hosted webpage) or see a demo video here: https://www.youtube.com/watch?v=73wNEun_L7w

You can play with the demo CSV that’s pre-loaded (GitHub data of text-editor adjacent projects) or upload your own CSV/XLSX file. The tool runs fully in-browser—you can disconnect from the internet once the website loads—so feel free to use sensitive data if you like.

Here’s how it works: You upload a CSV file and then, explore your data as a series of successive data transforms and plots. For example, you might: (1) Remove some columns; (2) Apply some filters (remove nulls, remove outliers, restrict time range etc); (3) Do a pivot (i.e, a group-by but fancier); (4) Plot a chart; (5) Download the chart and the the transformed data. See screenshot: https://imgur.com/a/qO4yURI

In the UI, each transform step appears as a “Block”. You can always see the result of the full transform in a table on the right. The transform blocks are editable - for instance in the example above, you can go to step 2, change some filters and the reactivity will take care of re-computing all the cells that follow, including the charts.

We wanted Pretzel to run locally in the browser and be extremely performant on large files. So, we parse CSVs with the fastest CSV parser (uDSV: https://github.com/leeoniya/uDSV) and use DuckDB-Wasm (https://github.com/duckdb/duckdb-wasm) to do all the heavy lifting of processing the data. We also wanted to allow for chained data transformations where each new block operates on the result of the previous block. For this, we’re using PRQL (https://prql-lang.org/) since it maps 1-1 with chained data transform blocks - each block maps to a chunk of PRQL which when combined, describes the full data transform chain. (PRQL doesn’t support DuckDB’s Pivot statement though so we had to make some CTE based hacks).

There’s also an AI block: This is the only (optional) feature that requires an internet connection but we’re working on adding local model support via Ollama. For now, you can use your own OpenAI API key or use an AI server we provide (GPT4 proxy; it’s loaded with a few credits), specify a transform in plain english and get back the SQL for the transform which you can edit.

Our roadmap includes allowing API calls to create new columns; support for an SQL block with nice autocomplete features, and a Python block (using Pyodide to run Python in the browser) on the results of the data transforms, much like a jupyter notebook.

There’s two of us and we’ve only spent about a week coding this and fixing major bugs so there are still some bugs to iron out. We’d love for you to try this and to get your feedback!




Looks great and thanks for sharing!

You mentioned that you went to some length to implement pivots. How far do you think you will take the pivot feature and UI?

For financial type usecases it's pretty much a requirement to be able to map many fields on X/Y, be able to collapse them in the the browser, and also show/control aggregations.

Asking because many apps stop at the simplest level of pivot features and go to great lengths in other areas like more advanced visualization. For mapping many fields though, I think a good pivot table UI is like a secret weapon.

I get that this isn't what you would focus on so much in week 1.

Nice work + love the local-first and direct UI.


Right - I used to work in GS and they had this really great internal table display tool where you could simply drag a column from the top to the left side and it would Pivot + Collapse all fields on it (plus allow multi level pivots). Then, you could look at the Pivot table OR do a drill-down to see the root-cause why a value was so high.

I really liked that interface and haven't really found anything near as useable yet. Perspective JS (a free library by JP Morgan, you can try it here: https://perspective.finos.org/block/) has some really cool functionality in this direction but it has its one data processing engine, query language, rendering engine etc so we couldn't have used that for this project.

But, this is an interesting thought - I definitely would want to see this functionality in Pretzel - the only question is how to prioritize this. If you know of any performant table libraries that support collapsible pivots out-of-the-box, I'd love to integrate that. Alternatively, we'll write our own!


One that I'm watching is the Table mark in Graphic Walker [1].

There is a license thing about logos to note if you do use it in Pretzel [2].

On GW [1], Create Dataset -> Public Datasets -> Student Performance then change the Mark Type to Table you can play around. It hits the things that I mentioned pretty good! You might have similar issues as you have with Perspective though.

I make calculang [3] and I'm getting ready to plug into a lot of things. A good pivot table generalizes well for my needs (mainly quick feedback during DX).

Perspective is on the list but so is GW and Pretzel.

Perspective might suit my needs perfectly.

But I like DuckDB-WASM approach anyway so I hope you continue and I hope you nail it one way or another! :)

[1] https://graphic-walker.kanaries.net

[2] https://github.com/Kanaries/graphic-walker/issues/330

[3] https://calculang.dev


Wow, Graphic Walker looks incredible! AND they vega-lite - I'm a BIG altair fan - this look like fun - I'll take a look!


Perspective.js is what I use for https://sql-workbench.com grid and charting functionality.

You can just do the data manipulations in DuckDB WASM and pipe the Arrow data to Perspective...


Very impressive project and vision! Love the demo!

I am also ex-GS and worked on what I am fairly sure is the table display tool you're describing. I tried to carry the essential aspects of that work (multi-level pivots, with drill-down to the leaf level, and all interactive events and analytics supported by db queries) to Tad (https://www.tadviewer.com/, https://github.com/antonycourtney/tad), another open source project powered by DuckDb.

An embeddable version of Tad, powered by DuckDb WASM, is used as the results viewer in the MotherDuck Web UI (https://app.motherduck.com/).

If you're interested in embedding Tad in Pretzel, or leveraging pieces of it in your work, or collaborating on other aspects of DuckDb WASM powered UIs, please get in touch!


Yes! I think it was TDS viewer or something like it - I loved using it so thank you for building it :)

Tad viewer looks perfect for embedding in Pretzel! Is it easy to embed it in web apps? I’m on mobile right now and did a quick search and didn’t find anything. I’ll definitely be in touch!


Yes, Tad should be pretty easy to embed -- in the github repo there's a React component (TadViewerPane), and a fairly modular API for adding a new data source. I'm happy to work with you on this, this should be a fun exercise!


Just tried it out - thanks for sharing.

We're a sales tech startup and I've had to look through large CSVs with prospecting information in the past - anything over 10 MB crashes my browser. I use a Mac so I don't have Excel. This looks great for simple data manipulations. I tried out a large CSV I had, and it loaded without a problem.

Quick bug report: The filter interface seems to be slow for me though with the large file. Also, one feature that would be really helpful would be connecting this to a database and also some way to share my workflow/analysis.


Hey Marco, I'll take a look - filters theoretically should be fast, when you create a new filter, it simply reads does a `select * from table limit 1` to get column names

I wasn't sure whether you could query DBs directly from the browser but looks like you can! (https://github.com/alexanderguy/pgress) - will add it to roadmap!


Replying as I couldn't edit:

Alright, did a bit of digging and the flamegraph points to a problem with the Table component. I'd hoped that BlueprintJS tables would be performant enough for our usecase but apparently not!

We'll try to move to canvas based rendering ASAP - that should fix any lags in the filter UI


I was recently searching for something like this - the ability to write PRQL queries on mobile is my ideal usecase. As you note, PRQL is perfect for putting together queries with drop-downs, so I'm a huge fan of the UI that you've put together. Feature request: Exposing the PRQL would be great, and using tabs to switch between the query and results would make it possible to work on mobile. Lastly, bookmarking those queries for later use would be great too!


Yes! I'm heartened to see people realize the the benefits of PRQL and chained transforms over vanilla SQL (with some caveats!).

We'll definitely add a PRQL/SQL block to the UI soon (you will be able to toggle to select which one you'd like).

On that point - PRQL doesn't natively support PIVOT statements. As a hack, we made our own flavor of PRQL with a PIVOT statement and we parse that to SQL CTEs to make it work for now. I will be submitting a pull request to PRQL to add support to get around this.

We're working on saving queries to local storage as well as sharing via downloadable config files (we're thinking a PRQL file with some sugar that when uploaded, recreates the entire transform)


[PRQL dev here]

Love what you've done and thanks for building on PRQL!

We're very happy to add something like a PIVOT statement. In the long term, we'd have to think about how to make it work deeply in PRQL, since the column names are suddenly runtime dependent. In the short term, we should definitely make PRQL work for your case — it's important that there's an escape hatch for things that aren't natively supported by PRQL yet. (We have s-strings but I'm guessing they don't cover this specific case?)


Hey Max, love what you all doing with PRQL! It's a wonderful tool.

That makes sense. I ran into some old issues about Pivot but it seems it's not been implemented yet. And yes, s-strings wouldn't work - the DuckDB pivot statement looks like this:

  PIVOT ⟨dataset⟩ 
  ON ⟨columns⟩
  USING ⟨values⟩ 
  GROUP BY ⟨rows⟩
So, we'd have to pass the `dataset` so far into the pivot statement. This is where CTEs come in handy but happy to hear if there's a better solution :)


For folks looking for a workaround in the meantime, you can add an AI block and it will give you the returned SQL to edit.

Not ideal because it’s just wasting credits, but it worked for me.


Hey! We added support for SQL blocks btw!


Awesome, thanks for the update!


It would be good to have EdgeQL too.


Oh this looks really interesting! I hadn't heard of EdgeDB/EdgeQL - I'll have to do a deeper read but at first brush, it looks like EdgeQL can be compiled down to SQL based on this HN comment: https://news.ycombinator.com/item?id=30296669

I'll take a look and if it's simply a matter of changing from the PRQL compiler to EdgeQL compiler, then, we should be able to have land in main in short order. It's not the highest priority thing right now though.


That looks great, I love all these new use cases for WASM that are starting to pop up.

Are you planning on supporting parquet files any time soon? I'd love an easy way to just drop a parquet file on there and easily visualize time-series data with it (e.g. sensor readings).

Small issue I noticed: When you chart time-series data, the timestamp axis is just represented as an integer rather than a meaningfully formatted timestamp.


Thanks for the feedback!

> Are you planning on supporting parquet files any time soon?

Yes. This is an oversight on our part - duckdb natively supports Parquet and Arrow files but we just haven't gotten around to adding upload support for those yet. It's a small change - landing in main tomorrow!

On the bug, yes, we've been trying to parse timestamps but it's just been very finicky. I'll try to fix this! Thanks for the report.


I think this could probably replace the functionality provided by datasette?


Datasette was definitely a huge influence for this and I'm a big of @simonw and his work.

However, there's a couple of major differences:

- Datasette - to me - is a tool for devs - it takes a fair bit of tech know-how to understand how to load new data in datasette, install plugins and deploy an instance. I wanted Pretzel to have a "low-ceiling, high-floor" - meaning the tool should work well for non-technical folks but also shouldn't hinder technical folks and power users

- Datasette has a unique way to explore data based on "facets" where you can dig through data. It's a powerful way to explore but I personally don't think Datasette is the best tool for complex data transforms - its strengths lie in exploring datasets instead.

- For accessibility and speed, we wanted something browser local and fast - hence the focus on DuckDB and canvas rendered tables. Datasette requires setting up an instance and that's additional friction for non-tech folks and isn't particularly designed for large datasets (is my impression - I might be wrong)

- Lastly, we have ton of future plans - including in-browser python support via WASM (Pyodide), local LLM support, ability to connect databases etc.

I think there's space for both tools!


I enjoyed your demo video and played around with a CSV of my own. This is a cool project, but I have a (perhaps stupid) question:

Why should I use this tool over Excel or Google Sheets?

I don't mean this question to be a challenge. Just trying to understand which sorts of tasks would I be better off using Pretzel as opposed to a traditional spreadsheet tool.


Hey, absolutely - that's a perfectly fair question. I can give you my personal reasons:

- I don't have excel on a Mac so I need to use Google Sheets. Google Sheets crashes on any file over 30-40 MB (at least on my macbook air)

- I think this type of chained data transform is a much better (more code-like?) way of doing transforms. You can see exactly how you got to your end-result. Vs, for example, in Google Sheets, you're modifying data in-place and your transform history is only accessible via Undo and Redo. This is far more reproducible short of using a python script.

(I've worked as a data scientist for a while and I suppose that influences how I think about chained data transforms vs the in-place Excel way too!)

But this is just for now - the long term vision is: Being able to easily switch between visual no-code blocks, SQL and Python in the same browser-local "notebook"/"workbook" that's easily shareable. I think it doesn't take much convincing that such a tool would be more powerful than Excel for certain workflows!


This is really promising, please look into embedded analytics. We are struggling with building our embedding for SAAS. Now we are doing it with cube and superset. All the alternatives are expensive or hard to implement.Being able to offload the analytics to the browser is a plus.


Thanks! This is interesting - just to make sure I'm understanding you correctly - you're trying to build analytics into your SaaS tool to offer it to your users, correct? And you're using Superset to build the analytics inside your SaaS app?

I'll definitely look into it. For the moment, I know a team trying to solve this exact problem (they're not open-source/free but I think they're pretty inexpensive) - happy to connect you with them if you like :)


What database are you using?


now we use postgres, testing duckdb/md for analytics.


Oh man this is cool. I don't really have much feedback tbh I just think this awesome. I can see myself using this as a handy little utility when data sets get a bit too big for Google Sheets.

In terms of it being a product, things I can think of that would be useful to me.

- Importing/exporting data from Google Sheets/Google Drive.

- Scheduling queries to run and export etc.

Not sure if that really aligns to this specific project being that it runs locally though.

Edit: Having a few standard dimension tables available to join would be great too. I often want to join time from a date dimension table when I'm making time series data for example.


That's great feedback, thanks!

This tool definitely comes from a place of personal need - beyond just handling large files, I've also never really gelled well with the Excel/Google Sheet model of changing data in place as if you were editing text. I'm a Data Scientist and always preferred the chained data transforms you see in things like dplyr (https://dplyr.tidyverse.org/) or Polars (https://pola.rs/) and I feel this tool maps very closely to the chained model.

Also, thank you for the feature requests! Those would all be very useful - we'll put them on the roadmap.


How ary you


Neat. I did a similar thing for analysing the output of commands that can produce JSON (SLURM in this case). Worked pretty well but I immediately ran into PRQL's lack of support for DuckDB's struct types.


Yes, there's definitely some downsides (which honestly, we didn't realize when we started to build this). For us, it was the lack of a PIVOT statement so I had to make a modified dialect of PRQL that supports PIVOT, then split on the final PRQL into PIVOT and non-PIVOT chunks, and then convert it to chained SQL CTEs. Annoying work for sure.

As a result, I'll be adding support for PIVOTs very soon, either in the main repo or just in a temporary fork.


Thanks for building this as open-source!

How is it different from CSVFiddle (which is also based on DuckDB-wasm, I think)? Ability to have multiple tables loaded and running join queries against them would be crucial.


Thanks!

We're different in a couple of ways:

- You don't need SQL to transform and manipulate data in Pretzel - we have visual blocks where you can construct a transform chain (though we do support manual SQL blocks now!)

- We have better CSV parsing (I spent a fair bit of time on this!). See one of our test case CSVs here that breaks (https://imgur.com/a/O8XMfET)

- We allow plotting!

- We support PRQL - which IMO is a better way to build data pipelines (but we also support SQL)

- We have AI Blocks! You can use your API keys (or our server) to get SQL from text which you can edit

- We're currently adding support for Python via Pyodide (pyodide.org) so you can do complex data transforms, even train simple ML models right inside the browser!

- Lastly, we'll be adding a ton more features to Pretzel over the coming months (many of them have been asked for in this thread actually!)

Support for multiple files is crucial - we'll be adding that soon. Right now, we're rethinking our whole task execution architecture so any major features will have to wait until then.


This is quite nice. Any plans to add simpler ML tools? A sklearn plugin could be as valueable as LLM access.


Thank you! Yes, one of the items in the Roadmap is support for Pyodide (https://github.com/pyodide/pyodide) for running in-browser python on the results of each of the code blocks! This should allow most ML libs to be usable in-browser! This is pretty high-up on our priority list.


Can we bookmark or localstorage our etl?


We're going to build this over the next week - it's a high priority item for us. Once done, you should be able to do two things: - Save existing workdflows to localStorage (a little drawer on the left side of the screen) - Export workflows as a JSON file so you can share workflows with other by simply sending them the JSON file however you like


I think it would be great to be able to embed the workflow in the URL, so that I can share pre-made workflows with people who can then apply it to their data.


Have a look at https://sql-workbench.com, it supports query sharing via URL, as well as sharing visualizations. Let me know if you have questions!


I think no one really enjoys working with sheet or excel. Apart from the inability to handle large data, have such a high/unintuitive learning curve for most data analysis operations. I can see this tool become a ‘metabase’ for everyday usage of data analysis.


Thanks! Yes, we're going to build the make modern data tool with a low barrier to entry (so non-devs, non-data folks can use it) but that also works great for power users (a way better jupyter notebook, with modern IDE like capabilities - pair programming, copilot, extension, out-of-the-box linting, testing etc).

It's almost a meta thing but I think dev-tools have been getting better and better over the last 10-20 years while data tools seem to be stuck in the past. Devs, when then see friction, go build something to solve it while this tendency is much less common in data scientists, for eg.


What is this offering over something like datasette?

https://datasette.io/

Or even OpenRefine?

https://openrefine.org/


Hey, I left a comment in another thread about the differences:

https://news.ycombinator.com/item?id=39732135

Didn't know about OpenRefine - but one major difference is Pretzel is web-based meaning no installation necessary (important for work laptops, for eg). We're also going to be adding a ton of features to this - think Jupyter notebooks crossed with Metabase crossed with Datasette/Pretzel.


Thanks for the reply. I thought I'd made some obvious error when I asked and was getting downvoted without comment. I appreciate you time in replying to what may have been a stupid question.


Nice to see DuckDB-Wasm get some love Curious from the pure performance point of view whether you have performed any benchmarking against any of the alternatives for this sort of in-browser data work?


That's an interesting point! No we haven't really - for now, I asked a bunch of friends to send me large CSVs - I loaded them on Google Sheets to see how it performs (usually terribly - crashes my browser tab or it's unbearably slow). Then I try the CSV in Pretzel with some filters and pivots and if I don't feel any annoyance with speed, it's a success in my book for now.

We'll definitely need to figure out (a) what's the right thing to measure, and (b) an automated testing/CI to check for regressions and do perf testing


Wow! Is there any way this could support SQLite databases?


DuckDB supports SQLite through extensions (https://duckdb.org/docs/extensions/sqlite.html)

So, it should be pretty straightforward to let folks drop a SQLite file instead of a CSV file! I haven't been able to get extensions to work on WASM so far though but I will definitely take a look to see if we can make this work!


Fantastic! Would you like me to make an issue for this?


I haven't taken a close enough look at it yet, but duckdb supports Sqlite databases, so in theory yes.

However, using Sqlite requires a duckdb extension and i'm not sure if works for wasm, and if so enabled this project in particular.

https://duckdb.org/docs/guides/import/query_sqlite

But I believe that most of these duckdb web-IDEs will revolve around a couple of generalized used cases:

1. OLAP/analytical type workloads/queries. Sqlite is really more for OLTP/transactional workloads.

2. Querying datasets that are available on, targeted for blob/object storage like S3. Parquet, CSV, line-delimited JSON, etc


Isn't doing data science locally in a browser like tieing one arm to your back and trying to tie your shoe laces? Why would you do it?


Very cool. Would love to be able to save/share my work (could probably even be in the url string if the CSV were online).


Ramon from Pretzel here, thank you for your feedback! We definitely have this on the roadmap:

1. Allow to download/upload workflows as files (maybe JSON, or PRQL)

2. Allow to share private URLs by attaching the workflow as a URL parameter

We are building a roadmap upvoting site, we'll add it to the README as soon as it's ready


this is great, thanks for sharing

I'm currently using a self hosted instance of lightdash connected to a dbt project and I can see this being really efficient for data exploration for business users

quite interesting!


Thank you. We're definitely looking to get to a point where you don't need to jump to a jupyter notebook for simple analyses, especially things which are hard to do in SQL (a basic linear regression model for eg).

Happy to chat at prasoon [at] withpretzel [dot] com if you need any integration help!


The DuckDB WASM space is really heating up! I released https://sql-workbench.com a few weeks ago, which can be used to query and visualize Parquet, CSV, JSON and Arrow data.

There’s also a accompanying tutorial blog post at https://tobilg.com/using-duckdb-wasm-for-in-browser-data-eng...


I just like to thank you (again) for how much you've made a lot of what is required to get duckdb up and started (esp on AWS) much easier.

Interesting to watch the similarities (duckdb wasm) and differences (python vs js) between your SQL IDEs in the browser. Exciting stuff.


Thank you! It’s really great that the in-browser data analysis ecosystem is really firing up… Exited to see what others build


Oh hey! I remember seeing your HN post several weeks ago, I think - it's a great tool, thanks for building this! It was definitely an inspiration for us :)


Thanks and congrats on the launch!


Hey congrats on the Show HN. Local, browser based data exploration works for a lot of uses cases and is so much faster thancloud based tools. We've implemented something similar at https://addmaple.com/ - but with a graphical interface designed for rapid exploratory data analysis of large datasets.

Memory per tab can be an issue for really big files (1gb+) but we're exploring a transform to CBOR which allows us to free up JS memory, i.e. when parsing CBOR we can leave row level data as Uint8Array and it doesn't increase the JS memory overhead.


Thanks! Maple looks really cool - really interesting demo video, too!

This is quite interesting - we've not explored really large files so far and being honest, we haven't thought that far either. Didn't know about CBOR! I will have to look deeper into how this can save on memory. I was wondering though, since WASM memory is limited to 4GB, if I have sufficiently large memory on my compute device, at least one tab should be able to handle 1gb+ files too, correct?


Thanks :-)

I've not done much profiling on DuckDB and what the overhead is - i.e. after the data is parsed how much memory is used. Would be really interesting to push it to the limit - or to explore not loading the entire file in, but only reading the relevant parts, but again that probably requires a conversion first, e.g. to parquet or some other column based storage format.


that sounds like a fun project :) maybe I'll take a stab at it!


200mb of data is not a large file, and chromium tabs have a memory limit of something ridiculously low so actual large 20-100gb datasets render this useless.


Sometimes you need a scooter, sometimes you need a truck.

I think a snappy interface for <1gb datasets is really neat and super useful for certain kinds of data


This echoes my thoughts exactly. Right now, we're actually more limited by the JS UI so a couple 100 MBs is the most you can do in a browser otherwise the UI becomes really slow. There's a lot of room for improvement - we're using React and that's causing a bunch of un-needed re-renders right now that we don't need. We probably need to create our own DAG based task management system and use Canvas to render everything - with all that, workflows on much larger files will hopefully become usable.


You can now use the file api to work with terabyte sized files if your disk can handle it.

https://developer.chrome.com/docs/capabilities/web-apis/file...

Browser is pretty powerful nowadays.


This is certainly true - I'm not saying "large file" in the colloquial sense of the "big data" but rather as in - a file you might want to open in Excel/Google Sheets. I've worked actual large datasets before - upwards of 500GB - pretty often before an I really wouldn't think about using my laptop for a such a thing!

We are thinking of making data connectors to major DBs though so you should be able to do a similar style visual analysis while keeping the compute on your DB.


I looked up the limit and as of 2021, tabs seem to have been limited to 16GB which is moderate in size for an in-memory dataset. However, I know WASM has a hard limit of 4GB without Memory64. Data size is all relative.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: