Hacker News new | past | comments | ask | show | jobs | submit | chrstr's comments login

Well, Rust is great and everything... but calling DuckDB "the previous generation of data engineering" seems a bit funny. And if anything, DuckDB will probably have more impact than all the other mentioned tools (or maybe it even already has).

I've built a similar board some time ago, focus was also mostly amateur level. It does include some of the mentioned features, i.e. it has a timeline to save different formations and graph connections (they don't move together though).

It's probably a bit outdated by now (haven't used it in a while), but maybe some of the ideas are useful for your app.

The board can be found here: https://crst.github.io/taktiktafel/tkt.html


But curating such a list of experts to follow takes quite some effort. It would be great to have a tool that helps with that.

And sure, ideally you wouldn't need such a list of trusted experts but just focus on content. There even was a time when this worked - you could just type "what is the best database to use" into a search engine, and get a helpful result. Not anymore. On HN it may still be better than elsewhere, but ultimately it's a similar issue.


  create table x as (select * from person);
  select name from x where ...;
there you go, just configure your editor to display "create table x" as "declare x = " ;)

or even a version with lazy evaluation:

  create view x as (select * from person);
  select name from x where ...;


You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.

SQL just does not have this, it instead has 15 different second class ways to handle tables and queries that try to make up for the fact that they are not first-class values. These include CTEs, table valued functions, views, etc.


Usually I balk at the idea of of someone describing a language feature as “first class” because it seems to wishy washy. But in this thread you’ve shown me that maybe the best definition is through “proof by negation,” by patiently responding to arguments and demonstrating why a certain usage and the ensuing restriction around it means it is not first class. Bravo!


I agree the term is often abused, but I think the wikipedia page actually does a decent job of making the notion somewhat precise, along the lines I've been arguing here:

https://en.wikipedia.org/wiki/First-class_citizen

If you want to see what queries as first-class values looks like, LINQ in .NET is pretty close. I can actually write a series of queries that build on and compose with each other, like this:

    IQueryable<Person> RunQuery(int userSelection)
    {
        var first = from x in People
                    select x;
        var second = userSelection == 1
            ? from x in first where x.Birthday > '2000-01-01' select x
            : from x in first where x.Name.Contains("Jane") select x;
        return DumbJoin(first, second);
    }

    IQueryable<Person> DumbJoin(IQueryable<Person> first, IQueryable<second>)
    {
        return from x in second
               join y in first on y.Role equals x.Role into g
               select g;
    }
This query is nonsense, but it just shows you what composition really looks like when queries are first-class values. I wish raw SQL were like this!


> You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.

I doubt you could implement a query planner that would cope with that degree of flexibility. Which means you’d be forced to deal with the mechanics of the query, pushing you away from declarative SQL and into procedural and functional programming. At which point you might as well ditch SQL anyway.


Without these features, people have to resort to dynamically generated SQL queries in procedural or functional languages, which is much worse! SQL has also become significantly more complicated by adding all sorts of second-class features to get around this composability limitation (CTEs, table valued functions, views, etc.).

Besides, I don't think it would be as bad as you say. You can approach it as a simple template expansion into flat SQL queries except where a data dependency occurs, at which point template expansion proceeds in stages, one for each dependency.

LINQ on .NET provides most of the composability I'm talking about, although it has a few limitations as well. Still worlds better than raw SQL.


In PostgreSQL at least, a table can appear as a return type of a function and as a parameter to a function. That's not nothing.


What if I wrote a very long, complicated query that I'd like to test against different tables (like test tables), and let's say I can't use stored functions or procedures. How could I pass different tables to my query?


  CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
  CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
  CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b);

  CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);

  SELECT * FROM complicated_query WHERE test_case = 'a';
  SELECT * FROM complicated_query WHERE test_case = 'b';


Nice, that is what I was looking for. Of course, it'd need to point to production data as well, so maybe test_case is null, in that case:

  CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
  CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
  CREATE TABLE data_prod AS (SELECT NULL AS test_case, prod_table.value FROM prod_table);

  CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b UNION ALL SELECT * FROM data_prod);

  CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);

  -- when testing
  SELECT * FROM complicated_query WHERE test_case = 'a';
  SELECT * FROM complicated_query WHERE test_case = 'b';

  -- when in 'production'
  SELECT * FROM complicated_query WHERE test_case IS NULL;


You just reinvented defunctionalization, which is a transformation from a domain that has first-class values to a domain where support is only second-class. Defunctionalization is typically used in programming languages to simulate first-class functions in languages where functions are only second-class citizens, like C and Pascal.

This perfectly illustrates my point. You had to manually defunctionalize your data model and queries to support what I'm saying should be inherently part of SQL.


> languages where functions are only second-class citizens, like C and Pascal.

1) Only if you define Pascal as only Wirth's very first version. That changed almost immediately.

2) Only if you refuse to equate “pointer to function” with “function”. Which in C, where “everything is a pointer” (a bit like in Unix Linux “everything is a file”), seems rather silly.


Check out dbt - it's a great tool for organizing queries and solving such patterns


If you can't use stored procedures which are good for this very case, many databases offer dynamic SQL. That might work in some cases.


That's a good point. This would rule out SQLite for me.


Dynamic SQL isn’t SQL, and it’s not relational. It’s no different from using a language like Python to generate SQL queries.


It's a little different. Anyway, this is under the constraint "no stored procedures."


TVF ( Table-Valued Function) with Cross apply.


> Queries using SELECT DISTINCT can now be executed in parallel.

This sounds quite interesting, but I would assume it does not always work? I didn't see this mentioned in the linked documentation, does someone know when/how the parallel distinct works?


Couldn't tell you the when, but I can tell you the how is likely how you'd expect.

Generally speaking, to do distinct you need a dictionary to look up previously seen values. To do it in parallel you need to make that dictionary thread safe.

For Java, such a thread safe dictionary is made by segmenting the table and synchronizing on the segments. So you'd hash your values, figure out which segment that targets, lock that segment, and then read/update that segment to contain the new value.

I'd assume that postgres is doing a fairly similar trick, The only additional synchronization would be on a linked list of found values. In that case, you could either lock the list and update as new values come in, you could sort those values after the fact, or you could employ a lock free algorithm to add nodes to the list (see lock free queue implementations).


Most parallel operations in PG are implemented by simple merge the dataset, work independently and merge the results. I expect the new distinct to behave the same and not work on a shared data structure.



Thanks! May be helpful to include this in the documentation, since I guess it will then often depend on the numDistinctRows estimate [1] if the parallel plan is used.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...


"select distinct" is likely now syntax sugar around "select ... group by 1", which worked in parallel for a while.


> However, the term "Data engineer" was specifically created by/for ML folks to get rid of unpleasant repetitive work that has to be done but nobody looks forward to it.

This may indeed be how the term "data engineer" is used sometimes, but I have my doubts that it was originally created with this meaning. Not really sure where/when the term "data engineer" was actually created, but ICDE started in 1984 [1] and the Data Engineering Bulletin was renamed in 1987 [2] (from "Database Engineering"). It seems likely that the term "data engineer" has also been used since at least then.

Of course ML did also already exist then, but it's certainly a while before the current "big data" / "deep learning" time. And regarding the topics considered "data engineering" at that time, this is from the foreword of the December 1987 issue of the Data Engineering bulletin:

> The reasons for the recent surge of interest in the area of Databases and Logic go beyond the theoretical foundations that were explored by early work [...] and include the following three motivations:

> a) The projected future demand for Knowledge Management Systems. These will have to combine inference mechanisms from Logic with the efficient and secure management of large sets of information from Database Systems.

Which sounds just as relevant today as it did back then. It also does sound like a rather challenging task, and not exactly like "unpleasant repetitive work". Or at least not any more repetitive than: change some model parameters / retrain model / evaluate results / repeat ;)

[1]: https://ieeexplore.ieee.org/xpl/conhome/1000178/all-proceedi...

[2]: http://sites.computer.org/debull/bull_issues.html


Data engineering jobs named as such started to pop up only in the past few years, coinciding with Map Reduce/Spark availability. I wouldn't be surprised if it was re-introduced by one of the companies developing those systems to distinguish themselves (like Databricks, Cloudera etc.), a sort of a marketing. In the past we had DBAs, now DBA + DevOps + unspecified everything morphed into data engineering.

I used to be a member of SIGMOD and the "data engineering" you mentioned was just an academic term.


Indeed, it is generally good for analytics to have a conceptual data model that is reasonably efficient both on the technical side (to quickly get answers for queries), and also on the logical side (to quickly translate questions to queries).

This model doesn't need to be OLAP cubes, but it's also not that easy to find something better.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: