
Introducing U-SQL, a language for big data processing - mrry
http://blogs.msdn.com/b/visualstudio/archive/2015/09/28/introducing-u-sql.aspx
======
hacker_9
Product looks good.. apart from the most important point from a dev point of
view - the syntax!

They created a new query language but didn't change the syntax at all from
SQL?? Why does every database product seem to think SELECT FROM is the be all
end all? Ok they bolted on C# Linq, and you can reference other assemblies,
but these are OLD ideas, and the combination/love-child of these ideas looks
horrible.

For example writing SELECT before FROM is a massive productivity problem -
mainly because the computer doesn't know what you are doing _until_ you write
the _FROM table_ part, before it can finally auto-suggest column names for
you.

Don't write _SELECT col1, col2 FROM {table}_ , but how about _FROM {table}
Select col1, col2, etc_.

Second problem with the syntax is it isn't modular: combine 4 tables together
and get data from each and you have something like this:

    
    
      SELECT a.col1, a.col2, b.col5, b.col1, a.col3, c.col5
      FROM a 
      JOIN b on etc
      JOIN c on etc
    

What if I want to remove the join to table B, I have to modify the syntax
quite extensively. (First comment out parts of the select, then get rid of the
from - but this breaks the 'join c', so I need to repeat this process
again..).

What if each from-select was it's _own piece of code_ \- revolutionary I know.

    
    
      FROM a
      SELECT col1, col2, col3
      JOIN
          (FROM b 
          SELECT [3]col5, col1)
          ON etc
          JOIN
              (FROM c
              SELECT col5)
              ON etc
    

Indenting is used to quickly show which table joined to which. If I want to
just run table A, then I highlight the top 2 lines of code and run. Or for
table B the 4th and 5th etc, you get the idea. The _[3]_ syntax is used to
tell SQL not to append the columns to table A, but insert starting from column
3.

This is what I thought of in 10 minutes, I'm sure there's some holes but
already I like it better than the crappy SQL syntax I have to read every day
at work :(

~~~
cwyers
I think your chained, tab-indented from-select structure is the worst idea
ever conceived of by mankind, and the only purpose I can think of for it is
you want to encourage my carpel-tunnel to flare up again. So much extra
verbosity added to the standard SQL of writing that query for no additional
clarity. (And you still need the from! Just comment out one line and two
columns, it's no harder than commenting out the weird syntax you've come up
with.)

Which is not to say that SQL can't be improved upon, but the amount of
improvement you'd need to make to make it worth learning a different query
language is I think higher than the benefits you can get from improving SQL,
especially if you're not going to radically remake it, because being 80% SQL
is worse than being 0% SQL because you keep running up against the principle
of least surprise not working.

~~~
hacker_9
Hilarious that your argument is your carpel-tunnel - Look at the code again
and you'll see it's laid out so the computer can _fill in the information for
you_ = less typing. By typing the FROM first, the whole next SELECT statement
can be auto added with a single key (perhaps `). Or you can write the SELECT
yourself, but only have to type the first char of each column name before the
IDE auto completes it.

The tab structure shows you at a glance what tables join to what (not uncommon
to see this parent-child in XML for example, or you know, any other
programming language ever).

Your last point is void - just look at C#, Java, C++, Go and so on. All reuse
keywords for different purposes. It's not hard to switch context in your head
depending on the language.

------
guard-of-terra
So, it's a glorified awk over .csv files?

I think that we as Open Source community can do better than that. We just have
to make a product out of sort, join, awk, grep - not meaning making a wrapper
over them but rather taking inspiration.

I did a great amount of medium-to-big data crunching in command line, and it's
usually okay. There are still some pain points:

\- Selecting by list of ids specified in one of columns.

\- Joining two files together by field (you want them to be sorted on it
obviously)

------
joshuaellinger
Interesting extensions.

The nice part of the syntax is that: 1\. it is clear when you are going
external. 2\. procedural code can be written in C# -- big win there.

But it misses the two big issues in ETL that I always run into:

    
    
        1. You have a bunch files that are 95% the same but have 2-5 columns that are different.
        2. Someone sticks the wrong type of data a field.
    

For those issues, you'd be better off mapping a file (or a group of files) to
a virtual table so you can specify the schema up front.

~~~
M_Rys_Msft
Actually U-SQL gives you the ability to map files to a virtual table (aka view
:)).

CREATE VIEW V AS EXTRACT <insert schema here> FROM "/a/b/{*}.csv" USING
Extractors.Csv();

If the data is more complex, you can write your own Extractors that for
example handle the divergence in columns. Although we are planning on some
features here in the future.

------
Olap84
Interesting to see ETL extensions to (T-)SQL. I wonder if we will see anybody
else offer some? (Come on Postgres!)

~~~
pella
for PostgreSQL check the "Foreign data wrappers" \+ "IMPORT FOREIGN SCHEMA"
...

* [http://blog.taadeem.net/english/2015/03/15/70_Shades_of_Post...](http://blog.taadeem.net/english/2015/03/15/70_Shades_of_Postgres/)

* [https://wiki.postgresql.org/wiki/FDW?nocache=1](https://wiki.postgresql.org/wiki/FDW?nocache=1)

------
cbsmith
Man, does this look a lot like Pig. Why not just use Pig?

~~~
M_Rys_Msft
Because Pig's overall adoption is low. 80% of Hadoop jobs last time I checked
were written in Hive.

Besides Pig's extensibility story is not that good.

But I am glad you like Pig, because Pig is supported in the system as well :).

~~~
cbsmith
> Because Pig's overall adoption is low. 80% of Hadoop jobs last time I
> checked were written in Hive.

Yeah, but putting another name on it doesn't seem likely to change that, does
it?

> Besides Pig's extensibility story is not that good.

Really? I consider it far more extensible than say Hive....

