
Show HN: Better SQL group by date - akane
http://ankane.github.io/groupdate.sql/
======
richardv
If you know that you need to be grouping by dates... then it's usually best to
store the date fragments in a separate column so you can take advantage of
indexes.

The project is pretty cool, but I don't think it's worth adding the dependency
lock-in for the functions.

Simply, if you know ahead of time that you want GROUP BY date, then you should
create a new column for each interval. `week`, `day`, `hour`.

That way you will have fast queries...

~~~
bjourne
You don't need that if you use postgresql:

    
    
        create index foo1 on t1 (date(timestamp))
        create index foo2 on t2 (extract(week from timestamp))
    

Postgres can use almost any row expression as an index.

~~~
kalleboo
These are the cool practical reasons to switch from MySQL to Postgres that I
so rarely see (usually the reasons are more ad hominem and handwavy).

~~~
orf
Really? There are loads of practical arguments: The incredibly rich type
system (hello array/schemaless columns), indexes (partial indexes, several
different types to choose from), language integration (write index functions
in javascript/python/ruby/c), foreign data wrappers (connect pg to
mysql/mongodb/etc) to name but a few.

~~~
kalleboo
I don't doubt they exist, it's just in the comments on HN, usually the only
thing that's touted is that "it's not MySQL" in that it doesn't have the
quirks MySQL does (in regards to silently munging data types, etc). Sometimes
performance/scalability is also discussed.

------
pvnick
Pretty cool! I can see this being really useful.

Beware, however, that, while there are certain cases that mysql can utilize
indexes in a group by clause if values are constant [1], mysql definitely
won't be able to use indexes with any of these functions (I'm not very
familiar with postgres so somebody else would have to comment on that). If
your load is relatively light or your table is small enough you'd probably be
fine, otherwise I would either 1) set another column to the value you want
when inserting the row or 2) have a cron run in the background to fill in
values on another column, something like _update tbl set creation_week =
gd_week(creation_timestamp) where creation_week is null_.

[1] [http://dev.mysql.com/doc/refman/5.0/en/group-by-
optimization...](http://dev.mysql.com/doc/refman/5.0/en/group-by-
optimization.html)

~~~
joevandyk
Postgres can use indexes with these functions.

If the functions are marked stable or immutable, then the function bodies will
be inlined directly into the query.

------
Terretta
A common grouping needed by sysadmins is to group by 5 minute intervals. That
query may be more likely to trip up a junior DBA than most of these. Could be
useful to add.

Example use case: [http://www.techtalkz.com/microsoft-sql-
server/170861-groupin...](http://www.techtalkz.com/microsoft-sql-
server/170861-grouping-sql-output-5-minute-intervals.html)

~~~
umsm
Do you happen to know the most elegant solution to this problem?

Convert the date to seconds since date X (or unix timestamp / 60) and divide
by the number of mins you need and convert to int.

~~~
jacques_chester
Never use raw numbers to represent slices of time, there's just too many
corner cases that can throw your calculations off.

Use the INTERVAL data type, it handles the logic properly.

------
joevandyk
The postgres functions should be marked IMMUTABLE.

[http://www.postgresql.org/docs/9.2/static/xfunc-
volatility.h...](http://www.postgresql.org/docs/9.2/static/xfunc-
volatility.html)

This allows the planner to optimize queries that use the functions correctly,
instead of treating the functions like a black box.

Edit: STABLE would indeed be better, forgot about timezones.

~~~
wulczer
Not really, the ones calling down to date_trunc should be marked as STABLE,
since the underlying function is STABLE.

    
    
      psql=$ \df+ date_trunc
         Schema   |    Name    |      Result data type       |        Argument data types        |  Type  | Volatility |  Owner   | Language |    Source code    |                     Description                      
      ------------+------------+-----------------------------+-----------------------------------+--------+------------+----------+----------+-------------------+------------------------------------------------------
       pg_catalog | date_trunc | interval                    | text, interval                    | normal | immutable  | postgres | internal | interval_trunc    | truncate interval to specified units
       pg_catalog | date_trunc | timestamp without time zone | text, timestamp without time zone | normal | immutable  | postgres | internal | timestamp_trunc   | truncate timestamp to specified units
       pg_catalog | date_trunc | timestamp with time zone    | text, timestamp with time zone    | normal | stable     | postgres | internal | timestamptz_trunc | truncate timestamp with time zone to specified units

~~~
jeltz
No, the code casts to "timestamp without time zone" before running date_trunc
so it is immutable.

For example:

    
    
      CREATE OR REPLACE FUNCTION gd_day(timestamptz, text)
        RETURNS timestamptz AS
      $$
        SELECT DATE_TRUNC('day', $1 AT TIME ZONE $2) AT TIME ZONE $2;
      $$
        LANGUAGE SQL;
    

EDIT: The reason that some time functions in PostgreSQL are not immutable is
that they are affected by the current time zone setting of the session.

------
tudborg
If you have an indexed DATE or DATETIME field in MySQL, you really want to use
the date and time functions to do your grouping.

This seems to add unnecessary complexity and overhead.

There is already plenty of build-in abstraction for date manipulation in
MySQL, and you really don't want to be dependent on any stored procedure or
function if you can avoid it.

[http://dev.mysql.com/doc/refman/5.5/en/date-and-time-
functio...](http://dev.mysql.com/doc/refman/5.5/en/date-and-time-
functions.html)

I can't speak for postgres though. It might be different.

------
kogir
You can actually do really fun (and fast!) things with dates by mapping them
to geospatial types.

For an example see: [http://explainextended.com/2009/08/11/efficient-date-
range-q...](http://explainextended.com/2009/08/11/efficient-date-range-query-
sql-server/)

You can easily extend this map into more dimensions depending in your common
use cases and the capability of your DB's spatial library.

------
Rygu
Looks simple and useful. Only worry is that the installation instruction is a
curl request piped to mysql with root user...

Thanks for sharing!

~~~
akane
curl | sh is a very transparent installer. Here's the source.
[https://raw.github.com/ankane/groupdate.sql/master/mysql.sql](https://raw.github.com/ankane/groupdate.sql/master/mysql.sql)

Also, you don't have to use the root user - I'll make that more clear. Thanks
for the feedback!

~~~
hnha
did you access that url with curl to make sure it does no user agent sniffing?

how do you know the script is the same on subsequent requests?

~~~
bcoates
It's an https request to GitHub, which has about a million more subtle ways to
own your machine if they really wanted to.

Figuring out how to solve the TOCTOU problem for a small script in a source
control repo is should not be difficult for anyone actually qualified to tell
if a script is evil or not by looking at it.

------
pbnjay
I'm not sure why this is necessary. It's a very light wrapper around existing
SQL functions that adds additional dependencies and makes it more difficult
for developers who already know SQL.

Maybe the use case is for installable packages that would be used with
different backend databases... but in general, it seems like a bad idea.

~~~
akane
Thanks for the feedback. Working correctly with time zones in the database can
be tricky.

DATE_TRUNC('day', $1 AT TIME ZONE $2) AT TIME ZONE $2;

or

CONVERT_TZ(DATE_FORMAT(CONVERT_TZ(ts, '+00:00', time_zone), '%Y-%m-%d
00:00:00'), time_zone, '+00:00');

are not very intuitive (or aren't to me at least).

~~~
tudborg
Simply use and store everything in UTC. If you ever need to do grouping by
time in any other timezone, it would be trivial to do the conversion.

DAYOFYEAR( CONVERT_TZ(my_datetime_field,'+00:00', '+02:00') )

Seems very intuitive to me.

------
lowglow
I'm sure I'm a neophyte at this, but can someone explain to me why we would
want to store functions in a database? It fragments the logic, adds another
layer of complexity when trouble shooting, as well as adding another talent
required when on-boarding new talent.

I've recently looked at a project where there were a number of postgres
functions floating around and it seemed terribly difficult to set breakpoints,
profile, and troubleshoot.

Maybe I'm missing something?

~~~
bobwaycott
I'd suggest it comes down to business logic vs data logic. Business logic
rightly belongs in [insert your programming language] code.

Data logic, however, is worth thinking through. When you have a lot of
calculations that may span multiple tables, or need fine-grained access
control, or require multiple reusable result sets that would benefit from
being indexed and delivered to a user (versus attempts to map all this through
ORM objects or pulling out raw data and then transforming it in app code), or
require querying and packaging the same datasets (or combinations thereof) to
share them with other systems that access via ODBC, utilizing functions,
stored procedures, and database views are there for a reason.

That some ORMs make it difficult to leverage this functionality does not speak
to the lack of need for separate data logic and taking advantage of database
capabilities. If all you're building is CRUD apps with a pretty face, then you
likely don't need it. If, however, you're building pretty complicated systems,
pulling out more raw data than you need to calculate a result that could be
calculated and kept up-to-date in a database is far easier and reusable.
Moreover, ORMs often sour in the face of very complicated data requirements
that are rather trivial to setup in a view, function, or stored procedure.

As far as onboarding is concerned, in my humble opinion, if you are bringing
in new talent to deal with data who do not have a strong understanding of SQL,
I'd be pretty wary of far more significant problems developing as a result of
not being familiar with what the ORM is doing behind the scenes (and of what
performance tradeoffs are occurring).

As far as troubleshooting is concerned, if you can confirm that the data is
leaving the database as expected by data logic/code--and this is as simple as
executing a query, function, sproc, or view--then you can at least limit
yourself to discovering where the offending business logic is located.

------
jotto
Here's a Ruby library (similar to his groupdate gem) I made with an SUM as an
additional feature (but no timezones). MySQL and PostgreSQL support
[https://github.com/jotto/ar_aggregate_counter/blob/master/li...](https://github.com/jotto/ar_aggregate_counter/blob/master/lib/ar_aggregate_counter.rb)

------
akane
For those concerned about a SQL dependency, it might be useful to create
projects in different languages that package these functions. These methods
were extracted from
[https://github.com/ankane/groupdate](https://github.com/ankane/groupdate),
which you can use for Ruby + ActiveRecord

(edit: spelling)

------
memset
Very cool. Thank you for sharing!

(Now we just need one for Oracle, DB2, and Sybase...)

~~~
bratsche
Do you really use all those databases?

~~~
z92
Number of commonly used database engines aren't small either.

Oracle, MySQL, PostgreSQL, sqlite, MS-SQL and Access.

Six intotal.

------
gtrubetskoy
PostgreSQL and MySQL? Where is the Hive version? :)

~~~
akane
Just a pull request away :)

------
jerogarcia
very very nice.

