Hacker News new | past | comments | ask | show | jobs | submit login

Alright here's a relevant question I've been having in terms of this. Let's say I have the code to gather / scrape / load some stats into postgres, but then want to run projections on them.

For example, if I'm trying to predict the next day's stats by using stats in the past by simply taking the average, how many days in the past should I look at results and take the average of? Is the last 3 day average the best? 5 days? 8? 10?

There are clearly a couple ways to do it. One is by getting a data frame of all the stats for all the objects, write the python logic to loop through the days, get back the stats from the past X days not including the current day, taking the average and then storing that back to postgres in a projections table. A function like set_projections(5) where 5 is the number of days in the past I'm taking the average of.

Second way to do this is write that function as a plpgsql function where uses subqueries to find the past X day stats for the players and then creates or updates the projections table all in sql so we can run `select set_projections(5)` and that'll do it itself.

So the question becomes, which ones is "best"? I have to imagine it's mostly a case by case basis. Since it's only me here, I've been doing it in postgres alone since it can be done in one query (with multiple sub queries, yes), but that's it. With python, it'd involve many more steps. On the other hand, the sql looks huge and then I've been running into the issue of do I split some of the sub queries into sub functions since that's what I'd be doing in python? If there were more people involved, would it be bad to have larger cases like that in postgres since we wouldn't know the skill of the others, where mostly they'd be coders and could write the projections in languages they'd want?

Another example of this tradeoff is how should I interact with the database? I have a rails background, and ActiveRecord is an incredibly good ORM, whereas SqlAlchemy hasn't done it for me. In either case, there's a ton of overhead to getting the connections running with the correct class variables. So instead, I kind of created my own ORM / interface where I can quickly write queries on my own and use those in the code. This is especially easy since most the queries can be written in postgres functions so the strings of those queries is incredibly tiny!

What I've learned from this project I'm doing is that sql is very, very powerful in what it does. I've shifted more to using it for actions than I would have in the past, and pretty much all thinking I do is making as little code as possible.

Anyone make it through reading this and have comments about what I'm doing and what they like to do?






It sounds like you're calculating a running average. Have you looked into window functions? I think they do exactly what you need. Something roughly like:

  SELECT AVG(stats.val)
            OVER(ORDER BY stats.date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS avg_val
  FROM stats
Regarding this point:

What I've learned from this project I'm doing is that sql is very, very powerful in what it does. I've shifted more to using it for actions than I would have in the past, and pretty much all thinking I do is making as little code as possible.

It's an interesting tradeoff. I've been on the other side, in an environment that was super SQL heavy (also Oracle PL/SQL) where we had to deal with these huge complex queries, and it was incredibly painful, for various reasons. One, SQL by itself isn't that composable in the way code is[1], so there tends to be a lot of copy/paste/duplication. Two, giant queries can be hard to understand/debug. Three, SQL can be hard to test. I mean, you can set up integration tests against a database (which can be time consuming), but how do you unit test a query?

On the other extreme, I've seen cases where something separately queries two tables and does a join operation in code, which would be much easier/safer just by doing a sql join. So there's a tradeoff. There's a hard to define sweet spot somewhere in the middle, IMHO, where you take advantage of relational semantics/joins, and utilize various sql features like group bys and whatnot, but where you haven't shoved all of your logic into the database layer.

[1] You can create CTEs (with foo as....) in postgres, but they may have downsides wrt performance (at least, last time I'd read up on it they posed an optimization boundary). You can also create views, but similar issues may arise.


Re the CTE optimization boundary comment: this changed in Postgres 12! By default, they are no longer a boundary. You can change the behaviour on a per-query level, too. As always, depesz has an article going into the details: https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

In general it's been my experience that if you try to do stuff SQL could do re: reporting or heavy data work in Ruby or Python or some other scripting language in a web application context, you're in for one or both of "why is this thing so damn slow?" and "why is this report crashing the application? (answer: it's eating all the memory)"

I've also seen highly user-visible performance issues because some developer either didn't realize how slow performing a series of queries in a loop would be (didn't understand the network overhead of each request) or didn't know how to condense what they needed to one or two requests. High correlation of this with the ORM-dependent folks.

So far as averaging a few days of stats my gut feeling is "this can just be a view with one or more medium-complexity selects behind it" but maybe there's some reason it can't.

> So instead, I kind of created my own ORM / interface where I can quickly write queries on my own and use those in the code. This is especially easy since most the queries can be written in postgres functions so the strings of those queries is incredibly tiny!

I haven't been in non-Rails Ruby land in quite a while but I remember liking Sequel quite a bit. Very much The Right Amount of abstraction (i.e. barely any) over the DB.


What is "best" probably depends on your current metric for "best". Developer time spent? (Is developer and DBA the same person? How familiar are you with SQL?) Processing time? Amount of data transferred? Readable code? Maintainability? Versioning?

I've also shifted more logic to Postgres recently and keep the queries in the code trivial. It's because I like that SQL is declarative and there's no intermediate state to mess up (the whole query can be processed within a transaction).

As for readability (refactoring) you have several options in Postgres. Going the PL/pgSQL route probably means you're gravitating back towards procedural code. It does have its uses, but I try to avoid it whenever I reasonably can. Try using language SQL [1] instead. Another option is functions. Can be more readable but likely less efficient (they're a barrier for the query planner), just as PL/pgSQL. I've been mostly happy using nested views (and materialized views) lately. But whenever you can, use CTEs to structure your queries.

YMMV, just providing ideas to think about.

[1] https://www.postgresql.org/docs/12/xfunc-sql.html


It sounds like you should research window functions: https://www.postgresql.org/docs/12/tutorial-window.html

Sensitivity Analysis: allow your lookback period to become a variable itself, and then analyze/optimize predictions for that lookback period variable itself.

We use SA's all the time in my work. One example, we only had 2018 data in our SA, even though our full analysis included 2019 data...this let us see if trends changed year to year. Another example: we made 5 lookback variables, 90 day, 180 day, 365 day, etc...then we analyzed which of these was most relevant to our analysis.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: