I think the content and title also matter. I didn’t know what snowflake was until I read the blog, but the title’s play on “... for fun and profit” got me to at least check it out.
Nice post! For your "How popular is investing in the community?" section, I would try to adjust that so it's not an absolute number. Right now you don't know whether "investing" seems more popular just because there are more users or stories overall or because a higher proportion of the items submitted are about investing. For example, instead of the raw number of stories that month with "investing" in them, look at something like that number divided by the total number of stories for that month, which would give you the fraction of stories that month that contained the term "investing".
As someone who never thought of using Snowflake for personal use, what was the cost of setting up and using Snowflake? Did you use Snowflake in AWS? FWIW: I am super new to the Snowflake ecosystem
This is a great question and something that I probably should have at least mentioned in the blog post. I'm currently using the free trial since I signed up for it around Jan 15th. I am using the standard plan on AWS (US East Ohio).
Snowflake charges $2 per credit in compute costs on their standard plan and $40/TB/month in storage. See https://www.snowflake.com/pricing/ for more details. Since the beginning of my trial, I have used 9 credits and 40GB of storage. So I have used about (9 * 2) + ((40 / 1000) * 40) = $19.60 worth of resources so far. This isn't terrible but I must caution you that Snowflake costs could balloon pretty fast. That's why I also set up a resource monitor on my account to stop all warehouses when the number of credits reaches 20. https://docs.snowflake.com/en/user-guide/resource-monitors.h...
I would recommend if you were going to go my route, to setup resource monitors, and only use XSMALL warehouses which consume 1 credit / hour. Snowflake sizes warehouses in t-shirt sizes, and the credits consumed doubles each size up. So XSMALL = 1 credit per hour, SMALL = 2, MEDIUM = 4, LARGE = 8, XLARGE = 16, 2XLARGE = 32, 3XLARGE = 64 and 4XLARGE = 128.
Ah thanks for mentioning this. I looked at the blog post and I seem to have prefixed an extra 1 to the number of requests done by my python script. I updated the post to include the actual calculation of 402 requests per second which is a lot less impressive but still much faster than my original approach on half the hardware.
Do you find the advice to reboot the VM after making a security change useful?
It will certainly work but from my understanding it should be able to change that setting and alter the users login such that you only need to logout and back in. Maybe I am mistaken?
I try very hard to avoid reboots - it may be distracting for a focused article like this - so curious what others think.
My rule of thumb is to reboot instances after making changes mainly so you can be confident that the machine will continue to work correctly after the reboot, and you'll still be able to login.
This is based on past experiences where I've configured a VM, then a year later needed to make changes but been terrified of a reboot because I can't remember if it's likely to keep working or not!
Been there done that! If you do end up with a machine that's non-bootable or seriously impaired, it's a lot easier to figure out when it's "just" today's update in question and not a year's worth.
According to https://unix.stackexchange.com/a/108605 it looks like only a logout and login was required for the changes to take effect. I didn't know this at the time which is why I initiated a reboot.
I think it's a holdover from older times...and underscores a resiliency thing. Rebooting forces garbage collection, and being able to withstand a reboot with no loss of service builds geek cred. :D
FYI that stock regexp is buggy and e.g. will match $42.36 which obviously isn't a symbol.
Indeed, symbols are tough enough to nail correctly e.g. people not including the $, symbols with periods, etc.
Offhand, I'd build a little neural net classifier (e.g. https://fasttext.cc/ ) and train this on a slew of example-posts that are/aren't about stonks. To get training data, use regexps and then run through them by hand (20+ per minute per hour = 1200/hour, or outsource to amazon mturk $0.25 per 10, incl verification = $30/1200). Also, there's probably easy ones you can classify 100% correctly with regexps, to increase the training set size.
Regarding combining to a single file for loading: not necessary. You can have as many files in your bucket as you like. Just make them JSON Lines (one object per line, not an array). The COPY command will even skip files it has already loaded. (I use this for a couple automations every day.)
Interesting read, thanks!
Regarding generating missing ids (quote from the blogpost: "If you know how one would generate the missing ids between the gaps (which could be of variable size)":
I don't have access to snowflake instance, but the following works in Postgres (if I understood the problem correctly):
with lead as (
select id, lead(id, 1) over (order by id) as lead_by_one
from gaps_table),
gaps as (
select id + 1 as start_gap,
lead_by_one - 1 as end_gap
from lead where lead_by_one - id > 1)
select generate_series(start_gap, end_gap) as missing_ids from gaps
So if Snowflake has a generating function similar to generate_series, it should do the trick.
Thanks for sharing!
I've recently starting using Snowflake - it definitely has some nice features (cloning, storage integrations, etc). I'm interested what tooling you used to generate your visualizations. Were the charts generated directly from query output or did you need to load your query results into something like seaborn?
Minor point, but when calculating the avg_diff_price for $GME you should be calculating a return (close-open)/open otherwise days where the stock went from 100 to 105 (5% increase) look the same as days when it went from 5 to 10 (100% increase).
Likewise, when calculating the correlations, that should be done on returns and not prices.
I'm not sure that's a given. They are different measures, but both completely valid. A return might be more meaningful to a long term investor, but ultimately every other metric and participant is interested in the absolute change: traders to calculate pnl/greeks/etc, exchanges to match orders, clearers to calculate margin.
If I am wearing 100 shares, whether it went from $5 to $10 or $100 to $105, it's the same pnl.
Unfortunately it doesn't have native support for code highlighting besides html/css/js. I ended up using prism.js https://prismjs.com/ to highlight my code. The graphs are just screenshots that I've uploaded as pictures.
Given you posted at your calculated time of 11:00 UTC on a Sunday, will be interesting to see how it does. It’s at 13 right now.