It comes off a bit... dubious. The prohibited uses boil down to "don't use Instella to be a jerk or make porn" which I expect many people will do anyways and simply not disclose their use of Instella (which, of course, is prohibited).
This is the first RAIL license I've read, but if this is the standard, this reads less like a license and more like an unenforceable list of requests.
Edit: if I were to make an analogy, this license is a bit like if curl came with a clause that said "no web scraping or porn".
> In natural language processing (NLP) terms, this is known as report generation.
I'm happy to see some acknowledgement of the world before LLMs. This is an old problem, and one I (or my team, really) was working on at the time of DALL-E & ChatGPT's explosion. As the article indicated, we deemed 3.5 unacceptable for Q&A almost immediately, as the failure rate was too high for operational reporting in such a demanding industry (legal). We instead employed SQuAD and polished up the output with an LLM.
These new reasoning models that effectively retrofit Q&A capabilities (an extractive task) onto a generative model are impressive, but I can't help but think that it's putting the cart before the horse and will inevitably give diminishing returns in performance. Time will tell, I suppose.
I had to refresh before posting, because I wanted to see if someone else beat me to being that HN commenter but...
From the Terms of Service (emphasis mine):
6. Restrictions on Use
You agree not to:
Use the Service for any unlawful purpose.
Attempt to reverse-engineer, modify, or *create derivative works of the Service.*
Share, resell, or distribute downloadable data provided by the Service without explicit written permission.
Do you intend to delineate the data provided by the service from "the Service" itself? It seems most fair that data received via Fair Use remains in that arena, pun fully intended.
That aside, it's an intriguing dataset nonetheless, but I'd prefer to see a sample of the data before signing up.
IANAL but I am someone who deals heavily in 1) scraping and 2) data and the analysis, enrichment & brokerage thereof. As such, I like to consult this for anything regarding US Copyright law: https://www.copyright.gov/circs
Steamdb.info is a derivative work, yes. And scraping is usually accepted as Fair Use, so both services are presumably within their rights, but they have no claim to the underlying data, only their process of enrichment. If someone were to build a new service based on the data presented on either site, there's not much they could do to stop them... short of getting them to agree not to do so via their ToS.
OpenAI is a great example of a company who built a derivative work on scraped data available under Fair Use, and then subsequently gated their data via their ToS. With such a popular precedent at play, I'd rather not use any services doing anything similar, especially when steamdb.info doesn't even have a ToS.
Thank you. Does this still hold good if steamdb was making money (ads, for example)?
Also, I am wary of using big companies like OpenAI as precedent. Big companies can do whatever they want and get away with a lot of stuff that individuals and smaller companies can only dream of
Yes, within some limits, but if one were to set up a business like that, it's a very good idea to seek out a consultation from a local copyright lawyer to know exactly what one can and can't get away with. Datasets are addressed as a "collective work", which lumps them in with everything ranging from art books, to hackernews, to scientific journals.
Personally, I wouldn't sell anything I gathered from a publicly available source anyways, mostly out of principle, but doubly so if that source is as well-paid as Valve.
> Personally, I wouldn't sell anything I gathered from a publicly available source anyways, mostly out of principle, but doubly so if that source is as well-paid as Valve.
Market reports are an entire industry, and people pay for them solely to avoid ingesting a tangential domain. It's ok to sell your transformations.
My advice is free, my custom tooling is dirt cheap with public examples, and my finished product costs money every month. It's basically price tiers based on your interest level.
I might be inclined to seek the raw data, should it be more cost effective than scraping Steam myself.
Being a user, free, paid, or anonymous, can still be under the thumb of their ToS, especially so if they force a dialog in front of you to agree to the ToS while signing up. I'm merely pointing out hurdles to the OP that may obstruct some of the people they are trying to reach.
What I don't understand is the difference between 'Download all CSV data' in the free tier and 'Download CSV data' / 'Download raw data' in the paid member tier. It seems that the free CSV data is likely an extract or digest of the raw data offered as a sample.
That's all true but matters significantly less "at scale". In the days of lean models, you needed to verify that your input parameters were functionally independent variables, meaning they couldn't correlate with other input parameters. When every document is transformed into a billions-long vector -- even if you took the noninsignificant amount of time it would take to compute a correlation matrix -- the heavy associations between a few features don't mean much, especially when you can just add more data. Plus, people misusing or repurposing words can introduce some interesting twists to features you'd assume 1:1 on paper.
Honestly, when looking back at the accomplishments of any administration, you'll likely find more things you agree with than disagree with. Presidents get a lot of ideas passing across their desks, generally posed by their subordinates. It's usually one big smear that overshadows it all (e.g. starting a never-ending war in the Middle East, implementing an extra-judiciary drone assassination program, ravaging the scientific institutions that help to make this country a hub of innovation, etc)
> There is no respectable statistician who would ever draw a conclusion from data.
This is outright pedantry, but I'm willing to give the benefit of the doubt. Scientific research is literally the process of drawing conclusions from data, after repeated measurement and control of independent variables.
No benefit of the doubt is needed, I made a true statement, albeit a subjective one about who I consider "respectable".
To me, statisticians analyze data and make objective statements about what they see. The process of scientific research is a different beast altogether. The objective statements about the data can only be used to inspire or support an argument. The correctness of the argument can only be recognized if the listener is convinced.
I'm not sure I follow your idea that correctness emerges from belief, but that could get epistemological quickly.
What I can address is that the role of a statistician is to interpret the statements made by data towards the objective (an unobtainable notion), not vice versa. I don't know any statistician, myself included, who isn't aware of the limitations of sampling, and thus the impossibility of dealing in objectivity -- hence why conclusions are often posited in speculative terms, such as probabilities or even hypotheses. Having the ability to reach those conclusions from the recorded observations of real phenomena is why statisticians tend to operate within subject matters.
Towards that end, I do agree that any statistician worth their salt would not take too much at face value on a two dimensional chart, and that this whole matter bears more investigation. But the idea that no respectable statisticians draw conclusions from data... I'm left to assume that you respect no statisticians.
Applied correctly, either every credible outlet excepting this one with the capacity to monitor and detect election fraud was busy during one of the most contentious races in modern US history, or nothing was missed and the convenient outlier can be dismissed as that.
The better explanation is probably that people who were annoyed by having to listen to dubious claims of voter and election fraud over the past four years are using this thread to give others a taste of their own medicine.
I'm concerned by the turn of this thread. The claims of voter fraud in 2020 were tenuous claims based on mail-in ballots and other speculation. This is a claim based on statistical analysis -- something I'd think would click with the HN crowd, regardless if there's room for debate on the meaning of the stats.
The main consideration is at the beginning: the stats largely resemble the patterns of verified instances of voter fraud, as in Russia and Georgia.
It seems that you're suggesting some fairly obvious factors working against Harris weren't considered by an organization whose entire purpose is to sniff out voter fraud. Are you suggesting that they overlooked such an obvious detail, or that they're willfully ignoring it?
Or even conversations presented entirely hex. Not only could that have occurred naturally in the wild (pre-2012 Internet shenanigans could get pretty goofy), it would be an elementary task to represent a portion of the training corpus in various encodings.
So the things I have seen in generative AI art lead me to believe there is more complexity than that.
Ask it do a scifi scene inspired by Giger but in the style of Van Gough. Pick 3 concepts and mash them together and see what it does. You get novel results. That is easy to undert5stand because it is visual.
Language is harder to parse in that way. But I have asked for Haiku about cybersecurity, work place health and safety documents in Shakespearean sonnet style etc. Some of the results are amazing.
I think actual real creativity in art, as opposed to incremental change or combinations of existing ideas, is rare. Very rare. Look at style development in the history of art over time. A lot of standing on the shoulders of others. And I think science and reasoning are the same. And that's what we see in the llms, for language use.
There is plenty more complexity, but that emerges more from embedding, where the less superficial elements of information (such as syntactic dependencies) allow the model to hone in on the higher-order logic of language.
e.g. when preparing the corpus, embedding documents and subsequently duplicating some with a vec where the tokens are swapped with their hex repr could allow an LLM to learn "speak hex", as well as intersperse the hex with the other languages it "knows". We would see a bunch of encoded text, but the LLM would be generating based on the syntactic structure of the current context.
It comes off a bit... dubious. The prohibited uses boil down to "don't use Instella to be a jerk or make porn" which I expect many people will do anyways and simply not disclose their use of Instella (which, of course, is prohibited).
This is the first RAIL license I've read, but if this is the standard, this reads less like a license and more like an unenforceable list of requests.
Edit: if I were to make an analogy, this license is a bit like if curl came with a clause that said "no web scraping or porn".
reply