> Statistics is "the practice or science of collecting and analyzing numerical data in large quantities"
That's not true.
Statistic can analyze data in small quantities you can read up on it with nonparametric statistic. At least half of nonparametric statistic deal with small data (mostly through use of ranking). With Bayesian stat you can just assign a prior distribution.
I think the best way to describe statistic is that it uses data to infer the population. A subset data via sampling and infer a statistic (mean, median, whatever) about the population at large.
More often than not I'm just surprise as to why Data Science is so big when seemingly it seems like statistic does every freaking thing with data. From how to correctly sampling, designing experiment to collect the correct data to answer a hypothesis, making sure the data aren't bias, etc.. It deals with designing the data from inception to end, including either collecting the data or data given to you already. On top of dealing with problematic data such as missing data, imputation, etc..
Seriously Intel is already struggling after buying Nervana.
I went to their shing dig and they were working their butt off to wow the developers who were invited. When I asked for hard number they were very mum about that and very evasive.
The timeline for Nervana chip have been always seemingly in this mystical horizon that is never solidified to a real date but over yonder.
Google is going to pull this crap? They got better software expertise than Intel though they may be able to do it. But after that fiasco with Angular 1 to 2 I wouldn't trust Google with any early version number.
Nervana had a lot of other issues. It was trying to produce an ASIC with 50 employees. When they got aquired by Intel the first step they had to tackle was hiring the engineers necessary to actually produce an ASIC which innevitably slows down production, and then on top of that they got caught in the Intel 10nm bear trap.
Their power cord keep on dying for my macbook 2013.
I had a temp iphone and it took over a lot of my texts by redirecting it toward apple's server first. When I got my android back I receive no text msgs from my friends who have iphone.
It also seem like they're making cheaper stuff so they can increase their margin via tech support and peripherals.
I agree they should invest more in software so they don't have to resort to these crappy tactic.
Also they should make a cheaper iphone region lock it to china and india only if they want to increase their margin. Their current strategy for these markets is stupid.
You can harvest data and anonymize it. My friend at Google supposedly is doing this via research at Google. I'm not going to say Google care about our privacy as much as Apple but they are researching this probably for medical and for when opportunity arises. To be perfectly honest I think Apple's stance on privacy is all marketing, there are cases where they refused to unlock phones for the FBI but there are other cases where they ship recordings of your talk to SIRI to a third party company.
How you collect your data can introduce bias and statistic have a whole range of topics on how to collect data without introducing bias and systematic techniques to sample data. Stratifying data if a certain group is under represented, random sample, etc... Survey analysis goes into hardcore details on how to sample a population to accurately do inference and it's an interesting statistic sub field if anybody is interested.
ML tends to be more here's the data already do something with it.
Statistic encompass everything about the data including how to sample, collect the data, and designing the experiment to collect the correct data to answer your hypothesis. Where as ML is usually here's the data, go figure out what you can get out of it.
I left the industry for data science. Went back to school and fall in love with statistic. Going to try to get a biostat job. I'm trying to do side projects web app and not give a damn about the flavor flav of the month.
I don't know if that's a good strategy.. it's basically peacing out. I also move most of my tech stack to plain old stuff like a simple mvc framework and no frontend rendering.
You can reduce it via PCA one of the many techniques in multivariate statistic.
You can do anova to select your predictors.
In general you can use a subset of it using the tools that statistic have provided.
Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.
All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns
CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.
>> All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
The solution is to direct research effort towards learning algorithms that generalise well from few examples.
Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
>> You can reduce it via PCA one of the many techniques in multivariate statistic.
PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.
>>>Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.
Btw, if you have relational data and a few good people with strong computer science backgrounds rather than statisticians or mathematicians, have a look at Inductive Logic Programming. ILP is a set of machine learning techniques that learn logic programs from logic programs. The sample efficiency is on a class of its own and it generalises robustly from very little data[1].
I study ILP algorithms for my PhD. My research group has recently developed a new technique, Meta Interpretive Learning. Its canonical implementation is Metagol:
Please feel free to email me if you need more details. My address is in my profile.
___________________
[1] As a source of this claim I always quote this DeepMind paper where Metagol is compared to the authors' own system (which is itself an ILP system, but using a deep neural net):
ILP has a number of appealing features. First, the learned program is an
explicit symbolic structure that can be inspected, understood, and verified.
Second, ILP systems tend to be impressively data-efficient, able to generalise
well from a small handful of examples. The reason for this data-efficiency is
that ILP imposes a strong language bias on the sorts of programs that can be
learned: a short general program will be preferred to a program consisting of
a large number of special-case ad-hoc rules that happen to cover the
training data. Third, ILP systems support continual and transfer learning. The
program learned in one training session, being declarative and free of
side-effects, can be copied and pasted into the knowledge base before the next
training session, providing an economical way of storing learned knowledge.
You're absolutely right and I appreciate that very much. On the other hand,
there's an incredible amount of hype around Big Data and deep learning,
exactly because the large corporations are doing it. So now everyone wants to
do it, whether they have the data for it or not, whether it really adds
anything to their products or not.
As to the Big N (good one) what I meant to say is that I don't see them trying
very hard to undo their own advantage, by spending much effort developing
machine learning techniques that rely on, well, little data. That would truly
democratise machine learning- much more so than the release of their tools for
free, etc. But then, if everyone could do machine learning as well as Google
and Facebook et al, where would that leave them?
>CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.
Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.
>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
It's fine to point out problems without giving solutions. You seem very aggravated.
PCA has specific use cases. It’s not a catch all dimensionality reduction technique. You can’t use it effectively, for example, if things are not linearly correlated. There of course many tools for addressing many problems, but as the title states, this is often a grind. For any practical problem, exclusive of huge black box neural nets where you don’t need to understand the model, you are probably better off starting with a smaller set of reasonable sounding features and then slowly growing out your model to incorporate others.
Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.
> Complaining about messy data... welcome to the real world.
I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.
I did play around with Erlang a few time. I really love the comma and period. It really nest and group the function patterns together.
For Elixir you just have to put them near each other. The function are enclose in a do/end block. And you put those similar pattern function next to each other. I believe you can move those pattern function around.
Other than that I can't recall much else. Prolog syntax is foreign but I'll admit I can see myself learning Erlang pretty fast since the language isn't "big" compare to Java or Scala. Like wise Elixir is a small language too with added macro (I don't recall Erlang got this) feature. Elixir also have better tooling baked in (mix/hex, doc, lint, etc...).
Also the Erlang community sucks. This is my personal experience but I vividly recall attending a meetup. The group meetup have a discussion about Erlang's adoption and how to get better adoption. I told them Ruby got popular because RoR other wise people would have chose Python and call it a day. Erlang should really have a killer framework like a web framework. Everybody thought it was ridiculous.
And you know what? Elixir came along and so did Phoenix. Phoenix may not be the only thing that got Elixir popular, Elixir got toolings from the get go. Jose and those people are from the industry they know what was needed. They came from Ruby.
> Elixir have better syntax because it's familiar.
Which syntax is "better" depends on how you measure. I'm a big fan of Erlang's "minimalistic" syntax. It means I might have to write a bit more code, but it also means that the code, once written, is very explicit.
> Like wise Elixir is a small language too with added macro (I don't recall Erlang got this) feature.
Well, Erlang does have simple macros (-define(MACRO(Var1, Var2), {some_expr, #{ var1 => Var1, var2 => Var2 }}).) but they are nowhere near as powerful as Elixir's system. There is merl (http://erlang.org/doc/man/merl.html) which allows for something quite similar, in particular if it's combined with parse transforms. I haven't used it and haven't seen usage in the wild, though.
> Elixir also have better tooling baked in (mix/hex, doc, lint, etc...).
That I give you, while rebar3 is already a huge step forward compared to what we had before, mix's out-of-the-box feature-set is really nice.
> Also the Erlang community sucks. This is my personal experience but I vividly recall attending a meetup. The group meetup have a discussion about Erlang's adoption and how to get better adoption. I told them Ruby got popular because RoR other wise people would have chose Python and call it a day. Erlang should really have a killer framework like a web framework. Everybody thought it was ridiculous.
This is a bit harsh. You can't derive "community sucks" from "the ones at this meetup didn't share my opinion". I don't share it either. I think the main thing holding Erlang back is the current out-of-the-box tooling (rebar3 is not included in the OTP distribution) and its general performance.
Intel is very greedy in their cpu prices. Their consumer cpu never gave the option of ecc memory and have gimp certain things. AMD was more flexible. Having two company competing will be great for the consumers.
Strictly speaking, x86-64 patents must be about to expire anyway - the first Intel EM64T implementation (which can be assumed to be fully compatible even with modern-day software--IIRC the original AMD64 ISA had some unfortunate quirks which make this less likely) is from 2004. So, it's only a matter of time until we see compatible x86-64 CPUs from other vendors.
There are tons of patents regarding implementation details and instruction set extensions. I don't think you could sell Athlon 64 or Pentium D clones to anyone.
Not sure about the original Athlon 64, but Pentium D's can apparently run, e.g. Win10 for x86-64 and its associated software (barring ISA extensions, of course) - surely that's good enough for some users. Especially if you could reimplement the "clone" on a more recent physical node than an actual Pentium D.
That's not true.
Statistic can analyze data in small quantities you can read up on it with nonparametric statistic. At least half of nonparametric statistic deal with small data (mostly through use of ranking). With Bayesian stat you can just assign a prior distribution.
I think the best way to describe statistic is that it uses data to infer the population. A subset data via sampling and infer a statistic (mean, median, whatever) about the population at large.
More often than not I'm just surprise as to why Data Science is so big when seemingly it seems like statistic does every freaking thing with data. From how to correctly sampling, designing experiment to collect the correct data to answer a hypothesis, making sure the data aren't bias, etc.. It deals with designing the data from inception to end, including either collecting the data or data given to you already. On top of dealing with problematic data such as missing data, imputation, etc..