If a problem is too difficult to solve, then we solve simpler related problems first. In doing so, we typically gain insight into the more challenging problem. For example, before Calculus was discovered, areas and volumes were all computed in ad hoc ways, via "parlor tricks". Only by generalizing the insights gleamed from some of these "tricks" did we stumble upon Calculus.
Problems such as chess certainly did not seem like "parlor tricks" at the time they were proposed. In fact, many thought of chess play as being an idyllic example of human intelligence. Just because we do not understand how to solve Go today doesn't mean that it won't be a trivial "parlor trick" in ten years.
As an experienced developer, I surprisingly enjoyed progressing through Wolfram's book for beginners. (I used Mathematica, not their online UI.) In particular:
+ Making instant web apps and APIs just by specifying two functions -- "front-end" and "back-end".
+ Easily creating impressive-looking programmatic 2D and 3D graphics and sounds.
+ Innovative idea making visible graphics and colors as arguments and outputs of functions.
+ "Knowledgebase" of facts accessible using natural language. (Though, this was often slow due to server calls.)
My main critique of Mathematica is that it is the epitome of a kitchen sink library. There is an overwhelming number of functions for every possible thing. So, it is often not feasible to know what is offered without a lot of reading. (In fact, Wolfram even states this in his book.)
This is the real problem with their "knowledgebase". I would love to enter any natural-language query and get an answer, but in reality I need to check their list of supported areas and know the seemingly arbitrary list of facts about each area. To really make this useful, Wolfram has to figure out how to make it work without requiring me to be so knowledgeable about what it contains a priori.
Also, having such a large standard library makes the language very "flashy". It is indeed impressive and inspiring to create a sphere or musical piece in one line of code. However, real programs end up being much larger. In practical programs, Mathematica's functional syntax results in long unwieldy statements with forced formatting. This is occasionally impressive, but it is just as often hard to decipher.
I don't really get the complains about a kitchen sink of predefined functions, or the functional syntax. You can always fall back on defining a function yourself. (And with the amazing documentation and Google, it's quite easy to find most functions anyways.) Likewise, you can always use C++-style syntax for defining functions, for loops, etc.
Mathematica has plenty of problems, like it's extreme slowness for things that should be comiled, and awkward reference passing, but the ones you've pointed out don't really make sense to me.
This is true for lots of programs. Try installing TeXLive or a similar LaTeX distribution from a package manager. Usually it pulls in a *-docs package that takes up well more than half of the download.
To be fair, though, it is very easy to customise TeXLive so that it doesn't do this; whereas, as far as I know, Mathematica offers no such option. To be sure, you can delete it afterwards, but I'm not sure how gracefully Mathematica deals with that; whereas TeXLive, for example, gracefully deals with incremental upgrades from a very basic installation. (I know, because that's exactly how I run it on my tiny SSD. My current installation is well under 1 GB.)
Surely the graphics that they return should be returned, live, rather than passively occupying space even while unread? (I don't just mean 'should' in the sense of "that's the way it ought to be"; I thought, though I don't know for sure, that part of the appeal of the 'live' documentation is that that's the way it is.)
Some stuff requires internet connectivity, some stuff takes a certain time to compute. If you had to wait one minute for all the stuff to evaluate when you open a new documentation page that would probably bad user experience. The appeal of live documentation to me is that I can modify the examples in place until they do what I need.
> I don't really get the complains about a kitchen sink of predefined functions, or the functional syntax. You can always fall back on defining a function yourself. (And with the amazing documentation and Google, it's quite easy to find most functions anyways.) Likewise, you can always use C++-style syntax for defining functions, for loops, etc.
In the short term a kitchen sink of predefined functions are great, but in the long term they become baggage that holds a platform back (and there's no way to improve them without breaking backwards-compatibility). They're usually a sign that the platform has its priorities wrong: you need a good dependency management solution, and once you have that there's not much value putting things in the standard library. (Look at Scala, which is going to a lot of effort to carve things out of the standard library so that they can be maintained separately with their own release lifecycle).
You've described a hypothetical way things could have gone wrong. But have they? I don't think so. The Mathematica developers have in fact been impressively foresightful. When it has been necessary to make changes that break legacy code they have done so, but it's been very rare.
I really think this is like an Apple vs. non-Apple situation. Sure, the Apple ecosystem is closed and not as flexible and you can think of all these problems that it could get stuck on. But in practice for the large majority of Apple's target audience, these concerns don't pan out. And the benefits of that system are large.
You could argue that standard libraries should be enormous and try to be everything to everyone. In fact, this is one of Mathematica's selling points -- they have many complex algorithms implemented.
However, this also means there is an enormous, aging codebase that Wolfram continually must maintain and add new features to. It means that coders from different communities (who use different sections of documentation) likely have very different standard practices. And, I am concerned that this philosophy of "always having to look everything up" really hinders their knowledgebase product due to lack of consistency.
However, I applaud Wolfram for the ambition and enthusiastically sticking with it over so many years.
- You can think of error, loss, and cost functions as the same. In fact, two textbooks in front of me say that the loss function is a measure of error. If "loss" is a confusing word, think of it as the "information loss" of the model -- if your model is not perfect, you lose some of the information inherent in the data.
- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.
- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)
- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs.
In the case of these slides, the loss function is max(0, -xy) and the error function is the sum of these. So, the error function is the number of incorrectly classified examples (if x and y are different, it adds 1 to the error), which is exactly what we hope to minimize.
The transfer function is applied only at evaluation.
In the formulas of the slides (and in the code), for training I compute the loss of an example X and it's expected target as: L(XW, target)
What you define is minimizing L(transfer(XW), target) which is not easily optimizable.
In the case of perceptrons, point taken -- I agree. However, my original statement still holds. The loss and error functions presented on the slides are still valid. Whether or not they are easily optimizable, they are still examples of loss and error functions.
>> Is there REALLY a difference between a "method," a "procedure," and a "function" proper?
Yes. These words closely follow the evolution of programming paradigms.
Early in computing, code was organized loosely as "blocks of named code", or procedures. These often used ad hoc methods to receive inputs and produce outputs, e.g. by directly writing to various globals, registers, and memory regions. You can imagine that this would result in complex programs, since each procedure may have many intertwined dependencies.
To better organize code, a stricter procedure was developed -- the function. These blocks of named code declare their inputs, using them to produce an output. Although this was the ideal, in practice there still tended to be many undeclared dependencies via globals. Again, this led to hard-to-read, obfuscated code.
So, object-oriented programming introduced methods. These are procedures or functions that belong to a class. Meaning, in addition to their inputs, they can only read and modify particular methods and attributes. Hence, the dependencies of the function beyond its inputs are much more precisely understood and organized. In theory at least.
Data is typically not very conclusive by itself. So, people often make a number of behind-the-scenes decisions that make the conclusions seem stronger than they actually are (whether on purpose or not).
Not necessarily. For individuals who want to live close to work and rent condos/apartments it could work out well (e.g. young professionals just out of college). For example, West Hollywood is ranked as the most walkable city in California (by Walkscore), has a population density of ~19000 people/sqmi, is ~60% individual households, and ~80% of the housing is rented.
You should be following the 1 white path before the n black paths, not afterwards. In particular, following "3 blacks then 1 white" in the last step, when the final digit is three, is an error; it will produce the wrong modulus for any number that is not divisible by 7.
(1) 1 white then 1 black, ending up in state 1
(6) 1 white then 6 blacks, ending up in state 2
(0) 1 white then 0 blacks, ending up in state 6
(3) 1 white then 3 blacks, ending up in state 0
The graph keeps track of ((the number so far) mod 7); following a white arrow multiplies your total by 10, and following a black arrow increases your total by 1.
I can see how you'd like to hold the order of operations correct for readability purposes (especially if the states were labeled with digits), but I don't see where psyklic's method could produce an error for the divisibility test.
Your initial white doesn't change anything, since it brings you back to the starting node. Any whites after the the final digit either keep you on the starting node or keep you in the graph where.
The graph isn't just a divisibility test, it reports the remainder after division by seven. I specifically acknowledged that this error won't mutate a "divisible by 7" result into "not divisible by 7" (because any number that is divisible by seven will still be divisible by seven after multiplying it by ten), but in every other case it will give you the wrong answer. (e.g. "remainder 2" will mutate into "remainder 6".)