Hacker News new | past | comments | ask | show | jobs | submit login
A Visual Intro to NumPy and Data Representation (jalammar.github.io)
366 points by jalammar 7 months ago | hide | past | web | favorite | 21 comments



Nice overview! One thing I think you should add, which I find immensely useful is the reordering of arrays using indexing.

Take for example:

    In [2]: numpy.array([1, 2, 3])[[0, 2, 1]]                                       
    Out[2]: array([1, 3, 2])
You index using a list and it gives you a view of the array with the new order (the underlying array is not changed and there is no copy being done).


Using "fancy" indices like this does result in a copy because it can't be represented as a simple slice of the original matrix. A good explaination is here (it's from 2008 but still true):

https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.ht...

You can verify there's a copy by changing the new array after putting the result in a new variable (see above link for why this makes a difference) and verifying the old one is unchanged:

    >>> import numpy as np
    >>> x = np.array([1, 2, 3])
    >>> y = x[[0, 2, 1]]
    >>> y[0] = 3
    >>> y
    array([3, 3, 2])
    >>> x
    array([1, 2, 3])

Edit:

But a view can be based on a slice that includes a skip parameter, and in fact you even slice in multiple dimensions and it will still be a view. That is worth discussing in the article:

    >>> x = np.array([np.arange(7), np.arange(7)+1]*3)
    >>> y = x[4:1:-2, 1:5:2]
    >>> y
    array([[1, 3],
           [1, 3]])
    >>> y[0,0] = 99
    >>> x
    array([[ 0,  1,  2,  3,  4,  5,  6],
           [ 1,  2,  3,  4,  5,  6,  7],
           [ 0,  1,  2,  3,  4,  5,  6],
           [ 1,  2,  3,  4,  5,  6,  7],
           [ 0, 99,  2,  3,  4,  5,  6],
           [ 1,  2,  3,  4,  5,  6,  7]])


A related fun fact, when slicing several dimensions:

    >>> a = np.arange(9).reshape(3,3) # a matrix
    >>> a[0:3,0:3]          # ranges are treated independently
    array([[0, 1, 2],
           [3, 4, 5],
           [6, 7, 8]])
    >>> a[[0,1,2],[0,1,2]]  # but arrays are treated at once
    array([0, 4, 8])


A copy-on-write mechanism triggered by `y[0] = 3` would look the same and pass the test you devised, so you can't eliminate the possibility that it exists.

A better way would be to track memory use. A copy being created by either `y = x[[0, 2, 1]]` or `y[0] = 3` would show as a memory increase.


As an aside, one of my major challenges grokking numpy and pandas is the semantically dense syntax like the above. I know that the layers of bracing have an impact but it's difficult for me to tell where it is applied and/or described.


Pretty, but not particularly in-depth.

Also, nitpick but I can't hold it: Why isn't the MSE np.mean(np.square(predictions - labels)? That's even breez-ier!


I think it's generally done this way because of the way the formula is represented mathematically.


I like this. One change I would make is on the aggregation and indexing section, change the representation of single values (as opposed to single-element arrays) to not be in a coloured box. It's important that the result of these operations is a different type.


Numpy was a huge boon in college. I had mostly gotten my homework process down to editing a LaTeX file with the csv files for my datasets and then when I compiled it would first crunch the numbers with Numpy, export it as Tex, and then build a pdf.


Care to share an example?


I might still have something. I didn't version control it, but it might be on Dropbox still.


This is excellent. I'd love to see even more on Pandas.


And now I see that you've already started one!

https://jalammar.github.io/gentle-visual-intro-to-data-analy...


It would be good to mention the @ operator in the matrix multiplication section.

https://alysivji.github.io/python-matrix-multiplication-oper...


A warning sign that your faith in 0-based indexing may be faltering -- catching yourself writing comments like this :)

    # element at the top right. i.e. (1, 2) aka (0, 1) in python
    A[0, 0] * B[0, 1] + A[0, 1] * B[1, 1]


That's called the "Matlab Hangover"


Wow, this is so timely! I love the visual references. I'm still a little confused about the section on Matrix Indexing. Overall, great work!


Good stuff! I'll definitely look for more from you!


Nice page, but unless you have never used software for math before, I am not sure it's very useful.


Would be nice to have something like this, but for Julia.


There is this, though shorter: https://julia.guide/broadcasting




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: