Hacker News new | past | comments | ask | show | jobs | submit login
Namedtuple in a Post-Dataclasses World (andgravity.com)
113 points by genericlemon24 8 days ago | hide | past | favorite | 88 comments





There is also a NamedTuple (notice the different casing) from the typing library, which doesn't seem to be mentioned in the article:

https://docs.python.org/3/library/typing.html#typing.NamedTu...

    class Employee(NamedTuple):
        name: str
        id: int
This is equivalent to:

    Employee = collections.namedtuple('Employee', ['name', 'id'])

I like that method of defining a named tuple much more. I understand why it's required, but having to duplicate the name of the named tuple always bothered me. The type hints are excellent to have too.

I don't know what you're talking about, I always confuse my coworkers by doing

    Pointe = namedtuple('Point'...

This breaks pickle, among other things.

It really shouldn't. This is actually an example of pickle being broken.

The class exists, there is an existing valid reference to it, and Python itself knows about it and the class itself knows its own name and namedtuple generates the right __getnewargs__. Everything is there for this to "just work" but pickle expects that every class object will have a reference to it of the same name which is kinda weird when you think about it.

You can see it with this stupid little program.

    from collections import namedtuple
    import pickle

    def get_all_subclasses(cls):
        all_subclasses = []

        for subclass in cls.__subclasses__():
            if subclass != type:
                all_subclasses.append(subclass)
                all_subclasses.extend(get_all_subclasses(subclass))

        return all_subclasses

     E = namedtuple('EEEEEEEEEEEEEEEEEEE', 'x')
     e = E(x='hello')

     for cls in get_all_subclasses(object):
          print(cls)

     pickle.dumps(e)
You'll that the class it there and called the right thing! But pickle tries to look up a reference to it under __main__.

> but pickle expects that every class object will have a reference to it of the same name

Yeah: its qualname. The qualname, as per PEP 3155, is defined as:

> a dotted path leading to the object from the module top-level

so of course pickle can't cope with this information being incorrect. (How would you make it work?)


There’s a difference between the classes __qualname__ and the requirement that there be reference to the class at, say, __main__.$qualname as the way for pickle to actually find the class object.

This is the part that, to me, is really odd because pickle knows (in theory) the class and module name of the thing it needs to instantiate and the class objects themselves know their name and module.

Up to the lack efficiency actually doing this you could just enumerate all class objects to find the one with the right name and module name.

You get nested classes for free this way and you break the requirement that there be references to class objects of the same name in the module.


That's awesome and I was not previously aware of it. Literally a decade ago I was asking on SO about exactly this kind of use-case [1] and at the time the answers were pretty unsatisfying; it's great to hear that the story is better now.

[1]: https://stackoverflow.com/questions/4071765/in-python-how-do...


It is mentioned if you expand the "In case you've never used them, here's a comparison." element.

I don’t use Python much, but what’s the difference between a NamedTuple and a regular class?

a named tuple works exactly like a tuple, except you can also use names to get the items in it.

so it is immutable and you can get it via slicing.

  >>> from collections import namedtuple
  >>> Hat = namedtuple('Hat', ['style', 'size', 'color'])
  >>> my_hat = Hat('safari', 'XL', 'Orange')
  >>> my_hat
  Hat(style='safari', size='XL', color='Orange')
  >>> my_hat[0]
  'safari'
  >>> my_hat.color
  'Orange'
  >>> my_hat[1:]
  ('XL', 'Orange')
  >>> style, size, color = my_hat
  >>> size
  'XL'
  >>>

If thats all your using your classes for, then a named tuple is probably a better solution, or a dataclass. Though I normally just use dicts in that situation. If I see someone create a class without any methods, or atleast planned methods, I don't let it through code review.

EDIT: Also, Raymond Hettinger created named tuples. I'm not normally one for call to authority, or hero worship, but I am a huge fan of his. I recommend that anyone interested in Python should watch as many of his talks as they can.

EDIT2: As masklinn pointed out, another really good use of named tuples is when you're already returning a tuple, and you realize it would be better if it had names. You could change it to a named tuple without breaking any of the existing code. Unless they're doing something dumb like halfassing type checking at runtime. (this use case is in the article, which i didn't read at first)


Well in and of itself none, in the sense that anything a namedtuple can do you could do by hand (it really just defines a class). However namedtuple:

* extends tuples, so a namedtuple is literally a tuple (which is useful)

* sets up a bunch of properties for the "named fields", which are basically just names on the tuple elements

* sets up a few other utility methods e.g. nice formatting, `_make`, `_asdict`, `_replace`

Now the latter two are nice, and mostly replicated by dataclasses (or attrs). The first one is the raison d'être of namedtuples though: originally their purpose is to "upgrade" tuple return values into richer / clearer types e.g. urlparse originally returned a 6-utple which is not necessarily super wieldy / clear, you can probably infer that the 3rd element is the path but… after upgrading to namedtuple it's just `result.path which is usually much clearer.

And because namedtuples are still classes in and of themselves, you can inherit from them to create a class with a `__dict__` with relative ease.


NamedTuple has the features of a tuple, for example it is immutable. A regular class is mutable.

I feel like none of the sibling answers actually answer your question which is "absolutely nothing." The function namedtuple is code generator that constructs a class definition and then eval()'s it.

The reason you reach for it is because it's tedious to write the same methods over and over to get things like a nice repr, methods covert between dicts, or pickling support.

The source from Python 3.6 is much more readable than 3.9 so I recommend reading that if you want to see how it works.

https://github.com/python/cpython/blob/3.6/Lib/collections/_...


NamedTuple or namedtuple instances are tuple instances that have the same properties that regular tuples have. They are immutable (you cannot reassign their fields), you can index into them (a[0], a[1] instead of a.x and a.y), you can unpack them with *a. They can have methods like regular classes can, including additional @property methods. A NamedTuple class cannot inherit from another class, not even other named tuples.

NamedTuple is purely a data container. It does not have class functions or a constructor you can use for anything other than setting the data members.

I think it was a pre-organized immutable data class in one line.

Along this line of reasoning I learned to love Pydantic which can make it a breeze to parse and coerce environment variables to the correct types: https://pydantic-docs.helpmanual.io/usage/settings/

You can make an env.py and override module __getattr__ and then import environment variables just like they're regular Python objects (even booleans, floats, collections, etc, despite .env files being string only

Huge force multiplier for ML cuz then you can do hyperparameter optimization just by passing different environment variables in an outer loop (even inside your infrastructure as code)

Edit: you can make these classes immutable too: https://pydantic-docs.helpmanual.io/usage/models/#faux-immut...


You can also create CLI tools that can load partial or full "presets" defined in JSON.

https://github.com/mpkocher/pydantic-cli


Pydantic is an incredible lib. I use it for so many different things on top of misc parsing.

A little unrelated, but this brings up a question I've had for a while.

Seems like one day, everyone around me was using dataclasses. I had not even heard of them. It felt like I had missed some memo or newsletter. It felt weird.

Here's my question: what should I have been reading / where should I have been "hanging out" online, so that I would have known that dataclasses were a thing? What are your go-tos for news about new language features, libraries that everyone is using, etc?

Hacker news is great, but it doesn't quite fill that need for me, it seems.


For Python, you pretty much just need to be aware of when the new major version is released because the "what's new" pages are pretty good. Here's the one in which data classes were released: https://docs.python.org/3/whatsnew/3.7.html

Concretely, I found out about dataclasses by using [pydantic](https://pydantic-docs.helpmanual.io/) and seeing their drop-in `@dataclass` annotation - it got me curious about the adjacent stdlib class. I was using pydantic because I started using FastAPI to build a REST interface, which has pydantic deeply integrated.

Generally, I find out about new features through PEP posts, and I reach those by seeing a keyword that I don't know in random code I read online


I follow the language specific sub-reddits, and I read release notes for major releases of languages (so for python that would be 3.X) even if I wasn't going to jump to the version set.

> I read release notes for major releases of languages

_This_, so much.

If you are a heavy user of a language / library, it's immensely helpful to look at the release notes every once in a while. Even if you don't plan to upgrade now, it gives you an idea of where things are going (and may eventually tip the scales to a "fuck it, it's now worth upgrading" moment).

For Python specifically, PEPs are also a good way to keep track of what's happening (even if some of them don't get accepted): https://www.python.org/dev/peps/ ; there's also an RSS feed: view-source:https://www.python.org/dev/peps/peps.rss/


I found out about data classes on hn, before they were in the standard library. I also regularly search for python to see what stories I missed.

I also like to keep up to date with the PyCon videos, as well as some of the other python conferences. But, as others have said, the release notes are the main source for whats new, if a bit dry.

That said, I never actually use data classes. I normally just use dicts, and occasionally named tuples.


FWIW, here is a PyCon video to get you up to speed on dataclasses:

https://www.youtube.com/watch?v=T-TwcmT6Rcw


A couple good newsletters are Python Weekly and PyCoder's Weekly. They each put out a mix of news, articles/tutorials, and interesting projects.

https://www.pythonweekly.com/ https://pycoders.com/


I read the release notes. I often see posts for releases here on HN or on reddit, but often I will check in on the official repos or websites to see whats new.

I like to spend a few hours a week reading up on whats happening, or try something new to keep up. Checking out new language features is part of that processes to me.


Python-centric forums, like r/Python:

https://www.reddit.com/r/Python/

I think the RealPython site is excellent for learning, even for mid to advanced users:

https://realpython.com/

They also have a great podcast:

https://realpython.com/podcasts/rpp/

Also just browsing the Python docs and release notes.


r/Python is 90% newbies showing off toy projects. It's not great for news.

Already duplicated in my reply to icegreentea2 below, but release notes for projects you use are a great place to get updates.

For Python specifically, PEPs are very helpful too: https://www.python.org/dev/peps/


In the past, Freenode #python, now Libera #python. Also the Python Discord server.

Dataclasses and Enums have, since their introduction, taken over as my foundation of Python data structures. They've obsoleted NamedTuple, namedtuple, and traditional classes in my code.

They're a clean way of defining how difference functions, methods etc should interact with each other.


what do you mean by 'traditional classes', dataclasses are more for data-only data structure with no methods?

As in, classes that aren't tagged as `dataclass` or `Enum`. You can have methods with dataclasses.

Y'all may be interested in a fast dataclass-like library I maintain called msgspec (https://jcristharif.com/msgspec/) that provides many of the benefits of dataclasses (mutable, type declarations), but with speedy performance. The objects are mainly meant to be used for (de)serialization (currently only msgpack is supported, but JSON support is in the works), with native type validation (think a faster pydantic).

Mirroring the author's initialization benchmark:

    In [1]: import msgspec

    In [2]: from typing import NamedTuple

    In [3]: class Point(msgspec.Struct):
    ...:     x: int
    ...:     y: int
    ...: 

    In [4]: class PointNT(NamedTuple):
    ...:     x: int
    ...:     y: int
    ...: 

    In [5]: %timeit Point(1, 2)
    48.4 ns ± 0.195 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

    In [6]: %timeit PointNT(1, 2)
    185 ns ± 0.851 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Is this faster than Attrs as well?

Yes, but it's also less flexible. Tradeoffs.

> Point = namedtuple('Point', 'x y')

I must come from a different world. What is going on here?

Are you dynamically creating a named tuple (a record?) by passing a space separated list of field names? Why?


After your example, `Point` is a class. You are not creating a record, you’re creating the class that defines the schema of fields in any record. A class can be defined without using the `class foo:` syntax, and in this case the class is returned by the function `namedtuple` to produce an equivalent effect.

As for the unusual space-delimited syntax, the missing context here is that namedtuple is a very, very old part of Python that predates the conventions now considered good style. Using space delimiters for lists of strings is a common idiom in Perl scripting due to the `qw()` quote syntax. Note the archetypical context where namedtuple was imagined to apply (record-oriented processing of logs and SQL result sets) was commonly handled using Perl before Python became dominant.

Namedtuple is definitely the most prominent example of this syntax convention in Python, but other libraries use it too. `enum.Enum` supports a function-like interface directly modeled off namedtuple. It’s a mildly bad idea to keep using it IMHO because it complicates static analysis or refactoring. If someone does a wide search-and-replace of a field name literal it’s easy to miss this edge case.


> a very, very old part of Python that predates the conventions now considered good style. Using space delimiters for lists of strings is a common idiom in Perl scripting due to the `qw()` quote syntax.

Note that the Smalltalk-80 class creation method also expects a space-separated string of instance variable names, and that's an environment considerably older than Perl.


I always wondered why is that a space separated string, when a list can work as well. The docs are not well written on that one. This works:

    Point = namedtuple('Point', ['x', 'y'])

The function was defined to either take a space separated list of names or a sequence of names. The docs seem pretty clear to me:

> The field_names are a sequence of strings such as ['x', 'y']. Alternatively, field_names can be a single string with each fieldname separated by whitespace and/or commas, for example 'x y' or 'x, y'.


It's a class factory function, so right off the bat it's a bit weird. The original intent of using spaces was probably to minimize typing. Since they're attribute names they can't have spaces in them, so it's a safe delimiter.

You could imagine the function dynamically creates the class by manipulating the underlying dictionary (or whatever the "slot" alternative uses). At that level of python, attributes are strings anyway. Handling spaces is just a matter of calling .split().

In modern python, there's a whole metaclass system that would possibly let you do the equivalent without getting your hands dirty with internal data structures.


I envy coders who can actually save time by using a space as a delimiter instead of ['x', 'y']. I really have no use for such syntactic sugar.

Yeah, I think more time is wasted in confusion and arguing about style than is saved in keyboard strokes.

There's definitely a class of persnickety coders out there though. As a technical leader within a growing organization, sometimes the bulk of my time spent in a code review turns into style guide enforcement. It can get old arguing about the subtle merits of someone's preferred but style violating syntax over and over, especially when all I care about is maintaining a standard of consistency.


syntactic sugar actually drives me nuts because it makes code harder to read for non experts

This kind of thing grates me: one thing I love about Python is that there is usually only one way of doing everything.

Using a list or tuple for the fields is generally best:

    Point = namedtuple('Point', ('x', 'y'))
Support for a space and/or comma separated strings was requested by users. It made life easier for them when syncing with other space/comma separated strings. For example, an SQL query, "SELECT name, rank, serial_number FROM Soldiers;" would have a corresponding named tuple where the field names could be cut-and-pasted from the SQL query.

    Soldier = namedtuple('Soldier', 'name, rank, serial_number')

I think it comes down to the idea that going out of your way to make the library work either way makes it easier for people to use, even if it makes the library itself a bit more complicated.

I wish more library devs would go out of their way to add such niceties.

A big one that I always do is if I'm expecting an iterator of objects I make it just work with one.

  from collections.abc import Iterable
  def my_function(arg):
      # slightly different if you're looking for a collection  of strings or bytes
      if not isinstance(arg, Iterable): 
          arg = [arg]

      for item in arg:
          do the thing

Or if you have a specific type of object you want it goes like this

  from collections.abc import Iterable
  def my_function(arg):
      if isinstance(arg, MyObjectIWant): 
          arg = [arg]

      for item in arg:
          do the thing

I like to think of my libraries as mini programs for users, and I hate when validation is too strict, when it could be so easy to fix. Like when a phone number validator insists on (XXX)XXX-XXXX or XXX.XXX.XXXX or XXXXXXXXXX when it could just ignore everything that isn't a number and make sure there is 10 of them.

This sounds like a nice idea in theory, and makes a lot of sense for polished, publicly visible libraries where convenience trumps simplicity, but the edge cases can lead to confusing failures and bloat otherwise simple code — as you noted, your example code appears to work for arbitrary objects but actually fails for `str` or `bytes`.

A great case study in the issues here is Pandas, which routinely allows arguments to be columns, lists of columns, string column labels, lists of string column labels, and so on. It works surprisingly well, but at the cost of inventing a new semantic distinction between `list` objects and other sequence types like `tuple` — someone unfamiliar with Pandas who thinks “Why does this need to be a list comprehension when a generator expression will do?” is likely introducing a bug.

Another subtle issue is that code permissive with inputs is harder to extend via wrapper code. Suppose you have a function that does some sort of processing for any number of given datetimes, but also accepts integer seconds since 1970-01-01, a formatted date string, or any mixed sequence of these types. If you need to write a wrapper that first rounds all times to the most recent hour, your task is much easier if the only accepted type is `Iterable[datetime]`.


I’d speculate it’s meant to mimic Perl’s `qw()` operator, which is like `str.split()` in Python. The module was originally written for contexts where you’re processing SQL result sets with fixed schemas, and before Python these tasks were traditionally handled in Perl. Python inherits a lot of these loose traditions. Similarly, some parts of the standard-lib (`sys`, `os`) follow shell- or C-like naming conventions that would seem bizarre to someone who’s never used a shell prompt.

> I always wondered why is that a space separated string, when a list can work as well.

Saves a bunch of typing. 5 chars:

  'x y'
vs 10:

  ['x', 'y']

With a non-trivial example that uses readable attribute names, a single, long, space-delimited string becomes more a burden than a convenience, I think. Also, the amount of time saved typing is miniscule in comparison to all the rest of the development work that'll happen.

> With a non-trivial example that uses readable attribute names, a single, long, space-delimited string becomes more a burden than a convenience, I think.

For a suitable definition of “non-trivial” and “readable” (where the former is “long list of attributes” and the latter is “long attribute names”), I’d agree, but plenty of real, serious namedtuple use is for namedtuples with small numbers of short attribute names, and those are more readable (in the literal sense) this way.

OTOH, for the nontrivial, static uses, you probably want to skip right past namedtuple() with a static list of string literals for names to typing.NamedTuple with its dataclass-like syntax, including type hints, since its more readable and also supports typing.

> Also, the amount of time saved typing is miniscule in comparison to all the rest of the development work that'll happen.

Sure, but if you start passing on providing (or, on the other side, using) conveniences because each is small in isolation, the aggregate cost ends up being high.


Fair points!

Yeah, I exclusively use the typing.NamedTuple declaration these days, because it's:

  - less redundant
  - easy to add a per-attribute comment if needed
  - great when encapsulating disparate data to be able to have concrete types listed

Also keep in mind interactive use.

The space-based approach is nicer when working in a repl, even though I probably wouldn't use it.


but you now have to explain it to everyone everytime

They are dynamically creating a named tuple class (or prototype). The namedtuple implementation in the python standard library indeed accepts a space separated list of field names. Once it's been defined (as above) then you can create instances (records) like "Point(2,3)"

> Are you dynamically creating a named tuple (a record?)

No, it's creating a namedtuple type, aka a subclass of `tuple`. So the fieldnames are literally the names given to the tuple's items: Point is a pair (a two-uple) whose 0th item can be accessed as `x` and 1st item `y`.


> Are you dynamically creating a named tuple (a record?) by passing a space separated list of field names? Why?

Very little in python is bound statically. This is akin to a type definition. The type will behave as an ordered tuple that can be indexed but also alias these attribute names to those ordinals.

    assert(a == Point(a, b).x)
    assert(b == Point(a, b).y)
    assert(a == Point(a, b)[0])

> Are you dynamically creating a named tuple (a record?) by passing a space separated list of field names? Why?

I believe that's called "procedural record interface" in Scheme and it does have its uses, for example if you need to create records for data the structure of which you don't know in advance.


It's effectively a 'macro' for metaprogramming.

Speaking of namedtuple, I would encourage anybody who uses Python and wants to learn a thing or two to read the source code for them. At least one of the things you learn should probably fall in the "what not to do" category. There's a lot going on in there to support all that magic you see from the outside, and it's a little scary in there.

This*1000.

I have been spending an embarrassing amount of time trying to merge a one-off version of dataclasses with dataclasses to pick up the codegen based on type hints. This stuff is nasty and subtle under the covers. I would rather be dealing with a real closure or with real compositional capabilities. Dataclasses are gross in the weeds.

For example, y = dataclass(x) mutates its argument. That is, y = x.


In practice the hashtable implementation in Python is so fast, particularly for cases like

  {"x": 51.2, "y": -74.1 }
 
that you don't gain a lot from namedtuples most of the time. I quit benching namedtuples for applications like that a long time ago.

The main problem of dicts is that they're really heavy memory-wise, even with key-shared dicts and stuff.

Your dict is 232 bytes, the equivalent tuple is 56.


I thought I'd add to this (because I was surprised) that the equivalent namedtuple is also 56 bytes (I expected it to be larger) and the equivalent dataclass is a mere 48 bytes. (although there's overhead for defining a namedtuple or a dataclass, on the order of a constant 1kb).

Although there seems to be some kind of trickery happening there, because if I make the class accept 3 floats instead of two, neither the class nor the instance get larger.


> and the equivalent dataclass is a mere 48 bytes

> Although there seems to be some kind of trickery happening there, because if I make the class accept 3 floats instead of two, neither the class nor the instance get larger.

The dataclass stores stuff in a normal instance __dict__ and sys.getsize is not recursive.

namedtuples are variable-size instances, they store the attributes “inline”.


How much does automatic interning of string literals help out here?

sys.getsizeof is not recursive so that's just the size of the collection itself, excluding the stuff it references (so both keys and values are additional, here the strings are literal so they're interned and part of the program constants).

It doesn't. That's the size assuming interning of string literals.

Glurk! Thanks.

a.x is far easier to both read and write than a[”x”] (67% is operator syntax compared to 33% in a.x).

Not to mention that the first case gets assistance from autocomplete in an IDE or ipython session. That speeds up typing long, descriptive names so much.

this and () => {} syntax have always been moderate to strong wins in my book for js when creative-coding

Immutability is nice though.

Also it's nice for bad references to trigger AttributeError which is almost always a design error, whereas KeyError is not always so.

I haven't tried it but type hints are probably smart enough to find errors like this statically with namedtuple, but probably not so with a dict.


I think you need to use typing.NamedTuple to get typing support. On the other hand, you can use TypedDict [1] to get type hints on dictionaries.

1: https://www.python.org/dev/peps/pep-0589/


In cases where you don't care about immutability, I'd think of it as a better version of TypedDict (though TypedDict still has its place). It makes my IDE more helpful, makes my code more self-documenting, and allows mypy to tell me when I'm being dumb.

That's only because Python is so slow in general and everything else is implemented in terms of hashtable lookups too.

You could define __slots__ on the namedtuple that would not use a hashtable for lookups.

Such a confusing title without "Python" in there, especially when it's specifically about some Python feature.

The syntax is a lot more intuitive in Julia:

julia> point = (x=1, y=2)

(x = 1, y = 2)

julia> point.x

1

julia> point.y

2

julia> dump(point)

NamedTuple{(:x, :y), Tuple{Int64, Int64}}

  x: Int64 1

  y: Int64 2

And to create one as an anonymous type, you can use the `@NamedTuple` macro:

    julia> @NamedTuple{x::Int, y::String}
    NamedTuple{(:x, :y), Tuple{Int64, String}}

I guess one difference is that when you inspect it, it doesn't indicate that it is a `point`, just that it's named tuple of two variables, so it's not exactly equivalent.

they are also, type stable, strongly typed and have the same overhead as a struct. So an array of NamedTuples takes the minimal space and allocation.



Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: