Hacker News new | past | comments | ask | show | jobs | submit login
Data Classification: Does Python still have a need for class without dataclass? (glyph.im)
129 points by ingve on Feb 14, 2023 | hide | past | favorite | 123 comments



Using dataclass for all basic state classes whose primary purpose is public state with sensible init makes sense. But for the many other purposes of classes that need custom init, advanced field defaults, no equality, custom repr/str, advanced inheritance, etc dataclasses give you nothing but a bunch of magic to turn off or work around.

Whether you're using Kotlin data class, a C# record, a Scala case class, or whatever, we've seen in practice that these are only good for (often immutable) public state and data composition. Business objects with more logic and more than simple state/member manipulation are only hampered by dataclass field approaches. If Python embraced immutability and composition more at its core, then using dataclasses in these cases would make more sense.


> But for the many other purposes of classes that need custom init,

write your own `__init__` method then, as you would anyway. Dataclass' default init is way better than that of `object`.

> advanced field defaults

same

> no equality

set eq=False, all dataclass methods are optional

> custom repr/str

write __repr__ and __str__ methods, same as init, dataclass' default is much nicer than that of `object`

> advanced inheritance

What's the issue with inheritance? what's "advanced" inheritance, you mean mixins?

> etc dataclasses give you nothing but a bunch of magic to turn off or work around.

Python `object` has all kinds of default behaviors too, which are just worse. default behaviors aren't "magic". dataclasses currently manufactures its methods at runtime, so that's maybe "magic", but all of that could be built into Python, then it would just be, "default behavior".


For awhile inheritance with dataclasses was fairly broken. They added kw_only in 3.10 which definitely helped. Even outside of dataclasses, python's MRO can be a footgun with multiple inheritance

This video has some good explanations of inheritance behavior in python https://www.youtube.com/watch?v=X1PQ7zzltz4


Re __init__ and advanced defaults, these may be achieved/set using __post_init__.


> dataclasses give you nothing but a bunch of magic to turn off or work around.

I don't think anything is particularly magical. At most you have a `blah=False` or you just override the thing manually and the decorator 'just works'.

Maybe some really insane inheritance stuff? IDK

I haven't run into any case where I didn't end up wanting `@dataclass`


That was similar to my initial reaction, but after trying some things in another comment I'm kind of sold on dataclasses. As long as it raised a syntax error if a class combined __init__ with arg: type, then I think you can still do everything else you would in a normal class. It's just a shortcut for when that's all you need.


As a Python instructor, classes are a pain in general to teach. OO seems to have a non-understandable aura around it. Especially for folks without a CS background. self is weird until you explain that __init__ is not really a constructor. We get through it but there are bumps in the road.

I don't teach dataclasses until I teach intermediate classes because I like my students to understand decorators beforehand.

This make me wonder if I should experiment with teaching dataclasses instead of __init__ (and just hand-wave decorators and types)...


I've been a Python instructor for 5+ years and I've never taught to my students that __init__ is not a constructor. That is I teach that __init__ works just like a constructor.

Sure technically (the best type of correct) it is not a constructor.

Where would this be important? Maybe it could be taught as part of "language lawyer" course.

I teach OOP around lesson 10. Each lesson is about 2hours.

Complete beginnners to programming are able create basic classes and even inheritance.

Some people like OOP some stick with imperative style. Some migrate to more functional style (which I personally advocate).

Going through this old thread: https://stackoverflow.com/questions/6578487/init-as-a-constr... puts me firmly into "this is hair-spliting" camp.


> Where would this be important

FAANG interviews?

Knowing that __new__() comes before __init__() is probably useful, as the deeper one gets into a language the more important it is to understand the lifecycle of an object. Otherwise, the only thing I can think of that's special is understanding that and how it works as a singleton.


I’ve been through a couple of those. I’ve never been asked language specific questions. Most FAANG code interviews don’t care much about which language you choose. I’ve also not seen them care if you use OO or don’t. Either way works - as long as your algorithm is good.

In short this is irrelevant for those interviews. It’s definitely relevant if you do python as a day job.


No, it's the opposite. Big tech don't care about these. Internally their language style are as generic as possible. They want you to avoid warts and gotchas and make every oop language usage look the same so that you can actually jump around different the projects


I think that muddies the water a bit too much and relies on too much magic. Self might be weird, but it's one of the most central concepts in python. Especially if someone is coming from javascript. Dataclasses don't quite follow the same rules as regular classes, especially with inheritance. I think waiting to understand decorators is a good call.


On a side note, I absolutely love how your username itself is a dunder identifier, well played!


To answer the question of the author, yes compatibility is sufficient reason not to do this. Please don't push this idea. I also don't agree with much of the reasoning or conclusions but others have identified many of the reasons elsewhere so I won't repeat them.


I think:

    data Point3D(x, y, z) from Vector: ...
is a terrible syntax, there are plenty of legitimate use cases where you'd have an overwhelming amount of fields in a class/structure, and listing them line by line (as well as explicitly specifying accessors to handle the couple special cases) would end up much more readable. I'd rather see something more along these lines:

    data Parrot(Animal):
        name
        age: int
        alive: bool = False
        purchased_in: Shop | None = field(default_factory=get_shop)
        ...
(Move "field" to builtins, etc.)

But overall this proposal seems like a no-brainer, even before the addition of the match statement. The "classic" __init__ is a jarring violation of DRY, baked right into one of the most common idioms in the language, and gets in the way of not just the "QoL" features like destructuring or type checking, but also of the efforts to build more efficient implementations that could pre-allocate object memory without speculative analysis, or optimise the dict lookup away.

The question is not "shall we", but "why wasn't it done in 2008".


You made me think "I think: [all of python] is a terrible syntax". At some point in the past, python crossed over into perl or php territory for me. The semantics, runtime, culture, support, library availability, etc. are all great. But jiminy no syntax can keep up with the decades of layers of cruft these languages accumulate. I'm starting to look at C# the same way.


You probably want a Lisp or a Forth instead then. Maybe (maybe!) Lua.


Python is a multi-paradigm language.

And not only that, I think there are many reasons to want to continue having non-dataclass classes. Dataclass fits best with record-type data and not with encapsulated data or classes with mostly private attributes.


The Python baseline is that a class does not know, and doesn't pretend to know, what attributes it could contain, everything is welcome unless extra code is added to repress rogue assignments.

Trivial getters and setters, __init__ methods that copy their parameters to members of the same name, and other boring Java-like patterns are certainly served well by convenience tools like the attrs library or dataclasses, but distinguishing records from serious classes requires significant skill.


> [...], everything is welcome unless extra code is added to repress rogue assignments.

I think slots do that, don't they?


Yes, and dataclasses offer an even nicer notation.


Dataclasses make it very easy to define such classes indeed but they don't disallow extra attributes. So they really compose with a lot of other techniques


I'm not sure what you mean here, GP was saying dataclasses themselves don't prevent rogue assignments, but they have a nicer syntax for generating a `__slots__` automatically based on type-hints/class attributes which absolutely does.


I see, new in Py 3.10 and new to me


> I think there are many reasons to want to continue having non-dataclass classes.

Even just one would help us evaluation your claim!

> Dataclass fits best with record-type data and not with encapsulated data or classes with mostly private attributes.

How does it "not fit"? How would you know if a class were a datalclass? Would it break something?

For me, not having to write a constructor is a small win right away, having the class have a useful `str` representation is nice, automatically comparable and hashable are nice, I can also make immutable classes which is very nice, etc, etc.

Unless you really have no constructors of visible members, what's the drawback?

I don't use `dataclasses` everywhere but more and more when I start any class I make it a `dataclass`.


For my general uses, pydantic and dataclasses are very similar, and I prefer pydantic over dataclass for the following reasons:

1. pydantic by default ignores extra fields - it's useful when I make an API call and want to extract and validate only certain fields from the response, while dataclasses throws errors if you don't specify all fields, and this behavior can't be disabled

2. pydantic is more customizable - I can overwrite the BaseModel if I have a custom need

3. pydantic seems to integrate better with python typing - I'm not sure how to explain this one, but it feels more natural and dynamic

I'm not sure what the performance concerns are, though


Pydantic is much much much slower than normal dataclasses. However, they have to be because they do validation and dataclasses do not. So, I use normal dataclasses for internal constructs, which need to be fast and pydantic for anything that comes externally, can't be trusted and has to be validated.


Pydantic models also have a model.construct(*src) form which skips validation. According to Pydantic's docs [1], this makes it 30x faster.

[1] https://docs.pydantic.dev/usage/models/#creating-models-with...


What I seriously love about pydantic is the ability to just write the default value of `tags: list[str] = []` and not worry about all of the instances of the class sharing the same single instance of the default list.


Does that work now? I seem to remember having to use `tags: list[str] = Field(default_factory=list)` a lot


Pydantic, yes, you can just set the default initializer to = [] or = {}. I don't think that works with dataclasses yet.


The intent of pydantic is to be a parser of (possibly recursive, with composition of classes) key-value "languages". It could, in principle, evolve in a completely different direction than dataclasses/msgpack/dataclassy, etc.

This is like -- there are many libraries that do PCA, from sklearn to statsmodels to plain scipy to genomics tools. But in a glue-language workflow you should choose a tool for its semantics.


There's also dataclassy, which stays closer to the standard library dataclasses while improving on them.


I enjoy novel ideas, they are fun to entertain, but this is quite spicy of a take. I kind of understand where OP is coming from, but if you start to understand python internals, you would quickly see why it's completely untenable.

    data Point3D(x, y, z) from Vector:
is never, ever, ever, ever gonna happen. It's ugly, alien to the rest of python style, will be a massive pain to implement in the lexer/parser, adds another keyword and mechanism, and most importantly, makes the word "data" unusable as a variable name. Dataclasses are "just" decorator functions (there is some special optimization and meta magic under the hood, but still just a callable).

There's a glaring reason why we can't "just make all classes dataclasses" - it would break the entire data model of the language. "class" keyword is basically syntactic sugar for the "type" builtin. Everything (with a few small exceptions) you touch in python is a subclass of object. Changing the behavior of class to by default behave like dataclass would have knock-on effects across the entire language on every single python3 codebase on the planet.

Ok, even assuming we could surmount the insurmountable breaking changes, would we even want default classes to be dataclasses? Not in the slightest. Python may not have true private variables, but libraries/APIs leverage dunders, properties, and hidden state to do all kinds of OOPy things. You can't do that with dataclasses. Many times you have objects bound to attributes which don't have concrete types, such as endpoints in HTTP frameworks and anything with dynamic dispatch (yeah there's Callable, but it is tricky to type check).

Dataclass also has a self-documentation effect: it says "this thing is kind of like a record". Not a primitive, not a function, not a router, view, thread, tree node, god-class or stack frame. If I see dataclass, I almost always expect to be able to de/serialize it to some sort of message. Complex state, anything unpickleable, or exotic types don't really belong in a dataclass.


Off the top of my head, python classes which shouldn't/can't be dataclasses:

The entire modules typing, os, sys, struct, inspect, json.

Anything which is a metaclass or relies on metaclass behavior (you have to be careful combining dataclass with other metaclasses, it's possible but with footguns akimbo).

Datetime actually should be a dataclass, but that api is so crufty and unintuitive that it's never gonna happen, and trying to shoehorn it to a dataclass would break oodles of libraries and applications.

threading: Thread, Lock, Rlock anything whichs uses locks

Numpy.ndarry, pandas.Series, pandas.Dataframe, torch.Tensor. Numerous scikit-learn models.

Probably the vast majority of c extensions.

pydantic.BaseModel, fastapi.FastAPI, fastapi.APIRouter (same with flask/django), celery.Task.

I could go on but I think I've made my point. There are plenty of non-dataclass use cases out there.


Integrating `dataclass` more into the language builtins might be nice, if only that it may allow/encourage a more native and performant implementation. Using a dataclass right now results in slightly slower class operations than handwritten types, and much slower import times.

I maintain another dataclass-like library[1] that's written fully as a C extension. Moving this code to C means these types are typically 5-10x faster for common operations[2]. It'd be nice if the builtin dataclasses were equally performant.

[1]: https://github.com/jcrist/msgspec

[2]: https://jcristharif.com/msgspec/benchmarks.html#benchmark-st...


Dataclasses are nice but they have a big downside: extending a dataclass with a default property can raise an error.

If you run this code:

    import dataclasses

    @dataclasses.dataclass
    class Base:
        a: int
        b: int = 1

    @dataclasses.dataclass
    class Child(Base):
        c: int
Then you'll get this error:

  TypeError: non-default argument 'c' follows default argument


that's because you are expecting dataclasses to create a default `__init__()` for you, which it can't do with that declaration. set init=False if you don't want it to do that.

    import dataclasses

    @dataclasses.dataclass
    class Base:
        a: int
        b: int = 1

    @dataclasses.dataclass(init=False)
    class Child(Base):
        c: int
if you want the init to be generated for you, you have to tell it how, given that "b" has a default value. so "c" needs to be keyword only here:

    import dataclasses

    @dataclasses.dataclass
    class Base:
        a: int
        b: int = 1

    @dataclasses.dataclass
    class Child(Base):
        _: dataclasses.KW_ONLY
        c: int


Looks like that works but it’s unintuitive. Pydantic can handle this situation without the weird KW_ONLY thing, so dataclasses should too


hm it implicitly decides to be KW_ONLY? OK

does not work:

    from pydantic import dataclasses

    @dataclasses.dataclass
    class Base:
        a: int
        b: int = 1

    @dataclasses.dataclass
    class Child(Base):
        c: int

error

    File "pydantic/dataclasses.py", line 265, in pydantic.dataclasses.dataclass
    ...      
    File "/opt/python-3.10.0/lib/python3.10/dataclasses.py", line 539, in _init_fn
      raise TypeError(f'non-default argument {f.name!r} '
   TypeError: non-default argument 'c' follows default argument

it works with KW_ONLY though awkward that pydantic.dataclasses doesn't supply it directly, need to split up imports among pydantic.dataclasses and dataclasses.


Extend from pydantic.BaseModel instead of using the dataclasses decorator


not SQLAlchemy compatible, interest lost :)


I didn’t mean “always use Pydantic instead of dataclasses”. I meant “dataclasses have a default property problem but it’s solvable since Pydantic handles them well”


I personally just use `dataclass(kw_only=True)`, which should really be the default in most cases anyways.


I think the same.

`slots=True` should also be the default.


Nice, I think kw_only=True and slots=True solve a lot of my dataclass problems. Wish they defaulted to true


> Nice, I think kw_only=True and slots=True solve a lot of my dataclass problems. Wish they defaulted to true

While it doesn’t solve reading other people’s code with the defaults you’d like to have, dataclass being a decorator and not a special syntax means that you can just derive a new decorator from it with the defaults you want.


> you can just derive a new decorator from it with the defaults you want.

This could hurt basic static analysis.


> This could hurt basic static analysis.

That’s a good point. I know some of the python static analysis tools can “see through” some simple indirection, but don't know if even a simple wrapper injecting arguments would work with any (or which) of them for this purpose.


It's ugly but you can fix this with `dataclass_transform`


`frozen=True` is really nice also if your data does not change :)


I think the difficulty here is that data classes beyond the most simple cases have awkward extension points, see the docs on post-initialization https://docs.python.org/3/library/dataclasses.html#post-init...

I am partial to a dataclass keyword but… it’s tough. The problem with declarative stuff like this is that sometimes you really just want to do things in a certain order and now you have to communicate that (the keyword argument hack is an example of this)


Kind of a radical idea but IMHO in a high-level language (i.e. not C++ or Rust) the real mistake is structs/classes with mutable fields. None of those problems exist in that world.

It's not very ergonomic, but if you start from that assumption you could have ad-hoc syntactic support that desugars to a limited version of an optics library.


It baffles me that the PEP to add frozendict is still in draft state. It's obnoxious to try and create truly immutable datatypes in Python for... I have no idea why.


frozendict (PEP 416) isn’t in draft, it is rejected

frozenmap (PEP 603) is in draft.


Ah right, I'm not sure how they're significantly different. Regardless, not having a immutable dictionary-type is just... frustrating.


I’ve used types.MappingProxyType on occasion for this


That's ultimately what I ended up using, but it seems not ideal considering the underlying data is still mutable and you still have to maintain a reference to it somewhere to prevent GC


We always use dataclasses with frozen=True to make them immutable.


Same. I treat them as immutable DTOs [0], or "value objects" if you will instead of "just a plain class". An unfortunate drawback is that not everyone is familiar with the concept of immutability, especially in the world of Python. It can also be confusing to explain how "frozen dataclasses are not actually the Python classes you know" :-/

[0] https://en.wikipedia.org/wiki/Data_transfer_object


This should be the default, with the fewest keystrokes and the least visual clutter. Mutable state should be the special case that you need extra syntax for.


So, like Erlang?

What’s an example of a language that does this by default?


Idiomatic usages of ML-family languages (Haskell, OCaml) are very close to that.


Common Lisp Object System might do something like that?


Rust


I'd rather gouge my eyes out than work with Rust code. Horrific.


I still don't understand why you would use classes without methods. That's what dicts are for.

About 10 years ago someone who I thought was pretty smart said it was because dots are nicer to type, fair enough I guess. It just always seemed silly to me to go through setting up a class but not getting the benefits of using methods.


1. Dots are nicer to type for sure 2. It's a bit easier to define types for class attributes than dictionaries IMO 3. You can absolutely have methods on dataclasses


There is a slight distinction between a dictionary and a class. Dict is a mutable key-value storage, so can be used as a collection of objects (or values) and benefit from its dynamism. Dataclasses are meant to define an object, so the emphasis is on clearly declaring the purpose of it upfront, enhancing readability.


There is clear value in being able to pass around several pieces of information about the same thing without having to individually name them every time (less typing, you can decide to include more members at a later time, less overhead in passing by reference, users can treat it as an opaque type without needing to know the contents, etc).


This. I look at old code I wrote and cringe at things like

    def foo(...):
       ...
       return yhat, sse, aic, bic

    yhat, _, _, _ = foo(ytrue)
    plot(ytrue, yhat)


I'm in the same boat. I've been using TypedDict all over the place and it's quite nice! The only plus side to dots I had was the IDE auto-filling it, and my IDE seems to work just fine with TypedDict.


Classes can change an attribute to a property while keeping the interface intact. Like turning an SQL table into an updateable view.


Dicts are much more annoying to give types for. They are also slower.


classes themselves are types, which is semantically valuable and also provides for a straightforward inheritance pattern. attributes on classes are easier to type. There's pep-589 TypedDict but it's not very easy to use right now.


IDE autocompletion is better with classes than with dicts.


Documenting your type.


Future-proofed, self-documenting code?


dataclasses document the full set of attribute they can hold, along with their type.


Having to specify the names and, if you hate yourself, the types of class members up front is a major limitation. Only when it is possible to do so (because the class is a boring one) the repetitiveness of typical __init__ methods can become an issue.


Please, I am very curious about the kind of "non boring" classes you are talking about. If you could show an example. I couldn't come up with anything that I would define as reasonable where we couldn't define the types beforehand.


Fairly contrived example, but something like this. Where you are accepting dirty input and need to normalize it. In this example I suppose you could set the type as int|str|Decimal|float|bytes, but the normalization still needs to happen somewhere.

    class Food:
        def __init__(self, temp):
            if str(temp).lower().strip() in ['hot', 'warm', 'cold']:
                self.temp = temp.lower().strip()
            else:
                try:
                    temp_int = int(temp)
                    if temp_int > 120:
                        self.temp = 'hot'
                    elif temp_int < 60:
                        self.temp = 'cold'
                    else:
                        self.temp = 'warm'
                except ValueError:
                    raise ValueError(f'temp must be  a number or one of the following: "hot" "warm" "cold", received: {temp}')


Pydantic has __post_init__.

This is a based on code I have in my editor right now. Typing straight into the comment box so it might be wrong.

    Toy = Literal['toy']
    class Spec(BaseModel):
        sample_size = Union[float, Toy]

    def __post_init__(self):
         self.sample_size = 0.1 if self.sample_size == 'toy' else self.sample_size


here's how that is much nicer with dataclasses:

    import dataclasses
    from typing import Literal


    @dataclasses.dataclass
    class Food:

        temp: Literal["hot", "cold", "warm"] = dataclasses.field(init=False)

        temp_arg: dataclasses.InitVar[str | int]

        def __post_init__(self, temp_arg):
            # optional bonus, get the list of temps from the annotation at runtime
            # possible_temps = self.__annotations__["temp"].__args__

            if str(temp_arg).lower().strip() in ["hot", "warm", "cold"]:
                self.temp = temp_arg.lower().strip()
            else:
                try:
                    temp_int = int(temp_arg)
                    if temp_int > 120:
                        self.temp = "hot"
                    elif temp_int < 60:
                        self.temp = "cold"
                    else:
                        self.temp = "warm"
                except ValueError:
                    raise ValueError(
                        f"temp must be  a number or one of the "
                        f'following: "hot" "warm" "cold", received: {temp_arg}'
                    )


Other than having type annotations, in what way is this nicer?

Seems way more verbose and less readable.


it's fully typed for one thing

the previous example was not verbose enough IMO, the intent here is clear and the two variables with different purposes are kept separate


He could've used different variable names in his example too, which makes me believe variable clashing was in fact intentional. And while it's definitely a controversial approach, there are cases where it can be considered more readable (e.g. where provider and consumer of "temp" are isolated entities which have their own meanings of "temp").


pep-484 typing hates when you use the same variable name for two different purposes and I can't say I disagree with that. Being strict about what kinds of types you attach to variables and allowing variables to be statically typed is a good idea that reduces bugs.


We have two variables with the same name, but one is a constructor argument, another one is an instance attribute. PEP-484 is absolutely fine with that, it's downsides (and benefits) are purely in the human domain.

> allowing variables to be statically typed is a good idea that reduces bugs

Not when it comes at the cost of readability. I'd argue more bugs (and definitely more severe bugs) are caused by poor readability than by lack of static types.


ORMs come to mind. There's typically a lot of reflection and dynamic instantiation magic baked into its classes, which you really have to hack around typing to make it work nicely (or even at all).


Thunks-style functors, where you bind a callback to a private variable, to be later called, often with kwargs. Like what most HTTP frameworks do. You can use Callable but the actual function prototype is unknown.


How is that a major limitation?

You only have to specify them 'up front' in terms of how the code is laid out. But in terms of which order you write the code in, you can add them as you need them.

Specifying the names and types is extremely useful for readers (and users) of your code.


Anything requiring e.g. setattr, getattr, delattr? Without looking far,

https://github.com/python-attrs/attrs/blob/main/src/attr/_ma...


If you are defining dataclasses, do yourself a favor and define them kw_only=True.


For one category of classes-- representing data-- there's little point in non-dataclass classes. But for other kinds of classes, it's important to be able to control the constructor/dependencies. For example, suppose I have a provider class that abstracts the interface to some external system. My system needs to operate in at least 3 different contexts for this data provider: 1) connecting to an external database, 2) connecting to the local filesystem for simulations and experiments, and 3) unit tests. I have several implementations of this class using different mechanisms. Each of these take different things in their constructor: one takes a database connection as a dependency, another takes a directory name for the local filesystem, and another takes an in-memory dictionary or maybe doesn't need anything in its constructor.

Having each of these classes have a clearly defined type-annotated constructor makes a lot of sense and makes things more self-documented. If everything were a dataclass then this situation would not be explicitly stated in the constructors.


I think of dataclasses as glorified tabular data that is not of size to justify numpy.

Once the need to tart up the class methods or do any inheritance crops up, make a "proper" class.

None of that has an technical basis, but there is value in being methodical with the coding choices.


Probably a controversial opinion, but I would like `self` removed, and have the class body be an init function. Something like:

```

dataclass NormalizedPoint(x: float, y: float):

    _norm = math.sqrt(x\*2 + y\*2)
    x /= _norm    
    y /= _norm

    def __add__(other: NormalizedPoint):
        return NormalizedPoint(x + other.x, y + other.y)
```

In other words, I would really like some construct which sticks as close as possible to a normal function, but that also returns a (typed) object with everything in its scope.


> Probably a controversial opinion, but I would like `self` removed, and have the class body be an init function.

That would be weird in classes that override the constructor (__new__), which is what is actually called (not __init__, except contingent on the result of the constructor) when the class is used as a callable.

Not to mention the need to define methods in the class body, and be able to distinguish them from local functions in the constructor/initializer, which is a pretty good argument against the class body being the constructor or initializer.


I agree that in the general case it does not work, but if we were to introduce an alternate dataclass/class definition syntax, I feel like something lightweight like this would be nice. (Local functions could just be prefixed with underscores, no need to distinguish them from private methods.)


If we get rid of normal classes, how do things like Django ORM work?

It's a pretty significant package, if you're proposing this kind of thing, I assume there's an answer to this...


Maybe I didn't have enough coffee, but what's really the point of "data" keyword? Seems like its only goal is to confuse. So you write that data thingy and later figure out you actually need a class, then what?

What's the difference between data and class? Because of what it keeps? So why not have "methods" for classes that only have methods or "notdecided" if you don't yet know what sort of class you are going to end up with.

I don't follow.


Dataclasses already exist and have throngs of users. It's for them. If you've never used dataclasses maybe it doesn't make sense to you.

Then, I agree Python is way too big a language now, but I'd rather start removing things like the walrus and operator overloading.


Operator overloading is what makes the majority of science/math heavy libraries (numpy, scipy, pandas, torch, scikit) work. It ain't going anywhere.

Walrus is fine. It's easy enough to google "python colon equals" and figure it out.


The idea is that dataclasses are for when you just want to store some data in a pre-structured format. It is almost the same thing as a named tuple.

It trades away the flexibility of processing the incoming data in __init__ to make defining the class simpler.

I agree that the name is confusing, I think maybe @autoinit would have been better.


Why Python interpreter can't figure it out like "oh this class only stores data, so it must be a data class". No need for decorators etc.

What would be the consequences of non-data class that only contains data to treat is as a data class? Likely none.


Good luck introspecting a given python object and ascertaining "this class only stores data". It's literally impossible given the dynamicism of python.

The consequences are exposing state that you don't necessarily want to expose, missed opportunities for optimization (lose fine-grained memory/performance tradeoff ability), imply something is serializable when it isn't, collisions with metaclasses/inheritance weirdness.


Data classes are just regular classes but with some extra syntactic sugar(magic) to automate some common stuff.


There's all kinds of use cases for classes without dataclass. For example, just the other day I wrote a leaky bucket rate limiter class. It uses the __aenter__ and __aexit__ methods to do things in async with blocks.

I love dataclasses, but there needs to be a compelling reason to break everyone's code and the reason can't just be "I don't want one import and one decorator".


why do you think you can't define `__aenter__` and `__aexit__` on a dataclass?


I don't. But it's really not a dataclass - it doesn't need slots etc.


Have been using python since python 1.x, I find data classes kind of nice bringing in convenience at the same time it is a weird hybrid. I would have rather have some native struct and variant (think rust neun) types which would have required new syntax

In a way it feels like data classes are a strange hybrid.

Similarly I see all kinds of type theoretical issues with type checking popping up.


It's just unfortunately becoming clear Python is going the way of C++.


This would have been a good discussion 10 years ago, but these days it's pydantic first, hardly ever use dataclass


Would switching all classes to the data class style mean losing the ability to differentiate between class variables and object variables?

I'm going to go try it in a minute, but I'm specifically thinking of setting a mutable variable to use as a way to cache the results of a long running process across objects. Or even just any variable to use with class methods. I'll update soon

edit:

I tried it out and it looks like it works just fine

    @dataclass
    class MyCoolDataClass:
        arg: any
        _things = {'last_updated':0, 'data':[]}
        _things_timeout = 300


        def _update_things(self):
            """some long process"""
            time.sleep(10)
            self._things['last_updated'] = time.time()
            self._things['data'] = ['some things']
        @property
        def things(self):
            seconds_since = time.time() - self._things.get('last_updated', 0) 
            if seconds_since >= self._things_timeout:
                self._update_things()
            return self._things.get('data',[])
which then works as expected, sharing the cache between objects.

    start_time = time.time()
    for i in range(10):
        obj = MyCoolDataClass(i)
        print(obj.things, int(time.time() - start_time))

    ['some things'] 10
    ['some things'] 10
    ...

So yeah, I think I might be on board with dataclasses by default. As long as we can still add __init__ for when we want to do something extra with the variable

edit2:

If we switch to dataclasses things like this would need to raise a syntax error:

    @dataclass
    class MyOtherDataClass:
        arg: any
        def __init__(self, special_arg:str):
            self.special_arg = special_arg.lower().strip()
As it stands now defining __init__ completely removes any definitions set in the class.

I think the killer feature would be if we added django forms style validate methods for validation and normalization. imagine if we could do this:

    class Drink:
        temp: str
        container: str

        def _validate_temp(self, temp):
            temp = temp.strip().lower()
            if temp not in ['hot', 'cold', 'room']:
                raise ValidationError(f'"{temp}" is not a valid temperature')
            return temp

        def _validate_container(self, container):
            container = temp.strip().lower()
            if container not in ['mug', 'glass' 'sippy cup']:
                raise ValidationError(f'"{container} is not a valid container")
            return container

        def _validate(self, kwargs):
            if kwargs['temp'] == 'cold' and kwargs['container'] == 'mug':
                raise ValidationError("Don't dirty a mug for a cold drink!")
            
Sure, it goes against explicit is better than implicit, but that could be fixed with decorators. It would also be nice to have a concise way to say what values are valid, and a handful of standard normalization processes, but I think you get the idea.


typically only things that you annotate end up instance variablses for dataclasses


Does Java still have a use for variables without final?

Does Javascript still have a use for functions without async/await?

This is just a repackaging of an age-old argument, imo.

When they invented the screw, it did not preclude the nail. /shrug


I'm always astonished by Python. For ages "dataclasses" have been called struct or record in other languages and are fast first class values.

In Python, they invent a weird, slow concept and discuss it for a decade, probably while writing 10 PEPs in the process. Through conferences, Stackoverflow, etc. people are slowly convinced that the concept is natural.


Python has had named tuples since 2008 and slots for attributes since 1999, both of which are fast first class values.

I’m always astonished by ignorant people feigning knowledge of their areas of ignorance.


this reminds me of a Perl guy once asking me why Python calls hashes dicts. I said that a dict is a key value object with a defined set of included methods. The fact that it uses a hash under the hood is just an implementation detail. If a better technique comes along, Python is free to change it. Also If someone wanted to do something crazy like make an implementation of python where every dict was a nosql database, that would be fine as long as it implemented the same behavior.

For structs you have to remember that in python everything is reference passed, so some of the differences between structs and classes that you see in other languages do not apply.

But I suspect the real practical reason is that python already has structs, which are used for interoperability with C structs. Or any structured binary format really.


It is the same concept. In rust structs are defined with the struct keyword, and can have methods. And there is no class keyword for classes.

In haskell, structs are defined with the keyword data. Maybe that is where the inspiration comes from.

https://lotz84.github.io/haskellbyexample/ex/structs

In any case, it seems each language chooses equally confusing keywords. Syntax is not a big deal IMO if the concept is clear.


> In haskell, structs are defined with the keyword data. Maybe that is where the inspiration comes from.

In Haskell, Rust and Python you can also just have tuples, if you want to. In Haskell you can use `newtype` or `type` to name your tuples and use them as records.

Of course, as you say, you can also use the `data` keyword in Haskell. Or you can use church encoding to use functions as records..

Lots of (confusing) possibilities.


Indeed. My intention was answering OPs claim that python is "weird" because

> "dataclasses" have been called struct or record in other languages


Oh, definitely. I agree with you!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: