`Bytes`: The Lesser-Known Python Built-In Sequence

misnome · 2024-06-10T17:44:24 1718041464

Are bytes... lesser known, or an especially advanced topic?

I didn't think so, but then interviewing Juniors (who all invariably have "expert python" on their CVs), nobody ever seems to know them, or anything about string handling/unicode.

Are the basics about how stuff is moved around inside computers just completely opaque to people learning to program nowadays? I learned that stuff at secondary (high) school, so internally my brain just counts it as "foundational knowledge".

Animats · 2024-06-10T18:34:25 1718044465

They're an artifact of the way Python 3 went to Unicode.

Early Python was ASCII-oriented. Type "str" was used both for text strings and arrays of bytes. "str" was treated much like arrays of "char" in C - it was the basic type for binary I/O.

Then came Unicode and Python 3. Characters and bytes were no longer the same thing. Strings and arrays of bytes had to be split somehow. As with most languages that had to retrofit Unicode, this didn't go well.

Rust, which didn't have a retrofit problem, did a clean separation. There are arrays of u8, and there is "str", which must be valid UTF-8 sequences. There is no implicit conversion. This is straightforward.

Python didn't do it that way. Type "bytes" in Python 3 prints as an ASCII string, not an array of numbers. It's close to the 'str" from Python 2. The usual string-like operations, such as "split" are defined for "bytes". This was kind of weird but was supposed to simplify conversion of old code from Python 2 to Python 3. It didn't really help all that much.

There's also "bytearray", which is a mutable version of "bytes".

It's one of those messes left over from the ASCII to Unicode transition era, along with UTF-16, byte order marks, "wchar_t", HTML character sets, HTTP headers, and all the character representations in SQL.

otabdeveloper4 · 2024-06-10T19:36:36 1718048196

Python2 has the type "str" (exactly equivalent to python3's "bytes"), and the type "unicode" (exactly equivalent to python3's "str").

The whole decade-long mess was just to fix quirks in Windows' path and file encoding handling bullshit. And they didn't even manage to fix the problem in the end.

What a mess.

ok123456 · 2024-06-10T18:17:13 1718043433

Raw byte string literals (b'') and their type in the language are details you can be unaware of if you didn't live through the Python 2->3 transition or don't deal with binary data buffers.

ziml77 · 2024-06-10T18:53:39 1718045619

I was surprised that bytes were a lesser known type in Python, but now that I think about what you're saying I guess it's true. The bytes type doesn't get surfaced if you're only working with higher level libraries that do the bytes to str conversion for you.

dfox · 2024-06-10T21:28:54 1718054934

I used to think that only various external libraries (eg. Flask) do various hidden conversion to shield the user from the str-bytes distinction, but even the builtin MIME implementation does that kind of stuff, which means that it will break 8bit S/MIME messages (and when you get bitten by that hard, and just go off implementing your own MIME parser the fact that bytes has all the string methods like .split() comes very handy).

nerdponx · 2024-06-10T22:12:12 1718057532

Just about every I/O interface in the Python standard library does it: env vars, reading data in files, reading stdin, even filenames themselves.

dfox · 2024-06-10T23:22:57 1718061777

There is a difference: packet on a wire is inherently bytes, while interaction with the OS is a can of worms. On UNIX all that are just an opaque byte blobs, but with expectation that these things mean something to humans, so it either are strings in the locale encoding or UTF-8 encoded strings, then there is Windows where this kind of interfaces use something like UTF-16. In general, the name of the thing that you use to interact with the OS probably is a meaningful string of Unicode code points, but it does not have to be so and you should have some kind of plan what to do when it is not. Doing that in a way that abstracts away the conventions of the underlying OS is almost certainly impossible.

dymk · 2024-06-10T17:56:04 1718042164

Certainly not an advanced topic, but lesser known (compared to string or list) is probably an accurate description. This article seems like it's aimed at Python learners, rather than experienced programmers.

sevensor · 2024-06-10T19:32:24 1718047944

I've noticed I have to drastically recalibrate when I'm looking at Python learning materials. I'm trying to find something good for onboarding people, but even the "advanced content" tends to be stuff like this. I'd consider advanced to be more along the lines of "your function returns a generator, here's how to write the type annotation," or "worked examples showing how to express computations using numpy so that your program runs 30x faster," or "here's how to write a context manager."

physicsguy · 2024-06-10T20:59:01 1718053141

At the same time, NumPy is so ubiquitous in science, I wouldn't even consider that intermediate, it falls under basics.

Really it's just a case of Python being able to handle a ton of different use cases that sometimes overlap, you can be really deep into it in one area but never touch another.

abracadaniel · 2024-06-10T18:04:53 1718042693

It's still a good reminder anyway. It's not something you often have to worry about. I just ran into this distinction dealing with handling raw http data. Was trying to check for string value in a byte string and took a bit to realize I was effectively doing: 'test' == b'test' which is, of course, False.

janalsncm · 2024-06-10T18:29:22 1718044162

No, not really. Most of the article discusses utf8 encoding which isn’t an “advanced” topic either but I never took the time to learn it.

> who all invariably have “expert python” on their CVs

This is to get around HR filters, where they ask your experience with various tools on a 1-10 scale. If you’re only seeing the “experts” it’s possible the people who used more reasonable assessments of their abilities were simply rejected.

DiggyJohnson · 2024-06-10T17:56:20 1718042180

Driveby answer to your question, sorry I can't expand: yes, for the most part.

Even when these topics are covered directly in a university course, this knowledge doesn't stick around to become "foundational". Especially for those that don't have a hardware, embedded, or otherwise resource-constrained background or area of interest.

pgwhalen · 2024-06-10T18:37:52 1718044672

There is a wide gap between "known to juniors/mid level/low senior programmers" and "interesting enough to post on hacker news," which often surprises people. People can be productive for a long time having no knowledge of concepts that HN readers deem essential.

throwaway894345 · 2024-06-10T18:47:49 1718045269

In most of the Python shops I’ve worked in, people have no idea how async actually works (and why they shouldn’t make sync I/O calls in their endpoints) or even the basics of performance (how Python allocates memory, how to dodge the GIL without eating a bunch of pickling costs, etc). Never mind stuff like import path rules, metaclasses, packaging, etc. It often feels like the median Python (or JavaScript for that matter) user has a very different mindset than Rust, Go, C, etc programmers (in my experience, Python programmers also are much less likely to have used other languages extensively).

DontchaKnowit · 2024-06-10T18:52:39 1718045559

I think a lot of the times python is used almost exclusively for dead simple batch tasks where this level of detail is just not remotely needed.

throwaway894345 · 2024-06-10T19:22:44 1718047364

Sure, but I've never worked somewhere that employed people to write "dead simple batch tasks". If you're writing async endpoints, it feels reasonable to assume that people know how an event loop works (or at least not to make sync calls in an async function).

DontchaKnowit · 2024-06-11T13:17:57 1718111877

Really? I feel like 90% of my programming career has been churning out dead simple code. Just business logic

throwaway894345 · 2024-06-11T14:27:27 1718116047

Personally I think an API endpoint is "dead simple" but if your API endpoint is async, you still need to know not to call sync functions in it. The fact that such a large share of Python developers don't understand this concerns me.

physicsguy · 2024-06-10T21:01:21 1718053281

This has been my experience too, I came from a HPC scientific background and ended up doing Django after leaving academia (for an engineering company, so still pretty maths-y). When I was hiring I really had to reset expectations to some degree about whether people had the capacity to learn this stuff rather than know it already.

pgwhalen · 2024-06-10T18:59:51 1718045991

> It often feels like the median Python (or JavaScript for that matter) user has a very different mindset than Rust, Go, C, etc programmers (in my experience, Python programmers also are much less likely to have used other languages extensively).

Totally agree. One example of "python-brain" that I see a lot at my employer is using pandas dataframes as the only data structure, beyond _maybe_ the occasional list or dict. Any entity with multiple attributes is represented as more columns in a dataframe, rather than via objects.

manifoldgeo · 2024-06-10T19:11:24 1718046684

I think this may have less to do with "python-brain" and more with "data-science brain". If a person is well-versed in data science concepts and has been trained to use Pandas DataFrames and Series for everything, that's what they'll lean on. After all, it's some kind of in-memory object that can hold many values and has a way to label them with column labels and indices.

Chances are somewhat good that these people weren't computer science majors to begin with. For example, math or biology majors who have moved away from R to Python might know a great deal about data but not much about compsci.

For people who use Python in a DevOps context, they'll likely be exposed to more OOP concepts and lean more heavily on classes.

throwaway894345 · 2024-06-10T19:24:39 1718047479

Yeah, I've seen a lot of people write absurd dataframe monstrocities that ended up being slower than the naive Python loop-over-a-list solution never mind 10X the code. But I've also seen plenty of non-data-science examples.

throwaway894345 · 2024-06-10T18:41:35 1718044895

I’ve been writing Python for 20 years, and I’m nowhere near “expert”. I don’t think people realize how incredibly complex Python is. Things like metaclasses and import rules alone are mind bogglingly complex.

After 5 years of Go I might reasonably have been called an expert (I started using Go in 2012)—there wasn’t much left in the language to learn apart from the things that had changed. But with Python there’s tons and tons left for me to learn even ignoring all of the changes.

physicsguy · 2024-06-10T21:04:30 1718053470

I think Go is much smaller than Python though; the standard library is tiny in comparison.

Just looking at this list of modules: https://docs.python.org/3/library/index.html

I reckon I've used at least once probably 50% of the list, but that's over 12 years of using Python, and I'd only consider myself fluent with a very small number of the total.

throwaway894345 · 2024-06-10T21:53:18 1718056398

I was remarking about core language features (the runtime, import resolution, the runtime type system, packaging, C extensions, etc). I was absolutely not talking about proficiency with the standard library in either case.

divbzero · 2024-06-10T19:34:29 1718048069

bytes might be lesser-known to younger devs, but it’s very well-known to older devs who used it as str in Python 2.