Hacker News new | past | comments | ask | show | jobs | submit login

Try passing a std::string in from C/C++ code. It treats it as a byte string, which you have to prepend with 'b', and Python 3 will not do any nice casting under the hood back and forth between the two types.

Sure, I am picking out one particular use case, but it isn't uncommon to wrap C code in python scripts to mung data going in and out of it.

You are right though. String manipulation does appear to be easier than Python 2's implementation.




That's because std::string does not carry any sort of encoding information, std::string is basically a wrapper around bytes (hopefully I'm not misreading this, I'm far from an expert C/C++ programmer). Due to this, python can't make any assumptions about encoding/decoding without the possibility of getting it wrong.

"Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters)."

https://stackoverflow.com/questions/1010783/what-encoding-do...

http://www.cplusplus.com/reference/string/string/


b"" is not "a byte string". It's a raw byte sequence:

   >>> type(b"foo")
   <class 'bytes'>
   >>> b'foo'[0]
   102
It can hold any bytes, it just happens that one way to contruct/represent it can be done with a string-like syntax as a convenience for developers. But you can actually built it in another way, or make it hold data in any other format:

    >>> bytes([102, 111, 111])
    b'foo'
    >>> struct.unpack('I', b'\x01\x01\x00\x02')
    (33554689, )

Also, std::string is exactly that, a raw byte sequence, with some string operations attached to it. But you don't have any encoding attached to it: https://stackoverflow.com/questions/1010783/what-encoding-do...

So it makes sense that Python is treating it has a raw bytes array (what you call "a byte string"): it has no way to know that it is UTF8 or CP850 if you don't tell it.

But because of c/c++ experience or habits from python 2, one tends to confuse the concept of text (represented with the type "str" in python) with some specific low level implementation (the raw bytes array).

Python explicitly avoid this problem, by defining that either you know what it is (utf8 text, big endian number, etc) or you don't (raw bytes array). Manipulating text as a raw byte sequence manually would be the equivalent of manipulating directly the IEEE 754 representation of a number: it's not what you want for a high level scripting language, and hence it's why Python 3 doesn't do that anymore.


> Try passing a std::string in from C/C++ code. It treats it as a byte string

Because that's exactly what it is? std::string is a bytes buffer, not actual text. There's no guarantee that the contents of std::string will be in any encoding, let alone a specific one.


You need to use `Py_BuildValue` with the `s` argument to get that into a python string.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: