
Msgpack can't differentiate between raw binary data and text strings - ericz
https://github.com/msgpack/msgpack/issues/121#
======
drewcrawford
Bikeshedding at its finest.

I ran into this issue a few months ago, on a cross-platform project involving
four languages that each take a distinctly different view about strings from
the other three. Although this situation is a common objection to supporting
strings in the issue thread, it took just a couple of hours to extend msgpack
to support strings in a reasonable-enough-for-me way on each platform.

The proposals in the thread are a lot better than mine. And I suppose it's
pretty antisocial / arrogant for me to just roll my own implementation without
consulting anybody. But in three years[0] of talking about the problem,
nothing had gotten done. Meanwhile, my code shipped a long time ago.

I do this a lot--fork people's projects to solve my problems and don't merge
back changes--and I feel guilty for not being more participatory with the
project maintainers. But the fact is that the expected cost of getting
embroiled in a flamewar like this is high (whether it is over architecture,
whitespace convention, "behavior by design", "Jim's already working on that",
etc.), whereas the benefit to me of getting my changes merged upstream is
essentially zero. So my antisocial behavior continues to be positively
reinforced.

Does anyone else have this problem? Or do people just enjoy flamewars more
than I do, or have the persuasive skills to avoid them?

[0] <https://github.com/msgpack/msgpack/issues/13>

~~~
vidarh
I generally take the approach that I fix the issue, add a comment and ask if
they want a pull request and don't do anything else unless the maintainer
expresses interest. If they do express interest, I'll go pretty far in trying
to clean up my fixes to make them suitable, as long as they still solve my
problem. If they don't express any interest, oh well, my fork will be there
and a comment will be there to point other people to a viable solution.

A lot of the time the response is very welcoming. E.g. I recently provided a
substantial patch to Beaneater (Beanstalkd client library for Ruby) and the
maintainers were all over it immediately, and we got it merged in quickly.

The benefit of taking the effort is to be able to keep up with upstream
without having to reapply patches. But that benefit is limited (often I will
prefer to stay with an "old" known entity rather than tracking upstream, as
long as security concerns don't force me to upgrade), and so I don't spend a
lot of time pursuing it.

I strongly believe code speaks louder than words in this kind of situation,
and often shipping code will be more likely to get acceptance than engaging in
discussions.

------
Groxx
... by design. Because there's no "string" type. This is a bug report about
high level implementations that don't encode and decode in reversible ways,
contrary to the msgpack protocol.
<http://wiki.msgpack.org/display/MSGPACK/Format+specification>

Really, for a protocol that values minimal space usage, not defining a string
type is probably a good thing. Use the one that produces the fewest bytes in
your application - it may not be UTF-8.

Also:

> _For instance, the objective C wrapper is currently broken because it tries
> to decode all raw bytes into high-level strings (through UTF-8 decoding)
> because using a text string (NSString) is the only way to populate a
> NSDictionary (map)._

Well there's your problem: [https://github.com/msgpack/msgpack-
objectivec/blob/master/Me...](https://github.com/msgpack/msgpack-
objectivec/blob/master/MessagePackParser.m#L28) It's a buggy wrapper that's
trying to be convenient. And NSString keys are by no means the only way to
populate an NSDictionary, and it doesn't look like the Objective-C wrapper
requires this: [https://github.com/msgpack/msgpack-
objectivec/blob/master/Me...](https://github.com/msgpack/msgpack-
objectivec/blob/master/MessagePackPacker.m#L70)

------
bengotow
I'm fine with the fact that Msgpack does not differentiate between binary data
and text strings. Sure, it requires a schema, but if you're concerned with
data size and parsing speed, you should choose an encoding appropriate for
your task anyway.

The bigger problem is that Msgpack is advertised as being "like JSON, but fast
and small." To me, that makes it sound like I can replace JSON messages with
Msgpack messages and be done, and that's not at all the case, because I need
to add a schema layer. I think the "like JSON" comparison is what is really
causing this frustration with the format.

~~~
chubs
Hi, I wrote (with another guy) the objective-c wrapper.

You might be misinformed re schema layers, as msgpack does convert to and from
a dictionary in much the same way that JSON does, keeping all your dictionary
keys (which are strings) intact. In fact, we originally used it as a drop-in
replacement for JSON.

As for the data vs string issue, it was designed originally to be as
conveniently similar to JSON as possible - which is why you don't get back a
dictionary full of NSData's which you then need to convert manually to
NSString's; it does that automatically for you. This was a convenience vs
correctness tradeoff. People who say it's wrong are quite right. They're very
welcome to fork it, or submit patches with options to return raw NSData, or
create a new wrapper - it wouldn't take a competent dev very long to re-write
what we did.

Now, i've not used messagepack in quite a while - i've simply found that
gzipped json is usually almost as good.

~~~
bengotow
Thanks for replying. When I explored Msgpack, it was the Objective-C library
that I tried using. Overall it was a great experience - nice work on the
wrapper. I think you're right - I was trying to use Msgpack to do more than I
could do with JSON (namely, to transmit NSData without having to stringify
it). When I realized all the NSData objects were being automatically converted
into strings when the data structure was inflated, I figured I'd need to
prevent that behavior and do it on only certain keys (which would need to be
specified somehow, hence my thought of a schema). Thanks for clarifying!

------
jrmg
The conflict here seems to be between people who think any arbitrary valid
msgpack stream should be decodable into a specific object graph, and those who
assume msgpack will be used to implement a protocol where only messages of a
predefined format should be allowed - hence the decoding app will know
beforehand what should be a string and what shouldn't.

The conflict is unresolvable until the participants agree on which of these
two distinct things msgpack should be.

~~~
dietrichepp
I don't see the conflict here. Spend one bit per string encoding whether it's
UTF-8 data or a binary blob.

* The people who use it to implement protocols already have to deal with types, e.g., expected a number but got a string. So one more type is not a big deal.

* The people who use it to create discoverable profiles will... use JSON no matter how good MessagePack gets.

That's not the direction I was headed when I started writing, but I don't
think the first group you mentioned even exists.

------
stock_toaster
msgpack always seemed an odd thing to me. Compressed json (gzip, lz*) is
small, and fast (see: <http://news.ycombinator.com/item?id=4091051>). If you
need structure, use protocol buffers or thrift.

I actually like tnetstrings for backend messaging, but I don't see it used
very often. json is pretty damn ubiquitous these days.

~~~
nitrogen
It's for speed. I write web software in Ruby that runs on embedded devices.
This software needs to communicate with a C backend that is consuming much of
the available CPU power, so there's not much time left to waste on parsing the
messaging protocol. I did some tests and found that JSON is significantly
slower than simple key-value pair parsing (implemented as a state machine in
C), which itself is half as fast as msgpack. This doesn't even consider the
overhead of gzip compression.

Edit: here are my results, which lack the numbers for msgpack except a mention
at the end (I hate linking to Posterous, but I haven't moved my blog yet):
<http://nitrogen.posterous.com/164964342>

~~~
duaneb
> I write web software in Ruby that runs on embedded devices.

Ok I gotta ask.... Why on earth would you do that? It seems like an exercise
in masochism. I wasn't even aware that ruby would compile on embedded arches.

~~~
nitrogen
High-end embedded Linux (1.2GHz ARM CPUs). It saved development time, but I am
considering rewriting everything in C. This is an example of what it's for
(self-promoting link in 3, 2, 1...): <http://www.nitrogenlogic.com/>

~~~
duaneb
Ahh, you mean literally embedded (perfectly legitimate use of the word). I
associate it with DSPs and extremely low power chips where something like ruby
is an absurdity. ARM is a great target for it, though.

------
Confusion
Well, then the information of whether some sequence of bytes is a string needs
to be communicated out of band. That's a perfectly acceptable design decision,
but one that may lead potential users to favor alternatives that include that
information in band.

The discussion is pointless if the objectives of the participants differ and
none is willing to compromise.

------
lnanek2
Reading TFA apparently there is no string type, so nothing in it is a string.
It's all binary data, byte array or whatever that means in your language of
choice. If an application or a library the application uses converts a string
into binary data before and converts it back after, that's none of the
format's business.

------
ambrop7
The solution to these problems is for everyone to be completely ignorant of
any character encoding and just deal with octets. If the characters represent
UTF-8 text, then only when text needs to be presented or interpreted in some
way, UTF-8 decoding happens. Any automatic encoding or decoding of UTF-8 (such
as what Python3 does) is stupid.

EDIT: A common example of implicit and wrong handling of character encoding is
when a file gets created with invalid characters, and your Linux file manager
is unable to delete it. This can happen because the file manager assumes the
file names it gets from the OS are text, and which it decodes incompletely.
When it wants to delete the file it encodes the text back, but the result is
different than the original file name bytes. The error happens because the
file manager tries to decode the as text too early - it should keep the
original octets as a reference to the file, but only decode them when it needs
to display a file name.

------
Beltiras
I can very easily froth at the mouth when it comes to character encoding
problems. It's one of those problems that should never even be a problem, but
ends up consuming hours upon hours of consulting arcane cobwebby specs.

