
Do binary files contain more 1s or 0s? - _6pyv
https://coil.com/p/redacted
======
tlb
The code ignores leading zero bits in each byte, which is why it gets the
surprising result of more ones than zeros.

~~~
Enginerrrd
I wonder about fixed format headers and wrappers with a default bias too.

------
Aardwolf
"2 / 8 = 0.29"

0.25 according to my calculations

pictures and /dev/random should give 0.5, not 0.57, if not they are not
properly compressed and it's not a proper random generation, which I don't
really believe. Maybe similar error as what happened with the 0.29?

The mean byte value in decimal should be 127.5.

I guess it may be something about the hexadecimal -> decimal -> binary
conversion -> count instances of '0' and '1' in string in the python code that
is doing something wrong (it's also inefficient to roundtrip through decimal
when the original bytes are already binary in the computer and to involve
strings in this at all). Does it remove leading zeros perhaps?

~~~
jhoechtl
Well spoted. And ashaming for the OP

~~~
chmod775
Mistakes like this happen to everyone, and are hardly something you need to be
ashamed of.

~~~
noobermin
Tbh it helps to send articles to friends to look over before you go live and
post to HN.

------
oconnor663
The reason you're measuring more 1's than 0's is this line:

    
    
      binary = bin(decimal)[2:]
    

That's because while `bin(0xff)` is indeed "0b11111111", `bin(0x00)` is "0b0".
Python omits your leading zeros.

A smaller nit, these lines:

    
    
      hexadecimal = binascii.hexlify(byte)
      decimal = int(hexadecimal, 16)
    

can be just

    
    
      decimal = byte[0]

~~~
edent
D'oh! I swear I had a `.zfill(8)` in there. Looks like the whole thing is
wrong. Oh well!

------
japaget
This page is devoid of content except for a single "_" character. (I viewed
the page in Chrome, Safari, Edge, and Firefox with identical results.) What am
I missing?

~~~
episteme
Guessing it was edited after the mistakes were pointed out

------
kazinator
There are at least two silly things in this code,

One is the hex conversion. Compare:

    
    
      >>> bin(int(binascii.hexlify(b"\x05"), 16))[2:]
      '101'
    

versus just:

    
    
      >>> bin(b"\x05"[0])[2:]
      '101'
    

Secondly, the code should just count the 1 bits; there is no need for a count
of zeros.

------
h2odragon
It is of course vitally important that the 1's and the 0's even out, otherwise
you might overflow the bit bucket and jam the tubes.

(This is actually a thing, encoding schemes like MFM and RLL in magnetic media
ensure there aren't too many sequential 1 bits etc. See also "Grey codes")

~~~
twic
Also 8b/10b and 64/66b encoding in ethernet:

[https://en.wikipedia.org/wiki/8b/10b_encoding](https://en.wikipedia.org/wiki/8b/10b_encoding)

[https://en.wikipedia.org/wiki/64b/66b_encoding](https://en.wikipedia.org/wiki/64b/66b_encoding)

------
jandrese
I would assume < 0.5 on average since English strings are pretty common and
the top bit is usually 0. Zero padding is fairly common in files too, while 1
padding is relatively rare.

------
imjasonmiller
> Consider a file which only contains the letter A. It has the following bits:
> 10000001

I thought A was 65, not 129?

------
isoprophlex
Jesus this is some dumb shit.

In python, the string returned by

    
    
        bin(decimal)
    

drops the leading zeros

    
    
        >>> bin(3)
        '0b11'
        >>> bin(513)
        '0b1000000001'
        >>> bin(1024)
        '0b10000000000'
    

I don't want to turn this into a personal attack or an overly snarky comment
but it's really something to try and implement a bit counting algo by hand and
then fail to observe the way bin() behaves

~~~
dang
> I don't want to turn this into a personal attack or an overly snarky comment

That's a fine intention, but then you shouldn't lead with "Jesus this is some
dumb shit" or end with "it's really something to try and [...] fail." Doing it
that way breaks the site guidelines. Would you mind reviewing them?
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

Note that the leading guideline in the "comments" section is _Be kind._

~~~
isoprophlex
Belated, but, yeah. You're right. I blew my top a bit there...

------
uwydr
There's a Linux application called "ent" that calculates the entropy of a
file. I use it to see if it's worth it to compress a file before I ship it
somewhere else.

~~~
gigama
Elegant solution, learned something new today, thanks for posting.

$ ent -bc <filename>

