Hacker News new | comments | show | ask | jobs | submit login

Just did it for 28000 C files. Here's the results:

    a  0.772163
    b  1.2679
    c  1.78209
    d  1.1195
    e  0.881398
    f  1.47252
    g  0.924242
    h  0.358954
    i  1.06756
    j  0.835313
    k  1.41458
    l  0.981729
    m  1.08955
    n  0.9156
    o  0.73849
    p  1.74468
    q  4.2497
    r  1.21577
    s  1.05023
    t  1.03627
    u  1.2967
    v  1.77662
    w  0.396003
    x  13.7292
    y  0.47566
    z  3.78748
The numbers are (relative frequency in C) / (relative frequency in English). So "b" is slightly more common in C than English, but "w" is a lot more common in English than C.

The raw counts for symbol characters:

    _  22890057
    ,  10895692
    )  10749798
    (  10745839
    *  9211904
    ;  8187969
    -  6628768
    =  5878296
    >  4428291
    /  3468260
    .  3011078
    {  2212412
    }  2211783
    "  2120264
    &  1647188
    :  1032587
    +  962554
    #  909859
    [  889538
    ]  888722
    <  839910
    |  643903
    %  583092
    !  561462
    \  540456
    '  454201
    @  131199
    ?  112488
    ~  84629
    ^  19064
    $  17922
    `  7272
    [space] 74199965



What would be interesting here would be a difference analysis or regression giving the preference for any given key in a given language. E.g.: '|' is highly predictive of shell, '$' of perl, '()' for lisp. Might be fun to do in R.


I really need to do reading and research on this, but I'm pretty sure that's what Hidden Markov Models are for. You could watch a webpage go from HTML to javascript and back!


Could I ask for one more data: the total number of characters and maybe lines? That way symbol/alpha/line ratios could be compared to other languages.


Yeah, when I get a chance I'll gather together some stats on all the languages I have data on (about 40).

Finder reports 630,942,867 bytes for the whole directory. Assuming most files will be plain ASCII, that should give a good approximation for the total number of characters.


Based on those numbers I gathered some stats about keyboard layouts:

* 18% of all characters are symbols, 12% are spaces and 70% are alphabetic

* 20% of all non-space characters are symbols and 80% are alphabetic.

* US kb layout users need to use shift for 64% of symbols

* Finnish/Swedish kb layout users need to use shift for 73% of symbols and AltGr for 7% of symbols.

* Fi/Swe layout users thus need to use 25% more modifer keys for symbols.

Conclusion: Fi/Swe layout sucks.

edit: https://gist.github.com/1205728 python script used to get these numbers (percentages calculated with OOo Calc).


Do you know why you have slightly different numbers of (){}[] characters? In C or C++ shouldn't those all be paired up to match?


It can probably be accounted for by comments (e.g. people sometimes comment out half a block). Although the comments should be left out, so as not to mix the C and the English.


Also string and character literals. It's common to write something like

    if (c == '{')
and not need to test for the matching one.


Heatmap for symbols only: http://i.imgur.com/yc6fe.png




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: