
50 terms most predictive of a submission making it to the front page - baccheion
Results are based on stories submitted in 2015.<p>Overall ROC AUC score: 0.634 (front page: 60.00%, not front page: 84.70%)<p><pre><code>   0.480619587060     pdf
   0.400951486155     yc
   0.357299773818     c
   0.345074611885     2013
   0.312349086474     2014
   0.282104533149     2012
   0.267570843408     language
   0.261375936103     2011
   0.241650745481     go
   0.219169898626     haskell
   0.211257263299     show hn
   0.205623463348     rust
   0.199571502981     i
   0.195598572648     programming
   0.195274408452     lisp
   0.186177768937     2010
   0.178721616659     a
   0.167683265531     theory
   0.164806775214     fast
   0.163041352661     linux
   0.160414928257     2009
   0.159121703294     released
   0.158462774218     yc w15
   0.157937056948     0
   0.155722711980     w15
   0.153727530604     memory
   0.152325392129     openbsd
   0.140109894354     compiler
   0.139065804971     an
   0.135471235636     open-source
   0.131606742813     deep
   0.131399688219     unix
   0.131269232599     gnu
   0.131168543633     kernel
   0.125287242638     show
   0.124609324199     firefox
   0.124602305508     os
   0.123616770730     who
   0.120643387605     computer
   0.117908824287     the
   0.115902369932     modern
   0.114483061102     hn what
   0.112337592986     ocaml
   0.111354775352     programmer
   0.109343194868     postgresql
   0.108654447345     math
   0.107721823851     2008
   0.106274292219     in go
   0.106071633308     2006
   0.104932567748     perl</code></pre>
======
dosy
So, for example, a hypothetical post,

    
    
      Show HN: The C Language, YC Guide [pdf, 2013]
    

should be an instant hit, right?

~~~
baccheion
If the list were to be used, I suppose it would serve as a loose guideline.
Maybe it reveals a general trend of preferred versus not.

~~~
dosy
Curious how you would calculate the value of two such terms occurring
together, from the values of single terms given. Is that possible from this
list of values?

------
Tomte
I would have expected „random number“ or „rng“ to be on that list. For quite
some time (much less now) it was like catnip.

------
karmakaze
How is in a this score calculated? Can't see how 'i' and 'a' would be
predictave?

~~~
cimmanom
Could be that headlines in the first person (“I”) or with proper grammar (“a”)
instead of omitting the article as is common in abbreviating headlines
correlate with upvotes?

------
baccheion
Also, 100 terms most predictive of a submission _not_ making it to the front
page:

    
    
      -0.335386489547     startup
      -0.331723905544     2015
      -0.321593118669     app
      -0.306937335575     your
      -0.305739531214     how to
      -0.275438550569     this
      -0.261565592652     business
      -0.252649164518     product
      -0.250614203448     mobile
      -0.236041160710     marketing
      -0.227196421746     top
      -0.208139598304     with
      -0.206031814574     5
      -0.203087091676     ios
      -0.202457032685     design
      -0.201021718651     watch
      -0.200267475193     startups
      -0.197466134506     ask
      -0.196357335391     or
      -0.192562683469     10
      -0.191253124976     best
      -0.190867070325     ask hn
      -0.187721441778     cloud
      -0.187394374070     android
      -0.186461809237     smart
      -0.184024063073     you
      -0.183827664018     tips
      -0.182653896122     growth
      -0.181372850037     for
      -0.178198606780     could
      -0.162472422631     blog
      -0.162207059285     java
      -0.160644447613     development
      -0.159487418681     social
      -0.157294135483     should
      -0.156980003088     bitcoin
      -0.150609220130     iphone
      -0.148979317953     tech
      -0.148345714371     testing
      -0.147454333035     change
      -0.145491827860     list
      -0.145485290331     to
      -0.144015642286     3
      -0.143708682318     robot
      -0.142186986230     tools
      -0.140812013948     twitter
      -0.140696100278     rails
      -0.140548788801     software
      -0.138527298008     future
      -0.138172121531     good
      -0.138015521103     internet
      -0.137281744329     facebook
      -0.136342150691     security
      -0.134144777413     content
      -0.133091842596     awesome
      -0.133049592053     angularjs
      -0.133019163138     create
      -0.131147662198     meet
      -0.128568740027     live
      -0.125766592272     wordpress
      -0.125681496867     star
      -0.125433963958     here's
      -0.124980970020     test
      -0.123513256155     day
      -0.123227292738     podcast
      -0.123085547655     feedback
      -0.122558159240     uber
      -0.122365526765     bill
      -0.121846476127     things
      -0.121766619177     online
      -0.121674711692     entrepreneurs
      -0.121271063379     vr
      -0.120835224059     devops
      -0.120704156113     website
      -0.120668008266     resources
      -0.119873591378     tutorial
      -0.119600975052     6
      -0.119263351612     most
      -0.118987167145     api
      -0.118767754130     apps
      -0.118683692890     digital
      -0.116745925093     will
      -0.116477896000     data
      -0.116317401689     needs
      -0.116223838757     need
      -0.115050697065     market
      -0.114878154258     3d
      -0.114105916526     more
      -0.111918004178     help
      -0.111764422735     apple
      -0.111326594562     new
      -0.110914386417     year
      -0.110475338587     customer
      -0.109564041456     technology
      -0.109468606136     iot
      -0.109381535069     application
      -0.109146062602     4
      -0.108483540034     solution
      -0.108171407112     music
      -0.107249340464     drone

~~~
gitgud
Maybe write a blog post or show the methodologies used to arrive at these
results.

Just posting numbers with no way to verify them is unscientific.

