
Show HN: Neural network that impersonates writers - jacob_plaster
https://github.com/JacobPlaster/ann-writer
======
nl
I haven't looked at the code, but glancing at the results leaves me thinking
it might need more work.

The output seems to me around the level a Markov chain might produce.
Karpathy's RNN code produces much, much better results[1].

I wonder if manually extracting features and training the RNN on that is a
mistake? RNN's tend to work well on text because they encode understanding of
the parse tree themselves.

[1] [https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-
rnn)

------
strong_ai
this doesn't look like a neural net to me. from NeuralNetwork.py

    
    
      from sklearn.neighbors import KNeighborsClassifier
      # Create a sperate neural network for each identifier
      for index in range(0, len(NaturalLanguageObject._Identifiers)):
           nn = KNeighborsClassifier()
           self._Networks.append(nn)

~~~
ching_wow_ka
So then it seems like the author of the code doesn't understand that "NN"
means "Nearest Neighbors" and not "Neural Network"?

He mentions that he used sklearn's Neural Network libraries in his blog post,
but sklearn doesn't have any aside from RBM.

~~~
jacob_plaster
I was confused with the difference between an SVM and a neural network. Easy
mistake I guess. The whole goal of this project was for educational purposes,
so im still happy with the outcome.

~~~
stuxnet79
OMG, how embarrassing for OP. This is what scares me about technical blogging.
Messing up unknowingly in an area I'm not experienced in and getting scathing
critiques from my fellow hackers. Keep your chin up OP and next time remember
to do your homework. +1 for the effort anyways.

------
bearzoo
I am afraid this author has no idea what he is doing - and is loosely throwing
around terms he does not understand. What the hell was his normalization
procedure. Dangerous to readers who do not know a lot and will get confused
while reading.

------
Turing_Machine
I ran a Markov chain text generator on _Finnegans Wake_ once. It came out
looking much the same. :-)

------
frisco
Fun hack. If anything, it highlights how compelling deep learning and RNNs
are: no messing with NLP, no messing with building other features or adding up
classifiers, etc. The manual feature engineering means it might work better on
a smaller dataset, but even then probably not.

For comparison with Andrej Karpathy's RNN code
([http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/))
training on the "HarryPotter(xxlarge).txt" (76K) file using the default
hyperparameters and a batch size of 25 gets me:

    
    
      > But Atfa the loom proset! No contarin — mibll,’s just pucking to live
      > note left them hard and fitther, clooked of course little happered to
      > trige on the fistpened. Their knew Harry mear from the shind-beas
      > eveided, at Uncle Vernon’s thepped to spept were pelled and beadn
      > Harry, distine dy use. Harry had in a amalout, into the fish sfary door.
    

The difference here is tokenizing on words vs letters: the RNN code is trying
to learn the structure of English from completely zero whereas the code here
gets to work with well-formed words from the beginning. But otherwise, the
results in the linked post are about as silly semantically:

    
    
      > Input: "Harry don't look"
      > Output: "Harry don't look , incredibly that a year for been parents in .
      >   followers , Harry , and Potter was been curse . Harry was up a year ,
      >   Harry was been curse "
    

EDIT: Updated the RNN output text. Was sampling from a checkpoint file for a
different input corpus. Got confused by the long similar-looking filenames.
Doesn't change the overall point though.

~~~
kylebgorman
I just can't agree that a simple, linear-time operation like "tokenizing words
and basic n-gram models out of them" is a tedious problem like you seem to be
implying, nor do I feel a solution to this very-solved problem is
"compelling". Word tokenization and n-gram models are simple, unreasonably
effective, and very fast. If character-based RNNs do better (albeit far more
slowly during training), great, but nothing to see here, let's move along.

As I've posted here before, people have been training character n-gram models
and getting language modeling performances comparable to those from word-based
models---without using neural networks---for at least a decade. That it works
with RNNs is no surprise because it worked just fine with the much more
constrained predecessor technology.

~~~
frisco
My problem isn't that the feature engineering is expensive or tedious, it's
that it's privileging a lot of information that NNs learn from the data. Yeah
ok, Markov models (n-grams) are simple and fast and produce good results for
generating representative text.

Deep RNNs are simple and produce good results for a huge, diverse range of
problems with no new domain information. As Andrej Karpathy wrote:

> Sometimes the ratio of how simple your model is to the quality of the
> results you get out of it blows past your expectations, and this was one of
> those times.

N-grams don't have nearly the power (eg longer-than-N-range structure like
grammar) and don't generalize nearly as well, making them a lot less
surprising.

------
caf
Have you considered the copyright on the Harry Potter training data?

~~~
mccracken
fair use - education

~~~
bhaak
I would still be cautious. There is no need to use exactly this text.

Anyway, there are countries that don't have a fair use policy. So in this
countries your repository could not legally be used.

IMHO, this is an unnecessary use of copyrighted material when there are
thousands of equally well suited texts that have fallen out of copyright.

~~~
mccracken
What you say is true - although I'm not sure it matters. There are heaps of
art forms which are out of copyright, why should people critique/parody new
films? Because they're relevant and people understand the content

------
achompas
> I decided to use scikit's machine learning libraries. [...] The writer I
> create uses multiple SVM engines. One large neural network for the sentence
> structuring and multiple small networks for the algorithm which selects
> words from a vocabulary.

This person has no idea what they're talking about. sklearn has no neural
network code whatsoever.

EDIT: this feels like a testament to sklearn's greatness, honestly.

------
w_t_payne
I'd be interested to know if this could be turned into a tool that lets you
know how well your writing (or coding) matches the "house style". (Mostly for
technical documentation, requirements specs etc...)

I'd be even more interested if it could be turned into a sublime text plugin
that highlights words / phrases that deviate most strongly from the house
style.

~~~
jacob_plaster
Good idea. I guess that was my main goal really, learning the structure of
some text. The vocabulary generator was a pretty recent add in which is why it
is quite in-accurate.

------
scorpwarp23
This is brilliant! I tried it out. Waiting for a larger data set! +1

