
Show HN: Generating fun Stack Exchange questions using Markov chains - Findus23
https://se-simulator.lw1.at/
======
Findus23
Hi everyone, I hope you like my latest side project!

I'm an astronomy student who likes programming in his free time. This time I
wanted to write something that handles larger amounts of data. And as I
recently came across the Stack Exchange data dump including all questions and
answers I had the idea of using them to create Markov Chains for (nearly)
every Stack Exchange site. The website displays the resulting content, which
is often surprisingly coherent and entertaing, and allows upvotes/downvotes so
that the best questions get to the front page. And as a bonus, I created a
quiz where one can guess which site a random question is based on.

If you are interested in my other projects, check out
[https://lw1.at](https://lw1.at), if you want to see the code, everything is
Open Source and can be found here: [https://github.com/Findus23/se-
simulator](https://github.com/Findus23/se-simulator)

Please excuse the very minimal design, but after writing a lot of Single Page
Applications I wanted to go the oposite way and write a website with less than
25KB.

~~~
froindt
>Please excuse the very minimal design, but after writing a lot of Single Page
Applications I wanted to go the oposite way and write a website with less than
25KB.

The world could use more of this.

I took the simple quiz and my first question was from the Russian language
site. I hadn't considered that Markov chains could work in other languages
too. I wonder if there are any significant differences?

~~~
Findus23
Hi, sorry everyone for the delay in answering, but timezones weren't nice to
me.

I really like minimalist pages, but I would be lying if I said the complete
oposite (like [https://lw1.at](https://lw1.at)) wasn't also fun to program.

The whole text-generation is language-independent [1]. I just split the source
up in words (well technically NLTK-tokens) and merge them together after
generating the text.

Therefore it even kind of works for Japanese [2] or Arabic. It's just a bit
ugly as a lot of questions on non-English sites are in English and the chains
end up as a mix.

[1] [https://github.com/Findus23/se-
simulator/blob/master/markov....](https://github.com/Findus23/se-
simulator/blob/master/markov.py) and [https://github.com/Findus23/se-
simulator/blob/master/text_ge...](https://github.com/Findus23/se-
simulator/blob/master/text_generator.py)

[2] [https://se-simulator.lw1.at/s/japanese.stackexchange.com](https://se-
simulator.lw1.at/s/japanese.stackexchange.com)

------
Waterluvian
Markov chains are so much fun. They produce believable relevant text that
ultimately makes no sense, which is basically a definition of comedy. And
they're also super simple to understand and implement. I can have lots of fun
without having to do any wild natural language processing.

~~~
Findus23
I have to fully agree. The language-processing part is really simple (partly
due to the really cool markovify library [1])

You may also like my older project [2] even though it is partly in German. I'm
using Markov Chains for the titles and some custom regex-based language
processing for the descriptions.

[1]
[https://github.com/jsvine/markovify/](https://github.com/jsvine/markovify/)

[2] [https://nonsense.lw1.at/](https://nonsense.lw1.at/)

------
koolba
> Do Greeks driving affect the whaling industry?

I’ve always wondered this as well.

~~~
arbie
> Like all animal abuse in particular: can one truly know?

Comedy gold.

------
Findus23
I have now written a bit more on how to hopefully get this to run locally and
how everything works here:

[https://github.com/Findus23/se-simulator#se-
simulator](https://github.com/Findus23/se-simulator#se-simulator)

------
bcaa7f3a8bbc
Ham Radio: [https://se-simulator.lw1.at/q/which-mode-describes-
this](https://se-simulator.lw1.at/q/which-mode-describes-this)

> If you can already be synchronized when it comes through the use of your
> test. That's a switching powersupply. I disable AGC in my comments above as
> an antenna analyzer that works depends on the Pi transmit frequency that
> isn't necessary to send an SWL, but let's dig further by adding another
> radial... You have bigger problems.

 __Any sufficiently advanced technology is indistinguishable from magic. __

------
chris_wot
I'd like to see Stack Exchange moderator responses using Markov chains. Like
"Stop answering this guy as he is posing useless questions", or "This
duplicate is considered not relevant".

------
everdev
Beautiful.

> Configure anonymous DDOS attacks on internal servers?

~~~
bcaa7f3a8bbc
> How to brute force against quantum computer

> This may or may not be a major flaw with your function F is for the second
> bytes of randomness, this requires access to the algorithm.

> Okamoto-Tanaka Revisited: Fully Authenticated Diffie-Hellman with Schnorr
> signatures and KCDSA are two obvious caveats to this question is off-
> topic...

[https://se-simulator.lw1.at/q/how-to-brute-force-against-qua...](https://se-
simulator.lw1.at/q/how-to-brute-force-against-quantum-computer)

------
exabrial
Ok some of these I actually want to know the answer to... Like what
windsurfing equipment is good for deep sea fishing

------
aetherspawn
> Essential windsurfing equipment to fish?

> Any additional info will be suspicious, they've had the card

------
pavel_lishin
> _Remove Broken Lightbulb from the toe?_

> _As it is horizontal? This means the color and material like them a few
> sheets of paper on the internet?_

I don't know how much effort you put into this, but that alone was absolutely
worth it.

------
dkersten
> What is an open-commercial license?

> Why did they determine that he used / invisibility cloak technology?

> How did newton APPROXIMATE THE AREA UNDER THESE PARTICULAR CURVES

------
stared
I am curious how would those results compare with RNN models, such as ones in
Andrej Karpathy's "The Unreasonable Effectiveness of Recurrent Neural
Networks" [http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

(E.g. as they are able to learn code grammar.)

~~~
Findus23
Hi, I have already expected a question about Neural Networks :)

When I had the idea I also toyed around with word-rnn and similar RNN
libraries. The results I got were pretty good, but training was extremely
resource consuming. Cuda gave a 8x boost, but still training one of the
smaller sites took 20 minutes on my simple graphics card while creating the
Markov chain is done in 2 minutes.

I also have absolutely no experience with Machine Learning and just the setup
was already quite an experience. So I stuck with what I know and stayed "the
traditional way".

~~~
stared
Interesting. Yes, training takes way more time. In some really big projects
like [https://blog.openai.com/unsupervised-sentiment-
neuron/](https://blog.openai.com/unsupervised-sentiment-neuron/) "We first
trained a multiplicative LSTM with 4,096 units on a corpus of 82 million
Amazon reviews to predict the next character in a chunk of text. Training took
one month across four NVIDIA Pascal GPUs, with our model processing 12,500
characters per second."

For smaller ones (and smaller datasets) a few hours of GPU is considered fast
(at ~1$/h it is not much!). If you want to use it online, there is
[https://neptune.ml/](https://neptune.ml/) (no setup, $5 free credit for
computing; full disclosure: created my colleagues).

In any case, I would be excited to see it on some site with code (like SO) or
formulae (math or stats). Especially as I am a big fan of StackExchange and
analysis (vide
[http://p.migdal.pl/tagoverflow/](http://p.migdal.pl/tagoverflow/) :)).

~~~
Findus23
Sounds nice, but my plan was to find out how much data I can handle easily on
my plain simple desktop PC.

You may have seen, that you can filter by site [1]. Code quite ruined my
chains as it didn't appear in blocks, but rather everywhere, so I went the
easy way and filtered all code blocks.

I didn't filter math as I couldn't find a proper way to do it, but you can see
that it gets quite messy [2]

[1] [https://se-simulator.lw1.at/s/math.stackexchange.com](https://se-
simulator.lw1.at/s/math.stackexchange.com)

[2] [https://se-simulator.lw1.at/q/a-e-3-how-to-span](https://se-
simulator.lw1.at/q/a-e-3-how-to-span)

~~~
stared
Well, then taking a subset of math and leaving it overnight will work. (No
need to do run it on ALL sites).

------
jpatokal
Turns out the random mishmash of pseudoscience and conspiracy theories that is
Skeptics.SE is hilarious fodder for a Markov chain:

[https://se-simulator.lw1.at/q/do-greeks-driving-affect-the-w...](https://se-
simulator.lw1.at/q/do-greeks-driving-affect-the-whaling-industry)

------
fourthark
This is a lot of fun.

Why are some of the choice buttons colored in the easy quiz? Sometimes it
seemed like a hint, sometimes just random.

~~~
krallja
For the colors, see
[https://stackexchange.com/sites](https://stackexchange.com/sites) \-- it
seems to be an approximation of the actual sites' themes/branding.

~~~
Findus23
Exactly, I tried to keep the themes of the single sites visible. Unfortunatly
the colors from the API are quite ugly and don't represent the sites properly,
so I did some manual collecting:

[https://github.com/Findus23/se-
simulator/blob/master/extra_d...](https://github.com/Findus23/se-
simulator/blob/master/extra_data.py)

------
jeroen
Is there supposed to be a close button on the yellowish popup? It's not on
screen on my iphone se.

~~~
Findus23
Hm, there ought to be one in the top right corner.

Can you try tapping there even though you don't see it?

------
mmirate
Next step: post these questions to the Stack Exchange sites from whence they
were generated.

------
pbhjpbhj
I wonder if you split the corpus and only used low voted questions/answers for
one corpus and edited questions + high-voted answers for the other ... could
we tell the difference in the output chains?

SE often has translated questions, or questions of low quality, IME.

~~~
Findus23
In a way I did it for Stack Overflow. All Posts are 60GB which is a bit too
much to handle the whole chain in my desktop-pc. So I only used questions and
answers with a score >=10 there to get it to a similar size as the other
sites. It still took half an hour to parse the XML and another hour to create
the chain.

------
sudouser
awesome, reminds me of ‘how is babby formed’

