Hi everyone, I hope you like my latest side project!
I'm an astronomy student who likes programming in his free time. This time I wanted to write something that handles larger amounts of data. And as I recently came across the Stack Exchange data dump including all questions and answers I had the idea of using them to create Markov Chains for (nearly) every Stack Exchange site.
The website displays the resulting content, which is often surprisingly coherent and entertaing, and allows upvotes/downvotes so that the best questions get to the front page.
And as a bonus, I created a quiz where one can guess which site a random question is based on.
Please excuse the very minimal design, but after writing a lot of Single Page Applications I wanted to go the oposite way and write a website with less than 25KB.
>Please excuse the very minimal design, but after writing a lot of Single Page Applications I wanted to go the oposite way and write a website with less than 25KB.
The world could use more of this.
I took the simple quiz and my first question was from the Russian language site. I hadn't considered that Markov chains could work in other languages too. I wonder if there are any significant differences?
Hi, sorry everyone for the delay in answering, but timezones weren't nice to me.
I really like minimalist pages, but I would be lying if I said the complete oposite (like https://lw1.at) wasn't also fun to program.
The whole text-generation is language-independent [1]. I just split the source up in words (well technically NLTK-tokens) and merge them together after generating the text.
Therefore it even kind of works for Japanese [2] or Arabic. It's just a bit ugly as a lot of questions on non-English sites are in English and the chains end up as a mix.
I'm not going to lie, I started reading this thinking it was an example. I was very impressed as often Markov chain generated sentences break down after a common joining word.
I think it works better than other examples simply because your typical StackOverflow question is more often than not some fragmented sentence missing a few words which are then intended to be inferred from the body of the question.
I suspect this has to do with the question length limit.
Markov chains are so much fun. They produce believable relevant text that ultimately makes no sense, which is basically a definition of comedy. And they're also super simple to understand and implement. I can have lots of fun without having to do any wild natural language processing.
I have to fully agree. The language-processing part is really simple (partly due to the really cool markovify library [1])
You may also like my older project [2] even though it is partly in German. I'm using Markov Chains for the titles and some custom regex-based language processing for the descriptions.
> If you can already be synchronized when it comes through the use of your test. That's a switching powersupply. I disable AGC in my comments above as an antenna analyzer that works depends on the Pi transmit frequency that isn't necessary to send an SWL, but let's dig further by adding another radial... You have bigger problems.
Any sufficiently advanced technology is indistinguishable from magic.
I'd like to see Stack Exchange moderator responses using Markov chains. Like "Stop answering this guy as he is posing useless questions", or "This duplicate is considered not relevant".
> This may or may not be a major flaw with your function F is for the second bytes of randomness, this requires access to the algorithm.
> Okamoto-Tanaka Revisited: Fully Authenticated Diffie-Hellman with Schnorr signatures and KCDSA are two obvious caveats to this question is off-topic...
Hi, I have already expected a question about Neural Networks :)
When I had the idea I also toyed around with word-rnn and similar RNN libraries. The results I got were pretty good, but training was extremely resource consuming. Cuda gave a 8x boost, but still training one of the smaller sites took 20 minutes on my simple graphics card while creating the Markov chain is done in 2 minutes.
I also have absolutely no experience with Machine Learning and just the setup was already quite an experience. So I stuck with what I know and stayed "the traditional way".
Interesting. Yes, training takes way more time. In some really big projects like https://blog.openai.com/unsupervised-sentiment-neuron/
"We first trained a multiplicative LSTM with 4,096 units on a corpus of 82 million Amazon reviews to predict the next character in a chunk of text. Training took one month across four NVIDIA Pascal GPUs, with our model processing 12,500 characters per second."
For smaller ones (and smaller datasets) a few hours of GPU is considered fast (at ~1$/h it is not much!). If you want to use it online, there is https://neptune.ml/ (no setup, $5 free credit for computing; full disclosure: created my colleagues).
In any case, I would be excited to see it on some site with code (like SO) or formulae (math or stats). Especially as I am a big fan of StackExchange and analysis (vide http://p.migdal.pl/tagoverflow/ :)).
Sounds nice, but my plan was to find out how much data I can handle easily on my plain simple desktop PC.
You may have seen, that you can filter by site [1].
Code quite ruined my chains as it didn't appear in blocks, but rather everywhere, so I went the easy way and filtered all code blocks.
I didn't filter math as I couldn't find a proper way to do it, but you can see that it gets quite messy [2]
Exactly, I tried to keep the themes of the single sites visible.
Unfortunatly the colors from the API are quite ugly and don't represent the sites properly, so I did some manual collecting:
I wonder if you split the corpus and only used low voted questions/answers for one corpus and edited questions + high-voted answers for the other ... could we tell the difference in the output chains?
SE often has translated questions, or questions of low quality, IME.
In a way I did it for Stack Overflow. All Posts are 60GB which is a bit too much to handle the whole chain in my desktop-pc.
So I only used questions and answers with a score >=10 there to get it to a similar size as the other sites.
It still took half an hour to parse the XML and another hour to create the chain.
I'm an astronomy student who likes programming in his free time. This time I wanted to write something that handles larger amounts of data. And as I recently came across the Stack Exchange data dump including all questions and answers I had the idea of using them to create Markov Chains for (nearly) every Stack Exchange site. The website displays the resulting content, which is often surprisingly coherent and entertaing, and allows upvotes/downvotes so that the best questions get to the front page. And as a bonus, I created a quiz where one can guess which site a random question is based on.
If you are interested in my other projects, check out https://lw1.at, if you want to see the code, everything is Open Source and can be found here: https://github.com/Findus23/se-simulator
Please excuse the very minimal design, but after writing a lot of Single Page Applications I wanted to go the oposite way and write a website with less than 25KB.