I'd also like to say that I highly recommend everyone does projects like this from time to time. I don't know any way to gain programming skills faster than making small throwaway projects using new tools and techniques.
Although now that I think about it, I could probably get close by using an 8bit greyscale image and then apply the palette in the client. That would probably halve the image size.
Take the bare bones of this project:
- Canvas.
- Websockets.
That's literally it. You'll need to know how to draw on a canvas, and how to send and receive WebSocket information. You can quite happily keep the current state of the canvas in an in-memory array, perhaps saving it to a file every few minutes in case the server crashes. Then, perhaps, when that's done you can swap our your in-memory array for a REDIS bitfield, improve the web sockets to no longer use JSON, but instead binary? Both of which should be only a few tens of lines of changes, but after that you'll be able to support tens of thousands of simultaneous users with hundreds, if not thousands of changes per second.
The thing with this project that's complex is the number of users required to use this at once, lessen the requirements a little and you'll come up with a simple project.
Looks like you genuinely felt it is out of the scope of a lot of novices (which it well might be at least for a weekend project), and you genuinely were looking for tips.
Just look for projects that would make yourself or others happy. Doesn't matter how small or large, whether it would take an hour or a decade, the hardest part is getting past line zero. Once you have that the next hardest part is getting to line one; wash, rinse repeat.
The good news is everything is relative and everyone has different skill sets - what's hard for everyone else might be a cinch for you due to your childhood fascination blending fishing line with champagne corks - who knows? ;)
Try not to be discouraged, just stick with it and be infinitely inquisitive, throw caution to the wind and dive right in.
Good luck and happy sailing!
I think all my technology choices worked out fine. I dumped server-sent events halfway through in favour of websockets because WS support binary packets. But that was a pretty easy change affecting at most 50 lines of code.
I still wish we had an efficient (native) solution for broadcasting an event sourcing log out to 100k+ browser clients, with catchup ('subscribe from kafka offset X'). Nodejs is handling the load better than I expected it to, but a native code solution would be screaming fast. It should be relatively simple to implement, too. Just, unless my google-fu is failing me I don't think anyone has done it yet.
Seems to me like you just described long-polling, which you dismissed in the article as "so 2005".
Yes, I did dismiss it out of hand in the article. The longer response is this:
In this instance long polling would require every request to be terminated at the origin server. I need to terminate at the origin server because every connection will start at a different version. The origin server in this case is running JS, and I don't want to send 100k messages from javascript every second. Performance is good enough, but barely. And with that many objects floating around the garbage collector starts causing mischief.
The logic for that endpoint is really simple - it just subscribes to a kafka topic from a client-requested offset and sends all messages to the client. It would be easy to write in native code, and it would perform great. After the per-client prefix, each message is just broadcast to all clients, so you could probably implement it using some really efficient zero-copy broadcast code.
The other approach is to bunch edits behind constantly-created URLs, and use long-hanging GETs to fetch them. I mentioned that in the blog post, but its not long-polling - there's no poll. Its just old-school long-hanging GETs. I think that would work, but it requires an HTTP request/response for each client, for each update. A native solution using websockets would be better. (From memory WS have a per-frame overhead of only about 4 bytes)
Nice work on all this!
The server that both the app and my blog are on only has 1 core and 1.5gb of ram. I didn't pay close enough attention to kafka, and it turns out kafka was using half of the available ram starving the rest of the processes. And it didn't help people's bots were thrashing POST requests at nginx to edit the page.
I've just done some high fructose server maintenance, spinning up a new machine 4 cores and 8 gigs of ram. The old site is proxying all traffic across, and once DNS propagates it'll stop being hit at all. Hopefully that'll ease the congestion.
Edit: at the time of writing everything seems back up and happy. Nginx was running out of open file descriptors, kafka was eating all the ram and ghost (my blogging platform) wasn't sending the right cache-control headers.
A few tweaks and a bit more CPU to play with and everything seems happier now.
How did you go about preventing spam? If you feel it might give the spammers the information to circumvent your measures, consider writing a blog post later.
How many users did the site receive over the course of the past 2 days?
Really cool weekend project btw, this is literally the first time I have seen someone follow through on the "I can build X in a weekend" claims.
I can't really hide anything - all the code is on github, though its ... um, ... not pretty. It might be better if nobody looks. http://github.com/josephg/sephsplace
But the system at the moment is really simple: The site only accepts 10 edits per 10 seconds from any IP address. After that your edits get rejected until the next 10 second window begins. You can write bots to draw things for you (and lots of the images you see are drawn this way). But drawing big objects is slow. So thats ok. I think thats a reasonable compromise between bots being powerful and humans being powerful.
The giant Angela Merkel image (and some other smut that I deleted) was drawn by someone proxying edits through about 200 IP addresses. I don't know if they're using TOR, or have access to a botnet or are using an anonymizing proxy or something. I could tell they were all the same botnet because all the requests had the useragent of 'python-requests/2.10.0'. (I have an IP address list if anyone wants to take a look.)
Anyway, I figured those addresses are probably hard to replace - so I let them do it, harvested the addresses and banned them all. Worse, I made it impossible to tell which servers are banned - the server replies to banned servers like normal - their edits just never appear.
I caught about 2/3rds of their addresses before they started making their headers match real browser traffic, but I think I ruined their fun and they stopped.
I have a few more tricks up my sleeve I could pull out if that little war escalated further. For example, I could always add a captcha you have to fill out when you open the page. The captcha would generate a token that you would have to provide with each edit. Rate limiting would then be by-token. Bots would still work, but you would have to give the token to your script. But getting around the rate limiting would be harder.
> How many users did the site receive over the course of the past 2 days?
Um I'm embarrassed to say I don't know! I don't have a log of which IP addresses generated each edit. Monitoring and logging seemed much less important at the time. Its obvious in retrospect, but I wish I'd sent each edit along with a timestamp and some metadata into a separate kafka queue when I received them. That way I would have a complete audit log to play with now. I have all the edits - but I have no way of knowing who did what, or how many unique visitors I've had.
The site was called 'blograffiti.com' - Remnants of it still exist on the wayback machine.
Fantastic work by the way. And thanks for the post about it all. I love weekend projects like this. That's exactly what kicked blograffiti off. (And most of my most valuable projects, come to think of i!.)
I'm thinking of a grid of Litecoin addresses would work well to limit abuse of infrastructure to gain advantage, while still allowing bot activity. A payment to a given address would last the amount paid divided by cost of ownership amount per time period.
I have been doing the same intuitively for as long as I can remember but never stopped to realize this or why. I wonder what else I've learned by doing like this that now I use unconciously.
Teachers typically want to first make up requirements and use-cases, then functional design, then technical design, then either code and tests or first tests then code (depending on the teacher)... Basically, you wait till 60-70% of the work is done to discover design flaws. Later on we had some Agile stuff as well, but more as a "this also exists" rather than "this is how it's done". Doing some prototyping and benchmarking to see whether something works at all was never part of anything.
One, exactly one subject ever had a performance requirement: 1000 simultaneous users in a multiplayer game. And it had to work over Java RMI (which makes no sense at all). I was the only person of two classes who pushed for (and was finally granted) the use of raw sockets. I was the only person who took this as a challenge and ran thousands of prototype clients on the school's computing cluster on a Saturday night so I'm not taking anyone's compute time. They never even looked at it. But next Wednesday is the last thing I will ever have to do for that school (unless I have to resit) and I'm so happy I'm done with their shit and can do my own thing next. Properly.
And also arguably the university teaching focus on these companies. That is why you have all of these fancy ways of encapsulating dependencies and wrapping them into oblivion.
Funnily enough a really similar thing happens in other degrees. I studied Industrial Engineering and can calculate whatever you want about the cinematics of a robot arm but it wasn't until I set up with some friend to learn how to make one from scratch that we really knew what it was all about.
For example, first draw = free. 2nd redraw 10 seconds, 3rd redraw 20 seconds... capped a 5 minutes. Not that the actual implimentation is that important.
One of the first things that happened when the site went up was that someone started drawing something, and then someone else immediately start spamming junk pixels over the top of it. Despite being able to draw literally anywhere else in the world they thought the best use of their time was to ruin someone else's creation. It was kind of disappointing to witness.
I might make that change now actually - the simplest form of that is very easy. I can just make white pixels cheaper to draw on, and for everything else there's stiffer rate limiting penalties. (Which isn't quite what you said, but I think its the MVP version of it)
(Edit: This is implemented now. You can draw over 25 white pixels in each 10 second window, but only 10 colored pixels)
So much for my big talk about performance numbers. I'm fixing it as fast as I can.
I've spun up a new much bigger server to handle the load. I'm just waiting DNS to propagate and it should start running much smoother.
I have to admit, I remember stumbling across your comment when you accepted the challenge and in my mind I scoffed, thinking you were never going to do it. Boy was I wrong! Once again, this is super cool.
I feel like there's two kinds of people who make bold statements like that: There's young people who are suffering from the Dunning-Kruger effect - inexperienced but think they're hot shit. Then there's people who've actually done a lot of hackathon-type events and as a result know what it takes to pull them off successfully. (Time, caffine, and a deep familiarity your tools.)
Congratulations on following through, and demonstrating your core premise!
What were the top things that you felt weren't captured by that premise--for instance, undocumented decisions that you had to discover on your own, or cases where you made tradeoffs that led to unexpected complexity? Were they maianly around bot-mitigation?
> that much of the time and difficulty in doing something novel is making many of the tough decisions, and that once those design and technical decisions are made (and revealed), it seems "obvious" to others, and is judged simple in comparison.
Yes - one of the things that drew me to the project was how building this in an event-sourcing style fits so well here. Doing it that way solves some of the architecture problems reddit talked about in their blog. It seems obvious to me that this is a good approach, but obviously not everyone shares that view!
> What were the top things that you felt weren't captured by that premise--for instance, undocumented decisions that you had to discover on your own, or cases where you made tradeoffs that led to unexpected complexity? Were they maianly around bot-mitigation?
Thats a great question, but I didn't spend much time surprised.
The thing I was most concerned about was kafka, but integrating kafka turned out to be was delightfully easy. I had to write some code to buffer recent operations in my server for catchup - I wish kafka had an API for that, but that wasn't hard to work around.
I think getting notifications working would have been a time sink but I explicitly removed them from the spec so I wouldn't have to deal with them.
It took me way too long to get kafka actually running through systemd on my linode. But I've spent enough time with apt-get that I wasn't surprised, just disappointed.
I was surprised how quickly people started drawing smut, and how much time I needed to spend early on cleaning things up or writing tools to remove large bot-drawn genitals.
There are still a lot of decisions around rate limiting that I feel uneasy about. I worry that reddit's 5 minute rule wouldn't work for a little website like mine. I allow ~1 edit per second. Is that a good idea? I don't know. Its an expensive experiment to try different values and see what happens because there's a community involved. And I don't have reddit's huge user base. But maybe I'm being unnecessarily risk averse by allowing so many edits. Forcing slow editing is bolder - it requires a longer commitment to draw, but is probably also much more satisfying to people who create content.
A couple of days ago I remember reading how difficult was to deploy Oracle on Linux and how Docker made this a breeze. I wonder if Kafka would also fall onto the same premise.
I think its a nice tool for deployment and making reproducible builds, but a lot of other things become harder through docker - like managing a databases's data, and communication between local processes.
Maybe the tooling has improved in the last few years, but I've gone back to the raw unix coalface.
It doesn't have to be this way. If you use shared folders to persist data on host you are in no worse position than you would be in if you used natively installed app, persistence wise.
I think the Docker's focus on orchestration (which makes business sense for them) is the reason why running DBs in containers got bad reputation. But really, if you use shared dirs with host and view containers as processes you can use them for DBs too.
IPC with containers OTOH forces you to architect the system as a bunch of microservices, which is usually not a bad idea either.
You don't even need to use a dependency like Kafka. We built a tool for tracking the lifecycle of software we ship, it uses Event Sourcing with snapshots and is open source: https://tech.fundingcircle.com/blog/2016/09/06/shipping-in-f...
I have no clue who those people are. It's just anarchy and the only thing we have in common is a canvas :-)
I haven't used it before, and the comments on your writeup make it sound more approachable than I'd expected.
However, it seems to me that Kafka is unnecessary in this system. It's clear that, at least in the final version, the system isn't designed to scale beyond one application server. For one thing, you're storing the ban list on the local filesystem. So it's definitely not 12-Factor compliant. And you're storing a local snapshot of the image. So why send the edits out to Kafka, only to have them come back in to the same process?
But right now to help deal with load the process is running across 2 machines. I just had to manually copy the ban list and snapshot database files across. When the server came up it pulled the snapshot version out of the database file, caught up from the kafka log and went to work.
Having a nice solution to distribute those files would be lovely - but I made the whole project start to finish in 2 days. I'm not going for 12 factor compliance here.
You are an inspiration! If people grasp what you've done then you've lowered the fear people have of copying things. That should lead to more attempts, failures, improvements, competition, a true catalyst!
> Seph's law: Programming is 95% decisions and 5% typing.
:) In real projects there's usually the other 95% of the time spent reading the existing code and figuring out how it works. But thats much harder to fit in a glib saying.
One decision that I don't agree on is the choice to send messages in order.
I usually prefer to flood the client of messages and attach at each message a timestamp (a monotonic increasing integer) and having the client to re-order everything.
It is cheaper from the server point of view and the worked is done by the clients.
Are there any reason why you picked your specific solution?
Just a technical question, that I am very curious about. I guess that there are concerns that I am overlooking at the moment...
And if you did that the client would need to be able to track and fetch lost messages. That adds client complexity and server complexity for the extra endpoints.
My imagination is haunted by premonitions of bugs. In one ghostly image I see edits getting silently lost sometimes and not knowing why. Just, sometimes if I draw a line, on your screen you see the occasional block missing until you refresh your browser. In another premonition I imagine a packet reordering system thinking its missing a single operation and waiting on it forever. To the user it looks like everything has frozen completely.
We have well-implemented protocols that deliver messages in order. I see no reason not to use them.
The work to keep everything in order, definitely, has to be done somewhere. You are doing it twice, once sending the messages actually in order, another in the TCP stack. Clearly the one on the TCP stack "comes from free" for anybody reasonable.
Like yours, also my mind is haunted by possible bugs, however in this cases I prefer to borrow from the Erlang philosophy and embrace possible failure. The way I model this problem is that the packets arrives usually in the same order, or one close enough, to the order in which they are send. It is rare that a single packet get lost, but if this happen I want to be able to reask for it and don't block the whole rendering.
I would have accept to receive messages slightly out of order and have a guarantee of something like the last 10/20 messages.
Would it require some reimplementation of the TCP stack? For sure! Would it be the common case? Definitely not! Would it make the architecture more resilient? I believe so but I may be wrong.
Now, to be clear, your work is amazing! I am really glad that you shared it and it I a pleasure to have this kind of technical conversation. Given your technical and time constraints I would have done just the same.
Just wondering if you have any thoughts on my counter points.
Happy Easter!
I'm curious about the algorithm for bans, once you're done with the project I hope that you will disclose it.
I replied to another comment with details:
Someone get this man a chocolate egg, he deserves it!
Spamming swastikas, though, this one is a known modus operandi of trolls worldwide.
This reference is better than a swastika though. http://i.imgur.com/sFwteao.png
