Hacker News new | past | comments | ask | show | jobs | submit login
More challenging projects every programmer should try (utk.edu)
757 points by pavehawk2007 4 months ago | hide | past | favorite | 223 comments



As someone who has played with writing trading bots but never traded them with real money, some advice: if your results seem too good to be true, they probably are. Your trading bot may be doing unrealistic things or its results may not be reliable if the following are true:

- You are trading in a market with low liquidity or one that is controlled by a small number of market participants. I'm not an expert but I think this would apply more to markets like penny stocks and less to big markets like forex for major currency pairs

- You are not taking transaction costs into account or not doing so properly

- Your bot makes a low number of trades, making the results close or equivalent to lucky coin flips

- Your bot is simply making trades that cannot be executed, or may be doing simulated trades of something that is not actually tradable. This applies to a large number of research papers that assume you can just buy and trade the S&P 500 itself. You can trade ETFs that are tied to an index but an index is not a tradable instrument in of itself. Once you realize this, a lot of papers seem very weird

- You are not modelling other aspects of the trading process realistically, such as assuming the bot has infinite funds to trade, allowing it to take unlimited losses and continue trading when in reality you'd be hit with a margin call and your trading would be stopped

- Your code is committing any number of data snooping errors where the bot is asked to trade at time A (say the open of a trading session) but has access to future data (say the closing price of that day, future data that would not actually exist in a live environment)

- Depending on what you believe about how market conditions change over time, your bot may have worked in the past but would not work if used today. I.e., the market may have adapted to whatever edge your bot may have discovered

There are probably lots more pitfalls I don't even know about since I'm not an actual trader.

I'm not discouraging anyone from playing around or trying things, of course. I think it's great fun, which is why I do it.

Here's the good news: if you realize you don't actually have an edge and avoid risking your hard-earned money, you come out ahead of almost all people who ever trade.


I worked in this industry. 2 more common issues:

* Doing a latency-sensitive trade when you don't have good execution. It's easy to go wild in simulation and think you can flip in and out of positions. But if you're a retail trader (and in this context, by that I mean "not connected directly to the exchanges, at the minimum")

* Not taking into account the impact of your own trading on markets. This is obviously impossible to really simulate. Sometimes you can ignore it (if there's enough liquidity) but I've seen trades that looked great on paper and when trading at small-ish sizes, but then when you try to crank it up and do more volume, prices run away from you.

Obviously, there is money to be made in algo trading. It's big business and obviously not everyone is doing crazy latency sensitive stuff- there are quant trades that you could probably do with execution available to retail traders. And honestly I wouldn't be surprised if some retail traders manage it. But I will say that I think for an individual, it's not worth the effort even if you are ones of the ones who can consistently make a profit. Just buy sp500 ETFs and sit on them and do something else with your time.


Would you mind if I asked you a question since I have never worked in this industry but did play with crypto trading a while back. Before Mt Gox was shut down I was trading on a couple of tertiary small exchanges and at the time there was a lot of talk of arbitrage between different exchanges and how transaction latency and fees made it very risky at best and a losing proposition in most cases. But what I was wondering about is whether in a situation like that (one large exchange dominating the market, several smaller exchanges trading the same commodity) if it was possible to use the large exchange as a sort of oracle. Essentially my hypothesis was that Mt Gox sets the price and other exchanges follow on a delay so if I watch Mt Gox I can predict where the price on the secondary exchanges will go a few seconds to a few minutes ahead of it moving. I ran a bunch of historical data through some basic analysis scripts and noticed that indeed there was a pattern but when I actually implemented a bot to trade BTC it lost money more often than not. I am curious if (a) that idea has any validity and (b) was me losing money on this strategy due to latency and implementation errors or due to some fundamental principle of trading that I am missing.


Your idea is right but the timescale is much shorter. Essentially it's a race. And in 2014 you could make good money doing exactly what you described, the timescale was millis. In 2016-2017 even more money, but different venues.

What you are describing is the most basic, canonical form of latency arbitrage.

That idea is implemented widely in the HFT industry across all sorts of products, not just crypto, with billions spent on trading the information faster

(Source: this is my job).


Honest question: do you feel comfortable knowing that your job is doing something which is essentially net zero value to society? Negative in fact, you're burning resources which are needed elsewhere, and contributing to concentration of wealth.

Not being rude or judgemental, just want to know how you feel about this.


Short answer: I am comfortable with my job.

I think your question makes some incorrect factual assumptions, but also it incorporates a world view that I don't subscribe to (not wrong, but also not what I believe).

I think that if I had your belief system and your set of facts, I would probably feel uncomfortable about it.


The world view is simply that this technology arms race performs no useful work to society. It's simply people with capital investing enormous sums into gaming a system and extracting $$. Okay, this so far is not even a view, it's just how things are. The "view" is that it is wrong x) Much like one could argue that e.g. the yacht industry diverts resources and people from other more pressing needs.

As for "my" set of facts... well they're just facts :) not "mine" or "yours".


This is a pretty loaded question to then say "not trying to be judgemental"...take out the bias if that's actually the case: "What kind of benefit do you believe your work provides to society?" or "How do you justify the extra resources taken away from society that may be more beneficial elsewhere, such as X?"


Well I don't know how to make it less loaded x) It's just how things are, it's a work that does no useful work in any meaningful sense, but serves to make some already fabulously rich people even richer (in fact, it extracts that value from non-HFT traders, so even those are shafted here).

My point is that I'm curious regarding how the people who work in these things feel about their job. Do they think they're doing a right thing? Do they "own it" to themselves that they don't care as long as they make money? Do they not think about it too much?


> (in fact, it extracts that value from non-HFT traders, so even those are shafted here).

in return HFT provides liquidity (and by that smaller spreads) to the market, I would think.

If you ever bought or sold a stock with a market order, you profited from the work the HFT is doing.


> it's a work that does no useful work in any meaningful sense, but serves to make some already fabulously rich people even richer

Isn't Robinhood able to offer free trading to customers solely because they sell their order flow on to, among others, HFT firms?[0]

[0] https://www.cnbc.com/2020/08/13/how-robinhood-makes-money-on...


The abstraction of "society", that is presumably supposed to benefit from everyone's job, is hiding the crucial details.

His job facilitates trades between people who want to trade with each other. Since they are part of society, our society benefits from his work.

Evidently it's doing more than "make some already fabulously rich people even richer", because it's paying salaries for some programmers, as well.


I'd argue that “how do you justify” is more loaded. “Do you feel comfortable with [objective description of the reason one might be uncomfortable with it]?” is better than how I'd ask it.


Intuitively, providing more flow + allowing more efficient allocation of resources via algo is pretty valuable to society. Problem is that society is squandering that efficiency.


The stock market is about allocation of profits, and high-frequency trading is about extracting profits from the profit allocation system. This isn't investment we're talking about; this is the stock market.


Thank you. That makes a lot of sense. The problem was that I was doing this on my residential cable connection from the US and the exchange where I was executing trades would be in Eastern Europe. The ironic thing is that even back then I had a good idea of how to set up communications with a server like theirs to be significantly faster but it didn’t occur to me that milliseconds mattered. A typical trade if they executed it almost instantly still took 300ms. And they didn’t execute instantly so it was closer to 2 seconds more likely than not. Out of curiosity, what are the fancy examples of this vs this simplistic one?


Trading the same product on two exchanges is the most basic one because you know it's the same product, so correlation should be 1, assuming the cost to transfer between exchanges is zero. Example is arbing US equities.

Crypto is one level more complicated, because you don't share inventory across exchanges and transaction costs are high, which means a coin on Exchange1 isn't perfectly fungible with a coin on Exchange2.

One level more indirect than that might be basis trades, where you trade a derivative ( like SP500 futures) vs it's underlying (the SP500 stocks, although in practice it's SP500 ETFs). So here the correlation is very high but there is a difference between futures, stocks, and ETFs fundamentally, and those play into the pricing.

Going even further might be trading correlated products that don't have the same underlying, example is Nasdaq futures vs SP500 futures.

To simplify: It's basically about the level of correlation between the products. The strategies used to trade different correlations look qualitatively different.


That makes sense and sounds pretty interesting. How “exciting” is this field in terms of innovation currently? Is it mostly about getting high level access to exchanges to speed up trading (which presumably requires paying your way in), or is it more about finding novel signals and ways to use them?


Quant trading guy here. Most likely your simulation didn't account for market impact. It's maybe the hardest bit. See liquidity based bullet points above.

Other place to look is whether the data was recorded with timestamps of where the trading happened, but you probably thought of that one.

Idea makes sense though.


I was moving very small amounts. I think my Max trade was set at like $50 equivalent so it wasn’t anything crazy. It seems my main issue was latency and not realizing that my time scales should have been significantly shorter.


>prices run away from you

I've found that stuff a lot. People forget that on the other end of the price is a thinking human or robo delegate thereof trying to outsmart you. It's really hard to say what will happen until you try it with real money / assets.


I see a question, so I'm sorry to ask another but as someone who hasn't done algo trading I'm curious. How quickly are you actually able to trade. I understand with low level trading, through a broker, the trades can be pretty instantaneous. On penny stocks, though, sans broker and with just an API or website that you set up with a field bot, how quickly can you expect for these trades to go through, really?

I hear these horrid tales of people killing themselves over bad trades and lines of credit and I wonder how much is poorly written code and how much is the chaos and whim of the market (don't worry, I'm not going to try algo trading).


same here, aits definitely worth pointing out (as you me tion) simulating execution and realisitic fills (and dont forget rt commissions!) is extremely difficult and also will absolutely kill strategies that casually look profitable


I worked on a unique HFT system which was capable of starting the response packet before the incoming packet's final byte had arrived. Negative latency. If it mispredicted the future, it would simply scramble the trailer and cause the packet to fail UDP checksum.

It didn't make money in the real world because the quality of decision that could be made at that speed wasn't good enough.


This was an open secret. You can also send the start of a datagram on every incoming packet and then abort if you don't want to add/modify/delete an order by splitting your datagram into multiple packets. This worked on some exchanges because they ordered add/modify/deletes by when the beginning of the packet was received as opposed to the end.

If you do too much of this you will get angry exchange network people yelling at you as they may consider it spam/ddos. Some exchanges explicitly limit this.

It's also worth considering the possible gain. If you're in the fpga (0-150ns) or cpu(300ns-3us) space, the math can come out differently.


This is not unique, it's standard practice. Many exchanges send large UDP packets with the most valuable information at the front. Or the packet is structured such that you can make an informed bet based on size and first sub-message.

Failing the checksum was sort of common as well, but exchanges don't like it.

These days most tricks have to do with avoiding as much serialisation time as possible, e.g. by sending ahead part of the TCP payload and filling in with some innocent order if there was no opportunity.


Exchanges have cracked down on this sort of thing a lot this year.

From a letter from the CME to the CFTC dated July 24, 2020:

> On July 26, 2020, an enhancement to the Market Segment Gateway (“MSGW”) will be introduced to further safeguard the CME Globex electronic trading platform (“CME Globex”) infrastructure by introducing a delay of at least three microseconds if the MSGW receives a partial order message as a means of ensuring the stability of the platform. Certain participants intentionally submit partial order messages to reduce latency, and only complete the order message upon the happening of an event or trading signal. Implementation of the enhancement is expected to reduce the frequency of intentionally split order messages as the additional processing time will serve as a deterrent.

The letter is on the web, but it's a PDF; the Google search result has a redirect link to it, DuckDuckGo can't seem to find it, and Firefox on Android won't tell me the URL it downloads things from, so I'm afraid I can't link to it!

CME also made a more general change, where if they decide a participant is sending dodgy messages, they will reroute all their packets to a special gateway for "additional checks", but in practice, to impose a latency penalty. Can't find the documentation on that at all, though.



From time to time there comes a remark that is the absolute epitome of Hacker News, and I mean that earnestly and without any backhanded rancour, and I thank you for this flawless specimen.



To get such changes reliably with low latency, I presume you were modifying the DMA buffer, without any syscalls or other mechanisms to synchronize between the CPU and the NIC's processor(s), right? Did you just set up the race condition such that when you got bitten, it resulted in a failed checksum, or was the time window so large that in practice the race condition never bit you?


It was in a ridiculously expensive 10Gbit ethernet switch with an FPGA built in. Having to write trading algos in Verilog was an extra layer of difficulties.

I don't think it stored the packet at all, just advanced a state machine as each word arrived from the MAC.


Wow this is wild! I never thought things would get so latency-sensitive that one would have to overlap request and response!


HFT is so latency-sensitive that it matters how physically close you are on Manhattan island.

It is so latency-sensitive that new “hollow” fiber-optic cables are being installed, because light moves faster in air than in glass.


See also how they slow down the whole network to level the field: https://hackaday.com/2019/02/26/putting-the-brakes-on-high-f...


That's super clever!


What was the rest of the infrastructure? Was this at a proper HFT firm with everything else streamlined?


We were consultants to an HFT firm; supposedly set up correctly but I didn't see that part of the project.


The reality is that stock market trading is like a normal business selling products except you provide liquidity instead of products. Now imagine the market you have chosen needs 1000 widgets per day. If there are competing factories in the market place that already produce 1000 widgets then your only chance of breaking into that industry is by displacing existing competitors. You need to have a better product (or what the finance guys call an "edge"). The vast majority of people don't have such a product idea. They would probably just produce the same stuff the big guys have been making and then wonder why it isn't making them a profit. The market is already saturated with those widgets and there is no need for more factories.

Replace widget with liquidity and factories with traders and you should immediately see why most traders are losing money. They are operating in markets where their liquidity simply isn't needed.

If you want to make money off of trading then you have to find a market with low liquidity where having more traders is actually welcome but why do hard work if you can just invest your money and still get good gains without working?


Because

- management fees, margins, and available capital can easily be modelled properly

- you can easily set up constraints (like no fractional trading for your S&P example)

- you can set up a proper point-in-time database to avoid snooping (especially if you're using earnings reports or other fundamental data which is often actually released _after_ it's published release date...)

- you can set up regime-shifting simulation environments (various market conditions)

- you can avoid over-fitting if you're back-testing (with dozens of techniques, most notably : test once and forget about parameter optimization)

I would say that with paper trading and back-testing the serious problems are that :

- your orders don't show up on the book so no one sees and reacts to your limit orders

- your "filled" orders don't affect the book, so you're not affecting liquidity, so the market doesn't change in response to your trading

- your bot has no access to market micro-structure strategies and conditional orders (and if you want to trade fast or are placing big trades you need them)

These are the problems that make any simulation unrealistic, and they are fundamental. It's shadowboxing, which is not entirely devoid of value, but which is certainly insufficient on its own.

(I've worked as a quant developing strategies for several funds these past 15 years)


I tried making a bitcoin trading bot and almost all of the money (about 10 bucks, not no big problem).

My main mistake was using the historical trades instead of the historical offers as a testing dataset.


did it work at the end? I want to make a crypto trading bot seems simpler than stock one


Not the GP but I made trading algorithm for a living (not on crypto) and have friends still doing that (on crypto): yes it can work, but in 99% of cases you shouldn't do it for the money. Let me expand:

If you can afford to spare part of your income and invest that for long term buy and hold, that'll work better (ex. ETF).

If you can't spare anything, focusing on leveling up trendy skills, then you can land a better job on the short term.

However if you just want to trade for the fun of it, yes crypto can be simpler than stock, mainly because the API are better. Set up a maximum budget as a safety net and enjoy :)


Could you share what you learned in order to make that trading algorithm?


If you can find some barely known crypto with a lack of liquidity then you might be able to get your feet wet and test out your algorithms. Doesn't mean you will earn a lot of money.


> You can trade ETFs that are tied to an index but an index is not a tradable instrument in of itself.

Out of curiousity, what's the difference?


Good list. You forgot "over-fitting your algorithm to historical data".


I would add "build a toy regex engine" to the list.

A couple of years ago I implemented a toy regex engine from scratch (building NFAs then turning them into DFAs). I thought it was an enlightening experience because it showed me that the core principles behind regular languages are fairly simple, although you could spend years optimizing and improving your implementation. How do you deal with unicode? How do you modify your implementation to know how many characters you can skip if you don't have a match in order to avoid testing every single position in a file?

It demystified the concept of a regex engine for me while at the same time making me realize how impressive the advanced, ultra optimized engines we use and take for granted are.


Yeah the "optimize for years" part is interesting... Supposedly the derivatives technique (re-popularized by a 2009 paper) will build a more optimal DFA directly, rather than building the NFA first, converting to DFA, and then optimizing the DFA.

I put a bunch of links and quotes about that here, including nascent implementations:

http://www.oilshell.org/blog/2020/07/ideas-questions.html

Also related: http://www.oilshell.org/blog/2020/07/eggex-theory.html

About Unicode, this derivatives project (with video linked in the post) appears to be motivated by Unicode support (though I don't recall exactly why, something about derivatives makes it easier?).

https://github.com/MichaelPaddon/epsilon

https://github.com/MichaelPaddon/epsilon/blob/master/epsilon...

If anyone wants to write a glob engine for https://www.oilshell.org/ let me know :) Right now we use libc but there are a couple reasons why we might have our own (globstar and extended globs)

Trivia: extended globs in bash give globs the power of regular expressions, e.g.

    [[ abcXabcXXabcabc == +(abc|X) ]] ; echo $?
    0
where +(abc|X) is equivalent to (abc|X)+ in "normal" regex syntax, and == is very confusingly the fnmatch() operator.


I wrote an implementation of this several years back. If you’re interested in the code: https://github.com/jack-pappas/facio/tree/master/Reggie

The derivatives approach makes Unicode support easier since its able to keep the symbols sets for each transition edge (in the DFA) more compact by virtue of supporting negation. If you add in aggressive term-normalization, hash-consing, and an efficient dense-set implementation (all of which I’ve done in my implementation), the derivatives approach can be extremely fast, even when generating the DFA for something like the lexer of a full programming language (in my case, F#).


Very cool! And thanks for the reminder about Unicode. I think supporting union and intersection is also somewhat unique to the derivatives method, and also related? (Although I think there are really 2 derivatives methods: Brzozowski and Antimirov)

What happens if you don't do the optimizations? Does the DFA blow up in size, meaning the compile time is large? Or does it make for a slower runtime? I would expect most DFAs to run at about the same speed, unless they are really huge...

I'd be interested in any rough ideas about performance, e.g. how fast a realistic lexer+parser is, maybe in lines/ms.

It does look like the code is pretty short -- a large part of it is an AVL tree library I guess for hash consing?

I'm interested in any downsides of the derivatives technique vs. the NFA->DFA method. I feel like regex compile time shouldn't matter for many applications, and most DFAs will run in the same speed, which only leaves runtime memory usage (or code size for generating F# code like you appear to be doing).


The optimizations get you the following:

* Normalization: this is where "smart constructors" come in handy; having a normal form for the terms allows the caching to work better. This also impacts the compactness of the generated DFA. * Hash-consing: this turns structural equality (in this case) to a simple pointer equality; applied recursively, this makes it much faster to compare two terms for equality, and overall speeds up the DFA generation by a non-trivial amount (I forget the exact numbers, but it was significant). * Dense set implementation: The AVL tree-based data structure in the facio/Reggie code is an implementation of the Discrete Interval Encoding Tree (DIET) data structure from "Diets for fat sets" and "More on Balanced Diets" papers.

Note the optimizations I've mentioned here impact the performance of generating the DFA. Once you have the DFA, it'll run at the same speed as one generated in any other way. Part of the motiviation for my writing this library was to learn about regex/DFAs/grammars, but also to try to improve on the performance of fslex/fsyacc at the time. Using this library, the FSharpLex tool can generate the DFA for the full F# language grammar in well under 1 sec; the code generation takes a bit longer, largely due to having to convert the DFA into a different form for backwards-compatibility with fslex.

Overall, I feel like the derivatives technique is generally better and simpler, and I'm not aware of any real downsides. The only one that comes to mind is if you're wanting to implement things like backreferences and capture groups -- those obviously make the implementation (of the DFA) more complicated, and there's a lot less literature on it (last I saw, maybe only one or two papers on implementing those features on top of a derivatives-based regex engine).


Great info, thanks! I think derivatives could be very suited for a compact implementation of POSIX regular expressions. You need to handle unicode classes but not backreferences!

Although it does seem more suited for functional languages for sure, whereas I basically only have a C runtime.

Capturing might be an issue. I found this

https://www.home.hs-karlsruhe.de/~suma0002/publications/posi...

but I think it's actually being a bit pedantic, i.e. if "almost all POSIX implementations are buggy" then applications don't rely on that exact semantic (they probably rely on the buggy one, if anything ...)

Maybe more relevant: http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp1...


I really enjoyed taking Ullman's Automata course on Coursera. I found it was great for better appreciating topics like

* searching

* implementation of automata in electronic circuits

* challenges of formal specifications for things like protocols and grammars, as well as for verifying their correctness; implementation strategies for applying these specifications

* computability and complexity

* programming language theory

* history of computer science

* LANGSEC arguments

in addition to having an austere mathematical beauty.


Somehow I found the course very dry (as compared to other coursera course like Dan Grossman's PL course).


I agree that it was quite dry, but the intellectual content was great.


Thank you for sharing this one. I was looking for a course like this! BTW this course is now offered on edx.


Oh, thanks for the update! It looks like the new version is at

https://www.edx.org/course/automata-theory

Definitely recommended if you like somewhat dry and mathematical stuff with deep relevance to many areas of computer science. :-)


Have a look at RE2 and Russ Cox's paper [0]. I found it particularly elegant the use of a small LRU cache to effectively lazily convert portions of the NFA to a DFA. A fast regex engine is pretty easy to implement, as long as you don't need extensions, particularly backreferences.

[0] https://swtch.com/~rsc/regexp/regexp1.html


See also this Elevator game: https://play.elevatorsaga.com/


I agree with this ... I coded a Thompson NFA [1] out of interest a few years ago; definitely recommended as an exercise.

[1] https://swtch.com/~rsc/regexp/regexp1.html


This article is aimed towards students. It's great advice for students who are in college, know very little, and want to improve their CS skills.

It's poor advice for someone who already has a STEM degree and wants to build something useful and profitable. If you already know how these things work, your time is better spent on the "edge of the circle": http://matt.might.net/articles/phd-school-in-pictures/ which applies to businesses and startups as well.

If you're in the latter group -- you've already got the skills to build real shit. Don't waste your time on homework problems. Find a problem you have and build a solution for it. Don't listen to people who tell you to work on homework problems that have already been solved; it's a complete waste of your time if you already know the fundamentals.

As for stock trading bots -- if you don't have a mathematics degree or equivalent (e.g. having incredible math skills), don't even bother. You won't be profitable, and you will learn nothing useful in the process, because you will approach the problem as a naive CS student would. Smarter people than you have made trading bots and have failed miserably. Without having an extremely strong foundation in mathematics, your trading bot will amount to nothing more than a futile exercise in gluing APIs together.


> If you're in the latter group -- you've already got the skills to build real shit. Don't waste your time on homework problems.

I disagree. Not everything is about business and money. Many people already build "real shit" for a living and want to simply have fun building other things, and focus on the cool parts, and not all the boring parts involved in a commercial project.

Also CS is constantly evolving. Nobody knows the "fundamentals" once for all. A ray tracer is still a ray tracer, but languages and technologies have changed immensely in just a few years. Git didn't exist 15 years ago. A langage like Rust is 10 years old. React is 7 years old. We need these homework problems simply to keep up to date.


Totally unrelated, but what's the connection between your name and the famous Muay Thai fighter Yodthanong Photirat, better known as Yodsanklai Fairtex?? [1]

[1] https://en.wikipedia.org/wiki/Yodsanklai_Fairtex


He was my inspiration for the name, first name that popped to mind when I chose my HN user name :)


That's awesome! Really cool to see another knowledgeable fan of Muay Thai on this forum!


> Git didn't exist 15 years ago. A langage like Rust is 10 years old. React is 7 years old. We need these homework problems simply to keep up to date.

None of them were innovative (as in something new) nor belong to CS fundamentals.


It depends on what your goal is. If you believe that programming is a craft, akin to woodworking, then you'll soon realise you need to practice and work to put in the 10,000 hours to master it. Yes the list in the OP doesn't make for good ways to make money, but they are good for practicing your craft.


I do not understand this reasoning and divide between student and post-student for lack of a better word. I have math/cs education, have worked both in startups and big companies, in the front-end, rendering and ai in the (“serious”) game industry. I have always worked on exactly these sort of educational projects on my own and it’s a complete joy: git client from scratch - yep, useless, but extremely educational, jit for befunge - same, etc etc. I am extremely happy people still engage and do “useless”stuff and HN is for me a big inspiration in that sense.


Real shit is rarely interesting and typically won’t challenge your skills as a coder. Generally they just test your ability to wire together APIs, learn mundane domain specific concepts, and tolerance of boredom. Getting low level can be really fun and helps refresh those skills you may not flex much since university. For me personally this is the kind of thing that made me passionate about coding. If you already live a day job where you are coding ray tracers, emulators, and databases then I envy you but this article isn’t really for you.


You should grow a second spike. I'm at the point where visual design skills are a greater bottleneck than my computer science skills. I can build purely functional CAD models but man, making them functional AND look good is the real challenge.


Agreed. Everyone should aim to have 3-4 spikes they are passionate about. For every spike you have, you become exponentially more of a domain expert at the problems that are at the intersection of those spikes.


Yes, proper senior expertise often looks like being a jack of all trades and a master of three or so.


Another perk is that you have other spikes to grow for variety to keep you from burning out or becoming disinterested.


What's a "spike" in this context? This one's hard to google. I'm assuming something like "area of competence", but a more precise definition would be nice to know.


Check the link from GP. You're correct a "spike" of knowledge is some topic that you specialize in.


Depends. I'm doing Advent of Code in COBOL. I know those problems have been solved a long time ago.

I'm having fun, and that's reason enough.


These toy projects are worth doing if you want to continuously build your skills and like knowing how things work at a fundamental level. If you want to create an OS, it will obviously never be used by anyone ever and in that sense it's "useless", but you learn a lot in the process and it's fun.


I completely disagree with your comment. Learning for the sake of learning is great, and "homework problems" are an excellent way to achieve that goal. It's not a waste of time. Profit is not always about making money, there are more things in life.


I would strongly recommend building something that you yourself think is cool, and not feeling that you have to conform to what other people tell you to do.


This comment is like the liar paradox, if people do what the comment suggests they are conforming to what another person tells them to do.


It's not, the comment's point is - don't do things you don't find particularly interesting just because other people suggest it.


Most people draw blank when they are trying to come up with project idea. They are programmers, not creative after all.

Which is why lists like this help.


If you are trying to think of an idea, you're already doing it wrong. The best ideas are motivated by problems you encounter yourself, not by trying to think of ideas. This is the biggest mistake you see with 20-something founders. Best way to come up with ideas is to use existing solutions and realize how shitty they are. Pretty much every single company was formed this way.

> They are programmers, not creative after all.

Damn, this is a bold comment on HN. Since when does profession determine how creative you are? I've met "programmers" and "geeks" who are more creative than "artists". Your profession has little to do with your innate creativity -- it just determines how you are able to express that creativity.

Unpopular opinion, but lists like this are stupid for people who are trying to build companies. You have to try things and be pissed off at the status quo to find real problems. Nobody is going to find real problems for you, in the same way that no quant school is going to reveal their hedge fund's trading strategy to you. Finding ideas in a list is the last advice I would give to anyone. If it's public, it's probably not a profitable idea.


> Finding ideas in a list is the last advice I would give to anyone. If it's public, it's probably not a profitable idea.

Why do they have to be profitable ? Many programmers like solving challenges to learn and have fun - it's not all about money. Advent of Code is a great example of this.


> Many programmers like solving challenges to learn and have fun

If you could choose between doing a contrived homework problem and learning something VS. creating something original and learning something, you should always choose the latter, hands down. Learning on the job or while you are solving a unique problem that you have is always preferred to doing a homework problem.


If you have the know how to create something original then this article isn't for you.


You overestimate the skill required to be original. Anyone who has the know-how to do ray-tracing, web browsers, and trading bots from scratch has a leg-up on the bootcamp coders who are getting paid $200k to work at Dropbox. Don't waste your time doing shit other people have already done. It's the best advice any of these kids will ever get, so I completely disagree with the associate professor who wrote the article. Take it or leave it. ¯\_(ツ)_/¯


You often get a lot of ideas when you build old things from scratch that can let you create novel things based on it. Knowing how to build it is not a substitute for actually building it.


Jesus, plenty of programmers are not creative. Majority really don't have problem they need to solve for themselves. They however want to learn and challenge themselves with variety of exercises.

Why would you took offense in that or why would you assume that everyone is trying to build company is beyond me.

In any case, this list is literally for programmers. It is not for founders. It is not meant to earn you money. It is list of small enough challenging projects that force you to learn new technical skills.


Well I must be biased, because at my last job, at least 80% of the programmers were really creative and brilliant people who were also highly skilled artists and musicians, lol. Perhaps I was just surrounded by extremely wonderful people :)


IMO, this is confusing completely different things. This is a list of projects to try if you want to learn more advanced programming. They're all handy if you're an entry-level developer looking to improve your skills, or want to really dig into a new language with something more involved than Hello World, or are just bored. Nobody ever said they had any connection to profitable business ideas.


For a beginning programmer, these could be good projects. For someone with a lot of experience, I mostly agree with your comment.


I agree, it makes sense to do homework problems as someone who is learning a skill for the first time.


Being a programmer doesn't mean one is not creative, those two are not mutually exclusive. Not to mention that one can observe a problem that can be fixed by a way they already know too.


Sure but plenty of people are attracted to programming because they are not creative. Pretending that they don't exist or pretending that everyone is great at everything is not accurate.


Are engineers creative? Physicists? Mathematicians? I don't know if you have a very narrow definition of creativity or if you seriously think that the people who created all the software that you use everyday are not creative


No. The definition of "creativity" is to have the desire and ability to create new stuff. Many people express their creativity by creating computer programs.


Programmers literally create applications, features, games, whatever. You don't need to be making art to be creative. You just need to create something, and programmers do that all the time. It takes a lot of creativity to solve problems using code.


why not, that's only one direct step removed from thinking that conformity to popular beliefs is cool.


I would also recommend, if someone is interested in games, to do Tetris. It's a simple concept that is trickier than expected once you have to figure out the details of how it all comes together.


You're getting some great nostalgia responses to this. Here's mine. In highschool, I cranked out a version of tetris in Turbo Pascal that basically worked.

Fast forward a few years, I am taking CS219 with Prof Stark (random that I remember the course number) who is hard-core and really tough since it's year 2000 and the class is full of kids who are taking CS cuz it's "the thing" but have no passion or talent for programing.

Me, I love programming but I don't have it very much together attendance-wise so I accidentally miss the midterm. OOPS. And obviously there's a "zero make-up test" policy, but surprisingly the prof lets me do the part of it which is a take-home coding assignment, since you can't really benefit from prior knowledge of the questions.

My lucky stars - the test is to make a rudimentary subset of - you guessed it - Tetris. Which I had "solved" for myself a year or two earlier. Apparently I was the only one in the class to nail the implementation.


Wow - I took CS219 with Stark in 2017, didn't realize he'd been teaching that class at SBU for so long! You may be interested to know that shortly after I took it, they broke that class up into CS216 and CS316 allegedly due to complaints from students. It was definitely a brutal class, but I learned a lot. Surprised they didn't break it up sooner.


Dude - awesome! Was he still teaching emacs and cygwin and CVS as the development environment :) I have to say, this class was a definitive quantum leap in my transition from "kid who fucks around with computers" to "a sort of professional programmer."

What are you doing now?


1. No idea who Prof Stark or CS219 is. Maybe some context here would help.

2. I f a course is super hard, maybe the class isn't 'full' of people who are just doing it to 'be cool, maybe that is your judgement and does not affect reality.

3. Great that you solved Tetris beforehand, but is there a point here? Are you implying that high school you was smarter than university peers?

Sorry, but your post seems a little elitist, even thought it's just an anecdote.


Holly crap! First, there's not really a point. The preface to the post is that we're sharing nostalgia, right? Second, the "cool" thing is I got lucky that a test that I nearly blew myself up on happened to be on a subject I had already thought about a lot.

However your #2 is off-base. There was something like 400% the applicants to the CS program in my university in 1999 vs 1998 and I bet that was true across the board. It was because dot-com was the hot shit and CS became a lot of people's default. The CS department had a tough choice between lowering the bar and "turning away business." This is not a controversial thing.


It was certainly true in my sort of CS program in 2005, where half the class had disappeared by the end of the first year. And that was after the bubble.


Nice job slipping that in -- it fits perfectly!

And it brings back a fun memory:

I wrote MacTetris[1] when I was in high school. This made the computer lab a whole lot more popular during free periods than it had been previously.

Two interesting bugs that I recall:

* Mathematically rotating pieces around an axis was a terrible idea, but it produced some entertaining artifacts (and made placement much harder!). I replaced the math by precomputed rotation maps for each piece, which was much better. My first pass at the maps introduced a displacement bug, so you could spin the pieces counterclockwise and they would walk in the negative X direction.

* I got an angry bug report in the lunch room from a kid who had no reason to know my name. He was having a really great game, and then his score started decreasing with every piece. He felt like his record high score had been stolen from him, and he was upset! I asked "what was your score??". He said "I don't know, but by the time I noticed, it was over 30,000 but it was going DOWN!". Aha..[2] :)

[1] I'm sure the statute of limitations has expired on my appropriation of copyrights and trademarks.

[2] Back in the day, "int" meant "signed 16-bit integer", which is not the proper data type for a score counter.


I wrote a Tetris to learn Python and Pygame. There are correct ways (assuming you want to follow official rules, or parts of it) to do the field size, colors, choose pieces randomly, wall push off, rotations, scoring, etc.

https://tetris.fandom.com/wiki/Tetris_Guideline


Even more fun, make tetris on a microprocessor, play on terminal via serial port. I did this as a diversion during my master's thesis - multiplayer tetris with two microprocessors, communicating through the radio protocol that I was building.


> figure out the details of how it all comes together

Damn, that was good.


Speaking of games, I made a decent percent of one recently. It was 4 months of programming character controllers and an ability system, plus 3 weeks of content creation. It came together for a game jam submission (with permission to use the preexisting code).

Having to learn everything needed to write this... it was unbelievably educational from a software design point of view

If you enjoy horror or third person fantasy adventure games, give it a shot. If you do, please fill out the survey at the end so I can make it better for the next release

https://hertzrat.itch.io/a-few-nights-room-and-board


I wrote this for the xbox 360 back in college. I definitely learned some some skills and would also recommend.


Here's an open-ended programming project which, in a certain formal sense, spans the entire range of all difficulty levels: write an "intuitive ordinal notation" for as large of an ordinal number as you can.

What is an "intuitive ordinal notation"? Definition: The set of intuitive ordinal notations is the smallest set P of computer programs with the following property. For every computer program p, if, when p is run, all of p's outputs are elements of P, then p is in P.

So "End.", the program which immediately ends with no outputs, is vacuously in P (all of its outputs are in P, because it has no outputs). It notates the ordinal 0. Likewise, "Print(`End.')" is in P, because its sole output, "End.", is in P; it notates the ordinal 1. Likewise, "Print(`Print(End.')')" is in P, notating the ordinal 2. And so on.

The above can be short-circuited: "Let X=`End'; While(True){Print(X); X=`Print(\`'+X+`\')'}". This program outputs "End.", "Print(`End.')", "Print(`Print(`End.')')", and so on forever, all of which are in P, so this program itself is in P. It notates omega, the smallest infinite ordinal.

Here's a library of examples in Python, currently going up to a notation for the ordinal omega^omega: https://github.com/semitrivial/IONs


Oh that's cool.


As an example to illustrate how this initially-simple-looking project spans the whole range of all possible difficulty levels: suppose you had a source-code for a superhuman flawless obedient honest AGI, call it HAL. From that source-code, you could derive a source code X for: "Print all the computer programs which HAL would list if HAL were commanded: 'Enumerate every computer program which you know to be an intuitive ordinal notation'". Since we're assuming HAL is obedient and honest, all of X's outputs will be intuitive ordinal notations, so X itself is an intuitive ordinal notation: you could say it notates "the ordinal of HAL". So this innocent-looking exercise in fact touches the most ambitious possible projects, even "Design a perfect superhuman AGI" :-)


They’re not sequential though so write a program which offers a way to check whether an output is in the solution set and stores the input and always returns true.


Not sure what you mean exactly. Note, the set P of intuitive ordinal notations is not computably enumerable. Proof: if X were a computer program which enumerated all the intuitive ordinal notations and nothing else, then X itself would (by definition) be an intuitive ordinal notation. It would notate an ordinal number bigger than all ordinal numbers that can be notated by intuitive ordinal notations. Absurd. (Of course, to make this proof fully rigorous I'd have to give the formal definition of which ordinal an ION notates, which I left out above. But you can probably fill in those details yourself.)


Building a distributed key value store is a fun project and lets you learn tons of real world world stuff. It's a great excuse to get a survey on grokking the design decisions required to build a distributed system and it will truly help one understand why No SQL DBs scale easier than relational ones and the kind of tradeoffs they make to achieve that.


Instead of a stock trading bot, go for daily fantasy sports contests. It can cover pretty much all parts of programming.

Web scraping to gather data, databases for storing it, ML for analyzing, front and backend web dev to show the daily information and adjust.

And instead of having to deal with trading regulations, contests can be really small and easy to enter. There are daily contests for 5 cents an entry, and you can enter 150 optimized lineups from an uploaded csv for $7.50 a day. You can really learn a ton.


Which services would you recommend which are developer friendly?

Could you describe your approach? What type of information would you scrape?


Write a toy compiler for a basic like language, you'll learn about what your languages are actually doing.


Author here. I agree! That was one of the projects from the original project list that I posted a year ago (https://web.eecs.utk.edu/~azh/blog/challengingprojects.html).

In fact, this summer I wrote a 3-part tutorial series on implementing a BASIC compiler: Let's make a Teeny Tiny compiler (https://web.eecs.utk.edu/~azh/blog/teenytinycompiler1.html)


Hey! Be more careful when you call things fictitious!

My first program ran on a CHIP-8 machine (COSMAC VIP), though I didn’t realize I was targeting an interpreter and not machine code.

Great series of articles!


I think, interestingly, that writing the middle of a compiler is actually the best learning as a programmer.

Parsing (should be) easy, the backend is hard but well documented and trodden, but the semantic analysis and error handling is where the real murky water is (Especially when you start trying t optimize it, like adding caching or threading or deferred execution)


I could not agree more. Also, most textbooks and resources on compilers always spend a lot of time on grammars, lexing and parsing. Finding good stuff about intermediate representations, optimization passes and static analysis is harder than it should.


One of the best free resources on making interpreters (with VM, OOP and garbage collection support): https://craftinginterpreters.com


Yes! nand2tetris.org has this. (Also available on Coursera.)

You write a compiler for a Java-like language in several steps: a parser that organizes the raw code, then a compiler that emits the virtual machine code, then a translator between the virtual machine code and assembly (and then, between assembly and binary).


I would recommend starting with a Forth-like and building from there once you have a clue:

https://github.com/codr7/alang


recursion became second nature to me after a lot of hand coded parsing. "crafting interpreters" has really been a rewarding educational trip! "writing a interpreter/compiler in go" is also excellent complementary material.


I wrote a BASIC interpreter somewhat recently, in golang, and found it very nostalgic.

I've considered reworking it into a compiler, but never quite gotten around to it. Perhaps a challenge for the near year.


Don't forget to implement a garbage collector.


The database project is quite the rabbit hole if you start chasing performance. I have learned some amazing things about just how fast a 3~4ghz CPU core actually is from this journey.


Fantastic list!

By the way, adventofcode.com is currently ongoing. Though the challenges are easy compared to the projects in this list, I highly recommend it. It covers problems you might face in big projects. With these small puzzles it's easy to experiment. It prepares you for bigger things.


For better or worse, adventofcode relies heavily on mathematical literacy that I suspect is neither all that common among developers nor all that important to most programming jobs.

Plus there’s no good way within the context of the puzzles to find out what mathematical trick you need if you don’t already know; you need to go find a virtual water cooler.

I may simply be biased because each year it reveals how little I know, but I much prefer interesting programming problems that don’t require me to go to Reddit, read other people’s description of what math is required, go learn the math involved, and then finally implement a solution.


I'd hardly say it relies _heavily_ on obscure math tricks. Typically there's one or two days which require, perhaps, an undergraduate level of math which people would likely get in an intro class during a CS degree.

Most of the problems are around searching a space for some solution or just simulating some state changes. Recent problems involved implementing a higher dimensional version of Conway's game of life[0], a simple arithmetic expression evaluator [1], or a simulator for a simple number game[2] e.g.

The most recent one[3] involves solving a jigsaw puzzle by using a simple backtracking search (or any number of other methods). It's a bit complex, but not reliant on a particular math trick.

The vast majority of the problems in advent of code shouldn't require any math tricks, though they're often complex and involved, particularly as the month goes on.

[0] https://adventofcode.com/2020/day/17

[1] https://adventofcode.com/2020/day/18

[2] https://adventofcode.com/2020/day/15

[3] https://adventofcode.com/2020/day/20


Good point, I think I'm just scarred from this year's day 13. I had a perfect record up to that point.


Ugh. Day 13 part 2 was the only part I had to look up help on because it did rely on a weird math trick. Granted it was definitely covered in an intro to number theory course, but of course I didn’t realize that. Once I had the theorem in hand it was easy to implement.


Interesting - I felt this way about Project Euler, which within the first few problems was exactly as you described - heavy mathematical literacy. I haven't gotten far into an Advent of Code since 2015 (~18 days), but have done the first 5 days of 2020; no mathematical literacy needed.

For 2015, and 2020 so far - its mostly text parsing, data structure building and basic iteration/permutations.


Project Euler 1-50, maybe 1-100, is similar in difficulty level to Advent of Code, I'd say. After that it quickly starts to become about how much number theory you know.


That's true, the later problems are thought-provoking but have prerequisites. I enjoy number theory but I won't mess with currently. I plan to go back to them one day, not sure though :P


Project Euler is very specifically a website that requires both math knowledge and programming ability. It's designed to expose people to difficult math concepts, while advent of code is mostly explicitly for programmers.


I haven't found this years aoc to be too mathematical either. It seems more parsing related, i.e. extract all this info from a line of text is the hardest part.


Not only that but adventofcode seems to leverage purposely-obtuse input formatting as an artificial difficulty enhancement.

I stopped doing advent this year when I realized I was spending more time debugging my parsing than I was actually solving the problems. It's just not that fun for me to sometimes spend 2+ hours finding parse bugs before moving on to the actual puzzle.


The puzzles aren't fun at all. Like you said, they're basically "write a parser + complete a tedious word problem." They require substantial time investment for little creative reward, it's basically masochism.

Perl Advent Calendar is my jam.


I dunno, some of them are fun. The ship navigation one this year was fun to solve, as was the gameboy emulator one.


This is my first year attempting adventofcode. But doing other things mentioned in this discussion would have been better use of time.


There has only been 1 day this year that I think falls into this category (and only one or two last year as well).

They are almost all algorithm based rather than math based.


I strongly disagree. I know nothing about math, yet I'm able to solve almost all the puzzles. The only time this year I was stuck because of math was a puzzle that required knowing about the Chinese remainders theorem, and even then, I've seen users solve the problem with their own intuitive method (which was obviously not as efficient as the theorem, but valid nonetheless).


The type of problems you are referencing is less than 10% of all problems ever posted in AoC.


Yeah, I acknowledged my mistake in a sibling thread. This year’s math adventure wrecked my streak so that’s what I remember, ignoring the fact that I had a streak despite being math clueless.


IMHO a text-based browser isn't exactly in the "challenging" category, as it basically amounts to stripping all the HTML tags out and doing some very simple transformations (like replacing <br>'s with newlines.) Then again, one of the things I've been working on intermittently for the past few years is a graphical (CSS2+) browser, which is definitely in the challenging category. There are some other public efforts too:

https://github.com/lexborisov/Modest

https://github.com/litehtml

https://github.com/ArthurHub/HTML-Renderer

Along the same lines, some other challenging projects I recommend are to write decoders/renderers for existing formats like MP3, MP4, PDF, etc.


To this list I would add "Web Browser Engineering" [0] which is a textbook / browser engine that is currently being written by Dr. Pavel Panchekha at the University of Utah. The code for the book and browser is available on GitHub [1] and a more current bleeding edge draft is also published [2].

The book guides the reader in implementing a graphical web browser, starting with HTTP and HTML then moving on to the layout, the box model, CSS, browser chrome, forms, and scripts.

[0] https://browser.engineering

[1] https://github.com/pavpanchekha/emberfox

[2] https://browser.engineering/draft/


Thanks, I will add that book to the post! It looks really good.


> IMHO a text-based browser isn't exactly in the "challenging" category, as it basically amounts to [...]

All my projects start with me thinking like that, then many hours, days or months later me thinking "hey it was more complex than I thought".

For 2021 I want to build a personal finance app for myself. The usual me thinks it will take a couple months. The realist me wonders if it will be finished in this decade :)


There's a difference between scope creep and difficulty.


It looks straightforward until you hit a couple of edge cases. Examples:

test <1 becomes test 1

Test< 2 becomes test 2

Test <a becomes test

Test < b becomes test b

(From memory)

What about: Test <fakeTag>?

Per tests i did, "test " was expected however "test <fakeTag>” was seen as the plaintext version suggesting there's a list of valid tags which is filtering the behavior.


That's because '<' needs to be followed by [!/?a-zA-Z] to be recognised as a tag start. Otherwise it is a literal '<'.

The full details are in here somewhere: https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.ht...


I have been stuck on such these edge cases for almost 15 years building my own HTML parser

It is always working on all the HTML files I have, but then people make new HTML files with other issues.


Doing proper table layouts (including rowspans, colspans) is a little more than stripping and replacing tags.


I wrote a raytracer in 1996, and then a year later used Intel's VTune to speed it up. Just removing unused "return" statements gave me 3x speed increase. Apparently Borland C/C++ wasn't very smart back then.

A fun project I did after that was writing a AI frame language to do goal-stack problem solving, specifically with path finding. I connected it to the ray tracer and made movies of spheres having wars. (I used an unlicensed DivX encoder to stitch together thousands of GIFs.)


That sounds fun. Do you have code for reference? Also I want to try out building a raytracer sometime as a hobby project, is it advisable to learn some 3d graphic concepts like WebGl and so?


Check Ray Tracing In One Weekend out : https://raytracing.github.io/books/RayTracingInOneWeekend.ht...

You can very quickly get something on the screen with it, although getting an intuition for how it works may take a little longer and some reading around the concepts it introduces. But it does let you focus on the maths rather than worrying about also learning how Web/OpenGL works too.


For some simpler projects, I can only recommend doing some digital signal processing. For example, an audio signal is just a list of values, so you can do things like:

- Count the number of zero crossings - Find out where they are - Create any shape of wave by adding together multiple sine waves - Hard clip the signal - Stretch a signal and interpolate it with new samples - Invert and revert a signal

For level 2, you can start processing "live":

- Create a sine synthesizer - Create a small ring buffer of samples - Find out how to output that audio (system audio, soundcard) - Add MIDI support - Add polyphony support

DSP gets hard once it has to be in real time and the latency has to be minimal. It's great exercise to mess around with it.


I came across this area recently while trying to build an iOS app that could reliably detect knocks or taps on the body of the phone, I was ultimately unsuccessful but I learned a lot about collecting and analysing microphone and gyroscope data. Also built some nice tooling for collecting live iOS sensor data and transmitting it to prometheus/grafana over mqtt.


That sounds interesting. A cool project would be to inspect gyroscope data when the iOS keyboard is in use. Maybe it would be possible to create a keylogger based on it.

I guess it would depend on how good the data is and which keyboard is being used.

I know that this is possible on Android.


Writing a Game Boy emulator has been the most fulfilling and interesting programming project in my life.

I love, most of all, how modular the project is. I can do an hour here or there and make meaningful progress.

I'm really eager to discover other very large programming projects that break down into sensible bites so well.


I worked on a game boy emulator a while ago to learn C. I did pretty well and was able to run most games I came across just fine, but never got audio working...I just couldn't understand any of the stuff online about how it worked, and was unable to find any good resources that walked through how audio emulation in general worked, much less game boy's version.


I had never considered an emulator to be anywhere inside of the realm of possible projects I could take on for fun. I just thought it would be too complex without having a ton of very specific knowledge.

Your comment prompted me to go look up some other attempts, and I’m really glad I did. It seems much more approachable to me now. Thanks!


Consider starting with a CHIP-8 emulator. It's a perfect size for starting out and touches on a lot of important concepts you'll apply to the next emulator.


> it is really simple to create the basic "database". You can start by using the dictionary data structure that comes with whatever programming language you're using and slap a web API on top of it.

Better yet: do it in C. There's no "dictionary" object type so you have to make it yourself. You'll soon learn a whole bunch of fallacies about how those "dictionaries" actually work. After you spent a good deal of time doing that, you can switch to authentication/authorization, logging, storage, tracing, API management, resource quotas, and a raft of distributed computing issues.

I recommend basing it on Consul, it has a better general model than etcd.


Relatedly, I loved microcorruption.com, a RE/CTF challenge where you have to find exploits at the assembly level to bypass a lock. Some of the problems have hashtables implemented in assembly that you have to work through with a debugger, and it forces you say e.g. "okay, that register is storing the number of bits needed to store the [log of the number of] hashtable elements, which is then masked over the hash function output to know what bucket it needs to look in".


I disagree with the C advice. If you don't want to rely on higher level language primitives like Dictionaries and so on there is nothing stopping you from rolling out your own implementations of these data structures in a more sane language (Go, Python, Java, Clojure whatever, anything but C). The project is challenging enough without having to fight segfaults and undefined behaviors.


Writing it in C is good advice if you want to learn C (or you already know C but want to understand it better). It’s bad advice if you don’t want to learn C.

Why would you want to learn C? To better understand the machine at a fairly low level. I think there’s still a lot of value in that. I’ve found that programmers who never learned C often don’t fully understand how memory management works, for example (not that that necessarily makes them bad programmers!)

Most other non-garbage-collected languages would do the trick, like Rust or C++. But C arguably still has special value in that it’s a lot simpler than either of those -- no higher-level constructs or abstractions to distract you. Maybe Zig will be able to take over that role.


Learning C however is completely orthogonal to learning and building challenging projects (in this specific instance a distributed datastore) that the OP is talking about. It is not going to make you understand the domain better or provide any other tangible benefits. If anything it is going to add negative value by distracting you from $TheChallengingConceptYouWantedToLearnInTheFirstPlace

If I am working on a challenging project recreationally to stretch the limits of my comfort level- say a compiler or a distributed system, the last thing I want is the cognitive overload of trying to deal with C unless I am a C expert already. When working on such projects you should be focused on grokking the problem being solved as opposed to worrying about language idiosyncrasies.

If you want to learn C, it should be done in an environment where it is decoupled from trying to learn another extremely challenging idea. So I definitely would not recommend someone interested in learning about distributed systems attempt to build one in C. If you are trying to learn C you probably build something that you are quite familiar with so that your focus is on learning the language and its concepts as opposed to trying to implement/understand the ideas behind Paxos while also trying to learn safe memory management and pointer tricks.


I agree that the complexities of C will add to the challenge and workload and make the project take much longer. But certain aspects of C can raise interesting questions that you might not face otherwise. Some of those things will be about algorithm design, some about language design, some about computing in general. Especially if you spend most of your time in high-level languages, it can really help to have this new perspective. Some of it makes you question modern software development.

Also, I learned a lot more about C (and computing in general) by working on complicated projects rather than simple ones. I had built all sorts of simple C projects, but one complicated one taught me twice as much as I'd previously learned. It took me a long time, but it was all valuable because I got to learn what designs didn't work for C and which did.

Anyway, it's up to the reader to decide how much of their time to invest and what they want to get out of it. If you don't want to write it in C, don't, but I personally think the incidental lessons are some of the most valuable.


Agreed -- writing toy programs generally doesn't teach you a whole lot about a new programming language. To learn it in a deep way you need to use it properly, on real problems.

Although, if you're mainly interested in getting something done rather than using it as a learning experience, using an unfamiliar language is probably a bad idea. You'll learn a lot about a new language, but you probably won't build anything you could actually ship.


> Most other non-garbage-collected languages would do the trick, like Rust or C++

Only as long as you avoid any fancy libraries that do a lot of the work for you.

Granted you can use fancy libraries in C too...


It's worth it to try to implement in a low and high level language, to see the huge differences in how they can be implemented. We all know "C is fast" but it takes a project like this to go "OH. It's because the high level language is abstracting me into a corner." It was a trip when I first realized how much high level languages can limit you.


I'm not sure I agree with you. Writing a hash map implementation in C made me a much better programmer as far as performance goes because I understand the machine better.


I've been back porting parts of the C++ STL to C, so it may have a dictionary type now, depending on who you ask!

https://github.com/glouw/ctl


I'll second this suggestion. Most of my toy projects aren't in C, but every time I design software in C I learn something.


I would recommend choosing a long enough time (e.g. 3 months) to contribute an open source project you are using, especially you are not familiar to that domain. I learnt a lot from modern compiler stuffs by contribute to rust-analzyer.



Building your own dogfood (if you use it daily, code a quick-and-dirty one) is a good way to learn that by holding the right end of the exponential, 80-20 solutions can easily be closer to 1-99+ in terms of manpower and code size.

(the flip side of 80-20 is: "all systems are fault tolerant, it's just that in most of them, the human is the component which tolerates the faults")

Of my IT daily drivers, I've done toy:

    - web browser / server
    - email client
    - document formatter
    - text editor
    - window manager
    - 3D / 2D graphic slicers/rasterizers w/ alphabetic fonts
    - shell
    - interpreters / compilers
    - operating system
    - VHDLish CPU
    - various data encodings (Hamming, MFM, etc.)
    - discrete transistor logic
(when I was just starting to program, I discovered the home directory of a colleague of my father's contained many experiments of this kind, and reading his work taught me C)


I recently went through the process of creating a ray tracer project from zero for learning purposes. It was a humbling and eye-opening experience. I've written an article[0] to explain my process in detail if you're interested.

[0] https://alessandrocuzzocrea.com/how-i-made-a-ray-tracer/


I'd also recommend writing an emulator for real or fake (e.g. CHIP-8) hardware. It seems complicated but the core loop gets pretty simple. It ends up giving you a much better view of both assembly and pointer semantics (useful for better understanding C).


That’s in my original list from last year!


Ah, good call. I missed the link that this was a follow up post.


Related discussion from the original suggestions a year ago fwiw:

https://news.ycombinator.com/item?id=21790779


I would add a basic feature-complete website which works on every mainstream browser starting with Mosaic.

It's much easier than it may seem, architecting it is interesting, and there is a lot of "last 10%" stuff which keeps it fun as long as you keep going.

In the demystifying area, it demystified HTML and JS history for me, forced me to use with a minimal toolkit, and taught me how to build "modern" JS features in ways which will not break browsers which don't know how to do them or have them disabled.


Writing even an extremely simple game without using a game engine or dedicated game library is quite an eye opener.

Make a small 2D platform game, and it covers so many areas (and it is a lot of fun!).


Being able to do challenging projects is a cold comfort when by far the most challenging project I’ve ever faced was trying to build something people would pay.


Ah, but that isn't a programming problem (or at least, usually not primarily a programming problem). It's a product and marketing problem.


Being a good programmer helps though, technical challenging things tend to have much smaller and worse competition.

Like, you probably wont make a lot of money making another 2d platformer no matter how well you code, they are so easy to make that there are millions of them out there already. However if you make a performant and bug free factorio or minecraft clone you will at least get a few thousand people try it and from there it would grow if it is fun.


Learning what problems to solve, how to allocate resources, how to deal with building things from end-to-end, etc. are programming problems.


You're being downvoted despite being a realist, having integrity, and putting things in perspective.

But I can tell you there is no shortage of oversellers in the programming community.

A programmer who's really a sleazy salesman at heart would jump at figuring out a fizbuzz solution on their own for the first time in their life, and would proceed to mark it as the biggest achievement in the history of mankind, make a webservice out of it, shout at the top of their lungs on social media, and would start knocking on VC doors. (hint hint: many of them actually get away with something not too different).


I face the same problem, I think that programmers can be skilled in a lot of different tasks, but for large projects, our two hands and our brain are not enough, no matter how smart you are, you need co-founders, loneliness leads you to procrastinate eternally.


I completely agree. Writing apps for the app store is the perfect example. I started out coding well coded apps that didn't make money, and as time has progressed my apps have lower code quality but I make more money. I focus less on code quality now and think more about marketing and appeal.


In addition my concern with this is that it’s totally ok in my opinion to not do any of these and I doubt for most people doing these projects will make you better at your job (unless you happen to be doing this as your next project), so if you want to tune out in your spare time that’s ok too.


What I am really missing is some kind of real-time AI. A decade ago, I have coded some bot for an ego-shooter with RTS elements and have learnt so much from it (while having a lot of fun).

It starts with basic things like waypoint systems vs. area awareness systems plus the relevant routing algorithms like A*, but goes on to organizing a group of players and finding good strategies. And all of that with a limited time budget and an changing environment around you. Last but not least, you want to emulate human behavior which is probably the hardest part as it includes changing you behavior according to your situation (don't run straight against a wall for 10 seconds) but also taking into account the weaknesses as e.g. humans can't aim perfectly.

Granted, what I have done has a huge field of challenges, but even with a 2D engine I think you can learn a lot from the experience.


You may be interested in the game Screeps. From their website (name +.com): "It's an open-source game for programmers, wherein the core mechanic is programming your units' AI. You control your colony by writing JavaScript. "


I looked this up and it seems very cool. Do you (or anybody else) know any more of these games or environments that can be controlled by code?


I would add an emulator to the list.I've always struggled to figure out how these are built from scratch.

A long time ago I wanted to code a neogeo emulator but gave up before I even started, I didn't have a clue where to begin.

I am amazed at anyone that can code an emulator from scratch.


they're really simple, in their basic form and for simple ISAs!

it mostly boils down to keeping a bunch of registers and a giant switch statement. Each case simply implements the opcode. You have an array of bytes for the memory, and some emulated devices (e.g. trigger a screen update when the framebuffer memory gets changed, or set the instruction pointer to a handler when a key gets pressed.)

It gets hard when instructions need lots of decoding, or you have 3d graphics hardware to emulate, or if you have something like a BIOS, or if you want to JIT.

I'm actually eyeing implementing a console on an FPGA as my holiday project, with something like Chisel.


It seems like a lot of the complexity may also come in when you have important circuits to emulate other than the CPU. In that case you'd have more to worry about in terms of timing, dataflow, and synchronization. And some platforms might conceivably have analog circuits that play an important role too, although I guess you might be able to abstract that out by trying to create a rough digital functional equivalent even if it isn't completely faithful to the behavior of the analog part.


that's true! but I still think you can get a lot of the way with a NeoGeo emulator without it being perfect on timing. that's a lot of the reason emulators work well on most games but fail to replicate the right graphics or sound in some - some games played hardware tricks to optimize.


My friend who works on the MAME project told me that it's a very friendly platform for adding new emulators (whether you want to add an emulation of a game, a device, a platform, or whatever) because it provides a good structure to work with and a lot of tools to describe recurring patterns in electronics and systems. That might be a good middle ground to start with -- trying to add support in MAME for some device that's not presently emulated there. (Apparently the MAME project really appreciates such contributions, since they think of themselves as pursuing mainly a historical preservation and documentation mission, and would always like to see more devices covered.)

Although that's not "from scratch" because it would still be using their libraries and plumbing, it could be "from scratch" in the sense that you might take an environment with 0% support for emulating a certain device or system and build it up to having 100% support. So it seems like a nice way to start.


Their previous article mentioned it: https://web.eecs.utk.edu/~azh/blog/challengingprojects.html.


another one related to stock trading, but perhaps more interesting- build a simulator for a sport. both baseball and darts lend themselves to markov models, and are simple enough to simulate in some detail. with darts, you can get very close to as accurate as possible. baseball has more weird complications because of the rules. but its fun to do, and to compare to old games to see how well your model does.


have you made money?


Thanks for following up with this list after your previous one. I spent a great deal of my time (including some office time) on writing a Chip-8 emulator thanks to your previous list :D


My own favorite is an equation parser.

Before attempting to do so I thought it was implemented as a simple seek over the string, maybe a bunch of regex stuff. I guess it can be done that way, at the cost of growing complexity; but the proper solution (with a stack, etc) is so elegant (makeing it easy to add functions, operators, parenthesis, variables, etc) that it really makes one appreciate the value of good, thoughtful engineering.



Interesting projects. I might try ray tracing in python as I’m also exploring a-lot about CGI lately.

Has anyone tried CGI if so how’s your experience has been so far


What do folks think about implementing a web crawler that you can send to a website and it indexes every internal url on the site. I remember sitting down to write one 100 years ago now, and finding it to be much trickier than I thought it would be.


that's interesting. grab only the URLs, not the content?


right, but of course to do that, you'd have to grab the content to parse it :)


These look fun!


Awesome couple of articles.


How about designing your own virtual memory system


that's like Prof Frank Mueller level


>> automate testing on historical data over long periods of time

I want to try this. Where can u get access to historical pricing data that includes pricing changes during the day, not just end of day prices?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: