We had Nicholas for the last two weeks only (he was on break). Turning our hacknight project into a real CTF was his intern project last year. His help at launch was more or less indispensable. Our scheduling was not discretionary.
I'm sure Stripe has a similar story.
We all agree, more careful spacing would be great. Oh well! :)
Level0 was fun, Level1 threw me immediately into "This looks like it's gonna suck". I know Git fairly well, but I don't know it that well. I poked at the shell script a bit, but I just ain't seeing it. It's also a lot harder to work on without the support of the easy benchmarking ability that was provided in Level0. How do I know if I'm actually improving.
Props in general, I just feel like Level1 was kind of a brick wall for me.
It's not calling the git command that's problematic, it's calling an external command at all. So you pretty much have to recode the hot part in another language, where you just manipulate buffers instead of piping stuff around.
I did exactly that but my SHA-1 (in Ruby) is ending up different than Git's. Even weirder, when I take the header + body, and use an online SHA-1, I get a third and different value! (Not equal to git-hash-object nor to mine in Ruby). Is the body supposed to be modified to add LEDGER content?
Look again at Object Storage under the Git Objects link and you should get it, you're probably not quite matching what it does, it is a simple SHA1 so you're probably just missing one little part of the string to hash. You're on the right track - test getting the hashes to match first with git hash-object, but once you have done that you just need to solve it faster and recover when you discover that someone else got there first.
It is possible in ruby, that's what I used first, though the ref implementation is in ruby, so a single process may not cut it unless you are lucky. Keep trying too if you are solving it but not quite making it in time as you're competing against a variable time to complete on the other side.
The link to the git docs is fairly illuminating - there's only a very small part you need to understand. In particular, look at Object Storage under the Git Objects link.
It is a bit tricky because you're competing against random timings for the bot miners, but you need to mine before the bots do, so you need to speed up finding a commit hash, then do the commit. I hope that is useful without giving too much away.
Getting the following output when pushing my code:
remote: > Your submission has been placed in the queue.
remote: > Kicking off a build for your submission (running `./build.sh`).
remote: > `./build.sh` succeeded
remote: > Kicking off 3 trials. Here goes...
remote: > Started running Trial 0
remote: > Started running Trial 1
remote: > ERROR: There was an internal error scoring Trial 0. If this error persists, please let us know at email@example.com and include the following error token: err_3MRLt7sxWqqD4C
7fcfd3d..17d0e35 master -> master
My code looks to be working because the test ./test/harness says that tests pass.
Are we allowed to discuss solution here? Or everybody has to work out the results separately?
I don't know Ruby, but I can guess what it is doing basically:
1. Read a list of words from a file with an input path or a default path
2. Take the runtime input from the user with any number of words before carriage return (I guess the words are separated by spaces)
3. Do a loop to compare each of the input words to lower case against the words in the list loaded from the file. If there is a match print out the original word, other wise print out something wrap up the word.
It's a O(NxM) algorithm. If the data set is small, we won't see to much difference. Otherwise, we need to concern about better algorithm to improve the speed.
1. Sort the list from the file (remove duplicates)
2. Always check the either half to reduce the number of checking in the list
3. Sort the input list (save the number of duplicates)
4. Print the duplicates directly without comparison again
This way the runtime can be improved to O(N*lg M). This may not be the best algorithm. I'd like to hear from you about better solutions.
Of course. If it's in Java, I'd definitely do that. We are using Hashmap everywhere. Since the original code looks like this, I thought Ruby does not support hash.
We usually implement code on the application level, so we don't really care about the underlying algorithms. But if we have to implement on the system level, we do care about the how to manipulate the memory and number of executions, like Big O.
The CTF does not give a description of the problem, but the code instead. And it persuades the new users not to change algorithm too much, just save a few steps, which implies to have minor changes of the code only.
Look at my another comment, hash may not be always better than list.
Yes, agreed. I was sort of mislead or implied. It'd be better if CFT can give out the problem description followed by the sample code or don't give any sample code at all.
For me, it makes more sense to receive the detailed information, especially the system condition and restriction for trouble shooting and problem solving, followed by brainstorming, instead of going directly to fix code. Because we are used to the pattern to make minimum code change, especially in production. If the code should be completely changed, then we need to know the requirement and re-implement it. Sounds like we have different convention though.
Unfortunately, I cannot agree with you completely.
If there is no language constraint and the system resource constraint, to the problem we have understand so far, using Java will be the fastest and easiest way without hashmap.
Load the complete file as a string (depending on how the size of the data set, up to 2^31 - 1), then using string.indexOf() function will get the best result.
The underlying algorithm for indexOf() is implemented by JVM in C code which is must fast than any other implementation.
My gut told me that it's weird to use hashmap to do string lookup. Everybody knows hashmap is used to lookup key-value pairs. The real reason for not using hashmap here are:
1. hashmap's lookup Big O is O(n), but not the build cost. if the data set size is huge, it takes long time to build the hashmap since every new element exceeded the initialCapacity being added needs a rehash
2. the underlying implementation of indexOf() will use a sort of algorithm called "automata" or something else to do a fast search within a string.
So there are lots of alternative solutions. Don't always think there is only one. I'm not in this field, and I'm not interesting to get into to it too much. But I don't think the best answer is that tiny change.
This is why I suggested to consider if you are doing application level optimization or changing system level algorithm. Building software is a lot more than code manipulation. Understanding requirement is the first step in the SDLC (Software Development Life Cycle).
I'm not sure what you mean about "many simple problem". Is this one of the example for that? I'm trying to compare the two alternative solutions for this problem:
2) string index
which is better? I think about it what I'm going to do if I'm using it in my web application.
First the answer is still relying on the size of the data set. If the dictionary is huge, say up to 4G, I'd use string index vs hashmap, because the memory space is expensive. And how to break into mutiple sub strings is another performance tuning issue.
If the problem is simple enough with not too large data set, hash will be working.
When I mentioned "one way", I mean the "best way". So now you are talking about "The best answer is one that passes the test". So do you mean that all the answers which can pass the test are the best answers, or there is only one best answer which can pass the test? I don't put my personal preference on the problem solving. I'm always looking for the best solution for a particular problem under certain condition and constraints. Once we figure out the answer, coding implementation using which language does not matter that much, unless Ruby does not support the same algorithm of string.indexOf() as Java does.
So do you mean that all the answers which can pass the test are the best answers
Yes, what I was trying to say was that the performance required is just that to pass the test, not more, and the dataset is a few MB here, not 4GB. This is actually really similar to a lot of problems in real life; you can spend ages trying to find a platonic solution when a simple solution works fine given the dataset and requirements (return an answer within 200ms for example). Sometimes simpler is better, and even if you can improve the solution, it won't really matter to whoever pays the bills.
There are lots of solutions though, I tried a few just out of curiosity and you can of course improve on a hashmap - the possible solutions to a problem this small are pretty similar whatever language you choose, and sometimes when a dataset is this small other more complex solutions are slower (unless you preindex).
I have no intent to choose complicated solutions over a simple issue. First, I'm always concerned about the size of the data set. The first sentence I said, if the data set is not big enough, you cannot tell too much difference. Second, I didn't insist on any solutions, so far, I proposed three of them and would like to discuss with you guys on the pros and cons under different conditions. Some people in other posts told me that the first solution using sorting + binary search actually is way faster than hashmap, which deserves more than 2000 points. I tried to verify whether it's true, but nobody answered. Some people helped to provide the Big O comparison on the hash lookup algorithm and linklist, which is also constructive. But did you learn from it that building hashmap cost more than O(1)? Lastly, I proposed the Java solution with string index, did you ever hear about that?
I don't need to spend too much time on finding the solutions, they are on top of my head. Depending on different conditions I'll use different solutions. 4GB is the upper bound of string indexing, if it's being used for web indexing, it's still not enough. If in this case it is used in document indexing for enterprise level with a few MG, it's fine for using any of them. But the difference is the score you get.
I guess eventually you will pick hashmap solution because you have the indexes built in the lookup, which makes more sense over other solutions, but you (or they) don't give the condition out in the first place. How can we discuss based on that? Looks like I pretty much wasted my time on this issue and was taught to learn that making things simple is better. Thank you.
Thank you for your advice, I'm really new to this game and I even don't know how to use github. Don't laugh at me, because I have enough experience on coding by myself and reading/modifying other people's code. I don't have time to play coding on CTF, but I'm curious about how other people are thinking about it, especially about distributed system.
Coding is the last thing we need to concern since various languages are available to implement. System architecture design, data modeling, application performance, security and algorithms are more important.
Regarding the language itself, I understand that Ruby and Python are quick and easy, but that's not for real production systems. Lots of teenagers are using it now. Maybe you are not happy to hear like that. Here are two blog articles regarding it:
Thanks for reading. Is there any conflict between the two?
The original words was "You're giving them a throat to choke". The minimum set of tools is always necessary either on Windows or Linux, especially from Open Source. Just make sure don't take the risk of your application to do re-implementation later.
I don't look down on teenagers if they are able to build websites using those languages. Maybe this type of exercise is really good or suit for them. And definitely coding is the no. 1 skill they occupy.
What I'm talking to are the hackers who are aiming for running startup companies and those who are going to grow to be software engineers.
I'd love to know that you have reference to show that with a large amount of data set, hashing is faster than linked list. I don't have the reference at hand, but I know that hashing has a higher cost from speed and memory point of view.
I don't see the original code is a huge issue per se which requires 19 levels? of improvement? I'm curious about how you guys move on though. Will appreciate if you can keep post your solution and progress.
I guess at the end of the entire program, people may learn how to send the query to multiple indexing servers in a concurrent (I prefer this than "distributed") system and then gather together of all the results. At the end of it, it shows how advanced algorithms Google search engine is used to index terabyte of data.
Looks like one simple change to hashmap may not be the best answer. May I draw the conclusion from your case like that? The best score now went up to almost 3000. I couldn't imagine how many possible ways out there and one of them could be that excellent. Just curious.
The hash _solution_ is O(n + m) in that it requires iterating through the words read from the dictionary file (O(m)) to fill the hash and then iterating through the inputs (O(n)), each time doing an O(1) hash lookup. You can think of it as: you need to read each word from the dictionary file once and you need to check each input word once; kind of hard to go faster than that.
Yes, I thought about exactly the same way you explained. But my point is trying to save more time, which may or may not be necessary depending on the size of the input data set. If we use caching, we can save the trip to look up the hashmap for each duplicated word of the input.
However according to the Big O, linear increase does not count. So the Big O is anyway O(n) when n is near m, it's O(2n) then it's O(n), or n >> m or n << m, it is O(n) or O(m). But runtime, there might be some difference. Again, in most of the normal applications using relational database, we will not use huge memory to store extremely large hashmap. We'll put a cap of the size. Indexing unstructured data is a different story.
>save the trip to look up the hashmap for each duplicated word of the input.
The time required to do a hashmap lookup should be small and of constant time. If you did cache it somehow, you would need a way to do lookups of what was cached (which might require another hashmap lookup.)
I'm saying that trying to cache hashmap lookups might not save you any time and is probably not worth thinking about. Another poster has indicated that the solution described above (without trying to do some kind of caching of duplicates) should be sufficient to pass the Stripe CTF stage, so I would recommend just implementing that.
Good point. It really depends on the size of the data set.
If the size of the input data set m << dictionary data size n, it's a worth trying for caching.
Otherwise, the input word list should be loaded into LinkedList instead of ArrayList which is the most expensive list to be used when we need to access by its index number. Using Set will lose its order.
(I'm just brainstorming the steps for discussion instead of writing code. I'd like to know the best solution and steps to improve it from you guys.)
It looks like you have another in-progress Git operation, and we limit you to one concurrent operation. Waiting 2s (attempt 1/3)
It looks like you have another in-progress Git operation, and we limit you to one concurrent operation. Waiting 2s (attempt 2/3)
It looks like you have another in-progress Git operation, and we limit you to one concurrent operation. Waiting 2s (attempt 3/3)
ERROR: Timed out waiting for your other Git operation to complete. Try again once it has finished.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Yep, that happens if you try pushing from multiple clients at once (in order to prevent people from spamming us with solutions). If you're still getting that even after a few minutes, ping us at firstname.lastname@example.org and see what's going on.
I tried once, but it stoped. I waited a couple of minutes and then ctrl+C'd it.
warning: push.default is unset; its implicit value is changing in
Git 2.0 from 'matching' to 'simple'. To squelch this message
and maintain the current behavior after the default changes, use:
git config --global push.default matching
To squelch this message and adopt the new behavior now, use:
git config --global push.default simple
See 'git help config' and search for 'push.default' for further information.
(the 'simple' mode was introduced in Git 1.7.11. Use the similar mode
'current' instead of 'simple' if you sometimes use older versions of Git)
Counting objects: 37, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (33/33), done.
Writing objects: 100% (36/36), 897.66 KiB | 0 bytes/s, done.
Total 36 (delta 4), reused 14 (delta 0)
Well my first thought was to use c++11 to do the same thing to give me a better basis for moving forward, and I can't seem to use anything outside the c++ standard libraries. Thats cool because we can always staticly link everything and be done with it and
version `GLIBCXX_3.4.18' not found (required by ./level0)
The missing package that started all this with the boost regex library. I sent an E-mail asking for all the libboost-dev to get installed that should be more then enough for any problem that you could throw at a c++ developer.
I know the problem is I am still dynamically linking glibc and whatever system I am running on doesn't have a new enough glibc. I am going to take another crack at it later and try to get it fully statically linked, but once you get to statically linking the C runtime I need to invoke my google foo and it is more time then I had on my lunch break.
Looks like the git repositories used to submit solutions are over SSH? Unfortunately this won't work in an environment that requires outbound traffic to go over an authenticated proxy. Any chance of getting git-over-HTTPS support?
I have not gotten the test harness (written in python) to run on windows yet but the level0 executable is just a ruby script and can be run on Windows with some modifications. If you try to run "python test/harness", it will pull the dictionary down locally in the test folder. Once that's done, make the following modification to the level0 file:
Assuming ruby is in your PATH and the "test/data/words" is a renamed words dictionary file. It's a little work but you can definitely do it on Windows. I'm working on getting the python test harness to work on Windows.
Note that `vagrant up` NEEDS to be run from an admin console for Level 2. The sample application is written in node, and npm wants to make symlinks on the mounted Windows filesystem, which is apparently a privileged operation.
It's a bit of a shame the test harness isn't Windows compatible; it's rather small, and it doesn't look like the fixes required would have been very difficult.
I figured I could either spend the next week fighting script execution on Windows or just spin up a linux VM and be good to go. I wound up installing ubuntu manually on a VirtualBox instance. Had heard of Vagrant, but didn't realize what it offered!
Yes, it seems like it is changing every minute or so (or perhaps it's just my imagination), which makes it pretty difficult to test all the hashes even if trying millions. I might have to switch to a faster language :(
Not sure yet! We'd love to package it up nicely for people to run, but I can't make any promises about when we'll have a chance to do so. Note that it should be pretty easy for us to just distribute the source code for each level, which would give you most of the functionality of these levels.