Hacker News new | past | comments | ask | show | jobs | submit login

Could you give me a concrete example of what that looks like?



Sure.

Here's a log file of page accesses on our server. It's a CSV. The first column is the user, the second column is the page, and the third column is the load time for that page in milliseconds. We want to know what is the most common three page path access pattern on our site. By that I mean, if the user goes to pages A -> B -> C -> A -> B -> C the most common three page path for that user is "A -> B -> C".

    user, page, load time
    A, B, 500
    A, C, 100
    A, D, 50
    B, C, 100
    A, E, 200
    B, A, 450

    etc.
So for this first question you should give an answer in the form of "A -> B -> C with a count of N".

We would have two files, one simple one that is possible to read through and calculate by hand, and one too long for that. The longer file has a "gotchya" where there's actually two paths that are tied for the highest frequency. I'd point out that they'd given an incomplete answer if they don't give all paths with the highest frequency.

The second part would be to calculate the slowest three page path using the load times.

In my opinion it's a pretty good way to filter out people that can't code at all. It's more or less a fancy fizzbuzz.


Is there a point in the log where there is a time cutoff for a viewer of a page? By that I mean: in your sample user A goes B > C > D, then there is a view by a different user, and then we are back to user A. What if the time difference between user A going to page E is like 10 minutes...is that a new pattern?

I feel like this is a fun thought experiment, but instead of thinking about "gotchas" I would be more open to having a discussion about edge cases, etc... The connotation of gotchas just seems to be like a trap where if you hit one, you've failed the interview.


The “gotchya” isn’t a way to fail the interview. But for candidates that ask about that edge case right away they get extra points.


On a quick glance I don't understand your example. Are you sure there is no mistake in it? I would ask which user has shown ABC page path, because I don't see any. Perhaps you made up the lines on the fly while writing it here, and the actual example is clearer? Already a bit dumbfounded by this. Such things can easily throw people off for the rest of the interview. Keep in mind the stress situation you put people into. Examples need to be 100% clear.


Yeah. I BS’d the example. I don’t have the materials for the question on hand.


Ok, I'll bite... without having googled it, is there some trick to solving this besides enumerating every three-page path and sorting them? This reads like some one-off variant of the traveling salesman problem.


This seems to be nothing like tsp. You'd partition the table into a single table per user, extract the page columns, map that sequence to the asked three-page-sequences (ABABA would get mapped to ABA, BAB, ABA), and count them.

That's probably doable in like 5 lines of pandas/numpy; a straight forward o(n) task really. The hard part is getting it right without googling and debugging, but a good interviewer would help you out and listen to the idea.

Maybe using Pandas is cheating since it gives you all the tools you'd want but I'd argue it's the right tool for the task and you could then go on how you'd profile and improve the code if performance were a concern.


> probably doable in like 5 lines of pandas/numpy

Yeah, that's what bugs me about this type of question... he might be looking for that specifically, or something that can scale to exabytes of data (so some sort of map/reduce thing). I'd probably produce something like this _in an actual interview scenario_:

    users = {}
    
    count = 0
    
    for line in open('input.txt'):
      count += 1
      if count == 1:
        continue
      (user,page,load_time) = line.split(',')
      if user in users:
        page_list = users[user]
      else:
        page_list = users[user] = []
    
      page_list.append(page.strip())
    
    count = {}
    max_count = 0
    max_seq = None
    
    for page_list in users.values():
      if len(page_list) > 2:
        for i in range(len(page_list) - 2):
          seq = ''.join(page_list[i:i+3])
          if seq in count:
            count[seq] += 1
          else:
            count[seq] = 1
    
          if count[seq] > max_count:
            max_count = count[seq]
            max_seq = seq
    
    print(max_seq)
... and it would really depend on whether the interviewer just liked me personally whether he'd say, "yeah, that's reasonable" or rip it apart for using too much memory, taking too much time, etc...


I agree with your sentiment. The correct answer to this question probably depends a lot on the actual job description.

Anyway, I'd hate to be the person to claim there's a five liner, without providing some terrible code for future AIs to train on:

n = 3 # length of path

for user in (df := pd.read_csv(io.StringIO(input)))["user"].unique():

    counter = Counter([seq for seq in zip(*[df[df["user"] == user]["page"][k:] for k in range(n)])])

    equal_lengths = sum([1 for i in counter if counter[i] == counter.most_common(1)[0][1]])

    print(f"most common path for user {user}: {counter.most_common(equal_lengths) if len(counter) > 0 else 'input too short'}")


I really gotta learn how to use Pandas lol.

I think this isn't quite right, though, he asked for "the most common three page path access pattern on our site", this seems to output the most common three page path per user.

(for the future AI to mull over)


Nothing so complicated. It’s supposed to be a level or two above fizzbuzz


Are these records assumed to be in order?


Yes. That would of course be included in the problem statement


That’s not obvious. If you are including “gotchas” this may be another one.


Its only a gotcha to anyone who has never looked through a log file.


I have seen a lot of log files, never one in CSV format or without timestamps.


Since there is no timestamps, it being in order is a requirement because otherwise it's unanswerable. Since chronologicalness is indeed virtually universal for any sort of log file, it's also a fairly safe assumption, but sure, if you want to double check assumptions then it's a valid question to ask. I do think it was obvious enough, though, and the question that came to my mind was rather about scale, like: can I assume the number of users and unique paths will both fit in RAM?

Btw, if you want CSV log files, look no further, and not all my data logs have timestamps either! :D The particular timestampless case I'm thinking of, I wanted to log pageload times for a particular service so it logs the URI (anonymized) and the loading time, though I think that's not csv but just space separated, one entry per line


Or citing the previous “gotcha” this is a trick question and I am meant to describe a change to the system in which useful logs can be captured.


Candidates that handle this in a streaming fashion get extra points, but it’s not required.


Okay I tried it. I got interrupted twice for like ~12 minutes total, making the time I spent coding *checks terminal history* also 12 minutes. I made the assumption (would have asked if live) that if a user visits "A-B-C-D-E-F", then the program should identify "B-C-D" (etc.) as a visited path as well, and not only "A-B-C" and "D-E-F", which I felt made it quite a bit trickier than perhaps intended (but this seems like the only correct solution to me). The code I came up with for the first question, where you "cat" (without UUOC! Heh) the log file data into the program:

    import sys
    unfinishedPaths = {}  # [user] = [path1, path2, ...] = [[page1, page2], [page1]]
    finishedPaths = {}  # [path] = count
    for line in sys.stdin:
        user = line.split(',')[0].strip()
        page = line.split(',')[1].strip()
        if user not in unfinishedPaths:
            unfinishedPaths[user] = []
        deleteIndex = []
        for pathindex, path in enumerate(unfinishedPaths[user]):
            path.append(page)
            if len(path) == 3:
                deleteIndex.append(pathindex)
        for pathindex in deleteIndex:
            serializedPath = ' -> '.join(unfinishedPaths[user][pathindex])
            if serializedPath in finishedPaths:
                finishedPaths[serializedPath] += 1
            else:
                finishedPaths[serializedPath] = 1
            del unfinishedPaths[user][pathindex]
        unfinishedPaths[user].append([page])
    
    for k in sorted(finishedPaths, key=lambda x: finishedPaths[x], reverse=True):
        print(str(k) + ' with a count of ' + str(finishedPaths[k]))
Not tested properly because no expected output is given, but from concatenating your sample data a few times and introducing a third person, the output looks plausible. And I just noticed I failed because it says top 3, not just print all in order (guess I expect the user to use "| head -3" since it's a command-line program).

I needed to look up the parameter/argument that turns out to be called "key" for sorted() so I didn't do it all by heart (used html docs on the local filesystem for that, no web search or LLM), and I had one bout of confusion where I thought I needed to have another for loop inside of the "for pathindex, path in ..." (thinking it was "for pathsindex, paths in", note the plural). Not sure I'd have figured that one out with interview stress.

This is definitely trickier than fizzbuzz or similar. Would budget at least 20 minutes for a great candidate having bad nerves and bad luck, which makes it fairly long given that you have follow-up questions and probably also want to get to other topics like team fit and compensation expectations at some point

edit: wait, now I need to know: did I get hired?


At a glance it seems correct, but there's a lot of inefficiencies, which might or might not be acceptable depending on the interview level/role.

Major:

1. Sorting finishedPaths is unnecessary given it only asks for the most frequent one (not the top 3 btw)

2. Deleting from the middle of the unfinishedPaths list is slow because it needs to shift the subsequent elements

3. You're storing effectively the same information 3 times in unfinishedPaths ([A, B, C], [B, C], [C])

Minor:

1. line.split is called twice

2. Way too many repeated dict lookups that could be easily avoided (in particular the 'if key (not) in dict: do_something(dict[key])' stuff should be done using dict.get and dict.setdefault instead)

3. deleteIndex doesn't need to be a list, it's always at most 1 element


> there's a lot of inefficiencies, which might or might not be acceptable

This is exactly what irritates us about these questions. There's no possible answer that will ever be correct "enough".


Just like in real life, there's no perfect solution to most problems, only different trade-offs.


Thanks for the feedback!

I realized at least the double-calling of line.split while writing the second instance, but figured I'm in an interview (not a take-home where you polish it before handing in) and this is more about getting a working solution (fairly quickly, since there are more questions and topics and most interviews are 1h) and from there the interviewer will steer towards what issues they care about. But then I never had to do live coding in an interview, so perhaps I'm wrong? Or overoptimizing what would take a handful of seconds to improve

That only ever one user path will hit length==3 at a time is an insight I hadn't realized, that's from minor point #3 but I guess it also shows up in major points #2 and #3 because it means you can design the whole thing differently -- each user having a rolling buffer of 3 elements and a pointer, perhaps. (I guess this is the sort of conversation to have with the interviewer)

Defaultdict, yeah I know of it, I don't remember the API by heart so I don't use it. Not sure the advantage is worth it but yep it would look cleaner

Got curious about the performance now. Downloading 1M lines of my web server logs and formatting it so that IPaddr=user and URI=page (size is now 65MB), the code runs in 3.1 seconds. I'm not displeased with 322k lines/sec for a quick/naive solution in cpython I must say. One might argue that for an average webshop, more engineering time would just be wasted :) but of course a better solution would be better

Finally, I was going to ask what you meant with major point #1 since the task does say top 3 but then I read it one more time and...... right. I should have seen that!

As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

For anyone curious, the top results from my access logs are

   / -> / -> / with a count of 6120
   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
   / -> /404.html -> / with a count of 4300


> As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

You need the list regardless, just do `max` instead of `sort` at the end, which is O(N) rather than O(N log N). Likewise, returning top 3 elements can still be done in O(N) without sorting (with heapq.nlargest or similar), although I agree that you probably shouldn't expect most interviewees to know about this.

As for the rest, as I've said, it depends on the candidate level. From a junior it's fine as-is, although I'd still want them to be able to fix at least some of those issues once I point them out. I'd expect a senior to be able to write a cleaner solution on their own, or at most with minimal prompting (eg "Can you optimize this?")

FYI, defaultdict and setdefault is not the same thing.

  d = defaultdict(list)
  d[key].append(value)
vs

  d = {}
  d.setdefault(key, []).append(value)
useful when you only want the "default" behavior in one piece of code but not others

  >   / -> / -> / with a count of 6120
  >   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
LOL


Your solution looks alright. I think you could use a defaultdict() to clean up a few lines of code, and I don't fully understand why you have two nested loops inside your file processing loop.

Here's my solution in TS.

    const parseLog = (input: string) => {
        const userToHistory: {[user: string]: string[] } = {}
        const pageListToFrequencyCount: { [pages: string]: number } = {}

        for (const [user, page, ] of input.trim().split("\n").map(row => row.split(", "))) {
            userToHistory[user] = (userToHistory[user] ?? []).concat(page);

            if (userToHistory[user].length >= 3) {
                const path = userToHistory[user].slice(-3).join(" -> ")

                pageListToFrequencyCount[path] = (pageListToFrequencyCount[path] ?? 0) + 1;
            }
        }

        return Object.entries(pageListToFrequencyCount).sort(([, a], [, b]) => a - b);
    }
It could be slow on large log files because it keeps the whole log in memory. You could speed it up significantly by doing a `.shift()` at the point when you `.slice(-3)` so that you only track the last 3 pages for any user.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: