Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: My Single-File Python Script I Used to Replace Splunk in My Startup (github.com/dicklesworthstone)
313 points by eigenvalue 11 months ago | hide | past | favorite | 79 comments
My immediate reaction to today's news that Splunk was being acquired was to comment in the HN discussion for that story:

"I hated Splunk so much that I spent a couple days a few months ago writing a single 1200 line python script that does absolutely everything I need in terms of automatic log collection, ingestion, and analysis from a fleet of cloud instances. It pulls in all the log lines, enriches them with useful metadata like the IP address of the instance, the machine name, the log source, the datetime, etc. and stores it all in SQlite, which it then exposes to a very convenient web interface using Datasette.

I put it in a cronjob and it's infinitely better (at least for my purposes) than Splunk, which is just a total nightmare to use, and can be customized super easily and quickly. My coworkers all prefer it to Splunk as well. And oh yeah, it's totally free instead of costing my company thousands of dollars a year! If I owned CSCO stock I would sell it-- this deal shows incredibly bad judgment."

I had been meaning to clean it up a bit and open-source it but never got around to it. However, someone asked today in response to my comment if I had released it, so I figured now would be a good time to go through it and clean it up, move the constants to an .env file, and create a README.

This code is obviously tailored to my own requirements for my project, but if you know Python, it's extremely straightforward to customize it for your own logs (plus, some of the logs are generic, like systemd logs, and the output of netstat/ss/lsof, which it combines to get a table of open connections by process over time for each machine-- extremely useful for finding code that is leaking connections!). And I also included the actual sample log files from my project that correspond to the parsing functions in the code, so you can easily reason by analogy to adapt it to your own log files.

As many people pointed out in responses to my comment, this is obviously not a real replacement for Splunk for enterprise users who are ingesting terabytes a day from thousands of machines and hundreds of sources. If it were, hopefully someone would be paying me $28 billion for it instead of me giving it away for free! But if you don't have a huge number of machines and really hate using Splunk while wasting thousands of dollars, this might be for you.




"This simple tool solves X at my org" is probably the most underrated type of project. There's not enough room to overcomplicate something that isn't a core part of the business, it must be practical to maintain, simple&stupid enough so that onboarding is not a hurdle, etc.

I encourage everyone to share your "splunk in 1kloc of Python" projects! Some of my own:

- https://github.com/rollcat/judo is Ansible without Python or YAML

- https://github.com/rollcat/zfs-autosnap manages rolling ZFS snapshots


Thanks, based on the dismissive replies to my original comment in the Splunk acquisition discussion, I thought this would get a lot of hostile takes saying that it was dumb, that I reinvented the wheel because I didn't want to spend 2 weeks trying to figure out opentelemetry nonsense and tools X, Y, and Z, that it was trivial, that it wouldn't scale, etc.

But people are actually being surprisingly nice and friendly! I guess people just really hate Splunk!


> reinvented the wheel

I hate this meme. It's as if cars, trains, and airplanes all use the same wheels. Or that wheels under my stove, my tiny filing dresser, and my shopping cart are all the same.

Oh yeah, re-inventing the wheel, what a stupid idea and something we obviously don't frequently do and for good reasons.

This meme is almost as bad as the horrible misquoted "premature optimisation is the root of all evil".


I suggest you sell it to Oracle, get some popcorn and watch the Cisco vs Oracle log war begin!


mired in antitrust lawsuits


Your project sounds like something that definitely could come in handy. I forked it as a "bookmark". I particularly like the idea of storing the data in a local SQLite database. Not everything needs to be "web scale".


The real power is that with datasette you can instantly turn that SQLite DB into a full fledged responsive web app with arbitrarily complex filtering and sorting, which all gets serialized to the URL so you can share it and bookmark it for future use.


Not only that, but with https://litestream.io/ things becomes even more interesting.

I'm currently using this for a small application to easily backup databases in docker containers.


> But people are actually being surprisingly nice and friendly! I guess people just really hate Splunk!

Best things in life come through love and passion. Frustration can be a good motivator but don't let it guide you.

> I thought this would get a lot of hostile takes [...]

To be entirely honest with you, recognizing and praising the good parts is a lot easier than giving proper feedback on what needs to be improved;)


For me, it's configinator[0]. Write a spec file for a config like [1], get a Go file that loads a config from environment variables like [2]. Code-gen only, no reflection, fairly type-safe, supports enums, string, bool, and int64. I made it because it was gross to add new config vars in a project at work, and it's come in handy a lot!

[0] https://github.com/olafal0/configinator

[1] https://github.com/olafal0/configinator/blob/0576a53970bcb4d...

[2] https://github.com/olafal0/configinator/blob/0576a53970bcb4d...


My org's apps heavily use this simple key-value interface built on sqlite: https://github.com/aaviator42/StorX

There's also a bunch of other purpose-built tiny utilities on that GitHub account: https://github.com/aaviator42?tab=repositories


Your software is cool but the description is a bit unfair to Ansible. Ansible works by solving for desired state. This software runs scripts, it replaces "for host in; do ssh $host < script.sh; done".


This is fair criticism.

When I first created Judo, I envisioned some sort of a standard library as a sibling project, sort of an "executable rosetta stone" for Unix, where you could declaratively say things like "ensure this user exists", "ensure this package is installed".

In practice I found out it's fairly easy to just write your scripts to be idempotent. It was the secret "2. and do not overcomplicate things" step that most initially simple software seems to gradually forget about.


Last time I checked, ansible playbooks were also essentially just sequential steps to execute via ssh, albeit in yaml format. There was certainly no way to describe a desired state nor did ansible accomplish consistently bringing a system into some desired state. Two executions of the same playbook could result in a very different system state for example, depending on what happened in between.

The only systems I am aware of which are reliably capable of "solving for desired state" are nix and guix.


You can run sequential commands with ansible, but then you're just using it as a replacement for "ssh $host < script.sh". You would be missing out on most of the usefulness of the tool. It's meant to be used declaratively. The manual describes it quite well. In the same type of tools are puppet and indeed nix, but with one important difference for the latter: nix is also a package manager, which allows for more a fine grained state specification.


True, nix is also a package manager, and that is the crucial step necessary to actually provide a declarative interface to system state. Without it you can only get the leaky abstraction that is ansible.

To illustrate my point, imagine this playbook:

  ---
  - name: Bring system into some state
    hosts: localhost
  
    tasks:
      - name: Install hello
        ansible.builtin.apt:
          name: hello
        become: true
Apply it and you get GNU hello installed. Now remove the package installation step, which really is just a glorified "ssh $host < script.sh":

  ---
  - name: Bring system into some state
    hosts: localhost
  
    tasks:
Apply that and you will still have hello installed, even though it was removed from the "declared state". This just pretends to be declarative but really isn't, the tasks are still imperative steps.

Not to blame ansible for this, it is just that ansible is build on a foundation that inherently makes declarative system management impossible to begin with and ansible is more or less the best thing you could do given the constraints.


Ansible is definitely all about solving from a specified target state and ensuring that it is followed. It’s even at the level of syntax for ansible, which is how it can be used totally declaratively. And if you stick to the native idiomatic ansible way of doing everything (as opposed to doing hacky stuff with ad hoc shell commmands), you get automatic idempotence and other nice stuff “for free”.


So, the last time I used ansible, which was quite a while ago to be fair, there was a builtin way to install packages with apt on the target system. You could add packages to a list to be installed and that would work. But removing them from that "declared state" would not remove them on the target system on the next playbook run. You would have to add an explicit uninstall command. And that is where ansible failed to be declarative. Did that change in the meantime?

Ansible might provide idempotence for the builtin things (although I would argue it doesn't, at least not on a bit-by-bit level, since you can't pin down specific versions of package repositories and stuff like that), but to be declarative it would need to provide a 1-to-1 mapping from declared state to running system state. And if what I described above is still the case, then it simply does not do that.

In my experience, ansible tries to build a declarative interface to an imperative mode of system management, which works to some extent, but breaks down in more complex cases because building this declarative interface can only be a leaky abstraction without the right foundation.


> There's not enough room to overcomplicate something that isn't a core part of the business, it must be practical to maintain, simple&stupid enough so that onboarding is not a hurdle, etc.

You would think. But no, there is lots of room to make it over complicated without the corresponding efforts to manage the complexity.


Don’t worry, that’s just tech debt and we will deeefinitely come back to it next sprint.


Quickly skimming some points that would irratate me if I had to maintain this script:

* Importing Paramiko but regularly call `ssh` via subprocess

* Unused functions like `execute_network_commands_func`

* Sharing state via a global instead of creating a class

Overall it's fit for purpose, but makes a lot of assumptions about the host and client machines. As you said in the thread you're running a very small number of servers (less than 30). I've written similiar things over the years and they are great for what you need.

When I heavily used Splunk (back in 2013) I was in an application production support team that managed over 100 productions servers for over a dozen applications, there were dozens of other teams in similar situations across the company. The Splunk instance was managed by a central team, minimal assumptions about the client environment, had well defined permissions, understood common and essoteric logging formats, and could reinterpret the log structure at query time. A script like this is not competiting in that kind of situation.


Thanks for the feedback. I do use Paramiko for some things. I tried to use it for everything in the project but ran into some weird stuff that wouldn't work reliably for me, which is why I switched some of it over to using SSH directly via subprocess (it was a few months ago so I don't even remember now what it was; I believe it was also performance related, since I'm trying to SSH to tons of machines at the same time concurrently).

I guess I did forget to use the execute_network_commands_func. I'm using the ruff linter extension in VSCode now which would have flagged that to me, but back when I made this I wasn't.

I don't think globals are so awful for certain things. I prefer a more functional approach where you have simple composable standalone functions instead of classes. Obviously classes have a role, but I find they sometimes overly complicate things and make the logic harder to follow and debug.

Anyway, I do appreciate that someone took the time to actually read through the code!


> I don't think globals are so awful for certain things. I prefer a more functional approach where you have simple composable standalone functions instead of classes. Obviously classes have a role, but I find they sometimes overly complicate things and make the logic harder to follow and debug.

But "globals" and "composable standalone functions" are contradictory, if you're mutating global state your function is neither composable nor standalone.

What you've got is a poor mans class instance using global instead of self.


It's a single script. Globals are fine--they're even marked as such.


> It's a single script

It's over 1200 lines of code, it's not like it's 100 lines of code and can fit on a single screen

> Globals are fine--they're even marked as such

I would argue that globals in this context are not fine from a code maintainability point of view.

By using globals here it's hard to know from a function call if it's going to mutate global state or not. If all the functions were methods of the same class instance, and other functions were just functions or part of some other class, then it gives you a clear grouping of calls which are related to mutating that state.

In general I would argue if you are ever in the situation of "I have more than two or three functions that are related to each other and they all need to mutate the same state so I use a mutable global" or "I pass around the mutable state via arguments" then make a class! It creates an obvious semantic grouping of callables.


* barely any comments and not a single docstring in the entire kiloline file


It's not that much code and it has sensible function names. I appreciate that OP took the time to share his tool with us.


I find comments annoying to read and write and distracting. I’d rather fit more code on the screen at once and instead focus on making the variable names and function names really descriptive and clear so you immediately grasp what it’s doing from context alone. Nowadays, if you really need comments to tell you what code is doing, you can just throw it into ChatGPT and get it that way.


I really like the tool you made, and appreciate helping your company save money as well! I don't think it matter that this isn't a perfect fit for everyone else (as you said, this was something you made to solve your problem) - but boy do I disagree with the "variable and function names really descriptive and clear so you immediate grasp what its doing from context alone". What is a descriptive function or variable name is extremely dependent on how familiar you are with the context the program functions in. Using `execute_network_commands_func`from above - this descriptive name say nothing about what network commands that are executed. With docstrings it would be so easy to detail input and output of this function.


I disagree with comments being distracting. You can overdo them, but one good comment - like "baz = 1 # the default is 1 instead of 0, because most real-world production servers foobars 1s into 7s" can save days of frustration.

In general, comments often add a very vaulable context to your code in a way that readaable function/variable names can't.


Oh, how I wish I had your scripts (and insights!) when I was analyzing Unix logs in 1986, looking for the footprints of an intruder...


I'm kinda glad you didn't; it might have made the book I read as a kid (and again as an adult, and again with my offspring) less interesting somehow.


Uh, yes, h0p3 ... I didn't exactly start on that adventure thinking I'd write a book. Chasing after those hackers was orthogonal to my work in astronomy and the Keck telescope.


Yes, sir. I appreciate that. I think you made that very clear in the book as well. I'll agree that having OP's tooling back then would likely have been quite useful to you and others. I'm often terrible with words. What I meant to say was: thank you for writing the book. Your story has been an important part of my family's lives for three generations (your work is also mentioned 3 times in my ℍ𝕪𝕡𝕖𝕣𝔱𝔢𝔵𝔱, and, prominently in my record of reaching out to others out of the blue [I've a habit of knocking on doors with low success rates]). Never thought I'd have the chance to say that to you. =D. `/salute`. Thank you, sir.


Thanks to you for brightening my afternoon -- oh, you brought a smile to my face.

My happy wishes to you and your family!

=Cliff


You should write about that sometime! /s


Thanks for the comment! Going to check out your book now— I somehow hadn’t heard of it before despite it being right down my alley!


Please purchase 30,000 copies of the paperback -- at a nickel royalty per book, it'll help with my kids' tuition this month.


I was about to ask you to get around the campfire and tell the story again, but I see other commenters got ahead of me :). I'll be getting another Klein bottle soon for a gift, if you still do those :).

Hope you're doing well!


My smiles to you Mercer: it's fun to look back over my shoulder to a slowly vanishing time, when the Arpanet backbone ran at 4800 baud and a 1 megabyte Unix workstation was hot stuff...


A long long time ago, I used a series of tail -f's and unix pipes to aggregate logs, and grep, less and awk to analyse them. There were about 20 different services written in C++, each producing over 1GB of logs each day. Managed to debug some fairly complex algorithmic trading bugs. Twenty years later, I still can't fathom why we're spending so much money on Splunk, DataDog an the like.


Financialization and mediocre developers. I haven't worked with too many people I could actually trust to even emit logs correctly, let alone develop a tool to collect and aggregate them.

I've also been told, time and again, in no uncertain terms, to "buy as much as possible". We've reached the logical conclusion of SaaS-everything: every company just cobbles together expensive, overcomplicated computers from other expensive, overcomplicated computer providers, resulting in expensive, bloated systems that barely work.


Buying everything and SaaSing the whole place up is a true killjoy. I giggle with joy whenever I am allowed to write code. And then a support request comes in that I get assigned to, “thing in SaaS doesn’t work please fix”. And all you have to debug that SaaS is their UI. The checkbox in question is on, you notice, so it can only be a bug on their side. Off to contacting support as the only available avenue. Incredibly boring.


Yay I get to write code! Oh it’s Pulumi code that wraps terraform to deploy Elastic Search and configure networking to allow logs from a kubernetes we deployed the same way.


Volume. 1GB of data per day is rounding error. If you have tens of thousands of servers, each generating hundreds of gigabytes of data per day, tail -f and grep don't scale especially well.


And I bet a hang glider can't fly from New York to Paris, either! The nerve!

Recall that the poster said this was for a small startup. If you're Google, by all means, use Google logging tools. If you aren't, then solve the problem you have, not the problem your résumé needs.


The guy asked

> Twenty years later, I still can't fathom why we're spending so much money on Splunk, DataDog a the like.

And the poster above answered that question


They scale perfectly fine, as long as you filter locally before aggregating. Lo and behold:

  mkdir -p /tmp/ssh_output
  while read ssh_host; do
      ssh "$ssh_host" grep 'keyword' /var/my/app.log > "/tmp/ssh_output/${ssh_host}.log" &
  done < ssh_hosts.txt
  wait
  cat /tmp/ssh_output/*.log
  rm -rf /tmp/ssh_output
Tweak as needed. Truncation of results and real-time tailing are left as an exercise to the reader.


100GB of logs per day? what kind of applications are that chatty?


Yeah, the solution here is to get rid of 98% of the logging.


Probably Java/JVM... Never seen something where all kinds of libraries log more.


Log level configuration is a cheap solution in this case.


What did you use for visualization in that stack? The fact that I can "|" (pipe) my data and make bar and pie charts is what really does it for me. What's really money is trivially being able to see requests coming in overlaid on a world map. I was sold the first time I saw that because it let me fix an issue that would have taken me hours to suss out just grepping around.

More power to you for using sed awk and grep, they're powerful tools and every computer person should know how to use them. But if you're hung up on only using sed awk and grep for emotional reasons, that's self-limiting. We have better tools today, and you don't get hero points for using shittier tools when there are better ones available to you.

https://www.splunk.com/en_us/blog/tips-and-tricks/mapping-wi...


Thanks for the share, I still find it hilarious how Python is by default installed on most distros, I was working on some compression tools and by default the os didn't come with the ability zip/unzip toolsets, but the python standard library zipfile did.

https://docs.python.org/3/library/zipfile.html


> by default the os didn't come with the ability zip/unzip

Some versions of tar are able to extract zip files.

Try

    tar xf somefile.zip
It might or might not work with the version in your OS


You don't even have to write a custom script around the library:

  python -m zipfile -e monty.zip target-dir/
https://docs.python.org/3/library/zipfile.html#command-line-...


I used to work at a splunk shop. It was used for alerting, graphing & prediction. It was critical to how the company functioned.

There was lots of stuff that relied on splunk, and we had splunk specialists who knew the magic splunkQL to get the graph/data they wanted.

However, we managed to remove most of the need for splunk by using graphite/grafana. It took about 2-3 years but it meant that non techs could create dashbaords and alerts.

As someone once told me, splunk is the most expensive way you can ignore your data.


I love this!

Log analysis isn't one of the core use-cases for Datasette, but I've done my own experiments with it that have worked pretty well - anything up to 10GB or so of data is likely to work just fine if you pipe it into SQLite, and you could go a lot larger than that with a bit of tuning.

I added some features to my sqlite-utils CLI tool a while back to help with log ingestion as well: https://simonwillison.net/2022/Jan/11/sqlite-utils/


Neat! Definitely a better solution for single source logs. Splunk is ridiculous and Cisco acquiring it isn't going to make that better.

For others with a bit more complex needs, take a look at the free (or paid) versions of Graylog Open[1].

It's really improved over the years. I had messed with Graylog in it's early days but was turned off by it. A few years back, I encountered someone doing some neat stuff with it. It looked much improved. I stood up a "pilot project" to test, and it's now been running for years and several different people use it for their areas of responsibility.

It does log collection/transforming and graphing and dashboarding and we use the everloving crap out of it at work. I wish I could publicly post some of the stuff we're doing with it.

It takes input from just about any source.

1. https://graylog.org/products/source-available/


Fetching logs regularly sounds hard? Wouldn't you need to keep track of the position of all files, with heuristics around file rotations? And if something catastrophic happens, the most interesting data would be in that last block which couldn't be polled?

Normally you'd avoid all that complexity by shipping logs the other way, sending from each machine. That way you can keep state locally should you need to. All unix-like systems do this out of the box, and almost all software supports the syslog protocol to directly stream logs. But you can also use something like filebeat and a bunch of other modern alternatives.

The analyzer can then run locally on the log server and a whole lot of complexity just disappears.


I considered doing it the way you described, but then you need to deploy software on every single one of your machines and make sure it's running, that it's not accidentally using up 99% of your CPU (I've had bad experiences with the monitoring agents for Splunk and Netdata misbehaving and slowing down the machines and causing problems), etc. Whereas with the "pull" approach I used in my tool, you don't need to deploy ANY software to the machines you are monitoring-- you just connect with SSH and grab the files you need and do all the work on your control node.


One way or the other, your hosts are running your application, and you are already deploying software on every single host. But I hear you with some of the agents. That's why I mentioned syslog.

It's already there, it's supported by most logging packages, and it's dead simple. No additional software required. All text. What it doesn't do is structured logging, but analyzing on the log host is often enough.

Agents aren't all that bad however, and you're likely already running some agent like icinga or zabbix for regular monitoring.


having a Netdata agent taking your machine's CPU to 99% shouldn't happen, not sure when was the list time you tried it but a lot of recent improvements have been done on the Netdata Agent

also, with Netdata you can achieve the same architecture design using a Netdata Parent that could be your "control node" and to where you stream the metrics of the nodes you want to keep running with as less load as possible - you can even offload the health engine and the machine learning

take a look at https://learn.netdata.cloud/docs/streaming/ and https://learn.netdata.cloud/docs/configuring/how-to-optimize...


I hear you, but once something like that burns me, I am very loathe to risk it again if the cost/benefit ratio seems unfavorable to me. While it's nice to have the pretty dashboards for Netdata, it's not worth even a small risk of it breaking or degrading my systems.


> Wouldn't you need to keep track of the position of all files

rsync --append is your friend.


That doesn't take neither rotated nor truncated log files into account. The easiest way, and what most log shippers do, is following inodes.

It's also not as effective as streaming them to their intended target directly. Syslog can write a complementary local copy too should you wish to keep one.

Logs has been a thing since the past forty years. In order to reinvent it, it is good to be acquainted with the standard systems.


"Hell hath no fury like a Python dev annoyed"

:-)

Thank you for sharing!


Hell hath no fury like any dev annoyed by something they could build themselves...


This is kinda ironic. The founders of Splunk originally created the product, because they realized that every sysadmin used their own, single-file script to analyze logs. They cleaned the scripts up and productionized what emerged. The reason Splunk grew to a billion-dollar company in the first place was that those sysadmins preferred to switch over to something that was more enterprise-grade. Life is cyclical.


That's cool!

I disagree though that the deal shows any bad judgment on Cisco's part; the gravamen of whether the acquisition was good is not whether many software developers can quickly develop replacements for their own use-cases, or how ergonomic the software is, but whether Splunk is a profitable business with a bunch of paying subscriptions/contracts that aren't going to go away any time soon.


So it requires redis. Would have loved if it was just a simple script or binary.


Thanks for sharing.


> If I owned CSCO stock I would sell it-- this deal shows incredibly bad judgment."

That may be so, but beware that acquisitions usually increase stock price rather than decrease it.


Is that overtime or immediately after acquisition? https://www.google.com/finance/quote/CSCO:NASDAQ?&window=5D


The results should show up pretty quickly. Maybe that was pricing in the acquisition, or the acquisition is nothing compared to the whole company and that's quarterly earnings. Not sure.


Ah, the hubris of the single developer who believes they can replace a battle-tested product from a company with innumerable decades of combined human effort.

Glad it works for you!


I mean, it seems like it's working for them. Not every single startup needs the same solutions as a larger company, especially when the solution is as expensive as Splunk!


"log files of several several gigabytes ... Process them in minutes"

Thanks but no thanks.


I misspoke there-- meant to say:

"The application has been tested with log files several gigabytes in size from dozens of machines and can process all of it in minutes."

That's the time it takes to connect to 20+ machines, download multiple gigs of log files from all of them, and parse/ingest all the data into a sqlite. If you have a big machine with a lot of cores and a lot of RAM, it's incredibly performant for what it does.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: