I was going to comment this. “What’s wrong with `cat`”, whose job is literally to concatenate files? Or even [uncompressed] `tar` archives, which are basically just a list of files with some headers?
Love this. I created (half-jokingly, but only half) the concept of a monofile (inspired by our monorepo) in our team. I have not managed to convince my colleagues to switch yet, but maybe this package can help. Unironically, I find that in larger python projects, combining various related sub 100 loc files into one big sub 1000 loc file can do magic to circular import errors and remove 100s of lines of import statements.
I suspect that from the usage in the code, it knows that there is a module foo and a submodule subfoo with a function bar() in it, and it can look directly in the file for the definition of bar().
It would be another story if we used this opportunity to mangle the submodules names for example, but that the kind of hidden control flow that nobody want in his codebase.
Also, it is not some dark arts of import or something: it is pretty standard at this point since its one of the most sane way of breaking circular dependencies between your modules, and the feature of overloading a module __getattr__ was introduced specifically for this usecase. (I couldn't find the specific PEP that introduced it, sorry)
I usually do this with docker/podman compose files for dev environments.
I see people creating all kinds of mounts and volumes but I just embed files inline under the configs top level key. I even embed shell scripts that way to do one shot/initialization tasks.
The goal is to just have one compose.yml file that the developer can spin up for a local development reproduction of what they need. It's quite nice.
I once had a 4k line javascript file (a vuex module), which I navigated using / in vim, which came with another 20k likes of tests (also in the single file). I would say 5k lines is the real celling.
I've been dreaming of a tool which resembles this, at least in spirit.
I want to figure out how to structure a codebase such that a failing test can spit out a CID for that failure such that it can be remotely recreated (you'd have to be running ipfs so that the remote party can pull the content from you, or maybe you push it to some kind of hub before you share it).
It would be the files relevant to that failure--both code files and data files, stdin, env vars... a reproducible build of a test result.
It would be handy for reporting bugs or getting LLM help. The remote party could respond with a similar "try this" hash which the tooling would then understand how to apply (fetching the necessary bits from their machine, or the hub). Sort of like how Unison resolves functions by cryptographic hash, except this is a link to a function call, so it's got inputs and outputs too.
Of course that's a long way from vomiting everything into a text file, I need to establish functional dependency at as small a granularity as possible, but this feels like the first step on a path that eventually gets us there.
Hmm, you could probably make a proof of concept on a weekend specifically in the typescript/JavaScript ecosystem, as it's already heavily reliant on bundlers.
The process could be
1. defining a new/temporary bundler entry point
2. copying the failing code into the file
3. Bundle without minification
It'd probably be best to reduce scope by limiting it to a specific testing framework and make it via an extension, i.e. jest
You're talking sense, but I'm kinda wanting to do it at the subprocess level so that caller and callee need not use the same language (I was talking in terms of tests but tests are just a special kind of function).
Whether to use nodejs or python or rust (and which version thereof) will be as much a part of the bundled function as its code. I figure I'll wrap nix so it can replicate the environments, then I'll just have to do the runtime stuff.
It'd be nice if something similar were available to traverse, say, directories of writings in Markdown, Word, LibreOffice, etc., and output a single text file so I have all my writings in one place. Plus allow plug-ins to extract from more exotic file types not originally included.
That's what I was thinking too. It looks like someone just reinvented tar, and given how it's a JavaScript thing I'm wondering if it's a zoomer who didn't know tar existed and the HN crowd would set them straight. But then I come into the comments here and people are posting about how absolutely brilliant it is, so surely I'm missing something… right?
I can imagine the token counts to be off the charts. How would an llm handle this input? Llm output quality already drops quite hard at a out 3000 tokens let alone 128k
Seems like repopack only packs the repo. How do you apply the refactors back to the project? Is it something that Claude projects does automatically somehow?
I have a bash script which is very similar to this, except instead of dumping it all into one file, it opens all the matched files as tabs in Zed. Since Zed's AI features let you dump all, or a subset, of open tabs into context, this works great. It gives me a chance to curate the context a little more. And what I'm working on is probably already in an open tab anyway.
Can you go 1 more step? Is there a way to not just dump someone's project into a plain text file, but sometime intelligently craft it into a ready to go prompt? I could use that!
> Can you go 1 more step? Is there a way to not just dump someone's project into a plain text file, but sometime intelligently craft it into a ready to go prompt? I could use that!
Cool! I'd like to see an indication of the total number of tokens in the output, so I know right away on which LLM I can use this prompt or, if it's too large, I can relaunch the script by excluding other files to reduce the number of tokens in the output
One feature you could add is allowing the user to map changes in the concatenated file back to the original files.
For example, if an LLM edits the concatenated file, I would want it to return the corresponding filenames and line numbers of the original files.
We use a C compiler for embedded systems that doesn't support link time optimizations (unless you pay for the pro version, that is). I have been thinking about some tool like this that merges all C source files for compilation.
That's called a "unity" build, isn't it? I was under the impression that it was a relatively well-known technique, such that there are existing tools to merge a set of source files into a single .c file.
Unless i am understanding you wrong, you could easily do this by #including all your a.c, b.c etc. into one file input.c and feeding that to the compiler.
We did this for a home-grown SoC with a gcc port for which there was no linker.
> A vomitorium is a passage situated below or behind a tier of seats in an amphitheatre or a stadium through which large crowds can exit rapidly at the end of an event.
> A commonly held but erroneous notion is that Ancient Romans designated spaces called vomitoria for the purpose of literal vomiting, as part of a binge-and-purge cycle
The name links up nicely with AI enshittification. Although if you wanted to be pedantic, for that metaphor to work you'd really want to call it "gorge" or something more related to ingestion rather than vomiting. (I'm aware that a vomitorium was the exit from a Roman stadium, so it's not really about throwing up either).
- Dump all .py files into out.txt (for copy/paste into a LLM)
> find . -name "*.py" -exec cat {} + > out.txt
- Sort all .py files by number of lines
> find . -name '*.py' -exec wc -l {} + | sort -n