I was poking through your source a little and I noticed a common inefficiency in the way Python for handling text is often written. Due to the immutable nature of strings in Python repeatedly concatenating onto one is very inefficient and not very kind to the gc, so it's much more efficient and pythonic to build up your string as a list of strings and then join them later. For example you could rewrite your do_info function a bit like this:
def do_info(self, line):
"""Display information about the current page."""
renderer = self.get_renderer()
def get_renderer_class():
return str(renderer.__class__).lstrip("<class '__main__.").rstrip("'>") if renderer else "None"
def get_page_lists(url):
return [l for l in self.list_lists() if self.list_has_url(url, l)]
def get_list_status(l):
if self.list_is_system(l):
return ""
status = "normal list"
if self.list_is_subscribed(l):
status = "subscription"
elif self.list_is_frozen(l):
status = "frozen list"
return f"({status})"
url, mode = unmode_url(self.current_url)
out = [
f"{renderer.get_page_title()}\n\n",
f"URL : {url}\n",
f"Mime : {renderer.get_mime()}\n",
f"Cache : {netcache.get_cache_path(url)}\n",
f"Renderer : {get_renderer_class()}\n\n"
]
lists = get_page_lists(url)
if lists:
out.append("Page appeared in the following lists:\n")
for l in lists:
out.append(f" • {l}\t{get_list_status(l)}\n")
return ''.join(out)
Because it was such a common problem, CPython 3 tries to detect such string concatenation and handle it like a list of strings concatenation internally.
Have you measured that your code is significantly faster than the original one?
I was not aware of that problem. The do_info is trivial anyway, so performance is not an issue there. But in ansicat, that might speed up significantly the HTML rendering (which is quite expensive with images transformed as ANSI strings by Chafa).
So I would be curious to have some benchmark and would happily accept patches improving performances in ansicat.
If html rendering is really a big bottleneck, a bytearray could be even faster, provided you accept the compromise of manipulating encoded strings. And a rope for even more perfs.
> The offline content is stored in ~/.cache/offpunk/ as plain .gmi/.html files. The structure of the Gemini-space is tentatively recreated. One key element of the design is to avoid any database. The cache can thus be modified by hand, content can be removed, used or added by software other than offpunk.
> An archiving HTTP proxy and on-disk archival format for websites.
so that all my regular web browsing is auto archived at some level.
It would sure be neat if the archive formats could be compatible. It would allow for a setup where everything I’ve seen with my eyes is then immediately accessible programmatically or in a terminal. I feel that could open some significant productive advantages, especially in the age of LLMs also in the terminal.
Offpunk author here: the goal of "netcache" is to allow access to the cache by more tools. It would be quite easy to build a proxy like webcrystal but for any URL. Something like https://localhost:666/news.ycombinator.com/
I’ll confess to two crimes a) I’ve come across offpunk at least 3 times already and haven’t tried it yet, it looks really neat! b) I often have ideas for interesting hacks that I write down but then drop the ball on. This is one of them but in case you’re interested and it by chance helps with the awesome work you are doing.
Here is my vision:
- mitmproxy is running in transparent mode with a plugin/custom code that does continuous archiving on all traffic, perhaps with help of your netcache tool instead of webcrystal I mentioned. it’s set as the gateway.
- terminal user can then retrieve quickly history list, fzf select, page/dump text for cut and paste, or perhaps some direct to LLM tools pipeline, for summarizing or extracting.
Perhaps I’m stating the obvious above and can be implied by my prior comment. I guess the one detail I’m adding is I think using mitmproxy transparent mode it could be completely transparent to the regular desktop user. No per website configs, no proxy settings.
I run my desktops in a VM using vfio-pci GPU passthrough on the physical workstations and then put all my traffic through another pfSense VM, so re-routing traffic for me is already a fundamental part of my setup.
Perhaps this all sounds a bit complicated and yes it is. But there are numerous reasons I feel it can have a huge payoff.
Thank you for your open source work and I hope I’ll have some time soon to play around with offpunk. It looks like a fantastic tool with many potential uses.
Is (or could) Offpunk be made as two parts: the browser and persistent caching proxy? The persistence of the proxy could be enabled and any content browsed by any browser cached. There could be issues with modern web/js pages not being able to naively run from cache, so safer to store the rendered documents.
I used to daydream about having a web proxy that could store every page I visited (instead of having to manually save interesting pages... something I do a lot). But I never had the storage space for that, and the bloat in web pages has grown faster than the size of disk I can afford. Since I started using Offpunk for Gemini some time ago I at least get a complete saved record of all the Gemini pages I read. 65 MB in one year. Far more realistic to maintain than with web pages.
> But I never had the storage space for that, and the bloat in web pages has grown faster than the size of disk I can afford.
Is this really true for you? Seems surprising to me given how cheap large HDD storage is now. I have trouble believing even bloated web pages are that large relatively speaking. I guess I should try to do it and find out, probably I’m wrong and it’s much more data than I’m expecting.
Gemini seems really neat, I should have investigated earlier.
ssh kiosk@gemini.circumlunar.space
Let’s one get a sense of what it is about. 1 bookmarks found me some spaces.
Two years of using nearly exclusively offpunk without ever trimming the cache:
2.5G gemini
1.3G gopher
1.5G http
23G https
With the exception of dynamic webpages, every single page I’ve read in those two years is there. With pictures. And with every single page linked by those webpages. And all the pages linked on HN.
That’s very awesome and by itself a compelling argument for one to consider adopting offpunk.
I’d imagine putting the archive in git repo with annex/lfs and along with builtin time-stamping it means an activity archive also. Lot’s of interesting use cases if combined with LLMs and RAG for example.
“A few weeks ago I was researching technology XYZ and there was an open source python package that looked neat, but I can’t remember what it was. Could you use my browsing cache and prepare a report on what I was reading about and summarize it please?”
Excellent pointer with warcprox, I hadn’t seen it. I’m noticing mitmproxy, warcprox, webcrystal, and also obviously offpunk are all python.
It seems there should be some mashup of them all that can
produce a solution. One that also involves using offpunk to access the archive in the terminal.
Mitmproxy caught my eye with transparent mode [1] and the idea that the client/user VM may not even need configuration in my setup, the vfio-pci GPU passthough desktop OS approach. The archiver VM produced archive/cache could just be NFS mounted over a private bridge interface between the desktop VM and archive VM.
I like these kind of projects because, probably due to of some nerdy/geeky aspects of my personality, they keep me excited about computer stuff in general, but I have to admit I will almost surely never have a use case for them, except for playing with it for a few minutes. I just did, and it was fun to see how my own web page renders in it (I have to say, way better than it does in graphical but not up-to-date browsers that lack some semi-recent CSS features…).
That's called 'art'. It's someone's self-expression that you connect to. I can't speak for the author, but that's how I look at my experience of engaging with, and especially of creating, such projects.
It's art, in the medium and using the tools the author knows. Many associate art with certain mediums, like paint on canvas. But does the painter make a sculpture or build the perfect engine? The programmer, when they feel the drive to make art, makes a program.
(Another signal of art, IMHO, is the author's seeming disinterest in global adulation. They aren't aiming for virality or influencer status or the next startup with an exit, not pivoting to the trending thing; they are making what they love.)
For me the use cases of offpunk (or any cli/tui program) are mainly 2 things:
1. keep my computer environment on a remote VPS. So I have an always-on machine with high-speed internet access. I read books, visit websites and write codes on the same machine. These are mainly text works so a mosh connection is enough. Occasionally when I want to view some images, I use chafa to preview the images. Because the machine is always on, syncing and backup is eaiser to do.
2. But sometimes either you are offline or the VPS is offline. In this situation I will switch to a raspberry pi zero (packed inside a mint tin). The computation resource is limited so do everything in cli/tui will make things faster.
Completely. The concept of "tour" allows me to queue everything I want to read (I also have a "to_read" list, which replaced Pocket for the longest read I would do later).
What is incredible is when I got to the end of my tour before finishing my cup of tea. I feel like I finished what I had to read for today. I then go to empty the "to_read" list (which is never empty but stay between 10 and 30 items all the time).
I also create lists for stuff I want to bookmark and surprise myself by triaging them: rereading and deciding if I want to keep that article or not in a list. If yes, adding some comments in the list (lists are simple gemtext files that could be edited with "list edit my_list")
Well, it probably won't play YouTube video or be able to run Google Spreadsheet. It's for the simple web (simple being a meliorative here, in the spirit of the Gemini project that this TUI browser supports).
More specifically: it is aimed at read-only web. YouTube videos can be downloaded with yt-dlp. There probably is something for downloading spreadsheets into CSV, but nothing to interact.