Hacker News new | past | comments | ask | show | jobs | submit login
Offpunk 2.0 (ploum.net)
99 points by todsacerdoti on Nov 25, 2023 | hide | past | favorite | 28 comments



I was poking through your source a little and I noticed a common inefficiency in the way Python for handling text is often written. Due to the immutable nature of strings in Python repeatedly concatenating onto one is very inefficient and not very kind to the gc, so it's much more efficient and pythonic to build up your string as a list of strings and then join them later. For example you could rewrite your do_info function a bit like this:

  def do_info(self, line):
    """Display information about the current page."""
    renderer = self.get_renderer()

    def get_renderer_class():
        return str(renderer.__class__).lstrip("<class '__main__.").rstrip("'>") if renderer else "None"

    def get_page_lists(url):
        return [l for l in self.list_lists() if self.list_has_url(url, l)]

    def get_list_status(l):
        if self.list_is_system(l):
            return ""
        status = "normal list"
        if self.list_is_subscribed(l):
            status = "subscription"
        elif self.list_is_frozen(l):
            status = "frozen list"
        return f"({status})"

    url, mode = unmode_url(self.current_url)
    out = [
        f"{renderer.get_page_title()}\n\n",
        f"URL      :   {url}\n",
        f"Mime     :   {renderer.get_mime()}\n",
        f"Cache    :   {netcache.get_cache_path(url)}\n",
        f"Renderer :   {get_renderer_class()}\n\n"
    ]

    lists = get_page_lists(url)
    if lists:
        out.append("Page appeared in the following lists:\n")
        for l in lists:
            out.append(f" • {l}\t{get_list_status(l)}\n")

    return ''.join(out)


Because it was such a common problem, CPython 3 tries to detect such string concatenation and handle it like a list of strings concatenation internally.

Have you measured that your code is significantly faster than the original one?


I was not aware of that problem. The do_info is trivial anyway, so performance is not an issue there. But in ansicat, that might speed up significantly the HTML rendering (which is quite expensive with images transformed as ANSI strings by Chafa).

So I would be curious to have some benchmark and would happily accept patches improving performances in ansicat.


If html rendering is really a big bottleneck, a bytearray could be even faster, provided you accept the compromise of manipulating encoded strings. And a rope for even more perfs.

But yes, benchmarks are very much necessary.


I’ve learned quite quickly that you never do any performance job without profiling. Never! Even if the optimisation looks straightforward.


From the the project page it says:

> The offline content is stored in ~/.cache/offpunk/ as plain .gmi/.html files. The structure of the Gemini-space is tentatively recreated. One key element of the design is to avoid any database. The cache can thus be modified by hand, content can be removed, used or added by software other than offpunk.

One ambition I have it to setup

https://github.com/davidfstr/webcrystal

> An archiving HTTP proxy and on-disk archival format for websites.

so that all my regular web browsing is auto archived at some level.

It would sure be neat if the archive formats could be compatible. It would allow for a setup where everything I’ve seen with my eyes is then immediately accessible programmatically or in a terminal. I feel that could open some significant productive advantages, especially in the age of LLMs also in the terminal.


Offpunk author here: the goal of "netcache" is to allow access to the cache by more tools. It would be quite easy to build a proxy like webcrystal but for any URL. Something like https://localhost:666/news.ycombinator.com/


I’ll confess to two crimes a) I’ve come across offpunk at least 3 times already and haven’t tried it yet, it looks really neat! b) I often have ideas for interesting hacks that I write down but then drop the ball on. This is one of them but in case you’re interested and it by chance helps with the awesome work you are doing.

Here is my vision:

- mitmproxy is running in transparent mode with a plugin/custom code that does continuous archiving on all traffic, perhaps with help of your netcache tool instead of webcrystal I mentioned. it’s set as the gateway.

- terminal user can then retrieve quickly history list, fzf select, page/dump text for cut and paste, or perhaps some direct to LLM tools pipeline, for summarizing or extracting.

Perhaps I’m stating the obvious above and can be implied by my prior comment. I guess the one detail I’m adding is I think using mitmproxy transparent mode it could be completely transparent to the regular desktop user. No per website configs, no proxy settings.

I run my desktops in a VM using vfio-pci GPU passthrough on the physical workstations and then put all my traffic through another pfSense VM, so re-routing traffic for me is already a fundamental part of my setup.

Perhaps this all sounds a bit complicated and yes it is. But there are numerous reasons I feel it can have a huge payoff.

Thank you for your open source work and I hope I’ll have some time soon to play around with offpunk. It looks like a fantastic tool with many potential uses.


Is (or could) Offpunk be made as two parts: the browser and persistent caching proxy? The persistence of the proxy could be enabled and any content browsed by any browser cached. There could be issues with modern web/js pages not being able to naively run from cache, so safer to store the rendered documents.


It is made of 3 different parts:

- netcache (caching and network) - ansicat (terminal rendering) - offpunk (browsing)

So you could use only the cache part


I used to daydream about having a web proxy that could store every page I visited (instead of having to manually save interesting pages... something I do a lot). But I never had the storage space for that, and the bloat in web pages has grown faster than the size of disk I can afford. Since I started using Offpunk for Gemini some time ago I at least get a complete saved record of all the Gemini pages I read. 65 MB in one year. Far more realistic to maintain than with web pages.


> But I never had the storage space for that, and the bloat in web pages has grown faster than the size of disk I can afford.

Is this really true for you? Seems surprising to me given how cheap large HDD storage is now. I have trouble believing even bloated web pages are that large relatively speaking. I guess I should try to do it and find out, probably I’m wrong and it’s much more data than I’m expecting.

Gemini seems really neat, I should have investigated earlier.

ssh kiosk@gemini.circumlunar.space

Let’s one get a sense of what it is about. 1 bookmarks found me some spaces.


Two years of using nearly exclusively offpunk without ever trimming the cache:

2.5G gemini 1.3G gopher 1.5G http 23G https

With the exception of dynamic webpages, every single page I’ve read in those two years is there. With pictures. And with every single page linked by those webpages. And all the pages linked on HN.


That’s very awesome and by itself a compelling argument for one to consider adopting offpunk.

I’d imagine putting the archive in git repo with annex/lfs and along with builtin time-stamping it means an activity archive also. Lot’s of interesting use cases if combined with LLMs and RAG for example.

“A few weeks ago I was researching technology XYZ and there was an open source python package that looked neat, but I can’t remember what it was. Could you use my browsing cache and prepare a report on what I was reading about and summarize it please?”


I already do that sort of thing with grep and find in the cache folder. Managed to find many things that way.

It should be said that the cache doesn’t have any versioning. When a new version is downloaded for a file, it replaces the old one.

That’s also why the cached date is displayed in the title of the page in Offpunk


I've looked into archiving all the pages i visit as well and warcprox[1] has been bookmarked for a while now

Hard drive storage space being so cheap in the ~$15/TB range makes this more feasible even for video archival

[1] https://github.com/internetarchive/warcprox


Excellent pointer with warcprox, I hadn’t seen it. I’m noticing mitmproxy, warcprox, webcrystal, and also obviously offpunk are all python.

It seems there should be some mashup of them all that can produce a solution. One that also involves using offpunk to access the archive in the terminal.

Mitmproxy caught my eye with transparent mode [1] and the idea that the client/user VM may not even need configuration in my setup, the vfio-pci GPU passthough desktop OS approach. The archiver VM produced archive/cache could just be NFS mounted over a private bridge interface between the desktop VM and archive VM.

[1] http://docs.mitmproxy.org.s3-website-us-west-2.amazonaws.com...


You can now start accessing the raw cache with "netcache --offline". Or you can access it by hand: the cache is only made of files stored in folders.


I like these kind of projects because, probably due to of some nerdy/geeky aspects of my personality, they keep me excited about computer stuff in general, but I have to admit I will almost surely never have a use case for them, except for playing with it for a few minutes. I just did, and it was fun to see how my own web page renders in it (I have to say, way better than it does in graphical but not up-to-date browsers that lack some semi-recent CSS features…).


That's called 'art'. It's someone's self-expression that you connect to. I can't speak for the author, but that's how I look at my experience of engaging with, and especially of creating, such projects.

It's art, in the medium and using the tools the author knows. Many associate art with certain mediums, like paint on canvas. But does the painter make a sculpture or build the perfect engine? The programmer, when they feel the drive to make art, makes a program.

(Another signal of art, IMHO, is the author's seeming disinterest in global adulation. They aren't aiming for virality or influencer status or the next startup with an exit, not pivoting to the trending thing; they are making what they love.)

But again, I can't speak for this author at all.


Well, I’m the author and, as a writer, I’m very interested in global adulation. For my books. Not my software ;-)


For me the use cases of offpunk (or any cli/tui program) are mainly 2 things:

1. keep my computer environment on a remote VPS. So I have an always-on machine with high-speed internet access. I read books, visit websites and write codes on the same machine. These are mainly text works so a mosh connection is enough. Occasionally when I want to view some images, I use chafa to preview the images. Because the machine is always on, syncing and backup is eaiser to do.

2. But sometimes either you are offline or the VPS is offline. In this situation I will switch to a raspberry pi zero (packed inside a mint tin). The computation resource is limited so do everything in cli/tui will make things faster.


Do you notice yourself becoming more mindful of what you visit while using this as an every day tool?


Completely. The concept of "tour" allows me to queue everything I want to read (I also have a "to_read" list, which replaced Pocket for the longest read I would do later).

What is incredible is when I got to the end of my tour before finishing my cup of tea. I feel like I finished what I had to read for today. I then go to empty the "to_read" list (which is never empty but stay between 10 and 30 items all the time).

I also create lists for stuff I want to bookmark and surprise myself by triaging them: rereading and deciding if I want to keep that article or not in a list. If yes, adding some comments in the list (lists are simple gemtext files that could be edited with "list edit my_list")


This would have been awesome in the 90s


Is this for any sort of web content?


Well, it probably won't play YouTube video or be able to run Google Spreadsheet. It's for the simple web (simple being a meliorative here, in the spirit of the Gemini project that this TUI browser supports).


More specifically: it is aimed at read-only web. YouTube videos can be downloaded with yt-dlp. There probably is something for downloading spreadsheets into CSV, but nothing to interact.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: