Hacker News new | past | comments | ask | show | jobs | submit login
The Overflow Offline project (stackoverflow.blog)
273 points by donutshop on Oct 20, 2022 | hide | past | favorite | 39 comments



I remember already being able to use certain stackexchanges with kiwix before, as well as the arch wiki, wikipedia without images, and some other great resources. It is nice to see that they actually pay attention to this use-case and I look forward to updated workflows with kiwix or similar in the future. Latency is way better that way, even with good and stable internet. OpenZIM[1] is also useful in turning any page for use with kiwix.

I also have great memories from a University exam where we were allowed to have laptops that were not connected to the internet.

[1] https://wiki.openzim.org/wiki/OpenZIM


What was test score ;)


This has been available for a while but it's great to see some acknowledgement especially since the most recent data set was stuck in 2019 for a while.

Here are the datasets: http://download.kiwix.org/zim/stack_exchange/

It's not clear to me why the data set shrank between 2019/3 and 2022/6; was something excluded? Compression improvements?

> stackoverflow.com_en_all_2019-02.zim 2019-03-12 19:53 134G

> stackoverflow.com_en_all_2022-05.zim 2022-06-17 12:36 75G


The data isn't stuck. The data is available here

https://archive.org/details/stackexchange

It's the "official" place to get the data

I've download it several times and extracted my own contributions.


The article states:

> ... to ensure that an up-to-date version of our dataset is easily available for those who need it, and will work to improve its readability and reduce its size so there is less friction for end users...


This is great! Too many services today become completely unusable when they encounter technical problems, are hacked or are just lost over time. Having an easily accessible offline copy is always reassuring, showing that their survivability does not depend on just a few people and the projects are fundamentally about the information, not an organization.


This project has really been the make or break for us to be able to do the coding in prisons program, as well as the devshop that develops Ed Tech for prisons at Unlocked Labs. Love at the project has been formalized and will make a huge difference in the lives of many justice involved individuals who just want to return to their communities as productive citizens.


Never heard of Unlocked Labs before reading this comment, but that sounds like an outstanding project. Keep up the good work. You didn't promote yourself but I have no affiliation and this is what I found after searching: https://unlockedlabs.org/

I was once a guest in a maximum security prison for a few hours. It gave me the eerie realization that my own freedom is entirely up to the guards and staff. When that door closes you are no longer free - you're at the mercy of a huge system. It's a terrible feeling. Talking with some of the inmates made me realize how much I had subconsciously dehumanized them as a group. It was eye opening.

Listening to stories of people who have been in prison, even if you never meet them, can help build empathy. Here's a good interview with a guy who taught himself to program while in prison: https://corecursive.com/prison-programming-with-rick-wolter/

If you're ever interviewing a formerly incarcerated person for a job, I'd encourage you to try very hard to keep your biases in check. Building a life after prison seems like a supremely difficult task.

[edit] That's what I get for reading the comments before the article! Unlocked Labs is linked in the 2nd sentence. :)


Love this. Reminds me of the other Kiwix projects to make MediaWiki services like Wikipedia available offline[1]. The entirety of English Wikipedia is ~50GB of text and ~100GB of images.

1. https://wiki.kiwix.org/


I feel like their homepage could be greatly improved. It doesn't really make obvious what great capability it provides


I downloaded the databases earlier this year via torrent and attached them to a local SQL Server instance. Some notes:

- The files are big. The data files were 486 GB and the log file (I haven't tried shrinking it yet) is 1.4 TB

- There were no foreign key relationships defined. Nor indexes on commonly queried columns. Easy enough to add them as you work on tuning it.

- To be usable by humans you'll need to set up some full text indexes for searching - they use Elasticsearch. SQL Server's Full-Text catalog doesn't get you where you need.

https://stackexchange.com/performance


You can also run FreeCodeCamp locally https://github.com/freeCodeCamp/freeCodeCamp/blob/main/docs/...

And I funded to work to run that on an Android phone https://play.google.com/store/apps/details?id=space.atrailin...


There was a recent HN Post for codequestion which builds an offline semantic index (using https://github.com/neuml/txtai) on the archive.org Stack Overflow dumps - https://news.ycombinator.com/item?id=33110219

GitHub: https://github.com/neuml/codequestion

Article: https://medium.com/neuml/find-answers-with-codequestion-2-0-...


I've always wondered if we could force web apps into some sort of "default" offline mode with like some offline://url.here . Very cool of overflow.


I like this idea; I wonder if there's a way to get Firefox to support this via the settings. There's already support for file:/// ftp:// etc


Nice to find Kiwix again. Shameless plug, I made my own Kiwix alternative for macOS: https://github.com/technusm1/kiwings


So this is a desktop app, but it uses the server as part of it? The normal Kiwix desktop client doesn't do that right?

I'll throw in my own shameless plug: Self-host your Stack Overflow, Wikipedia etc on Sandstorm: https://apps.sandstorm.io/app/5uh349d0kky2zp5whrh2znahn27gwh... Obviously uses kiwix-serve as well. 3 years old, I need to make a better clip for updating it.


I originally built this because Kiwix desktop didn't work with TED videos for me. Kiwix-serve works nicely always, plus I wanted to learn SwiftUI.


How alive is Sandstorm these days? I loved the concept and was sad when the company died.


A rather small number of people are working on it, but it's chugging along. Kenton has the keys still so he does basic vital things. A few new apps are in the pipeline. I'm still rather interested, and am developing another app that I expect to get a lot of attention when it's done.

There's a fund:

https://opencollective.com/sandstormcommunity

The bigger it gets, the more time can be diverted toward it. I was recently paid to upgrade the Etherpad app. But in theory it could go toward core development (which wouldn't be me).


To me this basically seems like boat programming made respectable.

Of course, if you asked me, it always was. You couldn't assume great connectivity then and you often still can't today.


My favorite consultant I ever worked with was a boat programmer. You hired him for super specialized MySQL magics so he, well his company, charged a pretty substantial hourly rate and he apparently had enough revenue/leverage to get his company to foot the bill for two separate satellite internet connections on his boat. I feel like I would get lonely but it's definitely a vibe.


To me this is programming during the first 10 years where the "Internet" were local BBS, magazines spoke about Compuserve and Prodigy, and the connection rates where impossible, so we had to get by with what came on magazines and local library.


They could actually try to build a Copilot competitor off their data. /s


I see the "/s" but I actually do wonder if integrating the "prompt" behavior into the question box would help cut down on the absolutely staggering number of duplicate questions. Regrettably, I'm not enough of a GPT expert to know what percentage of the time it would generate gibberish thus making the duplication question problem _worse_


Pretty sure they already show suggestions when the user is typing the question topic, though idk if they catch much. I guess the question here is whether some kind of AI would pop up more relevant suggestions, all the way to “your question is already answered, doofus”.


Copilot themselves probably swiped it all already, since SO answers are licensed under CC-BY-SA, which to Github is equal to ‘just take it all and ignore the author’. With the possible small difference that the uploader doesn't agree to the Github ToS.


It would be interesting to see how many times a copilot competitor trained off it gave correct code vs wrong code for a given case


I would suspect that would differ whether it was trained on the question's code versus any accepted answer's (or most upvotes?) code


Could probably trim it down 10-20% and call it "The Collected Works of Jon Skeet".


I already downloaded documentations, like the python api, or the cpp preference website as a pdf or html archive.

I don't know if it's available for html or js or css, or opengl.


You should check out Zeal, it's an offline documentation browser with existing documentation packages for HTML and a whole bunch of things

https://zealdocs.org/


https://devdocs.io/ exposes a huge catalog of indexed and searchable collections of documentation for a wide variety of languages, libraries, and subjects, including HTML, JS, and CSS – though, the only GL I see is WebGL – and _all_ of it can be downloaded to an IndexedDB for offline use.

It's been a very handy tool in my toolbelt.


This is awesome! At first I thought this only supported Stack Overflow and not the other 170+ StackExchange forums, but it looks like it does (or will?). From the blog:

> “We built the Sotoki (Stack Overflow to Kiwix) scraper in such a way that it can capture each and every one of the 180 Stack Exchange websites.”

Unclear to me if "can" means "does" or "will soon" or just "could"


It already does - everything from the technical stack exchanges to the sites on cooking and gardening :)


I assume this is so people can train AI on it. It's just hard to say that outright because some people don't like the idea.


It was already possible to download dumps since a long time.

https://archive.org/details/stackexchange


I mean, first paragraphs of the article explain what this is for.


This is a amazing dose of humility.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: