Hacker News new | past | comments | ask | show | jobs | submit login
Building personal search infrastructure for your knowledge and code (beepb00p.xyz)
661 points by october_sky on Jan 27, 2020 | hide | past | favorite | 157 comments

I've given up with trying to find The One True Note Taking Tool, so have ended up writing my own thing that I tinker with now and again to tune it to exactly what I need.

It's essentially a simple web server that sits on top of a bunch of markdown files.

The frontend renders the markdown using markdown-it and supports KaTeX for simple inline mathy things, along with the extended markdown stuff like tables etc. I've even made it so that you can drag and drop files (including images) into the edit box and it will upload them to the server and render the correct markdown syntax so they can be rendered when you look at the note.

Alongside the files, the data is also stored in a SQLite database file with some metadata, and I'm using the Full Text Search (FTS5) engine to support search which seems to work ok.

If the database gets corrupted it can just be rebuilt, it's really just there to augment the notes. If I stop developing it or want to move on, the notes are there as text files.

It works well enough in a mobile browser, although admittedly a bit rubbish if you need offline access.

Works well enough for me. I might open source it one day but I think I'd need to clean up the code a bit first :)

EDIT: the core of the tool was mostly inspired by this article https://golang.org/doc/articles/wiki/

This sounds a lot like a tool I built for myself [1], sans the database. I agree that Markdown + Katex with a local server seems like the right move for most technical people. Lots of things like encryption, backups, and basic text search can be done via other Unix tools. I also agree that the big win is owning your data long-term, even if you get tired of maintaining the software.

[1] https://github.com/gwgundersen/anno

Sir! I have to say seeing you here that I appreciate your contributions.

I have a similar tool I've been using for years now. It is built on sqlite and uses the fts extension to provide full-text search.

Some nice features I've added over the years: bookmarklet and automatic page-screenshotting, tags (and smart auto-tagging), everything markdown supports, file upload and attach, media embed (YouTube link becomes player, eg). Oh, I can also attach email reminders and make to-do lists (with little checkboxes and everything). It started out very simple and has grown over time. Sqlite is a great foundation for projects like this. Strongly recommend.

Same here: I'm taking a lot of notes and find SimpleNote/Notepad++/NV work for me for note taking but not for note management. And I also build several tools to manage it. For now I use very simple full-text search (qgrep) and fzf (with my own modifications) to perform search-as-you-type to find notes/source code. qgrep is good for really quick index and it has incremental indexing that works for my byt i'm hitting some problems as it works good for code search but not so good for notes. However i don't feel like using anything that start a server. I just don't think I have that many notes.

I did something similar for a few years... then my web app was hacked. I realized what a big liability it was to have such precious stuff inside an app with such limited security. Since then I’ve learned to make do with simpler tools. Omni Outliner is my favorite!

Sourcegraph CEO here. I see the doc mentions Sourcegraph for code search (cool!). Something like ripgrep is indeed better for your case, a single person who just needs to search code in local directories on their own machine. I made a PR for our docs at https://github.com/sourcegraph/sourcegraph/pull/8075 that should clarify this.

Sourcegraph is a web-based code search tool that automatically syncs and indexes many repositories from your organization's code host(s). It's intended for every developer at an organization to use for searching across all of the organization's code (and for navigating/cross-referencing with code intelligence). It's self hosted and usually there is 1 Sourcegraph instance per organization. If you love local+personal code search, I bet you and your teammates would love organization-wide code search, so give Sourcegraph a try (https://docs.sourcegraph.com/#quickstart). :)

Thank you for replying and updating the docs, appreciate it!

I still wish it was easier, it's such a cool tool :) In theory it should be possible to set up inotify watches on local repositories and reindex on changes (perhaps with some throttling logic if it's too heavy), although I understand it's harder than it sounds and my usecase is probably somewhat marginal. I might set it up anyway if my personal infrastructure ever settles.

Another great option for local code/repo search is Hound. I maintain an instance of it at my workplace, but it's so lightweight and easy to deploy that I could easily imagine running an instance of it on my laptop for offline personal use.


YES! We have been using hound for several years now, having all hundreds of our org repos searchable in one spot, in a LIGHTNING FAST manner has been an invaluable tool to help our various teams keep up with the legacy sprawl and effectively remove old features and all their dependencies from our sprawly systems. I even wrote a microservice that uses gitlab global hooks to keep hound up to date without polling, and a little c# config generator that runs as a cron job on our gitlab instance and redeploys hound with the newest repos included.

Hound falls short on access control front (we wrapped our instance with a saml proxy), but it's still a 'you either can search every piece of software for \'password\'' or you don't have any access at all. Having to index a specific branch instead of all of them kinda stinks too; for those two specific reasons we have been eyeing sourcegraph, esp. as the gitlab integration matures.

I can't emphasize enough how fast hound is and how pleasurable it is having a regex based code search that doesn't make me wait.

Yeah, the access control thing is not ideal— I have my instance behind Apache for the active directory plugin. Potentially as a hack you could run multiple Hound instances and reverse proxy the correct one based on a user's role? Might be easier to just add in proper support upstream. :)

Anyway, for now I'm at a small enough org that everyone still just sees everything, and it's been super valuable.

As far as competition with other tools, the infrastructure team at my org has their Elastic instance plugged into our GitLab, but most of the engineers agree that Hound is better— it's faster, it does regex, and it doesn't do goofy stuff like return pages of the same result from everyone's fork of the same repo.

A similar but more structured (though perhaps Hound supports a similar feature set) code searching tool is OpenGrok [1]. It's a bit more setup as it uses Apache Tomcat, but once it is setup it has an incredibly fast and useful code querying tool with really useful abilities to x-ref functions/structures, highlight uses of variables, and integrates git info as well. If you've ever used elixir.bootlin.com to go through the Linux source code, opengrok is effectively a more feature packed open source version of that. I highly recommend taking a look to anyone who spends a lot of time digging through code.

[1] https://oracle.github.io/opengrok/

Recently interviewed for a PM role at Sourcegraph, so I read everything Sourcegraph shares online and it was amazing to see your plans and OKR's for the future being laid out in the open. Kudos on running an open, successful organization.

After reading about your masterplan I would love to know your thoughts on the question presented regarding phase 2.

Will coding in the future be more like writing a novel or like knowing how to read+write? I feel the latter will eventually be true as the the human-machine interface becomes more 'native'.

I'm not familiar with your product, so this question might be either overly simple or entire outside your wheelhouse, but does Sourcegraph have any integration with IDEs? Like if my organization had a module in a repository that did image compression (random example), could I hit a hotkey and search for it, then have the plugin pull the module out of a different project and insert it into what I'm working on?

Sourcegraph has editor plugins (https://docs.sourcegraph.com/integration/editor) that give you editor hotkeys for (1) multi-repository search and (2) go-to-file on Sourcegraph (in your web browser, so you can read the code without ruining your editor state or share the URL of your current file with a teammate).

That advanced use case you mentioned isn't supported, but it sounds very cool. It's in the realm of things we'd like to offer someday. If anyone's interested in hacking on that (and making a PR to https://github.com/sourcegraph/sourcegraph), I'd be happy to screenshare with them and give them some pointers.

I was in an Uber with one of your engineers who was heading to Gophercon. He seemed cool so I'm going to assume you're all cool people.

Seems like more than half of this comment is just speaking about Sourcegraph, something sqs themselves acknowledge is not the right tool here. I know you're the CEO, but maybe can avoid pushing your product when it's not relevant :)

Thank you for updating the documentation to clarify the use case though!

Wow, the prices seem extremely high to me for a search engine across code repos.

$30/person is almost double what Stack Overflow charges, and that product can act as a frontend to search not just code but any type of documents, with voting, tagging, analytics on what confuses people the most and more.

It would be hard for me to justify even $10/person for something like Sourcegraph in my company (a Fortune 500 ecommerce brand), for the highest enterprise tier of functionality.

$30/person per month for the lowest tier? Boy, I wish I knew of companies willing to pay that. None in my experience ever have been.

> in my company (a Fortune 500 ecommerce brand)

My strategic advice is to get whatever's best in class, and not worry about $X0/month. Compared to what you should be spending on devs that rounds to free.

I’ve never heard of any company that does that. In all 6 of the large tech companies I’ve worked for, all of which were very profitable, this type of per-head cost would be a HUGE blocker to being allowed to requisition the tool.

Meta-observation. This topic seems to be getting a lot of attention on HN over the last few months, indicating massive interest. Further, looking at the landscape of developments in this space (past all the me-too Markdown note taking apps): Evernote seems to have a fading presence on the landscape, Notion seems to be a (too?) well-funded behemoth startup, Roam is trying some exciting things, and Tiago Forte is putting together some interesting things under the BASB banner. (Any others? Oh btw, there’s also Perkeep)

It’s amazing for how long Emacs’ Org-mode has been largely unparalleled! Apart from the revered desktop setup, there are now a bunch of mobile offerings including Organice — not quite slick, but definitely useful.

I‘m sincerely rooting for more experiments in this area. I would love to be able to write by hand or speak to my memex (multi-modal interaction). Vannevar Bush’s “As we may think” has languished uncourted for pitifully long. In some ways, this was supposed to be the first “killer app” for personal computing.

> It’s amazing for how long Emacs’ Org-mode has been largely unparalleled!

I use org-mode all the day but frankly OneNote is great too!

If OneNote would save in plain text and have a cross-platform gui I would use it (even if it's resource-sucking electron)

I use onenote extensively and the mobile app has gotten steadily a little slower and worse. I would love to have a simpler, faster mobile app that would sync to my computer in plaintext so I could use vim with it....

Basically, onenote is almost there but I would love to leave it

OneNote is the only note taking app that I've come remotely close to "successfully" useing.

I especially love how it automatically cites and links to whatever you copy and past from the web. That alone is so valuable for documenting workflow and how-to write-ups.

However the combination of me using a desktop less and mobile more, plus Microsoft's attempts to turn Office into a web app have soured me to it. That and the limitations mentioned above. I'd love to be able to export to a wiki style interface, but I cringe at thinking about what that html would look like (a la Word's html export).

But I have yet to find anything that I like better. Or will consistently use as much.

The OneNote format is documented. If I knew Haskell I might have a go at adding support to OneNote, but if anyone wants to have a crack at it I'd support you.

> It’s amazing for how long Emacs’ Org-mode has been largely unparalleled!

After jumping into org last year, I think it because orgmode has a solid foundation for organizing stuff that's infinitely customizable with elisp. Roam, evernote, onenote; they just don't have the flexibility. The lack of customizability is a feature in itself: it's easy to pick up.

On the other hand, orgmode has a fairly vibrant community that will keep improving orgmode for many years to come.

My hackerspace is working on a tool to put this knowledge-in-a-computer to work.

Essentially, it's a knowledge management system that makes input almost frictionless. This is then mapped into a shareable ontology graph on which algorithms can be executed. Valuable data can be extracted from here.

For example: do you need to find a team with a specialized couple of skills? Have applicants send their verified graphs and use those relations to find the best fit.

Or, alternatively, someone who's learned a trade/skill can share their dense knowledge with a community, to direct learning more effectively.

It's on a very early stage, for now purely for the fun of it. But if there's interest or suggestions (definitely some hard problems to solve) we could focus more efforts towards that.

>Evernote seems to have a fading presence on the landscape

My understanding is that they're throwing their presence away? Maybe they pivoted to enterprise, I don't know, but for at least a couple years all I've heard about them was people talking about what to use instead.

I’ve started retiring my use of Evernote because I simply don’t have any idea whether they’re going to be around for the long run. I’ve been using Emacs for nearly 30 years; it’s about time I learned org-mode.

I think for a long time they simply had no idea what road they should walk down. So they tried a bunch of stuff, walked in circles and wasted money left and right. Now they settle down on what they always were and try to maintain the established userbase while not scaring them away with too much innovation, and instead delivering conservative improvments.

This came with a shutdown of freeusers heaven, and focusing more on the paying customers. Because of which many people seem to be cranky over evernote. Similar think seems to happen at the moment with dropbox too BTW.

It's a ripe space. I'm using notion mostly right now, but I've also used:

-Coda.io (big, more scriptable player)

-Hypernote (super new player, but with a cool new take on inter-note relationships)

-Tiddlywiki (super customizable, really fast -- but also has a fair amount of footguns)

-Airtable (only played with it a few times but it's usually mentioned in the same breath as notion, I notice)

Hopefully someday we'll achieve Alan Kay's dream :)

If only Notion was private (like for example Omnifocus.) I can't imagine uploading all my private data to the cloud of a "free" app.

OmniFocus is more expensive but i gladly pay to prevent my data from being analyzed & sold.

I don't disagree -- though they do have a paid version, and a modest team size, which seems promising

Airtable is fantastic for so many things, I only wish there was a cheaper "personal" plan with more records, for a couple of bucks / month.

I can't really justify $10 / month for "just for fun" personal projects, and the 1200 records / base is too limited for many ideas (and also 5000 records for $10/month is on the low side as well, even if putting it as a company expense)

Yes, I know, they got to eat and everything, and maybe cost vs income is not feasible for personal accounts.

Have you looked at Wiki.js? I played around with it a little bit and it seemed nice

I've only started using Tiddlywiki, so I didn't get the chance to dive deep. Can you mention some of the footguns?

Just trying to make it save its state is difficult enough, there's a bunch of hoops to jump through and at least I have never actually managed to do it.

It's one of those things I try every few years, fail and quit.

Ok, I know what you mean. That kept me away as well for a while. What made it work for me was to use the desktop app first. Then, I made a setup with nodejs and that's what I'm currently using.

> Evernote seems to have a fading presence on the landscape

They've made massive changes over the past year. They'll even have a Linux app coming out soon!

Really? What massive changes? I use it regularly but haven't noticed anything massive.

(1) For note taking I stumbled across anno[1] via[2] two weeks ago. It's a python flask application which you run on your localhost. You write markdown which gets stored locally as file and is rendered as html using pandoc[3]. It's really basic but I love it.

(2) For physical documents I use a Fujitsu ScanSnap iX500[4] for scanning. A runtime-licencse of ABBYY FineReader for OCR is included. The resulting PDF has embedded text which I extract using pdftotext[5]. I wrote a python application to search and tag this documents. It loads all the text in-memory which is perfecty fine as I have < 10,000 documents. I use it since 5 years and it works OK.

[1] https://github.com/gwgundersen/anno

[2] https://news.ycombinator.com/item?id=22033792

[3] https://pandoc.org/

[4] https://www.fujitsu.com/global/products/computing/peripheral...

[5] https://en.wikipedia.org/wiki/Pdftotext

Actually, what has been bugging me recently is the inability to "tag" photos on my iphone - all I want is to snap a copy of my bill / invoice whatever, tag it with "gas bill" and let it upload to icloud / dropbox. from there I am sure I can onwards process looking for "gas bill" but actually there seems to be no obvious way to do it, (even looked into EXIf data), and I guess it will age to wait till i learn ios coding

Touch and hold , then tap an option. Custom: Tap , tap Enter New Tag, type a customtag, and tap Done. Create additional custom tags: Tap , tap Enter New Tag, type a custom tag, and tap Done. Add more than one custom tag to a photo: Tap , and tap each tag you want to add (so a checkmark appears next to it).

Is this a real UX, or something you'd like? (This isn't how either Apple or Google Photos works)

Have you looked into apps like Scanbot [1]

1: https://scanbot.io/en/index.html

Totally unrelated but I love these "how I built my version of" threads - I learn about tech and projects I never knew existed

ok carry on please, diversion over :/)

I have a ScanSnap scanner too (mine's an S1500 - I have had it for c10 years or so and it still works perfectly) and it's great to be able to search what used to be paper documents quickly and easily. It saves a lot of physical space as well, most documents I scan then shred immediately once I've verified the scan is good and backed up.

There are some reasonably good OCR tools on Linux now as well - I've been pretty happy with Tesseract[0]. It was an absolute pain to script everything to "just work" when I press the button on my scanner though.

Recoll[1] works very well for indexing documents for me including my OCRd scans. When that's not enough, I revert to pdfgrep.

0. https://github.com/tesseract-ocr/tesseract 1. https://www.lesbonscomptes.com/recoll/

I've been looking for a good multi-document feed scanner. Do you have experience using the iX500 with Linux, or gscan2pdf?

My usecase would be scanning multi page documents with minimal effort, and saving to PDF somewhere.

I thought about Linux but while it should be possible to use the iX500 with Linux you would lose OCR. I did some tests and compared the OCR of the included ABBYY FineReader with Tesseract[1]. Tesseract was not good enough for my use case. So I still use the iX500 on Windows.

[1] https://en.wikipedia.org/wiki/Tesseract_(software)

Does anyone else find that the simple act of writing notes helps them remember and process better? I spent forever trying to find an ideal note-taking solution, but now I just write things in a single notebook. I rarely review my notes, but I find that simply writing thoughts down consistently has improved my memory and understanding of new concepts.

Yes, the effectiveness of note-taking (particularly handwritten) on memory has been a subject of scientific interest for a while.

I feel it shares aspects with Rubber Duck Debugging: The effort of taking something you "know" and forcing it back out through other brain-circuits (i.e. language and/or simulating a social interaction) helps to fill gaps that your brain would otherwise skip over. The act of hearing/seeing your output also causes other parts of your brain to analyze it as if it were someone else's thought.

I suspect our consciousness isn't nearly as unified as we like to believe.

This has changed back and forth over time for me.

Tests in middle school, I could recall writing things down, even the part of the page I wrote them in.

By college I would write TODO's down and lose them, and not be able to recall what I wrote down. Misplacing the note was more likely than forgetting the task, so I stopped writing them down.

I should try to measure this again because right now I couldn't tell you which works better.

One of the most uncomfortable things about getting older is that in your teens and 20's you spent all this time figuring out who you are, what you like, what you're good at and what you struggle with. Age, changes in health, coping mechanisms, changes in perspective all fuck around with this and you can find yourself in situations you should avoid or avoiding situations you could embrace.

It's like a weird mid-life crisis.

This certainly applies for me personally. My theory is that it ties in with the sort of 'geographic' memory where when you think of something, you might not be able to remember exactly what it is, but you can remember pretty precisely that it's in the middle of a certain notebook, on a heavily marked-up page, in the bottom left corner. By tying things to a location which you can remember, placing it in a bit of a context, its easier to hold on to. I also find, and for this I have no explanation at all, that I can remember sequences of numbers and code very well, better than anything else. I couldn't tell you the date I started or left my job 2 employers ago, but I could rattle off my 7-digit numeric security code for the door no problem. The brain is weird.

Well, you also used that 7 digit code a lot more often than you ever had to recall your start or end dates.

I think notetaking is way to "materialize" thoughts. In which I mean a thought is bunch of uncertain values within a rough area, and they come a go fast. But writting it down makes the values clear and specific and puts and gives you something to remember, instead of something to think.

Hm, basically it's making real final decision, instead of playing with a bunch of potenial possible decisions which are all somewhat equal, but also kinda fuzzy,

This is well-known. Hand-writing are found in research to work very well, though digital note-taking also works. Those dilligent students were rewarded, though very good to learn this yourself!

I've found Freemind to work well enough for me. Search not needed as I browse the graph easily enough.

Very much so. For me, note-taking also forces basic internal thought (I guess what you called process) as well, which is great. As I'm writing down the note, the next line has a relatively high percentage chance of being "wait, no that can't be right because X".

I am now dabbling with reviewing them, although not sure what that will lead to, as they are so unstructured. There are generally a few gems in there to be remembered, but mostly spur of the moment gibberish!

I think there is a lot of truth to this, however I also (when I remember to) like to review notes I wrote a month or so later, and check to see if they make any sense to me. If they don't it means I didn't understand the concept as well as I thought I did and it is worth going back over the source material.

I wrote and use daily http://onemodel.org (AGPL, uses postgres), for many reasons listed there :) . One way to think of its current state is a text-mode, easy-to-learn (i hope) infinite mind map of things, where I store and can query effectively everything: calendar, reminders, quasi-anki-like knowledge review, journal, automatic activity log, notes on subjects, very efficiently for the user. (It also stores documents, but that is not very smooth compared to other document systems, nor is browser integration smooth at all.)

Edit: It also has a very basic security model (private, public, unspecified), and with that in mind, can export trees of notes as html or as outline documents (text), with or w/o indentation & numbering, which I've found very useful. And anything can be in as many places in the tree as is helpful. The export to simple html, I use to generate my 2 web sites.

(I plan to move it to Rust, and maybe sqlite, eventually, as well as add features like anki, internal code attached to entity classes for cheap internal customization/automation, etc, but have been slow lately.)

(Edit: it is currently only self-hosted by each user. Have considered doing hosting for other users, and might some day.)

Looks interesting but honestly I had trouble keeping my attention focused enough to read through the intro page.

A little CSS (max-width: 700px; margin: 0 auto;) on the body would go very far.

Thanks for the comment. That was debated slightly, in a previous HN discussion, where some pointed out that using browser defaults appeals more (to some people, I don't know if the majority), especially if they have particular needs. I admit to insensitivity to such things, but I will make a note to try your suggestion sometime. :)

I remember this one. Looked interessting, but was lacking the "big picture" of data. It seems to be easy to get lost in data and loose your trail.

Thanks. Can you elaborate, including on what would be a fix for you?

For me, the big picture is I organize everything in ways that work well for me, which I have tried to mention on the web site (in screen shots and some org ideas somewhere). Like, todos, historical things, documents, contacts I have (orgs and people), calendar + tickler file (so I dont have to think about things until the date I should start thinking about it, but I don't forget, if I check it habitually), habit reminders and other review/study material, and notes by topics organized in ways I can find things. I have a top level list/hierarchy/outline (actually a few of them, and anything can link to anything else for quick reference, depending on the convenience of the moment for lookup), or I can remember some search terms (<x-company> main" to get a phone # for x-company). I also have standard patterns (with some support in the software for making data look like templates) for details about contacts or other things, logging journal notes, conversation notes with businesses or doctors or whatever, and it then becomes easy to refer to history. Then anything is basically available via a few to several keystrokes, to get exactly what I want. There is also text search, or queries by date. It seems like one would have to do that with any kind of mind map, org-mode, or note system: organize things and/or search for them in a way that helps oneself as the user. Maybe some pre-fabricated forms or examples of that would help someone get started though...

(some edits for clarity above, and)

Edit: Also, when navigating in to one's data, one can then hit 0 or ESC to go back out the way you came, even holding down ESC to go back to the top level. I also tried to make it so the UI shows what can be done at any given time, if one reads the screen.

Is any of that relevant, or do you have something else in mind? Thanks again for the feedback.

(If one has possible future interest, there is an announcements list, and feedback is also appreciated.)

telnet demo seems to be down at the moment: Trying

True; sorry about that. Maybe I should remove that from the web site until I decide better. But the best thing is probably to check the screen shots via the web site, then install/try it if you like...

Edit: I have removed mention of the telnet demo from the site. If there were sufficient real interest I would put it back (or consider hosting the system for others). If so, email me via the mailing list at the site, or via the address at the site footer. Thanks.

If you have possible future interest, there is an announcements list.

A great time saver for me was simply setting up better bash history and search capabilities[1].

I wrote a wrapper function, sbh (search bash history) that allows me to input date strings like "2 months ago", or "last week", which narrows the search. Linux 'date' function with --date string arg is pretty powerful[2].

1 - https://spin.atomicobject.com/2016/05/28/log-bash-history/

2 - https://www.thegeekstuff.com/2013/05/date-command-examples/

Reminds me somewhat similar - CEO of Wolfram developed a nice way of record keeping: https://writings.stephenwolfram.com/2019/02/seeking-the-prod...

By the way, is there, by chance, a "note taking/indexing tool from photo"? I'd like to be able to take a photo of an title/abstract of computer science paper with my phone. And then be able to find it, by approximate date and keywords. (I use Android. Seems like something relatively easy to hack, actually, on top of Google photos.)

Evernote does character recognition quite well. I don't know if there are any others but would be good to have something else too so I can leave Evernote for Notion.

I've been thinking a lot about how I manage my own data lately (notes, photos, code, reference material, etc) and have concluded that the primary feature I'm looking for is longevity. I'm saddened by the amount of data I've lost over the years, either because of hard disk failures or third-party services going out of business/making it difficult to extract things/getting too expensive.

In light of this, I'm biasing toward simple file formats managed by tools I write myself, and optimizing for cost in a way that I otherwise don't, since any recurring costs incurred by the system are effectively a lifelong commitment. I am relying on S3 for primary storage (so that it is accessible anywhere) but with a sync to offline backup.

So far, I've implemented a personal Zettelkasten tool (with built-in spaced repetition, so doubles as an Anki replacement) and a search engine that's based on Presto (via AWS Athena) so that I don't need to keep an Elasticsearch instance alive. I'm planning to build out other repository tools as I go.

It's been very liberating to build tools that are never meant to be used by anyone other than myself, and with the confidence that the tools don't matter too much anyway since the underlying files are stored in evergreen formats.

what's the optimal setup for long-term, large-scale (personal) data storage?

I want to build one big Backup. Some initial research has pointed me to something like Bacula to manage the data backup process from a machine. With the 3-2-1 rule, I know I also need my Backup itself to have at least 3 copies, in at least 2 different forms (cloud/hard disk), at least one of which is off-site from me.

As an individual, do you or anybody else know the best way to implement such a system? Should I buy one giant hard drive, use many hard drives to create a RAID array, something else?

Oooh. I've been wrestling with this problem for a while now.

Basically I'm working on a tiered system. Files/dirs are categorized by size (<10MB, <25GB, >25GB) , and by sensitivity (public, confidential, secure. And importance is usually proportional to security). I have fortunately found that security is usually inverse to size. Github/lab anything which makes sense. Confidential small stuff (sans keys) is just stored in gmail/drive. Big, boring stuff (music, ebooks) is just kept on external hard drives.

Secure, ultra-important stuff, I don't really have a system for.

The system I'm leaning towards is just encrypt archives and store the key/password securely, and store it like you would any boring data, with a local NAS and a cloud backup service of some sort, or just stored on drives offsite.

Do you feel comfortable using cloud storage for so much of your content? My ideal is to be entirely self-backed-up. I want a personal git server, photo archive, etc. With bandwidth, service costs, vendor issues (dealing with google seems like a nightmare from reading online).

How did you construct your NAS? Is it a single system, or multiple hard drives/storage solutions connected to your network?

It depends. Github is not going down. Gmail is not going down. If they do, it's Bug-out-bag time, and I am working on curating what information subset I need for that.

Ideally though yes I would have my own entire backup system but I frankly don't trust myself enough to do it right, so hence some redundancy in the cloud.

The NAS I am still designing actually :p

You mention S3 and Athena, but also that you're building for longevity. Are you planning for the future obsolescence of AWS, or going to cross that bridge when you get to it?

The S3 files are mirrored to a local drive as a collection of plain .md, .jpg, etc. The Athena search index is secondary in importance to the source data and not necessarily permanent (presumably the options for "take this folder full of files and let me search it" will only improve over time).

That being said, one of the reasons I chose S3 vs. other AWS services or other companies is because I expect it to be around for a very long time. (Just because I've preserved the option of migrating away doesn't mean I relish the idea.)

I'd really like a personal "correlate all the things!" setup that has a plugin architecture for any source and creates a time series and document-based store of whatever I want. Tweets, e-mails, text messages, time tracking, etc.

There are lots of tools that do the individual moving parts, but a personal aggregator of everything would be interesting. Basically, a tool that lets you become your own personal data broker—just for your own personal data.

I'm kind of working on that too :) https://github.com/karlicoss/my

I wrote a post on some data that I collect and have/will integrate: https://beepb00p.xyz/my-data.html#consumers

I only skimmed through and the combined breadth + intent of your projects seems very, very interesting — I mean it speaks to me. So, way to go! Mad props, please keep it up!

If you ask me, this is the shape of things to come.

Thanks! :) I wish it was easier to share with other people, lots of things are tedious to set up

I (and many others I'm sure) have been thinking about similar things as well. Not really sure how it work though. Any one care to brainstorm up an architecture that can support this?

My problem with a lot of services listed below, is they all eventually go away, and all your data is off somewhere else. Unless you store your data locally in a human readable format (markdown) you are just putting all your data into a system that WILL go away at some point in the future.

Google has already had 2-3 services to manage your data that they have closed down. Maybe they are the ones that taught me not to trust your data with anything on the web.

Even something like Evernote is iffy, they seem like they are constantly on the verge of shutting down.

Although I do find it sad that that the human race as a whole puts so little value into this type of software, and so much value into sports and politics.

http://onemodel.org , described briefly elsewhere in this discussion page and more at that site, is self-hosted, which today means installing postgres and editing one config file, doing backups & upgrades (but I might be able to help some).

Maybe I could host for others sometime if there were sufficient interest. And/or move it to sqlite.

Yeah, seems neither self-host (onemodel) or letting someone else (you or Evernote) is particularly attractive, because the chance of data loss is always there.

Is it possible there is a solution that makes the data more permanent and allows multiple parties to backup the same sources, or something similar? Some sort of federation protocol maybe.

Thanks. That is on the future roadmap (though I have been slow lately): selective sharing/copying/synch. I encourage anyone with possible interest to sign up for the announcements list at least, and maybe decide sometime to help. :)

https://en.wikipedia.org/wiki/Blosxom ... I got started in 2004 with that setup: http://www.robertames.com/blog.cgi/entries/bloxsom-started.h...

...a bit contrarian compared to the WordPress and BlogSpot frenzy at the time, but I've been happy with it.

[rames@...:~/blog/entries]$ find . -type f | wc -l 331 [rames@...:~/blog/entries]$ find . -type f | xargs -n1 cat | wc -c 574481

It's been very stable over ~15 years, but I think it might be time to adopt SQLite, at least as a caching layer. ;-)

Someone shared this on HN yesterday - https://labs.tomasino.org/gnu-recutils/

It's a set of unix-style tools that let you treat text files as databases.

This is what I've moved to: https://joplinapp.org

It's just plain markdown and syncs to any cloud provider or a webdav share. Butt-ugly especially on iOS, but it works and there is no vendor lock-in.

Honestly, the risk of me losing my local data is much higher than a note service shutting down. It has happened countless of times and I'm just too sloppy with the backups. Purely personal of course but for me your argument is reversed.

It's been mentioned a few times in these comments, but I want to add a +1 for Roam[1]. Note-taking/personal knowledge tool that's very, very different from anything I've seen before -- closest thing I can compare it to is Wikipedia. It's still in beta with some rough edges, but VERY worth checking out.

[1] roamresearch.com

Worth mentioning the pricing [0]:

$30 / month

$10,000 / lifetime

[0]: https://twitter.com/Conaw/status/1214855473876201472

"trackcmp.net" keeps breaking navigation for me on roamresearch.com, and that tracker is not even https (classic unsafe warnings from Chrome). Weird. Unfortunately, that makes the whole thing look shady, and I can't even get to the create account screen after signing up. :/

Maybe they'd do better to ease up on the tracking, especially for a "give us all your documentation" service.

Looks neat but it's a service, which is kinda the opposite of rolling something for yourself.

Isn't this just a desktop-wiki with auto-linking?

> all digital trace I'm leaving (tweets, internet comments, annotations)

I would be open to the idea of a tool which combines the entirety of my digital presence at any point in time in a single platform. Kinda like a dynamically updated list which updates itself - every time a linked account makes a comment, 'likes' a post or performs any activity that may link it back to me.

I'm building this https://histre.com/ It has Hacker News, Telegram, and web browsing (notes, bookmarks, history) integrations already. Up next: Emacs org-mode exports, integrations with Pocket and Pinboard.

Here is a bit longer comment on that which I made earlier today: https://news.ycombinator.com/item?id=22160026

This is cool, I'd dig a $2-5/mo unlimited account for 1 person/team with the same unlimited settings.

Thanks swozey. Can you please send me an email? k@histre.com

We've been building something to solve this exact problem. We started looking at how much we could derive from email notifications but parsing unstructured emails is very error-prone and typically incomplete so we decided to lean on external APIs where possible. Essentially what we made was a platform to index your data across services in one searchable document store or build new applications natively on this database where all of your data is one place. If you're interested in getting early access (we're still in early alpha) you can sign up on http://www.aspen.cloud

It's barely a year old, but I think Timeliner is kind of trying to be this, as something you run yourself to protect against disappearing cloud services:


This have me a hairbrained idea for a browser extension that drops every web page you visit into a private Lucene database.

You might want to look into WorldBrain's Memex extension. I think it crawls every page you visit into IndexedDB so you can do full-text searches later.

I’ve been building essentially that, but more than just websites: https://apse.io. I am really pleased with how it works - just released v2.0!

I was kind of having the same idea, except any site you bookmark gets added to a personal web crawler, and then you have your own search site for things you find interesting.

This exists on iPhone/iPad! DevonThink2Go, local crawl/search + optional encrypted sync over self-hosted WebDAV or public cloud services. Can also take/search markdown notes.

No mentions of https://tiddlywiki.com/?

tiddlywiki was great until all the browsers stop supporting writing to local files, now saving changes is a pain, making me find something else.

Maybe this solves your problems? It creates a database in your browser's LocalStorage.


And the database can be setup to sync with a self-hosted CouchDB instance.

While that works, the original appeal of Tiddlywiki was that you could open a file in your browser, type away, and save naturally. Once you get into "self-hosted", you just have a regular old wiki. I used it everyday for several years but gave up once the transition happened. I keep trying to go back, but it just can't compete with files edited in a text editor and stored locally.

I used to run it on node but then I switched to notion I prefer the notion way much more. However I'm looking to move away from notion to something selfhosted.

My coworker used to swear by TW and every once in a while when I read browser release notes I have wondered if it still works for him.

It sounds like you're saying that nobody bothered to modify it to use LocalStorage, which is a surprise.

I run it on a webdav server like caddy, there is also this ruby script you can start. Works ok, set it up once and you forget about it.

Everything I write about (journal + other things, task lists and what not) is written in plain markdown files currently (about to move it to TiddlyWiki, one of these days...) and to get search, I just use `the-silver-searcher` which searches the entire directory of my files. Simple and scalable (got around 9k documents by now)

My eternal frustration in this space is that my employer has strict firewalls, web filtering and data-loss prevention software, and remote access is over Citrix with no copy-paste. Consequently, if I build a knowledge base, it is stuck inside the firewall. Equally, if I build it outside, I can't use it at work.

Why don't you host it on an ec2 instance? They won't be blocking amazon ip. Where do you work?

There's definitely no external access without going through the web proxy. And a new uncategorized site would be blocked by the web proxy - and it wouldn't pass review.

I work in a highly regulated industry...

Any workaround would be grounds for termination. So there's no point to my comment really - just curious if anyone else is in the same boat.

Can't you use a personal mobile device with a 4G connection to access a knowledge database outside the firewall, without moving any data across your employers network? As long as the data you wish to read/write is not sensitive in itself, and it's mostly just plaintext notes that you can read/write from any device, I don't see the issue with that.

Secure physical sites (e.g. some military bases) may require you to place personal electronics in a lockable cabinet. You have to use a paper notepad if you don't have a device certified by the local security team. Using a non-certified device can result in being evicted or prosecuted.

> just curious if anyone else is in the same boat.

Yes. Spent a few years building a knowledge base in an offline application. Now I have a new job that doesn't allow me to install software. So all my notes stay at home.

Maybe one day I'll make an export to PDF and use that at work. But I will miss the editing functionality.

On one hand I understand the need to prevent data leaks, malware, etc. On the other hand -- am I supposed to memorize literally everything? Or search everything on Stack Exchange over and over again, hoping that the explanation is there, is correct, and is up to date? Figuring out stuff and making simple notes is my strength. Memory is my weakness. This sucks.

I wonder if it would have been better to install some wiki software on my private website and build my knowledge base online. Reading unknown webs is not forbidden in my current job (the web filter apparently uses blacklists). But maybe in my next job it will be, who knows.

I won't tell you how to circumvent security, but a simple solution seems to be getting your domain 'categorized'. Keep it clean (no malware, not wordpress), strong transport security, and maybe behind basic auth. This should allow you access to the site as needed, from your work environment.

TLS is MITMd, DLP will run on the decrypted content.

I’m in the same situation, but I can use personal devices at work, and my personal interests and work interests don’t share much overlap, so in practice it’s not much of a problem.

Every now and then I find an external link I want to share with myself, but I’ll just send myself an email with it.

Hey, author here. Happy to answer any questions!

Is there a way we can subsribe to the blog ?

> Ideally I want to be able to do fulltext realtime search over anything that I ever had in my visual field. Not even necessarily text, but audio and video as well.

Where I find all these systems break down is recall. They're designed for someone who can recall a word or phrase that was in the content. I can usually recall "It was about X" or "The document/web page/image looked like Y". But an actual word? The author's name? Not a chance.

While a more difficult problem, if the tool is to live up to the "Future" section of this page, it's got to go a long way beyond what's in the source data, to what's thought of by the user.

This topics comes up again and again. I collected some notes about this here: https://github.com/albertz/wiki/blob/master/personal-knowled...

E.g. one software I started to use is nvALT, via: https://www.macstories.net/links/organizing-everything-with-...

But I'm nowhere near a perfect and complete solution yet...

The successor to nvALT, nvUltra, is currently in private beta. I'm looking forward to its release!


And still Mac only :(

I have less notes after being fed up with nites. It's really time consing to manage notes so - I manage logs. I just log everything I do each task in it's new page. It's append only.

For notes which I mutate I just keep a personal web site and I tried to keep this as cheatsheet and as compact as possible so I don't need to manage it.

So append only log in quip new folder for each task.

Mutative cheatsheet super compact pages in personal website.

Oh and for quick sniper's alfred.

That's it.

I self host a confluence server. All my content is available to me offline. Might be a bit overkill, but I have knowledge bases for all my work. If there is a web page I come across I can just copy/paste the content into a new post. Everything is searchable. It really is great. They offer a starter license, which is $10 per year:


I use a few things for this (on windows):

- For notes, OneNote, though I'm always on the lookout for an alternative with decent UI and syncing, but using open file formats. Full text search simple enough with this. Code formatting isn't good but there's an addin where the free version formats it as it was copied.

- To search local files, Voidtools Everything is great. Searching instantly by filename is a real time saver.

- If I want full text search of a large base of documents, I used Likasoft Archivarius which cost me $30 about 10 years ago and is still handy. It's the only local desktop search I've found that supports full text indexing of tons of formats like outlook .ost, etc and can look inside archive files

- For backups I've continued to stick with external drives, mirrored periodically with Freefilesync. 3 copies - one as master, two mirrors ensuring one is offsite.

Take a look at Standard Notes. It is privacy focussed with encryption but has markdown and code editors and can be self hosted

Thanks, looks interesting. I find Markdown a great idea in theory, but have found very few examples of wysiwyg markdown editors that work 'as you'd expect'. For me that means:

- Bullets with multiple indents going from 1 to 1) a. etc - Table handling - Usual formatting like heading levels etc

And there seem to be lots of flavours of markdown too, just to add another layer to things.

I love SN and have been using it for a few months. No complaints thus far. I will add that for me, the Simple Markdown Editor module is better the other one (forget the name).

I wish things like https://piggydb.net/ had more momentum or competitors... personal knowledge databases seem to be such a tough niche to tackle.

Edit: since there is a new project here is more details years back: http://www.linux-magazine.com/Issues/2014/160/Workspace-Pigg...

If you're into this sort of thing, you might want to checkout Roamresearch: https://roamresearch.com/

Seems similar to ZIM (https://zim-wiki.org/), except proprietary/hosted? I've just started using zim - can someone more experienced compare the two?

We could fill a whole internet with each personal method for storing, classifying and accesing. We're missing a OS for our own memory.

I wish there was a method for printing QR codes or URLs on paper that would be the reverse of scanning a QR code. This would make it easy to write complex URLs in your paper diary/techo/commonplace book/notebook.

I keep my knowledge in a private Git repo managed by https://www.gitbook.com/. So far it works out great. Going to make it public soon.

That's cool, please drop me an email (or just share here?) when you release it, I'm collecting (https://beepb00p.xyz/tags.html#exobrain) other people's wikis!

Thanks for making your notes public. It inspired some further thinking for my org-mode setup.

Tried the custom webapp and DB solution for a while. Wasn't publicly portable enough (for others to copy paste/export easily).

Currently using markdown files in git repos.

The holy grail [https://beepb00p.xyz/pkm-search.html#future] of this really resonated with me and fully mirrors what I've been thinking about the past few months. In my observations, it's input capture, information organization, and subsequent retrieval:

Information Capture:

Input Capture - You’re going to have all-encompassing tracking and recording of all activity, but want configurable privacy on the extent to which you want your daily conversations and observations of external things you encounter and are exposed to. Capturing input needs to be holistic and incorporate all properties of encounters and new information.

Potential sources of input:

Vision — point of view recording, see snapchat spectacles, etc as primitive examples. Audio (voice notes and multi-party conversations) - voice calls, video, etc. and other forms of audio transmission where there is more than a single party in the interaction. Digital interactions You will need to keep track of web pages you visit at what times Conversations you see on Twitter, etc.

Properties and cues must be extrapolated from the information that is captured on input, in the case of audio, transcriptions are sufficient for transcription and retrieval purposes, however since video is a visual medium, it includes significantly more properties that need to be accounted for.

The aim here is to identify sufficient data points (cues) that are subsequently represented in such a way that they are easy to search across things you have encountered but only seem to recall a certain property or cue from. This is because of the fact that human beings tend to remember things in fragments, for instance, you might remember a certain color on a page that you visited within the last 6 months and nothing else.

So long as you are capturing sufficient input and actions then you should be able to go back to any given point in time. How and where are you going to store this information? Storing everything is going to be a large amount of data. The essence of the information and context must be preserved. If you want to wind back to an arbitrary position in time with the original context intact, you want to retain as much as you can in the most efficient manner possible, so determining which data points to retain is essential. (Once the content structure has been figured out, this will be viable).

Examples of Primary Cues:

Time - humans generally keep track of things in a linear time-based fashion. Color - invokes emotion and is memorable. Physical Location - the efficiency of information retrieval is highly influenced by the location at which it is originally synthesized, encountered, and stored. Keywords - the default conventional mode. Can and should be extracted from video/imagery and audio. Imagery - search for images based on their contents and ambience.

Potential Secondary Cue — Music - see historical associated input and actions while certain music was played. (What else?)

Meta Cues — Subjects - Automated tagging of keywords/encountered content.

Any combination of these queries is possible, but ultimately the killer feature is the ability to backtrack through time to find a certain piece of information that is made available thanks to the always-on recorded nature of your interactions with the physical and digital worlds combined.

Knowing what to store, and how, + displaying it needs to be worked on further.

http://onemodel.org, described elsewhere here and moreso at that site, tries to model arbitrary knowledge and has a vision encompassing any kind of info one wanted to be tracked (again, more at the site). (Edit: If you have possible future interest, there is an announcements list.)

I've been pondering on building something like this for a while.

For now, I've settled on sphinx because it can be easily exported to dash, and tied in to an alfred workflow for search.

I use a vim plugin called vimwiki and I export my todos and notes into HTML. Works fine for me.

I just email links, code, docs, etc. to myself with descriptive subjects and tags.

I basically live in Evernote. Will gradually transition to personal tooling.

Most stuff (links, photos, docs, etc) I just email it to myself

Is there anything for people who don't use Vim?

emacs has org-mode.

A search infrastructure for my knowledge would require access to wetware. Code I can see working.

Ya lost me at $(emacs)!

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact