I'm building a personal search engine/document management system that uses tags similar to how del.icio.us worked. URLs and screenshots can be saved via the browser, or by instructing the system to crawl it (which gets done with Firefox/webdriver). It's a like a split brained version of the Grub crawler. It also supports uploading PDFs and images.
Tags, objects, labels, synthesized commentary, etc. are provided by machine learning models and GPT3. Eventually the pipelines will be customizable, so running a plant identification model will be possible. Full text search and analytics is provided via a customized Solr deployment manager. I've built a unique UI for it based on my original cut of a simple timeseries interface at Loggly. Love using it, but have no idea if others will want to pay for it. I seriously hate ads, trackers and user privacy violations.
merry-zebra|> !crawl https://news.ycombinator.com/item?id=31459103
merry-zebra|> Please wait while I index https://news.ycombinator.com/item?id=31459103.
merry-zebra|> Site has been indexed. An image of the site will be added in ~10 seconds.
merry-zebra|> ...
merry-zebra|> updated 2022-05-21T18:55:06Z
merry-zebra|> ID UmXyyk3tZJdGZW4uv
merry-zebra|> title What Happened to Tagging? (2019) |
Hacker News
merry-zebra|> description The article discusses the potential reasons why "tagging" (i.e. adding labels to content for organizational purposes) has declined in popularity in recent years, despite its usefulness.
merry-zebra|> URL https://news.ycombinator.com/item?id=31459103
merry-zebra|> Tags #What, #Happened, #Tagging, #2019, #HackerNews, #News
merry-zebra|> ...
merry-zebra|> To search me for the document, click on one of the action links.
system=> Do you have any comments about this webpage, @merry-zebra?
merry-zebra|> I find tagging to be extremely useful for organizing content. I think the decline in popularity is likely due to the fact that it can be time consuming to tag everything, and people are often lazy. However, I think it is worth the effort to tag things, as it makes it much easier to find what you're looking for later on.
I've built something similar for my crawler at biztoc.com — It does OpenGraph extraction, finding the body, summarizing it, tag & entity extraction, sentiment analysis, Oembed, stock symbol detection, screenshot & favicon, etc.
Great job on this! It looks fantastic and I think you'll do well. I like how you moved to token use for logins. Passwords are dumb.
I thought about these types of features for Mitta.us (which is NOT done, but operational), but it was too much work. Glad I put it off, because you did a much better job.
What's the best way to reach you? I am building something with a lot of the same ideas and would love to talk shop. Trying to network with others in the collaborative search/organization/knowledge space.
Thanks! I use GPT3 to synthesize a title and description from the URL and also use it to generate a description if the site simply lacks one. I use webdriver running Firefox to image the site. Some DOM information can be pulled that typically isn't blocked, but it isn't implemented yet.
My argument for these companies to allow a "scraper" like mine, is that I'm adding their full URL and tags for the user, on the user's behalf. I'm not scraping URLs or doing breadth/depth crawls. I ask for a single page the user gives me, then take an image only that user can see, unless they chose to share it with someone over email or Slack.
When a site implements block "crawlers" from certain IP blocks, I've written an extension for Chrome/Firefox which allows the user to image the screen and upload it. This adds the site to the index just like if they asked the site to crawl it. I gave up on scrolling the window, however. 0.5 seconds per screen grab limit in Chrome now.
It also supports image uploads, so if the user wants to just use their own screenshotting method, they can just upload the image. Extraction of text and synthesis of titles and descriptions can be handled by GPT3 (as well as URL synthesis from keywords, command translation and Solr query synthesis).
I'm working on training a model to tell me whether or not it's an image, web page screenshot or a desktop shot.
Tags, objects, labels, synthesized commentary, etc. are provided by machine learning models and GPT3. Eventually the pipelines will be customizable, so running a plant identification model will be possible. Full text search and analytics is provided via a customized Solr deployment manager. I've built a unique UI for it based on my original cut of a simple timeseries interface at Loggly. Love using it, but have no idea if others will want to pay for it. I seriously hate ads, trackers and user privacy violations.