I'm building a personal search engine/document management system that uses tags ...

thm · on May 22, 2022

I've built something similar for my crawler at biztoc.com — It does OpenGraph extraction, finding the body, summarizing it, tag & entity extraction, sentiment analysis, Oembed, stock symbol detection, screenshot & favicon, etc.

kordlessagain · on May 22, 2022

Great job on this! It looks fantastic and I think you'll do well. I like how you moved to token use for logins. Passwords are dumb.

I thought about these types of features for Mitta.us (which is NOT done, but operational), but it was too much work. Glad I put it off, because you did a much better job.

I'm adding a !biztoc command to Mitta for search, but it would be cool to be able to add some post parameters like https://biztoc.com/post?title=foo&url=https://zombo.com to post as well.

thm · on May 22, 2022

Nice! That's already implemented: https://biztoc.com/s/bm

E.g: https://biztoc.com/post?bms=mitta&bmu=https%3A%2F%2Fnews.yco...

Let me know if you need more. Cheers

soulofmischief · on May 22, 2022

What's the best way to reach you? I am building something with a lot of the same ideas and would love to talk shop. Trying to network with others in the collaborative search/organization/knowledge space.

My email is also in my bio.

ffhhj · on May 21, 2022

Nice work. How do you crawl dynamic websites that barely use links, or those which have scraping countermeasures like Amazon?

kordlessagain · on May 22, 2022

Thanks! I use GPT3 to synthesize a title and description from the URL and also use it to generate a description if the site simply lacks one. I use webdriver running Firefox to image the site. Some DOM information can be pulled that typically isn't blocked, but it isn't implemented yet.

My argument for these companies to allow a "scraper" like mine, is that I'm adding their full URL and tags for the user, on the user's behalf. I'm not scraping URLs or doing breadth/depth crawls. I ask for a single page the user gives me, then take an image only that user can see, unless they chose to share it with someone over email or Slack.

When a site implements block "crawlers" from certain IP blocks, I've written an extension for Chrome/Firefox which allows the user to image the screen and upload it. This adds the site to the index just like if they asked the site to crawl it. I gave up on scrolling the window, however. 0.5 seconds per screen grab limit in Chrome now.

It also supports image uploads, so if the user wants to just use their own screenshotting method, they can just upload the image. Extraction of text and synthesis of titles and descriptions can be handled by GPT3 (as well as URL synthesis from keywords, command translation and Solr query synthesis).

I'm working on training a model to tell me whether or not it's an image, web page screenshot or a desktop shot.

dchuk · on May 21, 2022

Is this a project you intend only for yourself? Or is it going to be a product?

kordlessagain · on May 21, 2022

It's a hosted service that will be available as well as an on premise deployment for companies.