Hacker News new | past | comments | ask | show | jobs | submit | renegat0x0's comments login

Not a web scraper, but a web crawler software. Allows to specify method of crawling, selenium, and others. Returns data in JSON (status code, text contents, etc).

[1] https://github.com/rumca-js/crawler-buddy


I recently tried to integrate Gmail in my app [0], and I poured too much time on it. I decided it is not worth to support Gmail.

Gmail to SQLite describes 6 steps to get credentials working, but it is not true for me. After 6 steps:

- that Google said that my app was not published, so I published it

- Google said that app cannot be internal, because I am not a workspace user

- for external apps

- then it said I cannot use the app until it is verified

- in verification they wanted to know domain, address, other details

- they wanted to have my justification for scopes

- they wanted to have video explaining how the app is going to be used

- they will take some time to verify the data I provided them

It all looks like a maze of settings, where requiring any of users to go above the hoops required by Google is simply too much.

Links:

[0] https://github.com/rumca-js/Django-link-archive


The steps Google makes people jump through just for API keys are absolutely insane.

Does anybody have insight as to why it’s so bad?


Probably because if you get API access to someones email account it is game over. And people are stupid so some of them are going to click yes to some scammy app. And then they will blame Google for not protecting them.

Because otherwise tons of people anonymously create api keys with extremely wide scopes for small / low quality apps.

When those inevitably get used for nefarious purposes; Google image suffers as a result.


Use regular old IMAP with an app password.

Don't jump through their hoops.


Every year the imap option ("app passwords") gets buried deeper and deeper in the settings.

Indeed. Quite a hassle to enable now. Multiple requirements including 2FA.

Recently I had some problems with parsing RSS in python with feed parser.

I decided to take some risk and write my own version.


Please correct me if I am wrong.

What grinds my gears about email monopoly is that you cannot create or integrate easily email client now.

To access mail it is not enough to provide user & password & whatever. Things can provide access without compromising security.

To access gmail? User needs to add access from some bullshit console settings, and stuff. I asked chatgpt how thunderbird is not required to do this, and it said that it has keys pregenerated, and big corpos can operate like that. I have not verified that any longer. Sounded credible. So annoying


Nice project! Good job!

Now somebody might also find interesting what I have done.

- I decided that implementing RSS reader for 100x time is really stupid, so naturally I wrote my own [0]

- my RSS reader is in form of API [1], which I use for crawling

- can be installed via docker. User has to only parse JSON via API. No need to use requests, browsers, status codes

- my weapon of choice is python. There is python feedparser package, but I had problems in using in parallel, because some XML shenanigans, errors

- my reader, serves crawling purpose, so I am interested in most basic elements, like thumbnails, so all nuance from RSS is lost

- detects feeds from sites automatically

Links

[0] https://github.com/rumca-js/crawler-buddy/blob/main/src/webt...

[1] https://github.com/rumca-js/crawler-buddy



While I did not put Internet in a box I put "search" in a box.

https://github.com/rumca-js/Internet-Places-Database

Contains Internet links, channels, etc. Work in progress. You can find various domains, channels etc.


Some time ago I create a RSS client. RSS feeds operated as Sources for data. I have extended them to be able to parse pages, collect links.

Currently I have decided that I can add "Email" as source, to be able to read not only news, but emails in my app.


I also did a thing in my bookmarking software.

I added advanced button, for any link, which shows menu where you can navigate to Google translate, internet archive, schema validation, whois pages.

With the menu I can check links, and navigate easily

https://github.com/rumca-js/Django-link-archive/blob/main/RE...


I love bookmarks. I have made RSS app with bookmarking mechanism.

My links about bookmarks and links below.

https://github.com/rumca-js/Django-link-archive - django app, RSS client, simple web crawler, under construction

https://github.com/rumca-js/RSS-Link-Database - my bookmarks repository

https://github.com/rumca-js/RSS-Link-Database-2025 - RSS links for year 2025

https://github.com/rumca-js/RSS-Link-Database-2024 - RSS links for year 2024

https://github.com/rumca-js/RSS-Link-Database-2023 - RSS links for year 2023

https://github.com/rumca-js/Internet-Places-Database - I also maintain a list of domains found in the Internet

https://rumca-js.github.io/search - search that uses links maintained in zip files

https://rumca-js.github.io/music - my music library, browsable

https://rumca-js.github.io/bookmarks - my bookmarks, browsable


Damn this is internet gold. Gonna be lost in those repos for days, man.

Awesome work!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: