Well, CT logs are a data dump, they are not searchable, ingesting all that data near-real time and making it searchable in a useful and fast way (especially with wildcards) is actually quite challenging!
Have you considered adding a monitoring feature where a user can enter a domain to be monitored and then be notified if a "similar" domain comes across the ingestion pipeline.
This would be useful for early detection of potential impersonations/typo-squatting domains typically used for phishing/scams.
Something as simple as a configurable levenshtein distance/jaro-winkler similarity check across CN and SAN of all new certs maybe? (user can configure with threshold to control how "noisy" they want their feed).
I also noticed you are ingesting/storing flowers-to-the-world.com certs, not sure what stage of optimization you are at but blacklisting/ignoring these certs in my ingestion pipeline helped with avoiding storing unnecessary data
I'm not sure but I believe that's used by Google internally for testing purposes.
For example if you search google, it returns 120k+ results, and these useless results are at the front.
> I also noticed you are ingesting/storing flowers-to-the-world.com certs, not sure what stage of optimization you are at but blacklisting/ignoring these certs in my ingestion pipeline helped with avoiding storing unnecessary data
The goal is to have something exhaustive so I'll keep them. But you are right that I probably should not put them at front.
Not sure how important it is though as these results shouldn't match many queries.
I am not using certstream as we'd lose data on the first network error. The way it's designed is more "Rsync for ct logs" than something like a stream => storage system.
How strange. I just tried this out and I see two unauthorised subdomains, with one being an actual "spam" website. However, I don't even know how to delete a subdomain that doesn't show up in my domain registrar or cloudflare!