Seconded - I love just & Justfile. Such an upgrade after trying to force things into package.json scripts. Chaining commands, optional CLI arguments, comments, simple variables, etc. Very simple and a breath of fresh air.
Also, we turned up 2,000 domains that redirect to a very shady site called happyfamilymedstore[dot]com. Stuff like avanafill[dot]com, pfzviagra[dot]com, prednisoloneotc[dot]com. These domains made it into the Tranco 100k somehow.
Lately, happyfamilymedstore has mysteriously always been in the top ~ten Google Images results for super niche bicycle parts searches I do. They seem to have ripped an insane amount if images that gets reposted on their domain.
What most of them do is they will use Wordpress exploits to get into random wordpress website ran by people who know nothing about managing a website and are running on a $3/mo shared hosting account.
After they get into these random wordpress sites, then then embed links back to their sketchy site in obscure places on the wordpress site that they hacked, so that owners of the site don't notice, but search bots do. They usually leave the wordpress site alone, but will create a user account to get back into it again later if Wordpress patches an exploit. All of this exploit and link adding is automated, so it is just done by crawlers and bots.
This is done tens of thousands or even millions of times over. All of these sketchy backlinks eventually add up, even if they are low quality, and provide higher ranking for the site they all point to.
Think of websites like mommy blogs, diet diaries, family sites, personal blogs, and random service companies (plumbers, pest control, restaurants, etc) that had their nephew throw up a wordpress site instead of hiring a professional.
I don't mean to pick on wordpress, but it really is the most common culprit of these attacks. Because so many Wordpress sites exist that are operated by people who aren't informed about basic security. Plus, wordpress is open source, so exploits get discovered by looking at source code and attackers will sell those exploits instead of reporting them. So Wordpress is in an infinite cycle of chasing exploits and patching them.
You can have a separate system, even a locally running desktop app do that. You can still have a database, complex HTML templating, and image resizing! You just do it offline as a preprocessing step instead of online dynamically for each page view.
Unfortunately, this approach never took off, even though it scales trivially to enormous sites and traffic levels.
I recently tried to optimise a CMS system where it was streaming photos from the database to the web tier, which then resized it and even optimised it on the fly. Even with caching, the overheads were just obscene. Over a 100 cores could barely push out 200 Mbps of content. Meanwhile a single-core VM can easily do 1 Gbps of static content!
Here's some rough scheme I came up with (I never implemented it, though):
1. Use github pages to serve content.
2. Use github login to authenticate using just JS.
3. Use JS to implement rich text editor and other edit features.
4. When you're done with editing, your browser creates a commit and pushes it using GitHub API.
5. GitHub rebuilds your website and few seconds later your website reflects the changes. JavaScript with localStorage can reflect the changes instantly to improve editor experience.
6. Comments could be implemented with fork/push request. Of course that implies that your users are registered on GitHub, so may not be appropriate for every blog. Or just use external commenting system.
So, essentially a site generated with Jekyll, hosted on GitHub Pages with Utterances [0] for comments and updated with GitHub Actions.
I don’t know if https://github.dev version of Visual Studio Code supports extensions/plugins, but if so, then there is also a rich text editor for markdown ready.
All that’s left would be an instant refresh for editing.
There are plenty of places that you can go to on this planet with little to no law enforcement. Don't be surprised if you end up dead there. Handling global crime is very difficult.
I recently saw and reported one to a local business.
If you typed in the domain and visited directly, it wouldn't redirect to the scam site. But if you clicked on a link from a google search, then it would redirect.
Probably makes it harder to find for small website owners if they're not clicking their own google searches.
It happens through search engine optimization, SEO, and a mix of planting reviews and other tactics. Think of it like this - what would you do to get people talking about your site? You'd somehow put links, conversations, reviews, quotes, etc. in front of them.
I wonder if there might be a way to map all these using t-SNE to discrete grid locations? Maybe even an autoencoder. I'd love to see what features it could pick out.
I don't see their data set though. hmmm.
maybe I'll just have to crawl it on my own if I want to do it.
You can use t-SNE (or even better: UMAP or one of its variation) to create a 2D points cloud, and then use something like RasterFairy [1] to map 2D positions to the cells a grid. It usually works well.
The map is pretty neat. Also see the giant analysis for the dataset at https://iconmap.io/blog. Turns out that a lot of folks have messed up their favicons.
Hey, great start. I spend half my day in CSVs and I am definitely your target audience. Most of the time I use bat, visidata or tabview. In many ways tabview is the best, though recently the project has been abandoned.
tv looks excellent. Fun name. I think if you added a couple of features it would ascend to my toolbox:
(1) scrolling (horizontal and vertical)
(2) better command line parsing. Running "tv" without stdin or arguments should produce an error/help. Running "tv xyz.csv" should read that file.
I have worked on this problem many times, at many companies. I am working on it again, actually. Usually some combination of scoring and persisting results in CSVs for human review.
(edit: I am at a desktop now and I can say a bit more)
Here is the process in a nutshell:
1. Create a fast hashing algorithm to find rows that might be dups. It needs to be fast because you have lots of rows. This is where SimHash, MinHash, etc. come into play. I've had good luck using simhash(name) and persisting it. Unfortunately you need to measure the hamming distance between simhashes to calculate a similarity score. This can be slow depending on your approach.
2. Create a slower scoring algorithm that measures the similarity between two rows. Think about a weighted average of diffs, where you pick the weights based on your intuition about the fields. In your case you have handy discrete fields, so this won't be too hard. The hardest field is name. Start with something simple and improve it over time. Blank fields can be scored as 0.5, meaning "unknown". Hashing photos can help here too.
3. Use (1) to find things that might be dups, then score them with (2). Dump your potential dups to a CSV for human review. As another poster indicated, I've found human review to be essential. It's easy for a human to see that "Super Mario 2" and "Super Mario 3" are very different.
4. Parse your CSV to resolve the dups as you see fit.
With regards to 1, I wonder: why would calculating the Hamming distance be slow? In python you can easily do it like this:
hamming_dist = bin(a^b).count("1")
It relies on a string operations, but takes ~1 microsecond on an old i5 7200u to compare 32bit numbers. In python 3.10 we'll get int.bit_count() to get the same result without having to do these kind of things (and a ~6x speedup on the operation, but I suspect the XOR and integer handling of python might already be a large part of the running time for this calculation).
If you need to go faster, you can basically pull hamming distance with just two assembly instructions: XOR and POPCNT. I haven't gone so low level for a long time, but you should be able to get into the nanosecond speed range using those.