More

gurgeous · on Aug 12, 2022

Seconded - I love just & Justfile. Such an upgrade after trying to force things into package.json scripts. Chaining commands, optional CLI arguments, comments, simple variables, etc. Very simple and a breath of fresh air.

gurgeous · on Oct 20, 2021

Also, we turned up 2,000 domains that redirect to a very shady site called happyfamilymedstore[dot]com. Stuff like avanafill[dot]com, pfzviagra[dot]com, prednisoloneotc[dot]com. These domains made it into the Tranco 100k somehow.

Full list here - https://gist.github.com/gurgeous/bcb3e851087763efe4b2f4b992f...

unicornporn · on Oct 20, 2021

Lately, happyfamilymedstore has mysteriously always been in the top ~ten Google Images results for super niche bicycle parts searches I do. They seem to have ripped an insane amount if images that gets reposted on their domain.

0des · on Oct 20, 2021

What kind of parts are you looking for?

noitpmeder · on Oct 20, 2021

Does anyone know the story behind these? How do seemingly obscure sites consistently get massive amount of obscure content placed highly in results.

jacurtis · on Oct 20, 2021

What most of them do is they will use Wordpress exploits to get into random wordpress website ran by people who know nothing about managing a website and are running on a $3/mo shared hosting account.

After they get into these random wordpress sites, then then embed links back to their sketchy site in obscure places on the wordpress site that they hacked, so that owners of the site don't notice, but search bots do. They usually leave the wordpress site alone, but will create a user account to get back into it again later if Wordpress patches an exploit. All of this exploit and link adding is automated, so it is just done by crawlers and bots.

This is done tens of thousands or even millions of times over. All of these sketchy backlinks eventually add up, even if they are low quality, and provide higher ranking for the site they all point to.

Think of websites like mommy blogs, diet diaries, family sites, personal blogs, and random service companies (plumbers, pest control, restaurants, etc) that had their nephew throw up a wordpress site instead of hiring a professional.

I don't mean to pick on wordpress, but it really is the most common culprit of these attacks. Because so many Wordpress sites exist that are operated by people who aren't informed about basic security. Plus, wordpress is open source, so exploits get discovered by looking at source code and attackers will sell those exploits instead of reporting them. So Wordpress is in an infinite cycle of chasing exploits and patching them.

shuntress · on Oct 20, 2021

> "had their nephew throw up a wordpress site instead of hiring a professional"

The web is supposed to be accessible to everyone.

This type of "blame the victim" attitude is a poor way to handle criminal activity.

jiggawatts · on Oct 20, 2021

If they had used static content, it would remain 100% accessible to them, but also vastly more secure.

Dynamic content generation on the fly for a blog is unnecessary complexity that invites attacks.

pc86 · on Oct 21, 2021

Static content is definitively not as accessible to the typical person asking their nephew to put up a WP blog on shared GoDaddy hosting.

jiggunjer · on Oct 21, 2021

wouldn't that preclude a few popular features like a rich text editor?

jiggawatts · on Oct 21, 2021

You can have a separate system, even a locally running desktop app do that. You can still have a database, complex HTML templating, and image resizing! You just do it offline as a preprocessing step instead of online dynamically for each page view.

Unfortunately, this approach never took off, even though it scales trivially to enormous sites and traffic levels.

I recently tried to optimise a CMS system where it was streaming photos from the database to the web tier, which then resized it and even optimised it on the fly. Even with caching, the overheads were just obscene. Over a 100 cores could barely push out 200 Mbps of content. Meanwhile a single-core VM can easily do 1 Gbps of static content!

vbezhenar · on Oct 21, 2021

I thought about "serverless" blog.

Here's some rough scheme I came up with (I never implemented it, though):

1. Use github pages to serve content.

2. Use github login to authenticate using just JS.

3. Use JS to implement rich text editor and other edit features.

4. When you're done with editing, your browser creates a commit and pushes it using GitHub API.

5. GitHub rebuilds your website and few seconds later your website reflects the changes. JavaScript with localStorage can reflect the changes instantly to improve editor experience.

6. Comments could be implemented with fork/push request. Of course that implies that your users are registered on GitHub, so may not be appropriate for every blog. Or just use external commenting system.

mkotowski · on Oct 21, 2021

So, essentially a site generated with Jekyll, hosted on GitHub Pages with Utterances [0] for comments and updated with GitHub Actions.

I don’t know if https://github.dev version of Visual Studio Code supports extensions/plugins, but if so, then there is also a rich text editor for markdown ready.

All that’s left would be an instant refresh for editing.

[0]: https://utteranc.es

pc86 · on Oct 21, 2021

If this is a serious suggestion (I really hope it isn't), you have never met the kind of person setting up the blogs the GP is talking about.

pixl97 · on Oct 21, 2021

There are plenty of places that you can go to on this planet with little to no law enforcement. Don't be surprised if you end up dead there. Handling global crime is very difficult.

charcircuit · on Oct 21, 2021

and anyone can hire me to design them a website.

lazide · on Oct 20, 2021

Pretty sure closed source wasn’t very effective at stopping 0days either (Windows). The most common platform gets the attention generally.

mfkp · on Oct 21, 2021

I recently saw and reported one to a local business.

If you typed in the domain and visited directly, it wouldn't redirect to the scam site. But if you clicked on a link from a google search, then it would redirect.

Probably makes it harder to find for small website owners if they're not clicking their own google searches.

IncRnd · on Oct 20, 2021

It happens through search engine optimization, SEO, and a mix of planting reviews and other tactics. Think of it like this - what would you do to get people talking about your site? You'd somehow put links, conversations, reviews, quotes, etc. in front of them.

johnx123-up · on Oct 21, 2021

IMHO, you should add this note in the blog too. Also, wondering about the use case of the website... are you building anything else too?

gurgeous · on Oct 20, 2021

Also see the gigantic map - https://iconmap.io

The blog post is the analysis of the data set, the map is the visualization.

oehpr · on Oct 20, 2021

I wonder if there might be a way to map all these using t-SNE to discrete grid locations? Maybe even an autoencoder. I'd love to see what features it could pick out.

I don't see their data set though. hmmm.

maybe I'll just have to crawl it on my own if I want to do it.

lgvld · on Oct 21, 2021

You can use t-SNE (or even better: UMAP or one of its variation) to create a 2D points cloud, and then use something like RasterFairy [1] to map 2D positions to the cells a grid. It usually works well.

[1] https://github.com/Quasimondo/RasterFairy

yboris · on Oct 20, 2021

side note: instead of t-SNE consider UMAP - provides better results (and it's much faster) https://github.com/lmcinnes/umap

isoprophlex · on Oct 20, 2021

Is the dataset available for download? I couldn't immediately find a download to the dataset in the linked article.

My hands itch to do some dimension reduction on that data and make some nice plots

nkriege · on Oct 20, 2021

We'd be happy to share the data. Reach us at help at gurge.com if you're interested.

wiz21c · on Oct 21, 2021

damn I was thinking about that too :-)

svdr · on Oct 20, 2021

I see a lot of repetitions in the map?

gurgeous · on Oct 20, 2021

It's one icon per domain. Try hovering (on desktop) and you'll see that many domains have the same favicon.

true_religion · on Oct 20, 2021

It also works on mobile if you tap the fav icon.

gurgeous · on Oct 19, 2021

try "ycombinator.com" :P

gurgeous · on Oct 19, 2021

The map is pretty neat. Also see the giant analysis for the dataset at https://iconmap.io/blog. Turns out that a lot of folks have messed up their favicons.

Disclosure: I am one of the authors

gurgeous · on Sept 27, 2021

Hey, great start. I spend half my day in CSVs and I am definitely your target audience. Most of the time I use bat, visidata or tabview. In many ways tabview is the best, though recently the project has been abandoned.

tv looks excellent. Fun name. I think if you added a couple of features it would ascend to my toolbox:

(1) scrolling (horizontal and vertical)

(2) better command line parsing. Running "tv" without stdin or arguments should produce an error/help. Running "tv xyz.csv" should read that file.

Good luck!

udkl · on Sept 27, 2021

What you need is https://www.visidata.org/

gurgeous · on Aug 29, 2021

I have worked on this problem many times, at many companies. I am working on it again, actually. Usually some combination of scoring and persisting results in CSVs for human review.

(edit: I am at a desktop now and I can say a bit more)

Here is the process in a nutshell:

1. Create a fast hashing algorithm to find rows that might be dups. It needs to be fast because you have lots of rows. This is where SimHash, MinHash, etc. come into play. I've had good luck using simhash(name) and persisting it. Unfortunately you need to measure the hamming distance between simhashes to calculate a similarity score. This can be slow depending on your approach.

2. Create a slower scoring algorithm that measures the similarity between two rows. Think about a weighted average of diffs, where you pick the weights based on your intuition about the fields. In your case you have handy discrete fields, so this won't be too hard. The hardest field is name. Start with something simple and improve it over time. Blank fields can be scored as 0.5, meaning "unknown". Hashing photos can help here too.

3. Use (1) to find things that might be dups, then score them with (2). Dump your potential dups to a CSV for human review. As another poster indicated, I've found human review to be essential. It's easy for a human to see that "Super Mario 2" and "Super Mario 3" are very different.

4. Parse your CSV to resolve the dups as you see fit.

Have fun!

dr_zoidberg · on Aug 31, 2021

With regards to 1, I wonder: why would calculating the Hamming distance be slow? In python you can easily do it like this:

    hamming_dist = bin(a^b).count("1")

It relies on a string operations, but takes ~1 microsecond on an old i5 7200u to compare 32bit numbers. In python 3.10 we'll get int.bit_count() to get the same result without having to do these kind of things (and a ~6x speedup on the operation, but I suspect the XOR and integer handling of python might already be a large part of the running time for this calculation).

If you need to go faster, you can basically pull hamming distance with just two assembly instructions: XOR and POPCNT. I haven't gone so low level for a long time, but you should be able to get into the nanosecond speed range using those.

gurgeous · on Feb 5, 2021

Looks great! Can you fill out web forms? We would use this for creating fake accounts in our test environment, for example.

danbarak · on Feb 6, 2021

Yes, you can simulate keys like 'tab' and 'click' to move between fields. Give it a shot.

gurgeous · on Nov 29, 2020

Depends how much randomness you need. I usually just use a rand alpha of length 6-8. You can also use dictionary words (see gfycat).

Sample Ruby code:

  8.times.map { (('a'..'z').to_a + (0..9).to_a).sample }.join

It's less collision resistant than UUID, of course.

gurgeous · on March 12, 2020

This is so sad. From the survey we ran - Closure is looming for 35% of small businesses in Seattle. More in TFA