Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A Python Script to Download Thousands of Wallpapers at Once (github.com/geekspin)
27 points by geekspin on April 23, 2018 | hide | past | favorite | 24 comments

Side question - are there any websites like wallhaven, but with less people and anime? I'm thinking the type of content Chromecast uses, or /r/TechnologyPorn?

Have you seen http://interfacelift.com/ ?

What about https://unsplash.com/ ?

You might want to async the downloading part. It is often faster to download 5 images at once than to download them one after another.

In the choices section you use input("text") properly one place but not others. You use a couple different ways of decoding codes, and the dict way is nicer, but also consistency is nice. Also, I am not sure you handle bad input (default all?)...

Personally, I'd pull the meat out of the loops in main() into functions - GetImageList() and GetImage(). Relatively complex there so it would be easier to read and spot errors in those bits of code in isolation.

Unless you're in dire need of thousands of wallpapers (most of them you are going to delete anyway) it's better not to hammer the website. I'd even limit the download rate.

Agreed - I looked at the code hoping for a good example of async in python as suggested by the “at once” in the title.

Using i and j in loops makes code much less self-explanatory, especially when you could use "page_index" and "image_index" instead.

Even better, replacing

     for i in range(len(imgid)):
and similar lines with:

    for i, img in enumerate(imgid):
would allow one to get rid of all these list accessors.

It looks like this downloads them one-by-one instead of downloading at the same time.

Is there a simple Python equivalent to Java's .parallelStream().forEach() that would allow these calls to easily be run in parallel?

Sure thing...

    from multiprocessing import Pool
    with Pool(8) as p:
        p.map(function, sequence))


I see that you explicitly specify 8 as the number of processes here. In Java, parallelStream() will pick a sane default for you if you haven't previously specified (based on the number of available processors, I believe). Is something like that possible in Python?

The ThreadPool class picks a sane default (number of cores), but I believe it uses Python threads instead of processes.

You could just use threads here too - CPython releases the GIL during IO.

  from concurrent.futures import ThreadPoolExecutor
  with ThreadPoolExecutor as executor:
      executor.map(function, sequence)

Every function has side effects. Could've written the whole thing in one function. Also, PEP8.

One function? Could've written the whole thing in one `wget` call.

teach me how.

Here's a start:

  for x in $(curl -s https://alpha.wallhaven.cc/random | pcregrep -o1 "https://alpha.wallhaven.cc/wallpaper/(\d+)" | sort | uniq) ; do wget "https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-$x.jpg" ; done

bash, curl, pcregrep, sort, uniq and wget is not what I'd call "one wget call".

To the hardcore bash users, I think they call that 'easy'. I have someone like that on my team - holy crap some of the bash stuff they can come up with

Hardcore bash sure sounds like fun, but when things start getting too big or messy I usually find a Python script with some `subprocess` [0] tricks to be way easier on the eye.

[0]: https://docs.python.org/3/library/subprocess.html

I hope none of that stuff makes it into your main codebase? If there is even a remote chance of requiring maintenance by someone other than the person that wrote it, that is. Which is probably 99.99% the chance, unless it's temporary/throwaway code.

I guess wget has a useful spidering function that could probably page through the websites search results page, downloading all the preview and real images. You'd have to do the login bit as a different call and fetch the first search pages url yourself?

Every code has side effects. Splitting code into appropriate functions makes those side effects much more managable.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact