
Ask HN: Best Way to Refactor This Python Method? - dhruvkar
I haven&#x27;t coded in a team before and have had very few opportunities to learn from more experienced programmers. I find myself running into this a lot. Large if&#x2F;else blocks, that have repeated partial parts.<p>Is there a pattern for this? I haven&#x27;t been able to successfully google this sort of issue either.<p>This is a selenium scraper for an ERP system we use, and I&#x27;m trying to find specific URLS given an integer or a range of integers. Couldn&#x27;t reproduce it here because of character limit.<p>* self.address is shorthand for browsing to a URL
 * self._transfer_session to requests() is to transfer the selenium session to a requests session.<p>https:&#x2F;&#x2F;pastebin.com&#x2F;Ff93nKr1
======
staticautomatic
There's a lot that's unclear here but I'll try.

I'd start by splitting out the url generation, request, and the parsing into
separate functions.

First write a function that generates a list of all the urls you want to
scrape. It'll take your config and number of pages as args. If the list is
going to be super long then have it return a generator.

Then you'd have a function for the request that takes only the url as an arg
and returns the response content. You'll want to catch session-related
exceptions and http status code issues before parsing.

The parsing function should just take the response content and `hold_no` as
args. You should add some code to check the objects you're generating during
parsing to make sure they actually contain what you're after. For example,
your `rows` variable could return an empty list for a variety of reasons that
aren't exceptions but are things you'd want to catch nonetheless. For example,
maybe the element(s) can't be found because they haven't loaded yet, the
returned html is malformed, or you simply made a typo in your own code. Under
the hood lxml is always going to return an empty list when it can't find
something.

It's hard to say how to structure this given that it's part of some bigger
thing we can't see. I'd probably class the whole thing, generate the full list
of urls to scrape and setup your empty `holdsurl` list both in the init. Then
I'd probably write a class method that just runs the whole thing and returns
the `holdsurl` list, but there are other equally good options.

One thing I don't understand is why you're transferring the session to
Requests at all. I don't see anything you're doing with the Session() object
that you wouldn't already be able to do with Selenium/soup/lxml.

~~~
dhruvkar
Thanks for taking the time to reply with the limited information provided.

>> why you're transferring the session to Requests at all

Only for speed. Selenium is very slow. Requests seems to return responses
faster. With 10(0)s of URLS to scrape, this adds up.

>> I'd probably class the whole thing, generate the full list of urls to
scrape and setup your empty `holdsurl` list both in the init. Then I'd
probably write a class method that just runs the whole thing and returns the
`holdsurl` list

This seems to be the way to g. One method is making it hard to read.

Thanks for the other tips as well for adding checking the parsing etc. I'll be
adding that in.

~~~
staticautomatic
Sure thing. Yeah Requests is definitely faster. If you really want to get as
much speed out of this as possible you'd go asynchronous but that's another
can of worms. This level of optimization may be overkill but if you want a
speed boost (not sure how much of one though) while still making the requests
synchronously you could use urllib instead of Requests (since you're only
fetching the page and not actually using Requests' abstractions, you're
wasting the additional overhead Requests incurs). You could also just use lxml
directly instead of soup. As with Requests, you're incurring the overhead
associated with calling soup even though most of what you're using soup for is
the optional lxml parser. I don't know what soup uses for locating stuff in
the tree but if it's not lxml, it's probably slower than lxml.

