

Ask HN: what could go wrong? - dhbradshaw

I&#x27;m building a website that lets people aggregate the numbers they care about into one spot.<p>Right now, the group that it is most popular with is authors, who use it to get alerts when they get a new review on Amazon.<p>They have suggested that I make it possible to track their author rank on Amazon.  I&#x27;ve been playing with that and I have found that regex is a nice way to go for that particular job.  (I&#x27;ve been using xpaths and selectors up to this point.)  So soon I&#x27;ll probably add that as a specialized function to my website.<p>Because regexes are so useful (not for parsing but for finding known patterns), I&#x27;m tempted to make it possible to create automatic scrapers using regexes.   But it seems the kind of thing you want to research a bit first.
======
mtmail
Does your target audience understand regular expressions? I like the approach
import.io took: you go to one or more pages with their browser, select the
fields you're interested in and they build the extraction (xpath, css
selectors) for you. An engineer can take that configuration and instruct the
scraper to call a URL and get JSON back. Even with their special browser, help
pages, videos I had trouble explaining it to a non-technical person.

"Normal" regular expressions are probably fine. Only with back-tracing or
look-forward it might be possible to create complexity so a regex takes too
long. Wrapping it into a block with fixed timeout should work.

~~~
dhbradshaw
Good question about the regular expressions. The Indie author community is
made up of a bit of everyone. They learn aggressively and they communicate
well. It seems to me that if I make something available it will soon be
adapted to a use and shared.

import.io looks awesome. I was also looking at ParseHub, which seems very
impressive.

Thanks for the idea of a fixed timeout.

~~~
romanhn
Do take a look at Google's RE2 library
([https://code.google.com/p/re2/](https://code.google.com/p/re2/)) - it is
designed to handle user-entered regexes safely.

