Ask HN: So, I need regular expressions...

Piskvorrr · on May 27, 2013

See, that's exactly what the maxim is about: "I know, I'll use regular expressions, [because regular expressions are the universal text manipulation tool]." Now you indeed have two problems, because regex is not an appropriate tool here.

You are absolutely not looking for a regular expression (as the grammatical rules and their exceptions are not all that easy to reimplement in regexp syntax), you are looking for a dictionary (a set of key-value pairs, i.e. "American zpelling" => "British spelling" ;)). Then iterate through each word in the document, see if it's a dictionary key, if so, replace with the relevant value. No regex needed.

seanccox · on May 27, 2013

That's very helpful. Since I am just starting out, I often find myself trying to use the wrong tool for a job, because I simply don't know what the hell I'm doing. I can use a dictionary as you describe to correct the spelling issues, then I suppose regex can be used on certain formatting problems (if need be)?

Since I am just getting started, can you tell me how I would go about building such a dictionary? For instance, should I create a text file with the word pairs (American, British), or two different files for each?

Thanks for the direction, and apologies if these questions seem foolish. As I indicated, I made something that works, but is slow and buggy. Getting direction on how to move forward really is a big help. Cheers.

Piskvorrr · on May 27, 2013

Probably keep the pairs together. Not sure what Applescript uses for hashmaps, but you probably want to load your dictionary into a hashmap or something like that, and lookup in that (much faster than on-disk access).

jaachan · on May 27, 2013

Unless you also want to fix things like spacing and whatnot. Some formatting rules can be pretty complex. But I agree that a dictionary would be an essential part to the spellcheck part.

jaachan · on May 27, 2013

I've got no experience in Applescript, so my advice might not be worth much.

That said, since you have three problems, solve them each on their own: * First, do something really simple with Regex, to make sure you can use them in your app (say, in JS, it'd be something like

  alert('test'.replace(/st/, 'ssst'));

Then you know the regex engine itself is working. * Next, you need to figure out how to change text in the document. Not sure if you have already got this covered. * Then it's a matter of applying the two. Note that regex functions return the changed string, rather than change the string itself. In the example listed on the documentation it does:

  set s to change "^[ ]+" into "" in s with regexp

i.e. first change the string (change "^[ ]+" into "" in s with regexp) then store it (set s to <expr>

All in all pretty vague advice I guess.

logn · on May 27, 2013

Sounds like you're going down a terrible path, but I'll help you fan the flames:

http://www.regular-expressions.info/lookaround.html

(?<!si)ze

And by saying this is a terrible path, I mean, spelling and grammar are highly irregular with all sorts of exceptions. Unless you can distill down rules that will always exist, like 'z' will never be turned to an 's' if 'si' precedes it, then I think you're better just building a dictionary of substitutions. If you have to for some reason express this in regex only, then build a simple wordlist of substitutions and run a script on it to convert it to regex, e.g., \b(analyze|theorize|rationalize)\b

McUsr · on May 27, 2013

Hello.

Use sed in a do shell script. short example (on the fly).

  set mytext to "sise" & LineFeed
  set fixed to do shell script "sed -n 's/size/sise/g' <<<" & mytext

I have written a script, for letting you first copy your terminal commandline, when you have figured it all out, so that you can paste it into a do shell script, it can be found here <http://macscripter.net/viewtopic.php?pid=163300#p163300>; And Macscripter.net is a great forum for AppleScript questions.

seanccox · on May 27, 2013

Wow... apologies for the above ramble. As an editor, I probably should have proofed this before making it public. This is what learning regex can do to an otherwise thorough and comprehensible writer.

johnny22 · on May 27, 2013

isn't stackoverflow a better place for such a question?

saidulu401 · on May 27, 2013

yes .. Stack Overflow is a question and answer site for professional and enthusiast programmers.

seanccox · on May 27, 2013

OK, great. Didn't know about that. I'm about as new to this as new gets. I'll ask there.

cturner · on May 27, 2013

This is a really exciting problem space, and I feel some envy at the thought of your task :)

I haven't worked with Quark, but years ago I did something similar working with Indesign. We needed to produce education resources at a fast pace, including work from many contributors into a single document, with editing phases as well.

I started thinking that Indesign was the centre of the system. But I had to change to a mindset where the backbone of the system was a chain of tools. Indesign was just one tool plugged into that system, focused on doing the things it was good at.

Building it this way, I was able to deal with textual issues in a different part of the toolchain, and had complete flexibility about what languages I could use to do it. (i.e. not stuck with applescript)

Here's an example for how this flow could work involving quark:

* Your users create their source material in whatever format. Some will give you word, others smile, others text.

* Consolidation phase. Consolidate the bulk of your content into a single place.

    You could abuse quark for this purpose, but if you
    do: completely ignore layout concerns. The purpose of
    this phase is to get all your content (images, text)
    in a single place, associated with one another.
    Then you'd export to XML. Again: no layout in
    this phase.

* Review phases. Here, use your textual tools to work with the text. Get all the content right.

* Layout phase. Now quark, and layout. Feed your completed text into quark, and arrange it.

If new content arrives, have a mechanism so you can feed it straight into the review layer, and then into quark for re-layout.

    > My goal is to script the editorial style guide of
    > the magazine I work for, thus side-stepping a lot of
    > formatting/spell checking that we do, so I can focus
    > on fact checking.

If you were to break it up like I mentioned above, you could even have different phases from fact checking and regionalisation. And you could see your data flow through the stages, and have diffs available to supply to interested parties, such as the content author. This allows you to add powerful new kinds of markup that doesn't need to appear in the final document. For example, XML markups for fact-checking events, "<check time="20130527-1225" person="joe">Confirmed with call to witness, see full writeup at [blah].</check>".

You wouldn't be doomed to an existence of XML editing to do this - you might have a tool sitting on top of it. But to get it going you could just edit the XML manually.

There's other advantages.

Say you wanted to publish to web and and to paper and to braille all off the same content. No problem. Just take your textually-correct document layer, and pass it to the braille processor or the web processor. You'll get exactly the same content as went into the paper published version.

You could feed that layer into a datastore and run a textual search program against it.

Or you might want to publish a single volume of all of the year's content, with a single index. Again, it becomes straightforward.

Now you control the text.

If you want to discuss in more detail, feel free to contact me offline.

seanccox · on May 27, 2013

Happy to do so. I'll drop you a line there.