

Ask HN: So, I need regular expressions... - seanccox

I understand that now I have two problems. Actually, since I'm trying to use these regex through Applescript, it appears that I have three.<p>I've tried using Smile from Satimage, but I can't seem to establish an environment in which I can run a regex on the text I have (which is in QuarkXpress). I can text that my expressions work, but I can't make manage to have them applied to the text in Quark, much less change the text.<p>Structurally, I feel I understand what I need to happen: activate Quark, select the document, find the offending text (given the regex parameters), replace the text (if it meets certain conditions), else ignore the text.<p>My goal is to script the editorial style guide of the magazine I work for, thus side-stepping a lot of formatting/spell checking that we do, so I can focus on fact checking.<p>I have a script that already does the formatting, actually, but I wanted to write it in regex so that it might run faster and fulfill a more goals. For example, we switch all manner of words with the letter 'z' to a spelling with the letter 's', as is customary in British grammar: "analyze" to "analyse", "capitalization" to "capitalisation", you get the idea. I've had trouble with my script thus far, because it introduces certain errors (ie. "size" to "sise"). I thought regex could prevent this, and that if I could learn it, I could go on to solving other, more complex, problems. But at this stage, I can't even get a script with regex in it to launch and work on the text.<p>Ideas?
======
cturner
This is a really exciting problem space, and I feel some envy at the thought
of your task :)

I haven't worked with Quark, but years ago I did something similar working
with Indesign. We needed to produce education resources at a fast pace,
including work from many contributors into a single document, with editing
phases as well.

I started thinking that Indesign was the centre of the system. But I had to
change to a mindset where the backbone of the system was a chain of tools.
Indesign was just one tool plugged into that system, focused on doing the
things it was good at.

Building it this way, I was able to deal with textual issues in a different
part of the toolchain, and had complete flexibility about what languages I
could use to do it. (i.e. not stuck with applescript)

Here's an example for how this flow could work involving quark:

* Your users create their source material in whatever format. Some will give you word, others smile, others text.

* Consolidation phase. Consolidate the bulk of your content into a single place.
    
    
        You could abuse quark for this purpose, but if you
        do: completely ignore layout concerns. The purpose of
        this phase is to get all your content (images, text)
        in a single place, associated with one another.
        Then you'd export to XML. Again: no layout in
        this phase.
    

* Review phases. Here, use your textual tools to work with the text. Get all the content right.

* Layout phase. Now quark, and layout. Feed your completed text into quark, and arrange it.

If new content arrives, have a mechanism so you can feed it straight into the
review layer, and then into quark for re-layout.

    
    
        > My goal is to script the editorial style guide of
        > the magazine I work for, thus side-stepping a lot of
        > formatting/spell checking that we do, so I can focus
        > on fact checking.
    

If you were to break it up like I mentioned above, you could even have
different phases from fact checking and regionalisation. And you could see
your data flow through the stages, and have diffs available to supply to
interested parties, such as the content author. This allows you to add
powerful new kinds of markup that doesn't need to appear in the final
document. For example, XML markups for fact-checking events, "<check
time="20130527-1225" person="joe">Confirmed with call to witness, see full
writeup at [blah].</check>".

You wouldn't be doomed to an existence of XML editing to do this - you might
have a tool sitting on top of it. But to get it going you could just edit the
XML manually.

There's other advantages.

Say you wanted to publish to web and and to paper and to braille all off the
same content. No problem. Just take your textually-correct document layer, and
pass it to the braille processor or the web processor. You'll get exactly the
same content as went into the paper published version.

You could feed that layer into a datastore and run a textual search program
against it.

Or you might want to publish a single volume of all of the year's content,
with a single index. Again, it becomes straightforward.

Now you control the text.

If you want to discuss in more detail, feel free to contact me offline.

~~~
seanccox
Happy to do so. I'll drop you a line there.

------
Piskvorrr
See, that's _exactly_ what the maxim is about: "I know, I'll use regular
expressions, [because regular expressions are the universal text manipulation
tool]." Now you indeed have two problems, because regex is not an appropriate
tool here.

You are _absolutely not_ looking for a regular expression (as the grammatical
rules and their exceptions are not all that easy to reimplement in regexp
syntax), you are looking for a dictionary (a set of key-value pairs, i.e.
"American zpelling" => "British spelling" ;)). Then iterate through each word
in the document, see if it's a dictionary key, if so, replace with the
relevant value. No regex needed.

~~~
jaachan
Unless you also want to fix things like spacing and whatnot. Some formatting
rules can be pretty complex. But I agree that a dictionary would be an
essential part to the spellcheck part.

------
jaachan
I've got no experience in Applescript, so my advice might not be worth much.

That said, since you have three problems, solve them each on their own: *
First, do something really simple with Regex, to make sure you can use them in
your app (say, in JS, it'd be something like

    
    
      alert('test'.replace(/st/, 'ssst'));
    

Then you know the regex engine itself is working. * Next, you need to figure
out how to change text in the document. Not sure if you have already got this
covered. * Then it's a matter of applying the two. Note that regex functions
return the changed string, rather than change the string itself. In the
example listed on the documentation it does:

    
    
      set s to change "^[ ]+" into "" in s with regexp 
    

i.e. first change the string (change "^[ ]+" into "" in s with regexp) then
store it (set s to <expr>

All in all pretty vague advice I guess.

------
logn
Sounds like you're going down a terrible path, but I'll help you fan the
flames:

<http://www.regular-expressions.info/lookaround.html>

(?<!si)ze

And by saying this is a terrible path, I mean, spelling and grammar are highly
irregular with all sorts of exceptions. Unless you can distill down rules that
will always exist, like 'z' will never be turned to an 's' if 'si' precedes
it, then I think you're better just building a dictionary of substitutions. If
you have to for some reason express this in regex only, then build a simple
wordlist of substitutions and run a script on it to convert it to regex, e.g.,
\b(analyze|theorize|rationalize)\b

------
McUsr
Hello.

Use sed in a do shell script. short example (on the fly).

    
    
      set mytext to "sise" & LineFeed
      set fixed to do shell script "sed -n 's/size/sise/g' <<<" & mytext
    
    

I have written a script, for letting you first copy your terminal commandline,
when you have figured it all out, so that you can paste it into a do shell
script, it can be found here
<[http://macscripter.net/viewtopic.php?pid=163300#p163300>](http://macscripter.net/viewtopic.php?pid=163300#p163300>);
And Macscripter.net is a great forum for AppleScript questions.

------
seanccox
Wow... apologies for the above ramble. As an editor, I probably should have
proofed this before making it public. This is what learning regex can do to an
otherwise thorough and comprehensible writer.

------
johnny22
isn't stackoverflow a better place for such a question?

~~~
saidulu401
yes .. Stack Overflow is a question and answer site for professional and
enthusiast programmers.

~~~
seanccox
OK, great. Didn't know about that. I'm about as new to this as new gets. I'll
ask there.

