Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: So, I need regular expressions...
5 points by seanccox on May 27, 2013 | hide | past | favorite | 13 comments
I understand that now I have two problems. Actually, since I'm trying to use these regex through Applescript, it appears that I have three.

I've tried using Smile from Satimage, but I can't seem to establish an environment in which I can run a regex on the text I have (which is in QuarkXpress). I can text that my expressions work, but I can't make manage to have them applied to the text in Quark, much less change the text.

Structurally, I feel I understand what I need to happen: activate Quark, select the document, find the offending text (given the regex parameters), replace the text (if it meets certain conditions), else ignore the text.

My goal is to script the editorial style guide of the magazine I work for, thus side-stepping a lot of formatting/spell checking that we do, so I can focus on fact checking.

I have a script that already does the formatting, actually, but I wanted to write it in regex so that it might run faster and fulfill a more goals. For example, we switch all manner of words with the letter 'z' to a spelling with the letter 's', as is customary in British grammar: "analyze" to "analyse", "capitalization" to "capitalisation", you get the idea. I've had trouble with my script thus far, because it introduces certain errors (ie. "size" to "sise"). I thought regex could prevent this, and that if I could learn it, I could go on to solving other, more complex, problems. But at this stage, I can't even get a script with regex in it to launch and work on the text.

Ideas?



See, that's exactly what the maxim is about: "I know, I'll use regular expressions, [because regular expressions are the universal text manipulation tool]." Now you indeed have two problems, because regex is not an appropriate tool here.

You are absolutely not looking for a regular expression (as the grammatical rules and their exceptions are not all that easy to reimplement in regexp syntax), you are looking for a dictionary (a set of key-value pairs, i.e. "American zpelling" => "British spelling" ;)). Then iterate through each word in the document, see if it's a dictionary key, if so, replace with the relevant value. No regex needed.


That's very helpful. Since I am just starting out, I often find myself trying to use the wrong tool for a job, because I simply don't know what the hell I'm doing. I can use a dictionary as you describe to correct the spelling issues, then I suppose regex can be used on certain formatting problems (if need be)?

Since I am just getting started, can you tell me how I would go about building such a dictionary? For instance, should I create a text file with the word pairs (American, British), or two different files for each?

Thanks for the direction, and apologies if these questions seem foolish. As I indicated, I made something that works, but is slow and buggy. Getting direction on how to move forward really is a big help. Cheers.


Probably keep the pairs together. Not sure what Applescript uses for hashmaps, but you probably want to load your dictionary into a hashmap or something like that, and lookup in that (much faster than on-disk access).


Unless you also want to fix things like spacing and whatnot. Some formatting rules can be pretty complex. But I agree that a dictionary would be an essential part to the spellcheck part.


I've got no experience in Applescript, so my advice might not be worth much.

That said, since you have three problems, solve them each on their own: * First, do something really simple with Regex, to make sure you can use them in your app (say, in JS, it'd be something like

  alert('test'.replace(/st/, 'ssst'));
Then you know the regex engine itself is working. * Next, you need to figure out how to change text in the document. Not sure if you have already got this covered. * Then it's a matter of applying the two. Note that regex functions return the changed string, rather than change the string itself. In the example listed on the documentation it does:

  set s to change "^[ ]+" into "" in s with regexp 
i.e. first change the string (change "^[ ]+" into "" in s with regexp) then store it (set s to <expr>

All in all pretty vague advice I guess.


Sounds like you're going down a terrible path, but I'll help you fan the flames:

http://www.regular-expressions.info/lookaround.html

(?<!si)ze

And by saying this is a terrible path, I mean, spelling and grammar are highly irregular with all sorts of exceptions. Unless you can distill down rules that will always exist, like 'z' will never be turned to an 's' if 'si' precedes it, then I think you're better just building a dictionary of substitutions. If you have to for some reason express this in regex only, then build a simple wordlist of substitutions and run a script on it to convert it to regex, e.g., \b(analyze|theorize|rationalize)\b


Hello.

Use sed in a do shell script. short example (on the fly).

  set mytext to "sise" & LineFeed
  set fixed to do shell script "sed -n 's/size/sise/g' <<<" & mytext

I have written a script, for letting you first copy your terminal commandline, when you have figured it all out, so that you can paste it into a do shell script, it can be found here <http://macscripter.net/viewtopic.php?pid=163300#p163300>; And Macscripter.net is a great forum for AppleScript questions.


Wow... apologies for the above ramble. As an editor, I probably should have proofed this before making it public. This is what learning regex can do to an otherwise thorough and comprehensible writer.


isn't stackoverflow a better place for such a question?


yes .. Stack Overflow is a question and answer site for professional and enthusiast programmers.


OK, great. Didn't know about that. I'm about as new to this as new gets. I'll ask there.


This is a really exciting problem space, and I feel some envy at the thought of your task :)

I haven't worked with Quark, but years ago I did something similar working with Indesign. We needed to produce education resources at a fast pace, including work from many contributors into a single document, with editing phases as well.

I started thinking that Indesign was the centre of the system. But I had to change to a mindset where the backbone of the system was a chain of tools. Indesign was just one tool plugged into that system, focused on doing the things it was good at.

Building it this way, I was able to deal with textual issues in a different part of the toolchain, and had complete flexibility about what languages I could use to do it. (i.e. not stuck with applescript)

Here's an example for how this flow could work involving quark:

* Your users create their source material in whatever format. Some will give you word, others smile, others text.

* Consolidation phase. Consolidate the bulk of your content into a single place.

    You could abuse quark for this purpose, but if you
    do: completely ignore layout concerns. The purpose of
    this phase is to get all your content (images, text)
    in a single place, associated with one another.
    Then you'd export to XML. Again: no layout in
    this phase.
* Review phases. Here, use your textual tools to work with the text. Get all the content right.

* Layout phase. Now quark, and layout. Feed your completed text into quark, and arrange it.

If new content arrives, have a mechanism so you can feed it straight into the review layer, and then into quark for re-layout.

    > My goal is to script the editorial style guide of
    > the magazine I work for, thus side-stepping a lot of
    > formatting/spell checking that we do, so I can focus
    > on fact checking.
If you were to break it up like I mentioned above, you could even have different phases from fact checking and regionalisation. And you could see your data flow through the stages, and have diffs available to supply to interested parties, such as the content author. This allows you to add powerful new kinds of markup that doesn't need to appear in the final document. For example, XML markups for fact-checking events, "<check time="20130527-1225" person="joe">Confirmed with call to witness, see full writeup at [blah].</check>".

You wouldn't be doomed to an existence of XML editing to do this - you might have a tool sitting on top of it. But to get it going you could just edit the XML manually.

There's other advantages.

Say you wanted to publish to web and and to paper and to braille all off the same content. No problem. Just take your textually-correct document layer, and pass it to the braille processor or the web processor. You'll get exactly the same content as went into the paper published version.

You could feed that layer into a datastore and run a textual search program against it.

Or you might want to publish a single volume of all of the year's content, with a single index. Again, it becomes straightforward.

Now you control the text.

If you want to discuss in more detail, feel free to contact me offline.


Happy to do so. I'll drop you a line there.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: