

Ask HN: How do you split strings (to get keywords)? - gopher

First trial, one splits on whitespace, but this sucks on interpunction and special characters.<p>Second trial, you use a alpha-numeric whitelist and split on anything else, but what about umlauts? What about hebrew or cyrillic?<p>Third trial: split on characters &#60; 32, whitespace and interpunction characters; this works somehow but is ugly. What would you do to get keywords from a string?
======
TallGuyShort
It depends very heavily on the origin of the string, as that would determine
the special cases that needed to be dealt with. Can you provide more details?

edit: Based on what you said in your original post, I would say to have a list
of possible delimiters (which would probably need to be added to for some
time), and tokenize the string according to that, and discard any token that
appears in a second list of words that don't matter (conjunctions, articles,
prepositions, etc...). Before discarding said strings, you'd also want to
check if they're operators used in your app, or anything like that.

~~~
gopher
I think of something like a comment or an abstract; security is not an issue
here because input validation and escaping is done elsewhere.

Basically, I think of a string like "ham, egg." which should result in "ham"
and "egg", and "Ветчина, яйцо." should also result in "Ветчина" and "яйцо".

The challenge is that you cannot whitelist all possible characters as there
are (imho) too many charsets.

~~~
TallGuyShort
Well barring the practice of specifying meaningful characters, the only thing
I can come up with is to have your program use statistics to take it's best
guess at what 'special' characters are. Let's say 95% of the characters are
between 65 and 90, and every now and then there's a 44-32 pair. Then your
program could be pretty sure that 44-32 is a delimiter, and that 65 and 90 are
the ranges of characters used in keywords. (The above examples are ASCII).

However, that does nothing to eliminate words like 'in' and 'of' in a query,
which you may want to do. It isn't very practical at all, I think, and you
probably want to look at more practical ways to list possible delimiters,
etc... Although the above could help you determine what charset you're using.

------
dannyr
How about term extraction?

[http://developer.yahoo.com/search/content/V1/termExtraction....](http://developer.yahoo.com/search/content/V1/termExtraction.html)

------
mbrubeck
_"Second trial, you use a alpha-numeric whitelist and split on anything else,
but what about umlauts? What about hebrew or cyrillic?"_

A multi-lingual version of this could use the Unicode "General Category"
character classes (Letter, Mark, Number, Punctuation, Symbol, Separator,
Other).

------
alanthonyc
Not sure what your main goal is, but in my compilers project class, we used
lexical analyzers to break out tokens from the input stream.

Try looking up "Lex" or "Flex"...these were the tools we used. There may be
better ones around now.

Here's a quick google: <http://dinosaur.compilertools.net/>

------
pedalpete
I just found LingPipe on Monday, and haven't had a chance to try it yet. But
it has 'entity extraction' in text mining. Not sure if that is what you're
looking for. It's a Java library. <http://alias-i.com/lingpipe/>

anybody have any comments about it?

