

How to Split Sentences (2014) - f00biebletch
http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html

======
diasks2
I did an analysis of different sentence segmentation tools when I was working
on my own rule-based segmenter. The results can be found in this README
([https://github.com/diasks2/pragmatic_segmenter](https://github.com/diasks2/pragmatic_segmenter)).

I think this blog post almost hits on the key in the middle - in my opinion it
is important to test (all of) the edge cases. The problem with most corpora
typically used to test segmenters is that 80-90% of the sentences are the same
(i.e. a regular sentence ending in a period). Thus if a segmenter just simply
split the sentence at every period it would still show a 80-90% accuracy rate.
This is why I am trying to develop a standardized set of edge cases:
[https://github.com/diasks2/pragmatic_segmenter#the-golden-
ru...](https://github.com/diasks2/pragmatic_segmenter#the-golden-rules)

~~~
Xeoncross
Great work. I love the "Golden Rules" list you compiled. It seems like teams
develop their NLP systems without sharing a common training set which leaves
some teams without testing things like the "a.m. / p.m." thing.

~~~
diasks2
See my comment below for some of the reasons I've had issues trying to test
the commonly used segmentation corpora. I completely agree it would be great
if there was a free (as in both speech and beer) common training set. One key
would be that this common training set either provide the exact text that
should be run in the segmenter or exact instructions on how to produce the
text to run in the segmenter (re: see the issue I mentioned below of the
ambiguity around how to actually test the Brown corpus).

------
kylebgorman
Self-promotion: I wrote an open-source sentence splitter tool that outperforms
the state of the art on the "standard split". It is also very fast.

[http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-
sentence-b...](http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-
boundary-detection/) (link to GitHub repo in post)

------
jbrooksuk
I've just added this kind of support to node-summary
([https://github.com/jbrooksuk/node-
summary](https://github.com/jbrooksuk/node-summary)) which seems to make a bit
of a positive difference under the tests.

------
andrewtbham
is anyone aware of a sentence segmenter for poorly written english that is
missing some punctation? like from chat sessions? it could be useful for
normal sentence segmentation. i.e. if you forget about the punctuation, can
you detect the boundaries of the sentence anyway.

