

Ask HN: What can I do with vast amounts of text?  - philtar

I&#x27;ve started collecting a lot of data in Arabic. the thing with arabic is that there&#x27;s Modern Standard Arabic that&#x27;s used in formal communication and then each country&#x27;s dialect that&#x27;s used in informal communication. I have a lot (like tens of millions of sentences) of each and I&#x27;m working on tagging them as either MSA or dialect (and which dialect it is).<p>But now what? Natural language processing? Something else? Any low hanging fruits I can work on?
======
agibsonccc
I could understand if you don't want to reveal the nature of the data, but
what kind of data is it? IM Conversations? Reviews? Data mining wise, it would
be neat to build something like a named entity tagger or doing sentiment
analysis with it. If you're really ambitious, you could do something like
relation extraction with it.

There are a lot of monetization options you could do with data like that.
Using some of the tools already mentioned here, there's a lot of options
available.

~~~
philtar
The majority of it is newspapers, tweets, facebook statuses, wikipedia, tv
show and movie scripts.

I've been able to do sentiment analysis with a high (80%+) level of accuracy
using only a fraction of what I currently have. I don't think it's really
monetizable even though it's (afaik) the only one of it's kind. Not that hard
to make.

I'm interested in named entity tagging and other things like that that I know
for a fact don't exist for Arabic or any of its dialects. Who would pay for
that, though?

~~~
agibsonccc
Since you have all of this text, I'm assuming you have access to that
audience. You could easily fill a language specific niche that lots of others
don't.

Here's an example: [http://blog.repustate.com/arabic-sentiment-analysis-a-
long-j...](http://blog.repustate.com/arabic-sentiment-analysis-a-long-journey-
is-complete/2012/09/17/)

Many of the text analysis APIs on the web only do english, spanish, and maybe
chinese.

If you could come up with a semi decent system, you could talk to marketers,
and actually other developers.

For example, I've had people come to me for something as obscure as
recognizing locations so they could do geo tagging in a mobile app based on
text.

If you have any particular industry verticals you're familiar with, there's
probably a niche somewhere.

------
lsiebert
Low hanging fruit. Find and publish Character and word frequency. Character
frequency can be used for the arabic version of dvorak, or something like
that. That could actually sell.

If you are tagging, take a look at supervised machine learning.

------
yareally
What do you want to do with the data? That can alter the tools/software one
wishes to use with it.

~~~
philtar
What are my options?

~~~
yareally
I'm just trying to understand your intentions for having the data, but it
sounds like you didn't have a set reason for collecting the data other than
for the sake of recording it? Not that one has to have a reason, but just
presumed you did. If so, then you're looking not only for tools to work with
the data, but also a reason to use the data to begin with?

~~~
philtar
Yeah. I collected it for the sake of doing anything with it. Now that I have
it all I'm not sure what to do with it.

~~~
yareally
I guess it depends on how the sample data is skewed as to what you can
statistically gain from it. If it was collected fairly random without favoring
any one audience, then there's lots of options. If not, then there's more
restrictions.

Like for example, collecting data from somewhere that appeals to the general
population is going to give different results than collecting data from a
development community.

Tech wise, you could throw all the data into full text search engine like
Sphinx or alternatively use something like Lucene or NLTK.

[http://nltk.org/](http://nltk.org/)

[http://lucene.apache.org/](http://lucene.apache.org/)

[http://sphinxsearch.com/](http://sphinxsearch.com/)

------
xmpir
check out [http://gate.ac.uk/](http://gate.ac.uk/) but I'm not sure wheter it
supports arabic

