
The IIT Bombay English-Hindi Parallel Corpus [pdf] - aq3cn
https://arxiv.org/abs/1710.02855
======
mongodude
Since so long, I have been waiting for Indian universities especially IITs to
invest and publish in building such corpora. Being a founder of AI/ML startup,
I am surprised at the appalling lack of datasets available to work on Indian
problems. Contrast this with Chinese universities where they have built some
world class datasets to build NLP solutions in Mandarin. Our sentiment
analysis works in 8 different languages but none of it is in Indian languages
despite we being in India!

~~~
woodson
The data set is released as CC-BY-NC and, thus, cannot be used in commercial
applications.

~~~
rspeer
And also cannot be used in open-source projects.

CC-BY-NC amounts to saying: you can play around with this in demos and
academic projects that no lawyer would ever go after anyway, but you can't use
it for real.

~~~
dylz
Is that entirely a bad thing? If a commercial, for profit wants to enter this
field then they can pay for it or licence it?

~~~
rspeer
And what if an open source project wants to enter this field?

~~~
snugghash
For open source commercial, it's the same as for-profit. They'll probably get
a discount.

For open source non-profit, it'd still be NC and therefore legal?

~~~
rspeer
"Non-profit" is a whole different kettle of fish than "non-commercial". It
doesn't mean you're not selling anything. It doesn't mean you're morally good.
It just means you registered for a particular business status with particular
restrictions. And it's not what CC is talking about.

A key part of the Open Source Definition is that you do not lose your
permission to use and copy the code based on what you do with it. A project
that you aren't allowed to use anymore if you start making money from it is
not Open Source.

"Non-commercial" code restrictions are more like the thing where you're
allowed to look at the Windows source code, if you're an academic and you ask
nicely and you won't ever do anything with it.

------
gumby
Being it's IIT Bombay I hope for Marathi some day soon.

I know different people have different priorities :-)

~~~
praneshp
Does it matter that it is in Bombay? I thought the prod/student body would be
diverse due to to how the intake works.

Nothing against Marathi, I feel like Telugu, Tamil etc would also have equal
chances of happening there.

~~~
gumby
Oh not really, I do agree that languages like Tamil, Telugu, Bengali and
others are sort of “more deserving” in that they have more speakers than
Marathi. I just don’t speak them and probably never will :-( .

The “Bombayness” was only that Mumbai is in Maharashtra. There really are
great folks from all over at that institution. My comment was NOT intended to
make any implication about any community over another! I see someone downvoted
my comment and might have understandably thought that that might have been my
intention.

~~~
praneshp
Actually I was saying the total opposite of what you inferred. I thought that
despite being in Mumbai, IIT-B would have a very diverse spread of folks due
to the methods of intake(JEE/ all India recruitment of professors). That means
Marathis might not be the language of choice, because it's highly likely that
the researcher is Tamil/Telugu. Unlike someplace like, say, VJTI, that's
likely to have a large Marathi population.

(I didn't downvote btw.)

