Hacker News new | comments | show | ask | jobs | submit login
The IIT Bombay English-Hindi Parallel Corpus [pdf] (arxiv.org)
103 points by aq3cn 10 months ago | hide | past | web | favorite | 13 comments

Since so long, I have been waiting for Indian universities especially IITs to invest and publish in building such corpora. Being a founder of AI/ML startup, I am surprised at the appalling lack of datasets available to work on Indian problems. Contrast this with Chinese universities where they have built some world class datasets to build NLP solutions in Mandarin. Our sentiment analysis works in 8 different languages but none of it is in Indian languages despite we being in India!

The data set is released as CC-BY-NC and, thus, cannot be used in commercial applications.

And also cannot be used in open-source projects.

CC-BY-NC amounts to saying: you can play around with this in demos and academic projects that no lawyer would ever go after anyway, but you can't use it for real.

Is that entirely a bad thing? If a commercial, for profit wants to enter this field then they can pay for it or licence it?

And what if an open source project wants to enter this field?

For open source commercial, it's the same as for-profit. They'll probably get a discount.

For open source non-profit, it'd still be NC and therefore legal?

"Non-profit" is a whole different kettle of fish than "non-commercial". It doesn't mean you're not selling anything. It doesn't mean you're morally good. It just means you registered for a particular business status with particular restrictions. And it's not what CC is talking about.

A key part of the Open Source Definition is that you do not lose your permission to use and copy the code based on what you do with it. A project that you aren't allowed to use anymore if you start making money from it is not Open Source.

"Non-commercial" code restrictions are more like the thing where you're allowed to look at the Windows source code, if you're an academic and you ask nicely and you won't ever do anything with it.

A lot of universities are open to give a separate commercial license when contacted. They charge for their efforts, which is fair. Whether public universities should charge given we already pay them from our taxes is a different issue. source: I am also cofounder at an AI startup, we often buy licenses to use academic datasets for commercial usage.

I guess it’s not the most demanding requirement at the moment. But happy to see progress.

Btw, where do you work? Can we see your work?

Being it's IIT Bombay I hope for Marathi some day soon.

I know different people have different priorities :-)

Does it matter that it is in Bombay? I thought the prod/student body would be diverse due to to how the intake works.

Nothing against Marathi, I feel like Telugu, Tamil etc would also have equal chances of happening there.

Oh not really, I do agree that languages like Tamil, Telugu, Bengali and others are sort of “more deserving” in that they have more speakers than Marathi. I just don’t speak them and probably never will :-( .

The “Bombayness” was only that Mumbai is in Maharashtra. There really are great folks from all over at that institution. My comment was NOT intended to make any implication about any community over another! I see someone downvoted my comment and might have understandably thought that that might have been my intention.

Actually I was saying the total opposite of what you inferred. I thought that despite being in Mumbai, IIT-B would have a very diverse spread of folks due to the methods of intake(JEE/ all India recruitment of professors). That means Marathis might not be the language of choice, because it's highly likely that the researcher is Tamil/Telugu. Unlike someplace like, say, VJTI, that's likely to have a large Marathi population.

(I didn't downvote btw.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact