Hacker News new | past | comments | ask | show | jobs | submit login
Using aligned word vectors for instant translations with Python and Rust (instantdomainsearch.com)
110 points by beau 6 days ago | hide | past | favorite | 35 comments





We've released the underlying Rust implementation here: https://github.com/InstantDomain/instant-distance with Python bindings at https://pypi.org/project/instant-distance — feedback welcome!

For Linux, in the Makefile change the copy command to

cp target/release/libinstant_distance.so instant-distance-py/test/instant_distance.so

and it works. Built and running. The main tree was MacOS only.

Here's resource consumption in a sample run.

Time: 4.49s, Memory: 1552 mb.

Single word. Three langs including en.


How did you figure this out? I've done lots of Linux software build troubleshooting as a result of using Gentoo, BuildRoot, and pacaur, but this doesn't ring any bells for a common issue

They probably tried it, it couldn't find a dynlib which is a Macos shared object file, opened the three line makefile, and then fixed it to copy a .so

Did you try spacy's most similar method? It's written in cython so is presumably quite fast as well. Thanks for the rust implementation though, I will most likely use this.

I’ve not much to say on the actual lib, it seems great! However, don’t feel compelled to put all your rust code into a single lib.rs. You can split your work into several files and use ‘pub use’ and ‘mod’ in lib.rs to re-export your functions & types into a public API of your choosing.

cargo check and format time might also slightly improve!


Funny, I often say the opposite. Don't feel compelled to split up your lib.rs. It's really refreshing to see a nice, compact library in one or two files. Much easier to follow, especially over "type per file". Of course, there are limits, but for a small lib like this, I personally would keep it in a single, or maybe two files.

I have a fair bit of experience writing Rust code and the current status is totally deliberate. I find module file sizes of about 400-800 lines of code optimal in terms of my ability to find things vs the unnecessary complexity of having to skip around files when changing something that touches an API boundary.

This webpage use a significant amount of CPU constantly for no apparent reason (as far as I can see it is mostly a static webpage). What the hell ? Is it mining crypto in the background ?

Sorry, this page had a useEffect/setState render loop. We are running react@experimental with concurrent mode, and missed the error. Rolling out a fix now. Thanks!

At a quick glance it seems like some React component is constantly re-rending.

Quick glance in this case: took a couple second snapshot on the Performance tab and saw a lot of React related calls.


Compiling rust.

> For example, here are the results of translating the English word "hello":

> Language: fr, Translation: bonjours

> Language: fr, Translation: bonsoir

> Language: fr, Translation: salutations

> Language: it, Translation: buongiorno

> Language: it, Translation: buonanotte

> Language: fr, Translation: rebonjour

> Language: it, Translation: auguri

> Language: fr, Translation: bonjour,

> Language: it, Translation: buonasera

> Language: it, Translation: chiamatemi

Is it just me or these machine translations are worse than ... Google Translate?


These results are less accurate than Google Translate. But they are far faster to get, and far less expensive to generate: https://cloud.google.com/translate/pricing — our goal is here is speed. We want to search through many possibilities as quickly as possible.

The word vectors have been aligned in multiple languages. Using an approximate nearest neighbor search we are able to find the nearest vector to the input in multiple languages very quickly.

To keep the example simple, we did not try to filter the data through hand-built language dictionaries. In fact, we simply drop words in other languages that also appear in the English .vec file. Words like "ciao" appear frequently enough in otherwise English sentences that the example code drops it from Italian, and so is not shown in the results:

% curl -s "https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki..." | grep -n ciao 50393:ciao 0.0120 ...

One improvement would be to filter out any words that do not appear in a hand-curated dictionary instead of filtering out words that already appear in English. We decided not to show how to do this because we'd already introduced a few concepts, like aligned word vectors, approximate nearest neighbour searches, and wanted to keep the example as simple as possible.


Google Translate is state of the art, so I’m not sure why that would be surprising. That said, is there something wrong with the translations offered?

> Google Translate is state of the art

For French/English/German, DeepL is much better IME.


> That said, is there something wrong with the translations offered?

I think in French hello = "bonjour" and hi = "salut"... not sure where "bonjours" and "salutations" came from.


The Italian "auguri" means "best wishes"; "chiamatemi" means "call me". Neither is a plausible translation of "hello". The obvious one, "ciao", is missing.

I thought Hello was invented with the telephone. Prior to that, English greetings were good morning/evening. What do Italians and French say when they pick up the phone? Allora?

"Bonjours" doesn't exist, and "salutations" is a tad quirky, but OK in informal settings, especially when addressing many people at once.

No, bonjours exists (it's simply the plural form of bonjour used as a noun) but the contexts it is used are very very infrequent so it's weird to find it in that list.

Compare "I said my hellos and goodbyes." in English. It's definitely a word, just so uncommon as to be largely irrelevant in most practical contexts.

It's very clearly semantically related though? I don't understand the complaint here.

It seems to be a very domain specific solution, they are trying to present versions of words in customer requested domain names if already taken.

Like you type in “stargazer. com”, system sees it’s already registered, and returns a “sorry sir it’s taken” page, with similar words listed as “but maybe try these words: astronomer, observatory, telescope, shooting star...”.

So it’s not serious translation, more of an inexpensive quick dictionary search. I guess it’s okay for its intended purposes.


It would be better to run the vectors through an attention layer if you want sentence to sentence translation.

Was disappointed this can't translate from Python to Rust.

Can something like this be done to compare/translate subsequences COVID genetic code to SARS and other virus genetic codes. Would be interesting how much overlap there is. And would further the research into where it came from.

Full genome of COVID-19 is available:

https://www.snapgene.com/resources/coronavirus-resources/?re...


Bioinformaticists have been able to do that with traditional algorithms for years (dynamic programming gets you a long way to compute an edit distance for example).

It is probably the first thing that was done once the COVID-19 genome was made public. A quick googling gave me that summary of the results: https://www.news-medical.net/health/How-Does-the-SARS-Virus-...


Great article, thank you

It sounds like you're thinking of "sequence alignment", which is a pretty standard bioinformatics tool.

BLAST (=Basic Local Alignment Search Tool) is one common version, and the NIH'S NCBI has a variety of nice online tools here: https://blast.ncbi.nlm.nih.gov/Blast.cgi

Note that it does take a little bit of background knowledge to interpret:some motifs are just really common, others are shared.


Nice example.

The short text and that fact that your application would tolerate or celebrate catchy neologisms plays to fasttext's strengths.


Thank you!

> fast translates to vite in French

Only as an adverb, it should be "rapide" otherwise.


At first glance at the title, I thought it was translating Python code to Rust code.

It may not be a bad candidate for writing the rust part in python and then running it through py2many to generate the rust.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: