Hacker News new | past | comments | ask | show | jobs | submit | more kirillkh's comments login

Perhaps it's better to set it to 2 years, then. A clock that's counting time towards its own death. There is some grim poetry in that.


Do you just use development boards with these? And which ones?


You don't really need a dev board for these, you can just stick them in a breadboard and go, they're already in a DIP package and don't really need any extra circuitry besides what you're going to connect them to. You can also program them using basically any other micro-controller you may have, or use a cheap USB programmer to do it, so they're pretty cheap to get started with (Though it's not exactly as user-friendly as plugging an Arduino into USB).


The AT Tiny can be programmed with an Arduino Uno, which is what I used. It's pretty easy. But there are also cheaper boards specifically for the AT Tiny.


Can any of these be programmed with Rust?


Yes! Theoretically anything that Rust has support for (https://forge.rust-lang.org/platform-support.html) but I think most work has been put into ARM Cortex-M support recently. Especially Jorge Aparicio has worked to make embedded programming in Rust a joy (see https://japaric.github.io/discovery/ for an introduction). If you use STM32F MCUs you will probably have the lowest friction. I've heard of other people using the NXP Kinetis lineup as well. Do mind though that there are not many peripheral libraries yet. Currently I myself am fighting with getting USB to work. Of course you could still use the vendor provided C libraries and link against them if you don't want to go for 100% Rust.

The best part: You can use many of Rust's high level zero-cost abstractions. Just take a look at this example code which compiles to very efficient assembly: https://github.com/japaric/discovery/blob/116fe76491d661b5ee...


Seriously curious about this. If not these ones, what other microcontrollers have decent or better rust support? Is there a list somewhere? I'm keen to have more practical uses for my rust skills so I can enjoy using rust without having to find excuses to use it where I would be more productive with other languages.


Your sibling has some good links, but a decent way to think about it is, from hard to easy:

* does it have an LLVM backend?

* is it listed on the forge page?

* is it listed as having std support?

The farther you go down this list, the simpler it gets to use Rust for it.

Latest work has been on ARM boards and a ton of work on AVR, which brings 8/16 but support.


How can this be used for full-text search, e.g. with Lucene? The first step in indexing a document for full-text search is reducing each word to its base form, and similarly for a search string. While it's not a difficult problem in English, in some languages (e.g. Herew) it's notoriously hard to figure out the base form of a word and further disambiguate its meaning, as the only way to do so is based on context. So how can you easily build a stemmer/lemmatizer on top of these instruments to perform such task?


This presentation from 2015 [1] answers your question. The basic idea is to create a list of keywords and/or phrases for your corpus. It can either be done manually or automatically using gensim or a similar tool. Then use word2vec to create vectors for the keywords and phrases. Cluster the vectors and use the clusters as "synonyms" at both index and query time using a Solr synonyms file.

You can also use Brown clustering [3] to create the clusters. It does a good job and is faster to compute than clustered word vectors. However clustered word vectors typically have better semantic performance.

1. https://www.slideshare.net/mobile/lucidworks/implementing-co...

2. Demo source: https://github.com/DiceTechJobs/ConceptualSearch

3. https://github.com/percyliang/brown-cluster


Some clarifying questions:

1) Do words in a generic corpus (such as Wikipedia) actually form well-separated clusters?

2) Is it correct that you find word clusters in the corpus as a preprocessing step (as opposed to at indexing or query time)?

3) Do I understand correctly that you use all words in clusters as synonyms and pass them to Solr at query/indexing time? Is it query time, index time or both?

4) Given a language where words have many syntactic forms (e.g. buy-bought-buying), how does it work with clusters? Do both syntactic forms and synonyms end up in the same cluster? Wouldn't it be beneficial to treat many of these different forms as the same word (i.e. perform stemming) and only list truly different, but closely related concepts as synonyms?


1. It should. The talk recommends using multiple cluster sizes (e.g. 50,500,5000) and give more weight in the query to smaller clusters. Ideally you would run word2vec on your own domain-specific corpus and then cluster, but that only works if your corpus is of sufficient size.

2. Correct. The goal of the pre-processing step is to generate a Solr synonyms file which can be added to your index mapping.

3a. You could use all the words, but in general I would advise against it. Using all the words from Wikipedia or Google News would be similar to using a thesaurus which can add a lot of noise. For example, the word "cocoa" could mean chocolate, a city in Florida, or programming language. It is better to use a list of domain-specific keywords and phrases as a filter for which words are added to the Solr synonyms file. However if your corpus is Wikipedia, Google News, or something equally generic, then using all the words makes sense.

3b. It must be both query and index time. For example, the phrase "java developer" would have the mapping "java developer => cluster_15" in the synonyms file. In order for the search terms "java developer" to match cluster_15, "cluster_15" must be indexed in place of "java developer".

4. The different forms will most likely end up in the same cluster, but stemming would guarantee it.


4) But recall that the language in question is such that stemming is hard. If you expand every form of every word in Hebrew, you obtain something like 600,000 words. And many of them have completely different meanings due to syntactic coincidences and short roots. So, ideally, the first step would a) determine what exactly each word in this document means in the given context, b) replace it with an unambiguous identifier.

For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding).


I don't know anything about Hebrew. But maybe lemmatization would be better than stemming since it takes meaning and context into account. It is also possible that it is an unnecessary step for clustered word vectors for your use case. If it was me, I would try without stemming/lemmatization first.

EDIT

I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec.

1. https://github.com/synhershko/HebMorph


Please forgive me attempting to milk as much as possible from this discussion - I just don't have many opportunities to get useful advice on this subject, and I've been mulling over it for a long time.

> But maybe lemmatization would be better than stemming

You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization.

> It is also possible that it is an unnecessary step for clustered word vectors for your use case

I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else.


For English NLP, I often stem first because it usually reduces noise. I think your main concern is that the abundance of homonyms will increase noise which is certainly possible. Because I don't know Hebrew, I don't have any intuition on what may work. My advice is to experiment. Cluster some Hebrew text without lemmatization, cluster with lemmatization using that Hebrew analyzer I linked, and see what the results are. Also maybe a literature review will yield experiments done with Hebrew and word embeddings/vectors. Sorry I cannot be of more help.

EDIT

I found this paper which may answer your question about lemmatization and word vectors.

http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss...


Thanks, I know about HebMorph. Its authors don't want it to be used for commercial purposes (at least for free), so that limits its usability beyond simple experiments. As to your second link, it confirms my suspicions that lemmatizing is important for Hebrew, but the code they reference in the footnotes is equally hostile to commercial usage. I was really hoping word2vec or other new tools would enable building lemmatizer from scratch without much hassle.

Thanks for your advice, anyway.


I think using word vectors for lemmatization is an interesting idea and on the cutting edge. Here is a paper which discusses it. https://link.springer.com/chapter/10.1007/978-3-662-49192-8_...


Thanks! That paper is extremely helpful. Still, there is one thing missing to complete the picture for me right now. At the input, I have a list of words that I want to index or query. When indexing, they usually form a sentence, when querying, they might just be keywords. But in both cases the words will usually be selected by the user/author in such a way that a human that reads all the words from the input together is able to disambiguate the meaning of every word. This is precisely what I'm missing.

Let's say the user entered three words: A B C. You look up each of them among the vectors and discover that there are three matching vectors for A, four for B and five for C (and for the sake of generality let's assume that there are more words than just 3 in the input, so it's impractical to test every subset of these words for co-occurrence). How do you jointly select the correct vector for each of the words?


Actually, I have an idea, albeit not without some doubts.

Let x1 be the number of vectors matching A, x2 the number of vectors matching B, etc, till xn. Let c1..cn be a particular selection of vectors. Now my main assumption here is that in order to determine which of these vectors are most often encountered together in the same context [1], our goal is to find j that maximizes sum_{i from 1 to n, i!=j}[d_i], where d_i=(c_j dot c_i) if the dot product is nonnegative, otherwise d_i=0. I'm not sure it's true primarily because I don't know if by summing up these dot products we add apples to apples or apples to oranges.

Then in order to find the best selection of vectors c1..cn we can iterate on every vector v_k matching A and dot v_k with every vector matching B, then pick the maximum m2 (or 0 if it's negative); dot v_k with every vector matching C, then pick the maximum m3; etc. Thus, for k'th iteration we obtain the selection of vectors that maximizes M_1k=sum_i[m_i]. After we're done with all c1 iterations, we pick the best such selection M1=max_k[M_1k]. This is all done in O(x1(x2+x3+...xn)) time.

Next, we repeat the above process for all x2 vectors matching B and obtain M2, etc, etc. Ultimately, we pick the selection of vectors that produced the highest M_t across all choices of t. Overall, we get O((sum_i[xi])^2), which seems fast enough. What do you think?

[1] One obvious problem is this limits the number of contexts we match against to just one.


Each token would have one vector from word2vec. A token could be a word or phrase depending on the pre-processing. The words in a phrase are usually concatenated with an underscore. I recommend gensim if want/need phrases.


Ah, you're right, word2vec assigns one vector to each word, as opposed to one vector to each meaning. Then the problem remains: we can't differentiate between homonyms.

But it seems it's been solved, too: https://github.com/sbos/AdaGram.jl


There is also sense2vec which I think tries to do something similar. https://explosion.ai/blog/sense2vec-with-spacy


Also, this makes me wonder what other things you can do with vectors. If you compute dot product between a verb or a noun with vector "singular"-"plural", will it give a positive value for plurals and a negative for singulars (or vice versa)?


No idea. Experiment!


Thank you! This sounds exactly what I was imagining. Very exciting!


For integrating into Solr, I've used Word2vec to improve rankings the synonyms it finds (boosting synonyms by how similar they are to the query). In English Word2vec tends to think plurals are synonyms.

I did a talk on this with more details: https://www.slideshare.net/GarySieling/ai-with-the-best-buil...

Sense2Vec might also help: https://github.com/explosion/sense2vec

There is also a project called Vespa, which looks potentially interesting as a replacement for Lucene - https://github.com/vespa-engine/vespa


The dice talk mentions that also (weighting by word2vec similarity). It's important to note that word2vec, LSA, sense2vec, etc, all find words that are RELATED but not necessarily SYNONYMOUS. For instance, antonyms like black and white, rich and poor, often appear in the same contexts, have the same word type but are opposite in meaning. Similarly, politicians on the opposite ends of the political spectrum will usually get assigned similar vectors and the same cluster as they tend to appear in similar contexts. It uses the context (word window in word2vec and GloVe, the document in LSA) to determine a measure of similarity. But the context for antonyms is typically very similar. Attaching the part-of-speech tag to each word before pushing it through these models can help as it enforces that grammatical relation, but this won't address all of these issues (e.g. black and white are both adjectives). In my experience, if people mostly search for nouns in your search engine (e.g. job search) this issue is also less of a concern, but can still cause cause problems. Finally, conceptual search can also help with precision - by matching across all concepts within a document, you can help disambiguate its meaning, when you have words that have multiple meanings.


> Finally, conceptual search can also help with precision - by matching across all concepts within a document, you can help disambiguate its meaning, when you have words that have multiple meanings

Thanks! I've written up something along these lines here: https://news.ycombinator.com/item?id=15592196

I'd love to hear your opinion if this is going to work.


Run Doc2Vec (or Word2Vec) on a large corpus of text or download pretrained vectors. To compute a document vector, take a linear combination of the word vectors in the document according to TFIDF. Now that you have vectors for each document, you need to create a fast index with a library called "Annoy". It can do very fast similarity search in vector space for millions of documents. I think this approach works faster than grep and doesn't need to bother with stemming. It will automatically know that "machine learning" and "neural nets" are related, so it does a kind of fuzzy search.


If you wanted it to know that "machine learning" and "neural networks" were related, wouldn't you need to do some type of entity extraction first, since Word2vec is run on tokens?


You can use Gensim:

    from gensim.models.phrases import Phrases
    bigrams = Phrases(corpus)
or you could rank bigrams by count(w1+w2)^2/(count(w1)*count(w2))

many variations on this formula work, but the idea is to compare the count of the bigram to the counts of the unigrams.

By the way, you do bigram identification before Word2Vec to have specialized vectors for bigrams as well.

Besides this method, there is one great way to identify ngrams: use Wikipedia titles. It's quite an extended list that covers most of the important named entities, locations and multi-word topic names, or go directly to http://wiki.dbpedia.org/ for a huge list with millions of ngrams. Cross reference it with your text corpus and you get a nice clean list.


The original word2vec source code comes with a probabilistic phrase detection tool. Keyword: word2phrase.


Good to know, thanks!


Alternatively to tf-idf, there's an interesting property in word embeddings generated by word2vec : they're sorted by rarity (the most common words being on top of the list).

So if you insert them in the same order in a database, you can just use their primary key as weight for a word. This also has the advantage of filtering out stop words without any additional processing.


If I understand correctly, this forgoes Lucene entirely. I would really like something that can be integrated into Lucene/Solr due to the availability of all the infrastructure build around it.

> works faster than grep

I didn't quite get the connection to grep.


Suppose you have gigabytes of text, Annoy will find matching articles faster and more precise than grepping with keywords.


Lucene is faster and better than grep too. Annoy may be better than Lucene's "more like this" query which is for finding similar documents in an index to a given set of documents. But how would it be helpful for keyword search which is what is being asked about?


I know, inverted index search is fast, it is the basic search engine algorithm, but there is a difference in quality of top ranked results. With word vectors you can ensure the topic of the whole document is what you want. Many documents mix topics and some keywords appear by mistake in the wrong place, for example, because scraping web text is imperfect and might capture extra text.


Then we have no hope of ever engineering Mars to have enough air for us to breath?


A Martian atmosphere (not sure about a lunar atmosphere) would be lost on a timescale of tens of millions of years. That's very quick in cosmological timescales, but slow enough that any human effort to create or replenish it could be very successful.


That said, the earth's atmosphere has a mass of about five billion billion tons. For reference, as a species, we produce about ten billion tons of concrete each year. This is just to give a sense to the scale of effort involved in replacing a planetary atmosphere.


Depending on the volume of surface ice (especially at the poles), might it be possible to produce atmosphere on a massive scale via orbital lenses or mirrors? With recent advances in solar sail technology, I can't imagine the implementation would be too far removed from current capabilities.


Think about how large of a lens you're talking about. Even if it were one hundred meters across and was able to collect 100% of the sun's energy passing through it, the amount of energy produced would be utterly insignificant compared to the problem we're discussing. And how would you get a one hundred meter lens to mars orbit? Even after that, you have to consider that we want an oxygen atmosphere, not one made of water vapors.

I don't know what you mean by "too far removed from current capabilities", but I doubt we'll even start working on the problem for two or three centuries.

No, it seems more likely to me that we'd use our growing knowledge of genetics and psychology to hack out the part of ourselves that needs to be outside, and opt for a purely enclosed existence on Mars.


The IKAROS sail (launched 2010) is 196 m^2, using aluminum as a reflector (about 90% efficiency). With regards to transferring a lens to orbit, that is a self-solving problem - a large lens or mirror can both be used as a solar sail, potentially even hauling additional mass to Mars orbit.

Mars receives 593 W/m^2 flux, so each IKAROS-sized reflector could produce about 100 KW of energy. Given the expected difficulty of large scale terraforming and colonization efforts, it seems the cost of, say, the equivalent of 10,000 IKAROS-sized mirrors (~1 GW, comparable to a large nuclear plant) would be relatively minor.

Whether that would be more cost effective than shipping an equivalently powerful reactor or other generator is questionable - it will presumably depend on our lifting capacities.


That's about ~250x the JWST. Given that we're talking about a society that has developed far enough to be sending people to Mars and terraforming the landscape, I don't think that 8 doublings would be an unreasonable multiplier of current capability. Still a flagship, many-decade mission though.


The James Webb mirror is a rigid, astonishingly precise focusing element. A solar energy mirror could instead be merely approximately parabolic, made of a foil instead of cryogenic, made of metalized kapton instead of gold-plated beryllium, would forego focusing elements, and so on.

However, the thought of a telescope-quality mirror 250x the size of the JWST is pretty amazing :)


With lenses you could also burn the soil to produce gases that add up to he atmosphere ... but we want breathable atmosphere. With lot's of oxygen and very little co2. And that is a bit harder ...

So you also couldn't just vaporize the ice, you need to split it up. Solar heat could be sufficient, but the water will then be missed everywhere else on mars where life wants to grow.

And Mars is dry.

So I would first use the water in enclosed habitats. And then after, if there is plenty of water left, one could start to think about smoking that up ..

But there might be other options, once you have lot's and lot's of autonomous machines and rockets available and allmost unlimited fuel (sun?). But without that? Not a chance ...


I wonder if it would be a reasonable colonization process to establish a non-breathable atmosphere, allowing postprocessing facilities to later separate oxygen from airborne water vapour.

The notable benefits from such an approach would be the ability to easily deliver asteroid-based water deposits (aim it at Mars, let reentry do the rest), as well as the significant simplification of ground based colonization technology - it's far easier to build resilient habitats for a non-breathable atmosphere than it is for a vacuum, and the risk of accidents and difficulty of venturing outside is much, much lessened (a face mask or filter and oxygen tank instead of a bulky spacesuit), not to mention the radiation protection afforded by a thick atmosphere.


True, I think I would agree, there are many immediate benefits.

But it has to be carefully considered, as it might hinder a longterm plan for a nice, breathable atmosphere.

But once we reach mars, we are probably busy first, with primitiv things such as life support ...


Also many thermoforming efforts consider adding an artificial dynamo (using, say, superconducting rings) to protect both the atmosphere and life on the surface. The energy cost of maintaining an artificial dynamo would be less than the incremental maintenance cost of maintaining atmosphere and dealing with health risks from radiation.


we could also use the resources on mars or the moon more sparingly, making dome bases would provide that.

we could also bore into those tunnels and put caps on the hole afterward, have some of those channel light into those tunnels too.


NASA proposes building artificial magnetic field to restore Mars’ atmosphere: https://www.universetoday.com/134052/nasa-proposes-magnetic-...


Interesting, it only needs a 1-2 Tesla magnetic field. MRI machines go up to 3T, which means this isn't too outlandish with known technology. The main issue is that they probably want a Mars sized field at 1-2T.


Does that mean it needs a bigger coil?


I don't know the rate of atmospheric loss, but it might be possible to introduce air at a high enough rate to outstrip the losses, making for a sustainable atmosphere. It would presumably depend on the availability of ice and other volatiles.


Unlikely because you can't use a noun as a verb :)

Someday though we may be able to breathe on Mars


Asking as someone who barely has any clue in this field: is there a way to use this for full-text search, e.g. Lucene? I know from experience that for some languages (e.g. Herew) there are no good stemmers available out of the box, so can you easily build a stemmer/lemmatizer (or even something more powerful? [1]) on top of word2vec or fastText?

[1] E.g., for each word in a document or a search string, it would generate not just its base form, but also a list of top 3 base forms that are different, but similar in meaning to this word's base form (where the meaning is inferred based on context).


You can do all that and more: for example, to find lexical variations of a word, just compute word vectors for the corpus and then search the most similar vectors to a root word, that also contain the first letters (first 3 or 4 letters) of the root. It's almost perfect at finding not only legal variations, but also misspellings.

In general, if you want to search over millions of documents, use Annoy from Spotify. It can index millions of vectors (document vectors for this application) and find similar documents in logarithmic time, so you can search in large tables by fuzzy meaning.

https://github.com/spotify/annoy


I'd rather have a specific system of rules governing what I can and cannot say than discover post factum that what I said was illegal based on lawyers' consensus.

While realistically such consensus is required in some cases anyway due to the imprecise nature of lawmaking, we should still strive for clear ruleset as a baseline, precisely so that the little guy is as informed of his rights as he reasonably can be.


Sure, there could be common guidelines. I'm also talking civil, bad speech could maybe get you fined, or banned from certain public speaking facilities, if proven by court of law that you acted inappropriately. I mean we have these already, and we manage just fine. Defamation and public indecency for example.

And it doesn't need go very far. Inciting violence, threats to ones well being or insults towards a group, spoken to a crowd, say 10+ people, either verbally, or written, or gestured. That would include any medium which assumed an audience of more then 10 people. Yes, I'd include online forums, tweets, etc in that.

Like, I can't think of any socially relevent commentary that needs you say things of that sort, unless its because you're basically rebuting against someone else doing the same. And again, don't need to jail people, a fine would be a good start.

You can even express very inappropriate ideas without resorting to these. For example: "If X were truly less intelligent, it could be that this alone would explain why they are poorer overall, and so, no amount of extra care would help, because their intelligence is an innate limitation." No insults, no threats, no violence. Yet still incredibly controversial. Now the people you'd say this to might be tempted to insult you back, but if they did, they'd be fined, so instead they'd also have to be civil. For example: "That's true, but there's no proof that X are innately less intelligent, and there's known examples of Xs which have demonstrated higher then average intelligence, so its a moot argument, and one that seems to imply a certain innate bias against X from Y. If it were true, and this shows it, that Y had an innate racism against X, then its possible Y wants to keep X poor."

Please prove me otherwise, but I don't see what's so hard about enforcing some civility in speech when expressed to a sizable audience.


Other people gave you enough arguments against censorship above. I'll just add that "Inciting violence, threats to ones well being or insults towards a group" are all quite vague and thus contradict the goal of having a clear ruleset.


I understand your point, and others point. But I'm not seing real arguments that aren't just: "believe it! the government will use this to censor and oppress the people, and we'll all end up in a dystopian authoritarian regime with our freedoms gone, and everyone afraid to speak up."

Maybe, but probably more likely we'll just end up where media and public discourse just becomes slightly more civil, tones down insults and focueses a little more on substance to make its claims.

I similarly don't have a way to prove this would be the outcome, I admit. I rely on a different premise, that we're a reasonable democracy, and that just as we work to fight collusion and corruption, we'd manage to control abuse of such laws in cases where they'd be used to oppress and control in a non democratic way. I feel if you don't believe our society can do this, then we're doomed to lose our freedoms to an authoritative regime anyways.


It seems we are sinking ever deeper into the victimised state of mind where any opinion you express may and will under certain circumstances be used against you, even by people from your closest circle. So the best course of action is usually to avoid exposing your opinions, especially on controversial subjects. Just pretend to be a nice person, smile a lot and never say anything explicit.

Criticism can get you in trouble, avoid it if you can. If forced to choose sides, try to guess which is less likely to raise any kind of negativity towards you. Counterintuitively, usually it will be the more radical side, as radicals are often younger and more energetic in forcing their opinion on others, so the last thing you want is to offend them.

In short, be a spineless, shapeless wuss with an appearance of calm confidence and a lot of scenical sympathy.


I did it exactly on time. I'm not sure what's meant by an "interlocking part" of circles in Q30, though.


Most of the kids applying were atheists or agnostics. They were Jews by ethnicity, not by religion.


I thought every ethnicity had jews


This is a source of major confusion. Ultimately, it boils down to how one identifies him- or herself or, alternatively, how one is identified by others. Jews in USSR were officially considered an ethnicity like any other, based on blood. The Judaism as a religion was suppressed under the Communist rule. But even before that, there was a lot of assimilation and secularization going on. Ultimately, in the 1970's there were very little Judaism practitioners left.

The word "Jew" ("Yevrey") in Russian usually refers to the ethnicity. They use a different word ("Iudjey") when they want to describe someone as a Judaism practitioner.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: