I did a lot of the ML work for this. Let me know if you have any questions.
The title might be a little ambitious since we only have two embeddings right now, but it really is our goal to have embeddings for everything. You can see some of our upcoming embeddings at https://www.basilica.ai/available-embeddings/.
We basically want to do for these other datatypes what word2vec did for NLP. We want to turn getting good results with images, audio, etc. from a hard research problem into something you can do on your laptop with scikit.
I.e. do you plan to do the work to align the embeddings of different types of media so that the contents are somewhat similar and e.g. an audio recording of a snippet gets a similar embedding to the equivalent text and a somewhat similar embedding to a picture that's being described?
In the future I think we'll try to embed into a single space on a best-effort basis, assuming we can find the engineering resources. It will be really hard for some data types, but for the big ones like text/image/audio it isn't that hard, and will probably be valuable to people.
In my mind "embedding" carries the connotation that you're moving into a smaller space that's easier to work with, and where things which are similar in some way are near each other.
When I was learning about SVMs way back when, we said "non-linear projection" instead of "embedding".
So Projection sounds closer to what you describe.
So my original comment is not quite right either. The embedding is a structure preserving map. A projection isn't necessarily structure preserving. ''' To be an embedding, such a mapping must preserve order "both ways": '''
I think that a projection from p dimensions to q<p dimensions isn't structure preserving.
At this point a mathematician may be able to explain more.
So, I don't know a ton about Word2Vec which probably doesn't help, but I do understand that it makes tasks, like NLP, much easier since you're going from this massive space (the english language) into a smaller embedding that you learn.
That being said, how are you embedding images? Is it based on how similar they are, if so what does "similarity" mean? Also, what dataset was leveraged?
Any more info on how you do this task would be awesome :).
We're embedding images by feeding them through a deep neural net and using the activations of an intermediate layer as an embedding.
You can read https://arxiv.org/abs/1403.6382 to learn more about this technique if you're interested.
Our launch model is trained on ImageNet, which has enough variety that it usually generalizes well even when your dataset is very dissimilar to the input distribution. We're planning to train on a wider variety of image data in the future, but we wanted to get something into people's hands quickly.
In particular, semantically similar words are close to each other after embedding, so the space ends up with semantically meaningful clusters.
Our embeddings have the same property. If you embed two similar images, they'll end up closer to each other than two dissimilar images. (Where "similar" depends on the training details, but that's true for word2vec as well.)
If you're doing clustering or instance retrieval, you probably want to PCA the number of dimensions down to 200 or so. (In fact, we do this in the tutorial at https://www.basilica.ai/tutorials/how-to-train-an-image-mode... .)
If you're training a big regression, you'll probably get better results with the larger embedding.
We decided to err on the side of making the embeddings too big, because it's very easy to reduce the number of dimensions on the user's end, and impossible to increase it.
If you are interested in how to build and use embeddings for search and classification yourself, I wrote a completely open source tutorial here: https://blog.insightdatascience.com/the-unreasonable-effecti...
Specifically, I am trying to think of an example where the user cares about a vector representation of something, but doesn't care about how that vector representation was obtained.
I can think of why it would be useful: the ML examples given, or perhaps a compression application.
However, in each of these cases, it would seem that the user has the skill to spin up their own, and a lot of motivation to do so and understand it.
1. Transfer learning / data volume. If you have a small image dataset, embedding it using an embedding trained on a much larger image dataset is really really helpful. In our tutorial (https://www.basilica.ai/tutorials/how-to-train-an-image-mode...), we get good results with only 2-3k animal pictures, which is only possible because of the transfer learning aspect of embeddings.
You could do transfer learning yourself, if you have the time and expertise. And for a domain like images, it's really easy to find big public datasets. But long-term we're hoping to have embeddings for a lot of areas where there aren't good public datasets, and pool data from all our customers to produce a better embedding than any of them could alone.
2. Ease of Use. You can take a Basilica image embedding, feed it into a linear regression running on your laptop CPU, and get really good results. To get equally good results on your own, you'd need to run tensorflow on GPUs. This is harder than it sounds for a lot of people.
3. Exploration. Because of the other two points, if you have a thought like "huh, I wonder if including these images would improve our pricing model", you can whip up some code and train a model in a few minutes to check. Maybe if it's a big model you go grab lunch while it trains.
If you're doing everything from scratch in tensorflow, it can take days to try the same thing. This activation energy reduces the amount of experimentation people do. It's bad for the same reasons having a multi-day compile/test loop would be bad.
I agree with what you're saying here. I just wonder how it would work in practice.
So imagine I have this monster text or image, and I want to know if it looks like another text or image.
I send each to Basilica, it gives me back two vectors and I compare the vectors.
I use the cosine of the vectors as a similarity score, and lets say it comes out to be 0.6.
However, I think this is too low, and I want to tweak my algorithm.
At this point, doesn't the question of how the vector was generated come to the front. Did you get rid of common words, how did you treat stems, and so on? Or did what biases did you introduce into training?
Furthermore, these questions come up right away, and they seem fundamental to whatever the main practice is.
In other words, can I even experiment or start without knowing how the word2vec works?
I'm going to narrow in on the question of how to go about tweaking a model that uses an embedding, since I think it's a really interesting topic.
To use your first example, let's say you're doing the image similarity task. You probably wouldn't be computing the cosine distance on the embeddings directly. You'd probably normalize and then do PCA to reduce the number of dimensions to 200 or so.
If you weren't getting good results, you'd have a few options. You could fiddle with the normalization and PCA steps, which can have a big effect. You could also include other handcrafted features alongside the embedding. But let's say you have a fundamental problem, like your similarity score is paying too much attention to the background of your images rather than the foreground.
There are two major approaches to solving that sort of problem with embeddings: preprocessing or postprocessing. You could preprocess the images before embedding them to de-emphasize the backgrounds (e.g. by cropping more tightly to what you care about). You could also postprocess the embeddings. For example, you could label which of your images have similar backgrounds, and instead of naive PCA you could extract components that maximally explain variance while having minimal predictive power for background.
Our bet is that the simplicity of using Basilica will provide a much easier experience that doesn't require complex infrastructure and training and will provide very good results thanks to transfer learning. The amount of data needed for this is also much smaller than if you were training a model from scratch.
See Bolukbasi (https://arxiv.org/pdf/1607.06520.pdf)
and Caliskan (http://science.sciencemag.org/content/356/6334/183)
While these examples are solely language based, it is easy to imagine the transfer to other domains.
We don't have any concrete plans to tackle this right now but it is something we're definitely mindful of. Thanks for the links! We'll be sure to go through them.
In case you didn't click around the page, note that one of the applications they are proposing sounds similar:
> Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.
The embedding of sentences are trained on translation tasks. A embedding that works both for images and sentences is found by training for a picture captioning task.
The point I'm asking about is that there may be many ways to embed a "data type", depending on what you might want to use the embedding for. Someone brought up board game states. You could imagine embedding images of board games directly. That embedding would only contain information about the game state if it was trained for the appropriate task.
Kind of surprisingly, though, if you get your embedding by training a deep neural network to do a fairly general task -- like denoising autoencoding, or classification with many classes -- it ends up being useful for a wide variety of other tasks. (You get the embedding out of the neural network by taking the activations of an intermediate layer.)
In some sense you'd expect this, since you'd hope that the intermediate layers of the neural network are learning general features -- if they were learning totally nongeneral features, it would be overfitting -- but I found it surprising when I first learned about it.
I can't really think of a barrier to this. Detecting the file format is straightforward, and generic image/text/etc. embeddings work surprisingly well. (In fact, you can actually get some generalization gains by training subword text embeddings on corpora in multiple languages.)
If we wanted to able to use specific embeddings (e.g. photos vs. line art, English vs. German), we could probably do it by running the data through a generic embedding, and then seeing which cluster of training data it's closest to and running it through that specific embedding.
It would be really important in this case to make sure that all the specific embeddings are embedding into the same space, in case people have a mixed dataset, but that's very doable.
Generally, you'll want to extract features (e.g. word counts) and then apply a clustering algorithm to group related documents together. The precise details are the subject of thousands of papers, each one doing things slightly differently.
No production use cases yet. This is the first usable release, and it's the bare minimum we felt we could build before showing it to people.
> Are you a YC company?
We have a YC interview on Friday, so hopefully in a few days I'll be able to say yes.
Yes. We intend to run this as a cloud service API for now.
In the very-long term, I want us to literally have embeddings for everything people want to embed, which will probably include the states of popular boardgames.
I'm not sure how we'll get there. Maybe we'll have community embeddings, or an embedding marketplace, or we'll abstract away the process of creating an embedding so well that we can create simple embeddings just by pointing our code at a dataset. But I'd like to get there eventually.
In the less-very-long term, we're still focusing on embeddings that are either very general and useful across a lot of domains (e.g. images, text), or embeddings that have clear and immediate business value (e.g. resumes), since running GPUs is expensive.
You can read more about this technique in https://arxiv.org/abs/1403.6382 if you're interested.
>Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.
Wonderful! We were in dire need for yet another black-box criteria based on which employers can reject candidates.
“We’re sorry to inform you that we choose not to go on with your application. You see, for this position we’re looking for someone with a different embedding.”
For what it's worth, people are doing job candidate clustering anyway right now. It's just that most people are doing it with keyword search or something.
Doing it with embeddings instead would probably increase the quality of the clustering at the cost of some interpretability. (i.e. you wouldn't be able to say "We didn't show your resume to this employer because it didn't contain both the word "Java" and the word "Agile", but maybe that's a good thing).
It's sort of a hard philosophical question how much you care about transparency/interpretability vs. quality, especially for socially important tasks like hiring.
It would be like complaining about an agricultural innovation that's projected to reduce starvation, because it doesn't also cure diabetes. I wouldn't want such progress to "just stop" because they're leaving a given problem unsolved.
resumes of candidates
- resumes of employees you fired
+ resumes of employees you promoted
= resumes of candidates you should hire
Worse, the person who looks at the rejection decision may have no idea that it boils down to this.
 - https://arxiv.org/pdf/1607.06520.pdf
Further, they cherry-picked the most-potentially-offensive examples, in some cases dependent on the increased 'fuzziness' of more-outlier tokens (like `computer_programmer`).
You can test analogies against the popular GoogleNews word-vector set here – http://bionlp-www.utu.fi/wv_demo/ – but it has this same repeated-word-suppression.
So yes, when you try "man : computer_programmer :: woman : _?_" you indeed get back `homemaker` as #1 (and `programmer` a bit further down, and `computer_programmer` nowhere, since it's filtered, thus unclear where it would have ranked).
But if you use the word `programmer` (which I believe is more frequent in the corpus than the `computer_programmer` bigram, and thus a stronger vector), you get back words closely-related to 'programmer' as the top-3, and 23 other related words before any strongly-woman-gendered professions (`costume_designer` and `seamstress`).
You can try lots of other roles you might have expected to be somewhat gendered in the corpus – `firefighter`, `architect`, `mechanical_engineer`, `lawyer`, `doctor` – but continue to get back mostly ungendered analogy-solutions above gendered ones.
So: while word-vectors can encode such stereotypes, some of the headline examples are not representative.
For example, if you trained only on the corpus of circia 1950 newspapers, would «“man” - “homosexual” ~= “pervert”» or something similar? I remember from my teenage years (as late as the 90s!) that some UK politicians spoke as if they thought like that.
I also wonder what biases it could reveal in me which I am currently unaware of… and how hard it may be to accept the error exists or to improve myself once I do. There’s no way I’m flawless, after all.
If it did, what conclusion would you be able to draw?
As far as I know, there's no theoretical justification for thinking that word vectors are guaranteed to capture meaningful semantic content. Empirically, sometimes they do; other times, the relationships are noise or garbage.
I am wholeheartedly in favor of trying to examine one's own biases, but you shouldn't trust an ad-hoc algorithm to be the arbiter of what those biases are.
resumes of candidates
- resumes of employees you fired
+ resumes of employees you promoted
= resumes of candidates you should hire
Basilica might reinforce that hard work when evaluating candidates.
Or you could use the techniques described in your citation to allow Basilica to help de-bias the hiring process.
eg If you fire all the black people and don't promote women, guess what resumes Artificial Intelligence will send you
For the most part - People don’t get fired due to their skills. They get fired for lacking in execution or behavior. Someone screws up production deployment or makes lewd comments on another coworker. This is hard to come across in a resume.
That doesn't mean you can't raise concerns. Someone else raised the same concern that you did in a fine way: https://news.ycombinator.com/item?id=18348005. When in doubt, emulate them and ask a simple question.
There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.
For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications.
The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.
I'd be curious if the embedding is "aligned", in the sense that an embedding of the word "cat" is close to the embedding of a picture of cat. I think that would be interesting and useful. I don't see how Basilica solve that problem by taking the top layers off ResNet though.
I appreciate the developer API etc, but as an ML practitioner this feels like a troll.
Maybe, but training/curating data appropriate for your application isn't. It's not in that state right now but I think this service could save you some time if they had a domain-relevant embedding ready to roll for your application and it performed decently well -- that would save you a lot of time gathering training data and help you focus on the "business logic" ML needs that accept the embeddings as input.
That said, they'd need to be more performant than, say, GloVe 2B, which I can get for free off of torchtext, meaning they have to do the domain-specific heavy-lifting.
Apologies for the super long response, but you had a lot of points.
> Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?
Hopefully you're missing something, or we've been wasting a great deal of our time ;)
> There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.
For images and text, it's definitely true that you can train your own embeddings with an off-the-shelf model. But I think it's more likely that we end up in a place where a small number of people train a bunch of really good models and everyone else uses them.
I think this for three reasons:
1. It's what we've seen with word2vec. The vast majority of people that use word2vec aren't training it themselves, they're downloading pretrained weights.
2. Most people don't have enough data to train a good embedding themselves. There are good public datasets for images and text, but we're planning to produce embeddings for more niche verticals too.
Keep in mind that modern deep neural nets are very data hungry, and the problem gets worse every year. In a few years I think we're going to be in a spot where getting state of the art performance requires a lot of compute, and more data than most people have access to.
3. Prebuilt embeddings drastically speed up development. If you have a traditional model, and you think feeding some images into it might improve it, you can test that hypothesis in twenty minutes with Basilica. We've talked to a lot of teams that have high-dimensional data lying around which they think might improve their models, but they aren't sure, and they can't really justify a week or two of someone's time to explore it.
> For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications.
> The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.
It's definitely true that you usually want your input distribution to be reasonably close to the distribution the embedding was trained on. (Although it's worth noting that having a different distribution for your embedding acts as a form of regularization, and sometimes that matters more than the problems you get from the distributional shift.)
I think you're overstating the case though. An embedding trained on a wide variety of sources will perform really well on a lot of tasks, and often other things like amount of data you trained on matters more than distributional similarity.
You may find https://research.fb.com/wp-content/uploads/2018/05/exploring... interesting, especially the end of section 3.1.2. The paper trains a giant network on billions of Instagram images, and then explores both fine-tuning it on Imagenet and using the features of the last layer as inputs to a logistic regression (which they call "feature transfer" rather than "embedding").
The logistic regression trained on the Instagram features gets 83.6% top-1 accuracy, compared to 85.4% for full network fine-tuning and 80.9% for a ResNeXt model trained directly on ImageNet.
In other words, the effect of the larger training set dominated the distributional shift.