Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Basilica – word2vec for anything (basilica.ai)
153 points by hiphipjorge 5 months ago | hide | past | web | favorite | 77 comments

Hey all,

I did a lot of the ML work for this. Let me know if you have any questions.

The title might be a little ambitious since we only have two embeddings right now, but it really is our goal to have embeddings for everything. You can see some of our upcoming embeddings at https://www.basilica.ai/available-embeddings/.

We basically want to do for these other datatypes what word2vec did for NLP. We want to turn getting good results with images, audio, etc. from a hard research problem into something you can do on your laptop with scikit.

In the case of images I can just take an off-the-shelf pre-trained model like ResNet as a feature extractor, why should I use a cloud service for that? I don't quite get what the benefit of it is. Are your embeddings better? Well prove it then? In terms of transfer learning, fine-tuning convolutional layers would perform way better anyways?

Do you plan to make these (many in the future) embeddings to refer to a single 'semantic vector space' or have each of them be separate?

I.e. do you plan to do the work to align the embeddings of different types of media so that the contents are somewhat similar and e.g. an audio recording of a snippet gets a similar embedding to the equivalent text and a somewhat similar embedding to a picture that's being described?

We aren't currently doing this.

In the future I think we'll try to embed into a single space on a best-effort basis, assuming we can find the engineering resources. It will be really hard for some data types, but for the big ones like text/image/audio it isn't that hard, and will probably be valuable to people.

What's different between an "embedding" and a projection, which I believe is the more standard term for this kind of transformation?

"Embedding" is the term I've heard used for this most often. It's definitely the term that seems to dominate in the literature. (Just to pick a random paper off my reading list: https://arxiv.org/abs/1709.03856 .)

In my mind "embedding" carries the connotation that you're moving into a smaller space that's easier to work with, and where things which are similar in some way are near each other.

The TensorFlow docs agree with you. This field has such jargon proliferation, it's hard to keep up.


Yeah, I agree. The number of new nouns per year is kind of ridiculous.

Projection is a type of embedding. But you can't really describe what LTSA, UMAP, etc. do as projection. LTSA "unrolls" data rather than projecting it.

On the contrary, I'd say an "unrolling" is a non-linear projection. Wikipedia suggests that embedding is a type of projection. I hadn't heard the term before today.


Embedding is the ML term for a non-linear projection.

Amusingly, it seems you and the other reply are contradicting each other, saying both that an embedding is a specific type of projection and that a projection is a specific type of embedding.

When I was learning about SVMs way back when, we said "non-linear projection" instead of "embedding".

I mean both are correct given the literature at this point. Maybe the use of embedding captures the notion that you are mapping through a NN which is a variable function of sorts instead of a single or set of linear or orthonormal functions.

The mathematical term "Embedding" refers to finding a lower dimensional representation of some high(er) dimensional data, whether this is done with single function (maybe hand chosen) or a neural network doesn't matter.

I don't think that's quite right.. According to wikipedia, embedding is: https://en.wikipedia.org/wiki/Embedding

So Projection sounds closer to what you describe. https://en.wikipedia.org/wiki/Projection_(mathematics)

So my original comment is not quite right either. The embedding is a structure preserving map. A projection isn't necessarily structure preserving. ''' To be an embedding, such a mapping must preserve order "both ways": ''' http://mathworld.wolfram.com/Embedding.html

I think that a projection from p dimensions to q<p dimensions isn't structure preserving.

At this point a mathematician may be able to explain more.

SVMs project the data into a higher dimensional space, not lower. A mixed blob of dots on a plane can't be cleanly separated by a line into two colors, but if you blow them out into a cube by some arbitrary projection, a plane will be able to separate them cleanly.

Hi there Lucy!

So, I don't know a ton about Word2Vec which probably doesn't help, but I do understand that it makes tasks, like NLP, much easier since you're going from this massive space (the english language) into a smaller embedding that you learn.

That being said, how are you embedding images? Is it based on how similar they are, if so what does "similarity" mean? Also, what dataset was leveraged?

Any more info on how you do this task would be awesome :).


We're embedding images by feeding them through a deep neural net and using the activations of an intermediate layer as an embedding.

You can read https://arxiv.org/abs/1403.6382 to learn more about this technique if you're interested.

Our launch model is trained on ImageNet, which has enough variety that it usually generalizes well even when your dataset is very dissimilar to the input distribution. We're planning to train on a wider variety of image data in the future, but we wanted to get something into people's hands quickly.

Isn't the point of word2vec that embeddings are semantically meaningful vectors?


In particular, semantically similar words are close to each other after embedding, so the space ends up with semantically meaningful clusters.

Our embeddings have the same property. If you embed two similar images, they'll end up closer to each other than two dissimilar images. (Where "similar" depends on the training details, but that's true for word2vec as well.)

Most other word embeddings have hundreds of dimensions, not thousands. Are you able to hint at what causes this difference? Do you see better downstream task performance?

It depends on the task.

If you're doing clustering or instance retrieval, you probably want to PCA the number of dimensions down to 200 or so. (In fact, we do this in the tutorial at https://www.basilica.ai/tutorials/how-to-train-an-image-mode... .)

If you're training a big regression, you'll probably get better results with the larger embedding.

We decided to err on the side of making the embeddings too big, because it's very easy to reduce the number of dimensions on the user's end, and impossible to increase it.

can you give a brief history on the use of the word 'embedding' ?

Interesting idea, but seems to very much fall within the category of something you would often want to build in-house. I always imagined the right level of abstraction was closer to spacy's, a framework that lets you easily embed all the things.

If you are interested in how to build and use embeddings for search and classification yourself, I wrote a completely open source tutorial here: https://blog.insightdatascience.com/the-unreasonable-effecti...

What is the use case for this? (And this is a general point for AI cloud APIs)

Specifically, I am trying to think of an example where the user cares about a vector representation of something, but doesn't care about how that vector representation was obtained.

I can think of why it would be useful: the ML examples given, or perhaps a compression application.

However, in each of these cases, it would seem that the user has the skill to spin up their own, and a lot of motivation to do so and understand it.

Apologies for the long answer, but this touches on a lot of interesting points:

1. Transfer learning / data volume. If you have a small image dataset, embedding it using an embedding trained on a much larger image dataset is really really helpful. In our tutorial (https://www.basilica.ai/tutorials/how-to-train-an-image-mode...), we get good results with only 2-3k animal pictures, which is only possible because of the transfer learning aspect of embeddings.

You could do transfer learning yourself, if you have the time and expertise. And for a domain like images, it's really easy to find big public datasets. But long-term we're hoping to have embeddings for a lot of areas where there aren't good public datasets, and pool data from all our customers to produce a better embedding than any of them could alone.

2. Ease of Use. You can take a Basilica image embedding, feed it into a linear regression running on your laptop CPU, and get really good results. To get equally good results on your own, you'd need to run tensorflow on GPUs. This is harder than it sounds for a lot of people.

3. Exploration. Because of the other two points, if you have a thought like "huh, I wonder if including these images would improve our pricing model", you can whip up some code and train a model in a few minutes to check. Maybe if it's a big model you go grab lunch while it trains.

If you're doing everything from scratch in tensorflow, it can take days to try the same thing. This activation energy reduces the amount of experimentation people do. It's bad for the same reasons having a multi-day compile/test loop would be bad.

Hi mlucy,

I agree with what you're saying here. I just wonder how it would work in practice.

So imagine I have this monster text or image, and I want to know if it looks like another text or image.

I send each to Basilica, it gives me back two vectors and I compare the vectors.

I use the cosine of the vectors as a similarity score, and lets say it comes out to be 0.6.

However, I think this is too low, and I want to tweak my algorithm.

At this point, doesn't the question of how the vector was generated come to the front. Did you get rid of common words, how did you treat stems, and so on? Or did what biases did you introduce into training?

Furthermore, these questions come up right away, and they seem fundamental to whatever the main practice is.

In other words, can I even experiment or start without knowing how the word2vec works?

You're definitely right that you sometimes need to know the exact details of how an embedding is produced, especially if you're doing cutting-edge work. That's one of the things we really need to improve documentation-wise. I'd like to have a page for each embedding that talks about how it's generated, what to watch out for while working with it, etc. etc.

I'm going to narrow in on the question of how to go about tweaking a model that uses an embedding, since I think it's a really interesting topic.

To use your first example, let's say you're doing the image similarity task. You probably wouldn't be computing the cosine distance on the embeddings directly. You'd probably normalize and then do PCA to reduce the number of dimensions to 200 or so.

If you weren't getting good results, you'd have a few options. You could fiddle with the normalization and PCA steps, which can have a big effect. You could also include other handcrafted features alongside the embedding. But let's say you have a fundamental problem, like your similarity score is paying too much attention to the background of your images rather than the foreground.

There are two major approaches to solving that sort of problem with embeddings: preprocessing or postprocessing. You could preprocess the images before embedding them to de-emphasize the backgrounds (e.g. by cropping more tightly to what you care about). You could also postprocess the embeddings. For example, you could label which of your images have similar backgrounds, and instead of naive PCA you could extract components that maximally explain variance while having minimal predictive power for background.

I definitely agree you should add more documentation to how the word vector model(s) is generated. Also you may want to have a set of models that the user can choose from. For example, Wikipedia is good for a general language use case. But something more technical, such as finance, SEC filings are a better data source.

Hi, Jorge from Basilica here.

Our bet is that the simplicity of using Basilica will provide a much easier experience that doesn't require complex infrastructure and training and will provide very good results thanks to transfer learning. The amount of data needed for this is also much smaller than if you were training a model from scratch.

How do you plan to counter the harmful societal biases that embeddings embody?

See Bolukbasi (https://arxiv.org/pdf/1607.06520.pdf) and Caliskan (http://science.sciencemag.org/content/356/6334/183)

While these examples are solely language based, it is easy to imagine the transfer to other domains.

Hi. Jorge from Basilica here.

We don't have any concrete plans to tackle this right now but it is something we're definitely mindful of. Thanks for the links! We'll be sure to go through them.


Actually solving this problem is enormously difficult, it is practically unsolved, isn't it? Seems unfair to judge small projects for not solving bias issues that google and amazon can't even deal with.

Those companies get flack for it too, e.g. https://slate.com/business/2018/10/amazon-artificial-intelli...

In case you didn't click around the page, note that one of the applications they are proposing sounds similar:

> Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.

Aren't these embeddings task-specific? For example a word2vec embedding is found by letting the embedder participate in a task to predict a word given words around it, on a particular corpus of text.

The embedding of sentences are trained on translation tasks. A embedding that works both for images and sentences is found by training for a picture captioning task.

The point I'm asking about is that there may be many ways to embed a "data type", depending on what you might want to use the embedding for. Someone brought up board game states. You could imagine embedding images of board games directly. That embedding would only contain information about the game state if it was trained for the appropriate task.

You can definitely improve performance by choosing an embedding closely related to your task. In the future we're hoping to have more embeddings for specialized tasks.

Kind of surprisingly, though, if you get your embedding by training a deep neural network to do a fairly general task -- like denoising autoencoding, or classification with many classes -- it ends up being useful for a wide variety of other tasks. (You get the embedding out of the neural network by taking the activations of an intermediate layer.)

In some sense you'd expect this, since you'd hope that the intermediate layers of the neural network are learning general features -- if they were learning totally nongeneral features, it would be overfitting -- but I found it surprising when I first learned about it.

You quote a target of 200ms per embedding, not sure if it's one type of embedding in particular. I am using Infersent (a sentence embedding from FAIR https://github.com/facebookresearch/InferSent) for filtering and they quote a number of 1000/sentences per second on generic GPU. That's 200 times faster than your number, but it is a local API so I am comparing apples to oranges. Yet it's hard to imagine you are spending 1ms embedding and 199 on API overhead. I am sure I have missed a 0 here or there, but I don't see where, other than theirs is a batch number (batch size 128) and maybe yours is a single embedding number. Can you please clarify? Thanks

So I am going to answer it myself. On batched data, it's a lot faster than 200ms per embedding and I'd say on a par with Infersent. On the other hand, I couldn't get statistical performance in the same ballpark as Infersent and I had to backtrack. This was training a logistic regression on the embeddings to filter some text streams according to my preferences. If I had, I would have preferred Basilica as Infersent is py2 only, hard to install and distribute and a battery killer on my laptop. Its vectors are also 4x bigger. I experienced some server errors and the team at Basilica was very responsive and fixed it, very pleased with the interaction. It would be important IMHO to publish some benchmark results for these embeddings, as it's usually done in the universal embedding literature, or serve published embeddings with known performance when licensing terms are favorable.

Another update, on their v2 sentence embedding basilica is ahead of Infersent for my task. Well done basilica!

How much does this depend on the data type? I.e. do you need people to specify: this is an image, this is a resume, this is an English resume, etc. Could you ever get to a point where you can just feed it general data, not knowing more than that it's 1s and 0s?

That's a really interesting idea.

I can't really think of a barrier to this. Detecting the file format is straightforward, and generic image/text/etc. embeddings work surprisingly well. (In fact, you can actually get some generalization gains by training subword text embeddings on corpora in multiple languages.)

If we wanted to able to use specific embeddings (e.g. photos vs. line art, English vs. German), we could probably do it by running the data through a generic embedding, and then seeing which cluster of training data it's closest to and running it through that specific embedding.

It would be really important in this case to make sure that all the specific embeddings are embedding into the same space, in case people have a mixed dataset, but that's very doable.

Slightly different topic but what are some approaches to categorize webpages. Like I have 1000s of web links I want to organize with tags. Is there software technique to group them by related topics?

The task is known as document clustering https://en.wikipedia.org/wiki/Document_clustering or topic modeling https://en.wikipedia.org/wiki/Topic_model

Generally, you'll want to extract features (e.g. word counts) and then apply a clustering algorithm to group related documents together. The precise details are the subject of thousands of papers, each one doing things slightly differently.

Is this actually 'for anything'? I see references to sentences and images. If I, for example, wanted to compare audio samples, how would it work?

"Word2vec for anything" is where we want to get to. Right now we only support images and text, but you can see the other data types on our roadmap at https://www.basilica.ai/available-embeddings/ .

Hi Lucy. Looks great. Do you have any production use cases you can tell us about? Are you a YC company?


No production use cases yet. This is the first usable release, and it's the bare minimum we felt we could build before showing it to people.

> Are you a YC company?

We have a YC interview on Friday, so hopefully in a few days I'll be able to say yes.

So the actual code is closed-source?

Hi, Jorge from basilica here.

Yes. We intend to run this as a cloud service API for now.

Do you think board game states might be a good target later?

Sort of depends on how late "later" is.

In the very-long term, I want us to literally have embeddings for everything people want to embed, which will probably include the states of popular boardgames.

I'm not sure how we'll get there. Maybe we'll have community embeddings, or an embedding marketplace, or we'll abstract away the process of creating an embedding so well that we can create simple embeddings just by pointing our code at a dataset. But I'd like to get there eventually.

In the less-very-long term, we're still focusing on embeddings that are either very general and useful across a lot of domains (e.g. images, text), or embeddings that have clear and immediate business value (e.g. resumes), since running GPUs is expensive.

Probably not, board game states are different for different games, so I doubt this will be a big enough niche.

How are you embedding images?

We're feeding them through a deep neural network and using the activations of an intermediate layer as an embedding.

You can read more about this technique in https://arxiv.org/abs/1403.6382 if you're interested.

For most purposes, taking a decent ImageNet model and ripping off a couple final layers works reasonably well.

>Job Candidate Clustering

>Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.

Wonderful! We were in dire need for yet another black-box criteria based on which employers can reject candidates.

“We’re sorry to inform you that we choose not to go on with your application. You see, for this position we’re looking for someone with a different embedding.”


For what it's worth, people are doing job candidate clustering anyway right now. It's just that most people are doing it with keyword search or something.

Doing it with embeddings instead would probably increase the quality of the clustering at the cost of some interpretability. (i.e. you wouldn't be able to say "We didn't show your resume to this employer because it didn't contain both the word "Java" and the word "Agile", but maybe that's a good thing).

It's sort of a hard philosophical question how much you care about transparency/interpretability vs. quality, especially for socially important tasks like hiring.

Right, it's a little absurd to complain about flaws in a potential hiring filter without realizing how incredibly flawed current hiring is, relative to some unrealized ideal.

Respectfully I disagree, if something is bad then the fact that the other thing might be bad doesn't make it ok. Either be demonstrably better (this isn't) or stop.

The solution does claim to be better along some axes; your statement would seem to imply that you can't work towards a solution to any problem until all of them are solved. That seems like a recipe for never improving anything, no?

It would be like complaining about an agricultural innovation that's projected to reduce starvation, because it doesn't also cure diabetes. I wouldn't want such progress to "just stop" because they're leaving a given problem unsolved.

If something doesn't work, sometimes it's better to try something else. Especially when the only way to see if something is better is to try it.


  - man
  + woman
  = queen

    resumes of candidates
  - resumes of employees you fired
  + resumes of employees you promoted
  = resumes of candidates you should hire

Unless you have thousands of fired and promoted employees, it may easily end up "Sorry, most of our promoted employees are Indians, because they were our founder's close friends and joined earlier, and that one guy fired for flirting with a customer was the only one from UIUC. Your name doesn't sound sufficiently Indian and you graduated from UIUC. Bye."

Worse, the person who looks at the rejection decision may have no idea that it boils down to this.


      computer programmer
    - man
    + woman
    = homemaker

[0] - https://arxiv.org/pdf/1607.06520.pdf

Note that some of this research, especially early, overstated the 'bias' here because they didn't realize that the default 'analogy' routines specifically rule-out returning any word that was also in the prompt words. So, even if closest word-vector after the `man->woman` translation was the same role (as is often the case), you wouldn't see it in the answer.

Further, they cherry-picked the most-potentially-offensive examples, in some cases dependent on the increased 'fuzziness' of more-outlier tokens (like `computer_programmer`).

You can test analogies against the popular GoogleNews word-vector set here – http://bionlp-www.utu.fi/wv_demo/ – but it has this same repeated-word-suppression.

So yes, when you try "man : computer_programmer :: woman : _?_" you indeed get back `homemaker` as #1 (and `programmer` a bit further down, and `computer_programmer` nowhere, since it's filtered, thus unclear where it would have ranked).

But if you use the word `programmer` (which I believe is more frequent in the corpus than the `computer_programmer` bigram, and thus a stronger vector), you get back words closely-related to 'programmer' as the top-3, and 23 other related words before any strongly-woman-gendered professions (`costume_designer` and `seamstress`).

You can try lots of other roles you might have expected to be somewhat gendered in the corpus – `firefighter`, `architect`, `mechanical_engineer`, `lawyer`, `doctor` – but continue to get back mostly ungendered analogy-solutions above gendered ones.

So: while word-vectors can encode such stereotypes, some of the headline examples are not representative.

One thing I’ve been tempted to research but never had time for myself: can one use that aspect of wording embeddings to automatically detect and quantify prejudice?

For example, if you trained only on the corpus of circia 1950 newspapers, would «“man” - “homosexual” ~= “pervert”» or something similar? I remember from my teenage years (as late as the 90s!) that some UK politicians spoke as if they thought like that.

I also wonder what biases it could reveal in me which I am currently unaware of… and how hard it may be to accept the error exists or to improve myself once I do. There’s no way I’m flawless, after all.

> For example, if you trained only on the corpus of circia 1950 newspapers, would «“man” - “homosexual” ~= “pervert”» or something similar?

If it did, what conclusion would you be able to draw?

As far as I know, there's no theoretical justification for thinking that word vectors are guaranteed to capture meaningful semantic content. Empirically, sometimes they do; other times, the relationships are noise or garbage.

I am wholeheartedly in favor of trying to examine one's own biases, but you shouldn't trust an ad-hoc algorithm to be the arbiter of what those biases are.

I think there's a further problem that there's never been a shortage of evidence, about things like this. The point is, prejudice and discrimination are not evidence-based in the first place. People who support existing unjust structures are generally strongly motivated to turn a blind eye. Even people who don't support them are - it's simply far easier and more socially advantageous to stop worrying and love the bomb.

I think this is a large part of what goes on in the digital humanities - to varying degrees of success. The problem, as usual, is not that there isn't an abundance of evidence. It's simply that nobody reads sociology papers except sociologists.

In this formulation wouldn't Basilica reflect the existing biases of the organization?

    resumes of candidates
  - resumes of employees you fired
  + resumes of employees you promoted
  = resumes of candidates you should hire
It's a lot of hard work to reduce bias in promotions and terminations.

Basilica might reinforce that hard work when evaluating candidates.

Or you could use the techniques described in your citation to allow Basilica to help de-bias the hiring process.

The second one being (I assume this is your point) merely a way to copy of all your existing biases, but not be able to see it.

eg If you fire all the black people and don't promote women, guess what resumes Artificial Intelligence will send you

I don’t think this approach will give you a good signal.

For the most part - People don’t get fired due to their skills. They get fired for lacking in execution or behavior. Someone screws up production deployment or makes lewd comments on another coworker. This is hard to come across in a resume.

This breaks the Show HN guidelines (https://news.ycombinator.com/showhn.html) as well as the HN guidelines (https://news.ycombinator.com/newsguidelines.html), which ask you not to post shallow dismissals, especially of others' work. That's particularly important in Show HN threads. We don't want a culture where the reflex is to be a jerk and kick things.

That doesn't mean you can't raise concerns. Someone else raised the same concern that you did in a fine way: https://news.ycombinator.com/item?id=18348005. When in doubt, emulate them and ask a simple question.

"Your CV must have fell between the cracks in multidimensional vector space"

Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?

There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.

For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications.

The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

I'd be curious if the embedding is "aligned", in the sense that an embedding of the word "cat" is close to the embedding of a picture of cat. I think that would be interesting and useful. I don't see how Basilica solve that problem by taking the top layers off ResNet though.

I appreciate the developer API etc, but as an ML practitioner this feels like a troll.

> Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

Maybe, but training/curating data appropriate for your application isn't. It's not in that state right now but I think this service could save you some time if they had a domain-relevant embedding ready to roll for your application and it performed decently well -- that would save you a lot of time gathering training data and help you focus on the "business logic" ML needs that accept the embeddings as input.

That said, they'd need to be more performant than, say, GloVe 2B, which I can get for free off of torchtext, meaning they have to do the domain-specific heavy-lifting.

Hi there :)

Apologies for the super long response, but you had a lot of points.

> Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?

Hopefully you're missing something, or we've been wasting a great deal of our time ;)

> There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.

For images and text, it's definitely true that you can train your own embeddings with an off-the-shelf model. But I think it's more likely that we end up in a place where a small number of people train a bunch of really good models and everyone else uses them.

I think this for three reasons:

1. It's what we've seen with word2vec. The vast majority of people that use word2vec aren't training it themselves, they're downloading pretrained weights.

2. Most people don't have enough data to train a good embedding themselves. There are good public datasets for images and text, but we're planning to produce embeddings for more niche verticals too.

Keep in mind that modern deep neural nets are very data hungry, and the problem gets worse every year. In a few years I think we're going to be in a spot where getting state of the art performance requires a lot of compute, and more data than most people have access to.

3. Prebuilt embeddings drastically speed up development. If you have a traditional model, and you think feeding some images into it might improve it, you can test that hypothesis in twenty minutes with Basilica. We've talked to a lot of teams that have high-dimensional data lying around which they think might improve their models, but they aren't sure, and they can't really justify a week or two of someone's time to explore it.

> For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications. > > The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

It's definitely true that you usually want your input distribution to be reasonably close to the distribution the embedding was trained on. (Although it's worth noting that having a different distribution for your embedding acts as a form of regularization, and sometimes that matters more than the problems you get from the distributional shift.)

I think you're overstating the case though. An embedding trained on a wide variety of sources will perform really well on a lot of tasks, and often other things like amount of data you trained on matters more than distributional similarity.

You may find https://research.fb.com/wp-content/uploads/2018/05/exploring... interesting, especially the end of section 3.1.2. The paper trains a giant network on billions of Instagram images, and then explores both fine-tuning it on Imagenet and using the features of the last layer as inputs to a logistic regression (which they call "feature transfer" rather than "embedding").

The logistic regression trained on the Instagram features gets 83.6% top-1 accuracy, compared to 85.4% for full network fine-tuning and 80.9% for a ResNeXt model trained directly on ImageNet.

In other words, the effect of the larger training set dominated the distributional shift.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact