
Show HN: Basilica – word2vec for anything - hiphipjorge
https://www.basilica.ai/
======
mlucy
Hey all,

I did a lot of the ML work for this. Let me know if you have any questions.

The title might be a little ambitious since we only have two embeddings right
now, but it really is our goal to have embeddings for _everything_. You can
see some of our upcoming embeddings at [https://www.basilica.ai/available-
embeddings/](https://www.basilica.ai/available-embeddings/).

We basically want to do for these other datatypes what word2vec did for NLP.
We want to turn getting good results with images, audio, etc. from a hard
research problem into something you can do on your laptop with scikit.

~~~
xapata
What's different between an "embedding" and a projection, which I believe is
the more standard term for this kind of transformation?

~~~
mlucy
"Embedding" is the term I've heard used for this most often. It's definitely
the term that seems to dominate in the literature. (Just to pick a random
paper off my reading list:
[https://arxiv.org/abs/1709.03856](https://arxiv.org/abs/1709.03856) .)

In my mind "embedding" carries the connotation that you're moving into a
smaller space that's easier to work with, and where things which are similar
in some way are near each other.

~~~
xapata
The TensorFlow docs agree with you. This field has such jargon proliferation,
it's hard to keep up.

[https://developers.google.com/machine-learning/crash-
course/...](https://developers.google.com/machine-learning/crash-
course/embeddings/video-lecture)

~~~
mlucy
Yeah, I agree. The number of new nouns per year is kind of ridiculous.

------
e_ameisen
Interesting idea, but seems to very much fall within the category of something
you would often want to build in-house. I always imagined the right level of
abstraction was closer to spacy's, a framework that lets you easily embed all
the things.

If you are interested in how to build and use embeddings for search and
classification yourself, I wrote a completely open source tutorial here:
[https://blog.insightdatascience.com/the-unreasonable-
effecti...](https://blog.insightdatascience.com/the-unreasonable-
effectiveness-of-deep-learning-representations-4ce83fc663cf)

------
projectramo
What is the use case for this? (And this is a general point for AI cloud APIs)

Specifically, I am trying to think of an example where the user cares about a
vector representation of something, but doesn't care about how that vector
representation was obtained.

I can think of why it would be useful: the ML examples given, or perhaps a
compression application.

However, in each of these cases, it would seem that the user has the skill to
spin up their own, and a lot of motivation to do so and understand it.

~~~
mlucy
Apologies for the long answer, but this touches on a lot of interesting
points:

1\. Transfer learning / data volume. If you have a small image dataset,
embedding it using an embedding trained on a much larger image dataset is
really really helpful. In our tutorial
([https://www.basilica.ai/tutorials/how-to-train-an-image-
mode...](https://www.basilica.ai/tutorials/how-to-train-an-image-model/)), we
get good results with only 2-3k animal pictures, which is only possible
because of the transfer learning aspect of embeddings.

You could do transfer learning yourself, if you have the time and expertise.
And for a domain like images, it's really easy to find big public datasets.
But long-term we're hoping to have embeddings for a lot of areas where there
_aren 't_ good public datasets, and pool data from all our customers to
produce a better embedding than any of them could alone.

2\. Ease of Use. You can take a Basilica image embedding, feed it into a
linear regression running on your laptop CPU, and get really good results. To
get equally good results on your own, you'd need to run tensorflow on GPUs.
This is harder than it sounds for a lot of people.

3\. Exploration. Because of the other two points, if you have a thought like
"huh, I wonder if including these images would improve our pricing model", you
can whip up some code and train a model in a few minutes to check. Maybe if
it's a big model you go grab lunch while it trains.

If you're doing everything from scratch in tensorflow, it can take days to try
the same thing. This activation energy reduces the amount of experimentation
people do. It's bad for the same reasons having a multi-day compile/test loop
would be bad.

~~~
projectramo
Hi mlucy,

I agree with what you're saying here. I just wonder how it would work in
practice.

So imagine I have this monster text or image, and I want to know if it looks
like another text or image.

I send each to Basilica, it gives me back two vectors and I compare the
vectors.

I use the cosine of the vectors as a similarity score, and lets say it comes
out to be 0.6.

However, I think this is too low, and I want to tweak my algorithm.

At this point, doesn't the question of how the vector was generated come to
the front. Did you get rid of common words, how did you treat stems, and so
on? Or did what biases did you introduce into training?

Furthermore, these questions come up right away, and they seem fundamental to
whatever the main practice is.

In other words, can I even experiment or start without knowing how the
word2vec works?

~~~
mlucy
You're definitely right that you sometimes need to know the exact details of
how an embedding is produced, especially if you're doing cutting-edge work.
That's one of the things we really need to improve documentation-wise. I'd
like to have a page for each embedding that talks about how it's generated,
what to watch out for while working with it, etc. etc.

I'm going to narrow in on the question of how to go about tweaking a model
that uses an embedding, since I think it's a really interesting topic.

To use your first example, let's say you're doing the image similarity task.
You probably wouldn't be computing the cosine distance on the embeddings
directly. You'd probably normalize and then do PCA to reduce the number of
dimensions to 200 or so.

If you weren't getting good results, you'd have a few options. You could
fiddle with the normalization and PCA steps, which can have a big effect. You
could also include other handcrafted features alongside the embedding. But
let's say you have a fundamental problem, like your similarity score is paying
too much attention to the background of your images rather than the
foreground.

There are two major approaches to solving that sort of problem with
embeddings: preprocessing or postprocessing. You could preprocess the images
before embedding them to de-emphasize the backgrounds (e.g. by cropping more
tightly to what you care about). You could also postprocess the embeddings.
For example, you could label which of your images have similar backgrounds,
and instead of naive PCA you could extract components that maximally explain
variance while having minimal predictive power for background.

~~~
rpedela
I definitely agree you should add more documentation to how the word vector
model(s) is generated. Also you may want to have a set of models that the user
can choose from. For example, Wikipedia is good for a general language use
case. But something more technical, such as finance, SEC filings are a better
data source.

------
ASpring
How do you plan to counter the harmful societal biases that embeddings embody?

See Bolukbasi
([https://arxiv.org/pdf/1607.06520.pdf](https://arxiv.org/pdf/1607.06520.pdf))
and Caliskan
([http://science.sciencemag.org/content/356/6334/183](http://science.sciencemag.org/content/356/6334/183))

While these examples are solely language based, it is easy to imagine the
transfer to other domains.

~~~
hiphipjorge
Hi. Jorge from Basilica here.

We don't have any concrete plans to tackle this right now but it is something
we're definitely mindful of. Thanks for the links! We'll be sure to go through
them.

------
gugagore
Aren't these embeddings task-specific? For example a word2vec embedding is
found by letting the embedder participate in a task to predict a word given
words around it, on a particular corpus of text.

The embedding of sentences are trained on translation tasks. A embedding that
works both for images and sentences is found by training for a picture
captioning task.

The point I'm asking about is that there may be many ways to embed a "data
type", depending on what you might want to use the embedding for. Someone
brought up board game states. You could imagine embedding images of board
games directly. That embedding would only contain information about the game
state if it was trained for the appropriate task.

~~~
mlucy
You can definitely improve performance by choosing an embedding closely
related to your task. In the future we're hoping to have more embeddings for
specialized tasks.

Kind of surprisingly, though, if you get your embedding by training a deep
neural network to do a fairly general task -- like denoising autoencoding, or
classification with many classes -- it ends up being useful for a wide variety
of other tasks. (You get the embedding out of the neural network by taking the
activations of an intermediate layer.)

In some sense you'd expect this, since you'd hope that the intermediate layers
of the neural network are learning general features -- if they were learning
totally nongeneral features, it would be overfitting -- but I found it
surprising when I first learned about it.

------
piccolbo
You quote a target of 200ms per embedding, not sure if it's one type of
embedding in particular. I am using Infersent (a sentence embedding from FAIR
[https://github.com/facebookresearch/InferSent](https://github.com/facebookresearch/InferSent))
for filtering and they quote a number of 1000/sentences per second on generic
GPU. That's 200 times faster than your number, but it is a local API so I am
comparing apples to oranges. Yet it's hard to imagine you are spending 1ms
embedding and 199 on API overhead. I am sure I have missed a 0 here or there,
but I don't see where, other than theirs is a batch number (batch size 128)
and maybe yours is a single embedding number. Can you please clarify? Thanks

~~~
piccolbo
So I am going to answer it myself. On batched data, it's a lot faster than
200ms per embedding and I'd say on a par with Infersent. On the other hand, I
couldn't get statistical performance in the same ballpark as Infersent and I
had to backtrack. This was training a logistic regression on the embeddings to
filter some text streams according to my preferences. If I had, I would have
preferred Basilica as Infersent is py2 only, hard to install and distribute
and a battery killer on my laptop. Its vectors are also 4x bigger. I
experienced some server errors and the team at Basilica was very responsive
and fixed it, very pleased with the interaction. It would be important IMHO to
publish some benchmark results for these embeddings, as it's usually done in
the universal embedding literature, or serve published embeddings with known
performance when licensing terms are favorable.

~~~
piccolbo
Another update, on their v2 sentence embedding basilica is ahead of Infersent
for my task. Well done basilica!

------
jdoliner
How much does this depend on the data type? I.e. do you need people to
specify: this is an image, this is a resume, this is an English resume, etc.
Could you ever get to a point where you can just feed it general data, not
knowing more than that it's 1s and 0s?

~~~
mlucy
That's a really interesting idea.

I can't really think of a barrier to this. Detecting the file format is
straightforward, and generic image/text/etc. embeddings work surprisingly
well. (In fact, you can actually get some generalization gains by training
subword text embeddings on corpora in multiple languages.)

If we wanted to able to use specific embeddings (e.g. photos vs. line art,
English vs. German), we could probably do it by running the data through a
generic embedding, and then seeing which cluster of training data it's closest
to and running it through that specific embedding.

It would be really important in this case to make sure that all the specific
embeddings are embedding into the same space, in case people have a mixed
dataset, but that's very doable.

------
pkaye
Slightly different topic but what are some approaches to categorize webpages.
Like I have 1000s of web links I want to organize with tags. Is there software
technique to group them by related topics?

~~~
yorwba
The task is known as document clustering
[https://en.wikipedia.org/wiki/Document_clustering](https://en.wikipedia.org/wiki/Document_clustering)
or topic modeling
[https://en.wikipedia.org/wiki/Topic_model](https://en.wikipedia.org/wiki/Topic_model)

Generally, you'll want to extract features (e.g. word counts) and then apply a
clustering algorithm to group related documents together. The precise details
are the subject of thousands of papers, each one doing things slightly
differently.

------
Lerc
Is this actually 'for anything'? I see references to sentences and images. If
I, for example, wanted to compare audio samples, how would it work?

~~~
mlucy
"Word2vec for anything" is where we want to get to. Right now we only support
images and text, but you can see the other data types on our roadmap at
[https://www.basilica.ai/available-
embeddings/](https://www.basilica.ai/available-embeddings/) .

------
kolleykibber
Hi Lucy. Looks great. Do you have any production use cases you can tell us
about? Are you a YC company?

~~~
mlucy
Thanks!

No production use cases yet. This is the first usable release, and it's the
bare minimum we felt we could build before showing it to people.

> Are you a YC company?

We have a YC interview on Friday, so hopefully in a few days I'll be able to
say yes.

------
msla
So the actual code is closed-source?

~~~
hiphipjorge
Hi, Jorge from basilica here.

Yes. We intend to run this as a cloud service API for now.

------
captn3m0
Do you think board game states might be a good target later?

~~~
mlucy
Sort of depends on how late "later" is.

In the very-long term, I want us to literally have embeddings for everything
people want to embed, which will probably include the states of popular
boardgames.

I'm not sure how we'll get there. Maybe we'll have community embeddings, or an
embedding marketplace, or we'll abstract away the process of creating an
embedding so well that we can create simple embeddings just by pointing our
code at a dataset. But I'd like to get there eventually.

In the less-very-long term, we're still focusing on embeddings that are either
very general and useful across a lot of domains (e.g. images, text), or
embeddings that have clear and immediate business value (e.g. resumes), since
running GPUs is expensive.

------
asdfghjl
How are you embedding images?

~~~
mlucy
We're feeding them through a deep neural network and using the activations of
an intermediate layer as an embedding.

You can read more about this technique in
[https://arxiv.org/abs/1403.6382](https://arxiv.org/abs/1403.6382) if you're
interested.

------
aaaaaaaaaab
>Job Candidate Clustering

>Basilica lets you easily cluster job candidates by the text of their resumes.
A number of additional features for this category are on our roadmap,
including a source code embedding that will let you cluster candidates by what
kind of code they write.

Wonderful! We were in dire need for yet another black-box criteria based on
which employers can reject candidates.

“We’re sorry to inform you that we choose not to go on with your application.
You see, for this position we’re looking for someone with a different
_embedding_.”

~~~
panarky
word2vec:

    
    
        king
      - man
      + woman
      --------
      = queen
    

Basilica:

    
    
        resumes of candidates
      - resumes of employees you fired
      + resumes of employees you promoted
      ---------------------------------------
      = resumes of candidates you should hire

~~~
kvb
word2vec[0]:

    
    
          computer programmer
        - man
        + woman
        ---------------------
        = homemaker
    

Basilica?

[0] -
[https://arxiv.org/pdf/1607.06520.pdf](https://arxiv.org/pdf/1607.06520.pdf)

~~~
ben_w
One thing I’ve been tempted to research but never had time for myself: can one
use that aspect of wording embeddings to automatically detect and quantify
prejudice?

For example, if you trained only on the corpus of circia 1950 newspapers,
would «“man” - “homosexual” ~= “pervert”» or something similar? I remember
from my teenage years (as late as the 90s!) that some UK politicians spoke as
if they thought like that.

I also wonder what biases it could reveal in me which I am currently unaware
of… and how hard it may be to accept the error exists or to improve myself
once I do. There’s no way I’m flawless, after all.

~~~
teraflop
> For example, if you trained only on the corpus of circia 1950 newspapers,
> would «“man” - “homosexual” ~= “pervert”» or something similar?

If it did, what conclusion would you be able to draw?

As far as I know, there's no theoretical justification for thinking that word
vectors are guaranteed to capture meaningful semantic content. Empirically,
sometimes they do; other times, the relationships are noise or garbage.

I am wholeheartedly in favor of trying to examine one's own biases, but you
shouldn't trust an ad-hoc algorithm to be the arbiter of what those biases
are.

~~~
pasabagi
I think there's a further problem that there's never been a shortage of
evidence, about things like this. The point is, prejudice and discrimination
are not evidence-based in the first place. People who support existing unjust
structures are generally strongly motivated to turn a blind eye. Even people
who don't support them are - it's simply far easier and more socially
advantageous to stop worrying and love the bomb.

------
mathena
Am I really missing something here or this thing is a complete nonsense with
no actual use cases what's so ever in practice?

There are a number of off-the-shelf models that would give you image/sentence
embedding easily. Anyone with sufficient understanding of embedding/word2vec
would have no trouble train an embedding that is catered to the specific
application, with much better quality.

For NLP applications, the corpus quality dictates the quality of embedding if
you use simple W2V. Word2Vec trained on Google News corpus is not gonna be
useful for chatbot, for instance. Different models also give different quality
of embedding. As an example, if you use Google BERT (bi-directional LSTM) then
you would get world-class performance in many NLP applications.

The embedding is so model/application specific that I don't see how could a
generic embedding would be useful in serious applications. Training a model
these days is so easy to do. Calling TensorFlow API is probably easier then
calling Basilica API 99% of the case.

I'd be curious if the embedding is "aligned", in the sense that an embedding
of the word "cat" is close to the embedding of a picture of cat. I think that
would be interesting and useful. I don't see how Basilica solve that problem
by taking the top layers off ResNet though.

I appreciate the developer API etc, but as an ML practitioner this feels like
a troll.

~~~
vladf
> Calling TensorFlow API is probably easier then calling Basilica API 99% of
> the case.

Maybe, but training/curating data appropriate for your application isn't. It's
not in that state right now but I think this service could save you some time
if they had a domain-relevant embedding ready to roll for your application and
it performed decently well -- that would save you a lot of time gathering
training data and help you focus on the "business logic" ML needs that accept
the embeddings as input.

That said, they'd need to be more performant than, say, GloVe 2B, which I can
get for free off of torchtext, meaning they have to do the domain-specific
heavy-lifting.

