
The chemfp project: problems selling free software - dalke
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8#Sec24
======
dalke
I started the chemfp project in part to see if I could develop a self-funded
free/open source product in my field, cheminformatics. (In short, storing and
searching chemical information on a computer. Chemfp does very fast Jaccard-
Tanimoto similarity search for "short"/O(1024 bit) bitstrings.)

The answer: no.

The section I linked to highlights some of the problems I had selling software
under the principles of free software. For examples: How do I provide a demo
if I always provide MIT licensed source code? Academics expect discounts, but
they are also the ones most likely to redistribute the code. Which is not a
wrong thing to do! But it affects the economics in a way I could never
resolve, compared to proprietary/"software hoarding" licensing models.

As an HN note, I contracted a couple people to help improve the popcount
implementations. HN user nkurz developed and tweaked the AVX2 implementation,
and proof-read the paper. Thanks nkurz! As a result, chemfp is, I believe, the
fastest single-threaded Tanimoto search implementation for CPUs available, and
most likely memory bandwidth limited, not CPU limited.

(Note: the mods asked me to repost. My earlier post is at
[https://news.ycombinator.com/item?id=23598470](https://news.ycombinator.com/item?id=23598470)
.)

~~~
alicemaz
have you considered offering it as a cloud platform? we're doing something
along these lines, niche scientific software (biological modeling,
bioinformatics) as a paid hosted service. still at the prototype stage! so I
can't comment on how well the business model actually works yet lol

but the idea is our mathematician will be able to publish whatever novel math
she develops, and we may eventually open source the math core as a reference
impl, but we'll keep all the cluster management and other supporting
infrastructure code proprietary. sort of a "if you want to run it on your
desktop, go ahead! if you want to actually scale this up for big jobs, we've
done all the legwork already so it's really in your best interests to just pay
us." I think open source ideals are good and worthy but from a business
perspective, you capture value by providing value that can't be got without
you. relying on customer goodwill is particularly difficult because any large
org, the people who will feel goodwill toward you and the people who can
authorize purchases are in two different departments

also fwiw I think if you wanted to do the model you described in the paper
unchanged, gpl is a much better choice than mit. copyleft actually serves as a
wonderful poison pill: you can try us out for free, but if you want to ship
us, you need to pay for a proprietary license or legal will nail you to the
wall. whereas mit, there's no stick. I've seen affero used by several projects
for this express purpose: you _have_ to buy a proprietary license because agpl
is so onerous you just can't use the code for commercial purposes at all

interesting project btw, I love seeing stuff like this!

~~~
dalke
Thanks!

Yes, I've considered cloud platform. There are several big difficulties with
that.

First, data. It's easy to grab public data from PubChem, ChEMBL, and a few
other projects, and make a service. But why would anyone pay for it given that
PubChem, ChEMBL, ChemSpider, and others already provide free search services
of that data?

There's search-as-improved-sales, like how Sigma-Aldrich lets people do a
substructure search to find chemicals available for sale.

There's value-add data. eMolecules includes data from multiple vendors, to
help those who want to purchase compounds more cheaply.

Or there's ZINC, which already provides search for their data.

So you can see there's plenty of competition for no-cost search. I don't have
the ability to add significantly new abilities that people are willing to pay
for.

Note also there's a non-trivial maintenance cost to keep the data sets up-to-
date.

Second, the queries themselves may be proprietary. I talked with one of the
eMolecules people. Pharmaceutical companies will block network access to a
public services to reduce the temptation of internal users to do a query using
a potential $1 billion molecular structure (or potential $0 structure).
eMolecules instead has NDAs with many pharmas which legal bind them. Managing
these negotiations takes experience I don't have, and neither do I have the
right contacts at those pharmas.

Sequences don't have quite the same connection between sequence and profit as
molecules do.

BTW, part of the conclusion of my work is that people don't need a cluster for
search - they can handle nearly all data sets on their laptop, so there
shouldn't be a need to scale up any more. And small molecule data has a much
smaller growth curve than sequence data, so Moore's Law is keeping up.

My first customer, who continues to be a customer, said outright that they
would not buy if it were under GPL.

Since my paying customers are pharmaceutical companies who, as a near-rule,
don't redistribute software, it doesn't really matter if they don't
redistribute under MIT or don't redistribute under GPL.

I came into the project in part to see if FOSS could be self-supporting _on it
own_. AGPL is often used as a stick to try to get people to use a commercial
license - the implicit view of the two-license model is that FOSS is not
sustainable. Which is now my conclusion, for this project and field.

------
zokier
I think to truly appreciate FOSS as a model, one needs to shift away from
thinking software as an asset to be monetized to more of a liability that
needs to be managed and maintained. Then the benefit of FOSS becomes clear: by
publishing your software there is possibility of sharing that burden with
others instead of carrying it alone yourself.

~~~
taneq
I agree. I can't see how you can 'sell free software' as a standalone product.
You build and evangelise free software while selling feature requests,
services and support to the users.

~~~
dalke
You can sell a proprietary product, right? With a restrictive software
license?

That means customers are willing to pay $base + $yearly renewal for the
product.

Why aren't they willing to pay the _same price_ for the _same product_ but
with an open source license?

I really don't understand why they don't.

I'll go one further - how much will people pay for an open source license over
a source available license with a right to modify, no time limits/renewal
requirement, but no distribution right?

Answer: all but one of my customers jumped at the chance to reduce the cost by
switching from the MIT license to a not-quite-open-source license.

Which means they don't really value the redistribution right.

And I saw this at one small conference about industry use of open source. The
organizers - who use chemfp! - stated at the start that the biggest reason
they love open source is because it's "free" (meaning no cost), not the
principles of software freedom nor the improved development methodology of
open source.

I tried selling feature requests, services and support to the users. That was
my original plan, and it worked _so long as_ those feature requests were easy
and there were enough of them.

But consider that the upgrade to Python 3 took two months. Who pays for that?
The first customer who wants Python 3 support 5 years ago, who pays $20K for a
feature request which everyone else gets for free? Then there's inventive to
wait for a feature request in hopes that someone else will pay it. While the
sales model - even as free software - lets me split the cost among multiple
customers who need that feature, and across a few years.

I also pointed out that selling services is a disincentive to developing good
document and good APIs. I feel like there's a sweet spot where if I were to
skimp on the documentation some then there's an increased chance of getting
consulting work.

~~~
imtringued
>the biggest reason they love open source is because it's "free" (meaning no
cost), not the principles of software freedom nor the improved development
methodology of open source.

I'm a cheapskate but that's still pretty weird to me. Open source software is
free because the entire idea behind it is users don't get excluded. It's more
about being accessible than not charging money.

There was a dual licensed HTML component that I was going to use at work but
the commercial licensing conditions (not the price) were pretty bad. Per user
licensing with a strict upper limit for both active users and the number of
apps even though we don't know how many people are going to use the software
and most users are only going to use it for one hour per month and we would
probably integrate it into a library that will be automatically included in
every of our applications to maintain consistency even if the commercial
component is not actively being used in every project.

Paying $100/month or maybe a little more for a commercial license with few
restrictions that I can just plop in would have been a no brainer but since
I'd have to constantly play license tetris it's going to cost my company more
time than the product is worth in the long run. It's not a lack money that
forced me to go with an open source project that also happens to be free. It's
the massive headaches caused by the commercial one.

~~~
dalke
My running hypothesis is that many people see open source as a way to avoid
dealing with upstream developers.

If I "pip install" a package which brings in a lot of other packages, I don't
need to have any relationship with any of those developers. It Just Works.

I don't have to know about their projects, find their web sites, read their
calls for funding, learn their licensing options, etc. I don't have to worry
about billing. It Just Works.

Even if the price is $100, the fact that it doesn't Just Work means the
effective price is far higher.

I decided to focus on industrial customers who were used to software in the
EUR ~5-20K/yr range (rather than the ~$1000/yr range) so the overhead costs
are proportionally smaller. And why I try to make the code fit into the "Just
Works" framework, eg, on Linux-based OSes:

    
    
        pip install chemfp -i https://chemfp.com/packages/

------
dekhn
It's funny just how much the implementations described in the paper map to how
modern search engines implement retrieval. The same is true for BLAST and
other search engines.

(it's a very readable paper and I enjoy the frank expression of view, even if
I have a vastly different perspective on how to accelerate problems like this)

~~~
dalke
There's a deep connection between what I do and text retrieval in general.

Take a look at the early work in IR in the 1940s and 1950s, at
[https://en.wikipedia.org/wiki/Information_retrieval#Timeline](https://en.wikipedia.org/wiki/Information_retrieval#Timeline)

1947, "Hans Peter Luhn (research engineer at IBM since 1941) began work on a
mechanized punch card-based system for searching chemical compounds"

1950s, "invention of citation indexing (Eugene Garfield)" \- Garfield's
earlier work was with chemical structure data sets, and his PhD work was on
the linguists of chemical compound names.

1950: "The term "information retrieval" was coined by Calvin Mooers." \- that
was presented at an American Chemical Society (ACS) meeting that year, and in
the 1940s Mooers developed an early version of what is now called a connection
table, hand-waved a substructure search algorithm which was implemented a few
years later. (I'm a Mooers fanboy!)

Many of the early IR events were at ACS meetings - the concept of an "inverted
index" was presented at one, as I recall.

This is because in the 1940s, chemical search was Big Data, with >1 million
records containing many structured data search fields, and demand for chemical
record search from many organizations.

So many of the core concepts are the same, though in cheminformatics we've
switched to a lossy encoding of molecular features to a bitstring fingerprint
since we tend to look more at global similarity than local similarity, and
there are a lot of possible features to encode.

Thank you for writing that it was a very readable paper. I have received very
little feedback of any sort about the publication, and have been worried that
it was too verbose, pedantic, or turgid.

~~~
dekhn
Its a bit verbose, and I really think it's several papers (the technical
details of the package is one, the open source positioning is another). But
it's readable- a person outside the field (say, a search engineer at Google)
could sit down, read this and immediately recognize what you were trying to
achieve ("implement popcnt" used to be a popular question), and then
immediately suggest ways to get the output results faster by using a cluster
:)

~~~
dalke
Indeed, it is several papers. There are two journals in my field - one I can't
read because it's behind a paywall and one that's expensive to publish in. I
choose the latter, but couldn't afford multiple months of rent in order to
publish several papers. :(

A blog post I wrote years ago use to part of the "implement popcnt" literature
-
[http://www.dalkescientific.com/writings/diary/archive/2008/0...](http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html)
. It's now outdated, and actual low-level programmers have done better, but it
still gets mentioned in-passing in postings like the one referenced on HN last
year at
[https://news.ycombinator.com/item?id=20914479](https://news.ycombinator.com/item?id=20914479)
.

~~~
dekhn
It's really extraordinary how tightly coupled modern innovation in scientific
fields is to processor implementations. I suspect you and I share a keen
interest in the path by which we got to this enviable situation.

------
rurban
Im sceptical that a good single CPU search can compete with massive parallel
HW, like this one: [https://www.graphcore.ai/posts/introducing-second-
generation...](https://www.graphcore.ai/posts/introducing-second-generation-
ipu-systems-for-ai-at-scale)

~~~
dalke
Sure. It can't. Even GPUs will beat a CPU. In my paper I commented:

> GPU memory bandwidth is an order of magnitude higher than CPU bandwidth, so
> a GPU implementation of the Tanimoto search kernel should be about ten times
> faster. Chemfp has avoided GPU support so far because it’s not clear that
> the demand for similarity search justifies dedicated hardware, especially if
> the time to load the data into the GPU is slower than the time to search it
> on the CPU. GPUs are more likely to be appropriate for clustering mid-sized
> datasets where the fingerprints fit into GPU memory.

Corporate compound sets have ~5 million records. That can be searched on a
laptop in about 50ms.

A large data set containing physically measured properties is ~100M records,
which takes a bit over a second. The largest data sets people search, with
synthetically generated compounds, is around 1G records. That requires
distributed computing. But most people don't work with them.

They say the best camera is the one you have with you. Most people have a CPU
with them. Fewer have massive parallel HW with them.

