

Ask HN: Fuzzy Logic SQL select framework - ssanders82

Hi, I've recently been reading up on fuzzy logic. The crux of it seems to be allowing members to have partial inclusion in a set, instead of our current set paradigm of "it's in or it's out. As an example, consider the set "Long Rivers". The Amazon would have a 1.0 membership in this set whereas the Mississippi might have - let's say - 0.8 membership, the Ohio River 0.5, and the stream in your backyard would be 0. This number quantifies how much the item belongs in the set.<p>An analogous example would be our everyday sql selects. Let's say a company is searching for young employees with great sales records, to consider for promotions. "SELECT * FROM Employee WHERE Age&#60;30 and Sales&#62;100000 ORDER BY Sales desc,Age asc". This statement will miss the 21 year-old with $90,000 in sales and the 31 year-old with $500,000, although those people may be bright young stars as well. Widening the search parameters waters down the results and the black &#38; white nature of it will always miss those on the cusp.<p>What the company <i>actually</i> wants is to do a sql statement "SELECT young employees with great sales records".<p>One solution would be fuzzy logic. They want employees that fall into two sets - 1.) young and 2.) good sales. The fuzzy solution would say, ok, every employee selling over $100,000 per year is a member of the Good Sales set with a membership value of 1.0. Over $90,000 is 0.8. $75,000 is 0.6.<p>Also, anyone less than 30 years old is a member of the Young Employee set with membership 1.0. 31 years old is 0.8. 35 is 0.5.<p>Once you define those parameters, by definition the membership of an employee in the two sets is the lowest membership he has in either set. Our precocious 21 year-old would be 0.8 (he has 1.0 in Young, and 0.8 in Good Sales) and our older but productive 31 year-old would also be 0.8 (0.8 Young and 1.0 Good Sales).<p>Our new query is something like "SELECT * FROM Employees WHERE Membership_YoungAndGoodSales&#62;0 ORDER BY Membership_YoungAndGoodSales DESC, Sales DESC, Age ASC". This first returns all employees with perfect matches (1.0 membership in both sets) but scales down to include partial matches as well that might warrant a further look, as long as they have a non-zero membership in both sets.<p>I'm also testing this now with NBA games - instead of selecting teams that have scored 110 points per game AND have held opponents to 90 points per game AND (etc.), I just want to "SELECT high-scoring teams with good defense AND (etc.)"<p>Anyway, I was just curious if there was any existing db framework or code to deal with this. The major challenge seems to be coming up with the partial membership weights (e.g. a 31 year-old is 0.8 Young...why not 0.7 or 0.9?)<p>I was thinking that it would be possible to write a db selection framework that works with existing sql filter statements without modification - it could pre-process it and return exact matches first, then partials (for fields where it has membership information), ordering by the membership weight DESC. Anyway, feedback?
======
mdasen
So, you can just think of this like a math problem. To get a score, you can
take the reciprocal of the person's age (so, someone 30 becomes 1/30) and then
(for readability's sake) multiply it by a constant (say, 30) to make someone
30 a 1.0, someone 31 becomes .96, someone 35 becomes .85, etc.

If you want it to vary more or less from that norm, you can use exponents and
logs. In your example, you wanted .8 for 31 and .5 for 35. I think raising it
to the 4th or 5th power will get you around that mark. So, you can make it
really sensitive to change in value or really insensitive to change depending
on the log or exp that you use.

Likewise, do the same thing with sales (only in this case, since higher is
better, we don't need to do the reciprocal thing). So, 100,000 becomes your
baseline and that's a 1.0 (just divide by 100,000). Here, linear looks good
since a person drawing 200k should be twice as good as the person drawing
100k, but you could use exponents or logs just like the previous example.

Then, you multiply them together. So, someone 30 years old who gets 100,000 in
sales has a 1.0 rating. Someone 25 years old with 100,000 in sales has a
2.0736 rating. Someone 35 with 500,000 in sales has a 2.69 rating. And so on.

The SQL would look something like this:

    
    
      SELECT *, (POW((1/age)*30, 4)*(sales/100000)) as the_awesome_score FROM table_name ORDER BY the_awesome_score
    

Basically, the difficulty in fuzzy matching is creating a common measurement.
Once we have both items on a common scale (here with age 30 and 100,000 in
sales being a 1.0), you can just multiply them and then you have your score.
You can easily adjust the numbers to make it more or less sensitive to change
on either side of the equation to suit your needs.

------
bayareaguy
This was a hot topic in databases 20 years ago and there have been various
academic projects which implemented user defined types and operators to do
this for systems like PostgreSQL.

Most of them haven't produced any useful code but a quick search turned up
this more promising site:
<http://calypso.cs.put.poznan.pl/~sqlf_j/en/index.php>

    
    
      SQLf_j is a query system, that enables fuzzy queries to relational
      database managements systems (currently: MySQL and PostgreSQL). The
      queries are written in SQLf which is an extension to SQL. SQLf_j
      works as a translator, that translates fuzzy queries to standard,
      SQL queries, and executes them with RDBMS.
    

There also seems to be some interesting stuff here:
<http://www.ecst.csuchico.edu/~juliano/Fuzzy>

------
Tangurena
If I were doing this, I'd look into writing CLR code to integrate with SQL
Server 2005/2008. As long as you could define classes and functions in .NET,
you could use them as datatypes or functions in SQL Server. For membership,
you might want to define them as functions in code, separate from the db.

<http://msdn.microsoft.com/en-us/library/ms345136.aspx>

[http://www.amazon.com/Professional-SQL-
Server-2005-Programmi...](http://www.amazon.com/Professional-SQL-
Server-2005-Programming/dp/0470054034/)

 _a 31 year-old is 0.8 Young...why not 0.7 or 0.9?_

Well, that depends on the problem domain. If you're dealing with US
Presidents, 31 would be 1.0 young. And 50 would probably be 0.9 young. If
you're dealing with dogs, 31 would be 0.0 young (I'm not aware of any dogs
living more than 29). Sit down with a bunch of domain experts (one at a time)
and poll them. You don't sit them together because they'll probably say things
like "yeah, that answer."

I've not done much research into this for quite some time (and way back then,
I was interested more in GIS). And a quick survey seems to say "nope, no
frameworks." <http://www.lcc.uma.es/~ppgg/FSQL.html>

------
frig
Sorry not to be more specific, but someone did almost exactly what you're
sketching. I'll get back to you once I've found the paper, or if I've not
found it again.

At first pass this might be it, but it looks different from what I remember:

<http://www.lcc.uma.es/~ppgg/FSQL.html>

What I saw basically translated "fuzzy" queries into equivalent SQL queries.

Actually just spin through these:

[http://scholar.google.com/scholar?hl=en&lr=&client=s...](http://scholar.google.com/scholar?hl=en&lr=&client=safari&q=fuzzy+sql&btnG=Search)

There's lots.

------
leonidtineo
At Universidad Simón Bolívar, Venezuela, we have developed three general
purpose fuzzy querying systems: SQLfi, SQLf-pl and PostgreSQLf. We have
several in proceedings communications about them. If you are interested in
using or contribute to the development: For PostgreSQLf contact professor José
Tomas Cadenas <jtcadenas@ldc.usb.ve>, for SQLf-pl contact professor Leonid
Tineo <leonid@usb.ve>, for SQLfi contact engineer Juan Carlos Eduardo
<jceduardo@gmail.com>. SQLfi is also disponible at
<http://xica.bd.cesma.usb.ve/sqlfiv4>

------
russell
I don't have a compelling need for a framework. If I had the need for such a
query, I would probably write a simple query to gather a superset of the
candidates, or not, if it looked like most of the records would be in the set.
Then I would write a Python program that would rank the candidate according to
whatever. It could all be done in SQL, but Python would be quicker in elapsed
time to a solution.

If this were a frequent run, I might do it purely in SQL, but not a framework,
SQL is sufficiently high level that I dont see much of a gain.

~~~
brand
Agreed. You'd be simply creating a composite value; your statements might look
messy, but there's nothing stopping you from doing this with SQL.

Additionally, you're still choosing arbitrary values, and there are still
going to be data points 'just outside' your selection - it may be more smooth,
but it's just a weighting of the same metrics.

------
arthole
because your fuzzy allocation is going to change from table to table, it would
probably be best to keep a master table of fuzzy allocations.

something along the lines of

columnname, lowvalue, highvalue, weight

that way, you could keep all your weightings for whatever column you wanted to
test on and do your joins on the low and high values.

it would also make it easier to modify your weightings and see how the queries
you generate change their results without having to change any sql. And if you
wanted multiple kinds of weightings for one column name you would just add a
name field.

eg:

create table fuzzy (name VARCHAR(10) NOT NULL, colname VARCHAR(30) NOT NULL,
lowval DECIMAL(10,2) NOT NULL, hival DECIMAL(10,2) NOT NULL, weight
DECIMAL(10,2) NOT NULL, PRIMARY KEY (name, colname, weight), INDEX fuzzy1
(name, colname, weight), INDEX fuzzy2 (name, colname, lowval, hival), INDEX
fuzzy3 (name, colname, hival, lowval) );

------
statictype
Sorry, ompletely unrelated, but I also recently looking at doing some 'mining'
on NBA stats. Where do you get your raw stats from?

~~~
ssanders82
I was using covers.com, it has historical results, spreads, and Over/Under
lines for several years back.

------
hboon
It is probably better to build such a framework on top of a search engine
rather than SQL.

