

Automatic Database Normalization? - zackmorris

Lately I've been working on projects that were started by someone else and the databases tend to be a mess.  For example, instead of having a separate Members and Organizations table (like for a user participating in politics), the members are repeated over several rows, with the Organization Name field set different on each row.  Then if the member's street address is updated, the SQL UPDATE changes the Address field on all of the rows.<p>At first glance this seems horrible, and I'm tempted to make a table to store member-organization entries.  But unfortunately there are several hundred queries in the app and this is nontrivial to write, much less exhaustively test.<p>So it got me thinking, why is this the programmer's responsibly?  Why not just make a database where you can do anything you want, and the contents of each field would be stored once uniquely, with the duplicates represented by pointers or hashes under the hood?  Then run in a mode that guarantees atomic updates over any number of rows.<p>Then the database could infer relationships by the queries that are executed instead of how the data is organized.  It would generate the relationship graph automagically so the programmer just has to sign off on it.  It could even run in a mode where if the code tries to execute an operation that breaks the graph, a notification could be sent to IT or whatever.<p>So this isn't really nosql, it's more like unstructured or self-organizing SQL.  Has anyone heard of something like this?  I googled a bit but all I found were these unhelpful links:<p>http://ieeexplore.ieee.org/xpl/login.jsp?tp=&#38;arnumber=4564486&#38;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4564486<p>http://stackoverflow.com/questions/4477257/is-there-any-software-that-can-normalize-data-bases-tables<p>http://www.codinghorror.com/blog/2008/07/maybe-normalizing-isnt-normal.html<p>Thanks!
======
tom_b
Lots of homegrown dbs look like this. They started life as a doc, then a
spreadsheet, got sucked into Access, and now are living in SQL Server or
Oracle dbs. I have seen this a bunch in research environments where the tech
is a "one-off" to support the science. I usually do a little data cleaning and
maybe try some data warehouse-style restructuring so that you can report on
the existing data.

You might have luck replacing your current table mess with views (with the
same name as the tables) while you fix the structure to a more normalized
schema. But you might be better starting from scratch and migrating to a new
"version" of the app while moving the data along to the new version.

As to your self-organizing data idea, I have wondered if something kind of
cool could come out of probabilistic graphical models, but haven't dug into
the theory of that (and there is the free cousera course I think) enough to
say.

I am going to guess that what is going to get you in trouble with self-
organizing data is that contextual relationships in the data that are
currently interpreted by your end-users (and are important/critical to the
user) will get lost in any algorithmic attempt to clean the data.

You might also want to investigate the social sciences data scene - those
folks do lots of statistical/data hacking on qualitative data that involve
weirdness about coding "grey area text answers" into quantitatively tractable
code numbers. It looks and sounds like hell to me . . .

------
zackmorris
Hey thanks for your comments, views sound like the way to go for now.

I can see how to go from normalized data to views, but what if there was a way
to go the other direction? So like, the SQL that generated the view could be
used in reverse, so when you run an update on the view, it would know which
fields to update in the original pure data.

If someone got all of that working, then I really question why database
organization matters. I know it's important for seeing relationships and all
that, but in reality, data just exists. The relationships are a level above
that. I really think that someone could make a database that works more like a
zip file, where it wouldn't matter how you laid it out, for the most part it
would just look like a spreadsheet and each entity would just have all of the
data in one row. This is how "regular" folks view things.

Also this is a bit off topic, but automatic normalization has a lot in common
with the semantic web. I just don't think it's all that difficult of a
problem, but waiting around for humans to do it manually is never going to
happen. I guess I'm just naive but, all the semantic web needs is iteration
and the relationships will refine and converge quickly. I don't understand why
it's not happening.

------
caw
I read the title of this, and I thought this was going to be asking about
automatically creating normalized schemas given the information.

Yes, you can eventually do it yourself, but I only ever have to create
database schemas once every 6 months or so. If I could plug in the information
using some higher level language, and it can crunch out the tables for me,
then that would be pretty cool. I know enough about databases to check to see
if the schema is right, or compare it to my constraints.

Or in the OP's case, this system would take the existing tables, and give you
some sort of migration scripts to get the data in normalized form.

------
byoung2
_and I'm tempted to make a table to store member-organization entries. But
unfortunately there are several hundred queries in the app and this is
nontrivial to write, much less exhaustively test_

It is possible to normalize the database and still present the denormalized
form to the app using MySQL views (I'm assuming MySQL here). You have your
member table, organization table, and a member_organizaion join table with all
your foreign key constraints, and you create a view as a select join.

