Graphical Data Mining

At work, we recently decided to standardise on a graphical data mining tool rather than rely on me writing code.  This was intended to lower the barrier to entry, make processes more repeatable and so on.

In the process of implementing this I’ve been appalled at just how sorry the state of graphical data mining is.  I know that graphical programming was trailed in the 90s and died a natural death, but I would’ve thought a few more lessons would have been learned from that, it is more than ten years on now!

I trialed four software packages: Oracle Data MinerSAS Enterprise Miner,PASW Modeler (which was known as Clementine at the time), ORANGEand KNIME.  Every one of them failed most of the test criteria.

Oracle Data Miner was by far the easiest to use: automatically normalising variables when appropriate, dropping highly correlated variables, and so on.  However, it was terrible when I attempted to scale to a large number of models.  The GUI left piles of cruft after itself, made marking of models as ‘production’ almost impossible, and provided almost no facilities for overriding the decisions it makes (e.g. tuning the clusters).  However the killer mistake was it’s (lack of) ability to productionise code.  It’s PL/SQL package output quite simply didn’t work (it would sometimes even crash!).

SAS Enterprise Miner does a lot of things right, but it’s user interface is awful with frequent unexplained crashes.  Some specific examples: it doesn’t support nested loops, it is limited to 32 values for a class variable but this limit is not enforced, just later steps will randomly crash, dragging a node in between two others won’t connect it up, and it doesn’t support undo.

Steps that you will almost always want to perform such as missing value imputation or normalising attributes are easy enough to do but they’re not done by default - you must always remember to add them.  To build a SVM with standard steps (sample, train/test split, impute, scale, test) requires finding and briefly considering half a dozen different nodes.  Why not default to normal practice and only require extra steps when you want to do something unusual?  

I won’t go into detail on PASW, Orange or KNIME except to note that despite being free, KNIME had the best interface of all the tools I looked at and ultimately I decided SAS was the best despite its warts due to its underlying power.  I do wonder what the designers were thinking: data mining best practice is now pretty well defined, why not design your tool to make it best practice the easiest option?