CME Connect

I've been involved for a long time with using data to making marketing relevant.  About three years ago I co-founded a startup which helps companies to build their in-house marketing capability.  The company is called CME Connect and you can read more about it at http://www.cmeconnect.com.  In this post I'd like to talk about why I made this decision and why I decided to concentrate on improving in-house marketing capabilities.

I care very deeply about making marketing data-driven.  I believe that the majority of marketers worldwide make decisions based on their intuition rather than facts.  As a result, customers are bombarded by millions of essentially irrelevant messages.  The only reason marketers have got away with this is that customers do not have access to sophisticated filtering.

This is slowly changing.  Google Inbox automatically analyses what you open and will only put messages in your inbox if it thinks you're going to open them.  Others get relegated to the promotions subfolder that rarely gets read.  I believe that within five years time, we will all have personal assistants like 'Siri', 'Cortana', 'Echo' and 'Ok Google' that filter out marketing material as effectively as spam has been eliminated. 

Permission marketing simply won't be relevant any more.  Only relevant messages will be allowed through regardless of 'permission'.  I believe the primary way that brands will achieve relevance is by using data to understand what their customers want to know about.  Therefore I really want to help brands do exactly this.

The other big point of difference is that CME Connect helps you build your in-house marketing capability.  We are not walking around saying "give us your problems and we'll take them away".  My belief is that taking problems away is just too expensive.  I believe that using data to drive decisions is a fundamental change in behaviour, not a project.  Therefore outsourcing to an agency like Datamine or Affinity is the wrong approach for most brands because they can't to afford permanently outsource everything.  

I have no problem with the services that Datamine or Affinity offer.  In fact I think they do a pretty good job, unlike most marketing agencies (Saachi, Aim Proximity, DraftFCB, ...) which pretend to know data but actually have just employed a couple analysts without changing their business practices.  Don't get me wrong, most agencies are far better at advertising than me, but their data analytics really sucks.

That brings me back to insourcing.  I simply don't have time to look after everyone's marketing, and even if I could employ and train enough people, I believe the outsourcing model does not change business processes sufficiently.  Therefore I believe the right approach is to raise the bar by helping everyone deliver better data-driven marketing.

In conclusion, I believe the world of the near future will be dominated by artificially intelligent personal assistants that carefully analyse all inbound messages on behalf of each person.  I believe that the only brands that will survive in this new world are those that use artificial intelligence to understand what each of their customers wants, and that I can make more of a difference by providing products and services to many companies than I could by helping a couple brand that can afford to pay for my expertise.

Email Frequency

How often should you email your email subscribers?  Every month? Every week? Every day? Every hour?  The traditional answer to this question is that you should define email frequency rules that your company strictly enforces.

There's lots of problems with this approach.  From a pragmatic perspective it is really hard because you'll get desperate stakeholders begging you for an exception to get their huge product announcement out just hours after you deployed a routine weekly update.  Turning down the requests on the grounds that you're just enforcing contact rules is likely to be career limiting.  However before you get all comfortable that at least doing the right thing by your customers, research in this field has been pretty definitive that contact frequency rules are useless and if you have something relevant to say then you can email your customers every fifteen minutes.

Of course the operative rule in the previous paragraph is relevance.  You live and breathe what your company offers, and the latest widget is pretty big news for you.  Your customers have a more more balanced life and are unlikely to be as excited by the widget as you are.  Contact rules came about as a way to prevent excited marketers from telling their subscribers about every little piece of news.  

One of the emails I help with is for a sports team. Some of our fans want to hear everything that goes on while others only want to hear the biggest announcements of the season.

It's not just different levels of engagement. Some fans are interested in player profiles or match reports while others want discounted tickets and special deals.

Best practice is to not use frequency rules at all. Instead score how relevant this message is for each of your subscribers and only send it to those that pass the threshold.

This can't be achieved using campaign management software and is one of the major benefits provided by sophisticated marketing automation platforms.

 

 

AB Testing

One of the most common tasks in marketing analytics (after post-campaign analytics) is determining which variation of a campaign was most effective.  Small tweaks to your subject line will increase your open rate, and small tweaks to your content will increase your CTR and eventual conversion rate.

This analysis works.  Some email deployment tools even have it built in, so that you can set up a few variations, deploy it slowly and based on the relative open rates from the first recipients / test subjects, the tool will automatically discard unsuccessful variations so the rest of the deployment get the best one.

My problem with the whole approach is that it treats your audience as homogeneous.  What if one version resonates best with women and the other with men?  To me the problem is not about simple maximisation, it is about selecting something relevant.

Another problem with the technique is that the goal should not be maximising your open rate.  It should be maximising your relevance.  If a recipient is not going to complete the purchase then getting them to open or click through is wasting that person's time and destroying your reputation with them.  

I'd like to propose an alternative approach.  Instead let's score each message based on its affinity to the person - i.e. a winner-takes-all problem rather than a global maximisation problem.  Reframed this way we get a very different solution, we need to represent both every message and every person in space, and the task is to find the closest message for each person.  Naturally we start off not knowing where each person (or message) is, so initial feedback is needed

The practical implementation of this are simple.  Where we can build an understanding about which products you will like, we can make far better recommendations.  That is a well studied problem with plenty of good solutions.  For example we can use purchase history to predict your next purchase and then present it in your next email.  We can use your website behaviour to infer the probability of you clicking on two different messages in order to select the right news for you.  We can even manually place new products in space by looking at products they are similar to, in order to send relevant messages without purchase or clickthrough history.

So why would you do this?  Because if done well you can create an expectation that your emails are worth reading even when nobody else's are.  This isn't easy... you will need to be significantly more relevant than people expect for approximately ten emails in a row before the pattern is noticed.  A single slip-up will consign your brand to the same position as your competition. However the benefits are huge.  Customers will automatically open and carefully read your emails without you having to shout to gain their attention.  

Anyway, please comment or otherwise get in contact with me if you find this interesting.  I'd like to make 2017 the year where broadcast advertising starts to die. 

Deployment best practice and the cloud

I have now worked for a number of years in cloud environments and in traditional environments.  But something I just noticed is how much easier the cloud makes it to implement best practice in development, testing and UAT. Let me give some examples:

Best practice says that the configuration of TEST / UAT should be the same as production.  I think most people attempt this in all environments, but it's very easy to accidentally do one little thing slightly differently in one environment and from then it's impossible to get them to converge again.  Within the cloud you don't keep a TEST environment, you clone production whenever you want to do testing.  As a result the test environment is always at exactly the same patch level and with an identical configuration in every other way.

Best practice says that the QA / UAT server should have the same hardware specifications as the production environment.  That's completely impractical in a normal environment, production is high performance and high reliability because everything depends on it working perfectly.  The number of additional surprises that get caught in UAT by having a perfect replica is almost never enough to justify the cost.  However in a cloud environment the cost of having production mirror UAT is almost zero - just clone production and call the clone UAT.  If production uses database mirroring then so will UAT - no more bugs due to slightly different database configurations.  

Best practice says you should develop and test using real data.  But snapshotting production for QA is a pain, especially to do regularly.  Not so in the cloud, where it would be harder to boot a QA environment that differs from production than one which is identical.

Best practice says you should develop deployment scripts which automate the upgrade to the next version.  Such scripts are very hard to test in a normal environment because you have to repeatedly go back to the old version as part of testing.  However in a cloud environment you can easily clone production and test the deployment script as many times as necessary until it runs perfectly.  You can even determine exactly how long the deployment takes, and if it will affect users during the deployment.

Best practice says you should run UAT and production in parallel for a while to validate they give identical results.  That's pretty easy to do when you can just spin up another server, but next to impossible where it would mean dedicating a server for a few days just to check it gives the same results.

Best practice says you should only upgrade a few users initially to ensure everythign continues to run smoothly.  I'd hate to think how you'd even start to do that without the cloud - something like a front server that redirects requests based on logins?  Incidentially, this is something Xero does not do despite having a cloud backend.

Every time I've used the cloud I've been disapponted by how expensive it is, and amazed at how valuable it is during the release process. I wonder if there's a middle ground where a private cloud provides the ability to use a pool of cheap hardware while providing all of the benefits.

 

Is DM cold-calling?

Don Marti replied to my post in favour of direct marketing.  I still disagree with him but I can see where he's coming from.

In brief the idea of carefully building a profile of you in order to select the product you are most likely to be interested in creeps him out, whereas I would just say it's trying to be relevant.  If you regularly walk into a wine warehouse then you'll see a thousand different bottles of wine nicely categorised and be left to choose what you think is best, while if you regularly walk into a small independent store you'll be directed straight to what the sales assistant thinks you'll want. 

Computers have been attempting to automate what that sales assistant does on a massive scale.  I find that helpful, but I can see why other people would find it creepy.  Ignoring the amount of effort involved or privacy for a moment, it is it means that if you surf a website about how to fold reusable nappies and then go to an online shop, it would show you the eco-friendly variety of nappies.  People like Don would rather manually choose the eco category themselves than have a computer judge them and present a different experience to what other people receive.  Fair enough.

Don also raised an interesting point as a brief aside which I've been putting a lot of thought into:

Is DM the equivalent of a cold call?

My initial reaction was that can't be right, you have to opt-in to DM when you never opt-in to a cold-call.  But after thinking about it a lot I decided the parallel has some merit.  If I open a web browser and go to a store to buy some wine then, well, I have essentially walked into a store.  However if exactly that same store sent me an email telling me to go to their store because they've got some wine I would love then, no matter how carefully selected that message is and regardless of whether I initially opted-in, they have interrupted my day.  

So I think he's right, DM is the store deliberately interrupting your day to remind you they exist.  If you had to proactively go to the website in order to be shown the carefully picked wine then I wonder if Don would still object - is it the creep factor or the interruption factor? From a marketng perspective I know waiting for the customer to come to you is far less effective than interrupting them.  If you want to make a sale to that customer then you somehow have to remind them that you exist and if you simply wait for them to come to you then your competitor will remind them that they exist first, and you'll lose a sale.  It's an interesting problem - how do you stay top-of-mind without wasting everyone's time.  Perhaps micro-payments are the answer? At least that way they're compensating you for the interruption. They have the nice side benefit of reducing spam's ROI too.

Visual Studio

I've always preferred a minimalist approach to coding.  While I transitioned with most people to using an IDE in the early 90s. I soon transitioned back again.  I found that when I was writing code I didn't want distractions and invariably both the editor and the build program integrated with IDEs were vastly inferior to those available in specialist tools.

To be specific, how many IDEs would enable you to type say ":1,$ s/([a-z]*) ([a-z])*/\2 \1/" and have every second and first word swapped? I bounced between emacs and vi a bit as to which editor I preferred but while I periodically tried the latest IDEs I always found them woefully inferior.  Build commands were similar, I was able to develop a custom preprocessor, integrate source control, or even FTP required files all as part of build scripts using make while IDEs struggled with anything more than a dependency tree.  They even had an irritating habit of trying to recompile your code when you just edit the comments, something that's easy to override using specialist tools.

When using windows you don't get much choice except to use an IDE.  Microsoft has been pushing Visual Studio for many years - offering it as a free download, bundling it with IronPython, moving DTS and SSRS into it, and generally expecting you to use it everywhere.  

Not long ago I upgraded to Visual Studio 2012 and, much to my surprise, I was absolutely blown away.  Firstly this version supports both programming and database development whereas the previous BIDS install didn't want to integrate with a C# install.  But the quality of the development ecosystem integration is where the latest version really shines.  Source control isn't just built in, it's beautifully integrated throughout the program, build servers, automated regression testing, agile style task management, it's all amazingly slick.  

So, well done Microsoft.  After twenty years of ignoring IDEs, you've released some software which is so good that you've changed how I work.

Incidentally 2013 came out the other day.  It's a solid incremental update with the most interesting feature being nice deployment management.  Nothing to get excited about but after such a brilliant release in 2012 I had expected Microsoft to release a dud.

Targeted Advertising Considered Harmful?

Don Marti recently wrote an article criticising targeted advertising (link).  As someone who spends his life helping to make messages more relevant, it's a little off-putting to be told your professional existence is harmful.  Having read his article I would like to take this opportunity to refute it.  I would like to take a later opportunity to write a counter-article on why I think targeted advertising is the best thing for consumers since advertising, but I'll save that for later and just concentrate on Don's article.

The core thesis behind Don's argument is a pretty simple logical chain.  Before seeing advertising you know nothing, but after seeing a big ATL campaign you know one company is willing to spend big bucks promoting their product while others are not, and from that you infer the company you saw believes in their product, and if they believe in it then it is more likely to be of value to you.  BTL advertising which Don dislikes does not follow this chain because it is so cheap, and so does not provide him with the same assurance.

I believe the flaw in this chain is the link between spend and belief in a product.  As a marketer my objective is to remind customers about my product so that it is front-of-mind and they are more likely to buy it next time they're out.  I assess a particular campaign based on the cost of that campaign against the incremental profit that campaign generates.  There are a few exceptions, like brand positioning or burning cash for market share, but they don't matter here so I'll ignore them.   The thing about this simple ROI assessment is that my belief in the product never comes into it - I trust the product team to make a great product but my ATL advertisements will go to people for whom it isn't the right choice.

For example, imagine you are CMO of Carpet Court - a carpet retailer.  Working with your media agency you have to decide between billboards, bus-backs, TVC, and whatnot.  Say your last billboard campaign costs $50k, and generates $300k in incremental revenue.  Because the margins in carpet retailing are pretty good (30%), I would be quite happy running that campaign again.  

Now say I'm a CMO of a very specialist carpet retailer which just happens to be a more suitable choice for Don.  Because I'm smaller, the last billboard campaign I ran generated only $100k in incremental revenue and so I will not do that again.   So Don is sitting there on his morning commute and he sees the Carpet Court billboard but not the specialist billboard, and so goes with the carpet court "because they back their product with advertising dollars".    He's made the wrong choice.

Another way of looking at it is cost per sale.  As a big established retailer, ATL often generates good cost per sale because so many of the people who see it are considerers for your product.  But as any sort of challenger or niche firm with a more suitable product the cost per sale of a big ATL campaign is completely unjustified.  

So the core problem with Don's thesis is that spending money on advertising relates only to expected return, it has nothing to do with my belief in the quality of the product.  With this link broken, all the following conclusions no longer follow.

What should I ask you?

I've tended to use this blog as a place to put my observations of things I have learned, rather than a place to discuss problems I'm trying to solve.  But something which I've had come up a few times recently and haven't found a good solution to is question selection. This post is about the game twenty questions.  How should I decide the questions to ask you which give me the best chance of guessing an answer.

Some context... Say I wanted to predict something about you, whether you're going to buy a particular product or what category I should classify you into.  I can produce a dataset with a whole lot of things I know about other people and whether they bought or their category, and then I could train a model such as a SVM and apply it to your particular data.  All machine learning 101 so far.

Collaborative filtering takes this a little bit further.  Based on observing a number of responses to questions from many people I can estimate the probability that you will give a particular answer to a particular question.  This gets things significantly more interesting because it largely eliminates the need to have a dataset where virtually every data point is known. 

To provide a concrete example consider the problem of predicting how much you'd like a particular movie.  In the basic case I could collect a whole lot of information about people and how much they like the movie, build my model and apply it to you.  But that's not terribly realistic because I'm unlikely to be able to ask most people the meaningful questions for predicting whether they like the movie (namely: whether they liked similar movies) because most people won't have seen virtually all of the similar movies.  There are techniques for reducing the impact of missing values such as Amelia, but they only scale so far - certainly not to most people having only seen a few dozen movies.

Collaborative filtering techniques such as RBMs  or SVMs help out a lot here.  They'll let you know what the best estimate is based on the unbelievably sparse dataset of movie scores.  

But the problem I want to pose is a bit harder still.  Everything we've talked about so far assumes a fixed set of knowledge about people.  What if I could as you a few questions before telling you what you'd score a movie?  This seems far more interesting and meaningful: I could walk up to you and say: "Do you like action films, or romance?" and immediately get a much better estimate than before.  To some extend attribute importance is a good answer here - at least it will tell you which question you should ask first.  

But what should you ask second?  The second highest scoring attribute is usually at almost the same angle as the highest and so contributes almost nothing to our understanding.  Factor analysis or similar can produce the second best question in general which might be good enough for the second or even the third question we ask, but is a pretty poor choice for say the tenth question we ask.  

To illustrate imagine the game twenty questions and how hard it would be if you had to write down all twenty questions in advance of hearing a single answer.  So the problem is how to effectively fold the new knowledge into our understanding such that the next question we ask is as close to the best question as possible.  On trivial datasets this is easy - we can recompute the probability distribution for the answer based on the revised input space, but real problem spaces have far too few datapoints to consider this approach.  

So far the best I've come up with is to set up a RBM where the input predicts the output, and then simulating the impact on accuracy when a single new answer is provided.  Apart from being slow this is not particularly elegant and I'm hoping someone can think of a better idea.  

How would you design the best computer program to play 20 questions? 

Eating your own dogfood

Site location is not a hard problem, though getting the data for a trustworthy result can be time consuming.  First you have to get the location of all competitors, apply a sensible drive-time buffer, score every potential customer in terms of their estimated value and run it through a heat-map to find under serviced areas.  

Recently my wife was looking to buy a practice and we were surprised by the lack of supply.  This raised the question: is the market saturated, or are there simply not enough practices built?   Also, should residential or work addresses be used for customer locations.  

Finding the appropriate number of dentists per head of population proved surprisingly difficult.  Research into theoretical levels does not help much, especially when coupled with quite weak research into who doesn’t go to the dentist regularly. Similarly the quality of geocoded practice data was terrible.  Most practices were located correctly but there were numerous duplicates, dental labs which do not see patients and businesses that have long since closed down.  

What all this leads to is that the time required for the analysis blows out.  That’s fine if it’s a work environment and the client’s willing to pay, but when you’re trying to slip the analysis into the evening before going to bed it’s a lot more annoying.  


EC2 for databases

EC2 or Elastic Compute 2 is Amazon’s cloud computing offering.  You rent servers from Amazon at a low hourly rate, micropayment style.  Every byte going into your server, every byte you store for a month, it all adds up on your monthly bill.  This approach applies not just to the hardware, you can effectively rent Windows and SQL Server as well. The end effect is very low initial costs but slightly higher annual costs - perfect for startups.  

Where it gets interesting is that you do not rent physical machines but virtual machines.  This means that you can clone your virtual machines, destroy them, scale them up, etc. to your heart’s content.  If you want to deploy a big change to production and have pre-prod running for a week then you can - and simply blow it away when you’re finished.  You don’t have to predict future computing needs - simply rent what you need now and rent what you need later well... later.   Think of the benefits that virtualisation provided a few years ago to the server room - Amazon’s cloud offering makes a similar advance to the ease of system management.

Also remember how when virtualisation first came popular it had terrible performance? Sadly we see something similar here.  Amazon’s instances do not have hard drives, they’re connected to their storage via the network port.  This means that at best you are going to get gigabit networking and at times you will get much worse (I’ve had as low as 100kB/s).

This brings us back to databases. Normal databases are heavily I/O bound, with 15k drives commonplace and many organisations transitioning to Solid State.  The idea of trying to run a large database on a single USB hard drive would normally be considered a joke, and yet that’s effectively what Amazon provides. The performance of a complete system is therefore ... disappointing even when looking at basic specs (RAM, CPU etc.) it would be quite acceptable.  

And there’s nothing you can do about it, a larger instance will mitigate the issue slightly by giving you more of the network bandwidth but fundamentally you’re limited to gigabit ethernet and even if you control the entire physical machine, you won’t control the SAN.

Unfortunately it gets worse.  Most databases I’m involved with are used for reporting and modern reporting tools are quite data intensive, generating beautiful visualisations which summarise a large amount of underlying data.  The problem is how you get a large amount of data from the database to the visualisation tool - with Amazon in Virginia and me in New Zealand, I’m lucky if I can sustain 1MB/s - perfectly acceptable for browsing the internet but extremely painful when trying to run reporting tools that were designed to operate correctly over a LAN.

So while cloud computing is interesting, I would advise you to hold off deploying substantial databases to EC2 until Amazon sorts out their I/O issues.

Fixing Operational Data in a DW

I obtain data generated by cheap, low powered devices that make mistakes.  Sometimes the data simply will not match the specification because the firmware mucked up (a bit flip perhaps) and there’s nothing anybody can do to prevent it.

So what should I do about it?  Rejecting it leads to a clean data warehouse but is unrealistic - we’ve lost information about the fact a transaction happened even if we don’t know exactly what transaction.  Fixing the data is also unrealistic, breaking the ability to reload the data warehouse and obtain the same state.


The solution I’ve come up with is simple but required quite a bit of coding.  A GUI is put over the Operational Data Store section of the data warehouse allowing changes.  Every time anybody changes the data there a trigger kicks off a few actions:

    The original source file is backed up.

    The original data is deleted from the data warehouse

    A file that simulates the generated data is created

    The simulated file is loaded

As far as I can see, this allows any authorised user to fix data in the data warehouse using either a table editing GUI or SQL, while maintaining the data warehouse’s integrity.  However I’m still sitting on the idea and seeing if there are any flaws in it - drop me an email if you can think of any.

Foreach Slowing Down?

I’ve got a server running SQL Server 2005 on EC2.  I know there’s criticism of EC2 for database use due to S3’s relative disk speed but this is running fast enough - and you’ve got to admit it’s a breeze to deploy a new server!

Recently though, it’s been acting really strange.   First the install onto a brand new (virtual) machine broke with .net somehow becoming corrupted.  I find this extremely odd:

  1. 1)create machine

  2. 2)run windows update

  3. 3).NET does not work

In the end I fixed it by downloading the .net update and running it manually but this isn’t the sort of thing that I can imagine squeezing past MS’s quality control - some weird interaction between EC2 and Windows?  I am sure running a virtual RHEL and then a virtual Windows on top of that isn’t healthy.

Anyway, the issue that I haven’t fixed is that I’m getting files sent to me bundled together in a zip file and processing them using a standard foreach loop.  Running locally this works fine.  Running remotely however this seems to result in the load gradually slowing down to a crawl.  Clearly there is a funny interaction going on somewhere but what? and where?

To be honest, I’m hoping to not have to resolve this by migrating off 2005 and hoping it just goes away.  It just seems to random to be something that’s easy to find.  However I thought I’d put it out there in case either someone reading this happens to know the answer, or this post helps someone else in a similar situation

 

Migrating SSIS

I recently inherited a small data warehouse developed in SSIS 2005.  It was only written three months ago so you’d expect it to be pretty compatible with modern technology. What I found was a horror story - the highlights from which are outlined below in case anybody else ever comes across similar issues.

Java Zip

If you search google for how to unzip in SQL Server, the number one hit is some VB.net code that calls a Java class.  It works just fine in 2005 but unfortunately that class is not distributed in 2008 (some people claim installing J# adds it in, while others claim that does not help).  

In 2008 things are much more complex.  One option is to rewrite the script using System.IO.compression but this is far lower level which makes it quite unattractive as an option.  Another is to install a third-party zip library - again, not the simplest.  I eventually decided the best solution was to call out to an executable (7-zip) which proved painless and quite efficient.

SQLNCLI.1

In 2005, the default connection to SQL Server is done using the SQL native client.  This is deprecated in 2008 in favour of a new native client - from memory SQLNCLI10.  It isn’t just phased out slowly, the 2005 client is not even installed on 2008.

The migration wizard automatically converts this for you - unfortunately the migration wizard also attempts to connect to the database at various points in the migration - using SQLNCLI.1.  As a result your migration will fail as the wizard is unable to connect to the database.  The only workaround I have found is to install the 2005 client in 2008 so that it is available to the wizard.

Memory of past connections

One of the bugs in Visual Studio 2008 is it keeps some metadata about the way things used to be.  One particularly annoying instance of this is that the package ‘remembers’ the old and invalid SQLNCLI.1 connections and keeps migrating back to them when the package is closed.  The only workaround I’ve found is to quit Visual Studio and use find and replace on the XML files.

Critical metadata in XML

One issue I had with the package was spelling errors - E.g. Retail misspelt as Retial.  Naturally I corrected these as I found them, but I did not check the XML support files.  Since the package parses the support file before loading and a failure there will mean the support file is ignored, correcting the spelling in the GUI and neglecting to correct it in the XML will corrupt the package.

Conclusion

At the end of the process I believe I spent longer on the migration than it would have taken me to recreate the entire package in 2008 (by visually selecting the same options).  It seems that there are two big annoyances in what Microsoft has done - firstly by changing connection managers, Microsoft has added a lot of work and secondly hard-coding critical data outside the GUI, Microsoft has made the wizard almost useless.  I see that SQL Server 2010 repeats the same mistake, requiring a migration from OLE (the only option supported in 2005 and the default in 2008) to ADO.

I won’t even pretend OLE is better than ADO - but I do not think it is good development practice to have the default in one release unusable by default in the very next release - you have to give people a workable migration strategy.

Poetry

CI Wizz, never boring

Have a chat, his name is Corrin

First impressions that you’ll find

Is that he has a find tuned mind

OK, so he’s really really smart

But you know he has a caring heart

Working with Corrin can be really quite a shock

When you find out that he’s as logical as Spock

Always quick, never ever slow

The NZ champion of the board game GO

Became known as a foosball hustler

But not renowned for being muscular


Although an expert of Oracle and programming SQL

Moving to SAS, for the team he’s made it less of a hell

Master of the information, big and small

This guy really seems to know it all

Moving Fly Buys to the new data warehouse

Writing requirements quietly as a weremouse

A fan of the quotes lists, he collected the datum

Averse to photos of other people’s scrotums

Quitting LNZL, his career’s in the crapper!

But au contraire, he’s off to Snapper

The kindest guy that you ever did know

Corrin, we’ll miss you after you go

So it is with a final shake hand

We say goodbye to Mister Lakeland

Annoyances SAS Enterprise Guide

I mentioned in my previous post that I selected SAS as the best tool for graphical data mining.  As part of that decision, we decided to move the entire analytics process to graphical programming in SAS.  What was previously custom SQL and PL/SQL is now diagrams in Enterprise Guide (EG).

I don’t regret that decision, and some of the things I have seen EG do elegantly and simply have blown me away.  Even more than that, seeing it do the job of four different tools adequately has more than justified the decision to standardise on the SAS platform.  It is a very fine tool, but this post is about the things in Enterprise guide that make you go “What were they thinking!”.

Graphical programming is a problem that has been attempted for a very long time - I first used it in Hypercard in the early 90s but I’m sure it’s much older than that.  Therefore, it surprises me that so many of the issues I’ve pointed out below are things that have been resolved

 

 

Graphical Data Mining

At work, we recently decided to standardise on a graphical data mining tool rather than rely on me writing code.  This was intended to lower the barrier to entry, make processes more repeatable and so on.

In the process of implementing this I’ve been appalled at just how sorry the state of graphical data mining is.  I know that graphical programming was trailed in the 90s and died a natural death, but I would’ve thought a few more lessons would have been learned from that, it is more than ten years on now!

I trialed four software packages: Oracle Data MinerSAS Enterprise Miner,PASW Modeler (which was known as Clementine at the time), ORANGEand KNIME.  Every one of them failed most of the test criteria.

Oracle Data Miner was by far the easiest to use: automatically normalising variables when appropriate, dropping highly correlated variables, and so on.  However, it was terrible when I attempted to scale to a large number of models.  The GUI left piles of cruft after itself, made marking of models as ‘production’ almost impossible, and provided almost no facilities for overriding the decisions it makes (e.g. tuning the clusters).  However the killer mistake was it’s (lack of) ability to productionise code.  It’s PL/SQL package output quite simply didn’t work (it would sometimes even crash!).

SAS Enterprise Miner does a lot of things right, but it’s user interface is awful with frequent unexplained crashes.  Some specific examples: it doesn’t support nested loops, it is limited to 32 values for a class variable but this limit is not enforced, just later steps will randomly crash, dragging a node in between two others won’t connect it up, and it doesn’t support undo.

Steps that you will almost always want to perform such as missing value imputation or normalising attributes are easy enough to do but they’re not done by default - you must always remember to add them.  To build a SVM with standard steps (sample, train/test split, impute, scale, test) requires finding and briefly considering half a dozen different nodes.  Why not default to normal practice and only require extra steps when you want to do something unusual?  

I won’t go into detail on PASW, Orange or KNIME except to note that despite being free, KNIME had the best interface of all the tools I looked at and ultimately I decided SAS was the best despite its warts due to its underlying power.  I do wonder what the designers were thinking: data mining best practice is now pretty well defined, why not design your tool to make it best practice the easiest option?  


Quantile Regression

Quantile regression is a variant of regression which instead of minimising error (finding the mean), aims to have a certain percentage of observations above and below the target.  In its simplest form the target is 50% and so the algorithm will find the median.

That’s enough theory, for people who want more I’d suggest Wikipedia to start and this great paper for more details.  I’m more interested in applications so I’ll concentrate on them.  If normal regression is about answering the question ‘what’s most likely to happen’, then quantile regression is about answering the question ‘how likely is this to happen’.  Some practical examples follow:

How confident can I be that this campaign will make a profit?

How confident can I be that this patient doesn’t have cancer?

What’s the most I can reasonably expect this person to spend?

What’s the least revenue I can reasonably expect this store to make?

Such questions can be attempted with regular regression if errors are assumed to be normally distributed - predict the expected value and add a standard deviation or two to increase confidence.  In practice I’ve found this a very poor approximation to make.  Say somebody is an ideal candidate for spending all their money, they will be predicted to spend quite a bit but after adding a couple standard deviations we’ll predict they will spend more than all their income!  

Quantile regression hasn’t made it into many tools yet, you’re pretty much limited to R and SAS if you want to give it a whirl.  And even then, it’s an optional add-in to R and in SAS it’s marked experimental and effectively can’t be called from Enterprise Miner.

To all those statistical pedants out there, yes I have oversimplified my explanation above.  But for the kinds of uses I make, the description above is accurate enough.