Wolf, Goat, Cabbage, and LLMs

The wolf, goat, cabbage logic puzzle is well known and any LLM can solve it. Or can they? Have a look at this very slightly tweaked puzzle:

A farmer needs to transport a wolf, a goat, and a cabbage across a river using a small boat. The boat can carry only the farmer and one other item (either the wolf, the goat, or the cabbage) at a time. However, if left together without the farmer's supervision: The wolf will eat the cabbage. The goat will eat the wolf. The challenge is to figure out how the farmer can get all three items across the river safely. What is the sequence of moves that the farmer needs to make?

All I’ve done is switched the roles of the wolf and the goat, and I’d expect anybody able to solve the original problem to have no trouble on this adjusted version. Yet every LLM out there (as of 2024-05-30) cannot solve it. They either leave the wolf alone with that tasty cabbage, or they take an invalid action such as teleporting the cabbage. Even trying to correct them by pointing out their mistakes is an exercise in futility. They’ll fix one thing and reintroduce another error. Eventually you’ll walk yourself in a circle.

I eventually solved it by forcing it to output a Python program, running that program and printing its output. You can see my work here: https://chatgpt.com/share/5129aa63-8d98-432a-bdab-5436a74a96df. You can also see how much tweaking I needed to do. The final code is simple enough:

def river_crossing_solution():
    # Initial and goal state setup
    start = (True, True, True, True)  # Farmer, Wolf, Goat, Cabbage start on the true side
    goal = (False, False, False, False)  # Goal to get all to the false side

    # Define all forbidden states
    forbidden = [
        (False, True, True, False),  # Wolf and Goat without Farmer on the goal side
        (True, False, False, True),  # Wolf and Goat without Farmer on the start side
        (False, True, False, True),  # Wolf and Cabbage without Farmer on the goal side
        (True, False, True, False)   # Wolf and Cabbage without Farmer on the start side
    ]

    # Define possible moves (Wolf, Goat, Cabbage, or none - Farmer alone)
    possible_moves = [('wolf',), ('goat',), ('cabbage',), ()]
    
    # Breadth-first search (BFS) to find a solution
    from collections import deque
    queue = deque([(start, [])])  # Queue of states to explore with paths

    while queue:
        current_state, path = queue.popleft()
        
        # Check if we've reached the goal
        if current_state == goal:
            return path

        # Generate and validate new states
        for move in possible_moves:
            if move and not current_state[{'wolf': 1, 'goat': 2, 'cabbage': 3}[move[0]]] == current_state[0]:
                # Skip move if item and farmer are not on the same side
                continue

            new_state = list(current_state)
            new_state[0] = not new_state[0]  # Farmer always changes side
            for item in move:
                index = {'wolf': 1, 'goat': 2, 'cabbage': 3}[item]
                new_state[index] = not new_state[index]
            new_state = tuple(new_state)
            
            # Check new state is not forbidden and ensures each item's side is valid
            if new_state not in forbidden and all((current_state[0] == new_state[i]) or (current_state[0] != new_state[0]) for i in range(1, 4)):
                queue.append((new_state, path + [move]))

    return "No solution found"

# Execute the function to find the solution path
river_crossing_solution()

Why am I posting about this? Because I have a pet theory that I’d like to share:

My theory on LLM architecture

As an avid go player with a PhD involving neural networks I’m naturally familiar with the technical architecture of Google DeepMind’s AlphaGo - the program that trounced the world champion go player back in 2016. But I will do a brief recap here since I suspect I’m in the minority.

The core design is to split the problem into three: a value network, a policy network and a governance system to control their use. If you want a more detailed explanation then I recommend this article.

There used to be a website called ‘Leela Zero Ply’ which dumped the policy network and simply played the highest value move. Its output reminds me a lot of state-of-the-art LLMs. Superfically the moves produced by LeelaZeroPly look amazing, but dig carefully and they turn to be built on a foundation fo sand. You can tear LeelaZeroPly to pieces, provided you’re at least 6th dan and really take your time on every move.

The analogy to LLMs is that their output is totally plausible if you don’t know the topic intimately and carefully read what they said. But if you look closely they’re writing what sounds plausible and any relationship with facts is largely coincidental.

My test above, where I showed GPT 4 failing miserably at the wolf, goat, cabbage problem was achieved because the model has an enormous amount of training data telling it to never put a goat and cabbage together. Then it sees a prompt telling the opposite, but recall that LLMs don’t encode their training data as a series of facts. They encode them by associations, and so the prompt cannot override associations.

The Policy Network

The role of the policy network is to perform logical manipulation of data, aka search. In the case of Go, it’s Monte Carlo Tree Search, but that might not be the case everywhere. The point though is that because mistakes by the Value network are rare, the Policy Network can focus on searching very deeply.

Phrased differently, this interaction enables a LLM to perform symbolic manipulation, just as I did in finally solving the puzzle by getting GPT4 to produce and execute a Python program. The Value network is capable of turning my prompt into a good program, and the equivalent of the policy network is capable of executing python code.

I’d particularly highlight that the workaround I used is how most people solve weaknesses in LLMs. They use a layer of redirection (LoRA, RAG) to mean that the value network is not left to solve the problem on its own.

Conclusion

We can already do amazing things with LLMs. I think we still have another monumental advance coming, where some company (Meta?) manages to work out how to integrate Policy.

Programming and LLMs

Many, many years ago I implemented a neural network as part of my PhD in natural language understanding. I developed a semantic embedding model, and predicted the next structure using softmax.

Then I left uni, got a real job, and barely touched language processing again. Most work in data science was direct marketing, and was based on transaction and interaction data. I vaguely tracked the literature on text processing, but more out of habit than anything.

It’s been incredible to watch the advance of LLMs from the sidelines. BERT was my wake up call. When it came out in 2018, I realised all my ‘old ways’ of doing NLP were now redundant and it was time for me to upskill. Still, life got in the way and I was busy with work, so that it wasn’t until last year that I really sat down and got my head around all the advances in the field.

I don’t want to get into the technical side of LLMs here and now. What I want to focus on is how much of an enabler they are for work. I learned programming way back in the 1980s but I was never terribly good. I wrote programs to get a task done rather than production-grade code intended to live for years. As my work has shifted into consulting, I’d largely dropped programming beyond a couple hundred line script.

That’s changed with LLMs. I can ask it to fix up my syntax, scan my code for bugs, or look up API documentation for me. It’s all stuff I can do without a LLM but the difference in time between me stumbling through Stackoverflow vs me asking a LLM is roughly an order of magnitude.

In the last six months I’ve completed an app for syncing my bank transactions from Open Banking (Akahu) to YNAB, and am about half way through an app for smart scheduling of tasks. I’ve also implemented SMS inside Open Dental, and generally used it to make myself faster at work.

It’s hard to overstate how much difference this is going to make in the world. I think the productivity of ‘knowledge worker’ is going to double almost overnight. WIll that mean more is produced? Mass redundancies?

Anyway, I’m certainly enjoying being able to work faster.

Designing an effective newsletter

Designing effective newsletters has been a significant proportion of my life for the last few years. I will list a few of the things I’ve seen work well, as well as mistakes.

My background is in measuring marketing effectiveness for enterprises. A few years ago I started a company called CME Connect and we work with clients to build their marketing capability. The most common reason clients get us in is they want help with their newsletters. This background will affect my answer in a few ways:

  1. My focus will be on how to design a newsletter that is profitable. A beautiful newsletter which does not increase revenue or decrease costs has very little interest for me.
  2. Branding and HTML coding are both important but they’re not my areas of specialty so I will not go into detail
  3. Some of the things I talk about will be irrelevant for smaller companies. For example managing multiple stakeholders and sign-off.
  4. We have developed technology for addressing the issues I talk about here and I will use that as an example of how to approach the issue. Most other enterprise-level software has similar features and even if it doesn’t, you can usually achieve the same thing manually with a bit more work. Please let me know if you need to do something I describe using different software.

Return on Marketing Investment

The thing I try to have in the back of my mind when designing newsletters is ROI (or more precisely, return on marketing investment- ROMI). Creating newsletters takes a lot of time and it’s important to keep your budget in mind. Every newsletter you send has an expected ROMI based on how it impacts lifetime value (mainly through reducing churn and generating sales).

The point is that the newsletter generates incremental returns, and improvements to the newsletter make it incrementally better. It is extremely easy to spend thousands of dollars on extracting the last few dollars of return. The best way to design a newsletter is quite different to the way you would design the best newsletter!

The best way to design a newsletter is to spend the appropriate amount of time on it so as to maximise long-term revenue. Everything else is a collection of techniques for how to do this. How to identify things with long-term impacts. How to estimate the costs and benefits of tasks. I find that the single biggest difference is simply remembering that you’re there to create a profit, not to create a newsletter, and everything you do should be evaluated against that goal. Keep that in the back of your mind and it will make a huge difference.

Our software makes it easy to estimate ROMI which is a handy time-saver but it’s easy enough to ballpark yourself. Estimate the incremental difference that this work will create per customer, and multiply that by the number of customers.

The reason I’m going on and on about this is that is that it comes up every day, for example:

  • You sit down with your CMO and demonstrate your assumptions showing that a newsletter should have a good ROMI, signing off a budget to run it for a year.
  • You are deciding whether to hire an HTML coder and use the same process to decide to do so.
  • A product manager suggests writing some special copy for people who have already purchased. You work out how many people this is and estimate the value of the better message to decide that actually a generic message is better value
  • You’re considering replacing a photo taking by your assistant with one taken by a professional photographer. This will cost about $600. You guess it will increase the value to each customer by about $0.0005 in incremental gross profit, and you’re sending to one million people. That works out to a $500 return for a $600 investment and stick with the existing photo.

At first you are unlikely to know your incremental ROMI. I would encourage you to not get too hung up on this. Just take a guess, write it down and act based on that guess. The important thing is to measure what your real ROMI was after the campaign and adjust your future guesses accordingly.

If you do this well then a lot of very important benefits fall into place naturally. For example actively trying to build relationships with less active customers often looks bad because they have low engagement statistics. However there is typically a lot of them and they have much more potential for incremental returns so . The emphasis on engaging with each and every customer is one of the biggest differences between a marketing automation system and a campaign management system.

What to talk about

What do your customers want to read about? Do they want great offers? Behind the scenes interviews? Customer success stories? Funny jokes? An effective customer research programme is essential for driving your newsletter programme. It’s important to realise that the priorities of your company such as new product launches is generally not those of your customers and you need to balance between the two.

I am personally a huge fan of tracking via ‘read more’ links, much like Quora uses. If a customer clicks on the read more link then clearly they’re interested. If they don’t then you got it wrong and just wasted their time. Do that too often and they’ll unsubscribe.

This approach has downsides. It requires writing teaser copy (aka clickbait) rather than summarising the subject efficiently so that interested customers do not need to ‘read more’. An alternative method is running a research programme. I’m not a huge fan of this because what people say and what they do is so different. However the general idea of getting information from a proxy is worth remembering.

Depending on the sophistication of your software (and the ROMI), I am a big fan of crafting different messages for every customer. For example if 20% of your customers want to hear what your management team is up to and 60% would rather not then the only way to give everyone what they want is by building an optimal newsletter for every customer. I get into this a bit more later but for now I’ll simply note that this requires tracking every individual’s interests and preferences rather than general trends. Also note that this is the sort of feature offered by enterprise marketing automation but not by the SMB solutions. You’ll need to decide what the ROMI is and whether that justifies an enterprise solution.

You can probably see from this that I take an extremely data-driven approach to content selection. I should point out that I use data to guide me, but I consider it absolutely essential that the marketer retains final control.

Message Eligibility, Relevance and ROI

There’s three main things to think about when deciding what content a customer should receive. People sometimes use these terms and I think this leads to mistakes. I would encourage you to get a system which allows you to work with all three concepts independently even if your organisation is not currently sophisticated enough to take full advantage of it.

Message eligibility is very simple. Who are we allowed to tell this message to? For example if you want to reward your VIPs with a special deal, then only VIPs are eligible for that message. If you are selling alcohol then only people with a verified age will be allowed to see the message. I use strict rules to handle eligibility, e.g. ‘VIP = TRUE’.

Just because someone is eligible for a message does not make it relevant for them. A 40 year old teetotaler is unlikely to be very interested in a discount on alcohol. I measure relevance using propensity models. For new products or in situations where I do not have a good propensity model, I simply build the best one I can. CME has a built in tool for automatically generating simple propensity models based on transactions which makes this quick and easy. If this is not easy in your organisation then (at the risk of sounding like a broken record) remember ROMI - it could well be that a very simple intuition-based model costs so much less than employing a statistical modeller that you’re better saving money.

Your company’s ROI is also critically important. Say you’ve selling two products, one with low margin and one with high margin. The customer has a slightly higher propensity for the low margin product. You will need to decide how to weight the customer’s interests against your company’s. Essentially, do you value short-term profit or building long-term loyalty.

I would strongly encourage you to think very carefully about this. It’s all very well to say that my role as a marketer is to build customer loyalty, but at some point you need to translate that into incremental profit. If you do decide as a company that loyalty is the priority then how are you going to manage a senior product owner begging for their product to be promoted when customer feedback has been poor?

I personally like to combine the message relevance with ROI to give an overall message score for this customer. This is the expected incremental profit from sending this message to this customer. The formula is:

(probability of purchase if they receive this message - probability of purchase if they do not receive this message) * profit from purchase + impact on lifetime value

Message Banks

The value of building a bank of approved content is something I discovered by accident. Every organisation I’ve worked in went crazy as the newsletter deadline approached. Stakeholders desperately wanted changes to their message and/or targeting, That was, until I saw one where it was managed differently and ever since I’ve done my best to get other company’s to adopt their approach.

I feel a bit like I’m sharing my secret sauce but… the magic was to have the newsletter deploy on time every day completely automatically with whatever content has been approved. I’ll try and break apart all the important components

  • Newsletters are comprised of message blocks (CME refers to them simply as messages and the entire newsletter as a communication). Everything you do from design to approval should be centered around these messages, not the whole newsletter
  • Get your messages approved early. Each message has both its creative (HTML) and criteria (eligibility, relevance) as described above. You can organise with stakeholders to get this approved potentially weeks before the message starts to go out.
  • Decentralise as much as possible. The stakeholders rarely interested in doing so in the marketing system. Send them an email with the message to approve and log their response. Where possible, have the different business units write and approve their own messages. You can always require central approval as well if necessary.
  • Have plenty of spare messages. For example if you decide to average four messages in your newsletter then write four reasonable messages to get started. Now you can keep improving your newsletter by spending longer but there is no real panic if a particular message gets pulled at the last moment.
  • Try to avoid large deployments. If you deploy to ten million people once a month then you’ll get stakeholders pushing for changes at the last minute. If you deploy to 300k a day then the pressure is much lower. Sometimes this is impossible but doing it where possible means you have more resources for the exceptions.
  • Remember that recipients receive a whole newsletter, not just message blocks. You need to periodically check how the whole thing comes together. For example you might want a business rule that a newsletter contains a the following blocks: transactional, promotional, feel good and strategic. Without this you might end up sending someone four feel good messages and no promotions.

Make it look good

Good copy writing is important. You can probably tell reading my answer that it isn’t my strength! The difference that employing a copywriter makes is huge.

Sticking to your brand guidelines is also very important. It should be immediately obvious that this newsletter comes from you. Good creative will include plenty of white space, relevant graphics and the correct fonts.

There’s some technical details to writing emails which do not come up elsewhere. HTML in emails lags years behind the web. Three of the main problems are:

  • Security. The web will allow you to run things like javascript which is not possible in email.
  • Mobile. While an increasingly high proportion of websites are consumed on mobile, email is several years ahead in this regard.
  • HTML engine. For various reasons, getting your email to look the same on different clients is much harder than it is for your website.

I would recommend learning a decent email HTML editor and also getting a licence to Litmus. Testing using Litmus is far better use of time than having half a dozen different clients installed. CME integrates with litmus so that you can easily see how your email looks on a bunch of different clients. Even if the tool you’re using does not do this it is pretty quick and easy to load up Litmus.

Finally I recommend making this an explicit step in your approval process. When you get a stakeholder sign off that they’re happy with a message they tend to concentrate on the content of the message. Explicitly putting on a marketing hat and asking ‘Does this read well? Is it on brand? Does it look good for most clients?’

Keep improving

I’ve hinted at this throughout the answer but I want to call it out explicitly. Monitor how you’re tracking and keep trying to improve. Try A/B segmentation where you send the same message phrased in different ways to different audiences and measure your ROMI.

One thing to be aware of is that we find it takes about seven relevant emails in a row before a customer realises they like receiving emails from you. A single good email will not see a significant increase in engagement. This also means you get very little benefit having a great newsletter but lousy solus messaging. Along similar lines, it’s important that you don’t change too quickly.

I mentioned feedback above when talking about deciding what is relevant. I believe it is very important to read the responses you’re getting to your newsletter and work to incorporate this feedback into your design. Do not send from noreply@company.com - instead create a dialog. CME is somewhat unique in including a feature for managing replies. If your software does not do this then redirect the replies to your inbound service team.

Lastly it’s not just your marketing effectiveness that you should keep improving. Think about your costs - if you’re spending a long time getting some data each week then organise a data feed or do without it.

Reduce ongoing work

This probably all sounds like a lot of work. It is. Building an effective newsletter requires a lot of things to be going write. What I would encourage is three things:

  1. Run a trial for a few months in order to demonstrate the newsletter’s ROMI to stakeholders. CME Connect offers free trials, as do most other marketing automation systems. The point of the trial is to get wide buy-in for investing.
  2. Build as much as possible upfront, you want your weekly deployments to be as efficient as possible. We like to build a marketing datamart with integration from CRM, web analytics and so on. This is a fair bit of work but it means everything else is easier. You never have to
  3. Choose the right size of marketing platform. If your marketing budget is $20,000/year then a powerful system like CME is just not a good investment. You’ll be better with something simpler - there’s no point having lots of levers to pull if you don’t have the time to manage them! Conversely if marketing is worth millions to your company then do not go out and buy a cheap tool because you’ll spend too long making it jump through loops. Simple campaign tools like ExactTarget, MailChimp and similar can be made to do everything I’ve outlined above but it is incredibly hard. You will end up simply not doing half the work.

Conclusion

This is probably a longer response than you were looking for. The reason my company has been successful is that lots of companies struggle with the points I’ve described. Often what happens is that a company buys our competitor’s product and then can’t effectively integrate it with their business processes.

Designing good newsletters is more about good data-driven marketing than it is about technology. Go back to first principals and other things fall into place. That’s why I started my answer by talking about ROMI. Along similar lines, don’t stress too much about getting it all right at the start - get a good platform and gradually improve from there.

The best way to design a newsletter is work out what each customer wants to hear in a regular and timely manner and tell them. The better you can do that, the better your newsletters will be.

PS: This answer was originally written for Quora but I put a lot of time into thinking it through and wanted to keep a copy.

CME Connect

I've been involved for a long time with using data to making marketing relevant.  About three years ago I co-founded a startup which helps companies to build their in-house marketing capability.  The company is called CME Connect and you can read more about it at http://www.cmeconnect.com.  In this post I'd like to talk about why I made this decision and why I decided to concentrate on improving in-house marketing capabilities.

I care very deeply about making marketing data-driven.  I believe that the majority of marketers worldwide make decisions based on their intuition rather than facts.  As a result, customers are bombarded by millions of essentially irrelevant messages.  The only reason marketers have got away with this is that customers do not have access to sophisticated filtering.

This is slowly changing.  Google Inbox automatically analyses what you open and will only put messages in your inbox if it thinks you're going to open them.  Others get relegated to the promotions subfolder that rarely gets read.  I believe that within five years time, we will all have personal assistants like 'Siri', 'Cortana', 'Echo' and 'Ok Google' that filter out marketing material as effectively as spam has been eliminated. 

Permission marketing simply won't be relevant any more.  Only relevant messages will be allowed through regardless of 'permission'.  I believe the primary way that brands will achieve relevance is by using data to understand what their customers want to know about.  Therefore I really want to help brands do exactly this.

The other big point of difference is that CME Connect helps you build your in-house marketing capability.  We are not walking around saying "give us your problems and we'll take them away".  My belief is that taking problems away is just too expensive.  I believe that using data to drive decisions is a fundamental change in behaviour, not a project.  Therefore outsourcing to an agency like Datamine or Affinity is the wrong approach for most brands because they can't to afford permanently outsource everything.  

I have no problem with the services that Datamine or Affinity offer.  In fact I think they do a pretty good job, unlike most marketing agencies (Saachi, Aim Proximity, DraftFCB, ...) which pretend to know data but actually have just employed a couple analysts without changing their business practices.  Don't get me wrong, most agencies are far better at advertising than me, but their data analytics really sucks.

That brings me back to insourcing.  I simply don't have time to look after everyone's marketing, and even if I could employ and train enough people, I believe the outsourcing model does not change business processes sufficiently.  Therefore I believe the right approach is to raise the bar by helping everyone deliver better data-driven marketing.

In conclusion, I believe the world of the near future will be dominated by artificially intelligent personal assistants that carefully analyse all inbound messages on behalf of each person.  I believe that the only brands that will survive in this new world are those that use artificial intelligence to understand what each of their customers wants, and that I can make more of a difference by providing products and services to many companies than I could by helping a couple brand that can afford to pay for my expertise.

Email Frequency

How often should you email your email subscribers?  Every month? Every week? Every day? Every hour?  The traditional answer to this question is that you should define email frequency rules that your company strictly enforces.

There's lots of problems with this approach.  From a pragmatic perspective it is really hard because you'll get desperate stakeholders begging you for an exception to get their huge product announcement out just hours after you deployed a routine weekly update.  Turning down the requests on the grounds that you're just enforcing contact rules is likely to be career limiting.  However before you get all comfortable that at least doing the right thing by your customers, research in this field has been pretty definitive that contact frequency rules are useless and if you have something relevant to say then you can email your customers every fifteen minutes.

Of course the operative rule in the previous paragraph is relevance.  You live and breathe what your company offers, and the latest widget is pretty big news for you.  Your customers have a more more balanced life and are unlikely to be as excited by the widget as you are.  Contact rules came about as a way to prevent excited marketers from telling their subscribers about every little piece of news.  

One of the emails I help with is for a sports team. Some of our fans want to hear everything that goes on while others only want to hear the biggest announcements of the season.

It's not just different levels of engagement. Some fans are interested in player profiles or match reports while others want discounted tickets and special deals.

Best practice is to not use frequency rules at all. Instead score how relevant this message is for each of your subscribers and only send it to those that pass the threshold.

This can't be achieved using campaign management software and is one of the major benefits provided by sophisticated marketing automation platforms.

 

 

AB Testing

One of the most common tasks in marketing analytics (after post-campaign analytics) is determining which variation of a campaign was most effective.  Small tweaks to your subject line will increase your open rate, and small tweaks to your content will increase your CTR and eventual conversion rate.

This analysis works.  Some email deployment tools even have it built in, so that you can set up a few variations, deploy it slowly and based on the relative open rates from the first recipients / test subjects, the tool will automatically discard unsuccessful variations so the rest of the deployment get the best one.

My problem with the whole approach is that it treats your audience as homogeneous.  What if one version resonates best with women and the other with men?  To me the problem is not about simple maximisation, it is about selecting something relevant.

Another problem with the technique is that the goal should not be maximising your open rate.  It should be maximising your relevance.  If a recipient is not going to complete the purchase then getting them to open or click through is wasting that person's time and destroying your reputation with them.  

I'd like to propose an alternative approach.  Instead let's score each message based on its affinity to the person - i.e. a winner-takes-all problem rather than a global maximisation problem.  Reframed this way we get a very different solution, we need to represent both every message and every person in space, and the task is to find the closest message for each person.  Naturally we start off not knowing where each person (or message) is, so initial feedback is needed

The practical implementation of this are simple.  Where we can build an understanding about which products you will like, we can make far better recommendations.  That is a well studied problem with plenty of good solutions.  For example we can use purchase history to predict your next purchase and then present it in your next email.  We can use your website behaviour to infer the probability of you clicking on two different messages in order to select the right news for you.  We can even manually place new products in space by looking at products they are similar to, in order to send relevant messages without purchase or clickthrough history.

So why would you do this?  Because if done well you can create an expectation that your emails are worth reading even when nobody else's are.  This isn't easy... you will need to be significantly more relevant than people expect for approximately ten emails in a row before the pattern is noticed.  A single slip-up will consign your brand to the same position as your competition. However the benefits are huge.  Customers will automatically open and carefully read your emails without you having to shout to gain their attention.  

Anyway, please comment or otherwise get in contact with me if you find this interesting.  I'd like to make 2017 the year where broadcast advertising starts to die. 

Deployment best practice and the cloud

I have now worked for a number of years in cloud environments and in traditional environments.  But something I just noticed is how much easier the cloud makes it to implement best practice in development, testing and UAT. Let me give some examples:

Best practice says that the configuration of TEST / UAT should be the same as production.  I think most people attempt this in all environments, but it's very easy to accidentally do one little thing slightly differently in one environment and from then it's impossible to get them to converge again.  Within the cloud you don't keep a TEST environment, you clone production whenever you want to do testing.  As a result the test environment is always at exactly the same patch level and with an identical configuration in every other way.

Best practice says that the QA / UAT server should have the same hardware specifications as the production environment.  That's completely impractical in a normal environment, production is high performance and high reliability because everything depends on it working perfectly.  The number of additional surprises that get caught in UAT by having a perfect replica is almost never enough to justify the cost.  However in a cloud environment the cost of having production mirror UAT is almost zero - just clone production and call the clone UAT.  If production uses database mirroring then so will UAT - no more bugs due to slightly different database configurations.  

Best practice says you should develop and test using real data.  But snapshotting production for QA is a pain, especially to do regularly.  Not so in the cloud, where it would be harder to boot a QA environment that differs from production than one which is identical.

Best practice says you should develop deployment scripts which automate the upgrade to the next version.  Such scripts are very hard to test in a normal environment because you have to repeatedly go back to the old version as part of testing.  However in a cloud environment you can easily clone production and test the deployment script as many times as necessary until it runs perfectly.  You can even determine exactly how long the deployment takes, and if it will affect users during the deployment.

Best practice says you should run UAT and production in parallel for a while to validate they give identical results.  That's pretty easy to do when you can just spin up another server, but next to impossible where it would mean dedicating a server for a few days just to check it gives the same results.

Best practice says you should only upgrade a few users initially to ensure everythign continues to run smoothly.  I'd hate to think how you'd even start to do that without the cloud - something like a front server that redirects requests based on logins?  Incidentially, this is something Xero does not do despite having a cloud backend.

Every time I've used the cloud I've been disapponted by how expensive it is, and amazed at how valuable it is during the release process. I wonder if there's a middle ground where a private cloud provides the ability to use a pool of cheap hardware while providing all of the benefits.

 

Is DM cold-calling?

Don Marti replied to my post in favour of direct marketing.  I still disagree with him but I can see where he's coming from.

In brief the idea of carefully building a profile of you in order to select the product you are most likely to be interested in creeps him out, whereas I would just say it's trying to be relevant.  If you regularly walk into a wine warehouse then you'll see a thousand different bottles of wine nicely categorised and be left to choose what you think is best, while if you regularly walk into a small independent store you'll be directed straight to what the sales assistant thinks you'll want. 

Computers have been attempting to automate what that sales assistant does on a massive scale.  I find that helpful, but I can see why other people would find it creepy.  Ignoring the amount of effort involved or privacy for a moment, it is it means that if you surf a website about how to fold reusable nappies and then go to an online shop, it would show you the eco-friendly variety of nappies.  People like Don would rather manually choose the eco category themselves than have a computer judge them and present a different experience to what other people receive.  Fair enough.

Don also raised an interesting point as a brief aside which I've been putting a lot of thought into:

Is DM the equivalent of a cold call?

My initial reaction was that can't be right, you have to opt-in to DM when you never opt-in to a cold-call.  But after thinking about it a lot I decided the parallel has some merit.  If I open a web browser and go to a store to buy some wine then, well, I have essentially walked into a store.  However if exactly that same store sent me an email telling me to go to their store because they've got some wine I would love then, no matter how carefully selected that message is and regardless of whether I initially opted-in, they have interrupted my day.  

So I think he's right, DM is the store deliberately interrupting your day to remind you they exist.  If you had to proactively go to the website in order to be shown the carefully picked wine then I wonder if Don would still object - is it the creep factor or the interruption factor? From a marketng perspective I know waiting for the customer to come to you is far less effective than interrupting them.  If you want to make a sale to that customer then you somehow have to remind them that you exist and if you simply wait for them to come to you then your competitor will remind them that they exist first, and you'll lose a sale.  It's an interesting problem - how do you stay top-of-mind without wasting everyone's time.  Perhaps micro-payments are the answer? At least that way they're compensating you for the interruption. They have the nice side benefit of reducing spam's ROI too.

Visual Studio

I've always preferred a minimalist approach to coding.  While I transitioned with most people to using an IDE in the early 90s. I soon transitioned back again.  I found that when I was writing code I didn't want distractions and invariably both the editor and the build program integrated with IDEs were vastly inferior to those available in specialist tools.

To be specific, how many IDEs would enable you to type say ":1,$ s/([a-z]*) ([a-z])*/\2 \1/" and have every second and first word swapped? I bounced between emacs and vi a bit as to which editor I preferred but while I periodically tried the latest IDEs I always found them woefully inferior.  Build commands were similar, I was able to develop a custom preprocessor, integrate source control, or even FTP required files all as part of build scripts using make while IDEs struggled with anything more than a dependency tree.  They even had an irritating habit of trying to recompile your code when you just edit the comments, something that's easy to override using specialist tools.

When using windows you don't get much choice except to use an IDE.  Microsoft has been pushing Visual Studio for many years - offering it as a free download, bundling it with IronPython, moving DTS and SSRS into it, and generally expecting you to use it everywhere.  

Not long ago I upgraded to Visual Studio 2012 and, much to my surprise, I was absolutely blown away.  Firstly this version supports both programming and database development whereas the previous BIDS install didn't want to integrate with a C# install.  But the quality of the development ecosystem integration is where the latest version really shines.  Source control isn't just built in, it's beautifully integrated throughout the program, build servers, automated regression testing, agile style task management, it's all amazingly slick.  

So, well done Microsoft.  After twenty years of ignoring IDEs, you've released some software which is so good that you've changed how I work.

Incidentally 2013 came out the other day.  It's a solid incremental update with the most interesting feature being nice deployment management.  Nothing to get excited about but after such a brilliant release in 2012 I had expected Microsoft to release a dud.

Targeted Advertising Considered Harmful?

Don Marti recently wrote an article criticising targeted advertising (link).  As someone who spends his life helping to make messages more relevant, it's a little off-putting to be told your professional existence is harmful.  Having read his article I would like to take this opportunity to refute it.  I would like to take a later opportunity to write a counter-article on why I think targeted advertising is the best thing for consumers since advertising, but I'll save that for later and just concentrate on Don's article.

The core thesis behind Don's argument is a pretty simple logical chain.  Before seeing advertising you know nothing, but after seeing a big ATL campaign you know one company is willing to spend big bucks promoting their product while others are not, and from that you infer the company you saw believes in their product, and if they believe in it then it is more likely to be of value to you.  BTL advertising which Don dislikes does not follow this chain because it is so cheap, and so does not provide him with the same assurance.

I believe the flaw in this chain is the link between spend and belief in a product.  As a marketer my objective is to remind customers about my product so that it is front-of-mind and they are more likely to buy it next time they're out.  I assess a particular campaign based on the cost of that campaign against the incremental profit that campaign generates.  There are a few exceptions, like brand positioning or burning cash for market share, but they don't matter here so I'll ignore them.   The thing about this simple ROI assessment is that my belief in the product never comes into it - I trust the product team to make a great product but my ATL advertisements will go to people for whom it isn't the right choice.

For example, imagine you are CMO of Carpet Court - a carpet retailer.  Working with your media agency you have to decide between billboards, bus-backs, TVC, and whatnot.  Say your last billboard campaign costs $50k, and generates $300k in incremental revenue.  Because the margins in carpet retailing are pretty good (30%), I would be quite happy running that campaign again.  

Now say I'm a CMO of a very specialist carpet retailer which just happens to be a more suitable choice for Don.  Because I'm smaller, the last billboard campaign I ran generated only $100k in incremental revenue and so I will not do that again.   So Don is sitting there on his morning commute and he sees the Carpet Court billboard but not the specialist billboard, and so goes with the carpet court "because they back their product with advertising dollars".    He's made the wrong choice.

Another way of looking at it is cost per sale.  As a big established retailer, ATL often generates good cost per sale because so many of the people who see it are considerers for your product.  But as any sort of challenger or niche firm with a more suitable product the cost per sale of a big ATL campaign is completely unjustified.  

So the core problem with Don's thesis is that spending money on advertising relates only to expected return, it has nothing to do with my belief in the quality of the product.  With this link broken, all the following conclusions no longer follow.

What should I ask you?

I've tended to use this blog as a place to put my observations of things I have learned, rather than a place to discuss problems I'm trying to solve.  But something which I've had come up a few times recently and haven't found a good solution to is question selection. This post is about the game twenty questions.  How should I decide the questions to ask you which give me the best chance of guessing an answer.

Some context... Say I wanted to predict something about you, whether you're going to buy a particular product or what category I should classify you into.  I can produce a dataset with a whole lot of things I know about other people and whether they bought or their category, and then I could train a model such as a SVM and apply it to your particular data.  All machine learning 101 so far.

Collaborative filtering takes this a little bit further.  Based on observing a number of responses to questions from many people I can estimate the probability that you will give a particular answer to a particular question.  This gets things significantly more interesting because it largely eliminates the need to have a dataset where virtually every data point is known. 

To provide a concrete example consider the problem of predicting how much you'd like a particular movie.  In the basic case I could collect a whole lot of information about people and how much they like the movie, build my model and apply it to you.  But that's not terribly realistic because I'm unlikely to be able to ask most people the meaningful questions for predicting whether they like the movie (namely: whether they liked similar movies) because most people won't have seen virtually all of the similar movies.  There are techniques for reducing the impact of missing values such as Amelia, but they only scale so far - certainly not to most people having only seen a few dozen movies.

Collaborative filtering techniques such as RBMs  or SVMs help out a lot here.  They'll let you know what the best estimate is based on the unbelievably sparse dataset of movie scores.  

But the problem I want to pose is a bit harder still.  Everything we've talked about so far assumes a fixed set of knowledge about people.  What if I could as you a few questions before telling you what you'd score a movie?  This seems far more interesting and meaningful: I could walk up to you and say: "Do you like action films, or romance?" and immediately get a much better estimate than before.  To some extend attribute importance is a good answer here - at least it will tell you which question you should ask first.  

But what should you ask second?  The second highest scoring attribute is usually at almost the same angle as the highest and so contributes almost nothing to our understanding.  Factor analysis or similar can produce the second best question in general which might be good enough for the second or even the third question we ask, but is a pretty poor choice for say the tenth question we ask.  

To illustrate imagine the game twenty questions and how hard it would be if you had to write down all twenty questions in advance of hearing a single answer.  So the problem is how to effectively fold the new knowledge into our understanding such that the next question we ask is as close to the best question as possible.  On trivial datasets this is easy - we can recompute the probability distribution for the answer based on the revised input space, but real problem spaces have far too few datapoints to consider this approach.  

So far the best I've come up with is to set up a RBM where the input predicts the output, and then simulating the impact on accuracy when a single new answer is provided.  Apart from being slow this is not particularly elegant and I'm hoping someone can think of a better idea.  

How would you design the best computer program to play 20 questions? 

Eating your own dogfood

Site location is not a hard problem, though getting the data for a trustworthy result can be time consuming.  First you have to get the location of all competitors, apply a sensible drive-time buffer, score every potential customer in terms of their estimated value and run it through a heat-map to find under serviced areas.  

Recently my wife was looking to buy a practice and we were surprised by the lack of supply.  This raised the question: is the market saturated, or are there simply not enough practices built?   Also, should residential or work addresses be used for customer locations.  

Finding the appropriate number of dentists per head of population proved surprisingly difficult.  Research into theoretical levels does not help much, especially when coupled with quite weak research into who doesn’t go to the dentist regularly. Similarly the quality of geocoded practice data was terrible.  Most practices were located correctly but there were numerous duplicates, dental labs which do not see patients and businesses that have long since closed down.  

What all this leads to is that the time required for the analysis blows out.  That’s fine if it’s a work environment and the client’s willing to pay, but when you’re trying to slip the analysis into the evening before going to bed it’s a lot more annoying.  


EC2 for databases

EC2 or Elastic Compute 2 is Amazon’s cloud computing offering.  You rent servers from Amazon at a low hourly rate, micropayment style.  Every byte going into your server, every byte you store for a month, it all adds up on your monthly bill.  This approach applies not just to the hardware, you can effectively rent Windows and SQL Server as well. The end effect is very low initial costs but slightly higher annual costs - perfect for startups.  

Where it gets interesting is that you do not rent physical machines but virtual machines.  This means that you can clone your virtual machines, destroy them, scale them up, etc. to your heart’s content.  If you want to deploy a big change to production and have pre-prod running for a week then you can - and simply blow it away when you’re finished.  You don’t have to predict future computing needs - simply rent what you need now and rent what you need later well... later.   Think of the benefits that virtualisation provided a few years ago to the server room - Amazon’s cloud offering makes a similar advance to the ease of system management.

Also remember how when virtualisation first came popular it had terrible performance? Sadly we see something similar here.  Amazon’s instances do not have hard drives, they’re connected to their storage via the network port.  This means that at best you are going to get gigabit networking and at times you will get much worse (I’ve had as low as 100kB/s).

This brings us back to databases. Normal databases are heavily I/O bound, with 15k drives commonplace and many organisations transitioning to Solid State.  The idea of trying to run a large database on a single USB hard drive would normally be considered a joke, and yet that’s effectively what Amazon provides. The performance of a complete system is therefore ... disappointing even when looking at basic specs (RAM, CPU etc.) it would be quite acceptable.  

And there’s nothing you can do about it, a larger instance will mitigate the issue slightly by giving you more of the network bandwidth but fundamentally you’re limited to gigabit ethernet and even if you control the entire physical machine, you won’t control the SAN.

Unfortunately it gets worse.  Most databases I’m involved with are used for reporting and modern reporting tools are quite data intensive, generating beautiful visualisations which summarise a large amount of underlying data.  The problem is how you get a large amount of data from the database to the visualisation tool - with Amazon in Virginia and me in New Zealand, I’m lucky if I can sustain 1MB/s - perfectly acceptable for browsing the internet but extremely painful when trying to run reporting tools that were designed to operate correctly over a LAN.

So while cloud computing is interesting, I would advise you to hold off deploying substantial databases to EC2 until Amazon sorts out their I/O issues.

Fixing Operational Data in a DW

I obtain data generated by cheap, low powered devices that make mistakes.  Sometimes the data simply will not match the specification because the firmware mucked up (a bit flip perhaps) and there’s nothing anybody can do to prevent it.

So what should I do about it?  Rejecting it leads to a clean data warehouse but is unrealistic - we’ve lost information about the fact a transaction happened even if we don’t know exactly what transaction.  Fixing the data is also unrealistic, breaking the ability to reload the data warehouse and obtain the same state.


The solution I’ve come up with is simple but required quite a bit of coding.  A GUI is put over the Operational Data Store section of the data warehouse allowing changes.  Every time anybody changes the data there a trigger kicks off a few actions:

    The original source file is backed up.

    The original data is deleted from the data warehouse

    A file that simulates the generated data is created

    The simulated file is loaded

As far as I can see, this allows any authorised user to fix data in the data warehouse using either a table editing GUI or SQL, while maintaining the data warehouse’s integrity.  However I’m still sitting on the idea and seeing if there are any flaws in it - drop me an email if you can think of any.

Foreach Slowing Down?

I’ve got a server running SQL Server 2005 on EC2.  I know there’s criticism of EC2 for database use due to S3’s relative disk speed but this is running fast enough - and you’ve got to admit it’s a breeze to deploy a new server!

Recently though, it’s been acting really strange.   First the install onto a brand new (virtual) machine broke with .net somehow becoming corrupted.  I find this extremely odd:

  1. 1)create machine

  2. 2)run windows update

  3. 3).NET does not work

In the end I fixed it by downloading the .net update and running it manually but this isn’t the sort of thing that I can imagine squeezing past MS’s quality control - some weird interaction between EC2 and Windows?  I am sure running a virtual RHEL and then a virtual Windows on top of that isn’t healthy.

Anyway, the issue that I haven’t fixed is that I’m getting files sent to me bundled together in a zip file and processing them using a standard foreach loop.  Running locally this works fine.  Running remotely however this seems to result in the load gradually slowing down to a crawl.  Clearly there is a funny interaction going on somewhere but what? and where?

To be honest, I’m hoping to not have to resolve this by migrating off 2005 and hoping it just goes away.  It just seems to random to be something that’s easy to find.  However I thought I’d put it out there in case either someone reading this happens to know the answer, or this post helps someone else in a similar situation

 

Migrating SSIS

I recently inherited a small data warehouse developed in SSIS 2005.  It was only written three months ago so you’d expect it to be pretty compatible with modern technology. What I found was a horror story - the highlights from which are outlined below in case anybody else ever comes across similar issues.

Java Zip

If you search google for how to unzip in SQL Server, the number one hit is some VB.net code that calls a Java class.  It works just fine in 2005 but unfortunately that class is not distributed in 2008 (some people claim installing J# adds it in, while others claim that does not help).  

In 2008 things are much more complex.  One option is to rewrite the script using System.IO.compression but this is far lower level which makes it quite unattractive as an option.  Another is to install a third-party zip library - again, not the simplest.  I eventually decided the best solution was to call out to an executable (7-zip) which proved painless and quite efficient.

SQLNCLI.1

In 2005, the default connection to SQL Server is done using the SQL native client.  This is deprecated in 2008 in favour of a new native client - from memory SQLNCLI10.  It isn’t just phased out slowly, the 2005 client is not even installed on 2008.

The migration wizard automatically converts this for you - unfortunately the migration wizard also attempts to connect to the database at various points in the migration - using SQLNCLI.1.  As a result your migration will fail as the wizard is unable to connect to the database.  The only workaround I have found is to install the 2005 client in 2008 so that it is available to the wizard.

Memory of past connections

One of the bugs in Visual Studio 2008 is it keeps some metadata about the way things used to be.  One particularly annoying instance of this is that the package ‘remembers’ the old and invalid SQLNCLI.1 connections and keeps migrating back to them when the package is closed.  The only workaround I’ve found is to quit Visual Studio and use find and replace on the XML files.

Critical metadata in XML

One issue I had with the package was spelling errors - E.g. Retail misspelt as Retial.  Naturally I corrected these as I found them, but I did not check the XML support files.  Since the package parses the support file before loading and a failure there will mean the support file is ignored, correcting the spelling in the GUI and neglecting to correct it in the XML will corrupt the package.

Conclusion

At the end of the process I believe I spent longer on the migration than it would have taken me to recreate the entire package in 2008 (by visually selecting the same options).  It seems that there are two big annoyances in what Microsoft has done - firstly by changing connection managers, Microsoft has added a lot of work and secondly hard-coding critical data outside the GUI, Microsoft has made the wizard almost useless.  I see that SQL Server 2010 repeats the same mistake, requiring a migration from OLE (the only option supported in 2005 and the default in 2008) to ADO.

I won’t even pretend OLE is better than ADO - but I do not think it is good development practice to have the default in one release unusable by default in the very next release - you have to give people a workable migration strategy.

Poetry

CI Wizz, never boring

Have a chat, his name is Corrin

First impressions that you’ll find

Is that he has a find tuned mind

OK, so he’s really really smart

But you know he has a caring heart

Working with Corrin can be really quite a shock

When you find out that he’s as logical as Spock

Always quick, never ever slow

The NZ champion of the board game GO

Became known as a foosball hustler

But not renowned for being muscular


Although an expert of Oracle and programming SQL

Moving to SAS, for the team he’s made it less of a hell

Master of the information, big and small

This guy really seems to know it all

Moving Fly Buys to the new data warehouse

Writing requirements quietly as a weremouse

A fan of the quotes lists, he collected the datum

Averse to photos of other people’s scrotums

Quitting LNZL, his career’s in the crapper!

But au contraire, he’s off to Snapper

The kindest guy that you ever did know

Corrin, we’ll miss you after you go

So it is with a final shake hand

We say goodbye to Mister Lakeland

Annoyances SAS Enterprise Guide

I mentioned in my previous post that I selected SAS as the best tool for graphical data mining.  As part of that decision, we decided to move the entire analytics process to graphical programming in SAS.  What was previously custom SQL and PL/SQL is now diagrams in Enterprise Guide (EG).

I don’t regret that decision, and some of the things I have seen EG do elegantly and simply have blown me away.  Even more than that, seeing it do the job of four different tools adequately has more than justified the decision to standardise on the SAS platform.  It is a very fine tool, but this post is about the things in Enterprise guide that make you go “What were they thinking!”.

Graphical programming is a problem that has been attempted for a very long time - I first used it in Hypercard in the early 90s but I’m sure it’s much older than that.  Therefore, it surprises me that so many of the issues I’ve pointed out below are things that have been resolved

 

 

Graphical Data Mining

At work, we recently decided to standardise on a graphical data mining tool rather than rely on me writing code.  This was intended to lower the barrier to entry, make processes more repeatable and so on.

In the process of implementing this I’ve been appalled at just how sorry the state of graphical data mining is.  I know that graphical programming was trailed in the 90s and died a natural death, but I would’ve thought a few more lessons would have been learned from that, it is more than ten years on now!

I trialed four software packages: Oracle Data MinerSAS Enterprise Miner,PASW Modeler (which was known as Clementine at the time), ORANGEand KNIME.  Every one of them failed most of the test criteria.

Oracle Data Miner was by far the easiest to use: automatically normalising variables when appropriate, dropping highly correlated variables, and so on.  However, it was terrible when I attempted to scale to a large number of models.  The GUI left piles of cruft after itself, made marking of models as ‘production’ almost impossible, and provided almost no facilities for overriding the decisions it makes (e.g. tuning the clusters).  However the killer mistake was it’s (lack of) ability to productionise code.  It’s PL/SQL package output quite simply didn’t work (it would sometimes even crash!).

SAS Enterprise Miner does a lot of things right, but it’s user interface is awful with frequent unexplained crashes.  Some specific examples: it doesn’t support nested loops, it is limited to 32 values for a class variable but this limit is not enforced, just later steps will randomly crash, dragging a node in between two others won’t connect it up, and it doesn’t support undo.

Steps that you will almost always want to perform such as missing value imputation or normalising attributes are easy enough to do but they’re not done by default - you must always remember to add them.  To build a SVM with standard steps (sample, train/test split, impute, scale, test) requires finding and briefly considering half a dozen different nodes.  Why not default to normal practice and only require extra steps when you want to do something unusual?  

I won’t go into detail on PASW, Orange or KNIME except to note that despite being free, KNIME had the best interface of all the tools I looked at and ultimately I decided SAS was the best despite its warts due to its underlying power.  I do wonder what the designers were thinking: data mining best practice is now pretty well defined, why not design your tool to make it best practice the easiest option?  


Quantile Regression

Quantile regression is a variant of regression which instead of minimising error (finding the mean), aims to have a certain percentage of observations above and below the target.  In its simplest form the target is 50% and so the algorithm will find the median.

That’s enough theory, for people who want more I’d suggest Wikipedia to start and this great paper for more details.  I’m more interested in applications so I’ll concentrate on them.  If normal regression is about answering the question ‘what’s most likely to happen’, then quantile regression is about answering the question ‘how likely is this to happen’.  Some practical examples follow:

How confident can I be that this campaign will make a profit?

How confident can I be that this patient doesn’t have cancer?

What’s the most I can reasonably expect this person to spend?

What’s the least revenue I can reasonably expect this store to make?

Such questions can be attempted with regular regression if errors are assumed to be normally distributed - predict the expected value and add a standard deviation or two to increase confidence.  In practice I’ve found this a very poor approximation to make.  Say somebody is an ideal candidate for spending all their money, they will be predicted to spend quite a bit but after adding a couple standard deviations we’ll predict they will spend more than all their income!  

Quantile regression hasn’t made it into many tools yet, you’re pretty much limited to R and SAS if you want to give it a whirl.  And even then, it’s an optional add-in to R and in SAS it’s marked experimental and effectively can’t be called from Enterprise Miner.

To all those statistical pedants out there, yes I have oversimplified my explanation above.  But for the kinds of uses I make, the description above is accurate enough.