I've tended to use this blog as a place to put my observations of things I have learned, rather than a place to discuss problems I'm trying to solve. But something which I've had come up a few times recently and haven't found a good solution to is question selection. This post is about the game twenty questions. How should I decide the questions to ask you which give me the best chance of guessing an answer.

Some context... Say I wanted to predict something about you, whether you're going to buy a particular product or what category I should classify you into. I can produce a dataset with a whole lot of things I know about other people and whether they bought or their category, and then I could train a model such as a SVM and apply it to your particular data. All machine learning 101 so far.

Collaborative filtering takes this a little bit further. Based on observing a number of responses to questions from many people I can estimate the probability that you will give a particular answer to a particular question. This gets things significantly more interesting because it largely eliminates the need to have a dataset where virtually every data point is known.

To provide a concrete example consider the problem of predicting how much you'd like a particular movie. In the basic case I could collect a whole lot of information about people and how much they like the movie, build my model and apply it to you. But that's not terribly realistic because I'm unlikely to be able to ask most people the meaningful questions for predicting whether they like the movie (namely: whether they liked similar movies) because most people won't have seen virtually all of the similar movies. There are techniques for reducing the impact of missing values such as Amelia, but they only scale so far - certainly not to most people having only seen a few dozen movies.

Collaborative filtering techniques such as RBMs or SVMs help out a lot here. They'll let you know what the best estimate is based on the unbelievably sparse dataset of movie scores.

But the problem I want to pose is a bit harder still. Everything we've talked about so far assumes a fixed set of knowledge about people. What if I could as you a few questions before telling you what you'd score a movie? This seems far more interesting and meaningful: I could walk up to you and say: "Do you like action films, or romance?" and immediately get a much better estimate than before. To some extend attribute importance is a good answer here - at least it will tell you which question you should ask first.

But what should you ask second? The second highest scoring attribute is usually at almost the same angle as the highest and so contributes almost nothing to our understanding. Factor analysis or similar can produce the second best question in general which might be good enough for the second or even the third question we ask, but is a pretty poor choice for say the tenth question we ask.

To illustrate imagine the game twenty questions and how hard it would be if you had to write down all twenty questions in advance of hearing a single answer. So the problem is how to effectively fold the new knowledge into our understanding such that the next question we ask is as close to the best question as possible. On trivial datasets this is easy - we can recompute the probability distribution for the answer based on the revised input space, but real problem spaces have far too few datapoints to consider this approach.

So far the best I've come up with is to set up a RBM where the input predicts the output, and then simulating the impact on accuracy when a single new answer is provided. Apart from being slow this is not particularly elegant and I'm hoping someone can think of a better idea.

How would you design the best computer program to play 20 questions?

Corrin's Website