Tuesday, August 18, 2009

Rapid innovation using online experiments

Erik Brynjolfsson and Michael Schrage at MIT Sloan Management Review have an interesting take on the value of A/B tests in their article, "The New, Faster Face of Innovation".

Some excerpts:
Technology is transforming innovation at its core, allowing companies to test new ideas at speeds -- and prices -- that were unimaginable even a decade ago. They can stick features on Web sites and tell within hours how customers respond. They can see results from in-store promotions, or efforts to boost process productivity, almost as quickly.

The result? Innovation initiatives that used to take months and megabucks to coordinate and launch can often be started in seconds for cents.

That makes innovation, the lifeblood of growth, more efficient and cheaper. Companies are able to get a much better idea of how their customers behave and what they want ... Companies will also be willing to try new things, because the price of failure is so much lower.
The article goes on to discuss Google, Wal-mart, and Amazon as examples and talk about the cultural changes necessary (such as switching to a bottom-up, data-driven organization and reducing management control) for rapid experimentation and innovation.

I am briefly quoted in the article, making the point that even failed experiments have value because failures teach us about what paths might lead to success.

10 comments:

jeremy said...

They can stick features on Web sites and tell within hours how customers respond.

Greg: Could you explain to me what gives companies the confidence that any new change can be assessed within hours?

There are some features on search engines that I, personally, don't even notice or never use... even after those features have been there for years.

Are there really enough users that enough people notice a change the first time it is introduced, and use it immediately? Really?

anand kishore said...

the link to the article seems to be broken

Greg Linden said...

Oops, sorry about that, Anand! Fixed!

Greg Linden said...

Hi, Jeremy. It's whether the sample size is sufficient to have differences between the two groups pass significance. Who is in the test and control groups (the A and B of an A/B test) often is narrowed to people who have a chance to see the feature, either by just viewing the page or by clicking some related content.

jeremy said...

Who is in the test and control groups (the A and B of an A/B test) often is narrowed to people who have a chance to see the feature

Let me put it this way: When interfaces change all the time, I (personally) have the tendency to ignore anything new, and just go to the 2 or 3 things that I normally use. So if some other feature is added or subtracted, I tend to not pay attention. I've almost got a learned aversion to things that I know are just going to change again, anyway.

If most people are like me, in that they ignore most of what they don't already know, then it wouldn't really matter if you had 2% or 100% sample size. Because most people aren't going to use your innovation, anyway. Not because they don't want to, but because of change blindness.

http://en.wikipedia.org/wiki/Change_blindness

So I guess what I am asking is: What is the relationship between change blindness and sample size? Suppose you had a 100% sample. How would you know whether B is better than A, given that users are already attuned to A, and don't really know or understand what B does? How does the change blindness get factored into the calculations?

(Any papers would be appreciated..)

Greg Linden said...

Hi, Jeremy. If a large group of users in both the test and control groups don't notice the change, the behavior of many people in the two groups will look the same. So this will show up as not being able to pass significance tests on the change without increasing the sample (usually by increasing the duration of the test).

Does that make sense? Or am I still missing something in your question?

jeremy said...

Correction, I think I mean inattentional blindness, not change blindness.

http://en.wikipedia.org/wiki/Inattentional_blindness

My question is the same, though: If users have been doing things one way for years, and suddenly you introduce a new way of doing things, does A/B testing really tell you which way is better? The new way might indeed be better, but users are going to be reluctant to use it, to do inattention or habituation. So even with a 100% sample size, how do you correct/normalize for that?

jeremy said...

Greg, I think you do understand my question.

So the way you overcome that problem, for a given sample size, is to increase the duration, so that inattention/habituation patterns start to melt away? That does make sense.

I'm still left wondering how you know how long it takes to overcome various habituations. How long do you increase the duration of the experiment for? Keep going and going, until you actually get a significant difference?

Dan Rosler said...

If you believe that there is the risk of change blindness from existing customers, then perhaps you could also consider looking at the performance behavior of only those users who are new customers. Depending on the rate of customer acquisition, you may need to run the test longer to gain a significant sample, but you will have ruled out the change blindness noise from pre-existing users.

jeremy said...

@Dan: That solves the problem of change blindness in existing customers. But how would you then compare the new features/system/whatever against the old customers?

Is it a fair comparison to look at a new feature with new (inexperienced) customers, versus an old feature with old (experienced) customers?

You might be able to get rid of change blindness, but now you have experience bias to contend with. The old customers might do better, simply because they're experienced. Even if the new feature really is a better feature.

Know what I'm saying?

This is where I'm still scratching my head with it comes to all the evangelism around A/B testing, rapid innovation, etc. I'm not saying I'm correct to dismiss it. I'm not saying I'm incorrect. Obviously there are many people here with orders of magnitude more experience than me, in doing it. And it also seems to be what everyone (all the successful companies.. the Amazons and the web search engines like G/Y/B) is proclaiming. So it must work, by and appeal-to-authority argument.

But I don't want just the appeal-to-authority argument. I have honest concerns about how one knows whether one is really testing what one thinks one is testing. Sure, you can always just do something. And you can even create situations in which you can observe statistically-significant differences between those situations. But have you, in the end, really built something that does what you think it does? How do you know, if you can't really get good measurements, but have to make all these assumptions about experience, change blindness, etc.?

One thing I've heard is that these issues all wash out, in the statistics. But, do they really wash out, or do they wash in?

I don't want to keep putting Greg on the hook to answer me, so if someone else can help enlighten me, I'd really like to understand it better.