October 13, 2017
You want to run some tests and you’ve got the buy-in to do so…now what?
One of the most common questions we got at ConversionXL is “what should I test?” If you’ve got a good process, there should be no shortage of test ideas. But problems also come up when you have a bunch of great ideas, but no way to prioritize them.
You can’t just run them all at the same time, so there must be a way to strategically prioritize them for maximum efficiency.
This article will explore some ideas like ROI, systems thinking, resource allocation, and tactical prioritization. It will present some common ways that those who run experiments prioritize them, as well as how you can optimize your optimization program over time. Let’s start with the process of discovering great testing opportunities.
Opinions, arbitrary experiments, and other time wasters
When we get into testing, it’s typically because we heard about another company, perhaps a competitor, getting a big win from A/B testing. In its most extreme form, we see that this company got some absurd lift from some micro-change on their website (e.g. a 50% increase in conversions from changing a button from green to orange). It’s great, on one level, that this case study got you interested in testing (after all, there’s a strong business case for it). But on the other hand, reading things like this is super misleading as to what experimentation actually looks like, what to expect, and what actually creates impact.
In addition, it’s almost always the case that when a number looks unbelievable, it’s probably wrong (See: Twyman’s Law). In blog posts like this, you’re probably looking at a false positive that an opportunistic growth team turned into a content/PR piece (especially if they don’t release any data like sample sizes or test cycles).
In reality, the best way to find impact is not by testing:
Random things on your site or product
Opinions from your boss
Things your competitors has done
Things you found on a listicle written by a blogger who hasn’t ever run a test
The best way to find testing victories is to implement a process with which you consistently unearth user insights and discoveries.
A better way: build an insight-generation process
A process tells you:
Where are the problems?
Why are they problems?
How to prioritize solutions and experiments.
The discovery of what matters is the best way to come up with tests and to prioritize them in the future. Of course, there are a few ways you can go about discovering what matters. One of those ways is, quite simply, to begin with the right questions. Ask things like:
Whose problem are we solving?
What do they think they need?
What words do they use to describe their problem?
What are they thinking when they see our offer?
Which networks are they already hanging out in?
Is there a smarter way we can onboard new users?
How can we make this segment of customers use this product feature more often?
When you begin asking important business questions, you can see where your data tracking is inaccurate or incomplete. In addition, you can see where you can find answers in the data you have. Finally, you can plan a) research or b) tests to discover the answers to some of those questions. For instance, take the question “what do our customers think they need?” This question, at its core, is about user motivation, but it’s also about voice of customer. We likely can’t reach any answers through historical digital analytics, but we can probably find some insights through user testing, customer surveys, and customer interviews. From there, we can set up a tactical plan to capture some of those insights, such as using an on-site survey (using something like Hotjar):
You can also learn what matters from past experiments. When you think of it that way, inconclusive tests are much more valuable than they first seem.
Andrew Anderson talks about this in his Discipline-Based Testing Methodology. In emphasizing discovery and reducing the value of “right” and “wrong” ideas, this approach takes the pressure off of single experience victories, and sets you up to success in a systems sense. In other words, the value of your program isn’t determined by whether any one hypothesis is right, but if you design your experiments correctly, you can get value even from being “wrong”:
“What does “inconclusive” really mean? Is it just that you didn’t get the answer you were hoping for? Or does it mean that the thing you are testing has little to no influence?
Knowing something has little influence is incredibly valuable, so that is far from inconclusive…
…Does copy matter on this page? Well, if I have tested out a large beta range of 10 options, and they all fail to move the needle, then I can be pretty sure copy doesn’t matter. Likewise, if eight of them fail to move the needle but two do, that tells me it is the execution.”
If what you thought would win does so, your test is marginally valuable. If what you thought would lose actually wins, your test is incredibly meaningful. You mitigated the risk that would have been brought on by your faulty instincts, and instead allowed the process to show you the way to a more effective user experience. Ego be damned, that’s what experimentation is about.
Especially in the beginning of growth projects, or periodically throughout the year, it helps to have an insight-generation process. At CXL, we used our ResearchXL model for conversion research. It was specifically set up to find A/B testing opportunities that would maximize value early on by solving the biggest problems first. I won’t dive too deeply here (you can read more here), but it consists of six prongs of user research:
Digital analytics analysis
Mouse tracking analysis
With a formal process like this, you still need to ask critical business questions. But implementing something holistic such as ResearchXL allows you to do so from multiple angles, including qualitative methods like user testing and surveys, technical ideas, and quantitative behavioral insights using digital analytics data.
One more point: it’s not all about solving obvious current problems with growth experiments. It’s also about innovation and allowing creativity to flourish (while also minding ROI and efficiency of course). A good way to do that is with a formula Geoff Daigle recently wrote about. As he explains, some of the best breakthroughs have come through some version of this formula:
So, take a subsection of users, a desired action (using some new feature for instance), and a theme. The theme is the most ambiguous part here, because it doesn’t specify where this theme comes from. My understanding is that it can be from any empirical source of insight, which could be an academic study in behavioral psychology, a series of user tests, past experiments, etc.
So a theme could be that users generally respond favorably to social proof on landing pages and sign up more easily. You could have learned that from past experiments as well as through reading classics like Cialdini’s influence. I assume the more data points you have to support a theme, the more confident you can be in its power.
One final point: we can’t always accurately predict what our biggest problems are or what themes will work. Most importantly, we almost never know why something worked.
This gets into the narrative fallacy a bit, or the tendency for humans to see patterns in disparate data points. We’ll get into this later when it comes to prioritization, but if an experiment is very cheap to run and the expected value outweighs the cost of running the test, it’s likely worth it to run the test. According to this HBR article, some of the biggest wins at Bing came without any underlying user behavior theories:
“At Bing some of the biggest breakthroughs were made without an underlying theory. For example, even though Bing was able to improve the user experience with those subtle changes in the colors of the type, there are no well-established theories about color that could help it understand why. Here the evidence took the place of theory.”
How do I prioritize growth experiment ideas?
Growth is not about the tactics, and if you place the emphasis on individual ideas and experiments, you’re bound to lose in the long term. The Paul Graham quote about growth hacking comes into play here: “Whenever you hear anyone talk about ‘growth hacks,’ just mentally translate it in your mind into ‘bullshit’.“
And what do you do with a big list of tactics anyway? Try them all out at once? If so, your site might look like Ling’s Cars:
…which works for Ling’s Cars, but won’t work for you. Or do you test them out one at a time, going down the list? Well, we all have a limited amount of traffic (even Amazon), so that’s going to take you years and years. Truth is, we all have a ton of great (and poor) ideas, and we need some way to prioritize them. That’s why almost everyone I know in optimization and growth operates on some sort of system with which to choose experiments. Furthermore, almost every system boils down to two variables: impact and cost.
Expected value and how we weigh the value of experiment ideas
“The expected value (EV) is an anticipated value for a given investment. In statistics and probability analysis, the EV is calculated by multiplying each of the possible outcomes by the likelihood each outcome will occur, and summing all those values. By calculating expected values, investors can choose the scenario most likely to give them the desired scenario.” Here’s a simple demonstration. Let’s say I give you the opportunity to roll a die for $1, and every time you roll a 3 you win $5. Is it a good decision to take the roll? In this case, no.
To calculate the expected value, we can take the win rate (1⁄6) times the reward ($5), which gives us $0.83. That’s less than $1, so the expected value is less than the cost.
It’s easy to use this math on experimentation as well. If it costs you $1,500 to run a test, you have a 10% experiment win rate, your overall revenue affected on this test would be $1,000,000, and you have a 10% average lift, what is the Expected Value of this test?
The value of our win would be roughly $1 million times our average lift of 10%, so $100,000. This is our reward. Our win rate is 10%, so we can multiply that by our reward of $100,00 to get an expected value of $10,000. Since it costs only $1,500, then it would make perfect sense to run the test.
Note: this is super simplified. There are other factors involved like opportunity costs, organizational feasibility, interaction effects, external validity factors, and time constraints. This exercise is just to demonstrate the logic that most prioritization systems come from. The fact is, most of them come down to this question: given the resource costs, will the value be worth it to run the test?
In fact, a high impact/low cost test will almost always be at the top of the list. If a test with high impact potential is virtually free to run, it’s almost better to ask “why wouldn’t we run this test?” As I mentioned, most frameworks operate on some variety of the above two variables. The most popular framework I’ve seen in the context of growth is the ICE framework, popularized by Sean Ellis and GrowthHackers.com. This is based on three variables:
You already know about impact and ease (i.e. resources). Confidence is simply a lever that you can use that is based on insights from past experiments, conversion research, or simply theories that are usually effect in influencing behavior. In conversion optimization, the PIE framework is quite popular (and super similar). It stands for:
Potential here is a function of how much room for improvement there is, where importance is sort of its “visibility” in terms of traffic, use, and value. Ease is the same.
At CXL, we created a conversion optimization specific framework called PXL that values objective evidence and a binary scale for grading. We thought it was a bit sketchy to subjectively grade your own ideas on a scale of 10 (is your confidence a 4 or a 7? What’s the quantitative difference?). This framework, too, values resources and impact, but it forces you to bring data and evidence to the table.
In reality, it doesn’t matter which trademarked acronym you use, just that you’re balancing risk and reward in some intelligent way. Is the cost high? Then the possibility of a high reward better be worth the cost.
At its core, experimentation prioritization—whatever framework you use—is a simplified version of decision theory. How much should we spend in resources for a desirable reduction in uncertainty? These frameworks help you make decisions and hopefully get more out of your optimization program. Be careful not to undervalue ideas, though. As I mentioned above, if a test is sufficiently cheap to run, it’s usually worthwhile to do it. A recent article in HBR from Ron Kohavi and Stefan Thomke talks about that:
“In 2012 a Microsoft employee working on Bing had an idea about changing the way the search engine displayed ad headlines. Developing it wouldn’t require much effort—just a few days of an engineer’s time—but it was one of hundreds of ideas proposed, and the program managers deemed it a low priority. So it languished for more than six months, until an engineer, who saw that the cost of writing the code for it would be small, launched a simple online controlled experiment—an A/B test—to assess its impact. Within hours the new headline variation was producing abnormally high revenue, triggering a “too good to be true” alert. Usually, such alerts signal a bug, but not in this case. An analysis showed that the change had increased revenue by an astonishing 12%—which on an annual basis would come to more than $100 million in the United States alone—without hurting key user-experience metrics. It was the best revenue-generating idea in Bing’s history, but until the test its value was underappreciated.”
Andrew Anderson also talks a lot about devaluing the “I think this will BLANK” part of a test idea, and instead working to maximize the efficiency of the program itself, and to maximize your chances of ROI on a given experiment. In an article on CXL’s blog, this is what he told me regarding roadmapping and prioritization:
“One of our key disciplines is that you make plans around the resources you have, not grab resources to match the plans you have. What that means is that we keep a massive backlog of ideas, and then we see what makes sense when we are getting to the end of the test and based on our other efforts. That way, we always have tests working, but we are flexible. We also prioritize our sites by the number of experiences they can handle, as well as tests by the population of the pages we handle. We try to keep one larger test going at a time, but that can take as much as a quarter. We then keep 3-4 medium tests going at all times, and the rest are small tests, tests that can be coded in a few seconds (the previously mentioned font test was one of those). This means we never have a set lists of tests, but we have a roadmap of resources and larger tests as well as focus areas. This also allows us to always be able to slot tests based on what we learn and where they can best be exploited.”
So there are many ways to do this, but most of them revolve around resources, impact, ROI, and efficiency (all of them should).
While there is a whole math and science to how we choose and prioritize decisions and experiments to invest in, there are a few things you really need to keep in mind to be effective.
Have a process. One-off “I think BLANK” opinion-based ideas won’t get you far.
Solve impactful business problems.
Prioritize based on impact and resources.
Don’t under or over-value individual ideas.
Continually tweak your program to maximize results.
As I mentioned, there’s a lot to learn about growth program management, test prioritization, and roadmapping. Here’s some further reading: