We put a lot of effort into crafting the first version of our app store listing. When we start from zero, the content is what we think will perform best. Which images, video and copy will convince merchants to install our app and which keywords will rank us for relevant searches. But at this point it’s guesswork.
Once that first version has been up for a while we’ll want to optimise it to find a new version that converts to more installs and more traffic. A/B testing is how we change these optimisations from opinions and guesswork to a science. It’s a tool that leads to compound growth through changes we know have improved our traffic and installs. It also helps us avoid changes that we thought would improve things, but have insidiously affected one or more traffic sources.
Over two years, I experimented a lot with our app store listings. These experiments often lead to unexpected results that we wouldn’t have noticed if we’d just made the changes without A/B testing them. A/B testing is a way to improve your Shopify App Store listing with confidence rather than guesswork for more traffic and installs.
What is A/B testing
A/B testing, in its most simple form, is having two versions of your app listing: A and B. You want to see if the B version changes positively or negatively affect your rankings, listing views, and install rate. We do this by comparing the performance of our B version against the A version.
To ensure an accurate comparison, we want the conditions (traffic sources and volume, ranking algorithms) for A and B versions to be as similar as possible to ensure we’re testing in the same environment.
We want to have enough users be exposed to both versions to say the difference between A and B is significant enough to have confidence that we are right. Consider testing a new, improved cereal recipe on only two people vs one we try on 2,000. If those first two say they both like the new version, can we say it is better? What if 1,500 say they like the new version and 500 say they don’t? That’s much better. We have greater confidence because the larger sample size gives us more statistical significance.
On the web, most A/B tests run in parallel by using a bit of JavaScript. It buckets a user into either A or B and then shows them their version. On our Shopify app listing, we can’t run arbitrary JavaScript like that. Instead, we run the A version for a time, change the app listing, and then run this B version for a time.
We look at our analytics to see how the B version performed vs the A version, calculate if we have a statistically significant result and then decide what to do about it. Hopefully, we’ll see a clear improvement or deterioration in our rankings, views or install rate. In which case we’ll either leave the B version in and go on to try something else or go back to the A version. Sometimes though, a test runs for a while, never getting to statistical significance. I’ll talk about that scenario too.
Why A/B test your Shopify app listing
We’re so deep into our product, messaging and branding that we can’t see how it’ll look to others. Humans behave in unpredictable ways. If we just do what we think is best to improve our text, graphics or video, we’re blindly making changes without knowing if those changes are improving things or not.
A bit better is if we measure our overall metrics. If we see our total installations increase after making some app listing changes, we could conclude that the modifications improved things. But they may have not. We may not be comparing like-for-like: has the improvement been caused by something completely different, like a new traffic source? Or we may not even have enough data for before or after to reach a statistically significant conclusion.
Our A/B test doesn’t need to be complicated. It is just a more controlled way of making changes than looking at metrics going up or down. A/B testing gives us an excellent way to pile on change after change, compounding our app store success, rather than guesswork.
For a while, I was subjectively making changes to our app listings. The first couple of changes gave me double-digit percentage lifts in views and installs. Awesome. Then I hit a wall. I tried a few things that didn’t move the needle. It seemed like it wasn’t worthwhile to optimise our listings any more. But I proved myself wrong when I started applying the same A/B testing techniques to our app listing that we were using in product development. Doing that began to result in incremental improvements again.
And clarity on unexpected deteriorations too! We found that adding a video to our listings strangely decreased the install rate. Similarly, for the key benefits images, some variants decreased our install rate. When it came to keyword optimisation, rankings improved enormously with specific changes that also increased our listing views and install rate.
I could go on and on about what we found, but more importantly, it’s about the process we used to discover these things. If you apply the same A/B testing method to your app listing changes, you will get more views and installs for your Shopify app.
Is A/B testing right for you?
If you’re starting from zero or little traffic to your app listing, A/B testing will not work. It relies on having enough visitors for the A and B versions to reach the statistical significance I wrote about earlier. If your app is new and you’re looking for your first hundred installs, A/B testing won’t get you there.
My situation was that our apps had tens of thousands of users. We were getting traffic to our app listing internally in the app store from keywords, collections and categories. The incremental improvements that A/B testing gave us meant we could deliver tangible increases in monthly recurring revenue (MRR).
A/B testing was a highly effective way to build on our existing success to improve our ranking, app listing views and install rate.
Designing an A/B test
When I worked at a large online retailer, we had a salesperson for an A/B testing tool come in with his presentation. He clicked to a slide. It showed a 60% increase in product page conversion rate by changing the ‘add to cart’ button from blue to red. Our online merchandising manager gasped. I rolled my eyes. Sure, it’s possible, but this kind of dramatic swing doesn’t happen that much. A well designed A/B test for your app listing page will be something deeper than tweaking a colour. Individual versions tend to give incremental growth, that in aggregate over time compound to considerable improvements.
The other extreme (which, incidentally, I saw at the same online retailer!) is changing too much and then not being able to work out what happened. If we go in and change our app name, tagline, key benefits, key benefit images, and add a new video, and then our install rate drops, what caused it? We would have no idea if one or more of these things did.
Here there are two schools of thought. Those who say that an A/B test should make as small a detectable change as possible for confidence in the result and those who say that making a few coherent changes at once leads to faster and stronger results. What I found best was to have a single hypothesis. That single hypothesis may mean we change just one, or more than one aspect of our app listing.
An example was when I analysed our keywords. I could see we were ranking poorly for one key feature of our app: structured data. My hypothesis was that ranking better for keywords related to structure data would lead to more views and installs, without compromising our existing rankings. Running a test where I just edited the description could’ve been slow to get results and weak. Instead, I decided to edit the description, key benefits and a key benefits image.
An example of where we needed to make just one change was adding a video on our app listing. We hypothesised that adding a video would lead to a better install rate. To prove that we needed just to isolate the video in a version. Adding anything else to the A/B test would muddy the results.
How to measure the effectiveness of app listing changes
As I mentioned at the start, most A/B tests run their A and B versions simultaneously, putting users in an A or B bucket in parallel. This runs each version with similar external factors, like traffic source fluctuations or seasonality. In our case, we can’t do that; instead, we run the A version for a time and then the B version for a time. Not running both versions in parallel means that we have to pay more attention to these external factors. Things that we are not testing, but can influence our test: confounders.
I’ll cover potential confounders when looking to improve our install rate, app listing views or both.
Confounders when measuring install rate improvements
Install rate is the percentage of app listing views that lead to a user pushing the ‘add app’ button. You can set up this event in Google Analytics.
App reviews have a substantial effect on our install rate. Having some negative reviews or a shining positive one on the first page will affect our install rate. It can be a significant confounding factor in our A/B test.
If a few negative reviews come in when we start running the B version, we have introduced a confounder. Under the same conditions, our B version might’ve increased our install rate vs the A version, but since we have negative reviews, our install rate may now be even worse.
When you choose a period to measure the B version against the A version, check that the reviews, especially those on the first page, are roughly the same.
Another confounder is the source of traffic to your app listing. Some sources have significantly different install rates than others. I bucketed all app store keyword searches for our brand name like [plug in seo], [plugin seo] and compared this to non-brand keywords like [seo], [json-ld]. The install rate for brand name searches was over 10% higher than non-brand.
Traffic sources with over/underperforming install rates can be confounders if the volume of listing views from one of these sources is higher/lower in the B version than the A. When measuring, check that the number of listing views from these sources is roughly the same in the B vs A periods, or simply exclude those sources from your calculations.
Other sources to look out for different install rates include an app store homepage feature, social media and email campaigns.
Confounders when measuring increases in app listing views
An app listing view is when a user lands on your app listing page.
Unfortunately, we can’t see how often users see our app in app store lists, so we can’t with 100% accuracy measure this part of the funnel. However, I’ve found that we can confidently say that our B version performs better by following some simple rules.
The first thing is to understand what we think our optimisation could affect. Changing our app icon won’t affect the search keyword ranking, but it might make our result more attractive to click on, affecting our clickthrough rate and increasing total listing views. Changing our app name or tagline can affect search keyword ranking and clickthrough rate. Changing the other textual elements of our listing: description, key benefits, integrations, pricing, can affect search keyword ranking (and of course install rate). Changing other graphics: key benefits images, screenshots and the video will not affect listing views. Collection and category ranking won’t be affected by any change you make to the app listing.
I like to keep it simple and focus just on the app store search surface when experimenting with increasing app listing views. Other surfaces and sources fluctuate a lot and are affected by things out of our control. There’s a lot to be gained from just looking at app store search ranking. The one caveat to this is to not go for a spammy app name or tagline. These can hit clickthrough rate hard.
Reviews are also a confounder here. Keep an eye on significant changes in these when choosing the period to measure your A and B versions.
A final confounding factor affecting both install rate and app listing views is when in the month and year you’re measuring. Shopify merchants may be more or less likely to install your app at certain times. If you have over a year’s worth of data, you will see a peaks and troughs pattern. It was quiet for my apps in December and busier in January, with the rest of the year flat apart from BFCM. Your profile may be different.
Measuring and determining a winner
Once you’ve worked out what to measure, update your app listing and publish it to create your B version. Go off and do something else and let that listing do its work. You’ll want to make sure you have enough impressions and views to make your result statistically significant, and that can take time.
A straightforward calculator for measuring the statistical significance of an install rate change is Neil Patel’s. There are more complicated solutions, and you can play with all kinds of statistical functions, but I found his calculator to do the job well.
Your A/B result can be that we’re confident B performs better than A, A performs better than B, or we don’t know. If we don’t know, sometimes we can wait to gather more data, making our result statistically significant. Sometimes we never get there. It’s a fact that those changes we’ve lovingly crafted just might not change user behaviour.
It’s then your call about what to do. Because my app listings were popular, I would usually wait up to a month for a result, giving it more than enough time to reach statistical significance. If it didn’t get to significance by then, I’d make a subjective call. If it were something like a rebranding of screenshots, my call would be to leave the B version in place. The change played into a more extensive product branding, and it’s easy to see how that wouldn’t improve the install rate. If it were something like changing a key benefit, I’d probably go back to the A version and try something else since it achieved nothing.
If your test aims to improve app listing views, you’re dealing with absolute numbers. We can’t see how many impressions our app has so I give these changes a fixed period to run, often two weeks, and then dig into how keyword rankings and views changed. As a sanity check, I check the install rate here too. Our install rate may fall if the listing starts to appear for less relevant keywords.
Once you’ve run a couple of these app listing A/B tests, you’ll start to see profitable compound growth of your app listing views and improvement in your install rate as your app appears in more relevant places, and its content speaks more deeply to your users.