A/B testing: mindset, lifecycle & tracking plan

Published in

Studocu Tech

14 min readDec 8, 2020

At StuDocu, being data-driven is part of our mindset, and we do not leave changes to chance. To offer our users the best experience, we would rather dive deep into the data and make conscious decisions. A/B testing, which has been part of the company culture since our very early days, comes in really handy.

Conceptually, A/B testing is a method to compare two or more variants of something, for instance an email or an element on a website or an app, and decide which is the most performant.

Supported upstream by user research, A/B testing helps corroborate or invalidate hypotheses about how to optimise parts of your website.

All of our UX/UI changes are therefore implemented as experiments, that we monitor with a set of tools and processes, before making a decision.

In this series of two articles, we will try to cover every topic that we find important to consider before A/B testing: from the A/B test lifecycle to our workflow and technical implementation, and including some hypothetical and real-life examples.

Terminology

The tools we use very much shape the way we work with A/B testing, as well as the terminology we employ. To make sure you understand the next sections, here is a brief summary.

Cards & Boards

Our project management tool is Trello. Our cards are written and organised across multiple boards. Cards can be design or technical specifications, as well as A/B test summaries; and boards can be about product areas, as well as specific concept s — design, A/B testing, development, etc.

Events & Properties

Our emailing and analytics tools are respectively Customer.io and Mixpanel. By sending them events with properties when a user takes certain actions, the former is able to trigger automatic email campaigns, while the latter is used to know how these actions actually perform. We actually send events to Segment, which in turn dispatches them to the other aforementioned services.

Events are used to succinctly describe actions. Event properties are used to detail actions.

Imagine reading one of my other tech articles and following me or clapping it. If we were part of the Medium team, we would probably want to track these features by sending specific events:

Example of a Medium tracking model

Remember: events represent user-based actions. You should therefore read “Author Followed” as “a user followed an author”, and “Article Clapped” as “a user clapped for an article”. The first event would allow us to trigger an automatic campaign and email you something like “Congrats, you’ve just followed Killian!”. But most importantly, extending these events with a few more properties would allow us to properly A/B test them — hang on, we will expand on this later in this article.

The author section of an article, as of October 2020.

A/B test lifecycle

Track meaningful events and properties
Pinpoint sub-performing features & funnels to improve
Think of improvements & set goals and hypotheses (UX research, design)
Implement & run the A/B test
Check the A/B test data regularly
Analyse the results once significant & revise the hypotheses
(optional) Iterate with a new A/B test

Track meaningful events and properties

A/B testing starts with tracking, and tracking starts with setting standards. Agreeing on naming conventions with the team makes it easier to implement new events and properties — or to find what you need. The key is to be consistent. At StuDocu, we try to stick to the following rules:

Event names should start with the subject of the action and end with the past participle while being short and self-explanatory.
Properties about the same concept should be grouped.
Properties describing the same type of data should have the same structure.

Keeping the above rules in mind, we can extend our previous tracking model with more events and properties:

Example of a Medium tracking model

Rule #2 is mainly about how properties are prefixed:

Every property describing an author starts with “Author” — and the same goes for an article and the currently logged-in user.
In our model, claps could be called a second-level concept, as they always belong to a parent concept (here an article, an author, or a user). Therefore, the word “Claps” comes second in the described properties.

Rule #3 is mainly about how properties are suffixed, and also named:

Rather than, for instance, “Time” and “Number”, properties describing dates and numbers respectively end with “Date” and “Count”.
When describing properties about claps, the model is consistent in the sense we chose to have “[Concept] Claps Received Count” and “[Concept] Claps Given” rather than “[Concept] Received Claps Count” and “[Concept] Given Claps” — the word order remains the same.

Remember: there is no correct answer per se when coming up with a tracking model — you are, yourself, defining the correct answers for the future.

Pinpoint sub-performing features

Did you notice the undocumented “Article Viewed” event in the table above? Implementing such events comes in really handy when you want to look at conversion rates. For instance, what’s the percentage of users seeing an article and clapping it? or following its author?

Example of how the Article Viewed to Author Followed funnel would look like on Mixpanel, as of October 2020.

As shown in the above example using fake data, it could be that, over the past 30 days, about 15% of users ended up following an author after viewing an article within the same session. In reality, you would be able to compare these figures with other funnels, features, and metrics, and know or assume which funnels, features and metrics could be improved upon. Bear in mind that an A/B test idea can also come from your gut feelings, and that is totally okay. Once implemented, the A/B test will validate your assumptions (or not), and you will always learn from the results.

From now on, we are going to consider that following authors is really important for the platform. Therefore, we want to increase engagement with that specific feature. As you may know, it is also possible to follow authors through their profile. But because articles have a lot more traffic than profiles, we want to improve the feature only via the former. Keeping that in mind, we need to revise our tracking model to be able to know where the action was performed:

By filtering out any “Author Followed” event with the new “Follow Origin” set to “Profile”, we will be able to focus on the usage of the feature on the articles only when looking into the data.

Think of improvements

Improvements can vary from the simplest changes, such as adjusting the copy or updating the design, to the most complex changes, such as rethinking the feature. Inspiration for improvements can come from taking a look at other websites with the same feature, for instance, or from gathering insights from your team and from user testing sessions — more on that in the next article of the series.

How many variants should you implement?

As we mentioned in the introduction, A/B testing allows you to compare several variants of the same page. The maximum number of variants you can implement depends on the traffic on your website and the engagement with the feature you want to A/B test. The less a feature is used or the more variants you implement, the longer you will need to wait in order to see statistically significant effects — or, in other words, that the effects you are seeing are not due to chance or randomness.

This probabilistic limit is also the reason why you should try and not run too many A/B tests on the same page, or at least on the same sections of the same page. You can use some mathematical tools to overcome these issues, but they introduce other problems. As everybody would have to keep the different combinations of variants in mind, A/B testing on top of existing A/B tests also introduces the following issues:

Designing and implementing a test is complicated. Its variants can become intertwined with other variants of other tests, which multiplies the amount of work.
Your product is harder to know and understand. You cannot expect everybody to know every part of your product in detail, but having combinations of variants and tests influencing each other on the same page certainly does not help.
It is prone to errors and bugs. Checking every combination of every variant is tedious, and a small update of a piece of code might break one of them at any time.
Feature-thinking, designing, implementing, and quality assurance: since everybody in the team needs to keep the test in mind, your processes are slowed down from the beginning to the end.

For these reasons, at StuDocu, we try and stick to implementing a maximum of two variants per A/B test, on top of not having overlapping tests. Most times, we implement only one variant.

In our example case, we could try to put more emphasis on the call to action (CTA) to follow an article’s author. We are going to turn it green, which is the color that is used for the “Publish” CTA when you write an article:

A variant of the author section of an article, as of October 2020.

Users will be assigned randomly and will keep seeing one or the other variant.

Implement & run the A/B test

When a test is running, you need a way to see which variant is performing the best. Let’s call out A/B test “Test Article Follow Button”, our variants “Light” and “Green”, and update our tracking model:

By tracking the test variant on “Article Viewed”, we will be able to break down the funnel we saw in the Pinpoint sub-performing features section for each of them — more on that below.

Nothing much more to explain here. Let your amazing development team do its magic!

Check A/B test data regularly

Regularly checking how the A/B test behaves is key. You should not only check the data you collect at least once a week but also if the feature works as intended. A bug introduced in one of the variants at any time might significantly impact the results of your A/B test, and you want to avoid that at all costs.

Initial check

Once the A/B test is live, you will want to check your data after just a few hours. It is important to keep the following question in mind: are the variants diverging so much that something might be wrong? It might be that nothing is wrong, but in such a case, double-checking the implementation of the feature and the tracking is wise and sometimes saves you some precious time later on.

Example of how the Article Viewed to Author Followed funnel would look like broken down by our test on Mixpanel, as of October 2020.

When we look at the data after the feature was released just one day prior, we can see a few noteworthy data points already:

The control variant — the initial, original variant of the feature — behaves about the same as in the funnel we saw in the Pinpoint sub-performing features section (~15.5% conversion).
The new variant is performing better, with an increase of about 12% (~17.4% conversion).

However, as highlighted earlier, users might perform the “Author Followed” action on the author page rather than the article page, where the change was actually introduced. In order to exclude them, the funnel setup would need to be slightly updated:

Example of a setup for the Article Viewed to Author Followed funnel on Mixpanel, as of October 2020.

This way, a bigger difference would appear between both test groups, as the funnel is narrowed down to users following an author using the button whose style was updated. Anyway — given the aforementioned figures, we can be pretty confident that nothing is broken and that the A/B test is working just fine.

The initial check is also a good moment to check your other important Key Performance Indicators (KPIs). It can be that the A/B test you just introduced drastically decreases the engagement with some of your critical features, in which case you might want to turn it off. In the case of Medium, important KPIs could be actually reading an article or taking a subscription.

Finally, you should think of the negative impacts your A/B test might have. In our example, turning a button green might reduce the conversion to other actions, such as saving or clapping the article, as it draws more attention.

In short, the outcome of the initial check should be a set of funnels that will help you monitor the A/B test.

The set should at least include:

One or more funnels about the A/B test itself
Funnels about the main KPIs you want to keep an eye on
Funnels about features that could be negatively impacted by the A/B test

Note that we are keeping our example simple for the sake of the article. In reality, we could check many other funnels with very different conversion criteria. Here are a few examples:

Are users reading more articles? Are these especially from authors they follow? Instead of looking at conversions within 1 session, we would probably want to look at a month’s worth of data.
As following authors is a feature for logged in users only, are users logging in and registering more?
Are our recommended articles getting more claps? If the algorithm is partly based on the authors a user follows, it could lead to better results — or not.

Routine checks

Depending on the criticality of the A/B test, taking a look at the data on a weekly basis is a good start.

The routine checks are a good moment to check the same things as during the initial check. I want to stress that you want to avoid bugs impacting your A/B tests at all costs, as they would compromise your results and which would result in the tests needing to run longer. If a bug makes it to production, it should be live for as short a time as possible. Of course, a robust quality assurance process prevents most bugs. Otherwise, a monitoring dashboard of your crucial funnels should theoretically help you unveil them. But neither of these strategies are flawless, and routinely checking can be really helpful to discover some complex or hidden bugs.

By looking at the A/B test results often, you might realise that your funnels need further refinement. For instance, I thought of many more new ideas to narrow down our example funnel during the time spent writing this article. However, to keep the article (kind of…) short, I did not implement all of them, but an example is given: we should probably filter users who already follow the author from the first step, by tracking a new “User Is Following Author” boolean property, as they cannot follow them again.

The other important metric to look at is the statistical significance that we mentioned earlier in the Think of improvements section. If you use Mixpanel, they compute and show it in the “Sig” column. Otherwise, you can use the Effin A/B Test Calculator chrome plugin or any other convenient tool that is compatible with your browser. You need

Your A/B test is ready for the next step when the following two requirements are met:

The statistical significance of the set of funnels you picked is high enough — say 0.95, for instance, which is the threshold we use at StuDocu.
Enough people got to see both variants — depending on the traffic on your website and on the part you are testing.

Analyse the results & revise the hypotheses

You are now several months further from the conception of your A/B test idea. First and foremost: congrats! The length of this article is a metaphor of how long the journey of A/B testing is :-)

Before analysing the results, it is really important to accept and keep the following in mind:

It is ok if the A/B test did not give positive results.
It is ok to doubt and struggle with your hypotheses.

When implementing an A/B test, you are very likely to believe in the new variant as your intent was to improve something. It is natural to want to find positive results, and it is difficult to look at the results without bias. On another hand, it is not easy to give up on so much work and to reach the conclusion that everything your team built actually needs to be removed.

But A/B testing is about experimenting, it is about learning the way users behave on your platform. The results do not have to be positive — the very fact they are not positive is a learning moment in itself. Spend your energy to try and understand why you are witnessing such results.

If you cannot figure it out, remember that teamwork is key — even when analysing results! Talk to your colleagues who are involved in the product, gather opinions, discuss assumptions. You might even end up extending the set of funnels you instantiated at the initial check, and discover new angles to look at your A/B test.

Sometimes, there are no right answers, but only a collection of hypotheses and correlations that will lead you to think one or another way. If you feel completely lost, you can always refer back to the user testing reports or initiate new sessions to put together some qualitative findings. Sleep on it, then get back to analysing and discussing the results with your team again.

Example of results of the A/B test about following authors on Medium, screenshot from the Effin Amazing Tool.

After a month, the number of users in the control and in the new variants is respectively 421.089 and 421.188, of which respectively 65.314 and 75.764 converted to following the article’s author using the button on the article itself. The new variant performed almost 16% better. This could very well be because the button is way more visible in the new variant.

In the set of funnels, you also put some of your main KPIs and possible features that could have been negatively impacted by your A/B test. They must also be checked, and the outcome of your A/B test will depend on a compromise between negatively and positively impacted funnels. Is the decrease in users performing this action worth the increase in users performing that action?

Iterate with a new A/B test

An A/B test does not necessarily need a follow-up, however, be it successful or not, it could be followed-up on. If the results were successful, you might want to promote the feature even more, or to A/B test other parts of the website that have the same feature. If the results were not successful, you might want to try again with some new variants. On the contrary, you might think that it is not worth spending more time A/B testing this feature and that you want to focus on something else entirely.

Assuming we are going forward with our brand new follow button, we might want to iterate and test it on the profile page:

The header of an author’s profile, as of November 2020.

We would have to look at other funnels, such as “Profile Viewed” to “Author Followed” with the origin set to “Profile”; we would have to make sure not to decrease conversion to logging in or registering from this same page; etc. Basically, we would have to start the whole process again!

Real-life example

The follow button example was not picked at random. In fact, at StuDocu, we tested a similar component on our website.

In a nutshell, StuDocu is a platform where students can share and access their study documents, which are notably organised in courses (you might call them “modules”, depending on where you come from). It is possible for students to follow courses so they can easily access them on their dashboard and be notified when new documents are uploaded.

The copy of the button was “Follow”, which was rather vague and not very explicit — this was brought up multiple times by people in user testing sessions. Therefore, we decided to implement a very simple test, where we replaced “Follow” with “Add to My Courses”. On the dashboard, we labeled their courses “My Courses” rather than “Following”. With such a simple A/B test, we managed to increase the feature usage by more than 10%.

To be continued…

In this article, we have set the basis for a strong A/B testing mindset. In the next article, we will go over two equally important matters: our A/B testing workflow and our very own technical implementation. As you may know, a solid workflow and a robust yet flexible implementation are unavoidable to do A/B testing at scale.

Follow me — and increase the conversion from “Article Viewed” to “Author Followed” ;-) — to be notified when the next article is published!