LukeW | Granular Bucket Testing

A recent discussion on the Interaction Design list about the utility of product design research got me thinking about how the bucket testing of Web sites has seemed to change over time.

Bucket testing, otherwise known as A/B testing, is a methodology for gauging the impact of different product designs on a Web site’s metrics. For those unfamiliar, the basic premise is to run two simultaneous versions of a single or set of Web pages in order to measure the difference in clicks, traffic, transactions, and more between the two.

I first began using bucket testing when designing Web sites with an already established and often quite profitable user base. Because we were experimenting with very different interactions or visual presentations, we needed a way to see the impact of our changes without migrating all our existing users to a new design. Bucket testing provided a great way to send a small amount of traffic (usually less than 5%) to a different user interface without negatively impacting the bottom line if our new design had unintended negative consequences.

In this context, the point of bucket testing was to confirm we were making the right decisions when we made big changes. Since then, the technology for running bucket tests has grown, and as a result, it is easier than ever before to pit two iterations of a design against each other. This has led to bucket testing of not only of pages but also individual features, UI elements, and even details such as the text color of a set of words.

At a BAYCHI panel in 2005, Marisa Mayer discussed Google’s user experience design process: “Use Interface decisions follow a scientific process that reduces the role of opinions. Products are usability tested and live tested to verify the validity of design options and even single variable testing (like black text vs. red text) occurs.” - User Experience: the Google Way

The problem with this type of nuanced bucket testing is that it isolates individual design elements from the rest of a product design and any designer will tell you it is the sum of the parts that make up the whole. A cohesive integration and layout of all the elements within an interface design is what enables effective communication with end users.

Testing individual elements like font colors and incremental feature variations in bucket tests is unlikely to drive changes that really make a significant impact on the bottom line. Small changes most often only enable small opportunities.

Highly granular bucket testing also has the potential to damage the integrity (and thereby effectiveness) of a page or set of pages because it only evaluates individual elements. The best performing versions of these elements are then (frequently crudely) stitched together into an “optimized” design. This of course opens up the possibility of Frankenstein design.

From my experience the value of bucket testing comes from understanding the impact of significant changes on an existing product and it users. Excessive testing of minor variations in an interface design has the potential to undermine that value through isolated evaluations of interface elements and the assumption that these “top performers” can simply be pasted together to create an optimal design.