Universal Search: A Retrospective

When Test Pilot was launched in May, Universal Search was among the first cohort of experiments. It is ambitious in scope: an attempt to both gain a better understanding of how users interact with the address bar, and to make content recommendations for in-flight searches while still respecting users’ privacy. It has been one of the more popular experiments: it was the first to reach 100,000 downloads, has the lowest uninstall rate, has sustained over 20,000 daily users over the last month, and was the subject of a beautiful laptop sticker:

Universal Search laptop sticker
Credit: Bryan Bell

On November 30th, Universal Search will break further ground as the first Test Pilot experiment to be retired. Since it depends on a server with substantial maintenance costs, we will be pushing a self-uninstalling update to the extension and decommissioning the server on that date.

This essay will take a look back on the project: providing an overview of how it worked, what we learned, and where future study might focus.

How it worked

Like all Test Pilot experiments, Universal Search was an extension — a Firefox add-on that modified the behavior of the browser. When installed, it sent user keystrokes in the address bar to a recommendation server. That server guessed what the user was trying to type and tried to provide a content recommendation. A user seeking their social media fix may type f and should be recommended Facebook.

Awesome Bar search for 'space'

In some cases, the server went a step further and identified the nature of the content being recommended, allowing it to enhance the recommendation with metadata specific to that type. A user looking for Michael Jordan’s cinematic opus might have be recommended SpaceX’s website when typing space, but should have be presented with the IMDb page for Space Jam—augmented with release date, plot summary and rating — when extending the query that to space j.

Awesome Bar search for 'space-j'

What we learned

At the outset, we were looking to answer a few broad questions:

A focus on privacy

This isn’t a completely novel concept; other browsers have previously offered features like this. But often this means sending sensitive information about your browsing habits to untrusted third parties like advertising networks. We wanted to do this in a Mozilla sort of way that respected your privacy while still trying to help you get to your information more quickly.

Our data collection and privacy policies were clear and concise, and we made efforts to scrub potentially-identifying data. Since the server’s code is open source, you can verify that we are only collecting what we claim to. All requests to third-parties were proxied by our recommendation server, so no data providers had insight into your browsing habits.

The second hard problem

As the joke goes, there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.

The Universal Search name was an homage to a Google project that added context to search results; prior art of a complex problem. It seemed to the team members that the name stuck the perfect balance of suitable and whimsical.

Upon installation, the extension removed the separate search box for which Firefox is somewhat notorious. This was done to preserve the integrity of our experiment; if users didn’t break their habits and start using the address bar to search, our measurements of user behavior would have been unreliable.

This was the most obvious UI change that Universal Search made. Plus, the extension did appear to unify the address bar and search box, and that sounds a lot like the word universal. Not only was this an unpopular design decision, but exit survey feedback showed that users commonly thought it was the experiment. Better naming — perhaps something akin to “Smart Search” — may have reduced experiment attrition and provided better results.

Experimenting with XUL

Our original approach to the extension involved replacing the entire address bar dropdown with an <iframe> element, using a pubsub broker to communicate with the browser. It was an interesting approach, with many benefits:

That all sounds great, but there was one major problem: introducing a vastly new experience would make it challenging to identify the source of any behavioral changes. They might have been attributable to the new design, to differences in responsiveness, or to any of a number of new features we considered.

Ultimately that approach was abandoned, instead opting for a simpler XUL-based approach that inserted a single recommendation above the top of the existing results. Though we lost some of the benefits of the <iframe> and our extension was likely to break with any upstream changes to the address bar, we were much better able to test our core research question: do users engage with browser-provided recommendations?

Injecting the recommendation with XUL proved to be challenging, but we now have a much better understanding of what is required to experiment with search results, and have made concrete plans to make the address bar more extensible for future experimentation. Through Embedded WebExtensions and WebExtension Experiments, these capabilities will also be available to WebExtensions when they land.

Recommendation quality

A common thread among feedback was concern over the quality of the recommendations; they sometimes felt like ads and often looked biased. Since we relied on search engines for weighting by relevancy, we were heavily influenced by website SEO. A search for linux resulted in a link to Linux Mint, which scores highly in Bing’s autocomplete engine. This could have been avoided by removing the long tail of domains that exist on the internet, instead only recommending a subset of higher-relevancy ones.

The recommendation server also suffered from a different sort of bias: ethnocentricity. Search engines use information about the users to try to provide more relevant results, customizing them to what they guess are your age, gender, interests, location, and language. Yahoo’s BOSS API—another of the data sources — allows consumers like Universal Search to explicitly do this for region and language. We should have taken advantage of this, passing along the user’s information. Instead, Yahoo inferred the region from where the recommendation server was querying (the United States), then chose the default language for that region (English). This resulted in poor recommendation quality for users both outside of the United States and speaking non-English languages.

For example, acceptance criteria of the first version of the recommendation server included the search query f recommending Facebook. Similarly, users in Russia might expect v to take them to VKontakte, a popular social networking site in the country. Instead, it recommended to them Verizon Wireless — a company that doesn’t operate in Russia.

Power user bias

Test Pilot’s audience has not been well-studied, but one might reason that they are more technical and engaged with Mozilla than a general population audience: participation in Test Pilot is voluntary, and users are recruited through a variety of means that may be more likely to reach more technical users, including e-mail newsletters, the Test Pilot Discourse forum, about:home snippets, and promotion in public meetings.

Table 1: Most common queries from 2016/05/03 to 2016/10/02.
Query n Proportion
windows 10 3,171 0.0307%
what is 3,109 0.0301%
is 3,081 0.0298%
android 3,060 0.0296%
firefox 3,039 0.0294%
linux 3,018 0.0292%
python 2,993 0.0290%
free 2,991 0.0290%
download 2,942 0.0285%
why 2,940 0.0285%
Query n Proportion
windows 2,934 0.0284%
ubuntu 2,927 0.0283%
java 2,912 0.0282%
get 2,896 0.0280%
apple 2,842 0.0275%
mac 2,839 0.0275%
minecraft 2,820 0.0273%
can 2,815 0.0273%
microsoft 2,776 0.0269%
open 2,744 0.0266%
  • Total n = 10,329,662.
  • Queries for 2016/09/20 and 2016/09/21 were lost in an outage.
  • Excludes queries prospectively containing personally-identifiable information, per data collection policies.

As Table 1 shows, the most common queries support that notion, with common searches including programming languages and Linux distributions. This reflects a clear bias in our audience that was not accounted for in collection or analysis.

Perhaps the most important finding differentiates between two possible uses of search in the address bar: navigation, where the user knows their ultimate destination before they begin typing, and discovery, where the user isn’t sure where to find what they’re looking for. Though Universal Search was explicitly attempting to improve discovery, we were surprised that the dominant use case appeared to be navigational.

Table 2: Effect of deduplication on recommendation selection
Results Deduplicated
Selected Yes No
Yes 52,632 128,845
No 765,586 1,258,569
CTR 6.433% 9.287%
Note: significant at p < .01

This was accidentally stumbled upon when adding a somewhat-obvious UX feature. We didn’t want the Universal Search recommendation to duplicate information, so it was omitted if the URL matched one that already existed in the results (e.g. if it was an already-existing tab, bookmark, or history item). We expected the introduction of this feature to improve engagement, but it actually reduced the clickthrough rate (CTR; see Table 2), indicating that users wanted to revisit past pages, making the recommendation much less useful when they were excluded.

Further support can be found by examining clickthrough rates of the types of recommendations Universal Search offered. There were three distinct types: TLDs, where an entire website is recommended; Wikipedia articles, where a specific Wikipedia article is recommended; and movies, where the IMDB page for a movie or television show was presented.

Table 3: Effect of recommendation type on recommendation selection
Recommendation Type
Selected TLD Wikipedia Movie
Yes 20,068 2,165 148
No 237,407 58,020 3,031
CTR 7.794% 3.597% 4.656%
  • Significant at p < .01.
  • Only includes queries after 2016/10/03, when movie cards were introduced.

More general in nature, it stands to reason that TLD results would be a more common endpoint for navigational searches, while Wikipedia and movie results may be more likely to be the result of a discovery search. This was supported by the data (see Table 3), where clickthrough rates were significantly higher for TLD results than either movies or Wikipedia articles. This implies that the address bar is commonly used as an interface to frecent results, and the most effective recommendations were the ones that supported that.

Frequency of selection by position
Chart 1

Another indicator that users lean on the address bar for navigation is the degree to which users preferred the earliest-position items in the result set (see Chart 1). That users would favor early results is expected, but over half of all address bar interactions resulted in the top item being chosen.

One way the navigation and discovery hypothesis was tested was by introducing the movie result type. We detected that a recommendation was for a movie or television series, and augmented the findings with contextually-specific information about movies: the release year, run time, genres, rating on IMDB, and an image of the promotional poster. This additional information would be useful if the user was attempting to discover information, but would provide little benefit if simply navigating to a known destination.

Table 4: Effect of movie cards on recommendation selection
Shown Movie Cards
Selected Yes No
Yes 13,710 14,156
No 192,691 187,001
CTR 6.642% 7.037%
Note: significant at p < .01

Movie cards were shown (or not) as a random-sampled A/B test. As Table 4 shows, overall CTR was lower in the population that was shown movie cards. Thanks to a large sample size, this small effect was decidedly significant, indicating that the contextual information may be harmful to the perceived utility of the recommendation, as measured by clickthrough rate.

This finding warrants further study. Though it seems clear that Universal Search’s users were using the address bar for navigation more frequently than they are for discovery, this could be explained as an intersection of two factors:

  1. The previously-discussed power user bias could correlate with faster and more predetermined operations, making it less likely that they see and parse the recommendation.
  2. Habit. Firefox’s address bar has largely remained unchanged since Firefox 3, nearly a decade ago, and users may just be trained to use it in a specific way: for navigation.

A longitudinal Shield study that investigates changes in user behavior as they become habituated to the new functionality could eliminate both of these biases and provide clearer insight into these findings.

What comes next?

Building on these findings, Mozilla has launched a new initiative: Context Graph, a grander vision of a privacy-respecting recommendation engine for the web. Early efforts include two projects: Miracle, a project to gather data to train an experimental recommendation engine; and Heatmap, which annotates a user’s history with the ways they interact with page. Combined, these could form a basis for a better recommendation engine than the one offered by Universal Search.

Early efforts on Universal Search looked at ways to uncover and store data about websites you visit. This spirit has been continued with two projects: Fathom, a framework for extracting meaning from the DOM, and page-metadata-parser, a Fathom implementation in use by the Activity Stream Test Pilot experiment.

Though the experiment will be ending, we’d still love to hear about your experiences. A survey has been set up for structured feedback, but we’d also love to hear your freeform thoughts on the Test Pilot Discourse, where you can discuss the experiment with the project team.

With gratitude

This project wouldn’t have been possible without the experience and expertise of a massive group of people. A surely-incomplete list: