1. 程式人生 > >A good hash is hard to find

A good hash is hard to find

A good hash is hard to find

I’m sure anyone who has worked with an A/B testing framework will nod knowingly when I say, the devil is in the details. When you’re looking for small signals in noisy human behavior, the tiniest bit of bias or extra variance is amplified and easily leads to false positives or missed opportunities. Recently, at OfferUp, we discovered that something as simple as choosing the right hash function can have surprisingly non-simple implications.

Hash me if you can

Two key assumptions underlying A/B testing are:

  1. Within a given experiment, individuals are assigned to a test group randomly. This could be violated, for instance, if users of a certain version of the app fail to be assigned to a test and are by default always in control.
  2. Assignment across tests is independent
    . Knowing a user’s assignment in one test tells you nothing about their assignment in another test.

If both of these assumptions are met, differences from user to user should be averaged out and you can say that any difference in outcomes must be due to the test intervention. In practice, it’s also desirable for a user to remain in the same variant throughout the duration of a test. Subjecting users to constant change is bad for UX and bad for experimentation because there’s often an accommodation period.

A common solution for assigning users to test buckets is via a hashing layer. This approach is deterministic — the same user identifier gets hashed to the same bucket — and it avoids the scaling challenges of using caching to track a user’s assignment.

In its basic form, the user bucketing scheme at OfferUp for a given experiment is:

  • Take a user ID and combine it with the experiment ID / experiment salt using an XOR.
  • Run the resulting user + experiment object through a hash.
  • Finally, mod the hash output to get a fixed range of outcomes, e.g. (x % 10), and then assign a user to a bucket based on the mod-ed value.

It’s a mod mod world

We discovered an issue with user randomization while debugging an unexpected A/B test result. We noticed that users in the control group were making more offers per person than the users in the test group, which was not a metric the test was expected to affect. However, we were also running a A/B test that did target offers per user. On a lark, we checked for correlation between the experiments and found that the assignments were not independent according to a chi-squared test.

Our investigation finally led us to the hashing step itself as the root cause: the hashing function we were using was generating correlated assignment across experiments. In some cases, depending on the particular pair of experiment salts, the correlation was negligible while in other cases the correlation became substantial. In fact, this problem will be present in most non-cryptographic functions. Buried in a 2007 KDD paper from Microsoft is a mention of this issue:

“And if the hash function has characteristics (instances where a perturbation of the key produces a predictable perturbation of the hash code), then correlations may occur between experiments. Few hash functions are sound enough to be used in this technique.
We tested this technique using several popular hash functions and a methodology similar to the one we used on the pseudorandom number generators[…] We found that only the cryptographic hash function MD5 generated no correlations between experiments. SHA256 (another cryptographic hash) came close, requiring a five-way interaction to produce a correlation. The .NET string hashing function failed to pass even a two-way interaction test.”

…fool me twice, SHAme on me

To get an intuition for the mechanics behind correlated hash bucketing, I’ve created a simplified demo of the phenomenon in Python. Our experimentation backend is actually written in Java, but the principles are the same in Python. Both languages’ default hashing functions are optimized for performance and not intended to be cryptographically secure.

Built-In Hash Function

In this demo I’m assuming a simple A/B test scenario where half the incoming users are assigned to bucket A and half are assigned to bucket B. For a set of two experiment salts, I’ve taken a set of 1000 sequential user ids and run them through Python’s built-in hashing function. I then mod-ed the result and bucketed into the appropriate variant.

Below I’ve plotted the mod-ed outputs and buckets for each experiment salt individually. You can clearly see that the outputs are non-random, namely the the buckets of sequential users are related. In practice, however, this isn’t really problematic. Users are still distributed 50/50 across the buckets and the mod-ing ensures that each bucket has a mix of both old and new users.