The Art of Scrambling


Every software engineer has been told a million times by salty old college professors, or by senior programmers, to always write good hashing functions. We all know that good hashes are critical to writing efficient searches and that if a hashing function produces too many collisions, the search efficiency can degrade from constant-time performance to linear-time performance. Ouch.

So, I’m going to share some handy code I’ve developed to evaluate the effectiveness of a hashing function. Basically, the idea is that if your hashing function produces a 32-bit integer, then you should be using all 32 of those bits. Over the long run, each bit should be set to zero about half the time and to one about half the time. If, for some reason, you’re setting bit 7 to one every single time, then you’re not really using that bit at all. So, the hashes coming out of your function are, effectively, only 31-bit hashes.

Likewise, if bit 18 is being set to zero 25% of the time and set to one 75% of the time, then it’s a little bit more tricky. To determine how many effective bits are being contributed to the hash from the bits in this position, first assume that a perfect hashing function would generate an average bit value of 0.5 at each bit position. Then, calculate the distance (the absolute value of the difference) between that ideal average and the actual average.

This tells us how far off the average bit value is from where it’s supposed to be. In this case, the distance is 0.25. But, since we have to account for that error in both directions from the ideal mean, we have to double it, resulting in a total bit error of 0.5. Subtracting this value from 1.0 results in the number of effective bits supplied at this bit position. In the current example, we only get 0.5 bits of information at this bit position. Not very good.

In code, the calculation looks like this:

double effBits = 1.0d - (Math.abs(average - 0.5d) * 2);

After calculating the number of effective bits for all 32 bit-positions in the hash integer, we can see how many total bits of information were actually generated by the hashing function. Then, we can calculate the overall effectiveness of the hashing function by raising 2 to the power of n, where n is the number of effective bits in our hashes, and then dividing that number by 2^32. This overall effectiveness number tells us the percentage of the possible integer values that can actually be generated by the hashing function in question.

Since this calculation relies on exponentiation, if our hash only provides 31 effective bits of information, the overall effectiveness will only be 50%. With 30 effective bits, the overall effectiveness decreases to 25%. So, obviously, it’s extremely important that our hashing functions have as high an effective bit count as possible.

Here’s the complete code for hash-testing class:

01  /**
02   * Copyright 2006 Benji Smith < benji@benjismith.net >
03   * 
04   * Licensed under the MIT License
05   * http://opensource.org/licenses/mit-license.php
06   */
07  package net.benjismith.hashing;
08  
09  import java.util.ArrayList;
10  import java.util.List;
11  import java.util.Random;
12  
13  public class HashTest {
14  
15    public static void main(String[] args) {
16      List<String> strings = makeRandomStrings(100000);
17      testHashes(strings);
18    }
19  
20    public static void testHashes(List objects) {
21      double[] bitSums = new double[32];
22      // Generate a running sum of bit values, for each of the 32
23      // bit positions for the hashes of the objects in this list.
24      for (Object obj : objects) {
25        for (int j = 0; j < 32; j++) {
26          int hash = obj.hashCode();
27          int thisBit = ((hash >> j) & 1);
28          bitSums[j] += (double) thisBit;
29        }
30      }
31      double effectiveBitSum = 0.0d;
32      System.out.println("average values and effective bit-counts:");
33      for (int i = 0; i < 32; i++) {
34        // Get the average bit value at this bit position.
35        double average = (double) bitSums[i] / objects.size();
36        // Get the number of effective bits at this position.
37        double effBits = 1.0d - (Math.abs(average - 0.5d) * 2);
38        System.out.printf("  bit %02d: (avg %.5f, eff %.5f)\\n",
39            i, average, effBits);
40        // Accumulate the effective bits across all positions.
41        effectiveBitSum += effBits;
42      }
43      // Estimate the number of unique hash values that could be
44      // produced by this hashing function.
45      double unique = Math.pow(2, effectiveBitSum);
46      // Determine the overall effectiveness of this hashing function
47      // (as a ration of the estimated unique values to the size of 
48      // the complete 32-bit integer domain).
49      double overallEff = unique / Math.pow(2, 32);
50  
51      System.out.printf("\\nEffective Bits: %.5f\\n", effectiveBitSum);
52      System.out.printf("Unique Hash Values: %,.2f\\n", unique);
53      System.out.printf("Overall Effectiveness: %.8f\\n", overallEff);
54    }
55  
56    public static List<String> makeRandomStrings(int numberToMake) {
57      char[] lowercase = "abcdefghijklmnopqrstuvwxyz".toCharArray();
58      Random r = new Random(System.nanoTime());
59      List<String> strings = new ArrayList<String>();
60      StringBuilder b;
61      for(int i = 0; i < numberToMake; i++) {
62        // Make a random ten-character lower-case string
63        b = new StringBuilder();
64        for (int j = 0; j < 10; j++) {
65          b.append(lowercase[r.nextInt(lowercase.length)]);
66        }
67        strings.add(b.toString());
68      }
69      return strings;
70    }
71  
72  }

Notice I’ve also included a method for generating a list of random ten-character lower-case strings. I did so because I wanted to severely limit the domain of inputs that will be fed into our hashing function. Most real-world applications expect certain kinds of limitations on the input data (i.e., usernames must be shorter than 12 characters), and the hashing function that we use for those data needs to utilize the whole integer space, even if the input domain is narrow.

I’ve also included a main() method that generates one hundred thousand of those strings and then feeds them into the testHashes() method.

When we run this code, as-is, it uses the hashCode() method from the String class, and the output of the testHashes() method verifies that Java’s built-in String hashing is very good:

average values and effective bit-counts:
  bit 00: (avg 0.49886, eff 0.99772)
  bit 01: (avg 0.49899, eff 0.99798)
  bit 02: (avg 0.49800, eff 0.99600)
  bit 03: (avg 0.50113, eff 0.99774)
  bit 04: (avg 0.49829, eff 0.99658)
  bit 05: (avg 0.49946, eff 0.99892)
  bit 06: (avg 0.49892, eff 0.99784)
  bit 07: (avg 0.50290, eff 0.99420)
  bit 08: (avg 0.49891, eff 0.99782)
  bit 09: (avg 0.50210, eff 0.99580)
  bit 10: (avg 0.50012, eff 0.99976)
  bit 11: (avg 0.50143, eff 0.99714)
  bit 12: (avg 0.50162, eff 0.99676)
  bit 13: (avg 0.50192, eff 0.99616)
  bit 14: (avg 0.50038, eff 0.99924)
  bit 15: (avg 0.49767, eff 0.99534)
  bit 16: (avg 0.50122, eff 0.99756)
  bit 17: (avg 0.50054, eff 0.99892)
  bit 18: (avg 0.49772, eff 0.99544)
  bit 19: (avg 0.49976, eff 0.99952)
  bit 20: (avg 0.50027, eff 0.99946)
  bit 21: (avg 0.50163, eff 0.99674)
  bit 22: (avg 0.50003, eff 0.99994)
  bit 23: (avg 0.49987, eff 0.99974)
  bit 24: (avg 0.49924, eff 0.99848)
  bit 25: (avg 0.49933, eff 0.99866)
  bit 26: (avg 0.49542, eff 0.99084)
  bit 27: (avg 0.49690, eff 0.99380)
  bit 28: (avg 0.49894, eff 0.99788)
  bit 29: (avg 0.49872, eff 0.99744)
  bit 30: (avg 0.49920, eff 0.99840)
  bit 31: (avg 0.49966, eff 0.99932)

Effective Bits: 31.91714
Unique Hash Values: 4,055,239,568.46
Overall Effectiveness: 0.94418404

With 31.92 effective bits and an overall effectiveness of 94.42%, this hashing function is nearly perfect.

But let’s see what happens if we code our own hash function, making some of the mistakes that beginners often make. On line 26 of this class, instead of calling the obj.hashCode() method, we’ll call our own badHash() method, which I’ve defined as follows:

public static int badHash(Object obj) {
  int hashCode = 0;
  for (char c : ((String) obj).toCharArray()) {
    hashCode += c;
  }
  return hashCode;
}

Ignoring the obviously reckless String cast, this method is problematic because it produces ridiculously bad hashes. Here’s the output of the testHashes method:

average values and effective bit-counts:
  bit 00: (avg 0.49960, eff 0.99920)
  bit 01: (avg 0.50264, eff 0.99472)
  bit 02: (avg 0.49915, eff 0.99830)
  bit 03: (avg 0.50076, eff 0.99848)
  bit 04: (avg 0.50117, eff 0.99766)
  bit 05: (avg 0.47724, eff 0.95448)
  bit 06: (avg 0.61683, eff 0.76634)
  bit 07: (avg 0.00849, eff 0.01698)
  bit 08: (avg 0.00084, eff 0.00168)
  bit 09: (avg 0.00084, eff 0.00168)
  bit 10: (avg 0.99916, eff 0.00168)
  bit 11: (avg 0.00000, eff 0.00000)
  bit 12: (avg 0.00000, eff 0.00000)
  bit 13: (avg 0.00000, eff 0.00000)
  bit 14: (avg 0.00000, eff 0.00000)
  bit 15: (avg 0.00000, eff 0.00000)
  bit 16: (avg 0.00000, eff 0.00000)
  bit 17: (avg 0.00000, eff 0.00000)
  bit 18: (avg 0.00000, eff 0.00000)
  bit 19: (avg 0.00000, eff 0.00000)
  bit 20: (avg 0.00000, eff 0.00000)
  bit 21: (avg 0.00000, eff 0.00000)
  bit 22: (avg 0.00000, eff 0.00000)
  bit 23: (avg 0.00000, eff 0.00000)
  bit 24: (avg 0.00000, eff 0.00000)
  bit 25: (avg 0.00000, eff 0.00000)
  bit 26: (avg 0.00000, eff 0.00000)
  bit 27: (avg 0.00000, eff 0.00000)
  bit 28: (avg 0.00000, eff 0.00000)
  bit 29: (avg 0.00000, eff 0.00000)
  bit 30: (avg 0.00000, eff 0.00000)
  bit 31: (avg 0.00000, eff 0.00000)

Effective Bits: 6.73120
Unique Hash Values: 106.24
Overall Effectiveness: 0.00000002

Yikes. The output is terrible.

Even though we’re using integer hashes, we’re only putting about seven bits worth of data into those hashes, resulting in only about 100 unique values being generated by this hashing algorithm.

Because our hash function only included addition operations, and because all of the input strings were very short, the resultant set of hashes only ever utilizes the bits in the first eleven positions. And only the first six positions have a reasonable level of effectiveness.

Back to the drawing board.

This time, rather than using an additive operation, we’ll use a multiplicative operation, and the bits will be more effectively utilized. We’ll continue to use our badHash() method, but we’ll re-write it as follows:

public static int badHash(Object obj) {
  int hashCode = 1;
  for (char c : ((String) obj).toCharArray()) {
    hashCode *= c;
  }
  return hashCode;
}

The results of this new function are somewhat better, but by no means ideal:

average values and effective bit-counts:
  bit 00: (avg 0.00097, eff 0.00194)
  bit 01: (avg 0.00574, eff 0.01148)
  bit 02: (avg 0.01811, eff 0.03622)
  bit 03: (avg 0.04075, eff 0.08150)
  bit 04: (avg 0.07593, eff 0.15186)
  bit 05: (avg 0.12403, eff 0.24806)
  bit 06: (avg 0.18123, eff 0.36246)
  bit 07: (avg 0.24167, eff 0.48334)
  bit 08: (avg 0.29770, eff 0.59540)
  bit 09: (avg 0.35290, eff 0.70580)
  bit 10: (avg 0.39766, eff 0.79532)
  bit 11: (avg 0.43170, eff 0.86340)
  bit 12: (avg 0.45399, eff 0.90798)
  bit 13: (avg 0.47442, eff 0.94884)
  bit 14: (avg 0.48314, eff 0.96628)
  bit 15: (avg 0.49155, eff 0.98310)
  bit 16: (avg 0.49588, eff 0.99176)
  bit 17: (avg 0.49580, eff 0.99160)
  bit 18: (avg 0.49805, eff 0.99610)
  bit 19: (avg 0.49972, eff 0.99944)
  bit 20: (avg 0.49909, eff 0.99818)
  bit 21: (avg 0.49655, eff 0.99310)
  bit 22: (avg 0.49874, eff 0.99748)
  bit 23: (avg 0.49830, eff 0.99660)
  bit 24: (avg 0.50323, eff 0.99354)
  bit 25: (avg 0.50023, eff 0.99954)
  bit 26: (avg 0.50103, eff 0.99794)
  bit 27: (avg 0.50191, eff 0.99618)
  bit 28: (avg 0.49976, eff 0.99952)
  bit 29: (avg 0.49562, eff 0.99124)
  bit 30: (avg 0.50135, eff 0.99730)
  bit 31: (avg 0.50091, eff 0.99818)

Effective Bits: 24.08068
Unique Hash Values: 17,742,180.61
Overall Effectiveness: 0.00413092

Using a simple multiplicative accumulation, we’re now effectively utilizing more than 24 of the 32 available bits. That’s a significant improvement over what we had before, but it’s still not very good. With less than 18 million unique hash values, we’re getting less than half of one percent of the overall effectiveness that we should be getting from a 32-bit hash. What’s interesting now, though, is that the effective bits are concentrated at the high-order side of the integer, while the low-order bits are extremely ineffective.

To understand why bit-position 00 only contains 0.00194 effective bits of information, keep in mind that only odd integers will ever have a value of 1 in this bit position. Each of the ten characters in our random strings has a 50% chance of being odd. But the probability is only 0.0009765 that all ten of the characters will have an odd value. If even one of them is even, then the bit at position 00 will be zero.

All of the ones at that bit position are considered to be information, but only an equal number of zeros at that position contain any information. The other zeros (the vast majority of them) are just there by accident. So the total amount of bit information at position zero is 2 * 0.0009765, or 0.001953. (Since this test operates on a sample of data, rather than on the whole population, there’s a very small margin of error involved, but we’ll continue to ignore it.)

Let’s try again. This time, we’ll concentrate on writing a function that scatters the information from our input bits throughout all 32 bit-positions in the integer space. Typically, a bitwise XOR operation is extremely effective at accomplishing this. So, with just one additional operation, here’s the new version of our badHash() method:

public static int badHash(Object obj) {
  int hashCode = 1;
  for (char c : ((String) obj).toCharArray()) {
    hashCode *= c;
    hashCode ^= c;
  }
  return hashCode;
}

Re-running the testHashes() method with this new algorithm produces the following results:

average values and effective bit-counts:
  bit 00: (avg 0.33368, eff 0.66736)
  bit 01: (avg 0.42638, eff 0.85276)
  bit 02: (avg 0.45648, eff 0.91296)
  bit 03: (avg 0.48878, eff 0.97756)
  bit 04: (avg 0.50503, eff 0.98994)
  bit 05: (avg 0.50894, eff 0.98212)
  bit 06: (avg 0.50156, eff 0.99688)
  bit 07: (avg 0.49924, eff 0.99848)
  bit 08: (avg 0.49787, eff 0.99574)
  bit 09: (avg 0.50231, eff 0.99538)
  bit 10: (avg 0.50198, eff 0.99604)
  bit 11: (avg 0.49680, eff 0.99360)
  bit 12: (avg 0.49778, eff 0.99556)
  bit 13: (avg 0.49876, eff 0.99752)
  bit 14: (avg 0.50005, eff 0.99990)
  bit 15: (avg 0.50064, eff 0.99872)
  bit 16: (avg 0.49567, eff 0.99134)
  bit 17: (avg 0.50107, eff 0.99786)
  bit 18: (avg 0.50183, eff 0.99634)
  bit 19: (avg 0.50224, eff 0.99552)
  bit 20: (avg 0.50261, eff 0.99478)
  bit 21: (avg 0.50241, eff 0.99518)
  bit 22: (avg 0.50103, eff 0.99794)
  bit 23: (avg 0.49770, eff 0.99540)
  bit 24: (avg 0.49911, eff 0.99822)
  bit 25: (avg 0.50122, eff 0.99756)
  bit 26: (avg 0.49926, eff 0.99852)
  bit 27: (avg 0.50009, eff 0.99982)
  bit 28: (avg 0.49738, eff 0.99476)
  bit 29: (avg 0.49688, eff 0.99376)
  bit 30: (avg 0.50199, eff 0.99602)
  bit 31: (avg 0.50046, eff 0.99908)

Effective Bits: 31.29262
Unique Hash Values: 2,630,372,545.36
Overall Effectiveness: 0.61243133

Much better. Although we’re still 0.7074 bits away from a perfect hashing algorithm, the results are finally acceptable, with the hashes from this function covering over 61% of the 32-bit integer space. We could probably spend a little more time tweaking these results, especially since we can see that the majority of the ineffectiveness is concentrated in the lowest four bit-positions.

But I’ll leave that as an exercise for the reader.

The important point I’m trying to convey in this article is that generating effective hashing functions isn’t very difficult at all, as long as you measure your results along the way, making corrections to the algorithm whenever you identify bit-positions that are being ineffectively utilized.

Leave a Reply

You must be logged in to post a comment.