A Brief Discussion on the .NET String HashCode

  • Updated on 9th Sep 2022

In the scenario of distributed caching, for a large number of caches (where the key is usually a string) that need to be evenly distributed across multiple (let’s assume N) server nodes, a more mainstream approach is the modulo algorithm: Hash(key) % N, which means: perform a hash operation on the key and then take the modulo with N. This is also known as the consistent hashing algorithm. Since it is not related to the main topic of this article, I will write another piece to discuss it further.

However, in .NET, the GetHashCode method of the string itself results in different return values when GetHashCode is run twice on the same string value. Here is an excerpt from the official documentation explaining this:

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

The hash code itself is not guaranteed to be stable. Hash codes for identical strings can differ across .NET implementations, across .NET versions, and across .NET platforms (such as 32-bit and 64-bit) for a single version of .NET. In some cases, they can even differ by application domain. This implies that two subsequent runs of the same program may return different hash codes.

As a result, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection, and they should never be persisted.

Since consistent hashing requires that the same string value has the same hash code, the GetHashCode method of the String itself does not meet the requirements of consistent hashing for cache keys.

In Java, the hash code for the same string value is consistent; you can refer to its algorithm description to implement the same method in .NET. According to the Java documentation, the hash code algorithm for a String object is as follows:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Based on integer operations, where s[i] is the ASCII value of the i-th character in the string, n is the length of the string, and ^ denotes exponentiation.

Why use 31 as the multiplier? Since the goal is to choose a larger prime number, why not 29, 37, or even 97?

Below is a highly upvoted answer from StackOverflow for the question:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

That is to say, the choice of 31 is because it is an odd prime number, which on one hand can produce a more scattered hash, meaning that the hash values of different strings are generally not the same. When choosing a coefficient, it is best to select a longer one that also avoids overflow in the multiplication result. The larger the hash address calculated, the fewer so-called “collisions” there will be, and the efficiency of lookups will also increase. However, it cannot be too large, as in Java multiplication, if the numbers are too large, it can lead to overflow issues.

Another aspect is the higher computational efficiency. A good feature of 31 is that multiplication can be replaced with shift and subtraction for better performance: 31 * i == (i << 5) - i. Modern virtual machines will perform this optimization.

The algorithm has a case where a string and its palindrome have the same hashCode() return value. For example, the strings “gdejicbegh” and “hgebcijedg” both have a hashCode() return value of -801038016. However, this does not affect the consistent hashing under distributed caching.

Based on the interpretation of the above Java string hashCode() algorithm, it is easy to derive a method for calculating the hash code of a string in C#. The core code is as follows:

private int GetHashCode(string text)
{
    if (text.Length <= 0)
    {
        return 0;
    }

    Span<byte> bytes = Encoding.UTF8.GetBytes(text);

    int hashCode = 0;

    foreach (var b in bytes)
    {
        //hashCode = (hashCode << 5) - hashCode + b
        hashCode = 31 * hashCode + b;
    }

    return hashCode;
}

If there is no need to match the hashCode() results of Java strings, the following implementation can also be used:

private int GetHash(string text)
{
   using (var md5 = MD5.Create())
   {
   	var hash = md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes(key));
   	return BitConverter.ToInt32(hash, 0);
   }
}