(In)security of E-mail Hashes

While hashing of e-mail addresses is sometimes proposed as a way to make storage of these addresses more secure, this article aims to show that it should not be relied on to provide anonymisation of personally identifying information and does not provide as much security as assumed.

Lately I’ve seen quite a few posts online about e-mail hashes as a way to securely store e-mail addresses when these should be anonymised and are not needed for marketing purposes. This might be out of a desire to be more easily GDPR compliant, or simply out of security considerations.


This is a short write-up to explain why this is not actually as secure as imagined and should be considered a form of pseudonymisation rather than anonymisation. As such, it will not make GDPR compliance easier and should not be considered a reliable way to secure personally identifying information.


Defining an E-mail Address

An e-mail address can be defined as a string with the following notation:


<local-part>@<domain>


In almost all cases, the <local-part> will be an unquoted string using the characters:


  • A-Z

  • a-z

  • 0-9

  • any in !#$%&’*+-/=?^_`{|}~

  • .


While in real-life there are some restrictions as to how these characters can be used, to reduce complexity we can say that the alphabet of the <local-part> contains 84 elements. However, most servers treat e-mails in a case-insensitive fashion and will simply lowercase all the characters within the e-mail address. As such the total amount of elements should realistically be considered 58 instead.


When cracking e-mails, the domains can generally be disregarded. This is because domains do not follow a random distribution. The great majority of all e-mail addresses will have one the following domains:


  • gmail.com

  • yahoo.com

  • hotmail.com

  • aol.com

  • outlook.com

  • icloud.com


As such, it does not make sense to brute force the e-mail domain if we can crack almost all e-mail addresses by trying one of these 6.


Cracking E-mail Addresses

The amount of attempts needed to exhaustively try all possible options for an e-mail with a local part of length nn and one of the 6 given domains can now be defined as:


58n6\begin{aligned} 58^n \cdot 6 \end{aligned}


For different amounts of nn this gives us:


f(1)=348f(2)=20 184f(3)1.17106f(4)6.79107f(5)3.94109f(6)2.281011f(7)1.321013f(8)7.681014\begin{aligned} f(1) &= 348 \\ f(2) &= 20\ 184 \\ f(3) &\approx 1.17 \cdot 10^6 \\ f(4) &\approx 6.79 \cdot 10^7 \\ f(5) &\approx 3.94 \cdot 10^9 \\ f(6) &\approx 2.28 \cdot 10^{11} \\ f(7) &\approx 1.32 \cdot 10^{13} \\ f(8) &\approx 7.68 \cdot 10^{14} \end{aligned}


With an RTX 4090 graphics card we can generate about 4333.9 Mega Hashes (or attempts) per second for a HMAC-SHA256 hash. This means that to exhaustively try all possible permutations would take:


(timef)(1)0.08μs(timef)(2)4.65μs(timef)(3)270μs(timef)(4)17.7ms(timef)(5)908ms(timef)(6)53s(timef)(7)51m(timef)(8)49h\begin{aligned} (\text{time} \circ f)(1) &\approx 0.08\mu s \\ (\text{time} \circ f)(2) &\approx 4.65\mu s \\ (\text{time} \circ f)(3) &\approx 270\mu s \\ (\text{time} \circ f)(4) &\approx 17.7ms \\ (\text{time} \circ f)(5) &\approx 908ms \\ (\text{time} \circ f)(6) &\approx 53s \\ (\text{time} \circ f)(7) &\approx 51m \\ (\text{time} \circ f)(8) &\approx 49h \end{aligned}


In around 50 hours, using only a single graphics card, we can crack all e-mail addresses up to 8 characters simply by brute forcing them.


In reality, however, most attackers would only use brute forcing for the shortest of e-mail addresses. As length increases, it makes more sense to switch to more sophisticated algorithms like Mask based or Markov chain based approaches or Princeton attacks, where we do not simply try random permutations of characters, but try to guess likely sequences. This will be a lot more successful at limiting the amount of guesses needed because most e-mail addresses are not random strings, they are often composed of words or names with dots or underscores as word delimiters.


With this, it should be quite obvious why hashing of e-mail addresses should be considered pseudonymisation instead of anonymisation. For a sizeable chunk of the stored e-mail addresses, the hash is not intractable.


But I Am Using a Secret

When using HMAC encryption to generate e-mail hashes, you will need to pass a secret to the hash function. This secret makes it exponentially more difficult to reverse the hash (as you would need to reverse the secret as well). In short: all of the previously discussed attacks would become intractable.


Does that mean that our hashing approach is saved? Not really, at least not from a GDPR perspective and also not from a security perspective. The best-case scenario would be for no-one to be able to tractably get access to the plaintext e-mails, not even lead developers or database administrators. However, someone, somewhere, must have access to the secrets. At some point these secrets need to be injected into the application and whomever has access to the secret has a reliable method of reversing these hashes.


Additionally, if the secret ever gets leaked, all the hashes in the database are no longer secure.


This is why, for high-security applications, I argue that e-mail hashes cannot provide as much security as we believe they do. They are quite easy to crack and do not have high intrinsic entropy.


Note that other hashing algorithms are possible and can significantly increase the time needed to crack these hashes. For most purposes, however, these hashing algorithms are prohibitively expensive. While we can scale a hashing algorithm to take 500ms or so to complete a single hash, this would then mean that our server needs to spend 500ms calculating that hash each time a users wishes to authenticate. Also note that this does not meaningfully alter the time complexity of cracking, it simply introduces a large constant factor.

Published on March 4, 2025