Hash functions are miraculous, helpful tools we have no right to expect to exist. They seem far too convenient! They have many uses, including determining if two files are the same, creating unique (shorter) names, and encryption and digital security
Hash functions
A hash function \(H\) maps data of arbitrary size to a fixed size such that
\(H(x)\) is an easy to compute, deterministic function
If \(x\not= y\) then \(H(x)\not=H(y)\) with high probability
\(H(x)\) appears random over its range as \(x\) varies
Examples. The “IT” hash function: first five letters of last name + first letter first name, which has the J. Smith problem. A person’s phone, zip, or social security number act similarly to hash functions.
In practice, we want the high probability in (2) to mean a collision probability \(\approx 10^{-40}\), not one in one hundred1.
A hash functions is called cryptographic if it also satisfies
Given \(y\) it is very hard to find \(x\) with \(H(x)=y\).
SHA256 is a common cryptographic hash function. It is available in Python.
import hashlibh1 = hashlib.sha256('The quick brown fox jumps over the lazy dog'.encode('utf-8')).digest().hex()h2 = hashlib.sha256(b'The quick brown fox jumps over the lazy dog').digest().hex()h3 = hashlib.sha256(b'The quick brown fox jumps over the lazy dog.').hexdigest()print(h1, h2, h3, sep='\n')
Adding a period to the end of the string completely changes the hash, illustrating (3).
The output of hash can be interpreted as an integer. For SHA256, it is a very large integer, between 0 and \(2^{256}\approx 10^{77}\). Note that you must carefully specify input and output formats, especially when dealing the [Unicode strings] (https://www.mynl.com/blog?id=e74df0049a9a03999462358a4a8ad821). Since the hash is so large, the probability of a hash collision is extremely low. How low?
Aside: The Birthday Problem and hash collisions
The Birthday Problem asks how many people you need so there is at least a 50 percent chance two share a birthday. The answer is only 23 people. Hash collisions are analogous to the Birthday Problem: two documents (people) with the same hash (birthday).
Approximately \(\sqrt{2Np}\) documents are needed before there is a \(p\) probability of collision given a hash space size of \(N\)2. For SHA256, \(N=2^{256}=10^{77}\) and so a \(10^{-3}\) collision probability requires about \(1.5\times 10^{37}\) documents. That’s enough for every person on earth to compute 1 billion hashes per second for five times the age of the universe!
Aside: Proof of work and Bitcoin mining = compute hashes
The (in)famous Bitcoin mining process involves finding a nonce (number used once) to append to a coded string so that the result has a small hash, i.e., one with several leading zeros. The following Python code illustrates.
My machine took about 6 seconds to find six leading zeros.
Currently (Feb 2022), the Bitcoin network hash rate3 is \(2\times 10^{20}\) hashes per second (in November 2018 it was \(4\times 10^{19}\), in March 2024 it was \(6 \times 10^{}\)). It is estimated that miner’s electricity consumption approximates that of Austria or Finland. How small are the hashes being found? Recent block hashes start:
Our simple code found nine leading zeros. Currently, Bitcoin finds 19. Remember, the difficulty increases exponentially. The problem is analogous to simulating random numbers in a spreadsheet and hitting F9 until one starting 0.0xxx appears. You’ll wait ten times as long for 0.00xxx, etc.
The discrete logarithm problem
The logarithm is the inverse to the exponential function: if \(a=b^c\) then \(c\) is the logarithm of \(a\) to base \(b\). With regular numbers, the exponential function \(x\mapsto x^c\) is a nice smooth function and it is easy to compute its inverse using numerical methods. The picture changes using modular arithmetic (remainders).
Setup. Let \(p\) be a large prime number with a large prime \(q\) dividing \(p-1\)
E.g. \(p=23\), \(p-1=22\) has factor \(q=11\)
Non-e.g. if \(p=29\) then \(p-1=28\) has many factors
Condition on \(p\) is needed to ensure the discrete logarithm problem is hard
Find \(g<q\) with \(g, g^2, \dots,g^{q-1}, g^q\equiv 1\) all distinct mod \(p\)
E.g. \(g=3\) and work mod 23 = remainder after dividing by 23
given \(n\), find \(a\) such that \(g^a\equiv n\pmod{p}\).
Remember, \(\log_g(n)=a\) means \(g^a=g^{\log_g(n)}=n\), hence name logarithm. Unlike the case for regular numbers, the remainders of powers of \(g\) bounce around randomly (see figures). We can’t use a Newton-Raphson method to home-in on the answer. And in practice you work with really big\(p\) and \(q\). Given \(g\), \(a\), and \(p\), computing \(A=g^a\mod p\) is easy. But recovering \(a\) from \(A\) is hard. Exponentiation is an example of a one-way function. These ideas are illustrated in the next figure.
Cryptography and Digital Signatures
Many techniques in cryptography and digital signatures are based on the ingenious use of basic modular arithmetic. Internet security relies on addition, multiplication and division, and the difficulty of the discrete logarithm problem. The next three sections explain how two actors can create a shared secret, send a message using Public Key Encryption, and digitally sign a document.
Creating a shared secret using hash functions
ElGamel Public Key Encryption
Digital Signatures
Digital signatures are improve over traditional signatures in three important ways.
Signer authentication: the verifier is assured that signature has been created only by sender who possess the corresponding secret private key, \(a\).
Message integrity: if the message is modified then the signature fails, making the signature tamper evident.
Non-repudiation: the existence of a signature proves it came from the sender. The sender cannot later repudiate signing.
In contrast, a wet-ink signature can be forged, or the document can be altered, or the signature can be denied.
Powers of 3, red dot, mod 23 move randomly around the subgroup of \(F^{\times}_{23}\) of order 11, blue dots.Powers of 115, red dot, mod 359 move randomly around the subgroup of \(F^{\times}_{359}\) of order 179, blue dots.Powers of 3 modulo 100043; 100042 = 2 x 50021 is twice a prime. The “graph” of the exponential function looks completely random, making it very difficult to compute the inverse and solve the discrete logarithm problem.Figure 1: Alice picks \(a\), computes \(A=g^a \pmod p\) and sends it to Bob. Bob picks \(b\), computes \(B=g^b\pmod p\) and sends it to Alice. Both Alice and Bob can compute \(K=B^a=A^b=g^{ab}\pmod p\), but no one else can, and it was never shared electronically. It is a shared secret between Alice and Bob.Figure 2: Bob wants to send Alice a message \(m\). Alice picks \(a\) and publishes a public key \(A=g^a\pmod p\). Bob picks a nonce \(k\) and computes \(B=g^k\pmod p\), the shared secret \(K=A^k\pmod p\), and encodes the message as \(Km\pmod p\). Bob sends the pair \((B, Km)\) to Alice. Knowing \(B\), Alice can compute \(K=B^a=g^{ak}\) and then decode \(m=(K^{-1}Km)\). The shared secret \(K\) is never communicated publicly.Figure 3: To sign \(m\), Alice picks a nonce \(k\), computes \(R=g^k\pmod p\) to hide it, solves \(kS=m+Ra\) for \(k\), and communicates the pair \((R,S)\) to Bob as the signature. Bob validates the signature by computing \(g^mA^R=g^{m+Ra}\) and \(R^S=g^{kS}\) and checking they are equal. If an Imposter, who does not know Alice’s secret \(a\) tries to sign the message, they must solve \(kS=m+Ra\), one equation with two unknowns. They do know \(A=g^a\), and they can pick \(k\) and \(R\), so they try to solve \(g^{kS}=R^S=g^{m+Ra}=g^mA^R\). But, solving \(R^S=g^mA^R\) for \(R\) and \(S\) is (supposed to be) as hard as the discrete logarithm problem. Thus, an Imposter cannot sign the message.
E.g. for birthday problem \(p=1/2\), \(N=365\) and \(\sqrt{2Np}=19\). Approximation relies on \(p\approx -\log(1-p)\), only true for smaller \(p\). Using \((-2N\log(1-p))^{1/2}=22.49\) is very close to correct answer, 23.↩︎
Number of hashes computed per second by all miners.↩︎