Home About Contact

Hashing(Part 2): making a basic hash function

The accompanying video can be found here.


A recap and a question

So, in the last article, I covered some of the basics of hashing, what it is, what it is used for, and desirable characteristics in hash functions. This time, I want to cover more about how hashing works. This will be by no means an in-depth discussion, that could be an entire book, or series of books. Instead, I want to explain some of the basic principals and techniques behind how hash functions work by looking at how we might develop a basic hash function, and what we want to achieve. So, a question: is n mod 10 a hash function?

Removing information

In general, in a hash function, information from the input is lost to form a constant length hash, and the most basic hash you could create is to truncate an input to the length of the hash. To deal with cases where the input is shorter than the hash, we could just pad it with zeros. A major downside of hashing like this is it is very possible to have two slightly different outputs generate the same hash. Let's take a quick look at an example, defining our hashing algorithm as truncating the input to 10 characters, or padding it to 10 with zeros. We will now generate hashes for the following inputs:

input1
input2
Hello world!
jhfajw3784y23h51a8
Lorem ipsum dolor
Lorem ipsum color

The hashes are:

InputsOutputs
input1input10000
input2input20000
Hello world!Hello worl
fa819jw784h51a8fa819jw784
Lorem ipsum dolorLorem ipsu
Lorem ipsum colorLorem ipsu

Here, we can readily see that all the first 5 are different, but the hashes would indicate that the fifth and sixth inputs are the same. This is because the hash function does not use the entire input.

This is where the question I asked at the start comes in, and yes, it is a basic hash function, with an input n, and the hash is the remainder when the input is divided by 10. We can express this in a more general way as n mod m, where n is the input, and m is some number, but I'll expand on the effect of changing m in a minute.

We can improve our hashes by using this function, but for the sake of simplicity, let's take some random numbers as inputs to a hash function. Let's say:

3
98
291
292
293

While I have used numbers here for simplicity's sake, you could use this for the previous examples, you'd just have to convert the inputs into some numerical form. The hashes for these inputs with this function are:

InputsOutputs
33
988
2911
2922
2933

So as you can see here, we already have an improvement over the last hash function in that the last three values are different, but now we have a different problem. 3 and 293 have the same hash. The problem here becomes apparent if we just take the numbers 1 to 30 and calculate the hashes we get the following:

inputhashinputhashinputhash
11111211
22122222
33133233
44144244
55155255
66166266
77177277
88188288
99199299
100200300

So basically, anything ending in 1 has a hash of 1, anything ending in 2 has a hash of 2, and so on. Now, a simple solution could be to divide by 100 instead of 10, and this would mitigate the issue to some extent, with the new hashes as follows. Basically, we can control the length of the hash with m.

Unfortunately, this method does not produce a characteristic often desired in hash functions that I mentioned in the first article in this series, that of irreversibility. With the current hash function we are considering, it is very easy to work out the possible inputs that would produce a given hash. For example if we take the hash value of 5, we can immediately work out the set of inputs that produces the hash, in this case: 5, 15, 25, 35, .... While this pattern and predictability is only a small issue where hashing is used to ensure data integrity, it is a massive issue in security applications like password authentication. Basically, what we need, is a hash function that cannot be reversed and used to work out the input, I might explain why another time.

This is where an effect I mentioned in the last article, the avalanche effect comes in. To demonstrate this effect, here are those numbers from 1-30 hashed with a function I also mentioned last video: SHA256.

inputhash
14355a46b19d348dc2f57c046f8ef63d4538ebb936000f3c9ee954a27460dd865n
253c234e5e8472b6ac51c1ae1cab3fe06fad053beb8ebfd8977b010655bfdd3c3n
31121cfccd5913f0a63fec40a6ffd44ea64f9dc135c66634ba001d10bcf4302a2n
47de1555df0c2700329e815b93b32c571c3ea54dc967b89e81ab73b9972b72d1dn
5f0b5c2c2211c8d67ed15e75e656c7862d086e9245420892a7de62cd9ec582a06n
606e9d52c1720fca412803e3b07c4b228ff113e303f4c7ab94665319d832bbfb7n
710159baf262b43a92d95db59dae1f72c645127301661e0a3ce4e38b295a97c58n
8aa67a169b0bba217aa0aa88a65346920c84c42447c36ba5f7ea65f422c1fe5d8n
92e6d31a5983a91251bfae5aefa1c0a19d8ba3cf601d0e8a706b4cfa9661a6b8an
10917df3320d778ddbaa5c5c7742bc4046bf803c36ed2b050f30844ed206783469n
1125d4f2a86deb5e2574bb3210b67bb24fcc4afb19f93a7b65a057daa874a9d18en
12a1fb50e6c86fae1679ef3351296fd6713411a08cf8dd1790a4fd05fae8688164n
131a252402972f6057fa53cc172b52b9ffca698e18311facd0f3b06ecaaef79e17n
149a92adbc0cee38ef658c71ce1b1bf8c65668f166bfb213644c895ccb1ad07a25n
15238903180cc104ec2c5d8b3f20c5bc61b389ec0a967df8cc208cdc7cd454174fn
16e6c21e8d260fe71882debdb339d2402a2ca7648529bc2303f48649bce0380017n
1754183f4323f377b737433a1e98229ead0fdc686f93bab057ecb612daa94002b5n
187ee29791fc17e986b97128845622b077fb45e349fdb80523fac9dba879b4ad60n
19a9742eb8ee320e006666aef25ae9aeed948247f3125c9cafa7cf97b7e7467dd5n
205378796307535df3ec8d8b15a2e2dc5641419c3d3060cfe32238c0fa973f7aa3n
216e2ae11dad0616f66bbb2b6e6556f580bb987fd911d7132aa6bee2bfc7cc7b52n
22f14b4987904bcb5814e4459a057ed4d20f58a633152288a761214dcd28780b56n
23076320a2a08267b4c026d06573bba408ea68841e73cdc20e62cce59de165ece3n
2468ca3fba3b7e864770cb61aeb306d4bd4354b68ab4dd38450860c5d823e42a53n
2564aeb9975f234becd55bb4635e6e2f2da7a6b7bf0a896f0c07763bdfbfb31420n
2620d2add851bf39edcc6b5e830930f962db7e19dffa2cab8610eed551eda6c302n
27932ab0a0e4191d32c0af7b3f565b7b180dbe9869378abc5816f9add54b806e7fn
289961d158a7e0e2f990765971a9e490af826c0743b7d603020f34cc8944319fcbn
293840bc236ee03aacbb1ef7d5108ddfa347c59f10b68d4174affbb53140f31273n
30f4ccd05b3271c386ee55d9876c7450012a3b361e5065c09dc22075e38b3cc35cn

Aside from the obvious fact that these hashes are much larger, we can also observe something else, they are all very different. These differences are the manifestation of the avalanche effect.

This effect is one of the most commonly sought after features in hash functions, and it is closely related to another effect called the butterfly effect. The butterfly effect is where in a system, a small change to the inputs has a large effect on the output. As simple example of this, consider the equation y=x^2. In this case, let us use $x$ as our input, and y as our output. If we graph this for when x is greater than 0, we get this. As x increases, y also increases, but not at the same rate. Lets start at x equals 0. Here, y also equals 0. increasing x to one gives us y equals 1, but increasing x to 2, we get y equals 4, at x equals 3, y equals 9. At x equals 4, y equals 16, but you get the idea. As x, the input, gets larger, y, the output also increases, but not at the same rate, and this effect gets even more pronounced as x gets larger. Lets take x = 100, and x = 101 for example. Here the y values are 10,000, and 10,201, respectively. With x being 2 and 3, the difference between the outputs was 5, here it is much larger, 201 to be precise, and this is only a fairly tame example. Using the values 2, 3, 100, and 101, just for fun lets change that equation slightly to y=2^x. Now our values are very different, and the effect is much more pronounced.

xydifference
2
3
4
9
4
100
101
1267650600228229401496703205376
2535301200456458802993406410752
1267650600228229401496703205376

Make that much, much more pronounced, and bear in mind that this is a very small and simple system. Anyway, that was a bit of a sidetrack, lets get back to the avalanche effect in hash functions. In general, to create a good irreversible hash function, we apply a variety of techniques, and I am not going to explain SHA256 right now, but I will mention a couple more key principles that are important to good hashing, and in fact cryptography in general. They are confusion and diffusion. Confusion is where each bit of the hash is dependent on several separate parts of the input, and there is no way to find the connection between the parts of input based on the hash.

Illustration of confusion

Diffusion is where changing one bit of the input changes roughly half of the hash.

Illustration of confusion

That's all for this time, but I hope to go over an application of hash functions in the next video.

Reference articles / further reading

Confusion and diffusion - Wikipedia