Hashing(Part 2): making a basic hash function

The accompanying video can be found here.

A recap and a question

So, in the last article, I covered some of the basics of hashing, what it is, what it is used for, and desirable characteristics in hash functions. This time, I want to cover more about how hashing works. This will be by no means an in-depth discussion, that could be an entire book, or series of books. Instead, I want to explain some of the basic principals and techniques behind how hash functions work by looking at how we might develop a basic hash function, and what we want to achieve. So, a question: is n mod 10 a hash function?

Removing information

In general, in a hash function, information from the input is lost to form a constant length hash, and the most basic hash you could create is to truncate an input to the length of the hash. To deal with cases where the input is shorter than the hash, we could just pad it with zeros. A major downside of hashing like this is it is very possible to have two slightly different outputs generate the same hash. Let's take a quick look at an example, defining our hashing algorithm as truncating the input to 10 characters, or padding it to 10 with zeros. We will now generate hashes for the following inputs:

input1
input2
Hello world!
jhfajw3784y23h51a8
Lorem ipsum dolor
Lorem ipsum color

The hashes are:

Inputs	Outputs
input1	input10000
input2	input20000
Hello world!	Hello worl
fa819jw784h51a8	fa819jw784
Lorem ipsum dolor	Lorem ipsu
Lorem ipsum color	Lorem ipsu

Here, we can readily see that all the first 5 are different, but the hashes would indicate that the fifth and sixth inputs are the same. This is because the hash function does not use the entire input.

This is where the question I asked at the start comes in, and yes, it is a basic hash function, with an input n, and the hash is the remainder when the input is divided by 10. We can express this in a more general way as n mod m, where n is the input, and m is some number, but I'll expand on the effect of changing m in a minute.

We can improve our hashes by using this function, but for the sake of simplicity, let's take some random numbers as inputs to a hash function. Let's say:

While I have used numbers here for simplicity's sake, you could use this for the previous examples, you'd just have to convert the inputs into some numerical form. The hashes for these inputs with this function are:

Inputs	Outputs
3	3
98	8
291	1
292	2
293	3

So as you can see here, we already have an improvement over the last hash function in that the last three values are different, but now we have a different problem. 3 and 293 have the same hash. The problem here becomes apparent if we just take the numbers 1 to 30 and calculate the hashes we get the following:

input	hash	input	hash	input	hash
1	1	11	1	21	1
2	2	12	2	22	2
3	3	13	3	23	3
4	4	14	4	24	4
5	5	15	5	25	5
6	6	16	6	26	6
7	7	17	7	27	7
8	8	18	8	28	8
9	9	19	9	29	9
10	0	20	0	30	0

So basically, anything ending in 1 has a hash of 1, anything ending in 2 has a hash of 2, and so on. Now, a simple solution could be to divide by 100 instead of 10, and this would mitigate the issue to some extent, with the new hashes as follows. Basically, we can control the length of the hash with m.

Unfortunately, this method does not produce a characteristic often desired in hash functions that I mentioned in the first article in this series, that of irreversibility. With the current hash function we are considering, it is very easy to work out the possible inputs that would produce a given hash. For example if we take the hash value of 5, we can immediately work out the set of inputs that produces the hash, in this case: 5, 15, 25, 35, .... While this pattern and predictability is only a small issue where hashing is used to ensure data integrity, it is a massive issue in security applications like password authentication. Basically, what we need, is a hash function that cannot be reversed and used to work out the input, I might explain why another time.

This is where an effect I mentioned in the last article, the avalanche effect comes in. To demonstrate this effect, here are those numbers from 1-30 hashed with a function I also mentioned last video: SHA256.

input	hash
1	4355a46b19d348dc2f57c046f8ef63d4538ebb936000f3c9ee954a27460dd865n
2	53c234e5e8472b6ac51c1ae1cab3fe06fad053beb8ebfd8977b010655bfdd3c3n
3	1121cfccd5913f0a63fec40a6ffd44ea64f9dc135c66634ba001d10bcf4302a2n
4	7de1555df0c2700329e815b93b32c571c3ea54dc967b89e81ab73b9972b72d1dn
5	f0b5c2c2211c8d67ed15e75e656c7862d086e9245420892a7de62cd9ec582a06n
6	06e9d52c1720fca412803e3b07c4b228ff113e303f4c7ab94665319d832bbfb7n
7	10159baf262b43a92d95db59dae1f72c645127301661e0a3ce4e38b295a97c58n
8	aa67a169b0bba217aa0aa88a65346920c84c42447c36ba5f7ea65f422c1fe5d8n
9	2e6d31a5983a91251bfae5aefa1c0a19d8ba3cf601d0e8a706b4cfa9661a6b8an
10	917df3320d778ddbaa5c5c7742bc4046bf803c36ed2b050f30844ed206783469n
11	25d4f2a86deb5e2574bb3210b67bb24fcc4afb19f93a7b65a057daa874a9d18en
12	a1fb50e6c86fae1679ef3351296fd6713411a08cf8dd1790a4fd05fae8688164n
13	1a252402972f6057fa53cc172b52b9ffca698e18311facd0f3b06ecaaef79e17n
14	9a92adbc0cee38ef658c71ce1b1bf8c65668f166bfb213644c895ccb1ad07a25n
15	238903180cc104ec2c5d8b3f20c5bc61b389ec0a967df8cc208cdc7cd454174fn
16	e6c21e8d260fe71882debdb339d2402a2ca7648529bc2303f48649bce0380017n
17	54183f4323f377b737433a1e98229ead0fdc686f93bab057ecb612daa94002b5n
18	7ee29791fc17e986b97128845622b077fb45e349fdb80523fac9dba879b4ad60n
19	a9742eb8ee320e006666aef25ae9aeed948247f3125c9cafa7cf97b7e7467dd5n
20	5378796307535df3ec8d8b15a2e2dc5641419c3d3060cfe32238c0fa973f7aa3n
21	6e2ae11dad0616f66bbb2b6e6556f580bb987fd911d7132aa6bee2bfc7cc7b52n
22	f14b4987904bcb5814e4459a057ed4d20f58a633152288a761214dcd28780b56n
23	076320a2a08267b4c026d06573bba408ea68841e73cdc20e62cce59de165ece3n
24	68ca3fba3b7e864770cb61aeb306d4bd4354b68ab4dd38450860c5d823e42a53n
25	64aeb9975f234becd55bb4635e6e2f2da7a6b7bf0a896f0c07763bdfbfb31420n
26	20d2add851bf39edcc6b5e830930f962db7e19dffa2cab8610eed551eda6c302n
27	932ab0a0e4191d32c0af7b3f565b7b180dbe9869378abc5816f9add54b806e7fn
28	9961d158a7e0e2f990765971a9e490af826c0743b7d603020f34cc8944319fcbn
29	3840bc236ee03aacbb1ef7d5108ddfa347c59f10b68d4174affbb53140f31273n
30	f4ccd05b3271c386ee55d9876c7450012a3b361e5065c09dc22075e38b3cc35cn

Aside from the obvious fact that these hashes are much larger, we can also observe something else, they are all very different. These differences are the manifestation of the avalanche effect.

This effect is one of the most commonly sought after features in hash functions, and it is closely related to another effect called the butterfly effect. The butterfly effect is where in a system, a small change to the inputs has a large effect on the output. As simple example of this, consider the equation y=x^2. In this case, let us use $x$ as our input, and y as our output. If we graph this for when x is greater than 0, we get this. As x increases, y also increases, but not at the same rate. Lets start at x equals 0. Here, y also equals 0. increasing x to one gives us y equals 1, but increasing x to 2, we get y equals 4, at x equals 3, y equals 9. At x equals 4, y equals 16, but you get the idea. As x, the input, gets larger, y, the output also increases, but not at the same rate, and this effect gets even more pronounced as x gets larger. Lets take x = 100, and x = 101 for example. Here the y values are 10,000, and 10,201, respectively. With x being 2 and 3, the difference between the outputs was 5, here it is much larger, 201 to be precise, and this is only a fairly tame example. Using the values 2, 3, 100, and 101, just for fun lets change that equation slightly to y=2^x. Now our values are very different, and the effect is much more pronounced.

x	y	difference
2 3	4 9	4
100 101	1267650600228229401496703205376 2535301200456458802993406410752	1267650600228229401496703205376

Make that much, much more pronounced, and bear in mind that this is a very small and simple system. Anyway, that was a bit of a sidetrack, lets get back to the avalanche effect in hash functions. In general, to create a good irreversible hash function, we apply a variety of techniques, and I am not going to explain SHA256 right now, but I will mention a couple more key principles that are important to good hashing, and in fact cryptography in general. They are confusion and diffusion. Confusion is where each bit of the hash is dependent on several separate parts of the input, and there is no way to find the connection between the parts of input based on the hash.

Illustration of confusion

Diffusion is where changing one bit of the input changes roughly half of the hash.

Illustration of confusion

That's all for this time, but I hope to go over an application of hash functions in the next video.

Reference articles / further reading

Confusion and diffusion - Wikipedia

input	hash	input	hash	input	hash
1	1	11	1	21	1
2	2	12	2	22	2
3	3	13	3	23	3
4	4	14	4	24	4
5	5	15	5	25	5
6	6	16	6	26	6
7	7	17	7	27	7
8	8	18	8	28	8
9	9	19	9	29	9
10	0	20	0	30	0

input	hash	input	hash	input	hash
1	1	11	1	21	1
2	2	12	2	22	2
3	3	13	3	23	3
4	4	14	4	24	4
5	5	15	5	25	5
6	6	16	6	26	6
7	7	17	7	27	7
8	8	18	8	28	8
9	9	19	9	29	9
10	0	20	0	30	0

input	hash	input	hash	input	hash
1	1	11	1	21	1
2	2	12	2	22	2
3	3	13	3	23	3
4	4	14	4	24	4
5	5	15	5	25	5
6	6	16	6	26	6
7	7	17	7	27	7
8	8	18	8	28	8
9	9	19	9	29	9
10	0	20	0	30	0