The accompanying video can be found here.
So, in the last article, I covered some of the basics of hashing, what it is, what it is used for, and desirable characteristics in hash functions. This time, I want to cover more about how hashing works. This will be by no means an in-depth discussion, that could be an entire book, or series of books. Instead, I want to explain some of the basic principals and techniques behind how hash functions work by looking at how we might develop a basic hash function, and what we want to achieve. So, a question: is n mod 10
a hash function?
In general, in a hash function, information from the input is lost to form a constant length hash, and the most basic hash you could create is to truncate an input to the length of the hash. To deal with cases where the input is shorter than the hash, we could just pad it with zeros. A major downside of hashing like this is it is very possible to have two slightly different outputs generate the same hash. Let's take a quick look at an example, defining our hashing algorithm as truncating the input to 10 characters, or padding it to 10 with zeros. We will now generate hashes for the following inputs:
input1 input2 Hello world! jhfajw3784y23h51a8 Lorem ipsum dolor Lorem ipsum color
The hashes are:
Inputs | Outputs |
---|---|
input1 | input10000 |
input2 | input20000 |
Hello world! | Hello worl |
fa819jw784h51a8 | fa819jw784 |
Lorem ipsum dolor | Lorem ipsu |
Lorem ipsum color | Lorem ipsu |
Here, we can readily see that all the first 5 are different, but the hashes would indicate that the fifth and sixth inputs are the same. This is because the hash function does not use the entire input.
This is where the question I asked at the start comes in, and yes, it is a basic hash function, with an input n
, and the hash is the remainder when the input is divided by 10. We can express this in a more general way as n mod m
, where n
is the input, and m
is some number, but I'll expand on the effect of changing m
in a minute.
We can improve our hashes by using this function, but for the sake of simplicity, let's take some random numbers as inputs to a hash function. Let's say:
3 98 291 292 293
While I have used numbers here for simplicity's sake, you could use this for the previous examples, you'd just have to convert the inputs into some numerical form. The hashes for these inputs with this function are:
Inputs | Outputs |
---|---|
3 | 3 |
98 | 8 |
291 | 1 |
292 | 2 |
293 | 3 |
So as you can see here, we already have an improvement over the last hash function in that the last three values are different, but now we have a different problem. 3
and 293
have the same hash. The problem here becomes apparent if we just take the numbers 1 to 30 and calculate the hashes we get the following:
input | hash | input | hash | input | hash |
---|---|---|---|---|---|
1 | 1 | 11 | 1 | 21 | 1 |
2 | 2 | 12 | 2 | 22 | 2 |
3 | 3 | 13 | 3 | 23 | 3 |
4 | 4 | 14 | 4 | 24 | 4 |
5 | 5 | 15 | 5 | 25 | 5 |
6 | 6 | 16 | 6 | 26 | 6 |
7 | 7 | 17 | 7 | 27 | 7 |
8 | 8 | 18 | 8 | 28 | 8 |
9 | 9 | 19 | 9 | 29 | 9 |
10 | 0 | 20 | 0 | 30 | 0 |
So basically, anything ending in 1
has a hash of 1
, anything ending in 2
has a hash of 2
, and so on. Now, a simple solution could be to divide by 100 instead of 10, and this would mitigate the issue to some extent, with the new hashes as follows. Basically, we can control the length of the hash with m
.
Unfortunately, this method does not produce a characteristic often desired in hash functions that I mentioned in the first article in this series, that of irreversibility. With the current hash function we are considering, it is very easy to work out the possible inputs that would produce a given hash. For example if we take the hash value of 5
, we can immediately work out the set of inputs that produces the hash, in this case: 5, 15, 25, 35, ...
. While this pattern and predictability is only a small issue where hashing is used to ensure data integrity, it is a massive issue in security applications like password authentication. Basically, what we need, is a hash function that cannot be reversed and used to work out the input, I might explain why another time.
This is where an effect I mentioned in the last article, the avalanche effect comes in. To demonstrate this effect, here are those numbers from 1-30
hashed with a function I also mentioned last video: SHA256.
input | hash |
---|---|
1 | 4355a46b19d348dc2f57c046f8ef63d4538ebb936000f3c9ee954a27460dd865n |
2 | 53c234e5e8472b6ac51c1ae1cab3fe06fad053beb8ebfd8977b010655bfdd3c3n |
3 | 1121cfccd5913f0a63fec40a6ffd44ea64f9dc135c66634ba001d10bcf4302a2n |
4 | 7de1555df0c2700329e815b93b32c571c3ea54dc967b89e81ab73b9972b72d1dn |
5 | f0b5c2c2211c8d67ed15e75e656c7862d086e9245420892a7de62cd9ec582a06n |
6 | 06e9d52c1720fca412803e3b07c4b228ff113e303f4c7ab94665319d832bbfb7n |
7 | 10159baf262b43a92d95db59dae1f72c645127301661e0a3ce4e38b295a97c58n |
8 | aa67a169b0bba217aa0aa88a65346920c84c42447c36ba5f7ea65f422c1fe5d8n |
9 | 2e6d31a5983a91251bfae5aefa1c0a19d8ba3cf601d0e8a706b4cfa9661a6b8an |
10 | 917df3320d778ddbaa5c5c7742bc4046bf803c36ed2b050f30844ed206783469n |
11 | 25d4f2a86deb5e2574bb3210b67bb24fcc4afb19f93a7b65a057daa874a9d18en |
12 | a1fb50e6c86fae1679ef3351296fd6713411a08cf8dd1790a4fd05fae8688164n |
13 | 1a252402972f6057fa53cc172b52b9ffca698e18311facd0f3b06ecaaef79e17n |
14 | 9a92adbc0cee38ef658c71ce1b1bf8c65668f166bfb213644c895ccb1ad07a25n |
15 | 238903180cc104ec2c5d8b3f20c5bc61b389ec0a967df8cc208cdc7cd454174fn |
16 | e6c21e8d260fe71882debdb339d2402a2ca7648529bc2303f48649bce0380017n |
17 | 54183f4323f377b737433a1e98229ead0fdc686f93bab057ecb612daa94002b5n |
18 | 7ee29791fc17e986b97128845622b077fb45e349fdb80523fac9dba879b4ad60n |
19 | a9742eb8ee320e006666aef25ae9aeed948247f3125c9cafa7cf97b7e7467dd5n |
20 | 5378796307535df3ec8d8b15a2e2dc5641419c3d3060cfe32238c0fa973f7aa3n |
21 | 6e2ae11dad0616f66bbb2b6e6556f580bb987fd911d7132aa6bee2bfc7cc7b52n |
22 | f14b4987904bcb5814e4459a057ed4d20f58a633152288a761214dcd28780b56n |
23 | 076320a2a08267b4c026d06573bba408ea68841e73cdc20e62cce59de165ece3n |
24 | 68ca3fba3b7e864770cb61aeb306d4bd4354b68ab4dd38450860c5d823e42a53n |
25 | 64aeb9975f234becd55bb4635e6e2f2da7a6b7bf0a896f0c07763bdfbfb31420n |
26 | 20d2add851bf39edcc6b5e830930f962db7e19dffa2cab8610eed551eda6c302n |
27 | 932ab0a0e4191d32c0af7b3f565b7b180dbe9869378abc5816f9add54b806e7fn |
28 | 9961d158a7e0e2f990765971a9e490af826c0743b7d603020f34cc8944319fcbn |
29 | 3840bc236ee03aacbb1ef7d5108ddfa347c59f10b68d4174affbb53140f31273n |
30 | f4ccd05b3271c386ee55d9876c7450012a3b361e5065c09dc22075e38b3cc35cn |
Aside from the obvious fact that these hashes are much larger, we can also observe something else, they are all very different. These differences are the manifestation of the avalanche effect.
This effect is one of the most commonly sought after features in hash functions, and it is closely related to another effect called the butterfly effect. The butterfly effect is where in a system, a small change to the inputs has a large effect on the output. As simple example of this, consider the equation y=x^2
. In this case, let us use $x$ as our input, and y
as our output. If we graph this for when x
is greater than 0, we get this. As x
increases, y
also increases, but not at the same rate. Lets start at x
equals 0. Here, y
also equals 0. increasing x
to one gives us y
equals 1, but increasing x
to 2, we get y
equals 4, at x
equals 3, y
equals 9. At x
equals 4, y
equals 16, but you get the idea. As x
, the input, gets larger, y
, the output also increases, but not at the same rate, and this effect gets even more pronounced as x
gets larger. Lets take x = 100
, and x = 101
for example. Here the y
values are 10,000, and 10,201, respectively. With x
being 2 and 3, the difference between the outputs was 5, here it is much larger, 201 to be precise, and this is only a fairly tame example. Using the values 2, 3, 100, and 101, just for fun lets change that equation slightly to y=2^x
. Now our values are very different, and the effect is much more pronounced.
x | y | difference |
---|---|---|
2 3 | 4 9 | 4 |
100 101 | 1267650600228229401496703205376 2535301200456458802993406410752 | 1267650600228229401496703205376 |
Make that much, much more pronounced, and bear in mind that this is a very small and simple system. Anyway, that was a bit of a sidetrack, lets get back to the avalanche effect in hash functions. In general, to create a good irreversible hash function, we apply a variety of techniques, and I am not going to explain SHA256 right now, but I will mention a couple more key principles that are important to good hashing, and in fact cryptography in general. They are confusion and diffusion. Confusion is where each bit of the hash is dependent on several separate parts of the input, and there is no way to find the connection between the parts of input based on the hash.
Diffusion is where changing one bit of the input changes roughly half of the hash.
That's all for this time, but I hope to go over an application of hash functions in the next video.