Hashing 101: What is hashing

The accompanying video can be found here.

Why did I write this?

I was originally going to make this article sometime in the future after finishing my blockchain series, but that didn't work out. As it turns out, I need this explanation as a basis to explain blockchains more easily. Just as a note, this is not a technical explanation, that will come in the next article/video on hashing, with more in-depth explanation, including more of a focus on how it works, not just what it does and why it is useful. To see the article, click here.

Like blockchains, hashing is commonly misunderstood, so the first thing I want to do is set the most prevalent misconception straight. Hashing is not the same as encryption. Some of the confusion may come from the fact that both hashing, and encryption, belong to the same field of computer science called cryptography, but, I repeat, hashing is NOT the same as encryption.

The explanation

Moving on, the simplest explanation of what hashing does is as follows: It is a process that takes in a piece of data as an input, and produces another piece of data as an output. This output is called a hash, and is always the same size, no matter the size of the input. This output is is determined by an algorithm called a hash function.

The main purpose of a hash function is to generate a repeatable and unique value with a constant size based on its input, but they are not just used in blockchains like I alluded to before. They are also used for generalized detection of data modification, in some search algorithms, and are a key part of the process of making passwords safe and secure.

There are some key points that I want to make about hashing and hashes. Firstly, for a given input, a hash function will always produce the same hash, but there are many hash functions (some common ones you may have heard of are MD5 and SHA256), and in general, different functions will produce different hashes from the same input. These hashes can differ in both the length and the content of the hash. Another key behavior of hash functions is that for two different inputs, even if they are only very slightly different, the two hashes are almost certainly very different. For example, lets take the aforementioned MD5 hash function and hash these two highly original but slightly different sentences.

The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazi dog.

The following two lines of seemingly random text are the hashes. Despite the two sentences only differing by one character, the hashes are very different.

0d7006cd055e94cf614587e1d2ae0c8e
049079f0bf716a1a1c1c744f17519c20

Just to demonstrate the previous point about the differences in the hashes from different hash functions, here are the same sentences hashed with the SHA256 function.

b47cc0f104b62d4c7c30bcd68fd8e67613e287dc4ad8c310ef10cbadea9c4380
325acbd77808f2c02695c84f9cec253e8c89882df28eddf38d3f524332f854f5

One last point I want to make about hashing is that it is nearly impossible to reverse the hashing process. That is, you can't take a hash and work out what input was used to generate that hash. In general, the only feasible way to do this is to do it by brute force, and try every possible input and compare the output, but this can take a very long time, often a practically infinite amount of time. Even if an input is found, it is not guaranteed to be the input, as it is possible for two inputs to have the same hash. This is known as a collision.

So that just about covers the bare minimum about hashing, so you can either move on and watch/read part 2, or the blockchain series if they are available.