MD5哈希的具体原理

首先我们假设我们将要对信息 $a$ 进行MD5哈希处理,我们知道每个字节有 8 个比特,我们可以把 $a$ 分成若干个 512 比特的分组,最后的分组补一个 0x80 的字符后,若该分组长度不等于 448 比特,则一直补字符 0x00 直到最后的分组为 448 比特,即若最后的分组长度大于 448 则多加一个分组然后一直补到下一个分区达到 448 比特。

之后我们再把信息 $a$ 的长度(不算后来的 0x80 以及 0x00)以小端的方法储存在最后的分区的最后 64 比特中,届时最后的分区也会达到 512 比特的长度。之后我们会通过64轮一系列的比特运算,代换,移位等方式,最后生成一个 128 比特的哈希信息,一般我们会把最后的哈希信息转为 16 进制信息进行输出。

Python 代码实现

我们不妨先把常量表打好:

# T Table
T = [[0xD76AA478, 0xE8C7B756, 0x242070DB, 0xC1BDCEEE, 0xF57C0FAF, 0x4787C62A, 0xA8304613, 0xFD469501, 0x698098D8, 0x8B44F7AF, 0xFFFF5BB1, 0x895CD7BE, 0x6B901122, 0xFD987193, 0xA679438E, 0x49B40821], 
     [0xF61E2562, 0xC040B340, 0x265E5A51, 0xE9B6C7AA, 0xD62F105D, 0x02441453, 0xD8A1E681, 0xE7D3FBC8, 0x21E1CDE6, 0xC33707D6, 0xF4D50D87, 0x455A14ED, 0xA9E3E905, 0xFCEFA3F8, 0x676F02D9, 0x8D2A4C8A], 
     [0xFFFA3942, 0x8771F681, 0x6D9D6122, 0xFDE5380C, 0xA4BEEA44, 0x4BDECFA9, 0xF6BB4B60, 0xBEBFBC70, 0x289B7EC6, 0xEAA127FA, 0xD4EF3085, 0x04881D05, 0xD9D4D039, 0xE6DB99E5, 0x1FA27CF8, 0xC4AC5665], 
     [0xF4292244, 0x432AFF97, 0xAB9423A7, 0xFC93A039, 0x655B59C3, 0x8F0CCC92, 0xFFEFF47D, 0x85845DD1, 0x6FA87E4F, 0xFE2CE6E0, 0xA3014314, 0x4E0811A1, 0xF7537E82, 0xBD3AF235, 0x2AD7D2BB, 0xEB86D391]]

ROLStep = [[7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22], 
           [5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20], 
           [4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23], 
           [6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21]]

然后我们再来写一些简单的逻辑函数,就像上面说的,我们每个 512 比特的分区都会经历 64 轮的比特运算,这 64 轮每 16 轮会换一个逻辑函数,所以总共有 fghi 四个逻辑函数,并且在后面三个逻辑函数中,我们会有信息代换,所以有 rho_2rho_3 以及 rho_4 三个函数:

# Logic Functions
f = lambda b, c, d : (b & c) | ((~b) & d)
g = lambda b, c, d : (b & d) | (c & (~d))
h = lambda b, c, d : b ^ c ^ d
i = lambda b, c, d : c ^ (b | (~d))
rho_2 = lambda i : (1 + 5 * (i)) % 16
rho_3 = lambda i : (5 + 3 * (i)) % 16
rho_4 = lambda i : (7 * (i)) % 16

之后我们每轮比特运算之后会有比特左循环移位,上面那个 ROLSteps 就是记录每轮之后移动的位数,我们接下来写左循环移位的函数:

def ROLs(x, y):
    x, y = int(x), int(y)
    mask1 = (1 << y) - 1
    return ((x >> (32 - y)) & mask1) | ((x << y) & ~mask1)

除了这些函数以外,我们需要注意的是 MD5 哈希是使用小端的方式对数据进行处理的,但是我们一般数据都是使用大端的方式储存,所以我们要专门写函数去转换小端和大端,下面我们来写加载和储存 32 比特大端以及储存 64 比特大段的函数,由于 64 比特大端只会在储存信息长度的时候用到,没有需要加载的场景,故不用写加载 64 比特大端的函数:

# Read 32bit Litte Endian From y to x
def load32L(y):
    # print(y)
    x = (((ord(y[3]) & 255) << 24) + ((ord(y[2]) & 255) << 16) + ((ord(y[1]) & 255) << 8) + (ord(y[0]) & 255))
    return x

# Store 32bit Little Endian From x to y
def store32L(x):
    y = [""] * 4
    y[0] = chr(x & 255)
    y[1] = chr((x>>8) & 255)
    y[2] = chr((x>>16) & 255)
    y[3] = chr((x>>24) & 255)
    return "".join(y)

def gen64LS(x):
    x, y = int(x), ""
    y += chr(x & 255)
    y += chr((x>>8) & 255)
    y += chr((x>>16) & 255)
    y += chr((x>>24) & 255)
    y += chr((x>>32) & 255)
    y += chr((x>>40) & 255)
    y += chr((x>>48) & 255)
    y += chr((x>>56) & 255)
    return y

之后我们可以来定义每轮比特运算的总函数了:

# Compress Hash Functions
def FF(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((f(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + f(b, c, d) + M + T[0][s]) % (2 ** 32), ROLStep[0][s])) % (2 ** 32)
    return d, a, b, c
    
def GG(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((g(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + g(b, c, d) + M + T[1][s]) % (2 ** 32), ROLStep[1][s])) % (2 ** 32)
    return d, a, b, c
    
def HH(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((h(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + h(b, c, d) + M + T[2][s]) % (2 ** 32), ROLStep[2][s])) % (2 ** 32)
    return d, a, b, c
    
def II(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((i(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + i(b, c, d) + M + T[3][s]) % (2 ** 32), ROLStep[3][s])) % (2 ** 32)
    return d, a, b, c

当我们对信息处理的时候,就是先把 512 比特分区以小段的方式读入,变成 16 个 64 比特的数值,并且定义 4 个状态寄存器 abcd 以及他们的初始值 0x674523010xEFCDAB890x98BADCFE0x10325476,之后我们顺序执行 FFGGHH 以及 II,对每个 512 比特分区执行上述步骤后四个状态寄存器的值就是我们的 MD5 哈希值了:

def compressMD5(msg, a, b, c, d):
    for i in range(int(len(msg) / 64)):
        aa, bb, cc, dd = a, b, c, d
        s = msg[64 * i : 64 * (i + 1)]
        w = [0] * 16
        for j in range(16):
            w[j] = load32L(s[4 * j: 4 * (j + 1)])
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[j])
            a, b, c, d = FF(a, b, c, d, w[j], j)
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_2(j)])
            a, b, c, d = GG(a, b, c, d, w[rho_2(j)], j)
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_3(j)])
            a, b, c, d = HH(a, b, c, d, w[rho_3(j)], j)
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_4(j)])
            a, b, c, d = II(a, b, c, d, w[rho_4(j)], j)
        
        # a, b, c, d = (a + aa) & 0xFFFFFFFF, (b + bb) & 0xFFFFFFFF, (c + cc) & 0xFFFFFFFF, (d + dd) & 0xFFFFFFFF
        a, b, c, d = a + aa, b + bb, c + cc, d + dd
    return a, b, c, d

最后我们把函数整合一下,并且在得到四个寄存器的值之后使用 16 进制的方式展现出来:

def hexDigest(hexNum):
    s = ""
    s += "%0.2x"%(hexNum & 255)
    s += "%0.2x"%((hexNum >> 8) & 255)
    s += "%0.2x"%((hexNum >> 16) & 255)
    s += "%0.2x"%((hexNum >> 24) & 255)
    return s

def hashMD5(msg):
    a, b, c, d = 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476
    msg = str(msg)
    strlen, endLen = len(msg), len(msg) % 64
    segments = []
    msg += chr(0x80)
    endLen += 1
    fill = 0

    while endLen > 56:
        while endLen < 64:
            msg += chr(fill)
            endLen = (endLen + 1) % 64
            # print("endLen =", endLen, "filled =", chr(fill), "Msg =", msg)
        endLen = 0

    while endLen < 56:
        msg += chr(fill)
        endLen = (endLen + 1) % 64

    msg = msg + gen64LS(strlen * 8)
    a, b, c, d = compressMD5(msg, a, b, c, d)
    # print(hex(a), hex(b), hex(c), hex(d))

    output = hexDigest(a) + hexDigest(b) + hexDigest(c) + hexDigest(d)

    return output

最后我们可以与系统函数库 hashlib 中的 md5 函数互相验证一下,即可知道算法正确与否。

完整代码

import hashlib

# T Table
T = [[0xD76AA478, 0xE8C7B756, 0x242070DB, 0xC1BDCEEE, 0xF57C0FAF, 0x4787C62A, 0xA8304613, 0xFD469501, 0x698098D8, 0x8B44F7AF, 0xFFFF5BB1, 0x895CD7BE, 0x6B901122, 0xFD987193, 0xA679438E, 0x49B40821], 
     [0xF61E2562, 0xC040B340, 0x265E5A51, 0xE9B6C7AA, 0xD62F105D, 0x02441453, 0xD8A1E681, 0xE7D3FBC8, 0x21E1CDE6, 0xC33707D6, 0xF4D50D87, 0x455A14ED, 0xA9E3E905, 0xFCEFA3F8, 0x676F02D9, 0x8D2A4C8A], 
     [0xFFFA3942, 0x8771F681, 0x6D9D6122, 0xFDE5380C, 0xA4BEEA44, 0x4BDECFA9, 0xF6BB4B60, 0xBEBFBC70, 0x289B7EC6, 0xEAA127FA, 0xD4EF3085, 0x04881D05, 0xD9D4D039, 0xE6DB99E5, 0x1FA27CF8, 0xC4AC5665], 
     [0xF4292244, 0x432AFF97, 0xAB9423A7, 0xFC93A039, 0x655B59C3, 0x8F0CCC92, 0xFFEFF47D, 0x85845DD1, 0x6FA87E4F, 0xFE2CE6E0, 0xA3014314, 0x4E0811A1, 0xF7537E82, 0xBD3AF235, 0x2AD7D2BB, 0xEB86D391]]

ROLStep = [[7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22], 
           [5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20], 
           [4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23], 
           [6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21]]

# Logic Functions
f = lambda b, c, d : (b & c) | ((~b) & d)
g = lambda b, c, d : (b & d) | (c & (~d))
h = lambda b, c, d : b ^ c ^ d
i = lambda b, c, d : c ^ (b | (~d))
rho_2 = lambda i : (1 + 5 * (i)) % 16
rho_3 = lambda i : (5 + 3 * (i)) % 16
rho_4 = lambda i : (7 * (i)) % 16

def ROLs(x, y):
    x, y = int(x), int(y)
    mask1 = (1 << y) - 1
    return ((x >> (32 - y)) & mask1) | ((x << y) & ~mask1)

# Compress Hash Functions
def FF(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((f(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + f(b, c, d) + M + T[0][s]) % (2 ** 32), ROLStep[0][s])) % (2 ** 32)
    return d, a, b, c
    
def GG(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((g(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + g(b, c, d) + M + T[1][s]) % (2 ** 32), ROLStep[1][s])) % (2 ** 32)
    return d, a, b, c
    
def HH(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((h(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + h(b, c, d) + M + T[2][s]) % (2 ** 32), ROLStep[2][s])) % (2 ** 32)
    return d, a, b, c
    
def II(a, b, c, d, M, s):
    # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((i(b, c, d)) % (2 ** 32)))
    a = (b + ROLs((a + i(b, c, d) + M + T[3][s]) % (2 ** 32), ROLStep[3][s])) % (2 ** 32)
    return d, a, b, c

# Read 32bit Litte Endian From y to x
def load32L(y):
    # print(y)
    x = (((ord(y[3]) & 255) << 24) + ((ord(y[2]) & 255) << 16) + ((ord(y[1]) & 255) << 8) + (ord(y[0]) & 255))
    return x

# Store 32bit Little Endian From x to y
def store32L(x):
    y = [""] * 4
    y[0] = chr(x & 255)
    y[1] = chr((x>>8) & 255)
    y[2] = chr((x>>16) & 255)
    y[3] = chr((x>>24) & 255)
    return "".join(y)

def gen64LS(x):
    x, y = int(x), ""
    y += chr(x & 255)
    y += chr((x>>8) & 255)
    y += chr((x>>16) & 255)
    y += chr((x>>24) & 255)
    y += chr((x>>32) & 255)
    y += chr((x>>40) & 255)
    y += chr((x>>48) & 255)
    y += chr((x>>56) & 255)
    return y

def compressMD5(msg, a, b, c, d):
    for i in range(int(len(msg) / 64)):
        aa, bb, cc, dd = a, b, c, d
        s = msg[64 * i : 64 * (i + 1)]
        w = [0] * 16
        for j in range(16):
            w[j] = load32L(s[4 * j: 4 * (j + 1)])
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[j])
            a, b, c, d = FF(a, b, c, d, w[j], j)
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_2(j)])
            a, b, c, d = GG(a, b, c, d, w[rho_2(j)], j)
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_3(j)])
            a, b, c, d = HH(a, b, c, d, w[rho_3(j)], j)
        for j in range(16):
            # print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_4(j)])
            a, b, c, d = II(a, b, c, d, w[rho_4(j)], j)
        
        # a, b, c, d = (a + aa) & 0xFFFFFFFF, (b + bb) & 0xFFFFFFFF, (c + cc) & 0xFFFFFFFF, (d + dd) & 0xFFFFFFFF
        a, b, c, d = a + aa, b + bb, c + cc, d + dd
    return a, b, c, d

def hexDigest(hexNum):
    s = ""
    s += "%0.2x"%(hexNum & 255)
    s += "%0.2x"%((hexNum >> 8) & 255)
    s += "%0.2x"%((hexNum >> 16) & 255)
    s += "%0.2x"%((hexNum >> 24) & 255)
    return s

def hashMD5(msg):
    a, b, c, d = 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476
    msg = str(msg)
    strlen, endLen = len(msg), len(msg) % 64
    segments = []
    msg += chr(0x80)
    endLen += 1
    fill = 0

    while endLen > 56:
        while endLen < 64:
            msg += chr(fill)
            endLen = (endLen + 1) % 64
            # print("endLen =", endLen, "filled =", chr(fill), "Msg =", msg)
        endLen = 0

    while endLen < 56:
        msg += chr(fill)
        endLen = (endLen + 1) % 64

    msg = msg + gen64LS(strlen * 8)
    a, b, c, d = compressMD5(msg, a, b, c, d)
    # print(hex(a), hex(b), hex(c), hex(d))

    output = hexDigest(a) + hexDigest(b) + hexDigest(c) + hexDigest(d)

    return output

# Main Program
if __name__ == "__main__":
    message = '1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#@$%^&*()'

    print("Original:", message)
    print("My MD5 Hash     :", hashMD5(message.encode("UTF-8").decode("UTF-8")))
    print("HashLib MD5 Hash:", hashlib.md5(message.encode("UTF-8")).hexdigest())