Optimize crc32 standard(poly:0x04C11DB7) and crc32
castagnoli(poly:0x1EDC6F41) with arm crc32 extension instructions.
For example, crc32 standard caculates(lookup crc32 table) 1812 bytes data,
reduced the time from 118 us to 14 us through optimization.
Performance improved ~700%
Signed-off-by: Jinliang Li <lijinliang1@lixiang.com>