Code Monkey home page Code Monkey logo

aes-vhdl's Introduction

AES-VHDL

VHDL Implementation of AES Algorithm

There are simple VHDL implementations of AES-128 encryption and decryption algorithms in this repository. This is actually my first experience in the VHDL implementation!

What is AES?

There might be a few people who deal with computers and hardwares, but have not heard the name of AES yet which is a famous symmetric block cipher. If you are among those who are not familiar with the AES, [1] is a good reference to understand how this algorithm works. This implementation is also based on [1].

Pipelined VS Loop Unrolled

Two famous kinds of hardware implementation techniques, are pipelining and loop-unrolling techniques. In pipelining, registers are inserted between each combinational processing element so that each input data block can be processed simultaneously in each processing element. An overview of a pipelined implementation of AES encryption algorithm is depicted in the following shape, where the round-i depicts the i'th round of AES encryption algorithm.

AES-pipelined-picture

The number of rounds of AES-128 encryption is 10 and an architecture implementing this cipher is called fully pipelined, when all data blocks of 10 rounds can be processed simultaneously. For a fully pipelined implementation of AES-128 ten 128-bit data registers are needed. The more data block we want to process simultaneously, the more registers and therefore, the more area we need for implementation.

In contrast to pipelining, in a loop-unrolling technique one or multiple rounds of the algorithm are processed in the same clock cycle. In the smallest case of a loop-unrolled implementation of AES which is depicted in the following shape, only one round of the algorithm is implemented as a combinational processing element and a data register is also used to store the result obtained in the previous clock cycle.

AES-loop-unrolled-picture

Therefore the next plaintext must be entered after 10 clock cycles, while in a fully pipelined architecture, in each clock cycle one new plaintext can be entered to the process of encryption. Although pipelined architecture has more throughput than the loop-unrolled architecture, it uses more area than the loop-unrolled one. Therefore fully pipelined architecture offer the highest performance and is suitable for the cases in which the area is not constrained. In contrast to fully pipelined architecture, the smallest case of loop-unrolled, which is sometimes called the round based implementation, uses the lowest area and is suitable for area constrained applications. This implementation is a loop-unrolled one, however, converting it to it's pipelined alternative is not too hard.

Encryption

For each round of AES encryption, a different sub-key is used as the round key, which is produced by the keyschedule algorithm. The following shape represents one round of keyschedule algorithm.

keyschedule-aes

If someone wants to use a fixed key, it is preferable to calculate all sub-keys once and use a lookup table to store sub-keys, instead of implementing keyschedule algorithm and recalculate the sub-keys frequently. This strategy is especially suitable for software implementations, where memory is not as constrained as hardware implementations. However, in this repository, the keyschedule is implemented based on the loop-unrolled technique to calculate the sub-keys on the fly. In other words, there is a dedicated part implementing keyschedule algorithm based on the loop-unrolled technique, to produce the required sub-key for each round on the fly.

Architecture

AES-Encryption-Architecture

Synthesis Report (Spartan6-xc6slx75-3fgg676)

You can find the details here : report.

Advanced HDL Synthesis Report

=========================================================================
Macro Statistics
# RAMs                                                 : 20
 256x8-bit single-port block Read Only RAM             : 4
 256x8-bit single-port distributed Read Only RAM       : 16
# Registers                                            : 264
 Flip-Flops                                            : 264
# Multiplexers                                         : 4
 128-bit 2-to-1 multiplexer                            : 3
 8-bit 2-to-1 multiplexer                              : 1
# Xors                                                 : 68
 128-bit xor2                                          : 1
 24-bit xor2                                           : 1
 32-bit xor2                                           : 3
 8-bit xor2                                            : 47
 8-bit xor3                                            : 16
=========================================================================

Device Utilization Summary

---------------------------
Selected Device : 6slx75fgg676-3 

Slice Logic Utilization: 
 Number of Slice Registers:             264  out of  93296     0%  
 Number of Slice LUTs:                 1104  out of  46648     2%  
    Number used as Logic:              1104  out of  46648     2%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:   1104
   Number with an unused Flip Flop:     840  out of   1104    76%  
   Number with an unused LUT:             0  out of   1104     0%  
   Number of fully used LUT-FF pairs:   264  out of   1104    23%  
   Number of unique control sets:         1

IO Utilization: 
 Number of IOs:                         387
 Number of bonded IOBs:                 387  out of    408    94%  

Specific Feature Utilization:
 Number of Block RAM/FIFO:                1  out of    172     0%  
    Number using Block RAM only:          1
 Number of BUFG/BUFGCTRLs:                1  out of     16     6%  

Timing Summary

---------------
Speed Grade: -3

   Minimum period: 5.813ns (Maximum Frequency: 172.031MHz)
   Minimum input arrival time before clock: 4.823ns
   Maximum output required time after clock: 5.588ns
   Maximum combinational path delay: No path found

Throughput

According to the synthesis report produced by the ISE Design Suite, the minimum clock period for our implementation when Spartan6-xc6slx75-3fgg676 is used as the target device is 5.813 nano seconds. Therefore, the throughput is equal to: 128 bits/(10*5.813 ns) = 2.2 Gb/s.

Synthesis Report (Artix7-xc7a200t-3-ffg1156)

You can find the details here: report

Advanced HDL Synthesis Report

=========================================================================
Macro Statistics
# RAMs                                                 : 20
 256x8-bit single-port block Read Only RAM             : 4
 256x8-bit single-port distributed Read Only RAM       : 16
# Registers                                            : 264
 Flip-Flops                                            : 264
# Multiplexers                                         : 4
 128-bit 2-to-1 multiplexer                            : 3
 8-bit 2-to-1 multiplexer                              : 1
# Xors                                                 : 68
 128-bit xor2                                          : 1
 24-bit xor2                                           : 1
 32-bit xor2                                           : 3
 8-bit xor2                                            : 47
 8-bit xor3                                            : 16
=========================================================================

Device Utilization Summary

---------------------------
Selected Device : 7a200tffg1156-3 

Slice Logic Utilization: 
 Number of Slice Registers:             264  out of  269200     0%  
 Number of Slice LUTs:                 1104  out of  134600     0%  
    Number used as Logic:              1104  out of  134600     0%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:   1104
   Number with an unused Flip Flop:     840  out of   1104    76%  
   Number with an unused LUT:             0  out of   1104     0%  
   Number of fully used LUT-FF pairs:   264  out of   1104    23%  
   Number of unique control sets:         1

IO Utilization: 
 Number of IOs:                         387
 Number of bonded IOBs:                 387  out of    500    77%  

Specific Feature Utilization:
 Number of Block RAM/FIFO:                1  out of    365     0%  
    Number using Block RAM only:          1
 Number of BUFG/BUFGCTRLs:                1  out of     32     3%  

Timing Summary

---------------
Speed Grade: -3

   Minimum period: 3.397ns (Maximum Frequency: 294.366MHz)
   Minimum input arrival time before clock: 1.649ns
   Maximum output required time after clock: 1.669ns
   Maximum combinational path delay: No path found

Throughput

According to the synthesis report produced by the ISE Design Suite, the minimum clock period for our implementation when Artix7-xc7a200t-3-ffg1156 is used as the target device is 3.397 nano seconds. Therefore the throughput is equal to: 128 bits/(10*3.397 ns) = 3.77 Gb/s.

Synthesis Report (Zynq-xc7z100-2-ffg1156)

You can find the details here: report

Advanced HDL Synthesis Report

=========================================================================
Macro Statistics
# RAMs                                                 : 20
 256x8-bit single-port block Read Only RAM             : 4
 256x8-bit single-port distributed Read Only RAM       : 16
# Registers                                            : 264
 Flip-Flops                                            : 264
# Multiplexers                                         : 4
 128-bit 2-to-1 multiplexer                            : 3
 8-bit 2-to-1 multiplexer                              : 1
# Xors                                                 : 68
 128-bit xor2                                          : 1
 24-bit xor2                                           : 1
 32-bit xor2                                           : 3
 8-bit xor2                                            : 47
 8-bit xor3                                            : 16
=========================================================================

Device Utilization Summary

---------------------------

Selected Device : 7z100ffg1156-2 

Slice Logic Utilization: 
 Number of Slice Registers:             264  out of  554800     0%  
 Number of Slice LUTs:                 1104  out of  277400     0%  
    Number used as Logic:              1104  out of  277400     0%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:   1104
   Number with an unused Flip Flop:     840  out of   1104    76%  
   Number with an unused LUT:             0  out of   1104     0%  
   Number of fully used LUT-FF pairs:   264  out of   1104    23%  
   Number of unique control sets:         1

IO Utilization: 
 Number of IOs:                         387
 Number of bonded IOBs:                 387  out of    400    96%  

Specific Feature Utilization:
 Number of Block RAM/FIFO:                1  out of    755     0%  
    Number using Block RAM only:          1
 Number of BUFG/BUFGCTRLs:                1  out of     32     3%  

Timing Summary

---------------
Speed Grade: -2

   Minimum period: 3.369ns (Maximum Frequency: 296.789MHz)
   Minimum input arrival time before clock: 1.653ns
   Maximum output required time after clock: 1.540ns
   Maximum combinational path delay: No path found

Throughput

According to the synthesis report produced by the ISE Design Suite, the minimum clock period for our implementation when Zynq-xc7z100-2-ffg1156 is used as the target device is 3.369 nano seconds. Therefore the throughput is equal to 128 bits/(10*3.369 ns) = 3.80 Gb/s. Note that the throughput for a fully pipelined design operating in ECB mode is 128 bits/(3.369 ns) = 38 Gb/s.

Simulation

In order to verify the correctness of the encryption's implementation, a testbech has been prepared, to compare two different ciphertexts obtained from this implementation, with the expected true ciphertexts taken from [1] as the reference. This implementation passes the verification and you can see a capture of the waveform obtained from the simulation via ISim simulator, in the following picture.

AES-Encryption Simulation

Test Vector

To validate the correctness of this implementation, I've used the test vectors provided by NIST, which is publicly avialble via the following link:

https://csrc.nist.gov/CSRC/media/Projects/Cryptographic-Standards-and-Guidelines/documents/examples/AES_Core128.pdf

For example, according to the NIST reference, the encryption of plaintext 6BC1BEE2 2E409F96 E93D7E11 7393172A under the key 2B7E1516 28AED2A6 ABF71588 09CF4F3C, must yield the ciphertext: 3AD77BB4 0D7A3660 A89ECAF3 2466EF97. It should be noted that in these representations each byte is denoted by two hex digits. However, the bytes are arranged column-wise. Hence, to convert these representations into the 128-bit hex values, the order of bytes should be reversed as the last cell (bottom right) in the state array of AES has been considered the most significant byte in my implementation. For instance, the hex vectors corresponding to the above test vectors in our implementation are as follows:

plaintext:  x"2a179373117e3de9969f402ee2bec16b"
key:        x"3c4fcf098815f7aba6d2ae2816157e2b"
ciphertext: x"97ef6624f3ca9ea860367a0db47bd73a"

Decryption

In order to implement the decryption algorithm of AES-128, one must invert the SubBytes, ShiftRows and MixColumns operations. Let's depict the inverse of SubBytes, ShiftRows and MixColumns operations by InvSubBytes, InvShiftRows and InvMixColumns respectively

While the area required to implement InvSubBytes and InvShiftRows are the same as the area required for implementation of SubBytes and ShiftRows respectively, the hardware implementation of InvMixColumns is a challenging problem since it's matrix is more dense than the MixColumn's matrix and therefore, uses more area than the MixColumns. In this implementation we use a technique taken from page 55 of [1] to overcome this problem. According to this technique, InvMixColumns can be implemented as a simple preprocessing step followed by the MixColumn step.

For a given key, the sub-keys must be used in reverse order to decrypt a ciphertext. Therefore, the encryption's keyschedule can not be used for the decryption and a different keyschedule algorithm is needed. If one has access to the last round sub-key and perform the inverse of keyschedule round function on it repeatedly, sub-keys are obtained in reverse order. In our implementation for decryption algorithm, it is assumed that the last round sub-key is stored as the decryption key for the later reuse and it is assumed to be the input of the decryption's keyschedule algorithm.

Architecture

AES-Decryption Architecture

Synthesis Report (Spartan6-xc6slx75-3fgg676)

You can find the details here: report

Advanced HDL Synthesis Report

=========================================================================
Macro Statistics
# RAMs                                                 : 20
 256x8-bit single-port distributed Read Only RAM       : 20
# Adders/Subtractors                                   : 2
 5-bit adder                                           : 1
 8-bit subtractor                                      : 1
# Registers                                            : 260
 Flip-Flops                                            : 260
# Multiplexers                                         : 3
 128-bit 2-to-1 multiplexer                            : 3
# Xors                                                 : 107
 128-bit xor2                                          : 1
 24-bit xor2                                           : 1
 32-bit xor2                                           : 3
 8-bit xor2                                            : 86
 8-bit xor3                                            : 16
=========================================================================

Device Utilization Summary

---------------------------

Selected Device : 6slx75fgg676-3 

Slice Logic Utilization: 
 Number of Slice Registers:             264  out of  93296     0%  
 Number of Slice LUTs:                 1474  out of  46648     3%  
    Number used as Logic:              1474  out of  46648     3%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:   1478
   Number with an unused Flip Flop:    1214  out of   1478    82%  
   Number with an unused LUT:             4  out of   1478     0%  
   Number of fully used LUT-FF pairs:   260  out of   1478    17%  
   Number of unique control sets:         2

IO Utilization: 
 Number of IOs:                         387
 Number of bonded IOBs:                 387  out of    408    94%  

Specific Feature Utilization:
 Number of BUFG/BUFGCTRLs:                1  out of     16     6%  

Timing Summary

---------------
Speed Grade: -3

   Minimum period: 6.740ns (Maximum Frequency: 148.361MHz)
   Minimum input arrival time before clock: 4.726ns
   Maximum output required time after clock: 5.102ns
   Maximum combinational path delay: No path found

Throughput

According to the synthesis report produced by the ISE Design Suite, the minimum clock period for our implementation when Spartan6-xc6slx75-3fgg676 is used as the target device is 6.740 nano seconds. Therefore the throughput is equal to: 128 bits/(10*6.740 ns) = 1.90 Gb/s.

Synthesis Report (Artix7-xc7a200t-3-ffg1156)

You can find the details here: report

Advanced HDL Synthesis Report

=========================================================================
Macro Statistics
# RAMs                                                 : 20
 256x8-bit single-port distributed Read Only RAM       : 20
# Adders/Subtractors                                   : 2
 5-bit adder                                           : 1
 8-bit subtractor                                      : 1
# Registers                                            : 260
 Flip-Flops                                            : 260
# Multiplexers                                         : 3
 128-bit 2-to-1 multiplexer                            : 3
# Xors                                                 : 107
 128-bit xor2                                          : 1
 24-bit xor2                                           : 1
 32-bit xor2                                           : 3
 8-bit xor2                                            : 86
 8-bit xor3                                            : 16
=========================================================================

Device Utilization Summary

---------------------------
Selected Device : 7a200tffg1156-3 

Slice Logic Utilization: 
 Number of Slice Registers:             260  out of  269200     0%  
 Number of Slice LUTs:                 1490  out of  134600     1%  
    Number used as Logic:              1490  out of  134600     1%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:   1490
   Number with an unused Flip Flop:    1230  out of   1490    82%  
   Number with an unused LUT:             0  out of   1490     0%  
   Number of fully used LUT-FF pairs:   260  out of   1490    17%  
   Number of unique control sets:         2

IO Utilization: 
 Number of IOs:                         387
 Number of bonded IOBs:                 387  out of    500    77%  

Specific Feature Utilization:
 Number of BUFG/BUFGCTRLs:                1  out of     32     3%  

Timing Summary

---------------
Speed Grade: -3

   Minimum period: 3.940ns (Maximum Frequency: 253.790MHz)
   Minimum input arrival time before clock: 1.177ns
   Maximum output required time after clock: 1.356ns
   Maximum combinational path delay: No path found

Throughput

According to the synthesis report produced by the ISE Designe Suite, the minimum clock period for our implementation when Artix7-xc7a200t-3-ffg1156 is used as the target device is 3.940 nano seconds. Therefore the throughput is equal to: 128 bits/(10*3.940 ns) = 3.25 Gb/s.

Synthesis Report (Zynq-xc7z100-2-ffg1156)

You can find the details here: report

Advanced HDL Synthesis Report

=========================================================================
Macro Statistics
# RAMs                                                 : 20
 256x8-bit single-port distributed Read Only RAM       : 20
# Adders/Subtractors                                   : 2
 5-bit adder                                           : 1
 8-bit subtractor                                      : 1
# Registers                                            : 260
 Flip-Flops                                            : 260
# Multiplexers                                         : 3
 128-bit 2-to-1 multiplexer                            : 3
# Xors                                                 : 107
 128-bit xor2                                          : 1
 24-bit xor2                                           : 1
 32-bit xor2                                           : 3
 8-bit xor2                                            : 86
 8-bit xor3                                            : 16
=========================================================================

Device Utilization Summary

---------------------------
Selected Device : 7z100ffg1156-2 

Slice Logic Utilization: 
 Number of Slice Registers:             260  out of  554800     0%  
 Number of Slice LUTs:                 1490  out of  277400     0%  
    Number used as Logic:              1490  out of  277400     0%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:   1490
   Number with an unused Flip Flop:    1230  out of   1490    82%  
   Number with an unused LUT:             0  out of   1490     0%  
   Number of fully used LUT-FF pairs:   260  out of   1490    17%  
   Number of unique control sets:         2

IO Utilization: 
 Number of IOs:                         387
 Number of bonded IOBs:                 387  out of    400    96%  

Specific Feature Utilization:
 Number of BUFG/BUFGCTRLs:                1  out of     32     3%  

Timing Summary

---------------
Speed Grade: -2

   Minimum period: 3.324ns (Maximum Frequency: 300.806MHz)
   Minimum input arrival time before clock: 1.185ns
   Maximum output required time after clock: 1.243ns
   Maximum combinational path delay: No path found

Throughput

According to the synthesis report produced by the ISE Designe Suite, the minimum clock period for our implementation when Zynq-xc7z100-2-ffg1156 is used as the target device is 3.324 nano seconds. Therefore the throughput is equal to: 128 bits/(10*3.324 ns) = 3.85 Gb/s.

Simulation

In order to verify the correctness of the decryption's implementation, a testbech has been prepared in which the equality of the derived plaintext with the refernce values taken from [1] is checked for two different ciphertexts and keys. As you can see in the following picture, this implementation passes the verification and hence it works correctly.

AES-Decryption Simulation

References

[1] The Design of Rijndael: AES - The Advanced Encryption Standard

aes-vhdl's People

Contributors

hadipourh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

aes-vhdl's Issues

AES-DEC controller.vhd modelsim simulation error

controller.vhd at simulation throws range error for:
49 round_constant <= rcon_table((i + 1)*8 - 1 downto 8*i);
'integer range 0 to (2**4) - 1'
So it is going out of range. A simple workaround is to range it from 0 to 10 however I do not know it that is neat.

Decryption key

what if I do not have the last round sub-key and I only have the key from the encryption key, do I have to generate all the keys for all the round and use the last key generated or is there another way?

Verification

Can you recheck if the core even works for the following example which is provided in the TB but commented out:

  --p = 3243f6a8885a308d313198a2e0370734
  --k = 2b7e151628aed2a6abf7158809cf4f3c		
  --c = 3925841d02dc09fbdc118597196a0b32

Converting to pipeline

Hi, Can I check how would you convert it to pipeline architecture, I have tried coding out the method which stores all the round keys for pipeline architecture but I wanted to try using this method which calculate the round key on the fly which might help increase the speed

i like to make a contact with you

Hello MR hadipour
i will be very happy when i see someone from my country in GIT
i would like to contact you and ask a few questions if it doesnt bother you .
please tell me if you have whatsapp or telegram or skype ..
regards
Mohsen Mirzaei

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.