Add rtl for friendly_modulo
This commit is contained in:
@@ -1,153 +1,36 @@
|
||||
# Notes
|
||||
# Overall Notes
|
||||
|
||||
Since we are designing this for a 64 bit datapath, we need to be able to
|
||||
compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or
|
||||
512-bit blocks at a time. Logically it might make more sense to have a datapath
|
||||
of 128 bits.
|
||||
We need to support 25Gbps, and we will have 2 datapaths, tx and rx
|
||||
|
||||
On the other hand, each operation is a 32 bit operation. It might make more
|
||||
sense for timing reasons then to have each operation registered. But will this
|
||||
be able to match the throughput that we need?
|
||||
|
||||
Each quarter round generates 4 words. Each cycle updates all 128 bits at once.
|
||||
We can do 4 of the quarter rounds at once, so at the end of each cycle we will
|
||||
generate 512 bits.
|
||||
|
||||
At full speed then, the core would generate 512 bits per cycle. but we would
|
||||
only need to generate 64 bits per cycle. We could only do 1 quarter cycle at
|
||||
once, which would only generate 128 bits per cycle, but we would need some sort
|
||||
of structure to reorder the state such that it is ready to xor with the
|
||||
incoming data. We could even make this parameterizable, but that would be the
|
||||
next step if we actually need to support 100Gbps encryption.
|
||||
|
||||
So in summary, we will have a single QuarterRound module which generates 128
|
||||
bits of output. We will have a scheduling block which schedules which 4 words
|
||||
of state go into the quarter round module, and a de-interleaver which takes the
|
||||
output from the quarter round module and re-orders it to be in the correct
|
||||
order to combine with the incoming data. there is also an addition in there
|
||||
somewhere.
|
||||
|
||||
To support AEAD, The first round becomes the key for the Poly1305 block. This
|
||||
can be done in parallel with the second round, which becomes the cipher, at the
|
||||
expense of double the gates. Otherwise, there would be a delay in between
|
||||
packets as this is generated.
|
||||
at 128 bit datapth, this is 200MHz, but lets aim for 250MHz
|
||||
|
||||
|
||||
Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
|
||||
single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)
|
||||
# ChaCha20 Notes
|
||||
|
||||
So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
|
||||
6.4Gbps. So then we will need 2 of these for 10Gbps.
|
||||
|
||||
So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
|
||||
Then we would put those bits into a memory or something and start calculating
|
||||
the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
|
||||
throughput still checks out). Once they are used, we load the memory with the
|
||||
new output.
|
||||
|
||||
This puts a 20 cycle minimum on small packets since the core is not completely
|
||||
pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
|
||||
4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
|
||||
would need packets of at least 100 bytes.
|
||||
|
||||
This is with the 20 cycle minimum, though in reality it would be more like 25
|
||||
or 30 with the final addition, scheduling, pipelining etc. Adding more cores
|
||||
increases the throughput for larger packets, but does nothing for small packets
|
||||
since the latency is the same. To solve this, we could instantiate the entire
|
||||
core twice, such that we could handle 2 minimum size packets at the same time.
|
||||
|
||||
If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
|
||||
number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
|
||||
would of course more than quadrouple the area since we need 4x the cores as
|
||||
well as the mux and demux between them.
|
||||
|
||||
This could be configurable at compile time though. The number of ChaChas per
|
||||
core would also be configurable, but at the moment I choose 2.
|
||||
|
||||
Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
|
||||
we want to 8 QRs per core instead of 4 for timing reasons.
|
||||
|
||||
Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
|
||||
of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
|
||||
pass timing using 10k luts than that would be nice. We get a peak throughput
|
||||
of 50Gbps, its just that the latency kills our packet rate. If we reduce the
|
||||
latency to 25 cycles and have 2 alternating cores, our packet rate would be
|
||||
20Mpps, increasing with every cycle we take off. I think that is good. This
|
||||
would result in 5k XLR which is not so bad.
|
||||
Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter
|
||||
rounds, which are the same ecept for which 32 bit is used. We can
|
||||
use the same 32 bit quarter round 4 times in a row, but we need to
|
||||
store the rest of the round between operations, so memory usage
|
||||
might be similar to if we just did all 4 at once, but the logic
|
||||
would only be 25% as much. Because we switch between odd and even
|
||||
rounds, the data used in one round is not the data used in the other
|
||||
round.
|
||||
|
||||
|
||||
Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
|
||||
is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
|
||||
we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
|
||||
instead of 2.
|
||||
# Poly1305
|
||||
|
||||
We are barely going to be able to pass at 180MHz. maybe the fully pipelined
|
||||
core is a better idea, but we can just fully pipeline a quarter stage, and
|
||||
generate 512 bits every 4 clock cycles. This would give us a theoretical
|
||||
throughput of 32Gbps, and we would not have to worry about latency and small
|
||||
packets slowing us down. Lets experiment with what that would look like.
|
||||
## Parallel Operation
|
||||
|
||||
For our single round its using 1024 adders, which almost sounds like it is
|
||||
instantiating 8 quarter rounds instead of just 4. Either way, we can say that
|
||||
a quarter round is 128ff + 128add + 250lut.
|
||||
We can calculate in parallel but we need to calculate r^n, where n is the number of
|
||||
parallel stages. Ideally we would have the number of parallel stages be equal to the
|
||||
latency of the full stage, that way we could have it be fully pipelined. For
|
||||
example, if it took 8 cycles per block, we would have 8 parallel calculations. This
|
||||
requires you to calculate r^n, as well as every intermediate value. If we do 8,
|
||||
|
||||
So pipelining 20 of these gives 10k luts. Not so bad.
|
||||
then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)
|
||||
|
||||
we need
|
||||
|
||||
Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff
|
||||
|
||||
Lets just leave it for now even if its overkill. The hardware would support up to
|
||||
40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
|
||||
we designed a custom board for it (or 120 if we used FMC connectors).
|
||||
|
||||
If we only use a single quarter round multiplexed between all 4, then the same
|
||||
quarter round module can have 2 different blocks going through it at once.
|
||||
|
||||
The new one multiplexes 4 quarter rounds between 1 QR module which reduces the
|
||||
logic usage down to only 46k le, of which the vast majority is flops (2k ff per round,
|
||||
0.5k lut)
|
||||
|
||||
|
||||
# Modulo 2^130-5
|
||||
|
||||
We can use the trick here to do modulo reduction much faster.
|
||||
|
||||
If we split the bits at 2^130, leaving 129 high bits and 130 low bits, we now
|
||||
have a 129 bit value multiplied by 2^130, plus the 130 bit value. We know that
|
||||
2^130 mod 2^130-5 is 5, so we can replace that 2^130 with 5 and add, then
|
||||
repeat that step again.
|
||||
|
||||
Ex.
|
||||
|
||||
x = x1*2^130 + x2
|
||||
x mod 2^130-5 = x1*5 + x2 -> x1*5+x2 = x3
|
||||
x mod 2^130-5 = x3*2^130 + x4
|
||||
x mod 2^130-5 = x3*5+x4
|
||||
|
||||
|
||||
and lets do the math to verify that we only need two rounds. The maximum value
|
||||
that we could possible get is 2^131-1 and the maxmimum value for R is
|
||||
0x0ffffffc0ffffffc0ffffffc0fffffff. Multiplying these together gives us
|
||||
0x7fffffe07fffffe07fffffe07ffffff7f0000003f0000003f0000003f0000001.
|
||||
|
||||
Applying the first round to this we get
|
||||
|
||||
0x1ffffff81ffffff81ffffff81ffffffd * 5 + 0x3f0000003f0000003f0000003f0000001
|
||||
= 0x48fffffdc8fffffdc8fffffdc8ffffff2
|
||||
|
||||
applying the second round to this we get
|
||||
|
||||
1 * 5 + 0x8fffffdc8fffffdc8fffffdc8ffffff2 = 0x8fffffdc8fffffdc8fffffdc8ffffff7
|
||||
|
||||
and this is indeed the correct answer. The bottom part is 130 bits but since we
|
||||
put in the max values and it didn't overflow, I don't think it will overflow here.
|
||||
|
||||
131+128 = 259 bits, only have to do this once
|
||||
|
||||
0xb83fe991ca75d7ef2ab5cba9cccdfd938b73fff384ac90ed284034da565ecf
|
||||
0x19471c3e3e9c1bfded81da3736e96604a
|
||||
|
||||
|
||||
Kind of curious now, at what point does a ripple carry adder using dedicated
|
||||
CI/CO ports become slower then a more complex adder like carry lookahead or
|
||||
carry save (wallace tree)
|
||||
r\*r = r^2
|
||||
r\*r^2 = r^3 r^2\*r^2 = r^4
|
||||
r^4\*r = r^5 r^2\*r^4 = r^6 r^3\*r^4 = r^7 r^4\*r^4 = r^8
|
||||
@@ -4,4 +4,10 @@ tests:
|
||||
modules:
|
||||
- "poly1305_core"
|
||||
sources: "sources.list"
|
||||
waves: True
|
||||
- name: "friendly_modulo"
|
||||
toplevel: "poly1305_friendly_modulo"
|
||||
modules:
|
||||
- "poly1305_friendly_modulo"
|
||||
sources: sources.list
|
||||
waves: True
|
||||
62
ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py
Normal file
62
ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py
Normal file
@@ -0,0 +1,62 @@
|
||||
import logging
|
||||
|
||||
|
||||
import cocotb
|
||||
from cocotb.clock import Clock
|
||||
from cocotb.triggers import Timer, RisingEdge, FallingEdge
|
||||
from cocotb.queue import Queue
|
||||
|
||||
from cocotbext.axi import AxiStreamBus, AxiStreamSource
|
||||
|
||||
import random
|
||||
|
||||
PRIME = 2**130-5
|
||||
|
||||
CLK_PERIOD = 4
|
||||
|
||||
|
||||
class TB:
|
||||
def __init__(self, dut):
|
||||
self.dut = dut
|
||||
|
||||
self.log = logging.getLogger("cocotb.tb")
|
||||
self.log.setLevel(logging.INFO)
|
||||
|
||||
cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start())
|
||||
|
||||
async def cycle_reset(self):
|
||||
await self._cycle_reset(self.dut.i_rst, self.dut.i_clk)
|
||||
|
||||
async def _cycle_reset(self, rst, clk):
|
||||
rst.setimmediatevalue(0)
|
||||
await RisingEdge(clk)
|
||||
await RisingEdge(clk)
|
||||
rst.value = 1
|
||||
await RisingEdge(clk)
|
||||
await RisingEdge(clk)
|
||||
rst.value = 0
|
||||
await RisingEdge(clk)
|
||||
await RisingEdge(clk)
|
||||
|
||||
@cocotb.test
|
||||
async def test_sanity(dut):
|
||||
tb = TB(dut)
|
||||
|
||||
await tb.cycle_reset()
|
||||
|
||||
value_a = random.randint(1,2**(130+16))
|
||||
|
||||
# value_a = PRIME + 1000000
|
||||
tb.dut.i_valid.value = 1
|
||||
tb.dut.i_val.value = value_a
|
||||
await RisingEdge(tb.dut.i_clk)
|
||||
tb.dut.i_valid.value = 0
|
||||
tb.dut.i_val.value = 0
|
||||
|
||||
await RisingEdge(tb.dut.o_valid)
|
||||
value = tb.dut.o_result.value.integer
|
||||
|
||||
print(value_a % PRIME)
|
||||
print(value)
|
||||
|
||||
await Timer(1, "us")
|
||||
48
ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv
Normal file
48
ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv
Normal file
@@ -0,0 +1,48 @@
|
||||
module poly1305_friendly_modulo #(
|
||||
parameter WIDTH = 130,
|
||||
parameter MDIFF = 5, // modulo difference
|
||||
parameter SHIFT_SIZE = 26
|
||||
) (
|
||||
input logic i_clk,
|
||||
input logic i_rst,
|
||||
|
||||
input logic i_valid,
|
||||
input logic [2*WIDTH-1:0] i_val,
|
||||
input logic [2:0] i_shift_amount,
|
||||
|
||||
output logic o_valid,
|
||||
output logic [WIDTH-1:0] o_result
|
||||
);
|
||||
|
||||
localparam WIDE_WIDTH = WIDTH + $clog2(MDIFF);
|
||||
localparam [WIDTH-1:0] PRIME = (1 << WIDTH) - MDIFF;
|
||||
|
||||
logic [WIDE_WIDTH-1:0] high_part_1, high_part_2;
|
||||
logic [WIDTH-1:0] low_part_1, low_part_2;
|
||||
|
||||
logic [WIDE_WIDTH-1:0] intermediate_val;
|
||||
logic [WIDTH-1:0] final_val;
|
||||
|
||||
logic [2:0] unused_final;
|
||||
|
||||
logic [2:0] valid_sr;
|
||||
|
||||
assign intermediate_val = high_part_1 + WIDE_WIDTH'(low_part_1);
|
||||
|
||||
assign o_result = (final_val >= PRIME) ? final_val - PRIME : final_val;
|
||||
|
||||
assign o_valid = valid_sr[2];
|
||||
|
||||
always_ff @(posedge i_clk) begin
|
||||
valid_sr <= {valid_sr[1:0], i_valid};
|
||||
|
||||
high_part_1 <= WIDTH'({3'b0, i_val} >> (130 - (i_shift_amount*SHIFT_SIZE))) * MDIFF;
|
||||
low_part_1 <= WIDTH'(i_val << (i_shift_amount*SHIFT_SIZE));
|
||||
|
||||
high_part_2 <= (intermediate_val >> WIDTH) * 5;
|
||||
low_part_2 <= intermediate_val[WIDTH-1:0];
|
||||
|
||||
{unused_final, final_val} <= high_part_2 + WIDE_WIDTH'(low_part_2);
|
||||
end
|
||||
|
||||
endmodule
|
||||
@@ -3,4 +3,5 @@ chacha20_block.sv
|
||||
chacha20_pipelined_round.sv
|
||||
chacha20_pipelined_block.sv
|
||||
|
||||
poly1305_core.sv
|
||||
poly1305_core.sv
|
||||
poly1305_friendly_modulo.sv
|
||||
Reference in New Issue
Block a user