diff --git a/ChaCha20_Poly1305_64/doc/notes.md b/ChaCha20_Poly1305_64/doc/notes.md index ac4f2b2..5240428 100644 --- a/ChaCha20_Poly1305_64/doc/notes.md +++ b/ChaCha20_Poly1305_64/doc/notes.md @@ -1,153 +1,36 @@ -# Notes +# Overall Notes -Since we are designing this for a 64 bit datapath, we need to be able to -compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or -512-bit blocks at a time. Logically it might make more sense to have a datapath -of 128 bits. +We need to support 25Gbps, and we will have 2 datapaths, tx and rx -On the other hand, each operation is a 32 bit operation. It might make more -sense for timing reasons then to have each operation registered. But will this -be able to match the throughput that we need? - -Each quarter round generates 4 words. Each cycle updates all 128 bits at once. -We can do 4 of the quarter rounds at once, so at the end of each cycle we will -generate 512 bits. - -At full speed then, the core would generate 512 bits per cycle. but we would -only need to generate 64 bits per cycle. We could only do 1 quarter cycle at -once, which would only generate 128 bits per cycle, but we would need some sort -of structure to reorder the state such that it is ready to xor with the -incoming data. We could even make this parameterizable, but that would be the -next step if we actually need to support 100Gbps encryption. - -So in summary, we will have a single QuarterRound module which generates 128 -bits of output. We will have a scheduling block which schedules which 4 words -of state go into the quarter round module, and a de-interleaver which takes the -output from the quarter round module and re-orders it to be in the correct -order to combine with the incoming data. there is also an addition in there -somewhere. - -To support AEAD, The first round becomes the key for the Poly1305 block. This -can be done in parallel with the second round, which becomes the cipher, at the -expense of double the gates. Otherwise, there would be a delay in between -packets as this is generated. +at 128 bit datapth, this is 200MHz, but lets aim for 250MHz -Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a -single cycle on a Titanium FPGA at 250MHz (~350-400 MHz) +# ChaCha20 Notes -So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or -6.4Gbps. So then we will need 2 of these for 10Gbps. - -So in order to use multiple cores, we would calculate 1024 bits in 20 cycles. -Then we would put those bits into a memory or something and start calculating -the next 1024 bits. Those bits would all be used up in 16 cycles, (but the -throughput still checks out). Once they are used, we load the memory with the -new output. - -This puts a 20 cycle minimum on small packets since the core is not completely -pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is -4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you -would need packets of at least 100 bytes. - -This is with the 20 cycle minimum, though in reality it would be more like 25 -or 30 with the final addition, scheduling, pipelining etc. Adding more cores -increases the throughput for larger packets, but does nothing for small packets -since the latency is the same. To solve this, we could instantiate the entire -core twice, such that we could handle 2 minimum size packets at the same time. - -If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the -number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This -would of course more than quadrouple the area since we need 4x the cores as -well as the mux and demux between them. - -This could be configurable at compile time though. The number of ChaChas per -core would also be configurable, but at the moment I choose 2. - -Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if -we want to 8 QRs per core instead of 4 for timing reasons. - -Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind -of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can -pass timing using 10k luts than that would be nice. We get a peak throughput -of 50Gbps, its just that the latency kills our packet rate. If we reduce the -latency to 25 cycles and have 2 alternating cores, our packet rate would be -20Mpps, increasing with every cycle we take off. I think that is good. This -would result in 5k XLR which is not so bad. +Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter +rounds, which are the same ecept for which 32 bit is used. We can +use the same 32 bit quarter round 4 times in a row, but we need to +store the rest of the round between operations, so memory usage +might be similar to if we just did all 4 at once, but the logic +would only be 25% as much. Because we switch between odd and even +rounds, the data used in one round is not the data used in the other +round. -Okay so starting over now, our clock speed cannot be 250MHz, the best we can do -is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so -we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks -instead of 2. +# Poly1305 -We are barely going to be able to pass at 180MHz. maybe the fully pipelined -core is a better idea, but we can just fully pipeline a quarter stage, and -generate 512 bits every 4 clock cycles. This would give us a theoretical -throughput of 32Gbps, and we would not have to worry about latency and small -packets slowing us down. Lets experiment with what that would look like. +## Parallel Operation -For our single round its using 1024 adders, which almost sounds like it is -instantiating 8 quarter rounds instead of just 4. Either way, we can say that -a quarter round is 128ff + 128add + 250lut. +We can calculate in parallel but we need to calculate r^n, where n is the number of +parallel stages. Ideally we would have the number of parallel stages be equal to the +latency of the full stage, that way we could have it be fully pipelined. For +example, if it took 8 cycles per block, we would have 8 parallel calculations. This +requires you to calculate r^n, as well as every intermediate value. If we do 8, -So pipelining 20 of these gives 10k luts. Not so bad. +then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?) +we need -Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff - -Lets just leave it for now even if its overkill. The hardware would support up to -40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if -we designed a custom board for it (or 120 if we used FMC connectors). - -If we only use a single quarter round multiplexed between all 4, then the same -quarter round module can have 2 different blocks going through it at once. - -The new one multiplexes 4 quarter rounds between 1 QR module which reduces the -logic usage down to only 46k le, of which the vast majority is flops (2k ff per round, -0.5k lut) - - -# Modulo 2^130-5 - -We can use the trick here to do modulo reduction much faster. - -If we split the bits at 2^130, leaving 129 high bits and 130 low bits, we now -have a 129 bit value multiplied by 2^130, plus the 130 bit value. We know that -2^130 mod 2^130-5 is 5, so we can replace that 2^130 with 5 and add, then -repeat that step again. - -Ex. - -x = x1*2^130 + x2 -x mod 2^130-5 = x1*5 + x2 -> x1*5+x2 = x3 -x mod 2^130-5 = x3*2^130 + x4 -x mod 2^130-5 = x3*5+x4 - - -and lets do the math to verify that we only need two rounds. The maximum value -that we could possible get is 2^131-1 and the maxmimum value for R is -0x0ffffffc0ffffffc0ffffffc0fffffff. Multiplying these together gives us -0x7fffffe07fffffe07fffffe07ffffff7f0000003f0000003f0000003f0000001. - -Applying the first round to this we get - -0x1ffffff81ffffff81ffffff81ffffffd * 5 + 0x3f0000003f0000003f0000003f0000001 -= 0x48fffffdc8fffffdc8fffffdc8ffffff2 - -applying the second round to this we get - -1 * 5 + 0x8fffffdc8fffffdc8fffffdc8ffffff2 = 0x8fffffdc8fffffdc8fffffdc8ffffff7 - -and this is indeed the correct answer. The bottom part is 130 bits but since we -put in the max values and it didn't overflow, I don't think it will overflow here. - -131+128 = 259 bits, only have to do this once - -0xb83fe991ca75d7ef2ab5cba9cccdfd938b73fff384ac90ed284034da565ecf -0x19471c3e3e9c1bfded81da3736e96604a - - -Kind of curious now, at what point does a ripple carry adder using dedicated -CI/CO ports become slower then a more complex adder like carry lookahead or -carry save (wallace tree) +r\*r = r^2 +r\*r^2 = r^3 r^2\*r^2 = r^4 +r^4\*r = r^5 r^2\*r^4 = r^6 r^3\*r^4 = r^7 r^4\*r^4 = r^8 \ No newline at end of file diff --git a/ChaCha20_Poly1305_64/sim/poly1305.yaml b/ChaCha20_Poly1305_64/sim/poly1305.yaml index 481c07e..ce869e6 100644 --- a/ChaCha20_Poly1305_64/sim/poly1305.yaml +++ b/ChaCha20_Poly1305_64/sim/poly1305.yaml @@ -4,4 +4,10 @@ tests: modules: - "poly1305_core" sources: "sources.list" + waves: True + - name: "friendly_modulo" + toplevel: "poly1305_friendly_modulo" + modules: + - "poly1305_friendly_modulo" + sources: sources.list waves: True \ No newline at end of file diff --git a/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py b/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py new file mode 100644 index 0000000..a1a2a31 --- /dev/null +++ b/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py @@ -0,0 +1,62 @@ +import logging + + +import cocotb +from cocotb.clock import Clock +from cocotb.triggers import Timer, RisingEdge, FallingEdge +from cocotb.queue import Queue + +from cocotbext.axi import AxiStreamBus, AxiStreamSource + +import random + +PRIME = 2**130-5 + +CLK_PERIOD = 4 + + +class TB: + def __init__(self, dut): + self.dut = dut + + self.log = logging.getLogger("cocotb.tb") + self.log.setLevel(logging.INFO) + + cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start()) + + async def cycle_reset(self): + await self._cycle_reset(self.dut.i_rst, self.dut.i_clk) + + async def _cycle_reset(self, rst, clk): + rst.setimmediatevalue(0) + await RisingEdge(clk) + await RisingEdge(clk) + rst.value = 1 + await RisingEdge(clk) + await RisingEdge(clk) + rst.value = 0 + await RisingEdge(clk) + await RisingEdge(clk) + +@cocotb.test +async def test_sanity(dut): + tb = TB(dut) + + await tb.cycle_reset() + + value_a = random.randint(1,2**(130+16)) + + # value_a = PRIME + 1000000 + tb.dut.i_valid.value = 1 + tb.dut.i_val.value = value_a + await RisingEdge(tb.dut.i_clk) + tb.dut.i_valid.value = 0 + tb.dut.i_val.value = 0 + + await RisingEdge(tb.dut.o_valid) + value = tb.dut.o_result.value.integer + + print(value_a % PRIME) + print(value) + + await Timer(1, "us") \ No newline at end of file diff --git a/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv b/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv new file mode 100644 index 0000000..5e398be --- /dev/null +++ b/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv @@ -0,0 +1,48 @@ +module poly1305_friendly_modulo #( + parameter WIDTH = 130, + parameter MDIFF = 5, // modulo difference + parameter SHIFT_SIZE = 26 +) ( + input logic i_clk, + input logic i_rst, + + input logic i_valid, + input logic [2*WIDTH-1:0] i_val, + input logic [2:0] i_shift_amount, + + output logic o_valid, + output logic [WIDTH-1:0] o_result +); + +localparam WIDE_WIDTH = WIDTH + $clog2(MDIFF); +localparam [WIDTH-1:0] PRIME = (1 << WIDTH) - MDIFF; + +logic [WIDE_WIDTH-1:0] high_part_1, high_part_2; +logic [WIDTH-1:0] low_part_1, low_part_2; + +logic [WIDE_WIDTH-1:0] intermediate_val; +logic [WIDTH-1:0] final_val; + +logic [2:0] unused_final; + +logic [2:0] valid_sr; + +assign intermediate_val = high_part_1 + WIDE_WIDTH'(low_part_1); + +assign o_result = (final_val >= PRIME) ? final_val - PRIME : final_val; + +assign o_valid = valid_sr[2]; + +always_ff @(posedge i_clk) begin + valid_sr <= {valid_sr[1:0], i_valid}; + + high_part_1 <= WIDTH'({3'b0, i_val} >> (130 - (i_shift_amount*SHIFT_SIZE))) * MDIFF; + low_part_1 <= WIDTH'(i_val << (i_shift_amount*SHIFT_SIZE)); + + high_part_2 <= (intermediate_val >> WIDTH) * 5; + low_part_2 <= intermediate_val[WIDTH-1:0]; + + {unused_final, final_val} <= high_part_2 + WIDE_WIDTH'(low_part_2); +end + +endmodule \ No newline at end of file diff --git a/ChaCha20_Poly1305_64/src/sources.list b/ChaCha20_Poly1305_64/src/sources.list index b1502ca..4aac61c 100644 --- a/ChaCha20_Poly1305_64/src/sources.list +++ b/ChaCha20_Poly1305_64/src/sources.list @@ -3,4 +3,5 @@ chacha20_block.sv chacha20_pipelined_round.sv chacha20_pipelined_block.sv -poly1305_core.sv \ No newline at end of file +poly1305_core.sv +poly1305_friendly_modulo.sv \ No newline at end of file