Add rtl for friendly_modulo

2025-10-27 19:19:43 -07:00
parent 003527ee0d
commit 06d5949aa7
5 changed files with 142 additions and 142 deletions
--- a/ChaCha20_Poly1305_64/doc/notes.md
+++ b/ChaCha20_Poly1305_64/doc/notes.md
@@ -1,153 +1,36 @@
-# Notes
+# Overall Notes

-Since we are designing this for a 64 bit datapath, we need to be able to
-compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or
-512-bit blocks at a time. Logically it might make more sense to have a datapath
-of 128 bits.
+We need to support 25Gbps, and we will have 2 datapaths, tx and rx

-On the other hand, each operation is a 32 bit operation. It might make more
-sense for timing reasons then to have each operation registered. But will this
-be able to match the throughput that we need?
-
-Each quarter round generates 4 words. Each cycle updates all 128 bits at once.
-We can do 4 of the quarter rounds at once, so at the end of each cycle we will
-generate 512 bits.
-
-At full speed then, the core would generate 512 bits per cycle. but we would
-only need to generate 64 bits per cycle. We could only do 1 quarter cycle at
-once, which would only generate 128 bits per cycle, but we would need some sort
-of structure to reorder the state such that it is ready to xor with the
-incoming data. We could even make this parameterizable, but that would be the
-next step if we actually need to support 100Gbps encryption.
-
-So in summary, we will have a single QuarterRound module which generates 128
-bits of output. We will have a scheduling block which schedules which 4 words
-of state go into the quarter round module, and a de-interleaver which takes the
-output from the quarter round module and re-orders it to be in the correct
-order to combine with the incoming data. there is also an addition in there
-somewhere.
-
-To support AEAD, The first round becomes the key for the Poly1305 block. This
-can be done in parallel with the second round, which becomes the cipher, at the
-expense of double the gates. Otherwise, there would be a delay in between
-packets as this is generated.
+at 128 bit datapth, this is 200MHz, but lets aim for 250MHz


-Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
-single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)
+# ChaCha20 Notes

-So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
-6.4Gbps. So then we will need 2 of these for 10Gbps.
-
-So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
-Then we would put those bits into a memory or something and start calculating
-the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
-throughput still checks out). Once they are used, we load the memory with the
-new output.
-
-This puts a 20 cycle minimum on small packets since the core is not completely
-pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
-4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
-would need packets of at least 100 bytes.
-
-This is with the 20 cycle minimum, though in reality it would be more like 25
-or 30 with the final addition, scheduling, pipelining etc. Adding more cores
-increases the throughput for larger packets, but does nothing for small packets
-since the latency is the same. To solve this, we could instantiate the entire
-core twice, such that we could handle 2 minimum size packets at the same time.
-
-If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
-number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
-would of course more than quadrouple the area since we need 4x the cores as
-well as the mux and demux between them.
-
-This could be configurable at compile time though. The number of ChaChas per
-core would also be configurable, but at the moment I choose 2. 
-
-Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
-we want to 8 QRs per core instead of 4 for timing reasons.
-
-Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
-of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
-pass timing  using 10k luts than that would be nice. We get a peak throughput
-of 50Gbps, its just that the latency kills our packet rate. If we reduce the
-latency to 25 cycles and have 2 alternating cores, our packet rate would be
-20Mpps, increasing with every cycle we take off. I think that is good. This
-would result in 5k XLR which is not so bad.
+Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter
+rounds, which are the same ecept for which 32 bit is used. We can
+use the same 32 bit quarter round 4 times in a row, but we need to
+store the rest of the round between operations, so memory usage
+might be similar to if we just did all 4 at once, but the logic
+would only be 25% as much. Because we switch between odd and even
+rounds, the data used in one round is not the data used in the other
+round.


-Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
-is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
-we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
-instead of 2.
+# Poly1305

-We are barely going to be able to pass at 180MHz. maybe the fully pipelined
-core is a better idea, but we can just fully pipeline a quarter stage, and 
-generate 512 bits every 4 clock cycles. This would give us a theoretical
-throughput of 32Gbps, and we would not have to worry about latency and small
-packets slowing us down. Lets experiment with what that would look like. 
+## Parallel Operation

-For our single round its using 1024 adders, which almost sounds like it is
-instantiating 8 quarter rounds instead of just 4. Either way, we can say that
-a quarter round is 128ff + 128add + 250lut.
+We can calculate in parallel but we need to calculate r^n, where n is the number of
+parallel stages. Ideally we would have the number of parallel stages be equal to the
+latency of the full stage, that way we could have it be fully pipelined. For
+example, if it took 8 cycles per block, we would have 8 parallel calculations. This
+requires you to calculate r^n, as well as every intermediate value. If we do 8,

-So pipelining 20 of these gives 10k luts. Not so bad.
+then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)

+we need 

-Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff
-
-Lets just leave it for now even if its overkill. The hardware would support up to
-40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
-we designed a custom board for it (or 120 if we used FMC connectors).
-
-If we only use a single quarter round multiplexed between all 4, then the same
-quarter round module can have 2 different blocks going through it at once.
-
-The new one multiplexes 4 quarter rounds between 1 QR module which reduces the
-logic usage down to only 46k le, of which the vast majority is flops (2k ff per round,
-0.5k lut)
-
-
-# Modulo 2^130-5
-
-We can use the trick here to do modulo reduction much faster.
-
-If we split the bits at 2^130, leaving 129 high bits and 130 low bits, we now
-have a 129 bit value multiplied by 2^130, plus the 130 bit value. We know that 
-2^130 mod 2^130-5 is 5, so we can replace that 2^130 with 5 and add, then
-repeat that step again.
-
-Ex.
-
-x = x1*2^130 + x2
-x mod 2^130-5 = x1*5 + x2   -> x1*5+x2 = x3
-x mod 2^130-5 = x3*2^130 + x4
-x mod 2^130-5 = x3*5+x4
-
-
-and lets do the math to verify that we only need two rounds. The maximum value
-that we could possible get is 2^131-1 and the maxmimum value for R is 
-0x0ffffffc0ffffffc0ffffffc0fffffff. Multiplying these together gives us 
-0x7fffffe07fffffe07fffffe07ffffff7f0000003f0000003f0000003f0000001.
-
-Applying the first round to this we get 
-
-0x1ffffff81ffffff81ffffff81ffffffd * 5 + 0x3f0000003f0000003f0000003f0000001
-= 0x48fffffdc8fffffdc8fffffdc8ffffff2
-
-applying the second round to this we get
-
-1 * 5 + 0x8fffffdc8fffffdc8fffffdc8ffffff2 = 0x8fffffdc8fffffdc8fffffdc8ffffff7
-
-and this is indeed the correct answer. The bottom part is 130 bits but since we
-put in the max values and it didn't overflow, I don't think it will overflow here.
-
-131+128 = 259 bits, only have to do this once
-
-0xb83fe991ca75d7ef2ab5cba9cccdfd938b73fff384ac90ed284034da565ecf
-0x19471c3e3e9c1bfded81da3736e96604a
-
-
-Kind of curious now, at what point does a ripple carry adder using dedicated
-CI/CO ports become slower then a more complex adder like carry lookahead or
-carry save (wallace tree)
+r\*r    = r^2
+r\*r^2  = r^3   r^2\*r^2 = r^4
+r^4\*r  = r^5   r^2\*r^4 = r^6  r^3\*r^4    = r^7   r^4\*r^4    = r^8
--- a/ChaCha20_Poly1305_64/sim/poly1305.yaml
+++ b/ChaCha20_Poly1305_64/sim/poly1305.yaml
@@ -4,4 +4,10 @@ tests:
    modules: 
      - "poly1305_core"
    sources: "sources.list"
+    waves: True
+  - name: "friendly_modulo"
+    toplevel: "poly1305_friendly_modulo"
+    modules:
+      - "poly1305_friendly_modulo"
+    sources: sources.list
    waves: True
--- a/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py
+++ b/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py
@@ -0,0 +1,62 @@
+import logging
+
+
+import cocotb
+from cocotb.clock import Clock
+from cocotb.triggers import Timer, RisingEdge, FallingEdge
+from cocotb.queue import Queue
+
+from cocotbext.axi import AxiStreamBus, AxiStreamSource
+
+import random
+
+PRIME = 2**130-5
+
+CLK_PERIOD = 4
+
+
+class TB:
+    def __init__(self, dut):
+        self.dut = dut
+
+        self.log = logging.getLogger("cocotb.tb")
+        self.log.setLevel(logging.INFO)
+
+        cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start())
+
+    async def cycle_reset(self):
+        await self._cycle_reset(self.dut.i_rst, self.dut.i_clk)
+
+    async def _cycle_reset(self, rst, clk):
+        rst.setimmediatevalue(0)
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+        rst.value = 1
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+        rst.value = 0
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+
+@cocotb.test
+async def test_sanity(dut):
+    tb = TB(dut)
+
+    await tb.cycle_reset()
+
+    value_a = random.randint(1,2**(130+16))
+
+    # value_a = PRIME + 1000000
+    tb.dut.i_valid.value = 1
+    tb.dut.i_val.value = value_a
+    await RisingEdge(tb.dut.i_clk)
+    tb.dut.i_valid.value = 0
+    tb.dut.i_val.value = 0
+
+    await RisingEdge(tb.dut.o_valid)
+    value = tb.dut.o_result.value.integer
+
+    print(value_a % PRIME)
+    print(value)
+
+    await Timer(1, "us")
--- a/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv
+++ b/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv
@@ -0,0 +1,48 @@
+module poly1305_friendly_modulo #(
+    parameter WIDTH = 130,
+    parameter MDIFF = 5,        // modulo difference
+    parameter SHIFT_SIZE = 26
+) (
+    input  logic                i_clk,
+    input  logic                i_rst,
+
+    input  logic                i_valid,
+    input  logic [2*WIDTH-1:0]  i_val,
+    input  logic [2:0]          i_shift_amount,
+
+    output logic                o_valid,
+    output logic [WIDTH-1:0]    o_result
+);
+
+localparam WIDE_WIDTH = WIDTH + $clog2(MDIFF);
+localparam [WIDTH-1:0]  PRIME = (1 << WIDTH) - MDIFF;
+
+logic [WIDE_WIDTH-1:0] high_part_1, high_part_2;
+logic [WIDTH-1:0] low_part_1, low_part_2;
+
+logic [WIDE_WIDTH-1:0] intermediate_val;
+logic [WIDTH-1:0]      final_val;
+
+logic [2:0] unused_final;
+
+logic [2:0] valid_sr;
+
+assign intermediate_val = high_part_1 + WIDE_WIDTH'(low_part_1);
+
+assign o_result = (final_val >= PRIME) ? final_val - PRIME : final_val;
+
+assign o_valid = valid_sr[2];
+
+always_ff @(posedge i_clk) begin
+    valid_sr <= {valid_sr[1:0], i_valid};
+
+    high_part_1 <= WIDTH'({3'b0, i_val} >> (130 - (i_shift_amount*SHIFT_SIZE))) * MDIFF;
+    low_part_1 <= WIDTH'(i_val << (i_shift_amount*SHIFT_SIZE));
+
+    high_part_2 <= (intermediate_val >> WIDTH) * 5;
+    low_part_2 <= intermediate_val[WIDTH-1:0];
+
+    {unused_final, final_val} <= high_part_2 + WIDE_WIDTH'(low_part_2);
+end
+
+endmodule
--- a/ChaCha20_Poly1305_64/src/sources.list
+++ b/ChaCha20_Poly1305_64/src/sources.list
@@ -3,4 +3,5 @@ chacha20_block.sv
 chacha20_pipelined_round.sv
 chacha20_pipelined_block.sv

-poly1305_core.sv
+poly1305_core.sv
+poly1305_friendly_modulo.sv