Actually randomize testing

Add rtl for friendly_modulo
Do poly1305 with absolutely no modulo operators
2025-10-27 20:13:21 -07:00 · 2025-10-27 19:19:43 -07:00 · 2025-10-26 16:09:16 -07:00 · 2025-10-26 15:43:58 -07:00 · 2025-10-26 15:43:36 -07:00 · 2025-10-24 18:46:30 -07:00
17 changed files with 1053 additions and 98 deletions
--- a/ChaCha20_Poly1305_64/doc/notes.md
+++ b/ChaCha20_Poly1305_64/doc/notes.md
@@ -1,108 +1,36 @@
-# Notes
+# Overall Notes

-Since we are designing this for a 64 bit datapath, we need to be able to
-compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or
-512-bit blocks at a time. Logically it might make more sense to have a datapath
-of 128 bits.
+We need to support 25Gbps, and we will have 2 datapaths, tx and rx

-On the other hand, each operation is a 32 bit operation. It might make more
-sense for timing reasons then to have each operation registered. But will this
-be able to match the throughput that we need?
-
-Each quarter round generates 4 words. Each cycle updates all 128 bits at once.
-We can do 4 of the quarter rounds at once, so at the end of each cycle we will
-generate 512 bits.
-
-At full speed then, the core would generate 512 bits per cycle. but we would
-only need to generate 64 bits per cycle. We could only do 1 quarter cycle at
-once, which would only generate 128 bits per cycle, but we would need some sort
-of structure to reorder the state such that it is ready to xor with the
-incoming data. We could even make this parameterizable, but that would be the
-next step if we actually need to support 100Gbps encryption.
-
-So in summary, we will have a single QuarterRound module which generates 128
-bits of output. We will have a scheduling block which schedules which 4 words
-of state go into the quarter round module, and a de-interleaver which takes the
-output from the quarter round module and re-orders it to be in the correct
-order to combine with the incoming data. there is also an addition in there
-somewhere.
-
-To support AEAD, The first round becomes the key for the Poly1305 block. This
-can be done in parallel with the second round, which becomes the cipher, at the
-expense of double the gates. Otherwise, there would be a delay in between
-packets as this is generated.
+at 128 bit datapth, this is 200MHz, but lets aim for 250MHz


-Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
-single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)
+# ChaCha20 Notes

-So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
-6.4Gbps. So then we will need 2 of these for 10Gbps.
-
-So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
-Then we would put those bits into a memory or something and start calculating
-the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
-throughput still checks out). Once they are used, we load the memory with the
-new output.
-
-This puts a 20 cycle minimum on small packets since the core is not completely
-pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
-4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
-would need packets of at least 100 bytes.
-
-This is with the 20 cycle minimum, though in reality it would be more like 25
-or 30 with the final addition, scheduling, pipelining etc. Adding more cores
-increases the throughput for larger packets, but does nothing for small packets
-since the latency is the same. To solve this, we could instantiate the entire
-core twice, such that we could handle 2 minimum size packets at the same time.
-
-If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
-number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
-would of course more than quadrouple the area since we need 4x the cores as
-well as the mux and demux between them.
-
-This could be configurable at compile time though. The number of ChaChas per
-core would also be configurable, but at the moment I choose 2. 
-
-Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
-we want to 8 QRs per core instead of 4 for timing reasons.
-
-Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
-of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
-pass timing  using 10k luts than that would be nice. We get a peak throughput
-of 50Gbps, its just that the latency kills our packet rate. If we reduce the
-latency to 25 cycles and have 2 alternating cores, our packet rate would be
-20Mpps, increasing with every cycle we take off. I think that is good. This
-would result in 5k XLR which is not so bad.
+Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter
+rounds, which are the same ecept for which 32 bit is used. We can
+use the same 32 bit quarter round 4 times in a row, but we need to
+store the rest of the round between operations, so memory usage
+might be similar to if we just did all 4 at once, but the logic
+would only be 25% as much. Because we switch between odd and even
+rounds, the data used in one round is not the data used in the other
+round.


-Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
-is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
-we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
-instead of 2.
+# Poly1305

-We are barely going to be able to pass at 180MHz. maybe the fully pipelined
-core is a better idea, but we can just fully pipeline a quarter stage, and 
-generate 512 bits every 4 clock cycles. This would give us a theoretical
-throughput of 32Gbps, and we would not have to worry about latency and small
-packets slowing us down. Lets experiment with what that would look like. 
+## Parallel Operation

-For our single round its using 1024 adders, which almost sounds like it is
-instantiating 8 quarter rounds instead of just 4. Either way, we can say that
-a quarter round is 128ff + 128add + 250lut.
+We can calculate in parallel but we need to calculate r^n, where n is the number of
+parallel stages. Ideally we would have the number of parallel stages be equal to the
+latency of the full stage, that way we could have it be fully pipelined. For
+example, if it took 8 cycles per block, we would have 8 parallel calculations. This
+requires you to calculate r^n, as well as every intermediate value. If we do 8,

-So pipelining 20 of these gives 10k luts. Not so bad.
+then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)

+we need 

-Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff
-
-Lets just leave it for now even if its overkill. The hardware would support up to
-40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
-we designed a custom board for it (or 120 if we used FMC connectors).
-
-If we only use a single quarter round multiplexed between all 4, then the same
-quarter round module can have 2 different blocks going through it at once.
-
-The new one multiplexes 4 quarter rounds between 1 QR module which reduces the
-logic usage down to only 46k le, of which the vast majority is flops (2k ff per round,
-0.5k lut)
+r\*r    = r^2
+r\*r^2  = r^3   r^2\*r^2 = r^4
+r^4\*r  = r^5   r^2\*r^4 = r^6  r^3\*r^4    = r^7   r^4\*r^4    = r^8
--- a/ChaCha20_Poly1305_64/doc/poly1305.drawio
+++ b/ChaCha20_Poly1305_64/doc/poly1305.drawio
@@ -0,0 +1,146 @@
+<mxfile host="Electron" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/26.2.2 Chrome/134.0.6998.178 Electron/35.1.2 Safari/537.36" version="26.2.2">
+  <diagram name="Page-1" id="gIy_vrPza4QP03Kn0wfk">
+    <mxGraphModel dx="655" dy="442" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
+      <root>
+        <mxCell id="0" />
+        <mxCell id="1" parent="0" />
+        <mxCell id="GA09nmFLpfHeItamLD5O-24" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.25;exitDx=0;exitDy=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-1" target="GA09nmFLpfHeItamLD5O-21">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-25" value="r" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-24">
+          <mxGeometry x="0.5579" y="-1" relative="1" as="geometry">
+            <mxPoint x="9" y="5" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-35" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.75;exitDx=0;exitDy=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-1" target="GA09nmFLpfHeItamLD5O-34">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-38" value="s" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-35">
+          <mxGeometry x="-0.6624" y="1" relative="1" as="geometry">
+            <mxPoint x="-9" y="-9" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-1" value="r/s" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
+          <mxGeometry x="360" y="200" width="80" height="40" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-2" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.25;entryDx=0;entryDy=0;" edge="1" parent="1" target="GA09nmFLpfHeItamLD5O-1">
+          <mxGeometry width="50" height="50" relative="1" as="geometry">
+            <mxPoint x="320" y="210" as="sourcePoint" />
+            <mxPoint x="410" y="270" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-3" value="otk" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-2">
+          <mxGeometry x="-0.3946" y="1" relative="1" as="geometry">
+            <mxPoint x="-22" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-10" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-4" target="GA09nmFLpfHeItamLD5O-6">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-4" value="64-&amp;gt;128" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
+          <mxGeometry x="175" y="130" width="50" height="20" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-5" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" target="GA09nmFLpfHeItamLD5O-4">
+          <mxGeometry width="50" height="50" relative="1" as="geometry">
+            <mxPoint x="120" y="140" as="sourcePoint" />
+            <mxPoint x="290" y="110" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-15" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-6" target="GA09nmFLpfHeItamLD5O-14">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-40" value="data_one_extended" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-15">
+          <mxGeometry x="-0.3532" y="-1" relative="1" as="geometry">
+            <mxPoint x="7" y="29" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-6" value="bit add" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
+          <mxGeometry x="240" y="120" width="40" height="40" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-8" value="" style="endArrow=classic;html=1;rounded=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" target="GA09nmFLpfHeItamLD5O-6">
+          <mxGeometry width="50" height="50" relative="1" as="geometry">
+            <mxPoint x="260" y="80" as="sourcePoint" />
+            <mxPoint x="290" y="70" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-9" value="tkeep" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-8">
+          <mxGeometry x="-0.699" relative="1" as="geometry">
+            <mxPoint y="-16" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-11" value="P" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
+          <mxGeometry x="540" y="180" width="80" height="20" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-18" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-12" target="GA09nmFLpfHeItamLD5O-14">
+          <mxGeometry relative="1" as="geometry">
+            <Array as="points">
+              <mxPoint x="340" y="100" />
+            </Array>
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-36" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-12" target="GA09nmFLpfHeItamLD5O-34">
+          <mxGeometry relative="1" as="geometry">
+            <Array as="points">
+              <mxPoint x="400" y="60" />
+              <mxPoint x="660" y="60" />
+            </Array>
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-12" value="acc" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
+          <mxGeometry x="360" y="80" width="80" height="40" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-14" value="+" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
+          <mxGeometry x="320" y="120" width="40" height="40" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-32" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-21" target="GA09nmFLpfHeItamLD5O-31">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-21" value="*" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
+          <mxGeometry x="440" y="120" width="40" height="40" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-22" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-14" target="GA09nmFLpfHeItamLD5O-21">
+          <mxGeometry relative="1" as="geometry">
+            <mxPoint x="460" y="140" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-41" value="data_post_add" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-22">
+          <mxGeometry x="-0.1925" y="-1" relative="1" as="geometry">
+            <mxPoint x="8" y="19" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-26" value="%" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
+          <mxGeometry x="560" y="120" width="40" height="40" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-29" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.51;entryY=1.007;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-11" target="GA09nmFLpfHeItamLD5O-26">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-30" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1.022;entryY=0.482;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-26" target="GA09nmFLpfHeItamLD5O-12">
+          <mxGeometry relative="1" as="geometry">
+            <Array as="points">
+              <mxPoint x="580" y="99" />
+            </Array>
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-33" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-31" target="GA09nmFLpfHeItamLD5O-26">
+          <mxGeometry relative="1" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-31" value="reg" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
+          <mxGeometry x="500" y="130" width="40" height="20" as="geometry" />
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-37" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-34">
+          <mxGeometry relative="1" as="geometry">
+            <mxPoint x="720" y="140" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-39" value="tag" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-37">
+          <mxGeometry x="0.6531" y="2" relative="1" as="geometry">
+            <mxPoint x="17" y="2" as="offset" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="GA09nmFLpfHeItamLD5O-34" value="+" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
+          <mxGeometry x="640" y="120" width="40" height="40" as="geometry" />
+        </mxCell>
+      </root>
+    </mxGraphModel>
+  </diagram>
+</mxfile>
--- a/ChaCha20_Poly1305_64/poly1305_timing_test/constraints.sdc
+++ b/ChaCha20_Poly1305_64/poly1305_timing_test/constraints.sdc
@@ -0,0 +1 @@
+create_clock -period 2.5 -name clk [get_ports i_clk]
--- a/ChaCha20_Poly1305_64/poly1305_timing_test/mult_timing_test.sv
+++ b/ChaCha20_Poly1305_64/poly1305_timing_test/mult_timing_test.sv
@@ -0,0 +1,42 @@
+module mult_timing_test(
+    input i_clk,
+
+    input logic [132:0] data_a,
+    input logic [127:0] data_b,
+    
+    output logic [260:0] data_z
+);
+
+logic [132:0] data_a_reg;
+logic [127:0] data_b_reg;
+
+
+logic [260:0] partial_result [7];
+
+logic [260:0] data_z_temp_1[4];
+logic [260:0] data_z_temp_2_0, data_z_temp_2_1;
+
+always @(posedge i_clk) begin
+    data_a_reg <= data_a;
+    data_b_reg <= data_b;
+
+    for (int i = 0; i < 7; i++) begin
+        partial_result[i] <= data_a_reg[i*18 +: 18] * data_b_reg;
+    end
+    
+
+    data_z_temp_1[0] <= (partial_result[0] << (19*0)) + (partial_result[1] << (19*1));
+    data_z_temp_1[1] <= (partial_result[2] << (19*0)) + (partial_result[3] << (19*1));
+    data_z_temp_1[2] <= (partial_result[4] << (19*0)) + (partial_result[5] << (19*1));
+    data_z_temp_1[3] <= (partial_result[6] << (19*0));
+
+    data_z_temp_2_0 <= data_z_temp_1[0] + (data_z_temp_1[1] << (19*2));
+    data_z_temp_2_1 <= data_z_temp_1[2] + (data_z_temp_1[3] << (19*2));
+
+    data_z <= data_z_temp_2_0 + data_z_temp_2_1;
+
+    // data_z <= data_z_temp_2[0] + (data_z_temp_2[1] << (19*4));
+
+end
+
+endmodule
--- a/ChaCha20_Poly1305_64/poly1305_timing_test/poly1305_timing_test.peri.xml
+++ b/ChaCha20_Poly1305_64/poly1305_timing_test/poly1305_timing_test.peri.xml
@@ -0,0 +1,122 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<efxpt:design_db name="poly1305_timing_test" device_def="Ti375N1156" version="2025.1.110" db_version="20251999" last_change_date="Sat Jul  5 07:15:12 2025" xmlns:efxpt="http://www.efinixinc.com/peri_design_db" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.efinixinc.com/peri_design_db peri_design_db.xsd ">
+    <efxpt:device_info>
+        <efxpt:iobank_info>
+            <efxpt:iobank name="2A" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2A_MODE_SEL"/>
+            <efxpt:iobank name="2B" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2B_MODE_SEL"/>
+            <efxpt:iobank name="2C" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2C_MODE_SEL"/>
+            <efxpt:iobank name="2D" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2D_MODE_SEL"/>
+            <efxpt:iobank name="2E" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2E_MODE_SEL"/>
+            <efxpt:iobank name="4A" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4A_MODE_SEL"/>
+            <efxpt:iobank name="4B" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4B_MODE_SEL"/>
+            <efxpt:iobank name="4C" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4C_MODE_SEL"/>
+            <efxpt:iobank name="4D" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4D_MODE_SEL"/>
+            <efxpt:iobank name="BL0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL0_MODE_SEL"/>
+            <efxpt:iobank name="BL1" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL1_MODE_SEL"/>
+            <efxpt:iobank name="BL2" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL2_MODE_SEL"/>
+            <efxpt:iobank name="BL3" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL3_MODE_SEL"/>
+            <efxpt:iobank name="BR0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR0_MODE_SEL"/>
+            <efxpt:iobank name="BR1" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR1_MODE_SEL"/>
+            <efxpt:iobank name="BR3" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR3_MODE_SEL"/>
+            <efxpt:iobank name="BR4" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR4_MODE_SEL"/>
+            <efxpt:iobank name="TL0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TL0_MODE_SEL"/>
+            <efxpt:iobank name="TL1_TL5" iostd="3.3 V LVCMOS" is_dyn_voltage="false">
+                <efxpt:mode_sel_name>
+                    <efxpt:pin_name bank_name="TL1" value="TL1_MODE_SEL"/>
+                    <efxpt:pin_name bank_name="TL5" value="TL5_MODE_SEL"/>
+                </efxpt:mode_sel_name>
+            </efxpt:iobank>
+            <efxpt:iobank name="TR0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR0_MODE_SEL"/>
+            <efxpt:iobank name="TR1" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR1_MODE_SEL"/>
+            <efxpt:iobank name="TR2" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR2_MODE_SEL"/>
+            <efxpt:iobank name="TR3" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR3_MODE_SEL"/>
+            <efxpt:iobank name="TR5" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR5_MODE_SEL"/>
+        </efxpt:iobank_info>
+        <efxpt:ctrl_info>
+            <efxpt:ctrl name="cfg" ctrl_def="CONFIG_CTRL0" clock_name="" is_clk_invert="false" cbsel_bus_name="cfg_CBSEL" config_ctrl_name="cfg_CONFIG" ena_capture_name="cfg_ENA" error_status_name="cfg_ERROR" um_signal_status_name="cfg_USR_STATUS" is_remote_update_enable="false" is_user_mode_enable="false">
+                <efxpt:gen_param>
+                    <efxpt:param name="remote_update_retries" value="0" value_type="int"/>
+                </efxpt:gen_param>
+            </efxpt:ctrl>
+        </efxpt:ctrl_info>
+        <efxpt:seu_info>
+            <efxpt:seu name="seu" block_def="CONFIG_SEU0" mode="auto" ena_detect="false" wait_interval="16500000">
+                <efxpt:gen_pin>
+                    <efxpt:pin name="seu_START" type_name="START" is_bus="false"/>
+                    <efxpt:pin name="seu_INJECT_ERROR" type_name="INJECT_ERROR" is_bus="false"/>
+                    <efxpt:pin name="seu_RST" type_name="RST" is_bus="false"/>
+                    <efxpt:pin name="seu_CONFIG" type_name="CONFIG" is_bus="false"/>
+                    <efxpt:pin name="seu_ERROR" type_name="ERROR" is_bus="false"/>
+                    <efxpt:pin name="seu_DONE" type_name="DONE" is_bus="false"/>
+                </efxpt:gen_pin>
+            </efxpt:seu>
+        </efxpt:seu_info>
+        <efxpt:clkmux_info>
+            <efxpt:clkmux name="GCLKMUX_B" block_def="GCLKMUX_B" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
+                <efxpt:gen_pin>
+                    <efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
+                </efxpt:gen_pin>
+            </efxpt:clkmux>
+            <efxpt:clkmux name="GCLKMUX_L" block_def="GCLKMUX_L" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
+                <efxpt:gen_pin>
+                    <efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
+                </efxpt:gen_pin>
+            </efxpt:clkmux>
+            <efxpt:clkmux name="GCLKMUX_R" block_def="GCLKMUX_R" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
+                <efxpt:gen_pin>
+                    <efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
+                </efxpt:gen_pin>
+            </efxpt:clkmux>
+            <efxpt:clkmux name="GCLKMUX_T" block_def="GCLKMUX_T" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
+                <efxpt:gen_pin>
+                    <efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
+                    <efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
+                </efxpt:gen_pin>
+            </efxpt:clkmux>
+        </efxpt:clkmux_info>
+    </efxpt:device_info>
+    <efxpt:gpio_info>
+        <efxpt:global_unused_config state="input with weak pullup"/>
+    </efxpt:gpio_info>
+    <efxpt:pll_info/>
+    <efxpt:osc_info/>
+    <efxpt:lvds_info/>
+    <efxpt:mipi_info/>
+    <efxpt:jtag_info/>
+    <efxpt:ddr_info/>
+    <efxpt:mipi_dphy_info/>
+    <efxpt:pll_ssc_info/>
+    <efxpt:quad_lane_info/>
+    <efxpt:quad_pcie_info/>
+    <efxpt:lane_10g_info/>
+    <efxpt:lane_1g_info/>
+    <efxpt:raw_serdes_info/>
+    <efxpt:soc_info/>
+</efxpt:design_db>
--- a/ChaCha20_Poly1305_64/poly1305_timing_test/poly1305_timing_test.xml
+++ b/ChaCha20_Poly1305_64/poly1305_timing_test/poly1305_timing_test.xml
@@ -0,0 +1,110 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<efx:project name="poly1305_timing_test" description="" last_change="1752448578" sw_version="2025.1.110" last_run_state="pass" last_run_flow="bitstream" config_result_in_sync="true" design_ood="sync" place_ood="sync" route_ood="sync" xmlns:efx="http://www.efinixinc.com/enf_proj" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.efinixinc.com/enf_proj enf_proj.xsd">
+    <efx:device_info>
+        <efx:family name="Titanium"/>
+        <efx:device name="Ti375N1156"/>
+        <efx:timing_model name="C4"/>
+    </efx:device_info>
+    <efx:design_info def_veri_version="sv_09" def_vhdl_version="vhdl_2008" unified_flow="false">
+        <efx:top_module name="mult_timing_test"/>
+        <efx:design_file name="../src/poly1305_core.sv" version="default" library="default"/>
+        <efx:design_file name="../../common/sim/sub/taxi/src/axis/rtl/taxi_axis_if.sv" version="default" library="default"/>
+        <efx:design_file name="../sim/poly1305_core_wrapper.sv" version="default" library="default"/>
+        <efx:design_file name="mult_timing_test.sv" version="default" library="default"/>
+        <efx:top_vhdl_arch name=""/>
+    </efx:design_info>
+    <efx:constraint_info>
+        <efx:sdc_file name="constraints.sdc"/>
+        <efx:inter_file name=""/>
+    </efx:constraint_info>
+    <efx:sim_info/>
+    <efx:misc_info/>
+    <efx:ip_info/>
+    <efx:synthesis tool_name="efx_map">
+        <efx:param name="work_dir" value="work_syn" value_type="e_string"/>
+        <efx:param name="write_efx_verilog" value="on" value_type="e_bool"/>
+        <efx:param name="allow-const-ram-index" value="0" value_type="e_option"/>
+        <efx:param name="blackbox-error" value="1" value_type="e_option"/>
+        <efx:param name="blast_const_operand_adders" value="1" value_type="e_option"/>
+        <efx:param name="bram_output_regs_packing" value="1" value_type="e_option"/>
+        <efx:param name="bram-push-tco-outreg" value="0" value_type="e_option"/>
+        <efx:param name="create-onehot-fsms" value="0" value_type="e_option"/>
+        <efx:param name="fanout-limit" value="0" value_type="e_integer"/>
+        <efx:param name="hdl-compile-unit" value="1" value_type="e_option"/>
+        <efx:param name="hdl-loop-limit" value="20000" value_type="e_integer"/>
+        <efx:param name="infer-clk-enable" value="3" value_type="e_option"/>
+        <efx:param name="infer-sync-set-reset" value="1" value_type="e_option"/>
+        <efx:param name="enable-mark-debug" value="1" value_type="e_option"/>
+        <efx:param name="max_ram" value="-1" value_type="e_integer"/>
+        <efx:param name="max_mult" value="-1" value_type="e_integer"/>
+        <efx:param name="max-bit-blast-mem-size" value="10240" value_type="e_integer"/>
+        <efx:param name="min-sr-fanout" value="0" value_type="e_integer"/>
+        <efx:param name="min-ce-fanout" value="0" value_type="e_integer"/>
+        <efx:param name="mode" value="speed" value_type="e_option"/>
+        <efx:param name="mult-auto-pipeline" value="1" value_type="e_integer"/>
+        <efx:param name="mult-decomp-retime" value="1" value_type="e_option"/>
+        <efx:param name="operator-sharing" value="1" value_type="e_option"/>
+        <efx:param name="optimize-adder-tree" value="1" value_type="e_option"/>
+        <efx:param name="optimize-zero-init-rom" value="1" value_type="e_option"/>
+        <efx:param name="peri-syn-instantiation" value="0" value_type="e_option"/>
+        <efx:param name="peri-syn-inference" value="0" value_type="e_option"/>
+        <efx:param name="ram-decomp-mode" value="0" value_type="e_option"/>
+        <efx:param name="retiming" value="2" value_type="e_option"/>
+        <efx:param name="seq_opt" value="1" value_type="e_option"/>
+        <efx:param name="seq-opt-sync-only" value="0" value_type="e_option"/>
+        <efx:param name="use-logic-for-small-mem" value="64" value_type="e_integer"/>
+        <efx:param name="use-logic-for-small-rom" value="64" value_type="e_integer"/>
+        <efx:param name="max_threads" value="-1" value_type="e_integer"/>
+        <efx:param name="dsp-input-regs-packing" value="1" value_type="e_option"/>
+        <efx:param name="dsp-output-regs-packing" value="1" value_type="e_option"/>
+        <efx:param name="dsp-mac-packing" value="1" value_type="e_option"/>
+        <efx:param name="insert-carry-skip" value="1" value_type="e_option"/>
+        <efx:param name="pack-luts-to-comb4" value="0" value_type="e_option"/>
+        <efx:dynparam name="asdf" value="asdf"/>
+    </efx:synthesis>
+    <efx:place_and_route tool_name="efx_pnr">
+        <efx:param name="work_dir" value="work_pnr" value_type="e_string"/>
+        <efx:param name="verbose" value="off" value_type="e_bool"/>
+        <efx:param name="load_delaym" value="on" value_type="e_bool"/>
+        <efx:param name="optimization_level" value="TIMING_3" value_type="e_option"/>
+        <efx:param name="seed" value="1" value_type="e_integer"/>
+        <efx:param name="placer_effort_level" value="5" value_type="e_option"/>
+        <efx:param name="max_threads" value="-1" value_type="e_integer"/>
+        <efx:param name="print_critical_path" value="10" value_type="e_integer"/>
+        <efx:param name="classic_flow" value="off" value_type="e_noarg"/>
+        <efx:param name="beneficial_skew" value="on" value_type="e_option"/>
+    </efx:place_and_route>
+    <efx:bitstream_generation tool_name="efx_pgm">
+        <efx:param name="mode" value="active" value_type="e_option"/>
+        <efx:param name="width" value="1" value_type="e_option"/>
+        <efx:param name="enable_roms" value="smart" value_type="e_option"/>
+        <efx:param name="spi_low_power_mode" value="on" value_type="e_bool"/>
+        <efx:param name="io_weak_pullup" value="on" value_type="e_bool"/>
+        <efx:param name="oscillator_clock_divider" value="DIV8" value_type="e_option"/>
+        <efx:param name="bitstream_compression" value="on" value_type="e_bool"/>
+        <efx:param name="enable_external_master_clock" value="off" value_type="e_bool"/>
+        <efx:param name="active_capture_clk_edge" value="negedge" value_type="e_option"/>
+        <efx:param name="jtag_usercode" value="0xFFFFFFFF" value_type="e_string"/>
+        <efx:param name="release_tri_then_reset" value="on" value_type="e_bool"/>
+        <efx:param name="four_byte_addressing" value="off" value_type="e_bool"/>
+        <efx:param name="generate_bit" value="on" value_type="e_bool"/>
+        <efx:param name="generate_bitbin" value="off" value_type="e_bool"/>
+        <efx:param name="generate_hex" value="on" value_type="e_bool"/>
+        <efx:param name="generate_hexbin" value="off" value_type="e_bool"/>
+        <efx:param name="cold_boot" value="off" value_type="e_bool"/>
+        <efx:param name="cascade" value="off" value_type="e_option"/>
+    </efx:bitstream_generation>
+    <efx:debugger>
+        <efx:param name="work_dir" value="work_dbg" value_type="e_string"/>
+        <efx:param name="auto_instantiation" value="off" value_type="e_bool"/>
+        <efx:param name="profile" value="NONE" value_type="e_string"/>
+    </efx:debugger>
+    <efx:security>
+        <efx:param name="randomize_iv_value" value="on" value_type="e_bool"/>
+        <efx:param name="iv_value" value="" value_type="e_string"/>
+        <efx:param name="enable_bitstream_encrypt" value="off" value_type="e_bool"/>
+        <efx:param name="enable_bitstream_auth" value="off" value_type="e_bool"/>
+        <efx:param name="encryption_key_file" value="NONE" value_type="e_string"/>
+        <efx:param name="auth_key_file" value="NONE" value_type="e_string"/>
+    </efx:security>
+</efx:project>
--- a/ChaCha20_Poly1305_64/sim/do_poly_1305.py
+++ b/ChaCha20_Poly1305_64/sim/do_poly_1305.py
@@ -0,0 +1,121 @@
+from typing import List
+
+from modulo_theory import friendly_modular_mult, friendly_modulo
+
+def mask_r(r: int) -> int:
+    r_bytes = r.to_bytes(16, "little")
+
+    r_masked = bytearray(r_bytes)
+    r_masked[3] &= 15
+    r_masked[7] &= 15
+    r_masked[11] &= 15
+    r_masked[15] &= 15
+    r_masked[4] &= 252
+    r_masked[8] &= 252
+    r_masked[12] &= 252
+
+
+    r_masked = int.from_bytes(r_masked, "little")
+
+    return r_masked
+
+
+def poly1305(message: bytes, r: int, s: int):
+    r = mask_r(r)
+    p = 2**130-5
+    acc = 0
+
+    blocks = [int.from_bytes(message[i:i+16], "little") for i in range(0, len(message), 16)]
+
+    for block in blocks:
+        byte_length = (block.bit_length() + 7) // 8
+        
+        block += 1 << (8*byte_length)
+
+        acc = ((acc+block)*r) % p
+
+    acc += s
+
+    return acc & (2**128-1)
+
+def parallel_poly1305(message: bytes, r: int, s: int, lanes: int):
+    r = mask_r(r)
+    p = 2**130-5
+
+    r_powers = [1, r]
+
+    for l_pow_log2 in range(3):
+        l_pow = 2**l_pow_log2
+        for r_pow in range(1,l_pow+1):
+            r_powers.append(friendly_modular_mult(r_powers[l_pow], r_powers[r_pow]))
+    
+    acc = [0]*lanes
+
+    blocks = [int.from_bytes(message[i:i+16], "little") for i in range(0, len(message), 16)]
+
+    lane_blocks = [blocks[i:i+lanes] for i in range(0, len(blocks), lanes)]
+
+    for i, lane_block in enumerate(lane_blocks):
+        for j, lane in enumerate(lane_block):
+            idx = i*lanes + j
+            power = min(lanes, len(blocks) - idx)
+
+            # There is a division here but we can get this value somehow else
+            byte_length = (lane.bit_length() + 7) // 8
+            lane += 1 << (8*byte_length)
+
+            acc[j] = friendly_modular_mult(acc[j] + lane, r_powers[power])
+
+    combined_acc = friendly_modulo(sum(acc), 0)
+    combined_acc += s
+
+    return combined_acc & (2**128-1)
+
+
+def test_regular():
+    r = 0xa806d542fe52447f336d555778bed685
+    s = 0x1bf54941aff6bf4afdb20dfb8a800301
+    
+    golden_result = 0xa927010caf8b2bc2c6365130c11d06a8
+
+    msg = b"Cryptographic Forum Research Group"
+
+    result = poly1305(msg, r, s)
+
+    print(f"{golden_result:x}")
+    print(f"{result:x}")
+
+def test_parallel():
+    r = 0xa806d542fe52447f336d555778bed685
+    s = 0x1bf54941aff6bf4afdb20dfb8a800301
+    
+    golden_result = 0xa927010caf8b2bc2c6365130c11d06a8
+
+    msg = b"Cryptographic Forum Research Group"
+
+    result = parallel_poly1305(msg, r, s, 8)
+    
+    print(f"{golden_result:x}")
+    print(f"{result:x}")
+
+
+def test_on_long_string():
+    r = 0xa806d542fe52447f336d555778bed685
+    s = 0x1bf54941aff6bf4afdb20dfb8a800301
+
+    msg = b"Very long message with lots of words that is very long and requires a lot of cycles to complete because of how long it is"
+
+    regular_result = poly1305(msg, r, s)
+    parallel_result = parallel_poly1305(msg, r, s, 8)
+
+    print(f"{regular_result:x}")
+    print(f"{parallel_result:x}")
+
+
+def main():
+    test_regular()
+    test_parallel()
+    test_on_long_string()
+
+if __name__ == "__main__":
+    main()
--- a/ChaCha20_Poly1305_64/sim/modulo_theory.py
+++ b/ChaCha20_Poly1305_64/sim/modulo_theory.py
@@ -0,0 +1,87 @@
+import random
+
+PRIME = 2**130-5
+
+def modulo_theory_simple(loops: int):
+    prime = 97
+
+    for _ in range(loops):
+        value_a = random.randint(1,97)
+        value_b = random.randint(1,97)
+
+        value_a_high = value_a // 10
+        value_a_low = value_a % 10  # Ignore this modulo, in base 2 it is a mask
+
+        prod_high = value_a_high * value_b
+        prod_low = value_a_low * value_b
+
+        mod_high = (prod_high*10) % prime
+        mod_low = prod_low % prime
+
+        mod_sum = (mod_high + mod_low) % prime
+
+        mod_conventional = (value_a * value_b) % prime
+
+        if mod_sum != mod_conventional:
+            print(f"{value_a}")
+            print(f"{value_b}")
+            print(f"{mod_sum=}")
+            print(f"{mod_conventional=}")
+
+def modulo_theory_full(loops: int):
+    for _ in range(loops):
+        value_a = random.randint(1,PRIME)
+        value_b = random.randint(1,2**128)
+
+        a_partials = [(value_a >> 26*i) & (2**26-1) for i in range(5)]
+
+        prods = [a_partial * value_b for a_partial in a_partials]
+
+        mods = [friendly_modulo(prod, 26*i) for i, prod in enumerate(prods)]
+
+
+        mod_sum = friendly_modulo(sum(mods), 0)
+
+        mod_conventional = (value_a * value_b) % PRIME
+
+        if mod_sum != mod_conventional:
+            print(f"{value_a}")
+            print(f"{value_b}")
+            print(f"{mod_sum=}")
+            print(f"{mod_conventional=}")
+
+def friendly_modular_mult(value_a: int, value_b: int) -> int:
+    a_partials = [(value_a >> 26*i) & (2**26-1) for i in range(5)]
+
+    prods = [a_partial * value_b for a_partial in a_partials]
+
+    mods = [friendly_modulo(prod, 26*i) for i, prod in enumerate(prods)]
+
+
+    mod_sum = friendly_modulo(sum(mods), 0)
+
+    return mod_sum
+
+def friendly_modulo(val: int, shift_amount: int) -> int:
+    high_part = val >> (130-shift_amount)
+    low_part = (val << shift_amount) & (2**130-1)
+
+    high_part *= 5
+
+    val = high_part + low_part
+
+    high_part = val >> 130
+    low_part = val & (2**130-1)
+
+    high_part *= 5
+
+    val = high_part + low_part
+
+    if val >= PRIME:
+        val -= PRIME
+
+    return val
+
+if __name__ == "__main__":
+    #modulo_theory_simple(10000000)
+    modulo_theory_full(100000)
--- a/ChaCha20_Poly1305_64/sim/poly1305.yaml
+++ b/ChaCha20_Poly1305_64/sim/poly1305.yaml
@@ -0,0 +1,13 @@
+tests:
+  - name: "poly1305_core"
+    toplevel: "poly1305_core_wrapper"
+    modules: 
+      - "poly1305_core"
+    sources: "sources.list"
+    waves: True
+  - name: "friendly_modulo"
+    toplevel: "poly1305_friendly_modulo"
+    modules:
+      - "poly1305_friendly_modulo"
+    sources: sources.list
+    waves: True
--- a/ChaCha20_Poly1305_64/sim/poly1305_core.py
+++ b/ChaCha20_Poly1305_64/sim/poly1305_core.py
@@ -0,0 +1,75 @@
+import logging
+
+
+import cocotb
+from cocotb.clock import Clock
+from cocotb.triggers import Timer, RisingEdge, FallingEdge
+from cocotb.queue import Queue
+
+from cocotbext.axi import AxiStreamBus, AxiStreamSource
+
+CLK_PERIOD = 4
+
+
+class TB:
+    def __init__(self, dut):
+        self.dut = dut
+
+        self.log = logging.getLogger("cocotb.tb")
+        self.log.setLevel(logging.INFO)
+
+        cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start())
+
+        self.s_data_axis = AxiStreamSource(AxiStreamBus.from_prefix(dut, ""), dut.i_clk, dut.i_rst)
+
+    async def cycle_reset(self):
+        await self._cycle_reset(self.dut.i_rst, self.dut.i_clk)
+
+    async def _cycle_reset(self, rst, clk):
+        rst.setimmediatevalue(0)
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+        rst.value = 1
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+        rst.value = 0
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+
+@cocotb.test
+async def test_sanity(dut):
+    tb = TB(dut)
+
+    await tb.cycle_reset()
+
+    s = 0x1bf54941aff6bf4afdb20dfb8a800301
+    r = 0xa806d542fe52447f336d555778bed685
+    r_masked = 0x0806d5400e52447c036d555408bed685
+
+    result = 0xa927010caf8b2bc2c6365130c11d06a8
+
+    msg = b"Cryptographic Forum Research Group"
+
+
+    tb.dut.i_otk.value = ((r << 128) | s)
+    tb.dut.i_otk_valid.value = 1
+    await RisingEdge(tb.dut.i_clk)
+    tb.dut.i_otk_valid.value = 0
+    await RisingEdge(tb.dut.i_clk)
+
+    dut_s = tb.dut.u_dut.poly1305_s.value.integer
+    dut_r = tb.dut.u_dut.poly1305_r.value.integer
+
+    assert dut_s == s
+    assert dut_r == r_masked
+
+    await tb.s_data_axis.send(msg)
+
+    await RisingEdge(tb.dut.o_tag_valid)
+    tag = tb.dut.o_tag.value.integer
+    
+    tb.log.info(f"tag: {tag:x}")
+
+    assert tag == result
+
+    await Timer(1, "us")
--- a/ChaCha20_Poly1305_64/sim/poly1305_core_wrapper.sv
+++ b/ChaCha20_Poly1305_64/sim/poly1305_core_wrapper.sv
@@ -0,0 +1,40 @@
+module poly1305_core_wrapper(
+    input i_clk,
+    input i_rst,
+
+    input  [255:0] i_otk,
+    input          i_otk_valid,
+
+    output [127:0] o_tag,
+    output         o_tag_valid,
+
+    input  [127:0] tdata,
+    input  [15:0] tkeep,
+    input  [15:0] tstrb,
+    input  tlast,
+    input  tvalid,
+    output tready
+);
+
+taxi_axis_if #(.DATA_W(128)) s_data_axis();
+
+assign s_data_axis.tdata = tdata;
+assign s_data_axis.tkeep = tkeep;
+assign s_data_axis.tstrb = tstrb;
+assign s_data_axis.tlast = tlast;
+assign s_data_axis.tvalid = tvalid;
+assign tready = s_data_axis.tready;
+
+poly1305_core u_dut (
+    .i_clk          (i_clk),
+    .i_rst          (i_rst),
+    .i_otk          (i_otk),
+    .i_otk_valid    (i_otk_valid),
+
+    .o_tag          (o_tag),
+    .o_tag_valid    (o_tag_valid),
+
+    .s_data_axis    (s_data_axis)
+);
+
+endmodule
--- a/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py
+++ b/ChaCha20_Poly1305_64/sim/poly1305_friendly_modulo.py
@@ -0,0 +1,91 @@
+import logging
+
+
+import cocotb
+from cocotb.clock import Clock
+from cocotb.triggers import Timer, RisingEdge, FallingEdge
+from cocotb.queue import Queue
+
+from cocotbext.axi import AxiStreamBus, AxiStreamSource
+
+import random
+
+PRIME = 2**130-5
+
+CLK_PERIOD = 4
+
+
+class TB:
+    def __init__(self, dut):
+        self.dut = dut
+
+        self.log = logging.getLogger("cocotb.tb")
+        self.log.setLevel(logging.INFO)
+
+        self.input_queue = Queue()
+
+        self.expected_queue = Queue()
+        self.output_queue = Queue()
+
+        cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start())
+
+        cocotb.start_soon(self.run_input())
+        cocotb.start_soon(self.run_output())
+
+    async def cycle_reset(self):
+        await self._cycle_reset(self.dut.i_rst, self.dut.i_clk)
+
+    async def _cycle_reset(self, rst, clk):
+        rst.setimmediatevalue(0)
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+        rst.value = 1
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+        rst.value = 0
+        await RisingEdge(clk)
+        await RisingEdge(clk)
+
+    async def write_input(self, value: int, shift_amount: int):
+        await self.input_queue.put((value, shift_amount))
+        await self.expected_queue.put((value << (shift_amount*26)) % PRIME)
+
+    async def run_input(self):
+        while True:
+            value, shift_amount = await self.input_queue.get()
+            self.dut.i_valid.value = 1
+            self.dut.i_val.value = value
+            self.dut.i_shift_amount.value = shift_amount
+            await RisingEdge(self.dut.i_clk)
+            self.dut.i_valid.value = 0
+            self.dut.i_shift_amount.value = 0
+            self.dut.i_val.value = 0
+
+    async def run_output(self):
+        while True:
+            await RisingEdge(self.dut.i_clk)
+            if self.dut.o_valid.value:
+                await self.output_queue.put(self.dut.o_result.value.integer)
+
+@cocotb.test
+async def test_sanity(dut):
+    tb = TB(dut)
+
+    await tb.cycle_reset()
+
+    count = 1024
+
+    for _ in range(count):
+        await tb.write_input(random.randint(1,2**(130+16)), random.randint(0, 4))
+
+    fail = False
+
+    for _ in range(count):
+        sim_val = await tb.expected_queue.get()
+        dut_val = await tb.output_queue.get()
+
+        if sim_val != dut_val:
+            tb.log.info(f"{sim_val:x} -> {dut_val:x}")
+            fail = True
+
+    assert not fail
--- a/ChaCha20_Poly1305_64/sim/sources.list
+++ b/ChaCha20_Poly1305_64/sim/sources.list
@@ -1 +1,4 @@
-../src/sources.list
+poly1305_core_wrapper.sv
+
+../src/sources.list
+../../common/sim/sub/taxi/src/axis/rtl/taxi_axis_if.sv
--- a/ChaCha20_Poly1305_64/src/chacha20_poly1305_64.sv
+++ b/ChaCha20_Poly1305_64/src/chacha20_poly1305_64.sv
@@ -0,0 +1,24 @@
+module chacha20_poly1305_64 (
+    input i_clk,
+    input i_rst,
+
+    taxi_axis_if.snk s_ctrl_axis,
+    taxi_axis_if.snk s_data_axis,
+    taxi_axis_if.src m_data_axis
+);
+
+//TODO the rest of this
+
+// control axis decoder.
+
+localparam R_MASK = 128'h0ffffffc0ffffffc0ffffffc0fffffff;
+
+chacha20_pipelined_block u_chacha20_pipelined_block (
+
+);
+
+poly1305 u_poly1305 (
+
+);
+
+endmodule
--- a/ChaCha20_Poly1305_64/src/poly1305_core.sv
+++ b/ChaCha20_Poly1305_64/src/poly1305_core.sv
@@ -0,0 +1,101 @@
+module poly1305_core #(
+
+) (
+    input i_clk,
+    input i_rst,
+
+    input  [255:0] i_otk,
+    input          i_otk_valid,
+
+    output [127:0] o_tag,
+    output         o_tag_valid,
+
+    taxi_axis_if.snk s_data_axis
+);
+
+// incoming data must be 128 bit and packed, i.e. tkeep is 1 except for the last beat with no gaps
+
+
+localparam R_MASK = 128'h0ffffffc0ffffffc0ffffffc0fffffff;
+localparam P130M5 = 258'h3fffffffffffffffffffffffffffffffb;
+
+logic [127:0] poly1305_r, poly1305_s;
+logic [129:0] accumulator, accumulator_next;
+
+logic [129:0] data_one_extended;
+logic [130:0] data_post_add, data_post_add_reg;
+
+logic [257:0] data_post_mul, data_post_mul_reg;
+
+logic [257:0] modulo_stage, modulo_stage_next;
+
+logic [2:0] phase;
+
+logic [3:0] valid_sr;
+
+function logic [129:0] tkeep_expand (input [15:0] tkeep);
+    tkeep_expand = '0;
+    for (int i = 0; i < 16; i++) begin
+        tkeep_expand[i*8 +: 8] = {8{tkeep[i]}};
+    end
+endfunction
+
+// only ready in phase 0
+assign s_data_axis.tready = phase == 0;
+assign o_tag_valid = valid_sr[3];
+
+always_ff @(posedge i_clk) begin
+    if (i_rst) begin
+        phase <= '0;
+        valid_sr <= '0;
+    end
+
+    valid_sr <= {valid_sr[2:0], s_data_axis.tlast & s_data_axis.tvalid & s_data_axis.tready & (phase == 0)};
+    data_post_add_reg <= data_post_add;
+    data_post_mul_reg <= data_post_mul;
+    modulo_stage <= modulo_stage_next;
+
+    if (i_otk_valid) begin
+        poly1305_r <= i_otk[255:128] & R_MASK;
+        poly1305_s <= i_otk[127:0];
+    end
+
+    if (s_data_axis.tvalid && phase == 0) begin
+        phase <= 1;
+    end
+
+    if (phase == 1) begin
+        phase <= 2;
+    end
+
+    if (phase == 2) begin
+        phase <= 3;
+    end
+
+    if (phase == 3) begin
+        accumulator <= accumulator_next;
+        phase <= '0;
+    end
+end
+
+always_comb begin
+    accumulator_next = accumulator;
+    data_post_mul = '0;
+
+    // phase == 0
+    data_one_extended = (tkeep_expand(s_data_axis.tkeep) + 1) | {2'b0, s_data_axis.tdata};
+    data_post_add = data_one_extended + accumulator;
+
+    // phase == 1
+    data_post_mul = data_post_add_reg * poly1305_r;
+
+    // phase == 2
+    modulo_stage_next = (data_post_mul_reg[257:130] * 5) + 258'(data_post_mul_reg[129:0]);
+
+    // phase == 3
+    accumulator_next = 130'((modulo_stage[257:130] * 5) + 258'(modulo_stage[129:0]));
+end
+
+assign o_tag = accumulator[127:0] + poly1305_s;
+
+endmodule
--- a/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv
+++ b/ChaCha20_Poly1305_64/src/poly1305_friendly_modulo.sv
@@ -0,0 +1,48 @@
+module poly1305_friendly_modulo #(
+    parameter WIDTH = 130,
+    parameter MDIFF = 5,        // modulo difference
+    parameter SHIFT_SIZE = 26
+) (
+    input  logic                i_clk,
+    input  logic                i_rst,
+
+    input  logic                i_valid,
+    input  logic [2*WIDTH-1:0]  i_val,
+    input  logic [2:0]          i_shift_amount,
+
+    output logic                o_valid,
+    output logic [WIDTH-1:0]    o_result
+);
+
+localparam WIDE_WIDTH = WIDTH + $clog2(MDIFF);
+localparam [WIDTH-1:0]  PRIME = (1 << WIDTH) - MDIFF;
+
+logic [WIDE_WIDTH-1:0] high_part_1, high_part_2;
+logic [WIDTH-1:0] low_part_1, low_part_2;
+
+logic [WIDE_WIDTH-1:0] intermediate_val;
+logic [WIDTH-1:0]      final_val;
+
+logic [2:0] unused_final;
+
+logic [2:0] valid_sr;
+
+assign intermediate_val = high_part_1 + WIDE_WIDTH'(low_part_1);
+
+assign o_result = (final_val >= PRIME) ? final_val - PRIME : final_val;
+
+assign o_valid = valid_sr[2];
+
+always_ff @(posedge i_clk) begin
+    valid_sr <= {valid_sr[1:0], i_valid};
+
+    high_part_1 <= WIDTH'({3'b0, i_val} >> (130 - (i_shift_amount*SHIFT_SIZE))) * MDIFF;
+    low_part_1 <= WIDTH'(i_val << (i_shift_amount*SHIFT_SIZE));
+
+    high_part_2 <= (intermediate_val >> WIDTH) * 5;
+    low_part_2 <= intermediate_val[WIDTH-1:0];
+
+    {unused_final, final_val} <= high_part_2 + WIDE_WIDTH'(low_part_2);
+end
+
+endmodule
--- a/ChaCha20_Poly1305_64/src/sources.list
+++ b/ChaCha20_Poly1305_64/src/sources.list
@@ -1,4 +1,7 @@
 chacha20_qr.sv
 chacha20_block.sv
 chacha20_pipelined_round.sv
-chacha20_pipelined_block.sv
+chacha20_pipelined_block.sv
+
+poly1305_core.sv
+poly1305_friendly_modulo.sv
Author	SHA1	Message	Date
Byron Lathi	2fd1136154	Actually randomize testing	2025-10-27 20:13:21 -07:00
Byron Lathi	06d5949aa7	Add rtl for friendly_modulo	2025-10-27 19:19:43 -07:00
Byron Lathi	003527ee0d	Do poly1305 with absolutely no modulo operators	2025-10-26 16:09:16 -07:00
Byron Lathi	fd50ecc4f0	Calculate r powers ahead of time	2025-10-26 15:43:58 -07:00
Byron Lathi	faef39c4d3	Add modulo theory	2025-10-26 15:43:36 -07:00
Byron Lathi	5e3b7be854	Add parallel implementation	2025-10-24 18:46:30 -07:00
Byron Lathi	d9651e9074	Add poly1305 python implementation	2025-10-24 08:25:35 -07:00
Byron Lathi	80e3faeae6	ramblings	2025-07-14 11:10:43 -07:00
Byron Lathi	2b57079205	Add poly1305 and synthesis test Wow this does not come even close to passing timing. Need to be smarter	2025-07-05 07:30:18 -07:00
Byron Lathi	7f91a8af32	Get poly1305 core to kind of work	2025-07-04 10:49:48 -07:00
				`@@ -0,0 +1 @@`
				`create_clock -period 2.5 -name clk [get_ports i_clk]`