Compare commits

...

10 Commits

Author SHA1 Message Date
Byron Lathi
2fd1136154 Actually randomize testing 2025-10-27 20:13:21 -07:00
Byron Lathi
06d5949aa7 Add rtl for friendly_modulo 2025-10-27 19:19:43 -07:00
Byron Lathi
003527ee0d Do poly1305 with absolutely no modulo operators 2025-10-26 16:09:16 -07:00
Byron Lathi
fd50ecc4f0 Calculate r powers ahead of time 2025-10-26 15:43:58 -07:00
Byron Lathi
faef39c4d3 Add modulo theory 2025-10-26 15:43:36 -07:00
Byron Lathi
5e3b7be854 Add parallel implementation 2025-10-24 18:46:30 -07:00
Byron Lathi
d9651e9074 Add poly1305 python implementation 2025-10-24 08:25:35 -07:00
Byron Lathi
80e3faeae6 ramblings 2025-07-14 11:10:43 -07:00
Byron Lathi
2b57079205 Add poly1305 and synthesis test
Wow this does not come even close to passing timing. Need to be smarter
2025-07-05 07:30:18 -07:00
Byron Lathi
7f91a8af32 Get poly1305 core to kind of work 2025-07-04 10:49:48 -07:00
17 changed files with 1053 additions and 98 deletions

View File

@@ -1,108 +1,36 @@
# Notes
# Overall Notes
Since we are designing this for a 64 bit datapath, we need to be able to
compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or
512-bit blocks at a time. Logically it might make more sense to have a datapath
of 128 bits.
We need to support 25Gbps, and we will have 2 datapaths, tx and rx
On the other hand, each operation is a 32 bit operation. It might make more
sense for timing reasons then to have each operation registered. But will this
be able to match the throughput that we need?
Each quarter round generates 4 words. Each cycle updates all 128 bits at once.
We can do 4 of the quarter rounds at once, so at the end of each cycle we will
generate 512 bits.
At full speed then, the core would generate 512 bits per cycle. but we would
only need to generate 64 bits per cycle. We could only do 1 quarter cycle at
once, which would only generate 128 bits per cycle, but we would need some sort
of structure to reorder the state such that it is ready to xor with the
incoming data. We could even make this parameterizable, but that would be the
next step if we actually need to support 100Gbps encryption.
So in summary, we will have a single QuarterRound module which generates 128
bits of output. We will have a scheduling block which schedules which 4 words
of state go into the quarter round module, and a de-interleaver which takes the
output from the quarter round module and re-orders it to be in the correct
order to combine with the incoming data. there is also an addition in there
somewhere.
To support AEAD, The first round becomes the key for the Poly1305 block. This
can be done in parallel with the second round, which becomes the cipher, at the
expense of double the gates. Otherwise, there would be a delay in between
packets as this is generated.
at 128 bit datapth, this is 200MHz, but lets aim for 250MHz
Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)
# ChaCha20 Notes
So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
6.4Gbps. So then we will need 2 of these for 10Gbps.
So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
Then we would put those bits into a memory or something and start calculating
the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
throughput still checks out). Once they are used, we load the memory with the
new output.
This puts a 20 cycle minimum on small packets since the core is not completely
pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
would need packets of at least 100 bytes.
This is with the 20 cycle minimum, though in reality it would be more like 25
or 30 with the final addition, scheduling, pipelining etc. Adding more cores
increases the throughput for larger packets, but does nothing for small packets
since the latency is the same. To solve this, we could instantiate the entire
core twice, such that we could handle 2 minimum size packets at the same time.
If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
would of course more than quadrouple the area since we need 4x the cores as
well as the mux and demux between them.
This could be configurable at compile time though. The number of ChaChas per
core would also be configurable, but at the moment I choose 2.
Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
we want to 8 QRs per core instead of 4 for timing reasons.
Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
pass timing using 10k luts than that would be nice. We get a peak throughput
of 50Gbps, its just that the latency kills our packet rate. If we reduce the
latency to 25 cycles and have 2 alternating cores, our packet rate would be
20Mpps, increasing with every cycle we take off. I think that is good. This
would result in 5k XLR which is not so bad.
Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter
rounds, which are the same ecept for which 32 bit is used. We can
use the same 32 bit quarter round 4 times in a row, but we need to
store the rest of the round between operations, so memory usage
might be similar to if we just did all 4 at once, but the logic
would only be 25% as much. Because we switch between odd and even
rounds, the data used in one round is not the data used in the other
round.
Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
instead of 2.
# Poly1305
We are barely going to be able to pass at 180MHz. maybe the fully pipelined
core is a better idea, but we can just fully pipeline a quarter stage, and
generate 512 bits every 4 clock cycles. This would give us a theoretical
throughput of 32Gbps, and we would not have to worry about latency and small
packets slowing us down. Lets experiment with what that would look like.
## Parallel Operation
For our single round its using 1024 adders, which almost sounds like it is
instantiating 8 quarter rounds instead of just 4. Either way, we can say that
a quarter round is 128ff + 128add + 250lut.
We can calculate in parallel but we need to calculate r^n, where n is the number of
parallel stages. Ideally we would have the number of parallel stages be equal to the
latency of the full stage, that way we could have it be fully pipelined. For
example, if it took 8 cycles per block, we would have 8 parallel calculations. This
requires you to calculate r^n, as well as every intermediate value. If we do 8,
So pipelining 20 of these gives 10k luts. Not so bad.
then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)
we need
Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff
Lets just leave it for now even if its overkill. The hardware would support up to
40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
we designed a custom board for it (or 120 if we used FMC connectors).
If we only use a single quarter round multiplexed between all 4, then the same
quarter round module can have 2 different blocks going through it at once.
The new one multiplexes 4 quarter rounds between 1 QR module which reduces the
logic usage down to only 46k le, of which the vast majority is flops (2k ff per round,
0.5k lut)
r\*r = r^2
r\*r^2 = r^3 r^2\*r^2 = r^4
r^4\*r = r^5 r^2\*r^4 = r^6 r^3\*r^4 = r^7 r^4\*r^4 = r^8

View File

@@ -0,0 +1,146 @@
<mxfile host="Electron" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/26.2.2 Chrome/134.0.6998.178 Electron/35.1.2 Safari/537.36" version="26.2.2">
<diagram name="Page-1" id="gIy_vrPza4QP03Kn0wfk">
<mxGraphModel dx="655" dy="442" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
<mxCell id="GA09nmFLpfHeItamLD5O-24" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.25;exitDx=0;exitDy=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-1" target="GA09nmFLpfHeItamLD5O-21">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-25" value="r" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-24">
<mxGeometry x="0.5579" y="-1" relative="1" as="geometry">
<mxPoint x="9" y="5" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-35" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.75;exitDx=0;exitDy=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-1" target="GA09nmFLpfHeItamLD5O-34">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-38" value="s" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-35">
<mxGeometry x="-0.6624" y="1" relative="1" as="geometry">
<mxPoint x="-9" y="-9" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-1" value="r/s" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="360" y="200" width="80" height="40" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-2" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.25;entryDx=0;entryDy=0;" edge="1" parent="1" target="GA09nmFLpfHeItamLD5O-1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="320" y="210" as="sourcePoint" />
<mxPoint x="410" y="270" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-3" value="otk" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-2">
<mxGeometry x="-0.3946" y="1" relative="1" as="geometry">
<mxPoint x="-22" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-10" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-4" target="GA09nmFLpfHeItamLD5O-6">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-4" value="64-&amp;gt;128" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="175" y="130" width="50" height="20" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-5" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" target="GA09nmFLpfHeItamLD5O-4">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="120" y="140" as="sourcePoint" />
<mxPoint x="290" y="110" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-15" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-6" target="GA09nmFLpfHeItamLD5O-14">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-40" value="data_one_extended" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-15">
<mxGeometry x="-0.3532" y="-1" relative="1" as="geometry">
<mxPoint x="7" y="29" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-6" value="bit add" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
<mxGeometry x="240" y="120" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-8" value="" style="endArrow=classic;html=1;rounded=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" target="GA09nmFLpfHeItamLD5O-6">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="260" y="80" as="sourcePoint" />
<mxPoint x="290" y="70" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-9" value="tkeep" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-8">
<mxGeometry x="-0.699" relative="1" as="geometry">
<mxPoint y="-16" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-11" value="P" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="540" y="180" width="80" height="20" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-18" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-12" target="GA09nmFLpfHeItamLD5O-14">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="340" y="100" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-36" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-12" target="GA09nmFLpfHeItamLD5O-34">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="400" y="60" />
<mxPoint x="660" y="60" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-12" value="acc" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="360" y="80" width="80" height="40" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-14" value="+" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
<mxGeometry x="320" y="120" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-32" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-21" target="GA09nmFLpfHeItamLD5O-31">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-21" value="*" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
<mxGeometry x="440" y="120" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-22" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-14" target="GA09nmFLpfHeItamLD5O-21">
<mxGeometry relative="1" as="geometry">
<mxPoint x="460" y="140" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-41" value="data_post_add" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-22">
<mxGeometry x="-0.1925" y="-1" relative="1" as="geometry">
<mxPoint x="8" y="19" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-26" value="%" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
<mxGeometry x="560" y="120" width="40" height="40" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-29" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.51;entryY=1.007;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-11" target="GA09nmFLpfHeItamLD5O-26">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-30" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1.022;entryY=0.482;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-26" target="GA09nmFLpfHeItamLD5O-12">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="580" y="99" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-33" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-31" target="GA09nmFLpfHeItamLD5O-26">
<mxGeometry relative="1" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-31" value="reg" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
<mxGeometry x="500" y="130" width="40" height="20" as="geometry" />
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-37" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="GA09nmFLpfHeItamLD5O-34">
<mxGeometry relative="1" as="geometry">
<mxPoint x="720" y="140" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-39" value="tag" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="GA09nmFLpfHeItamLD5O-37">
<mxGeometry x="0.6531" y="2" relative="1" as="geometry">
<mxPoint x="17" y="2" as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="GA09nmFLpfHeItamLD5O-34" value="+" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" vertex="1" parent="1">
<mxGeometry x="640" y="120" width="40" height="40" as="geometry" />
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>

View File

@@ -0,0 +1 @@
create_clock -period 2.5 -name clk [get_ports i_clk]

View File

@@ -0,0 +1,42 @@
module mult_timing_test(
input i_clk,
input logic [132:0] data_a,
input logic [127:0] data_b,
output logic [260:0] data_z
);
logic [132:0] data_a_reg;
logic [127:0] data_b_reg;
logic [260:0] partial_result [7];
logic [260:0] data_z_temp_1[4];
logic [260:0] data_z_temp_2_0, data_z_temp_2_1;
always @(posedge i_clk) begin
data_a_reg <= data_a;
data_b_reg <= data_b;
for (int i = 0; i < 7; i++) begin
partial_result[i] <= data_a_reg[i*18 +: 18] * data_b_reg;
end
data_z_temp_1[0] <= (partial_result[0] << (19*0)) + (partial_result[1] << (19*1));
data_z_temp_1[1] <= (partial_result[2] << (19*0)) + (partial_result[3] << (19*1));
data_z_temp_1[2] <= (partial_result[4] << (19*0)) + (partial_result[5] << (19*1));
data_z_temp_1[3] <= (partial_result[6] << (19*0));
data_z_temp_2_0 <= data_z_temp_1[0] + (data_z_temp_1[1] << (19*2));
data_z_temp_2_1 <= data_z_temp_1[2] + (data_z_temp_1[3] << (19*2));
data_z <= data_z_temp_2_0 + data_z_temp_2_1;
// data_z <= data_z_temp_2[0] + (data_z_temp_2[1] << (19*4));
end
endmodule

View File

@@ -0,0 +1,122 @@
<?xml version="1.0" encoding="UTF-8"?>
<efxpt:design_db name="poly1305_timing_test" device_def="Ti375N1156" version="2025.1.110" db_version="20251999" last_change_date="Sat Jul 5 07:15:12 2025" xmlns:efxpt="http://www.efinixinc.com/peri_design_db" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.efinixinc.com/peri_design_db peri_design_db.xsd ">
<efxpt:device_info>
<efxpt:iobank_info>
<efxpt:iobank name="2A" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2A_MODE_SEL"/>
<efxpt:iobank name="2B" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2B_MODE_SEL"/>
<efxpt:iobank name="2C" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2C_MODE_SEL"/>
<efxpt:iobank name="2D" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2D_MODE_SEL"/>
<efxpt:iobank name="2E" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="2E_MODE_SEL"/>
<efxpt:iobank name="4A" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4A_MODE_SEL"/>
<efxpt:iobank name="4B" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4B_MODE_SEL"/>
<efxpt:iobank name="4C" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4C_MODE_SEL"/>
<efxpt:iobank name="4D" iostd="1.8 V LVCMOS" is_dyn_voltage="false" mode_sel_name="4D_MODE_SEL"/>
<efxpt:iobank name="BL0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL0_MODE_SEL"/>
<efxpt:iobank name="BL1" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL1_MODE_SEL"/>
<efxpt:iobank name="BL2" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL2_MODE_SEL"/>
<efxpt:iobank name="BL3" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BL3_MODE_SEL"/>
<efxpt:iobank name="BR0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR0_MODE_SEL"/>
<efxpt:iobank name="BR1" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR1_MODE_SEL"/>
<efxpt:iobank name="BR3" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR3_MODE_SEL"/>
<efxpt:iobank name="BR4" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="BR4_MODE_SEL"/>
<efxpt:iobank name="TL0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TL0_MODE_SEL"/>
<efxpt:iobank name="TL1_TL5" iostd="3.3 V LVCMOS" is_dyn_voltage="false">
<efxpt:mode_sel_name>
<efxpt:pin_name bank_name="TL1" value="TL1_MODE_SEL"/>
<efxpt:pin_name bank_name="TL5" value="TL5_MODE_SEL"/>
</efxpt:mode_sel_name>
</efxpt:iobank>
<efxpt:iobank name="TR0" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR0_MODE_SEL"/>
<efxpt:iobank name="TR1" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR1_MODE_SEL"/>
<efxpt:iobank name="TR2" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR2_MODE_SEL"/>
<efxpt:iobank name="TR3" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR3_MODE_SEL"/>
<efxpt:iobank name="TR5" iostd="3.3 V LVCMOS" is_dyn_voltage="false" mode_sel_name="TR5_MODE_SEL"/>
</efxpt:iobank_info>
<efxpt:ctrl_info>
<efxpt:ctrl name="cfg" ctrl_def="CONFIG_CTRL0" clock_name="" is_clk_invert="false" cbsel_bus_name="cfg_CBSEL" config_ctrl_name="cfg_CONFIG" ena_capture_name="cfg_ENA" error_status_name="cfg_ERROR" um_signal_status_name="cfg_USR_STATUS" is_remote_update_enable="false" is_user_mode_enable="false">
<efxpt:gen_param>
<efxpt:param name="remote_update_retries" value="0" value_type="int"/>
</efxpt:gen_param>
</efxpt:ctrl>
</efxpt:ctrl_info>
<efxpt:seu_info>
<efxpt:seu name="seu" block_def="CONFIG_SEU0" mode="auto" ena_detect="false" wait_interval="16500000">
<efxpt:gen_pin>
<efxpt:pin name="seu_START" type_name="START" is_bus="false"/>
<efxpt:pin name="seu_INJECT_ERROR" type_name="INJECT_ERROR" is_bus="false"/>
<efxpt:pin name="seu_RST" type_name="RST" is_bus="false"/>
<efxpt:pin name="seu_CONFIG" type_name="CONFIG" is_bus="false"/>
<efxpt:pin name="seu_ERROR" type_name="ERROR" is_bus="false"/>
<efxpt:pin name="seu_DONE" type_name="DONE" is_bus="false"/>
</efxpt:gen_pin>
</efxpt:seu>
</efxpt:seu_info>
<efxpt:clkmux_info>
<efxpt:clkmux name="GCLKMUX_B" block_def="GCLKMUX_B" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
<efxpt:gen_pin>
<efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
</efxpt:gen_pin>
</efxpt:clkmux>
<efxpt:clkmux name="GCLKMUX_L" block_def="GCLKMUX_L" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
<efxpt:gen_pin>
<efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
</efxpt:gen_pin>
</efxpt:clkmux>
<efxpt:clkmux name="GCLKMUX_R" block_def="GCLKMUX_R" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
<efxpt:gen_pin>
<efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
</efxpt:gen_pin>
</efxpt:clkmux>
<efxpt:clkmux name="GCLKMUX_T" block_def="GCLKMUX_T" is_mux_bot0_dyn="false" is_mux_bot7_dyn="false">
<efxpt:gen_pin>
<efxpt:pin name="" type_name="ROUTE0" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE1" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE2" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="ROUTE3" is_bus="false" is_clk="true" is_clk_invert="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_0" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_OUT_7" is_bus="false"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_0" is_bus="true"/>
<efxpt:pin name="" type_name="DYN_MUX_SEL_7" is_bus="true"/>
</efxpt:gen_pin>
</efxpt:clkmux>
</efxpt:clkmux_info>
</efxpt:device_info>
<efxpt:gpio_info>
<efxpt:global_unused_config state="input with weak pullup"/>
</efxpt:gpio_info>
<efxpt:pll_info/>
<efxpt:osc_info/>
<efxpt:lvds_info/>
<efxpt:mipi_info/>
<efxpt:jtag_info/>
<efxpt:ddr_info/>
<efxpt:mipi_dphy_info/>
<efxpt:pll_ssc_info/>
<efxpt:quad_lane_info/>
<efxpt:quad_pcie_info/>
<efxpt:lane_10g_info/>
<efxpt:lane_1g_info/>
<efxpt:raw_serdes_info/>
<efxpt:soc_info/>
</efxpt:design_db>

View File

@@ -0,0 +1,110 @@
<?xml version="1.0" encoding="UTF-8"?>
<efx:project name="poly1305_timing_test" description="" last_change="1752448578" sw_version="2025.1.110" last_run_state="pass" last_run_flow="bitstream" config_result_in_sync="true" design_ood="sync" place_ood="sync" route_ood="sync" xmlns:efx="http://www.efinixinc.com/enf_proj" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.efinixinc.com/enf_proj enf_proj.xsd">
<efx:device_info>
<efx:family name="Titanium"/>
<efx:device name="Ti375N1156"/>
<efx:timing_model name="C4"/>
</efx:device_info>
<efx:design_info def_veri_version="sv_09" def_vhdl_version="vhdl_2008" unified_flow="false">
<efx:top_module name="mult_timing_test"/>
<efx:design_file name="../src/poly1305_core.sv" version="default" library="default"/>
<efx:design_file name="../../common/sim/sub/taxi/src/axis/rtl/taxi_axis_if.sv" version="default" library="default"/>
<efx:design_file name="../sim/poly1305_core_wrapper.sv" version="default" library="default"/>
<efx:design_file name="mult_timing_test.sv" version="default" library="default"/>
<efx:top_vhdl_arch name=""/>
</efx:design_info>
<efx:constraint_info>
<efx:sdc_file name="constraints.sdc"/>
<efx:inter_file name=""/>
</efx:constraint_info>
<efx:sim_info/>
<efx:misc_info/>
<efx:ip_info/>
<efx:synthesis tool_name="efx_map">
<efx:param name="work_dir" value="work_syn" value_type="e_string"/>
<efx:param name="write_efx_verilog" value="on" value_type="e_bool"/>
<efx:param name="allow-const-ram-index" value="0" value_type="e_option"/>
<efx:param name="blackbox-error" value="1" value_type="e_option"/>
<efx:param name="blast_const_operand_adders" value="1" value_type="e_option"/>
<efx:param name="bram_output_regs_packing" value="1" value_type="e_option"/>
<efx:param name="bram-push-tco-outreg" value="0" value_type="e_option"/>
<efx:param name="create-onehot-fsms" value="0" value_type="e_option"/>
<efx:param name="fanout-limit" value="0" value_type="e_integer"/>
<efx:param name="hdl-compile-unit" value="1" value_type="e_option"/>
<efx:param name="hdl-loop-limit" value="20000" value_type="e_integer"/>
<efx:param name="infer-clk-enable" value="3" value_type="e_option"/>
<efx:param name="infer-sync-set-reset" value="1" value_type="e_option"/>
<efx:param name="enable-mark-debug" value="1" value_type="e_option"/>
<efx:param name="max_ram" value="-1" value_type="e_integer"/>
<efx:param name="max_mult" value="-1" value_type="e_integer"/>
<efx:param name="max-bit-blast-mem-size" value="10240" value_type="e_integer"/>
<efx:param name="min-sr-fanout" value="0" value_type="e_integer"/>
<efx:param name="min-ce-fanout" value="0" value_type="e_integer"/>
<efx:param name="mode" value="speed" value_type="e_option"/>
<efx:param name="mult-auto-pipeline" value="1" value_type="e_integer"/>
<efx:param name="mult-decomp-retime" value="1" value_type="e_option"/>
<efx:param name="operator-sharing" value="1" value_type="e_option"/>
<efx:param name="optimize-adder-tree" value="1" value_type="e_option"/>
<efx:param name="optimize-zero-init-rom" value="1" value_type="e_option"/>
<efx:param name="peri-syn-instantiation" value="0" value_type="e_option"/>
<efx:param name="peri-syn-inference" value="0" value_type="e_option"/>
<efx:param name="ram-decomp-mode" value="0" value_type="e_option"/>
<efx:param name="retiming" value="2" value_type="e_option"/>
<efx:param name="seq_opt" value="1" value_type="e_option"/>
<efx:param name="seq-opt-sync-only" value="0" value_type="e_option"/>
<efx:param name="use-logic-for-small-mem" value="64" value_type="e_integer"/>
<efx:param name="use-logic-for-small-rom" value="64" value_type="e_integer"/>
<efx:param name="max_threads" value="-1" value_type="e_integer"/>
<efx:param name="dsp-input-regs-packing" value="1" value_type="e_option"/>
<efx:param name="dsp-output-regs-packing" value="1" value_type="e_option"/>
<efx:param name="dsp-mac-packing" value="1" value_type="e_option"/>
<efx:param name="insert-carry-skip" value="1" value_type="e_option"/>
<efx:param name="pack-luts-to-comb4" value="0" value_type="e_option"/>
<efx:dynparam name="asdf" value="asdf"/>
</efx:synthesis>
<efx:place_and_route tool_name="efx_pnr">
<efx:param name="work_dir" value="work_pnr" value_type="e_string"/>
<efx:param name="verbose" value="off" value_type="e_bool"/>
<efx:param name="load_delaym" value="on" value_type="e_bool"/>
<efx:param name="optimization_level" value="TIMING_3" value_type="e_option"/>
<efx:param name="seed" value="1" value_type="e_integer"/>
<efx:param name="placer_effort_level" value="5" value_type="e_option"/>
<efx:param name="max_threads" value="-1" value_type="e_integer"/>
<efx:param name="print_critical_path" value="10" value_type="e_integer"/>
<efx:param name="classic_flow" value="off" value_type="e_noarg"/>
<efx:param name="beneficial_skew" value="on" value_type="e_option"/>
</efx:place_and_route>
<efx:bitstream_generation tool_name="efx_pgm">
<efx:param name="mode" value="active" value_type="e_option"/>
<efx:param name="width" value="1" value_type="e_option"/>
<efx:param name="enable_roms" value="smart" value_type="e_option"/>
<efx:param name="spi_low_power_mode" value="on" value_type="e_bool"/>
<efx:param name="io_weak_pullup" value="on" value_type="e_bool"/>
<efx:param name="oscillator_clock_divider" value="DIV8" value_type="e_option"/>
<efx:param name="bitstream_compression" value="on" value_type="e_bool"/>
<efx:param name="enable_external_master_clock" value="off" value_type="e_bool"/>
<efx:param name="active_capture_clk_edge" value="negedge" value_type="e_option"/>
<efx:param name="jtag_usercode" value="0xFFFFFFFF" value_type="e_string"/>
<efx:param name="release_tri_then_reset" value="on" value_type="e_bool"/>
<efx:param name="four_byte_addressing" value="off" value_type="e_bool"/>
<efx:param name="generate_bit" value="on" value_type="e_bool"/>
<efx:param name="generate_bitbin" value="off" value_type="e_bool"/>
<efx:param name="generate_hex" value="on" value_type="e_bool"/>
<efx:param name="generate_hexbin" value="off" value_type="e_bool"/>
<efx:param name="cold_boot" value="off" value_type="e_bool"/>
<efx:param name="cascade" value="off" value_type="e_option"/>
</efx:bitstream_generation>
<efx:debugger>
<efx:param name="work_dir" value="work_dbg" value_type="e_string"/>
<efx:param name="auto_instantiation" value="off" value_type="e_bool"/>
<efx:param name="profile" value="NONE" value_type="e_string"/>
</efx:debugger>
<efx:security>
<efx:param name="randomize_iv_value" value="on" value_type="e_bool"/>
<efx:param name="iv_value" value="" value_type="e_string"/>
<efx:param name="enable_bitstream_encrypt" value="off" value_type="e_bool"/>
<efx:param name="enable_bitstream_auth" value="off" value_type="e_bool"/>
<efx:param name="encryption_key_file" value="NONE" value_type="e_string"/>
<efx:param name="auth_key_file" value="NONE" value_type="e_string"/>
</efx:security>
</efx:project>

View File

@@ -0,0 +1,121 @@
from typing import List
from modulo_theory import friendly_modular_mult, friendly_modulo
def mask_r(r: int) -> int:
r_bytes = r.to_bytes(16, "little")
r_masked = bytearray(r_bytes)
r_masked[3] &= 15
r_masked[7] &= 15
r_masked[11] &= 15
r_masked[15] &= 15
r_masked[4] &= 252
r_masked[8] &= 252
r_masked[12] &= 252
r_masked = int.from_bytes(r_masked, "little")
return r_masked
def poly1305(message: bytes, r: int, s: int):
r = mask_r(r)
p = 2**130-5
acc = 0
blocks = [int.from_bytes(message[i:i+16], "little") for i in range(0, len(message), 16)]
for block in blocks:
byte_length = (block.bit_length() + 7) // 8
block += 1 << (8*byte_length)
acc = ((acc+block)*r) % p
acc += s
return acc & (2**128-1)
def parallel_poly1305(message: bytes, r: int, s: int, lanes: int):
r = mask_r(r)
p = 2**130-5
r_powers = [1, r]
for l_pow_log2 in range(3):
l_pow = 2**l_pow_log2
for r_pow in range(1,l_pow+1):
r_powers.append(friendly_modular_mult(r_powers[l_pow], r_powers[r_pow]))
acc = [0]*lanes
blocks = [int.from_bytes(message[i:i+16], "little") for i in range(0, len(message), 16)]
lane_blocks = [blocks[i:i+lanes] for i in range(0, len(blocks), lanes)]
for i, lane_block in enumerate(lane_blocks):
for j, lane in enumerate(lane_block):
idx = i*lanes + j
power = min(lanes, len(blocks) - idx)
# There is a division here but we can get this value somehow else
byte_length = (lane.bit_length() + 7) // 8
lane += 1 << (8*byte_length)
acc[j] = friendly_modular_mult(acc[j] + lane, r_powers[power])
combined_acc = friendly_modulo(sum(acc), 0)
combined_acc += s
return combined_acc & (2**128-1)
def test_regular():
r = 0xa806d542fe52447f336d555778bed685
s = 0x1bf54941aff6bf4afdb20dfb8a800301
golden_result = 0xa927010caf8b2bc2c6365130c11d06a8
msg = b"Cryptographic Forum Research Group"
result = poly1305(msg, r, s)
print(f"{golden_result:x}")
print(f"{result:x}")
def test_parallel():
r = 0xa806d542fe52447f336d555778bed685
s = 0x1bf54941aff6bf4afdb20dfb8a800301
golden_result = 0xa927010caf8b2bc2c6365130c11d06a8
msg = b"Cryptographic Forum Research Group"
result = parallel_poly1305(msg, r, s, 8)
print(f"{golden_result:x}")
print(f"{result:x}")
def test_on_long_string():
r = 0xa806d542fe52447f336d555778bed685
s = 0x1bf54941aff6bf4afdb20dfb8a800301
msg = b"Very long message with lots of words that is very long and requires a lot of cycles to complete because of how long it is"
regular_result = poly1305(msg, r, s)
parallel_result = parallel_poly1305(msg, r, s, 8)
print(f"{regular_result:x}")
print(f"{parallel_result:x}")
def main():
test_regular()
test_parallel()
test_on_long_string()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,87 @@
import random
PRIME = 2**130-5
def modulo_theory_simple(loops: int):
prime = 97
for _ in range(loops):
value_a = random.randint(1,97)
value_b = random.randint(1,97)
value_a_high = value_a // 10
value_a_low = value_a % 10 # Ignore this modulo, in base 2 it is a mask
prod_high = value_a_high * value_b
prod_low = value_a_low * value_b
mod_high = (prod_high*10) % prime
mod_low = prod_low % prime
mod_sum = (mod_high + mod_low) % prime
mod_conventional = (value_a * value_b) % prime
if mod_sum != mod_conventional:
print(f"{value_a}")
print(f"{value_b}")
print(f"{mod_sum=}")
print(f"{mod_conventional=}")
def modulo_theory_full(loops: int):
for _ in range(loops):
value_a = random.randint(1,PRIME)
value_b = random.randint(1,2**128)
a_partials = [(value_a >> 26*i) & (2**26-1) for i in range(5)]
prods = [a_partial * value_b for a_partial in a_partials]
mods = [friendly_modulo(prod, 26*i) for i, prod in enumerate(prods)]
mod_sum = friendly_modulo(sum(mods), 0)
mod_conventional = (value_a * value_b) % PRIME
if mod_sum != mod_conventional:
print(f"{value_a}")
print(f"{value_b}")
print(f"{mod_sum=}")
print(f"{mod_conventional=}")
def friendly_modular_mult(value_a: int, value_b: int) -> int:
a_partials = [(value_a >> 26*i) & (2**26-1) for i in range(5)]
prods = [a_partial * value_b for a_partial in a_partials]
mods = [friendly_modulo(prod, 26*i) for i, prod in enumerate(prods)]
mod_sum = friendly_modulo(sum(mods), 0)
return mod_sum
def friendly_modulo(val: int, shift_amount: int) -> int:
high_part = val >> (130-shift_amount)
low_part = (val << shift_amount) & (2**130-1)
high_part *= 5
val = high_part + low_part
high_part = val >> 130
low_part = val & (2**130-1)
high_part *= 5
val = high_part + low_part
if val >= PRIME:
val -= PRIME
return val
if __name__ == "__main__":
#modulo_theory_simple(10000000)
modulo_theory_full(100000)

View File

@@ -0,0 +1,13 @@
tests:
- name: "poly1305_core"
toplevel: "poly1305_core_wrapper"
modules:
- "poly1305_core"
sources: "sources.list"
waves: True
- name: "friendly_modulo"
toplevel: "poly1305_friendly_modulo"
modules:
- "poly1305_friendly_modulo"
sources: sources.list
waves: True

View File

@@ -0,0 +1,75 @@
import logging
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import Timer, RisingEdge, FallingEdge
from cocotb.queue import Queue
from cocotbext.axi import AxiStreamBus, AxiStreamSource
CLK_PERIOD = 4
class TB:
def __init__(self, dut):
self.dut = dut
self.log = logging.getLogger("cocotb.tb")
self.log.setLevel(logging.INFO)
cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start())
self.s_data_axis = AxiStreamSource(AxiStreamBus.from_prefix(dut, ""), dut.i_clk, dut.i_rst)
async def cycle_reset(self):
await self._cycle_reset(self.dut.i_rst, self.dut.i_clk)
async def _cycle_reset(self, rst, clk):
rst.setimmediatevalue(0)
await RisingEdge(clk)
await RisingEdge(clk)
rst.value = 1
await RisingEdge(clk)
await RisingEdge(clk)
rst.value = 0
await RisingEdge(clk)
await RisingEdge(clk)
@cocotb.test
async def test_sanity(dut):
tb = TB(dut)
await tb.cycle_reset()
s = 0x1bf54941aff6bf4afdb20dfb8a800301
r = 0xa806d542fe52447f336d555778bed685
r_masked = 0x0806d5400e52447c036d555408bed685
result = 0xa927010caf8b2bc2c6365130c11d06a8
msg = b"Cryptographic Forum Research Group"
tb.dut.i_otk.value = ((r << 128) | s)
tb.dut.i_otk_valid.value = 1
await RisingEdge(tb.dut.i_clk)
tb.dut.i_otk_valid.value = 0
await RisingEdge(tb.dut.i_clk)
dut_s = tb.dut.u_dut.poly1305_s.value.integer
dut_r = tb.dut.u_dut.poly1305_r.value.integer
assert dut_s == s
assert dut_r == r_masked
await tb.s_data_axis.send(msg)
await RisingEdge(tb.dut.o_tag_valid)
tag = tb.dut.o_tag.value.integer
tb.log.info(f"tag: {tag:x}")
assert tag == result
await Timer(1, "us")

View File

@@ -0,0 +1,40 @@
module poly1305_core_wrapper(
input i_clk,
input i_rst,
input [255:0] i_otk,
input i_otk_valid,
output [127:0] o_tag,
output o_tag_valid,
input [127:0] tdata,
input [15:0] tkeep,
input [15:0] tstrb,
input tlast,
input tvalid,
output tready
);
taxi_axis_if #(.DATA_W(128)) s_data_axis();
assign s_data_axis.tdata = tdata;
assign s_data_axis.tkeep = tkeep;
assign s_data_axis.tstrb = tstrb;
assign s_data_axis.tlast = tlast;
assign s_data_axis.tvalid = tvalid;
assign tready = s_data_axis.tready;
poly1305_core u_dut (
.i_clk (i_clk),
.i_rst (i_rst),
.i_otk (i_otk),
.i_otk_valid (i_otk_valid),
.o_tag (o_tag),
.o_tag_valid (o_tag_valid),
.s_data_axis (s_data_axis)
);
endmodule

View File

@@ -0,0 +1,91 @@
import logging
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import Timer, RisingEdge, FallingEdge
from cocotb.queue import Queue
from cocotbext.axi import AxiStreamBus, AxiStreamSource
import random
PRIME = 2**130-5
CLK_PERIOD = 4
class TB:
def __init__(self, dut):
self.dut = dut
self.log = logging.getLogger("cocotb.tb")
self.log.setLevel(logging.INFO)
self.input_queue = Queue()
self.expected_queue = Queue()
self.output_queue = Queue()
cocotb.start_soon(Clock(self.dut.i_clk, CLK_PERIOD, units="ns").start())
cocotb.start_soon(self.run_input())
cocotb.start_soon(self.run_output())
async def cycle_reset(self):
await self._cycle_reset(self.dut.i_rst, self.dut.i_clk)
async def _cycle_reset(self, rst, clk):
rst.setimmediatevalue(0)
await RisingEdge(clk)
await RisingEdge(clk)
rst.value = 1
await RisingEdge(clk)
await RisingEdge(clk)
rst.value = 0
await RisingEdge(clk)
await RisingEdge(clk)
async def write_input(self, value: int, shift_amount: int):
await self.input_queue.put((value, shift_amount))
await self.expected_queue.put((value << (shift_amount*26)) % PRIME)
async def run_input(self):
while True:
value, shift_amount = await self.input_queue.get()
self.dut.i_valid.value = 1
self.dut.i_val.value = value
self.dut.i_shift_amount.value = shift_amount
await RisingEdge(self.dut.i_clk)
self.dut.i_valid.value = 0
self.dut.i_shift_amount.value = 0
self.dut.i_val.value = 0
async def run_output(self):
while True:
await RisingEdge(self.dut.i_clk)
if self.dut.o_valid.value:
await self.output_queue.put(self.dut.o_result.value.integer)
@cocotb.test
async def test_sanity(dut):
tb = TB(dut)
await tb.cycle_reset()
count = 1024
for _ in range(count):
await tb.write_input(random.randint(1,2**(130+16)), random.randint(0, 4))
fail = False
for _ in range(count):
sim_val = await tb.expected_queue.get()
dut_val = await tb.output_queue.get()
if sim_val != dut_val:
tb.log.info(f"{sim_val:x} -> {dut_val:x}")
fail = True
assert not fail

View File

@@ -1 +1,4 @@
../src/sources.list
poly1305_core_wrapper.sv
../src/sources.list
../../common/sim/sub/taxi/src/axis/rtl/taxi_axis_if.sv

View File

@@ -0,0 +1,24 @@
module chacha20_poly1305_64 (
input i_clk,
input i_rst,
taxi_axis_if.snk s_ctrl_axis,
taxi_axis_if.snk s_data_axis,
taxi_axis_if.src m_data_axis
);
//TODO the rest of this
// control axis decoder.
localparam R_MASK = 128'h0ffffffc0ffffffc0ffffffc0fffffff;
chacha20_pipelined_block u_chacha20_pipelined_block (
);
poly1305 u_poly1305 (
);
endmodule

View File

@@ -0,0 +1,101 @@
module poly1305_core #(
) (
input i_clk,
input i_rst,
input [255:0] i_otk,
input i_otk_valid,
output [127:0] o_tag,
output o_tag_valid,
taxi_axis_if.snk s_data_axis
);
// incoming data must be 128 bit and packed, i.e. tkeep is 1 except for the last beat with no gaps
localparam R_MASK = 128'h0ffffffc0ffffffc0ffffffc0fffffff;
localparam P130M5 = 258'h3fffffffffffffffffffffffffffffffb;
logic [127:0] poly1305_r, poly1305_s;
logic [129:0] accumulator, accumulator_next;
logic [129:0] data_one_extended;
logic [130:0] data_post_add, data_post_add_reg;
logic [257:0] data_post_mul, data_post_mul_reg;
logic [257:0] modulo_stage, modulo_stage_next;
logic [2:0] phase;
logic [3:0] valid_sr;
function logic [129:0] tkeep_expand (input [15:0] tkeep);
tkeep_expand = '0;
for (int i = 0; i < 16; i++) begin
tkeep_expand[i*8 +: 8] = {8{tkeep[i]}};
end
endfunction
// only ready in phase 0
assign s_data_axis.tready = phase == 0;
assign o_tag_valid = valid_sr[3];
always_ff @(posedge i_clk) begin
if (i_rst) begin
phase <= '0;
valid_sr <= '0;
end
valid_sr <= {valid_sr[2:0], s_data_axis.tlast & s_data_axis.tvalid & s_data_axis.tready & (phase == 0)};
data_post_add_reg <= data_post_add;
data_post_mul_reg <= data_post_mul;
modulo_stage <= modulo_stage_next;
if (i_otk_valid) begin
poly1305_r <= i_otk[255:128] & R_MASK;
poly1305_s <= i_otk[127:0];
end
if (s_data_axis.tvalid && phase == 0) begin
phase <= 1;
end
if (phase == 1) begin
phase <= 2;
end
if (phase == 2) begin
phase <= 3;
end
if (phase == 3) begin
accumulator <= accumulator_next;
phase <= '0;
end
end
always_comb begin
accumulator_next = accumulator;
data_post_mul = '0;
// phase == 0
data_one_extended = (tkeep_expand(s_data_axis.tkeep) + 1) | {2'b0, s_data_axis.tdata};
data_post_add = data_one_extended + accumulator;
// phase == 1
data_post_mul = data_post_add_reg * poly1305_r;
// phase == 2
modulo_stage_next = (data_post_mul_reg[257:130] * 5) + 258'(data_post_mul_reg[129:0]);
// phase == 3
accumulator_next = 130'((modulo_stage[257:130] * 5) + 258'(modulo_stage[129:0]));
end
assign o_tag = accumulator[127:0] + poly1305_s;
endmodule

View File

@@ -0,0 +1,48 @@
module poly1305_friendly_modulo #(
parameter WIDTH = 130,
parameter MDIFF = 5, // modulo difference
parameter SHIFT_SIZE = 26
) (
input logic i_clk,
input logic i_rst,
input logic i_valid,
input logic [2*WIDTH-1:0] i_val,
input logic [2:0] i_shift_amount,
output logic o_valid,
output logic [WIDTH-1:0] o_result
);
localparam WIDE_WIDTH = WIDTH + $clog2(MDIFF);
localparam [WIDTH-1:0] PRIME = (1 << WIDTH) - MDIFF;
logic [WIDE_WIDTH-1:0] high_part_1, high_part_2;
logic [WIDTH-1:0] low_part_1, low_part_2;
logic [WIDE_WIDTH-1:0] intermediate_val;
logic [WIDTH-1:0] final_val;
logic [2:0] unused_final;
logic [2:0] valid_sr;
assign intermediate_val = high_part_1 + WIDE_WIDTH'(low_part_1);
assign o_result = (final_val >= PRIME) ? final_val - PRIME : final_val;
assign o_valid = valid_sr[2];
always_ff @(posedge i_clk) begin
valid_sr <= {valid_sr[1:0], i_valid};
high_part_1 <= WIDTH'({3'b0, i_val} >> (130 - (i_shift_amount*SHIFT_SIZE))) * MDIFF;
low_part_1 <= WIDTH'(i_val << (i_shift_amount*SHIFT_SIZE));
high_part_2 <= (intermediate_val >> WIDTH) * 5;
low_part_2 <= intermediate_val[WIDTH-1:0];
{unused_final, final_val} <= high_part_2 + WIDE_WIDTH'(low_part_2);
end
endmodule

View File

@@ -1,4 +1,7 @@
chacha20_qr.sv
chacha20_block.sv
chacha20_pipelined_round.sv
chacha20_pipelined_block.sv
chacha20_pipelined_block.sv
poly1305_core.sv
poly1305_friendly_modulo.sv