154 lines
6.9 KiB
Markdown
154 lines
6.9 KiB
Markdown
# Notes
|
|
|
|
Since we are designing this for a 64 bit datapath, we need to be able to
|
|
compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or
|
|
512-bit blocks at a time. Logically it might make more sense to have a datapath
|
|
of 128 bits.
|
|
|
|
On the other hand, each operation is a 32 bit operation. It might make more
|
|
sense for timing reasons then to have each operation registered. But will this
|
|
be able to match the throughput that we need?
|
|
|
|
Each quarter round generates 4 words. Each cycle updates all 128 bits at once.
|
|
We can do 4 of the quarter rounds at once, so at the end of each cycle we will
|
|
generate 512 bits.
|
|
|
|
At full speed then, the core would generate 512 bits per cycle. but we would
|
|
only need to generate 64 bits per cycle. We could only do 1 quarter cycle at
|
|
once, which would only generate 128 bits per cycle, but we would need some sort
|
|
of structure to reorder the state such that it is ready to xor with the
|
|
incoming data. We could even make this parameterizable, but that would be the
|
|
next step if we actually need to support 100Gbps encryption.
|
|
|
|
So in summary, we will have a single QuarterRound module which generates 128
|
|
bits of output. We will have a scheduling block which schedules which 4 words
|
|
of state go into the quarter round module, and a de-interleaver which takes the
|
|
output from the quarter round module and re-orders it to be in the correct
|
|
order to combine with the incoming data. there is also an addition in there
|
|
somewhere.
|
|
|
|
To support AEAD, The first round becomes the key for the Poly1305 block. This
|
|
can be done in parallel with the second round, which becomes the cipher, at the
|
|
expense of double the gates. Otherwise, there would be a delay in between
|
|
packets as this is generated.
|
|
|
|
|
|
Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
|
|
single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)
|
|
|
|
So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
|
|
6.4Gbps. So then we will need 2 of these for 10Gbps.
|
|
|
|
So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
|
|
Then we would put those bits into a memory or something and start calculating
|
|
the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
|
|
throughput still checks out). Once they are used, we load the memory with the
|
|
new output.
|
|
|
|
This puts a 20 cycle minimum on small packets since the core is not completely
|
|
pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
|
|
4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
|
|
would need packets of at least 100 bytes.
|
|
|
|
This is with the 20 cycle minimum, though in reality it would be more like 25
|
|
or 30 with the final addition, scheduling, pipelining etc. Adding more cores
|
|
increases the throughput for larger packets, but does nothing for small packets
|
|
since the latency is the same. To solve this, we could instantiate the entire
|
|
core twice, such that we could handle 2 minimum size packets at the same time.
|
|
|
|
If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
|
|
number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
|
|
would of course more than quadrouple the area since we need 4x the cores as
|
|
well as the mux and demux between them.
|
|
|
|
This could be configurable at compile time though. The number of ChaChas per
|
|
core would also be configurable, but at the moment I choose 2.
|
|
|
|
Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
|
|
we want to 8 QRs per core instead of 4 for timing reasons.
|
|
|
|
Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
|
|
of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
|
|
pass timing using 10k luts than that would be nice. We get a peak throughput
|
|
of 50Gbps, its just that the latency kills our packet rate. If we reduce the
|
|
latency to 25 cycles and have 2 alternating cores, our packet rate would be
|
|
20Mpps, increasing with every cycle we take off. I think that is good. This
|
|
would result in 5k XLR which is not so bad.
|
|
|
|
|
|
Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
|
|
is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
|
|
we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
|
|
instead of 2.
|
|
|
|
We are barely going to be able to pass at 180MHz. maybe the fully pipelined
|
|
core is a better idea, but we can just fully pipeline a quarter stage, and
|
|
generate 512 bits every 4 clock cycles. This would give us a theoretical
|
|
throughput of 32Gbps, and we would not have to worry about latency and small
|
|
packets slowing us down. Lets experiment with what that would look like.
|
|
|
|
For our single round its using 1024 adders, which almost sounds like it is
|
|
instantiating 8 quarter rounds instead of just 4. Either way, we can say that
|
|
a quarter round is 128ff + 128add + 250lut.
|
|
|
|
So pipelining 20 of these gives 10k luts. Not so bad.
|
|
|
|
|
|
Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff
|
|
|
|
Lets just leave it for now even if its overkill. The hardware would support up to
|
|
40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
|
|
we designed a custom board for it (or 120 if we used FMC connectors).
|
|
|
|
If we only use a single quarter round multiplexed between all 4, then the same
|
|
quarter round module can have 2 different blocks going through it at once.
|
|
|
|
The new one multiplexes 4 quarter rounds between 1 QR module which reduces the
|
|
logic usage down to only 46k le, of which the vast majority is flops (2k ff per round,
|
|
0.5k lut)
|
|
|
|
|
|
# Modulo 2^130-5
|
|
|
|
We can use the trick here to do modulo reduction much faster.
|
|
|
|
If we split the bits at 2^130, leaving 129 high bits and 130 low bits, we now
|
|
have a 129 bit value multiplied by 2^130, plus the 130 bit value. We know that
|
|
2^130 mod 2^130-5 is 5, so we can replace that 2^130 with 5 and add, then
|
|
repeat that step again.
|
|
|
|
Ex.
|
|
|
|
x = x1*2^130 + x2
|
|
x mod 2^130-5 = x1*5 + x2 -> x1*5+x2 = x3
|
|
x mod 2^130-5 = x3*2^130 + x4
|
|
x mod 2^130-5 = x3*5+x4
|
|
|
|
|
|
and lets do the math to verify that we only need two rounds. The maximum value
|
|
that we could possible get is 2^131-1 and the maxmimum value for R is
|
|
0x0ffffffc0ffffffc0ffffffc0fffffff. Multiplying these together gives us
|
|
0x7fffffe07fffffe07fffffe07ffffff7f0000003f0000003f0000003f0000001.
|
|
|
|
Applying the first round to this we get
|
|
|
|
0x1ffffff81ffffff81ffffff81ffffffd * 5 + 0x3f0000003f0000003f0000003f0000001
|
|
= 0x48fffffdc8fffffdc8fffffdc8ffffff2
|
|
|
|
applying the second round to this we get
|
|
|
|
1 * 5 + 0x8fffffdc8fffffdc8fffffdc8ffffff2 = 0x8fffffdc8fffffdc8fffffdc8ffffff7
|
|
|
|
and this is indeed the correct answer. The bottom part is 130 bits but since we
|
|
put in the max values and it didn't overflow, I don't think it will overflow here.
|
|
|
|
131+128 = 259 bits, only have to do this once
|
|
|
|
0xb83fe991ca75d7ef2ab5cba9cccdfd938b73fff384ac90ed284034da565ecf
|
|
0x19471c3e3e9c1bfded81da3736e96604a
|
|
|
|
|
|
Kind of curious now, at what point does a ripple carry adder using dedicated
|
|
CI/CO ports become slower then a more complex adder like carry lookahead or
|
|
carry save (wallace tree)
|