Add basic repo
This commit is contained in:
@@ -31,3 +31,71 @@ To support AEAD, The first round becomes the key for the Poly1305 block. This
|
||||
can be done in parallel with the second round, which becomes the cipher, at the
|
||||
expense of double the gates. Otherwise, there would be a delay in between
|
||||
packets as this is generated.
|
||||
|
||||
|
||||
Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
|
||||
single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)
|
||||
|
||||
So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
|
||||
6.4Gbps. So then we will need 2 of these for 10Gbps.
|
||||
|
||||
So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
|
||||
Then we would put those bits into a memory or something and start calculating
|
||||
the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
|
||||
throughput still checks out). Once they are used, we load the memory with the
|
||||
new output.
|
||||
|
||||
This puts a 20 cycle minimum on small packets since the core is not completely
|
||||
pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
|
||||
4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
|
||||
would need packets of at least 100 bytes.
|
||||
|
||||
This is with the 20 cycle minimum, though in reality it would be more like 25
|
||||
or 30 with the final addition, scheduling, pipelining etc. Adding more cores
|
||||
increases the throughput for larger packets, but does nothing for small packets
|
||||
since the latency is the same. To solve this, we could instantiate the entire
|
||||
core twice, such that we could handle 2 minimum size packets at the same time.
|
||||
|
||||
If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
|
||||
number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
|
||||
would of course more than quadrouple the area since we need 4x the cores as
|
||||
well as the mux and demux between them.
|
||||
|
||||
This could be configurable at compile time though. The number of ChaChas per
|
||||
core would also be configurable, but at the moment I choose 2.
|
||||
|
||||
Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
|
||||
we want to 8 QRs per core instead of 4 for timing reasons.
|
||||
|
||||
Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
|
||||
of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
|
||||
pass timing using 10k luts than that would be nice. We get a peak throughput
|
||||
of 50Gbps, its just that the latency kills our packet rate. If we reduce the
|
||||
latency to 25 cycles and have 2 alternating cores, our packet rate would be
|
||||
20Mpps, increasing with every cycle we take off. I think that is good. This
|
||||
would result in 5k XLR which is not so bad.
|
||||
|
||||
|
||||
Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
|
||||
is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
|
||||
we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
|
||||
instead of 2.
|
||||
|
||||
We are barely going to be able to pass at 180MHz. maybe the fully pipelined
|
||||
core is a better idea, but we can just fully pipeline a quarter stage, and
|
||||
generate 512 bits every 4 clock cycles. This would give us a theoretical
|
||||
throughput of 32Gbps, and we would not have to worry about latency and small
|
||||
packets slowing us down. Lets experiment with what that would look like.
|
||||
|
||||
For our single round its using 1024 adders, which almost sounds like it is
|
||||
instantiating 8 quarter rounds instead of just 4. Either way, we can say that
|
||||
a quarter round is 128ff + 128add + 250lut.
|
||||
|
||||
So pipelining 20 of these gives 10k luts. Not so bad.
|
||||
|
||||
|
||||
Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff
|
||||
|
||||
Lets just leave it for now even if its overkill. The hardware would support up to
|
||||
40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
|
||||
we designed a custom board for it (or 120 if we used FMC connectors).
|
||||
Reference in New Issue
Block a user