crypto/ChaCha20_Poly1305_64/doc/notes.md

# Notes

Since we are designing this for a 64 bit datapath, we need to be able to
compute 64 bits every cycle. The ChaCha20 hash works on groups of 16x32, or
512-bit blocks at a time. Logically it might make more sense to have a datapath
of 128 bits.

On the other hand, each operation is a 32 bit operation. It might make more
sense for timing reasons then to have each operation registered. But will this
be able to match the throughput that we need?

Each quarter round generates 4 words. Each cycle updates all 128 bits at once.
We can do 4 of the quarter rounds at once, so at the end of each cycle we will
generate 512 bits.

At full speed then, the core would generate 512 bits per cycle. but we would
only need to generate 64 bits per cycle. We could only do 1 quarter cycle at
once, which would only generate 128 bits per cycle, but we would need some sort
of structure to reorder the state such that it is ready to xor with the
incoming data. We could even make this parameterizable, but that would be the
next step if we actually need to support 100Gbps encryption.

So in summary, we will have a single QuarterRound module which generates 128
bits of output. We will have a scheduling block which schedules which 4 words
of state go into the quarter round module, and a de-interleaver which takes the
output from the quarter round module and re-orders it to be in the correct
order to combine with the incoming data. there is also an addition in there
somewhere.

To support AEAD, The first round becomes the key for the Poly1305 block. This
can be done in parallel with the second round, which becomes the cipher, at the
expense of double the gates. Otherwise, there would be a delay in between
packets as this is generated.


Okay so we did some timing tests and we can easily do 1 round of ChaCha20 in a
single cycle on a Titanium FPGA at 250MHz (~350-400 MHz)

So then it will take 20 cycles to calculate 512 bits, or 25.6 bits/cycle, or
6.4Gbps. So then we will need 2 of these for 10Gbps.

So in order to use multiple cores, we would calculate 1024 bits in 20 cycles.
Then we would put those bits into a memory or something and start calculating
the next 1024 bits. Those bits would all be used up in 16 cycles, (but the
throughput still checks out). Once they are used, we load the memory with the
new output.

This puts a 20 cycle minimum on small packets since the core is not completely
pipelined. This puts a hard cap at 12.5Mpps. At 42 byte packets, this is
4.2Gbps, and for 64 byte packets is 6.4Gbps. In order to saturate the link, you
would need packets of at least 100 bytes.

This is with the 20 cycle minimum, though in reality it would be more like 25
or 30 with the final addition, scheduling, pipelining etc. Adding more cores
increases the throughput for larger packets, but does nothing for small packets
since the latency is the same. To solve this, we could instantiate the entire
core twice, such that we could handle 2 minimum size packets at the same time.

If we say there is a 30 cycle latency, the worst case is 2.8Gbps. Doubling the
number of cores gives 5.6, quadrupling the number of cores gives 11.2Gbps. This
would of course more than quadrouple the area since we need 4x the cores as
well as the mux and demux between them.

This could be configurable at compile time though. The number of ChaChas per
core would also be configurable, but at the moment I choose 2.

Just counting the quarter rounds, there are 4\*2\*4 = 32 QR modules, or 64 if
we want to 8 QRs per core instead of 4 for timing reasons.

Each QR is 322 XLR, so just the QR would be either 10k or 20k XLR.. That's kind
of a lot. A fully pipelined design would use 322\*20\*4 or 25k XLR. If we can
pass timing  using 10k luts than that would be nice. We get a peak throughput
of 50Gbps, its just that the latency kills our packet rate. If we reduce the
latency to 25 cycles and have 2 alternating cores, our packet rate would be
20Mpps, increasing with every cycle we take off. I think that is good. This
would result in 5k XLR which is not so bad.


Okay so starting over now, our clock speed cannot be 250MHz, the best we can do
is 200MHz. If we assume this same 25 cycle latency, thats 4Gbps per block, so
we would need 3 of them to surpass 10Gbps (each is 4096) so now we need 3 blocks
instead of 2.

We are barely going to be able to pass at 180MHz. maybe the fully pipelined
core is a better idea, but we can just fully pipeline a quarter stage, and
generate 512 bits every 4 clock cycles. This would give us a theoretical
throughput of 32Gbps, and we would not have to worry about latency and small
packets slowing us down. Lets experiment with what that would look like.

For our single round its using 1024 adders, which almost sounds like it is
instantiating 8 quarter rounds instead of just 4. Either way, we can say that
a quarter round is 128ff + 128add + 250lut.

So pipelining 20 of these gives 10k luts. Not so bad.


Actualyl its 88k luts... its 512ff * 4 * 20 = 40k ff

Lets just leave it for now even if its overkill. The hardware would support up to
40Gbps, and technically the FPGA has 16 lanes so could do 160Gbps in total, if
we designed a custom board for it (or 120 if we used FMC connectors).

If we only use a single quarter round multiplexed between all 4, then the same
quarter round module can have 2 different blocks going through it at once.

The new one multiplexes 4 quarter rounds between 1 QR module which reduces the
logic usage down to only 46k le, of which the vast majority is flops (2k ff per round,
0.5k lut)


# Modulo 2^130-5

We can use the trick here to do modulo reduction much faster.

If we split the bits at 2^130, leaving 129 high bits and 130 low bits, we now
have a 129 bit value multiplied by 2^130, plus the 130 bit value. We know that
2^130 mod 2^130-5 is 5, so we can replace that 2^130 with 5 and add, then
repeat that step again.

Ex.

x = x1*2^130 + x2
x mod 2^130-5 = x1*5 + x2   -> x1*5+x2 = x3
x mod 2^130-5 = x3*2^130 + x4
x mod 2^130-5 = x3*5+x4


and lets do the math to verify that we only need two rounds. The maximum value
that we could possible get is 2^131-1 and the maxmimum value for R is
0x0ffffffc0ffffffc0ffffffc0fffffff. Multiplying these together gives us
0x7fffffe07fffffe07fffffe07ffffff7f0000003f0000003f0000003f0000001.

Applying the first round to this we get

0x1ffffff81ffffff81ffffff81ffffffd * 5 + 0x3f0000003f0000003f0000003f0000001
= 0x48fffffdc8fffffdc8fffffdc8ffffff2

applying the second round to this we get

1 * 5 + 0x8fffffdc8fffffdc8fffffdc8ffffff2 = 0x8fffffdc8fffffdc8fffffdc8ffffff7

and this is indeed the correct answer. The bottom part is 130 bits but since we
put in the max values and it didn't overflow, I don't think it will overflow here.

131+128 = 259 bits, only have to do this once

0xb83fe991ca75d7ef2ab5cba9cccdfd938b73fff384ac90ed284034da565ecf
0x19471c3e3e9c1bfded81da3736e96604a


Kind of curious now, at what point does a ripple carry adder using dedicated
CI/CO ports become slower then a more complex adder like carry lookahead or
carry save (wallace tree)