crypto/ChaCha20_Poly1305_64/doc/notes.md

# Overall Notes

We need to support 25Gbps, and we will have 2 datapaths, tx and rx

at 128 bit datapth, this is 200MHz, but lets aim for 250MHz


# ChaCha20 Notes

Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter
rounds, which are the same ecept for which 32 bit is used. We can
use the same 32 bit quarter round 4 times in a row, but we need to
store the rest of the round between operations, so memory usage
might be similar to if we just did all 4 at once, but the logic
would only be 25% as much. Because we switch between odd and even
rounds, the data used in one round is not the data used in the other
round.


# Poly1305

## Parallel Operation

We can calculate in parallel but we need to calculate r^n, where n is the number of
parallel stages. Ideally we would have the number of parallel stages be equal to the
latency of the full stage, that way we could have it be fully pipelined. For
example, if it took 8 cycles per block, we would have 8 parallel calculations. This
requires you to calculate r^n, as well as every intermediate value. If we do 8,

then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)

we need

r\*r    = r^2
r\*r^2  = r^3   r^2\*r^2 = r^4
r^4\*r  = r^5   r^2\*r^4 = r^6  r^3\*r^4    = r^7   r^4\*r^4    = r^8

we can do all of these in parallel, so we 4 (n/2) multiply blocks that feed back
on themselves, with some kind of FSM to control it. This can be done while another
block is being hashed, but there will be a delay between when the key is ready from
the chacha block and when the powers are ready, so there needs to be a fifo in between.


Basically we have to wait until we see that the accumulator was written with our index.
At reset though, the acumulator is unwritten? So we need to pretend that it was written

Lets just write out what we want to happen:

1. The index starts at 0. We accept new data, and send it through the pipeline
2. We increment the index to 1.
3. We accept new data and send it through the pipeline
4. We increment the index to 2
5. We need to wait until the index 0 is written before we can say we are ready
6. If the index 1 is written then we still need to say we are ready though
7. We can just use the 1 to indicate that is a valid write then?

So in the shift register we just need to say whether it is a valid write or not,
so always 1?

But if we send in 0, then send in 1, then the current index will be 0
and eventually the final index will always be 0. We need to store what
the last written one is.

We can just say the last written one was 2 I guess

We also need an input that tells it to reset the accumulator

What if instead of calculating all the way up to R^16 I just calculated up to r^8
and then just had 2 parallel blocks?

Lets think about the worst case throughput. The theoretical layout would have
8 of these in parallel. A minimum size packet of 64 bytes for example, is 512
bits. This is less than 128*8, so it would only take one round. Therefore, we
take 16 cycles to do 64 bytes, or 32 bits per cycle. This is only 1/4 of our
target throughput. In order to reach our target throughput of 128 bits per cycle,

If the packet is enough to fit into the second phase of the multiplier, then it
can run in parallel and give up to 256 bits per 16 cycles. In order for this to
happen, the packet size must be greater than 128*16, or 256 bytes. I would really
like to be able to reach our target throughput with 64 byte packets, so we may
need to have more smaller multipliers that can run in parallle, at the cost of
latency for larger packets.

a 64 byte packet is 512 bytes, which takes up 4 128 bit lanes. If we have a group
of 2 multipliers, they can do 128\*2\*2 bits per 16 cycles, or 512 bits per 16
cycles, which is 32 bits per cycle as we said earlier. To hit our target of 128
bits per cycle we just instantiate 4 of them. This results in the same number of
multipliers (8), but configured differently to prioritize throughput over latency.

We need to have a demux or whatever go between the groups.

If we only do 4 effective lanes in parallel, then we only need to do the multiply
loop twice

r->r^2

This will take 26 cycles, which is not ideal. Could we figure out a way to do all
of these powers in one step, taking only 13 cycles?

Alternatively, we could only do a single parallel step and just calculate R^2. This
would mean we have 8 different hashes going on at the same time, and would drastically
increase latency, but I think that is a fair tradeoff

So basically we need to store incoming data as 128 bit words. We will first get
r and s as 128 bit words. We store both and start work on squaring r. We will
also be recieving data this whole time at 128 bits per cycle which we store in
a FIFO. Once R^2 is calculated, we start running it through the multiplier, with
a counter that tells us when we should be using R and when we should be using
r^2. We only have 1 value to worry about, when we get the last value we only
use R instead of R^2. we also need to remember to store the outputs on both
last cycles. Since we are storing the data in a FIFO, we will know which is the
last. There is also a possibility that the data will not be a full 128 bits, so
we need to handle adding the leading 1 as well.

We can use 1 multiplier, 2 data fifos, 2 constant buffers.

The utilization of the multiplier is kinda low though since its only used
once per packet instaead of every 16 bytes