Files
crypto/ChaCha20_Poly1305_64/doc/notes.md
2026-01-18 21:58:56 -08:00

5.3 KiB

Overall Notes

We need to support 25Gbps, and we will have 2 datapaths, tx and rx

at 128 bit datapth, this is 200MHz, but lets aim for 250MHz

ChaCha20 Notes

Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter rounds, which are the same ecept for which 32 bit is used. We can use the same 32 bit quarter round 4 times in a row, but we need to store the rest of the round between operations, so memory usage might be similar to if we just did all 4 at once, but the logic would only be 25% as much. Because we switch between odd and even rounds, the data used in one round is not the data used in the other round.

Poly1305

Parallel Operation

We can calculate in parallel but we need to calculate r^n, where n is the number of parallel stages. Ideally we would have the number of parallel stages be equal to the latency of the full stage, that way we could have it be fully pipelined. For example, if it took 8 cycles per block, we would have 8 parallel calculations. This requires you to calculate r^n, as well as every intermediate value. If we do 8,

then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)

we need

r*r = r^2 r*r^2 = r^3 r^2*r^2 = r^4 r^4*r = r^5 r^2*r^4 = r^6 r^3*r^4 = r^7 r^4*r^4 = r^8

we can do all of these in parallel, so we 4 (n/2) multiply blocks that feed back on themselves, with some kind of FSM to control it. This can be done while another block is being hashed, but there will be a delay between when the key is ready from the chacha block and when the powers are ready, so there needs to be a fifo in between.

Basically we have to wait until we see that the accumulator was written with our index. At reset though, the acumulator is unwritten? So we need to pretend that it was written

Lets just write out what we want to happen:

  1. The index starts at 0. We accept new data, and send it through the pipeline
  2. We increment the index to 1.
  3. We accept new data and send it through the pipeline
  4. We increment the index to 2
  5. We need to wait until the index 0 is written before we can say we are ready
  6. If the index 1 is written then we still need to say we are ready though
  7. We can just use the 1 to indicate that is a valid write then?

So in the shift register we just need to say whether it is a valid write or not, so always 1?

But if we send in 0, then send in 1, then the current index will be 0 and eventually the final index will always be 0. We need to store what the last written one is.

We can just say the last written one was 2 I guess

We also need an input that tells it to reset the accumulator

What if instead of calculating all the way up to R^16 I just calculated up to r^8 and then just had 2 parallel blocks?

Lets think about the worst case throughput. The theoretical layout would have 8 of these in parallel. A minimum size packet of 64 bytes for example, is 512 bits. This is less than 128*8, so it would only take one round. Therefore, we take 16 cycles to do 64 bytes, or 32 bits per cycle. This is only 1/4 of our target throughput. In order to reach our target throughput of 128 bits per cycle,

If the packet is enough to fit into the second phase of the multiplier, then it can run in parallel and give up to 256 bits per 16 cycles. In order for this to happen, the packet size must be greater than 128*16, or 256 bytes. I would really like to be able to reach our target throughput with 64 byte packets, so we may need to have more smaller multipliers that can run in parallle, at the cost of latency for larger packets.

a 64 byte packet is 512 bytes, which takes up 4 128 bit lanes. If we have a group of 2 multipliers, they can do 128*2*2 bits per 16 cycles, or 512 bits per 16 cycles, which is 32 bits per cycle as we said earlier. To hit our target of 128 bits per cycle we just instantiate 4 of them. This results in the same number of multipliers (8), but configured differently to prioritize throughput over latency.

We need to have a demux or whatever go between the groups.

If we only do 4 effective lanes in parallel, then we only need to do the multiply loop twice

r->r^2

This will take 26 cycles, which is not ideal. Could we figure out a way to do all of these powers in one step, taking only 13 cycles?

Alternatively, we could only do a single parallel step and just calculate R^2. This would mean we have 8 different hashes going on at the same time, and would drastically increase latency, but I think that is a fair tradeoff

So basically we need to store incoming data as 128 bit words. We will first get r and s as 128 bit words. We store both and start work on squaring r. We will also be recieving data this whole time at 128 bits per cycle which we store in a FIFO. Once R^2 is calculated, we start running it through the multiplier, with a counter that tells us when we should be using R and when we should be using r^2. We only have 1 value to worry about, when we get the last value we only use R instead of R^2. we also need to remember to store the outputs on both last cycles. Since we are storing the data in a FIFO, we will know which is the last. There is also a possibility that the data will not be a full 128 bits, so we need to handle adding the leading 1 as well.

We can use 1 multiplier, 2 data fifos, 2 constant buffers.

The utilization of the multiplier is kinda low though since its only used once per packet instaead of every 16 bytes