118 lines
5.3 KiB
Markdown
118 lines
5.3 KiB
Markdown
# Overall Notes
|
|
|
|
We need to support 25Gbps, and we will have 2 datapaths, tx and rx
|
|
|
|
at 128 bit datapth, this is 200MHz, but lets aim for 250MHz
|
|
|
|
|
|
# ChaCha20 Notes
|
|
|
|
Chacha20 operates on 512 bit blocks. Each round is made of 4 quarter
|
|
rounds, which are the same ecept for which 32 bit is used. We can
|
|
use the same 32 bit quarter round 4 times in a row, but we need to
|
|
store the rest of the round between operations, so memory usage
|
|
might be similar to if we just did all 4 at once, but the logic
|
|
would only be 25% as much. Because we switch between odd and even
|
|
rounds, the data used in one round is not the data used in the other
|
|
round.
|
|
|
|
|
|
# Poly1305
|
|
|
|
## Parallel Operation
|
|
|
|
We can calculate in parallel but we need to calculate r^n, where n is the number of
|
|
parallel stages. Ideally we would have the number of parallel stages be equal to the
|
|
latency of the full stage, that way we could have it be fully pipelined. For
|
|
example, if it took 8 cycles per block, we would have 8 parallel calculations. This
|
|
requires you to calculate r^n, as well as every intermediate value. If we do 8,
|
|
|
|
then we need to calculate r^1, r^2, r^3, etc. This takes log2(n) multiplies (right?)
|
|
|
|
we need
|
|
|
|
r\*r = r^2
|
|
r\*r^2 = r^3 r^2\*r^2 = r^4
|
|
r^4\*r = r^5 r^2\*r^4 = r^6 r^3\*r^4 = r^7 r^4\*r^4 = r^8
|
|
|
|
we can do all of these in parallel, so we 4 (n/2) multiply blocks that feed back
|
|
on themselves, with some kind of FSM to control it. This can be done while another
|
|
block is being hashed, but there will be a delay between when the key is ready from
|
|
the chacha block and when the powers are ready, so there needs to be a fifo in between.
|
|
|
|
|
|
Basically we have to wait until we see that the accumulator was written with our index.
|
|
At reset though, the acumulator is unwritten? So we need to pretend that it was written
|
|
|
|
Lets just write out what we want to happen:
|
|
|
|
1. The index starts at 0. We accept new data, and send it through the pipeline
|
|
2. We increment the index to 1.
|
|
3. We accept new data and send it through the pipeline
|
|
4. We increment the index to 2
|
|
5. We need to wait until the index 0 is written before we can say we are ready
|
|
6. If the index 1 is written then we still need to say we are ready though
|
|
7. We can just use the 1 to indicate that is a valid write then?
|
|
|
|
So in the shift register we just need to say whether it is a valid write or not,
|
|
so always 1?
|
|
|
|
But if we send in 0, then send in 1, then the current index will be 0
|
|
and eventually the final index will always be 0. We need to store what
|
|
the last written one is.
|
|
|
|
We can just say the last written one was 2 I guess
|
|
|
|
We also need an input that tells it to reset the accumulator
|
|
|
|
What if instead of calculating all the way up to R^16 I just calculated up to r^8
|
|
and then just had 2 parallel blocks?
|
|
|
|
Lets think about the worst case throughput. The theoretical layout would have
|
|
8 of these in parallel. A minimum size packet of 64 bytes for example, is 512
|
|
bits. This is less than 128*8, so it would only take one round. Therefore, we
|
|
take 16 cycles to do 64 bytes, or 32 bits per cycle. This is only 1/4 of our
|
|
target throughput. In order to reach our target throughput of 128 bits per cycle,
|
|
|
|
If the packet is enough to fit into the second phase of the multiplier, then it
|
|
can run in parallel and give up to 256 bits per 16 cycles. In order for this to
|
|
happen, the packet size must be greater than 128*16, or 256 bytes. I would really
|
|
like to be able to reach our target throughput with 64 byte packets, so we may
|
|
need to have more smaller multipliers that can run in parallle, at the cost of
|
|
latency for larger packets.
|
|
|
|
a 64 byte packet is 512 bytes, which takes up 4 128 bit lanes. If we have a group
|
|
of 2 multipliers, they can do 128\*2\*2 bits per 16 cycles, or 512 bits per 16
|
|
cycles, which is 32 bits per cycle as we said earlier. To hit our target of 128
|
|
bits per cycle we just instantiate 4 of them. This results in the same number of
|
|
multipliers (8), but configured differently to prioritize throughput over latency.
|
|
|
|
We need to have a demux or whatever go between the groups.
|
|
|
|
If we only do 4 effective lanes in parallel, then we only need to do the multiply
|
|
loop twice
|
|
|
|
r->r^2
|
|
|
|
This will take 26 cycles, which is not ideal. Could we figure out a way to do all
|
|
of these powers in one step, taking only 13 cycles?
|
|
|
|
Alternatively, we could only do a single parallel step and just calculate R^2. This
|
|
would mean we have 8 different hashes going on at the same time, and would drastically
|
|
increase latency, but I think that is a fair tradeoff
|
|
|
|
So basically we need to store incoming data as 128 bit words. We will first get
|
|
r and s as 128 bit words. We store both and start work on squaring r. We will
|
|
also be recieving data this whole time at 128 bits per cycle which we store in
|
|
a FIFO. Once R^2 is calculated, we start running it through the multiplier, with
|
|
a counter that tells us when we should be using R and when we should be using
|
|
r^2. We only have 1 value to worry about, when we get the last value we only
|
|
use R instead of R^2. we also need to remember to store the outputs on both
|
|
last cycles. Since we are storing the data in a FIFO, we will know which is the
|
|
last. There is also a possibility that the data will not be a full 128 bits, so
|
|
we need to handle adding the leading 1 as well.
|
|
|
|
We can use 1 multiplier, 2 data fifos, 2 constant buffers.
|
|
|
|
The utilization of the multiplier is kinda low though since its only used
|
|
once per packet instaead of every 16 bytes |