Whatever I was working on
This commit is contained in:
@@ -63,4 +63,56 @@ the last written one is.
|
||||
|
||||
We can just say the last written one was 2 I guess
|
||||
|
||||
We also need an input that tells it to reset the accumulator
|
||||
We also need an input that tells it to reset the accumulator
|
||||
|
||||
What if instead of calculating all the way up to R^16 I just calculated up to r^8
|
||||
and then just had 2 parallel blocks?
|
||||
|
||||
Lets think about the worst case throughput. The theoretical layout would have
|
||||
8 of these in parallel. A minimum size packet of 64 bytes for example, is 512
|
||||
bits. This is less than 128*8, so it would only take one round. Therefore, we
|
||||
take 16 cycles to do 64 bytes, or 32 bits per cycle. This is only 1/4 of our
|
||||
target throughput. In order to reach our target throughput of 128 bits per cycle,
|
||||
|
||||
If the packet is enough to fit into the second phase of the multiplier, then it
|
||||
can run in parallel and give up to 256 bits per 16 cycles. In order for this to
|
||||
happen, the packet size must be greater than 128*16, or 256 bytes. I would really
|
||||
like to be able to reach our target throughput with 64 byte packets, so we may
|
||||
need to have more smaller multipliers that can run in parallle, at the cost of
|
||||
latency for larger packets.
|
||||
|
||||
a 64 byte packet is 512 bytes, which takes up 4 128 bit lanes. If we have a group
|
||||
of 2 multipliers, they can do 128\*2\*2 bits per 16 cycles, or 512 bits per 16
|
||||
cycles, which is 32 bits per cycle as we said earlier. To hit our target of 128
|
||||
bits per cycle we just instantiate 4 of them. This results in the same number of
|
||||
multipliers (8), but configured differently to prioritize throughput over latency.
|
||||
|
||||
We need to have a demux or whatever go between the groups.
|
||||
|
||||
If we only do 4 effective lanes in parallel, then we only need to do the multiply
|
||||
loop twice
|
||||
|
||||
r->r^2
|
||||
|
||||
This will take 26 cycles, which is not ideal. Could we figure out a way to do all
|
||||
of these powers in one step, taking only 13 cycles?
|
||||
|
||||
Alternatively, we could only do a single parallel step and just calculate R^2. This
|
||||
would mean we have 8 different hashes going on at the same time, and would drastically
|
||||
increase latency, but I think that is a fair tradeoff
|
||||
|
||||
So basically we need to store incoming data as 128 bit words. We will first get
|
||||
r and s as 128 bit words. We store both and start work on squaring r. We will
|
||||
also be recieving data this whole time at 128 bits per cycle which we store in
|
||||
a FIFO. Once R^2 is calculated, we start running it through the multiplier, with
|
||||
a counter that tells us when we should be using R and when we should be using
|
||||
r^2. We only have 1 value to worry about, when we get the last value we only
|
||||
use R instead of R^2. we also need to remember to store the outputs on both
|
||||
last cycles. Since we are storing the data in a FIFO, we will know which is the
|
||||
last. There is also a possibility that the data will not be a full 128 bits, so
|
||||
we need to handle adding the leading 1 as well.
|
||||
|
||||
We can use 1 multiplier, 2 data fifos, 2 constant buffers.
|
||||
|
||||
The utilization of the multiplier is kinda low though since its only used
|
||||
once per packet instaead of every 16 bytes
|
||||
Reference in New Issue
Block a user