The parser circuit processes only packets that have a TCP tag and a destination port number set beforehand. These packets are analyzed. The parameters of TCP, e.g., Sequential Number (SN#), Acknowledge Number (ACK#), Window Size, Flags, and Source Port Number are extracted and transferred to the TCP state manager cir- cuit that manages the TCP connection and requests transmis- sion of management packets that have SYN, FIN, and RST flags
Unfortunately it does not actually specify what the header actually looks like, just what is in it. Although, it sounds like a TCP packet. I wonder if the TCP packet also has an IP header and an ethernet header?
In this case then, it appears like the external circuit is responsible for asking to open the connection, and also keeping track of the sequence numbers. The actual TCP engine handles the handshakes, retransmission, and acking.
I would like to offload even more though. The only things that the application level needs to provide is the ports, and possibly a subset of the flags. The sequence numbers and window sizes can be handled completely in hardware.
Here is a TCP header. We don't need a lot of these things
Sequence Number
Ack number
checksum
window size
checksum
urgent pointer
Honestly its easier to say what we do need:
Source Port
Destination Port
Flag Subset
We need some way of telling the TCP engine that we want to create or close a connection. To open a connection, we could send a blank packet that just has the SYN flag. The TCP engine will handle receiving the SYN-ACK and sending the ACK without forwarding to the application.
Ok after some review, my understanding is this: TCP is a streaming protocol: You do not send packets into the socket, you just send bytes. It is then up to the TCP implementation to break it up into <MTU size chunks. The part I am still confused on is how they are reassembled on the other end. Like, how does the TCP implementation know that the sender is done? In the implementations I have read, you just call read() or recv() or whatever and then check the size.
Do you have to close the connection in order to see this?
I think the best way to figure this out is just with experimentaiton. We can create simple TCP servers in both python and C and see how they handle when we send e.g. 5k of data. This must be sent in multiple packets.
I think it is actually the push flags. Look at this wireshark capture:
This http response is made up of 3 packets, 28, 29, and 30. 28 and 29 are marked as TCP and do not have the push flag set. The final packet, which is actually marked as HTTP, does have the push flag set.
Based on this, and also experiments I did with a simple python server and netcat, the receive syscall does not return until it sees a TCP packet with a push flag.
Question: What happens if we send multiple packets while the server is not receiving? Will it return both of them at once, or first one and then the other?
So then. The actual interface that an application has is to open and close sockets, and to read and write bytes to the socket. Nowhere does the application care about packets, it is always just a stream of bytes.
So when we send our data from the application, we have to send it to a socket. Traditionally this would be a file descriptor or something. In our system, we would have the data we want to send in memory and then we would trigger a DMA to send it to the networking engine. So when we trigger the DMA we would send both the source address, the length, and also the socket pointer. The socket pointer would have the IP addresses and the ports, and other stuff as well.
But what about receiving data?
In linux we can receive data with the socket open even if we are not actively receiving. The data needs to be stored in a buffer until the client asks for it. So there needs to be a buffer in the kernel. But that is more of a software issue. The hardware issue is how do we know what socket the data goes to when we don't specifically tell it? Well the answer is in the ports: We check the incoming port against all of the active connections. If it matches, then we store the data in whatever buffer that socket has specified.
And how will we look up what ports are active? Basically, we get a destination port and need to turn that into a pointer. If we had something like a linked list, that could be very slow, with a lot of connections and is not scalable. We could try something like a CAM, where we look up the port and see if it matches, and if it does then it would have the address. This would limit the number of active connections though. We could do some combination of them, where we check the CAM and if it is not in there, we check the linked list. As a way to speed it up, we can use a bloom filter to make sure that we aren't wasting too much time.
Linked list might be hard to memory manage though. We could also have ram inside the module where it can be stored, and then we can loop over that in a slower way than the CAM, but will be faster than reading from DRAM.
Each element in the ram would only be 6 bytes, so our total number of connections is limited by that. For example, if we want to support say, 16 entires in the CAM and then 1024 in the memory, that would be 6kB of memory. The FPGA has 128 kB of ram, so thats not too much. We could also limit it to something smaller like 256, it can change at compile time.
We can scan the whole memory or keep track of what is in there so we know where to stop. Otherwise, we will waste 1 clock cycle for every element in the memory, The memory also has to be dual ported, with 1 w port and 1 read port.
But think, do we need all of these connections? For example my laptop right now has 24 connections open, and the 6502 system will definitely be smaller. We could just make them CAM bigger then at 32 and ditch the memory. After all, this is not a higher performance system
The state manager is responsible for keeping track of the status of the TCP connection and also is what loads the socket data from memory. The socket data structure includes the source and dest IP addresses and ports, the rx and tx sequence numbers, and the congestion and flow control windows.
The TX Control is responsible for handling the TX. It will send the received data and keep track of the windows to make sure it has not sent too many. It also updates the sequence numbers. It also handles retransmission if requested by the state manager. It maintains the state of the buffer as well, so in that case it might actually make more sense to have the data from DMA go through control. TX Control needs to maintain the buffer because it is a circular buffer, and the DMA is streaming.
The Parse reads the incoming packets and checks both the destination IP and destination port. if the IP does not match, then the packet is dropped immediately. The parser then looks up the destination port in the CAM. If it matches, it then sends that address of the socket structure to the TCP state manager, who will then DMA it into its local memory. From there, the packet goes into RX Control where the TCP part of it is handled. The sequence numbers are checked, and acks are requested to be send out of TX. If the received packet is data, then it gets sent to the RX Buffer. If there is a PSH packet, then we have to do something idk, that's kind of where my thought process ended for now.
Yeah so the push flag can either generate an interrupt, or just set a flag in the context that there is data to be read. The context is written back to memory after the packet is received. An interrupt can also be generated at this point.
Is writing the context back to memory after every single packet going to be a problem? I don't think so because of how slow the ethernet link is. The core is running at 100MHz, and the MII is running at 100 Mbps. We know that the minimum size packet is 20+20+44 or 84 bytes. This corresponds to 672 clock cycles for each packet. This should be plenty of time.
We should also still have time to support receiving independent streams, If we have 672 clock cycles between packets. The core runs ~8x faster, and thats if we use an 8 bit interface.
But we still need to think carefully about how we handle the contexts. Right now, the plan is this:
We create a socket by creating a context struct.
We open the socket by setting a bit in the context struct and then DMAing that to the controller with no data
Upon receiving the context pointer, the controller will read it into its memory and notice the syn flag set and then try to create the connection
It will also associate the local port with the pointer in the CAM
So the problem that I see with this is that we can't host servers because there is no way to listen to ports, but in fact there is: We can create the context struct and DMa it to the controller but without actually telling the controller to create the connection. The controller will still associate it in the CAM and then do nothing. When a SYN packet is received with that port, it will hit the CAM and we will then load that context from memory. We will see that it is in the IDLE state, and so will send the SYN/ACK and move it into the SYN_RECVD state.
I think we are going the wrong way with this. As long as a single port manager is not too many luts, I think it would be better to just statically assign these per port and then free them when the connection is closed. No more packet type DMA or anything like that.
So we have a new idea for the actual stream block, but we need to think about the dma. We could do something where we have a linked list of dma descriptors, where each descriptor is one packet?
However, the point of TCP is that we do not have packets, its just a stream. So then we should have an even simpler DMA: Address, Length, Go, where writing to length is also Go. The sending part is easy, its the receiving part that is hard.
We could either preload the address, but then we do not know the length, or we could have it just stored in a memory inside the stream, we don't have a ton of memory in the FPGA.
It looks like in linux they use ring buffers in memory, so when data is received the nic writes it into the ring buffer and then raises an interrupt to the CPU. Then it is up to software to read from the ring buffer.
The same thing happens when we transmit, we write data to the ring buffer and then the nic will notice that and start forming packets.
Since we do not have a lot of cpu though, can we make the ring buffers in hardware? This doesn't entirely solve the problem, but it gives a lot of memory for the data to go before the CPU is ready to handle it. A downside of this is that the the data will have to be copied from one area of memory to another area of memory, but given the memory is much faster than the CPU this isn't a huge problem. Moving a 4kB block of data would take about 41 microseconds in the best case. the CPU is running at 1MHz, so thats only 41 clock cycles.
With this setup though, each stream needs its own ring buffer in order to present a stream like interface? Yes, unless we somehow tag the data going into the ring buffer. The thing about the ring buffers though is that it is not split up by packets and should present a stream interface, that means multiple writes could be combined into a single packet. But, this is not actually required since we are the ones sending, we should be able to send whatever we want and leave the server to handle it. On the receive side though, we need to combine the packets back into a single contiguous stream of data. If multiple streams are active, then they could get interleaved, and if one process is not reading, then its data will be blocking up the queue.
One thing we could do is have the ring buffer managers in hardware, but then have the actual memory areas be configurable in software, so you could decide at runtime how large the ring buffers are going to be. I think this is the best way forward.
But where is the data coming from that goes into the tx buffer? We still need to have a DMA engine that reads the data from SDRAM, then writes it back to SDRAM. This seems kind of dumb, but that's kind of the only choice we have. So we still need the M2S/S2M DMA that we had before, but this time it becomes after the circular buffer.