Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

spidernet: driver docmentation

Documentation for the spidernet driver.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

authored by

Linas Vepstas and committed by
Jeff Garzik
3213e3ab 5f309b90

+204
+204
Documentation/networking/spider_net.txt
··· 1 + 2 + The Spidernet Device Driver 3 + =========================== 4 + 5 + Written by Linas Vepstas <linas@austin.ibm.com> 6 + 7 + Version of 7 June 2007 8 + 9 + Abstract 10 + ======== 11 + This document sketches the structure of portions of the spidernet 12 + device driver in the Linux kernel tree. The spidernet is a gigabit 13 + ethernet device built into the Toshiba southbridge commonly used 14 + in the SONY Playstation 3 and the IBM QS20 Cell blade. 15 + 16 + The Structure of the RX Ring. 17 + ============================= 18 + The receive (RX) ring is a circular linked list of RX descriptors, 19 + together with three pointers into the ring that are used to manage its 20 + contents. 21 + 22 + The elements of the ring are called "descriptors" or "descrs"; they 23 + describe the received data. This includes a pointer to a buffer 24 + containing the received data, the buffer size, and various status bits. 25 + 26 + There are three primary states that a descriptor can be in: "empty", 27 + "full" and "not-in-use". An "empty" or "ready" descriptor is ready 28 + to receive data from the hardware. A "full" descriptor has data in it, 29 + and is waiting to be emptied and processed by the OS. A "not-in-use" 30 + descriptor is neither empty or full; it is simply not ready. It may 31 + not even have a data buffer in it, or is otherwise unusable. 32 + 33 + During normal operation, on device startup, the OS (specifically, the 34 + spidernet device driver) allocates a set of RX descriptors and RX 35 + buffers. These are all marked "empty", ready to receive data. This 36 + ring is handed off to the hardware, which sequentially fills in the 37 + buffers, and marks them "full". The OS follows up, taking the full 38 + buffers, processing them, and re-marking them empty. 39 + 40 + This filling and emptying is managed by three pointers, the "head" 41 + and "tail" pointers, managed by the OS, and a hardware current 42 + descriptor pointer (GDACTDPA). The GDACTDPA points at the descr 43 + currently being filled. When this descr is filled, the hardware 44 + marks it full, and advances the GDACTDPA by one. Thus, when there is 45 + flowing RX traffic, every descr behind it should be marked "full", 46 + and everything in front of it should be "empty". If the hardware 47 + discovers that the current descr is not empty, it will signal an 48 + interrupt, and halt processing. 49 + 50 + The tail pointer tails or trails the hardware pointer. When the 51 + hardware is ahead, the tail pointer will be pointing at a "full" 52 + descr. The OS will process this descr, and then mark it "not-in-use", 53 + and advance the tail pointer. Thus, when there is flowing RX traffic, 54 + all of the descrs in front of the tail pointer should be "full", and 55 + all of those behind it should be "not-in-use". When RX traffic is not 56 + flowing, then the tail pointer can catch up to the hardware pointer. 57 + The OS will then note that the current tail is "empty", and halt 58 + processing. 59 + 60 + The head pointer (somewhat mis-named) follows after the tail pointer. 61 + When traffic is flowing, then the head pointer will be pointing at 62 + a "not-in-use" descr. The OS will perform various housekeeping duties 63 + on this descr. This includes allocating a new data buffer and 64 + dma-mapping it so as to make it visible to the hardware. The OS will 65 + then mark the descr as "empty", ready to receive data. Thus, when there 66 + is flowing RX traffic, everything in front of the head pointer should 67 + be "not-in-use", and everything behind it should be "empty". If no 68 + RX traffic is flowing, then the head pointer can catch up to the tail 69 + pointer, at which point the OS will notice that the head descr is 70 + "empty", and it will halt processing. 71 + 72 + Thus, in an idle system, the GDACTDPA, tail and head pointers will 73 + all be pointing at the same descr, which should be "empty". All of the 74 + other descrs in the ring should be "empty" as well. 75 + 76 + The show_rx_chain() routine will print out the the locations of the 77 + GDACTDPA, tail and head pointers. It will also summarize the contents 78 + of the ring, starting at the tail pointer, and listing the status 79 + of the descrs that follow. 80 + 81 + A typical example of the output, for a nearly idle system, might be 82 + 83 + net eth1: Total number of descrs=256 84 + net eth1: Chain tail located at descr=20 85 + net eth1: Chain head is at 20 86 + net eth1: HW curr desc (GDACTDPA) is at 21 87 + net eth1: Have 1 descrs with stat=x40800101 88 + net eth1: HW next desc (GDACNEXTDA) is at 22 89 + net eth1: Last 255 descrs with stat=xa0800000 90 + 91 + In the above, the hardware has filled in one descr, number 20. Both 92 + head and tail are pointing at 20, because it has not yet been emptied. 93 + Meanwhile, hw is pointing at 21, which is free. 94 + 95 + The "Have nnn decrs" refers to the descr starting at the tail: in this 96 + case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers 97 + to all of the rest of the descrs, from the last status change. The "nnn" 98 + is a count of how many descrs have exactly the same status. 99 + 100 + The status x4... corresponds to "full" and status xa... corresponds 101 + to "empty". The actual value printed is RXCOMST_A. 102 + 103 + In the device driver source code, a different set of names are 104 + used for these same concepts, so that 105 + 106 + "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa 107 + "full" == SPIDER_NET_DESCR_FRAME_END == 0x4 108 + "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf 109 + 110 + 111 + The RX RAM full bug/feature 112 + =========================== 113 + 114 + As long as the OS can empty out the RX buffers at a rate faster than 115 + the hardware can fill them, there is no problem. If, for some reason, 116 + the OS fails to empty the RX ring fast enough, the hardware GDACTDPA 117 + pointer will catch up to the head, notice the not-empty condition, 118 + ad stop. However, RX packets may still continue arriving on the wire. 119 + The spidernet chip can save some limited number of these in local RAM. 120 + When this local ram fills up, the spider chip will issue an interrupt 121 + indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit 122 + will be set in GHIINT1STS). When the RX ram full condition occurs, 123 + a certain bug/feature is triggered that has to be specially handled. 124 + This section describes the special handling for this condition. 125 + 126 + When the OS finally has a chance to run, it will empty out the RX ring. 127 + In particular, it will clear the descriptor on which the hardware had 128 + stopped. However, once the hardware has decided that a certain 129 + descriptor is invalid, it will not restart at that descriptor; instead 130 + it will restart at the next descr. This potentially will lead to a 131 + deadlock condition, as the tail pointer will be pointing at this descr, 132 + which, from the OS point of view, is empty; the OS will be waiting for 133 + this descr to be filled. However, the hardware has skipped this descr, 134 + and is filling the next descrs. Since the OS doesn't see this, there 135 + is a potential deadlock, with the OS waiting for one descr to fill, 136 + while the hardware is waiting for a different set of descrs to become 137 + empty. 138 + 139 + A call to show_rx_chain() at this point indicates the nature of the 140 + problem. A typical print when the network is hung shows the following: 141 + 142 + net eth1: Spider RX RAM full, incoming packets might be discarded! 143 + net eth1: Total number of descrs=256 144 + net eth1: Chain tail located at descr=255 145 + net eth1: Chain head is at 255 146 + net eth1: HW curr desc (GDACTDPA) is at 0 147 + net eth1: Have 1 descrs with stat=xa0800000 148 + net eth1: HW next desc (GDACNEXTDA) is at 1 149 + net eth1: Have 127 descrs with stat=x40800101 150 + net eth1: Have 1 descrs with stat=x40800001 151 + net eth1: Have 126 descrs with stat=x40800101 152 + net eth1: Last 1 descrs with stat=xa0800000 153 + 154 + Both the tail and head pointers are pointing at descr 255, which is 155 + marked xa... which is "empty". Thus, from the OS point of view, there 156 + is nothing to be done. In particular, there is the implicit assumption 157 + that everything in front of the "empty" descr must surely also be empty, 158 + as explained in the last section. The OS is waiting for descr 255 to 159 + become non-empty, which, in this case, will never happen. 160 + 161 + The HW pointer is at descr 0. This descr is marked 0x4.. or "full". 162 + Since its already full, the hardware can do nothing more, and thus has 163 + halted processing. Notice that descrs 0 through 254 are all marked 164 + "full", while descr 254 and 255 are empty. (The "Last 1 descrs" is 165 + descr 254, since tail was at 255.) Thus, the system is deadlocked, 166 + and there can be no forward progress; the OS thinks there's nothing 167 + to do, and the hardware has nowhere to put incoming data. 168 + 169 + This bug/feature is worked around with the spider_net_resync_head_ptr() 170 + routine. When the driver receives RX interrupts, but an examination 171 + of the RX chain seems to show it is empty, then it is probable that 172 + the hardware has skipped a descr or two (sometimes dozens under heavy 173 + network conditions). The spider_net_resync_head_ptr() subroutine will 174 + search the ring for the next full descr, and the driver will resume 175 + operations there. Since this will leave "holes" in the ring, there 176 + is also a spider_net_resync_tail_ptr() that will skip over such holes. 177 + 178 + As of this writing, the spider_net_resync() strategy seems to work very 179 + well, even under heavy network loads. 180 + 181 + 182 + The TX ring 183 + =========== 184 + The TX ring uses a low-watermark interrupt scheme to make sure that 185 + the TX queue is appropriately serviced for large packet sizes. 186 + 187 + For packet sizes greater than about 1KBytes, the kernel can fill 188 + the TX ring quicker than the device can drain it. Once the ring 189 + is full, the netdev is stopped. When there is room in the ring, 190 + the netdev needs to be reawakened, so that more TX packets are placed 191 + in the ring. The hardware can empty the ring about four times per jiffy, 192 + so its not appropriate to wait for the poll routine to refill, since 193 + the poll routine runs only once per jiffy. The low-watermark mechanism 194 + marks a descr about 1/4th of the way from the bottom of the queue, so 195 + that an interrupt is generated when the descr is processed. This 196 + interrupt wakes up the netdev, which can then refill the queue. 197 + For large packets, this mechanism generates a relatively small number 198 + of interrupts, about 1K/sec. For smaller packets, this will drop to zero 199 + interrupts, as the hardware can empty the queue faster than the kernel 200 + can fill it. 201 + 202 + 203 + ======= END OF DOCUMENT ======== 204 +