mirror of
				https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
				synced 2025-10-31 03:13:59 +00:00 
			
		
		
		
	 fff9289b21
			
		
	
	
		fff9289b21
		
	
	
	
	
		
			
			This patch fixes typos in various Documentation txts. This patch addresses some words starting with the letters 'D'-'E'. Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
		
			
				
	
	
		
			767 lines
		
	
	
		
			27 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			767 lines
		
	
	
		
			27 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| HISTORY:
 | |
| February 16/2002 -- revision 0.2.1:
 | |
| COR typo corrected
 | |
| February 10/2002 -- revision 0.2:
 | |
| some spell checking ;->
 | |
| January 12/2002 -- revision 0.1
 | |
| This is still work in progress so may change.
 | |
| To keep up to date please watch this space.
 | |
| 
 | |
| Introduction to NAPI
 | |
| ====================
 | |
| 
 | |
| NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
 | |
| to improve network performance on Linux. For more details please
 | |
| read that paper.
 | |
| NAPI provides a "inherent mitigation" which is bound by system capacity
 | |
| as can be seen from the following data collected by Robert on Gigabit 
 | |
| ethernet (e1000):
 | |
| 
 | |
|  Psize    Ipps       Tput     Rxint     Txint    Done     Ndone
 | |
|  ---------------------------------------------------------------
 | |
|    60    890000     409362        17     27622        7     6823
 | |
|   128    758150     464364        21      9301       10     7738
 | |
|   256    445632     774646        42     15507       21    12906
 | |
|   512    232666     994445    241292     19147   241192     1062
 | |
|  1024    119061    1000003    872519     19258   872511        0
 | |
|  1440     85193    1000003    946576     19505   946569        0
 | |
|  
 | |
| 
 | |
| Legend:
 | |
| "Ipps" stands for input packets per second. 
 | |
| "Tput" == packets out of total 1M that made it out.
 | |
| "txint" == transmit completion interrupts seen
 | |
| "Done" == The number of times that the poll() managed to pull all
 | |
| packets out of the rx ring. Note from this that the lower the
 | |
| load the more we could clean up the rxring
 | |
| "Ndone" == is the converse of "Done". Note again, that the higher
 | |
| the load the more times we couldn't clean up the rxring.
 | |
| 
 | |
| Observe that:
 | |
| when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. 
 | |
| The system cant handle the processing at 1 interrupt/packet at that load level. 
 | |
| At lower rates on the other hand, rx interrupts go up and therefore the
 | |
| interrupt/packet ratio goes up (as observable from that table). So there is
 | |
| possibility that under low enough input, you get one poll call for each
 | |
| input packet caused by a single interrupt each time. And if the system 
 | |
| cant handle interrupt per packet ratio of 1, then it will just have to 
 | |
| chug along ....
 | |
| 
 | |
| 
 | |
| 0) Prerequisites:
 | |
| ==================
 | |
| A driver MAY continue using the old 2.4 technique for interfacing
 | |
| to the network stack and not benefit from the NAPI changes.
 | |
| NAPI additions to the kernel do not break backward compatibility.
 | |
| NAPI, however, requires the following features to be available:
 | |
| 
 | |
| A) DMA ring or enough RAM to store packets in software devices.
 | |
| 
 | |
| B) Ability to turn off interrupts or maybe events that send packets up 
 | |
| the stack.
 | |
| 
 | |
| NAPI processes packet events in what is known as dev->poll() method.
 | |
| Typically, only packet receive events are processed in dev->poll(). 
 | |
| The rest of the events MAY be processed by the regular interrupt handler 
 | |
| to reduce processing latency (justified also because there are not that 
 | |
| many of them).
 | |
| Note, however, NAPI does not enforce that dev->poll() only processes 
 | |
| receive events. 
 | |
| Tests with the tulip driver indicated slightly increased latency if
 | |
| all of the interrupt handler is moved to dev->poll(). Also MII handling
 | |
| gets a little trickier.
 | |
| The example used in this document is to move the receive processing only
 | |
| to dev->poll(); this is shown with the patch for the tulip driver.
 | |
| For an example of code that moves all the interrupt driver to 
 | |
| dev->poll() look at the ported e1000 code.
 | |
| 
 | |
| There are caveats that might force you to go with moving everything to 
 | |
| dev->poll(). Different NICs work differently depending on their status/event 
 | |
| acknowledgement setup. 
 | |
| There are two types of event register ACK mechanisms.
 | |
| 	I)  what is known as Clear-on-read (COR).
 | |
| 	when you read the status/event register, it clears everything!
 | |
| 	The natsemi and sunbmac NICs are known to do this.
 | |
| 	In this case your only choice is to move all to dev->poll()
 | |
| 
 | |
| 	II) Clear-on-write (COW)
 | |
| 	 i) you clear the status by writing a 1 in the bit-location you want.
 | |
| 		These are the majority of the NICs and work the best with NAPI.
 | |
| 		Put only receive events in dev->poll(); leave the rest in
 | |
| 		the old interrupt handler.
 | |
| 	 ii) whatever you write in the status register clears every thing ;->
 | |
| 		Cant seem to find any supported by Linux which do this. If
 | |
| 		someone knows such a chip email us please.
 | |
| 		Move all to dev->poll()
 | |
| 
 | |
| C) Ability to detect new work correctly.
 | |
| NAPI works by shutting down event interrupts when theres work and
 | |
| turning them on when theres none. 
 | |
| New packets might show up in the small window while interrupts were being 
 | |
| re-enabled (refer to appendix 2).  A packet might sneak in during the period 
 | |
| we are enabling interrupts. We only get to know about such a packet when the 
 | |
| next new packet arrives and generates an interrupt. 
 | |
| Essentially, there is a small window of opportunity for a race condition
 | |
| which for clarity we'll refer to as the "rotting packet".
 | |
| 
 | |
| This is a very important topic and appendix 2 is dedicated for more 
 | |
| discussion.
 | |
| 
 | |
| Locking rules and environmental guarantees
 | |
| ==========================================
 | |
| 
 | |
| -Guarantee: Only one CPU at any time can call dev->poll(); this is because
 | |
| only one CPU can pick the initial interrupt and hence the initial
 | |
| netif_rx_schedule(dev);
 | |
| - The core layer invokes devices to send packets in a round robin format.
 | |
| This implies receive is totaly lockless because of the guarantee only that 
 | |
| one CPU is executing it.
 | |
| -  contention can only be the result of some other CPU accessing the rx
 | |
| ring. This happens only in close() and suspend() (when these methods
 | |
| try to clean the rx ring); 
 | |
| ****guarantee: driver authors need not worry about this; synchronization 
 | |
| is taken care for them by the top net layer.
 | |
| -local interrupts are enabled (if you dont move all to dev->poll()). For 
 | |
| example link/MII and txcomplete continue functioning just same old way. 
 | |
| This improves the latency of processing these events. It is also assumed that 
 | |
| the receive interrupt is the largest cause of noise. Note this might not 
 | |
| always be true. 
 | |
| [according to Manfred Spraul, the winbond insists on sending one 
 | |
| txmitcomplete interrupt for each packet (although this can be mitigated)].
 | |
| For these broken drivers, move all to dev->poll().
 | |
| 
 | |
| For the rest of this text, we'll assume that dev->poll() only
 | |
| processes receive events.
 | |
| 
 | |
| new methods introduce by NAPI
 | |
| =============================
 | |
| 
 | |
| a) netif_rx_schedule(dev)
 | |
| Called by an IRQ handler to schedule a poll for device
 | |
| 
 | |
| b) netif_rx_schedule_prep(dev)
 | |
| puts the device in a state which allows for it to be added to the
 | |
| CPU polling list if it is up and running. You can look at this as
 | |
| the first half of  netif_rx_schedule(dev) above; the second half
 | |
| being c) below.
 | |
| 
 | |
| c) __netif_rx_schedule(dev)
 | |
| Add device to the poll list for this CPU; assuming that _prep above
 | |
| has already been called and returned 1.
 | |
| 
 | |
| d) netif_rx_reschedule(dev, undo)
 | |
| Called to reschedule polling for device specifically for some
 | |
| deficient hardware. Read Appendix 2 for more details.
 | |
| 
 | |
| e) netif_rx_complete(dev)
 | |
| 
 | |
| Remove interface from the CPU poll list: it must be in the poll list
 | |
| on current cpu. This primitive is called by dev->poll(), when
 | |
| it completes its work. The device cannot be out of poll list at this
 | |
| call, if it is then clearly it is a BUG(). You'll know ;->
 | |
| 
 | |
| All these above nethods are used below. So keep reading for clarity.
 | |
| 
 | |
| Device driver changes to be made when porting NAPI
 | |
| ==================================================
 | |
| 
 | |
| Below we describe what kind of changes are required for NAPI to work.
 | |
| 
 | |
| 1) introduction of dev->poll() method 
 | |
| =====================================
 | |
| 
 | |
| This is the method that is invoked by the network core when it requests
 | |
| for new packets from the driver. A driver is allowed to send upto
 | |
| dev->quota packets by the current CPU before yielding to the network
 | |
| subsystem (so other devices can also get opportunity to send to the stack).
 | |
| 
 | |
| dev->poll() prototype looks as follows:
 | |
| int my_poll(struct net_device *dev, int *budget)
 | |
| 
 | |
| budget is the remaining number of packets the network subsystem on the
 | |
| current CPU can send up the stack before yielding to other system tasks.
 | |
| *Each driver is responsible for decrementing budget by the total number of
 | |
| packets sent.
 | |
| 	Total number of packets cannot exceed dev->quota.
 | |
| 
 | |
| dev->poll() method is invoked by the top layer, the driver just sends if it 
 | |
| can to the stack the packet quantity requested.
 | |
| 
 | |
| more on dev->poll() below after the interrupt changes are explained.
 | |
| 
 | |
| 2) registering dev->poll() method
 | |
| ===================================
 | |
| 
 | |
| dev->poll should be set in the dev->probe() method. 
 | |
| e.g:
 | |
| dev->open = my_open;
 | |
| .
 | |
| .
 | |
| /* two new additions */
 | |
| /* first register my poll method */
 | |
| dev->poll = my_poll;
 | |
| /* next register my weight/quanta; can be overridden in /proc */
 | |
| dev->weight = 16;
 | |
| .
 | |
| .
 | |
| dev->stop = my_close;
 | |
| 
 | |
| 
 | |
| 
 | |
| 3) scheduling dev->poll()
 | |
| =============================
 | |
| This involves modifying the interrupt handler and the code
 | |
| path which takes the packet off the NIC and sends them to the 
 | |
| stack.
 | |
| 
 | |
| it's important at this point to introduce the classical D Becker 
 | |
| interrupt processor:
 | |
| 
 | |
| ------------------
 | |
| static irqreturn_t
 | |
| netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 | |
| {
 | |
| 
 | |
| 	struct net_device *dev = (struct net_device *)dev_instance;
 | |
| 	struct my_private *tp = (struct my_private *)dev->priv;
 | |
| 
 | |
| 	int work_count = my_work_count;
 | |
|         status = read_interrupt_status_reg();
 | |
|         if (status == 0)
 | |
|                 return IRQ_NONE; /* Shared IRQ: not us */
 | |
|         if (status == 0xffff)
 | |
|                 return IRQ_HANDLED;      /* Hot unplug */
 | |
|         if (status & error)
 | |
| 		do_some_error_handling()
 | |
|         
 | |
| 	do {
 | |
| 		acknowledge_ints_ASAP();
 | |
| 
 | |
| 		if (status & link_interrupt) {
 | |
| 			spin_lock(&tp->link_lock);
 | |
| 			do_some_link_stat_stuff();
 | |
| 			spin_lock(&tp->link_lock);
 | |
| 		}
 | |
| 		
 | |
| 		if (status & rx_interrupt) {
 | |
| 			receive_packets(dev);
 | |
| 		}
 | |
| 
 | |
| 		if (status & rx_nobufs) {
 | |
| 			make_rx_buffs_avail();
 | |
| 		}
 | |
| 			
 | |
| 		if (status & tx_related) {
 | |
| 			spin_lock(&tp->lock);
 | |
| 			tx_ring_free(dev);
 | |
| 			if (tx_died)
 | |
| 				restart_tx();
 | |
| 			spin_unlock(&tp->lock);
 | |
| 		}
 | |
| 
 | |
| 		status = read_interrupt_status_reg();
 | |
| 
 | |
| 	} while (!(status & error) || more_work_to_be_done);
 | |
| 	return IRQ_HANDLED;
 | |
| }
 | |
| 
 | |
| ----------------------------------------------------------------------
 | |
| 
 | |
| We now change this to what is shown below to NAPI-enable it:
 | |
| 
 | |
| ----------------------------------------------------------------------
 | |
| static irqreturn_t
 | |
| netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 | |
| {
 | |
| 	struct net_device *dev = (struct net_device *)dev_instance;
 | |
| 	struct my_private *tp = (struct my_private *)dev->priv;
 | |
| 
 | |
|         status = read_interrupt_status_reg();
 | |
|         if (status == 0)
 | |
|                 return IRQ_NONE;         /* Shared IRQ: not us */
 | |
|         if (status == 0xffff)
 | |
|                 return IRQ_HANDLED;         /* Hot unplug */
 | |
|         if (status & error)
 | |
| 		do_some_error_handling();
 | |
|         
 | |
| 	do {
 | |
| /************************ start note *********************************/		
 | |
| 		acknowledge_ints_ASAP();  // dont ack rx and rxnobuff here
 | |
| /************************ end note *********************************/		
 | |
| 
 | |
| 		if (status & link_interrupt) {
 | |
| 			spin_lock(&tp->link_lock);
 | |
| 			do_some_link_stat_stuff();
 | |
| 			spin_unlock(&tp->link_lock);
 | |
| 		}
 | |
| /************************ start note *********************************/		
 | |
| 		if (status & rx_interrupt || (status & rx_nobuffs)) {
 | |
| 			if (netif_rx_schedule_prep(dev)) {
 | |
| 
 | |
| 				/* disable interrupts caused 
 | |
| 			         *	by arriving packets */
 | |
| 				disable_rx_and_rxnobuff_ints();
 | |
| 				/* tell system we have work to be done. */
 | |
| 				__netif_rx_schedule(dev);
 | |
| 			} else {
 | |
| 				printk("driver bug! interrupt while in poll\n");
 | |
| 				/* FIX by disabling interrupts  */
 | |
| 				disable_rx_and_rxnobuff_ints();
 | |
| 			}
 | |
| 		}
 | |
| /************************ end note note *********************************/		
 | |
| 			
 | |
| 		if (status & tx_related) {
 | |
| 			spin_lock(&tp->lock);
 | |
| 			tx_ring_free(dev);
 | |
| 
 | |
| 			if (tx_died)
 | |
| 				restart_tx();
 | |
| 			spin_unlock(&tp->lock);
 | |
| 		}
 | |
| 
 | |
| 		status = read_interrupt_status_reg();
 | |
| 
 | |
| /************************ start note *********************************/		
 | |
| 	} while (!(status & error) || more_work_to_be_done(status));
 | |
| /************************ end note note *********************************/		
 | |
| 	return IRQ_HANDLED;
 | |
| }
 | |
| 
 | |
| ---------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| We note several things from above:
 | |
| 
 | |
| I) Any interrupt source which is caused by arriving packets is now
 | |
| turned off when it occurs. Depending on the hardware, there could be
 | |
| several reasons that arriving packets would cause interrupts; these are the
 | |
| interrupt sources we wish to avoid. The two common ones are a) a packet 
 | |
| arriving (rxint) b) a packet arriving and finding no DMA buffers available
 | |
| (rxnobuff) .
 | |
| This means also acknowledge_ints_ASAP() will not clear the status
 | |
| register for those two items above; clearing is done in the place where 
 | |
| proper work is done within NAPI; at the poll() and refill_rx_ring() 
 | |
| discussed further below.
 | |
| netif_rx_schedule_prep() returns 1 if device is in running state and
 | |
| gets successfully added to the core poll list. If we get a zero value
 | |
| we can _almost_ assume are already added to the list (instead of not running. 
 | |
| Logic based on the fact that you shouldn't get interrupt if not running)
 | |
| We rectify this by disabling rx and rxnobuf interrupts.
 | |
| 
 | |
| II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
 | |
| These functionalities are still around actually......
 | |
| 
 | |
| infact, receive_packets(dev) is very close to my_poll() and 
 | |
| make_rx_buffs_avail() is invoked from my_poll()
 | |
| 
 | |
| 4) converting receive_packets() to dev->poll()
 | |
| ===============================================
 | |
| 
 | |
| We need to convert the classical D Becker receive_packets(dev) to my_poll()
 | |
| 
 | |
| First the typical receive_packets() below:
 | |
| -------------------------------------------------------------------
 | |
| 
 | |
| /* this is called by interrupt handler */
 | |
| static void receive_packets (struct net_device *dev)
 | |
| {
 | |
| 
 | |
| 	struct my_private *tp = (struct my_private *)dev->priv;
 | |
| 	rx_ring = tp->rx_ring;
 | |
| 	cur_rx = tp->cur_rx;
 | |
| 	int entry = cur_rx % RX_RING_SIZE;
 | |
| 	int received = 0;
 | |
| 	int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
 | |
| 
 | |
| 	while (rx_ring_not_empty) {
 | |
| 		u32 rx_status;
 | |
| 		unsigned int rx_size;
 | |
| 		unsigned int pkt_size;
 | |
| 		struct sk_buff *skb;
 | |
|                 /* read size+status of next frame from DMA ring buffer */
 | |
| 		/* the number 16 and 4 are just examples */
 | |
|                 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
 | |
|                 rx_size = rx_status >> 16;
 | |
|                 pkt_size = rx_size - 4;
 | |
| 
 | |
| 		/* process errors */
 | |
|                 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
 | |
|                     (!(rx_status & RxStatusOK))) {
 | |
|                         netdrv_rx_err (rx_status, dev, tp, ioaddr);
 | |
|                         return;
 | |
|                 }
 | |
| 
 | |
|                 if (--rx_work_limit < 0)
 | |
|                         break;
 | |
| 
 | |
| 		/* grab a skb */
 | |
|                 skb = dev_alloc_skb (pkt_size + 2);
 | |
|                 if (skb) {
 | |
| 			.
 | |
| 			.
 | |
| 			netif_rx (skb);
 | |
| 			.
 | |
| 			.
 | |
|                 } else {  /* OOM */
 | |
| 			/*seems very driver specific ... some just pass
 | |
| 			whatever is on the ring already. */
 | |
|                 }
 | |
| 
 | |
| 		/* move to the next skb on the ring */
 | |
| 		entry = (++tp->cur_rx) % RX_RING_SIZE;
 | |
| 		received++ ;
 | |
| 
 | |
|         }
 | |
| 
 | |
| 	/* store current ring pointer state */
 | |
|         tp->cur_rx = cur_rx;
 | |
| 
 | |
|         /* Refill the Rx ring buffers if they are needed */
 | |
| 	refill_rx_ring();
 | |
| 	.
 | |
| 	.
 | |
| 
 | |
| }
 | |
| -------------------------------------------------------------------
 | |
| We change it to a new one below; note the additional parameter in
 | |
| the call.
 | |
| 
 | |
| -------------------------------------------------------------------
 | |
| 
 | |
| /* this is called by the network core */
 | |
| static int my_poll (struct net_device *dev, int *budget)
 | |
| {
 | |
| 
 | |
| 	struct my_private *tp = (struct my_private *)dev->priv;
 | |
| 	rx_ring = tp->rx_ring;
 | |
| 	cur_rx = tp->cur_rx;
 | |
| 	int entry = cur_rx % RX_BUF_LEN;
 | |
| 	/* maximum packets to send to the stack */
 | |
| /************************ note note *********************************/		
 | |
| 	int rx_work_limit = dev->quota;
 | |
| 
 | |
| /************************ end note note *********************************/		
 | |
|     do {  // outer beginning loop starts here
 | |
| 
 | |
| 	clear_rx_status_register_bit();
 | |
| 
 | |
| 	while (rx_ring_not_empty) {
 | |
| 		u32 rx_status;
 | |
| 		unsigned int rx_size;
 | |
| 		unsigned int pkt_size;
 | |
| 		struct sk_buff *skb;
 | |
|                 /* read size+status of next frame from DMA ring buffer */
 | |
| 		/* the number 16 and 4 are just examples */
 | |
|                 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
 | |
|                 rx_size = rx_status >> 16;
 | |
|                 pkt_size = rx_size - 4;
 | |
| 
 | |
| 		/* process errors */
 | |
|                 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
 | |
|                     (!(rx_status & RxStatusOK))) {
 | |
|                         netdrv_rx_err (rx_status, dev, tp, ioaddr);
 | |
|                         return 1;
 | |
|                 }
 | |
| 
 | |
| /************************ note note *********************************/		
 | |
|                 if (--rx_work_limit < 0) { /* we got packets, but no quota */
 | |
| 			/* store current ring pointer state */
 | |
| 			tp->cur_rx = cur_rx;
 | |
| 
 | |
| 			/* Refill the Rx ring buffers if they are needed */
 | |
| 			refill_rx_ring(dev);
 | |
|                         goto not_done;
 | |
| 		}
 | |
| /**********************  end note **********************************/
 | |
| 
 | |
| 		/* grab a skb */
 | |
|                 skb = dev_alloc_skb (pkt_size + 2);
 | |
|                 if (skb) {
 | |
| 			.
 | |
| 			.
 | |
| /************************ note note *********************************/		
 | |
| 			netif_receive_skb (skb);
 | |
| /**********************  end note **********************************/
 | |
| 			.
 | |
| 			.
 | |
|                 } else {  /* OOM */
 | |
| 			/*seems very driver specific ... common is just pass
 | |
| 			whatever is on the ring already. */
 | |
|                 }
 | |
| 
 | |
| 		/* move to the next skb on the ring */
 | |
| 		entry = (++tp->cur_rx) % RX_RING_SIZE;
 | |
| 		received++ ;
 | |
| 
 | |
|         }
 | |
| 
 | |
| 	/* store current ring pointer state */
 | |
|         tp->cur_rx = cur_rx;
 | |
| 
 | |
|         /* Refill the Rx ring buffers if they are needed */
 | |
| 	refill_rx_ring(dev);
 | |
| 	
 | |
| 	/* no packets on ring; but new ones can arrive since we last 
 | |
| 	   checked  */
 | |
| 	status = read_interrupt_status_reg();
 | |
| 	if (rx status is not set) {
 | |
|                         /* If something arrives in this narrow window,
 | |
| 			an interrupt will be generated */
 | |
|                         goto done;
 | |
| 	}
 | |
| 	/* done! at least thats what it looks like ;->
 | |
| 	if new packets came in after our last check on status bits
 | |
| 	they'll be caught by the while check and we go back and clear them 
 | |
| 	since we havent exceeded our quota */
 | |
|     } while (rx_status_is_set); 
 | |
| 
 | |
| done:
 | |
| 
 | |
| /************************ note note *********************************/		
 | |
|         dev->quota -= received;
 | |
|         *budget -= received;
 | |
| 
 | |
|         /* If RX ring is not full we are out of memory. */
 | |
|         if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 | |
|                 goto oom;
 | |
| 
 | |
| 	/* we are happy/done, no more packets on ring; put us back
 | |
| 	to where we can start processing interrupts again */
 | |
|         netif_rx_complete(dev);
 | |
| 	enable_rx_and_rxnobuf_ints();
 | |
| 
 | |
|        /* The last op happens after poll completion. Which means the following:
 | |
|         * 1. it can race with disabling irqs in irq handler (which are done to 
 | |
| 	* schedule polls)
 | |
|         * 2. it can race with dis/enabling irqs in other poll threads
 | |
|         * 3. if an irq raised after the begining of the outer  beginning 
 | |
|         * loop(marked in the code above), it will be immediately
 | |
|         * triggered here.
 | |
|         *
 | |
|         * Summarizing: the logic may results in some redundant irqs both
 | |
|         * due to races in masking and due to too late acking of already
 | |
|         * processed irqs. The good news: no events are ever lost.
 | |
|         */
 | |
| 
 | |
|         return 0;   /* done */
 | |
| 
 | |
| not_done:
 | |
|         if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
 | |
|             tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 | |
|                 refill_rx_ring(dev);
 | |
| 
 | |
|         if (!received) {
 | |
|                 printk("received==0\n");
 | |
|                 received = 1;
 | |
|         }
 | |
|         dev->quota -= received;
 | |
|         *budget -= received;
 | |
|         return 1;  /* not_done */
 | |
| 
 | |
| oom:
 | |
|         /* Start timer, stop polling, but do not enable rx interrupts. */
 | |
| 	start_poll_timer(dev);
 | |
|         return 0;  /* we'll take it from here so tell core "done"*/
 | |
| 
 | |
| /************************ End note note *********************************/		
 | |
| }
 | |
| -------------------------------------------------------------------
 | |
| 
 | |
| From above we note that:
 | |
| 0) rx_work_limit = dev->quota 
 | |
| 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
 | |
| it does the work.
 | |
| 2) We have a done and not_done state.
 | |
| 3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
 | |
| 4) we have a new way of handling oom condition
 | |
| 5) A new outer for (;;) loop has been added. This serves the purpose of
 | |
| ensuring that if a new packet has come in, after we are all set and done,
 | |
| and we have not exceeded our quota that we continue sending packets up.
 | |
|  
 | |
| 
 | |
| -----------------------------------------------------------
 | |
| Poll timer code will need to do the following:
 | |
| 
 | |
| a) 
 | |
| 
 | |
|         if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
 | |
|             tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) 
 | |
|                 refill_rx_ring(dev);
 | |
| 
 | |
|         /* If RX ring is not full we are still out of memory.
 | |
| 	   Restart the timer again. Else we re-add ourselves 
 | |
|            to the master poll list.
 | |
|          */
 | |
| 
 | |
|         if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 | |
|                 restart_timer();
 | |
| 
 | |
| 	else netif_rx_schedule(dev);  /* we are back on the poll list */
 | |
| 	
 | |
| 5) dev->close() and dev->suspend() issues
 | |
| ==========================================
 | |
| The driver writter neednt worry about this. The top net layer takes
 | |
| care of it.
 | |
| 
 | |
| 6) Adding new Stats to /proc 
 | |
| =============================
 | |
| In order to debug some of the new features, we introduce new stats
 | |
| that need to be collected.
 | |
| TODO: Fill this later.
 | |
| 
 | |
| APPENDIX 1: discussion on using ethernet HW FC
 | |
| ==============================================
 | |
| Most chips with FC only send a pause packet when they run out of Rx buffers.
 | |
| Since packets are pulled off the DMA ring by a softirq in NAPI,
 | |
| if the system is slow in grabbing them and we have a high input
 | |
| rate (faster than the system's capacity to remove packets), then theoretically
 | |
| there will only be one rx interrupt for all packets during a given packetstorm.
 | |
| Under low load, we might have a single interrupt per packet.
 | |
| FC should be programmed to apply in the case when the system cant pull out
 | |
| packets fast enough i.e send a pause only when you run out of rx buffers.
 | |
| Note FC in itself is a good solution but we have found it to not be
 | |
| much of a commodity feature (both in NICs and switches) and hence falls
 | |
| under the same category as using NIC based mitigation. Also experiments
 | |
| indicate that its much harder to resolve the resource allocation
 | |
| issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness
 | |
| proved harder. In any case, FC works even better with NAPI but is not
 | |
| necessary.
 | |
| 
 | |
| 
 | |
| APPENDIX 2: the "rotting packet" race-window avoidance scheme 
 | |
| =============================================================
 | |
| 
 | |
| There are two types of associations seen here
 | |
| 
 | |
| 1) status/int which honors level triggered IRQ
 | |
| 
 | |
| If a status bit for receive or rxnobuff is set and the corresponding 
 | |
| interrupt-enable bit is not on, then no interrupts will be generated. However, 
 | |
| as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is 
 | |
| generated.  [assuming the status bit was not turned off].
 | |
| Generally the concept of level triggered IRQs in association with a status and
 | |
| interrupt-enable CSR register set is used to avoid the race.
 | |
| 
 | |
| If we take the example of the tulip:
 | |
| "pending work" is indicated by the status bit(CSR5 in tulip).
 | |
| the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
 | |
| the CSR5 will continue to be turned on with new packet arrivals even if
 | |
| we clear it the first time)
 | |
| Very important is the fact that if we turn on the interrupt bit on when
 | |
| status is set that an immediate irq is triggered.
 | |
|  
 | |
| If we cleared the rx ring and proclaimed there was "no more work
 | |
| to be done" and then went on to do a few other things;  then when we enable
 | |
| interrupts, there is a possibility that a new packet might sneak in during
 | |
| this phase. It helps to look at the pseudo code for the tulip poll
 | |
| routine:
 | |
| 
 | |
| --------------------------
 | |
|         do {
 | |
|                 ACK;
 | |
|                 while (ring_is_not_empty()) {
 | |
|                         work-work-work
 | |
|                         if quota is exceeded: exit, no touching irq status/mask
 | |
|                 }
 | |
|                 /* No packets, but new can arrive while we are doing this*/
 | |
|                 CSR5 := read
 | |
|                 if (CSR5 is not set) {
 | |
|                         /* If something arrives in this narrow window here,
 | |
|                         *  where the comments are ;-> irq will be generated */
 | |
|                         unmask irqs;
 | |
|                         exit poll;
 | |
|                 }
 | |
|         } while (rx_status_is_set);
 | |
| ------------------------
 | |
| 
 | |
| CSR5 bit of interest is only the rx status. 
 | |
| If you look at the last if statement: 
 | |
| you just finished grabbing all the packets from the rx ring .. you check if
 | |
| status bit says theres more packets just in ... it says none; you then
 | |
| enable rx interrupts again; if a new packet just came in during this check,
 | |
| we are counting that CSR5 will be set in that small window of opportunity
 | |
| and that by re-enabling interrupts, we would actually triger an interrupt
 | |
| to register the new packet for processing.
 | |
| 
 | |
| [The above description nay be very verbose, if you have better wording 
 | |
| that will make this more understandable, please suggest it.]
 | |
| 
 | |
| 2) non-capable hardware
 | |
| 
 | |
| These do not generally respect level triggered IRQs. Normally,
 | |
| irqs may be lost while being masked and the only way to leave poll is to do
 | |
| a double check for new input after netif_rx_complete() is invoked
 | |
| and re-enable polling (after seeing this new input).
 | |
| 
 | |
| Sample code:
 | |
| 
 | |
| ---------
 | |
| 	.
 | |
| 	.
 | |
| restart_poll:
 | |
| 	while (ring_is_not_empty()) {
 | |
| 		work-work-work
 | |
| 		if quota is exceeded: exit, not touching irq status/mask
 | |
| 	}
 | |
| 	.
 | |
| 	.
 | |
| 	.
 | |
| 	enable_rx_interrupts()
 | |
| 	netif_rx_complete(dev);
 | |
| 	if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
 | |
| 		disable_rx_and_rxnobufs()
 | |
| 		goto restart_poll
 | |
| 	} while (rx_status_is_set);
 | |
| ---------
 | |
| 		
 | |
| Basically netif_rx_complete() removes us from the poll list, but because a
 | |
| new packet which will never be caught due to the possibility of a race
 | |
| might come in, we attempt to re-add ourselves to the poll list. 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| APPENDIX 3: Scheduling issues.
 | |
| ==============================
 | |
| As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the 
 | |
| general solution to schedule softirq's to run before next interrupt and by putting 
 | |
| them under scheduler control. Also this prevents consecutive softirq's from 
 | |
| monopolize the CPU. This also have the effect that the priority of ksoftirq needs 
 | |
| to be considered when running very CPU-intensive applications and networking to
 | |
| get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 
 | |
| (eventually more) is reported cure problems with low network performance at high 
 | |
| CPU load.
 | |
| 
 | |
| Most used processes in a GIGE router:
 | |
| USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND
 | |
| root         3  0.2  0.0     0     0  ?  RWN Aug 15 602:00 (ksoftirqd_CPU0)
 | |
| root       232  0.0  7.9 41400 40884  ?  S   Aug 15  74:12 gated 
 | |
| 
 | |
| --------------------------------------------------------------------
 | |
| 
 | |
| relevant sites:
 | |
| ==================
 | |
| ftp://robur.slu.se/pub/Linux/net-development/NAPI/
 | |
| 
 | |
| 
 | |
| --------------------------------------------------------------------
 | |
| TODO: Write net-skeleton.c driver.
 | |
| -------------------------------------------------------------
 | |
| 
 | |
| Authors:
 | |
| ========
 | |
| Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
 | |
| Jamal Hadi Salim <hadi@cyberus.ca>
 | |
| Robert Olsson <Robert.Olsson@data.slu.se>
 | |
| 
 | |
| Acknowledgements:
 | |
| ================
 | |
| People who made this document better:
 | |
| 
 | |
| Lennert Buytenhek <buytenh@gnu.org>
 | |
| Andrew Morton  <akpm@zip.com.au>
 | |
| Manfred Spraul <manfred@colorfullife.com>
 | |
| Donald Becker <becker@scyld.com>
 | |
| Jeff Garzik <jgarzik@pobox.com>
 |