Fri, 01 May 2009

r8169 NETDEV WATCHDOG transmit timed out problem

I recently built a new home server box using an Intel Atom (BOXD945GCLF2 Atom 330 Dual Core 1.6Ghz to be exact), and ran into a strange problem where the box would crash with an error like this:
[322865.976030] ------------[ cut here ]------------
[322865.976038] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf6/0x18b()
[322865.976043] Hardware name:
[322865.976047] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[322865.976051] Modules linked in: ipt_MASQUERADE xt_limit xt_helper xt_multiport xt_DSCP xt_tcpudp xt_state ipt_LOG ipt_REJECT iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter iptable_mangle ip_tables x_tables ipv6 fuse loop hid_pl hid_cypress hid_zpff hid_gyration hid_sony hid_ntrig hid_samsung hid_microsoft hid_tmff hid_monterey hid_ezkey hid_apple hid_a4tech hid_logitech ff_memless hid_cherry hid_sunplus hid_petalynx hid_belkin hid_chicony usbhid hid ds2490 wire cn serio_raw 8139too i2c_i801 rng_core 8139cp parport_pc evdev i2c_core floppy parport ehci_hcd uhci_hcd button thermal processor iTCO_wdt thermal_sys usbcore
[322865.976152] Pid: 0, comm: swapper Not tainted 2.6.29.1 #1
[322865.976156] Call Trace:
[322865.976167]  [] warn_slowpath+0x80/0xb6
[322865.976176]  [] cpumask_next_and+0x23/0x33
[322865.976184]  [] find_busiest_group+0x2fa/0x7e2
[322865.976193]  [] sched_clock_cpu+0x136/0x147
[322865.976200]  [] dev_watchdog+0xf6/0x18b
[322865.976207]  [] hrtimer_forward+0x10c/0x124
[322865.976214]  [] scheduler_tick+0x9c/0x1a3
[322865.976220]  [] getnstimeofday+0x4c/0xcf
[322865.976227]  [] lapic_next_event+0x10/0x13
[322865.976233]  [] dev_watchdog+0x0/0x18b
[322865.976241]  [] run_timer_softirq+0x14a/0x1b4
[322865.976247]  [] dev_watchdog+0x0/0x18b
[322865.976254]  [] __do_softirq+0x8c/0x130
[322865.976260]  [] do_softirq+0x45/0x53
[322865.976266]  [] irq_exit+0x35/0x62
[322865.976272]  [] smp_apic_timer_interrupt+0x71/0x7b
[322865.976280]  [] apic_timer_interrupt+0x28/0x30
[322865.976287]  [] mwait_idle+0x4c/0x5a
[322865.976293]  [] cpu_idle+0x60/0x7a
[322865.976298] ---[ end trace f9e87d98b4ee5218 ]---
[322866.001730] r8169: eth0: link up
It would always happen while transfering large amounts of data out from the server through the onboard gigabyte ethernet listed in lspci as:
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

Sometimes it would sort of freeze up the machine for a minute or two, and others it crashed and rebooted. Anyways, tracking the problem down was quite the pain since it only happened sometimes when transfering large amounts of data. Searching for a fix also was hard, and I found many others with the same problem with this realtek NIC, but no one had a solution. But I eventually stumbled upon this post which was the same problem and the last post is someone saying they were going to try the pci=nomsi boot option. I guess it worked for him and so he never posted back, so I tried that out myself and it seems to have fixed the problem.

The pci=nomsi option seems to disable MSI (Message Signaled Interrupt) which is a feature of the PCI bus revision 2.3 or later. It seems like it sometimes causes problems as it is the solution to a number of different problems with pci devices not working so well.

posted at: 01:37 | path: /debian | permanent link to this entry


Powered by PyBlosxom | RSS 2.0