Poor ethernet latency

Hi,

I have an aml-s905x-cc currently running Debian 12 with Linux 6.1.92. The network latency and ping response time is generally poor. I know it is a 100Mbps port however my really slow Raspberry Pi Model B can more than perform twice as well in respect to latency on its 100Mbps port.

The ping latency is of the order 1.0ms to 1Gbps servers on my 1Gbps LAN whereas the RPiB manages 0.3 to 0.35ms.

The RPiB has a configuration “smsc95xx.turbo_mode=0” to improve the latency but I’ve not seen any way of reducing latency on the Le Potato board.
It means the slow and ancient RPiB is a far better GPS Stratum 1 reference than the allegedly much more performant Libre Computer board.

Anyone with any ideas or configurations for improving this?

From an AML-S805X-AC (same chip as AML-S905X-CC in a different package), to a gigabit device on the same switch. Never seen a generic kernel get 0.3ms.

debian-12-aml-s805x-ac:~$ ping debian-12-aml-a311d-cc -A -c 20
PING debian-12-aml-a311d-cc (192.168.12.173) 56(84) bytes of data.
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=1 ttl=64 time=0.740 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=2 ttl=64 time=0.687 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=3 ttl=64 time=0.679 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=4 ttl=64 time=0.671 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=5 ttl=64 time=0.674 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=6 ttl=64 time=0.486 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=7 ttl=64 time=0.695 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=8 ttl=64 time=0.512 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=9 ttl=64 time=0.677 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=10 ttl=64 time=0.686 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=11 ttl=64 time=0.683 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=12 ttl=64 time=0.664 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=13 ttl=64 time=0.414 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=14 ttl=64 time=0.678 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=15 ttl=64 time=0.667 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=16 ttl=64 time=0.665 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=17 ttl=64 time=0.678 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=18 ttl=64 time=0.682 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=19 ttl=64 time=0.678 ms
64 bytes from 192.168.12.173: icmp_seq=20 ttl=64 time=0.676 ms

--- debian-12-aml-a311d-cc ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 49ms
rtt min/avg/max/mdev = 0.414/0.649/0.740/0.078 ms, ipg/ewma 2.590/0.661 ms

From gigabit to gigabit

debian-12-aml-s905d3-cc:~$ ping debian-12-aml-a311d-cc -A -c 20
PING debian-12-aml-a311d-cc.lan (192.168.12.173) 56(84) bytes of data.
64 bytes from debian-12-aml-a311d-cc.lan (192.168.12.173): icmp_seq=1 ttl=64 time=0.736 ms
64 bytes from debian-12-aml-a311d-cc.lan (192.168.12.173): icmp_seq=2 ttl=64 time=0.669 ms
64 bytes from debian-12-aml-a311d-cc.lan (192.168.12.173): icmp_seq=3 ttl=64 time=0.683 ms
64 bytes from debian-12-aml-a311d-cc.lan (192.168.12.173): icmp_seq=4 ttl=64 time=0.627 ms
64 bytes from debian-12-aml-a311d-cc.lan (192.168.12.173): icmp_seq=5 ttl=64 time=0.634 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=6 ttl=64 time=0.519 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=7 ttl=64 time=0.402 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=8 ttl=64 time=0.619 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=9 ttl=64 time=0.620 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=10 ttl=64 time=0.608 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=11 ttl=64 time=0.603 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=12 ttl=64 time=0.598 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=13 ttl=64 time=0.608 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=14 ttl=64 time=0.598 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=15 ttl=64 time=0.611 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=16 ttl=64 time=0.615 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=17 ttl=64 time=0.604 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=18 ttl=64 time=0.616 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=19 ttl=64 time=0.474 ms
64 bytes from debian-12-aml-a311d-cc (192.168.12.173): icmp_seq=20 ttl=64 time=0.611 ms

--- debian-12-aml-a311d-cc.lan ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 88ms
rtt min/avg/max/mdev = 0.402/0.602/0.736/0.068 ms, ipg/ewma 4.655/0.600 ms

There’s a few important things that affect network latency and trade off throughput. CPU frequency, packet offloading, DMA, interrupt handling. But getting 0.75ms is typical. As you’ve stated, your gigabit only gets 1.0ms.

There are a number of things that may affect ethernet latency. I have to disagree that 0.75 ms is normal, unless maybe the AML chips with stock config (in their Bookworm image) are that bad. Can’t recall when I last used one straight stock…

First, summarizing a few quick pings:

GHz to GHz (apu3s):
rtt min/avg/max/mdev = 0.169/0.189/0.219/0.019 ms

GHz to 100MHz (apu3 to Beagle Bone Black):
rtt min/avg/max/mdev = 0.328/0.385/0.572/0.063 ms

GHz to 100MHz (apu3 to frite):
rtt min/avg/max/mdev = 0.238/0.277/0.319/0.023 ms

These are all somewhat tweaked for NTP performance (so latency over throughput). For the frite network interface:
ethtool -C end0 rx-usecs 25 tx-usecs 1 tx-frames 0

Frite NIC/driver has some quirks. 25 is smallest rx delay setting it will accept - reports it set to 24 after that.

After setting frite's rx-usecs to 384, it did get worse:
rtt min/avg/max/mdev = 0.625/0.650/0.730/0.031 ms

And I seem to recall that the default setting was weirdly large, but 
I'm not going to reboot just to check the number. So I guess this
matters after all, at least for 805/905 chips.  tx-usecs didn't
change pings either incoming or outgoing...

EEE (ethernet energy efficiency) can also delay things. IIRC the frite doesn’t accept setting it to off, but the switches involved here do have it disabled. Oh yeah, frite rejects even --show-eee, oh well.

What has probably had the biggest effect on NTP performance [after reducing the coalesce delays as much as possible, see next posting for more] has been disabling the deeper idle state and, for the BBB and frite, choosing the performance (non-)scaling frequency manager. For frite:

#!/bin/sh

# disable low power states to improve interrupt response time (for chrony, mostly)
for c in 0 1 2 3
do
	echo 1 >/sys/bus/cpu/devices/cpu$c/cpuidle/state1/disable
done

# there is only one cpufreq governor to bind them, one to run them...
echo performance >/sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor


exit 0

This doesn’t seem to change the power draw much if at all for the frite, at least when it’s lightly loaded.

With all this (and more that I’ve tried along the way), I find the frite (and the sweetp) make less good NTP servers than the old single cored BBB. The apu3 leaves both of them in the dust, but they’ve been EOL for a couple years now. Uhm, this is mostly in the context of using them with a GPS PPS source, and working on a local ethernet. Talking to anything out there in the Interwebs is doing great of it’s only a few msec off, of course.

Just for the record, I am not a timenut: no rubidium or cesium references in the building, nope. :wink:

Well, I had reason to reboot the frite this morning, and with this discussion fresh in my mind I even remembered to comment out the settings in the interfaces config file (traditional Debian manual network config). And the stock settings are… clearly not concerned with low latency:

# ethtool -c end0
Coalesce parameters for end0:
   ...
rx-usecs: 246
rx-frames: 0
  ...
tx-usecs: 1000
tx-frames: 25

Ping from GHz to frite, then vice-versa:
rtt min/avg/max/mdev = 0.478/0.502/0.550/0.022 ms
rtt min/avg/max/mdev = 0.475/0.494/0.546/0.019 ms

And NTP offsets go from ~25us to ~135us, as expected for an increase in receive delay of ~220us (calculated offset changes by half the added delay because it’s a long story). [oops, I was thinking of this paper, with error analysis in sections 6 and later.] And that’s why the outdated Beagle Bone Black makes a better NTP stratum 1 reference than the AML boards.

I’m willing to believe that similarly latency-hostile defaults on the a311d account for the other ~200ms of ping latency, and could probably be mitigated using ethtool as I described previously.

Everything is a trade-off between stability, throughput, and latency. The default driver for the PHY and MAC sets the delay and data calibration to auto in the Linux driver.

These must be manually calibrated along with the interrupt handling by the CPU/Linux/irq/AHB priority subsystems. The comparison with a single core BeagleBone is not apt because multi-core systems running modern Linux are inherently more complex than low complexity devices running simpler version of Linux. Modern devices are optimized for throughput, not latency.

You must tune for your application and cannot expect the default configuration to fit all of your needs. This is why we provide all of the code in open-source format.

You can also use this datasheet for ideas as the IP is very similiar for the most part. https://ww1.microchip.com/downloads/aemDocuments/documents/OTH/ProductDocuments/DataSheets/LAN83C185-Data-Sheet-DS00002808A.pdf

And you wonder why I’m less than impressed with your documentation? A datasheet for IP that is very similiar for the most part… and it’s for the PHY, which is the wrong part of the network hardware for the coalescence issue.

Everything is a trade-off between stability, throughput, and latency is like an AI’s description of, well, most anything - technically correct but hardly relevant in context. Oh, for the record, the BBB is running Debian Bookworm, not some archaic simpler version of Linux.

Anyway, I’m satisfied that your Spuds are simply not the answer to my NTP plans (1), so I’m certainly not going to try to fix your device’s driver code. If it is just the code, for some reason, limiting the range of rx-coalescence setting for no good reason, rather than a baked-in hardware problem. That shouldn’t interfere with using it as a networked print server, for example.

Let me be clear: the default setting, well, you have your opinion of it, but THAT WAS NEVER WHAT I USED EXCEPT FOR THE TESTING ABOVE. I don’t expect general purpose systems to be properly setup out of the box - they never are.

(1) You can make a pretty decent local NTP server out of any of these little Spuds and a GPS with PPS. Certainly better than the FC-NTP-MINI that someone in China has been churning out for years now. But if there’s another good time source that doesn’t have the 25 (or 135 by default) usec offset, they can work together to confuse the NTP clients - that was what first led me to recognize the builtin offset error in these boards.

We don’t make the SoCs. You’re barking up the wrong tree.

But this is the tree that designed them, no?

Woof!