What should I do when hwloc reports "operatingsystem" warnings?
When the operating system reports invalid locality
information (because of either software or hardware bugs), hwloc may fail to
insert some objects in the topology because they cannot fit in the already
built tree of resources. If so, hwloc will report a warning like the following.
The object causing this error is ignored, the discovery continues but the
resulting topology will miss some objects and may be asymmetric (see also What
happens if my topology is asymmetric?).
****************************************************************************
* hwloc has encountered what looks like an error from the operating system.
*
* L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) without inclusion!
* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology script.
****************************************************************************
* hwloc has encountered what looks like an error from the operating system.
*
* L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) without inclusion!
* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology script.
****************************************************************************
These
errors are common on large AMD platforms because of BIOS and/or Linux kernel
bugs causing invalid L3 cache information. In the above example, the hardware
reports a L3 cache that is shared by 2 cores in the first NUMA node and 4 cores
in the second NUMA node. That's wrong, it should actually be shared by all 6
cores in a single NUMA node. The resulting topology will miss some L3 caches.
If
your application not care about cache sharing, or if you do not plan to request
cache-aware binding in your process launcher, you may likely ignore this error
(and hide it by setting HWLOC_HIDE_ERRORS=1 in your environment).
Some
platforms report similar warnings about conflicting Packages and NUMANodes.
Upgrading the BIOS and/or the operating system may help. Otherwise, as
explained in the message, reporting this issue to the hwloc developers (by
sending the tarball that is generated by the hwloc-gather-topology script on
this platform) is a good way to make sure that this is a software (operating
system) or hardware bug (BIOS, etc).
Comments
Post a Comment