1
0
Fork 0

x86/intel_rdt/cqm: Documentation for resctrl based RDT Monitoring

Add a description of resctrl based RDT(resource director technology)
monitoring extension and its usage.

[Tony: Added descriptions for how monitoring and allocation are measured
and some cleanups]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-3-git-send-email-vikas.shivappa@linux.intel.com
hifive-unleashed-5.1
Vikas Shivappa 2017-07-25 14:14:21 -07:00 committed by Thomas Gleixner
parent c39a0e2c88
commit 1640ae9471
1 changed files with 278 additions and 38 deletions

View File

@ -6,8 +6,8 @@ Fenghua Yu <fenghua.yu@intel.com>
Tony Luck <tony.luck@intel.com>
Vikas Shivappa <vikas.shivappa@intel.com>
This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".
To use the feature mount the file system:
@ -17,6 +17,13 @@ mount options are:
"cdp": Enable code/data prioritization in L3 cache allocations.
RDT features are orthogonal. A particular system may support only
monitoring, only control, or both monitoring and control.
The mount succeeds if either of allocation or monitoring is present, but
only those files and directories supported by the system will be created.
For more details on the behavior of the interface during monitoring
and allocation, see the "Resource alloc and monitor groups" section.
Info directory
--------------
@ -24,7 +31,12 @@ Info directory
The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
names reflect the resource names.
Cache resource(L3/L2) subdirectory contains the following files:
Each subdirectory contains the following files with respect to
allocation:
Cache resource(L3/L2) subdirectory contains the following files
related to allocation:
"num_closids": The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@ -36,7 +48,8 @@ Cache resource(L3/L2) subdirectory contains the following files:
"min_cbm_bits": The minimum number of consecutive bits which
must be set when writing a mask.
Memory bandwitdh(MB) subdirectory contains the following files:
Memory bandwitdh(MB) subdirectory contains the following files
with respect to allocation:
"min_bandwidth": The minimum memory bandwidth percentage which
user can request.
@ -52,48 +65,152 @@ Memory bandwitdh(MB) subdirectory contains the following files:
non-linear. This field is purely informational
only.
Resource groups
---------------
If RDT monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids": The number of RMIDs available. This is the
upper bound for how many "CTRL_MON" + "MON"
groups can be created.
"mon_features": Lists the monitoring events if
monitoring is enabled for the resource.
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
Resource alloc and monitor groups
---------------------------------
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".
system. The default group is the root directory which, immediately
after mounting, owns all the tasks and cpus in the system and can make
full use of all resources.
There are three files associated with each group:
On a system with RDT control features additional directories can be
created in the root directory that specify different amounts of each
resource (see "schemata" below). The root and these additional top level
directories are referred to as "CTRL_MON" groups below.
"tasks": A list of tasks that belongs to this group. Tasks can be
added to a group by writing the task ID to the "tasks" file
(which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2)
and clone(2) are added to the same group as their parent.
If a pid is not in any sub partition, it is in root partition
(i.e. default partition).
On a system with RDT monitoring the root directory and other top level
directories contain a directory named "mon_groups" in which additional
directories can be created to monitor subsets of tasks in the CTRL_MON
group that is their ancestor. These are called "MON" groups in the rest
of this document.
"cpus": A bitmask of logical CPUs assigned to this group. Writing
a new mask can add/remove CPUs from this group. Added CPUs
are removed from their previous group. Removed ones are
given to the default (root) group. You cannot remove CPUs
from the default group.
Removing a directory will move all tasks and cpus owned by the group it
represents to the parent. Removing one of the created CTRL_MON groups
will automatically remove all MON groups below it.
"cpus_list": One or more CPU ranges of logical CPUs assigned to this
group. Same rules apply like for the "cpus" file.
All groups contain the following files:
"schemata": A list of all the resources available to this group.
Each resource has its own line and format - see below for
details.
"tasks":
Reading this file shows the list of all tasks that belong to
this group. Writing a task id to the file will add a task to the
group. If the group is a CTRL_MON group the task is removed from
whichever previous CTRL_MON group owned the task and also from
any MON group that owned the task. If the group is a MON group,
then the task must already belong to the CTRL_MON parent of this
group. The task is removed from any previous MON group.
When a task is running the following rules define which resources
are available to it:
"cpus":
Reading this file shows a bitmask of the logical CPUs owned by
this group. Writing a mask to this file will add and remove
CPUs to/from this group. As with the tasks file a hierarchy is
maintained where MON groups may only include CPUs owned by the
parent CTRL_MON group.
"cpus_list":
Just like "cpus", only using ranges of CPUs instead of bitmasks.
When control is enabled all CTRL_MON groups will also contain:
"schemata":
A list of all the resources available to this group.
Each resource has its own line and format - see below for details.
When monitoring is enabled all MON groups will also contain:
"mon_data":
This contains a set of files organized by L3 domain and by
RDT event. E.g. on a system with two L3 domains there will
be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
directories have one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
Resource allocation rules
-------------------------
When a task is running the following rules define which resources are
available to it:
1) If the task is a member of a non-default group, then the schemata
for that group is used.
for that group is used.
2) Else if the task belongs to the default group, but is running on a
CPU that is assigned to some specific group, then the schemata for
the CPU's group is used.
CPU that is assigned to some specific group, then the schemata for the
CPU's group is used.
3) Otherwise the schemata for the default group is used.
Resource monitoring rules
-------------------------
1) If a task is a member of a MON group, or non-default CTRL_MON group
then RDT events for the task will be reported in that group.
2) If a task is a member of the default CTRL_MON group, but is running
on a CPU that is assigned to some specific group, then the RDT events
for the task will be reported in that group.
3) Otherwise RDT events for the task will be reported in the root level
"mon_data" group.
Notes on cache occupancy monitoring and control
-----------------------------------------------
When moving a task from one group to another you should remember that
this only affects *new* cache allocations by the task. E.g. you may have
a task in a monitor group showing 3 MB of cache occupancy. If you move
to a new group and immediately check the occupancy of the old and new
groups you will likely see that the old group is still showing 3 MB and
the new group zero. When the task accesses locations still in cache from
before the move, the h/w does not update any counters. On a busy system
you will likely see the occupancy in the old group go down as cache lines
are evicted and re-used while the occupancy in the new group rises as
the task accesses memory and loads into the cache are counted based on
membership in the new group.
The same applies to cache allocation control. Moving a task to a group
with a smaller cache partition will not evict any cache lines. The
process may continue to use them from the old partition.
Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
to identify a control group and a monitoring group respectively. Each of
the resource groups are mapped to these IDs based on the kind of group. The
number of CLOSid and RMID are limited by the hardware and hence the creation of
a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
and creation of "MON" group may fail if we run out of RMIDs.
max_threshold_occupancy - generic concepts
------------------------------------------
Note that an RMID once freed may not be immediately available for use as
the RMID is still tagged the cache lines of the previous user of RMID.
Hence such RMIDs are placed on limbo list and checked back if the cache
occupancy has gone down. If there is a time when system has a lot of
limbo RMIDs but which are not ready to be used, user may see an -EBUSY
during mkdir.
max_threshold_occupancy is a user configurable value to determine the
occupancy at which an RMID can be freed.
Schemata files - general concepts
---------------------------------
@ -143,22 +260,22 @@ SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth.
L3 details (code and data prioritization disabled)
--------------------------------------------------
L3 schemata file details (code and data prioritization disabled)
----------------------------------------------------------------
With CDP disabled the L3 schemata format is:
L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3 details (CDP enabled via mount option to resctrl)
----------------------------------------------------
L3 schemata file details (CDP enabled via mount option to resctrl)
------------------------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:
L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L2 details
----------
L2 schemata file details
------------------------
L2 cache does not support code and data prioritization, so the
schemata format is always:
@ -185,6 +302,8 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
Examples for RDT allocation usage:
Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
@ -410,3 +529,124 @@ void main(void)
/* code to read and write directory contents */
resctrl_release_lock(fd);
}
Examples for RDT Monitoring along with allocation usage:
Reading monitored data
----------------------
Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
show the current snapshot of LLC occupancy of the corresponding MON
group or CTRL_MON group.
Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
---------
On a two socket machine (one L3 cache per socket) with just four bits
for cache bit masks
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
# echo 5678 > p1/tasks
# echo 5679 > p1/tasks
The default resource group is unmodified, so we have access to all parts
of all caches (its schemata file reads "L3:0=f;1=f").
Tasks that are under the control of group "p0" may only allocate from the
"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
Tasks in group "p1" use the "lower" 50% of cache on both sockets.
Create monitor groups and assign a subset of tasks to each monitor group.
# cd /sys/fs/resctrl/p1/mon_groups
# mkdir m11 m12
# echo 5678 > m11/tasks
# echo 5679 > m12/tasks
fetch data (data shown in bytes)
# cat m11/mon_data/mon_L3_00/llc_occupancy
16234000
# cat m11/mon_data/mon_L3_01/llc_occupancy
14789000
# cat m12/mon_data/mon_L3_00/llc_occupancy
16789000
The parent ctrl_mon group shows the aggregated data.
# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
31234000
Example 2 (Monitor a task from its creation)
---------
On a two socket machine (one L3 cache per socket)
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
An RMID is allocated to the group once its created and hence the <cmd>
below is monitored from its creation.
# echo $$ > /sys/fs/resctrl/p1/tasks
# <cmd>
Fetch the data
# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
31789000
Example 3 (Monitor without CAT support or before creating CAT groups)
---------
Assume a system like HSW has only CQM and no CAT support. In this case
the resctrl will still mount but cannot create CTRL_MON directories.
But user can create different MON groups within the root group thereby
able to monitor all tasks including kernel threads.
This can also be used to profile jobs cache size footprint before being
able to allocate them to different allocation groups.
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir mon_groups/m01
# mkdir mon_groups/m02
# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
Monitor the groups separately and also get per domain data. From the
below its apparent that the tasks are mostly doing work on
domain(socket) 0.
# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
31234000
# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
34555
# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
31234000
# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
32789
Example 4 (Monitor real time tasks)
-----------------------------------
A single socket system which has real time tasks running on cores 4-7
and non real time tasks on other cpus. We want to monitor the cache
occupancy of the real time threads on these cores.
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p1
Move the cpus 4-7 over to p1
# echo f0 > p0/cpus
View the llc occupancy snapshot
# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
11234000