1
0
Fork 0

Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache QoS (RDT/CAR) updates from Thomas Gleixner:
 "Add support for pseudo-locked cache regions.

  Cache Allocation Technology (CAT) allows on certain CPUs to isolate a
  region of cache and 'lock' it. Cache pseudo-locking builds on the fact
  that a CPU can still read and write data pre-allocated outside its
  current allocated area on cache hit. With cache pseudo-locking data
  can be preloaded into a reserved portion of cache that no application
  can fill, and from that point on will only serve cache hits. The cache
  pseudo-locked memory is made accessible to user space where an
  application can map it into its virtual address space and thus have a
  region of memory with reduced average read latency.

  The locking is not perfect and gets totally screwed by WBINDV and
  similar mechanisms, but it provides a reasonable enhancement for
  certain types of latency sensitive applications.

  The implementation extends the current CAT mechanism and provides a
  generally useful exclusive CAT mode on which it builds the extra
  pseude-locked regions"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  x86/intel_rdt: Disable PMU access
  x86/intel_rdt: Fix possible circular lock dependency
  x86/intel_rdt: Make CPU information accessible for pseudo-locked regions
  x86/intel_rdt: Support restoration of subset of permissions
  x86/intel_rdt: Fix cleanup of plr structure on error
  x86/intel_rdt: Move pseudo_lock_region_clear()
  x86/intel_rdt: Limit C-states dynamically when pseudo-locking active
  x86/intel_rdt: Support L3 cache performance event of Broadwell
  x86/intel_rdt: More precise L2 hit/miss measurements
  x86/intel_rdt: Create character device exposing pseudo-locked region
  x86/intel_rdt: Create debugfs files for pseudo-locking testing
  x86/intel_rdt: Create resctrl debug area
  x86/intel_rdt: Ensure RDT cleanup on exit
  x86/intel_rdt: Resctrl files reflect pseudo-locked information
  x86/intel_rdt: Support creation/removal of pseudo-locked region
  x86/intel_rdt: Pseudo-lock region creation/removal core
  x86/intel_rdt: Discover supported platforms via prefetch disable bits
  x86/intel_rdt: Add utilities to test pseudo-locked region possibility
  x86/intel_rdt: Split resource group removal in two
  x86/intel_rdt: Enable entering of pseudo-locksetup mode
  ...
hifive-unleashed-5.1
Linus Torvalds 2018-08-13 16:01:46 -07:00
commit 30de24c7dd
8 changed files with 2965 additions and 75 deletions

View File

@ -29,7 +29,11 @@ mount options are:
L2 and L3 CDP are controlled seperately.
RDT features are orthogonal. A particular system may support only
monitoring, only control, or both monitoring and control.
monitoring, only control, or both monitoring and control. Cache
pseudo-locking is a unique way of using cache control to "pin" or
"lock" data in the cache. Details can be found in
"Cache Pseudo-Locking".
The mount succeeds if either of allocation or monitoring is present, but
only those files and directories supported by the system will be created.
@ -65,6 +69,29 @@ related to allocation:
some platforms support devices that have their
own settings for cache use which can over-ride
these bits.
"bit_usage": Annotated capacity bitmasks showing how all
instances of the resource are used. The legend is:
"0" - Corresponding region is unused. When the system's
resources have been allocated and a "0" is found
in "bit_usage" it is a sign that resources are
wasted.
"H" - Corresponding region is used by hardware only
but available for software use. If a resource
has bits set in "shareable_bits" but not all
of these bits appear in the resource groups'
schematas then the bits appearing in
"shareable_bits" but no resource group will
be marked as "H".
"X" - Corresponding region is available for sharing and
used by hardware and software. These are the
bits that appear in "shareable_bits" as
well as a resource group's allocation.
"S" - Corresponding region is used by software
and available for sharing.
"E" - Corresponding region is used exclusively by
one resource group. No sharing allowed.
"P" - Corresponding region is pseudo-locked. No
sharing allowed.
Memory bandwitdh(MB) subdirectory contains the following files
with respect to allocation:
@ -151,6 +178,9 @@ All groups contain the following files:
CPUs to/from this group. As with the tasks file a hierarchy is
maintained where MON groups may only include CPUs owned by the
parent CTRL_MON group.
When the resouce group is in pseudo-locked mode this file will
only be readable, reflecting the CPUs associated with the
pseudo-locked region.
"cpus_list":
@ -163,6 +193,21 @@ When control is enabled all CTRL_MON groups will also contain:
A list of all the resources available to this group.
Each resource has its own line and format - see below for details.
"size":
Mirrors the display of the "schemata" file to display the size in
bytes of each allocation instead of the bits representing the
allocation.
"mode":
The "mode" of the resource group dictates the sharing of its
allocations. A "shareable" resource group allows sharing of its
allocations while an "exclusive" resource group does not. A
cache pseudo-locked region is created by first writing
"pseudo-locksetup" to the "mode" file before writing the cache
pseudo-locked region's schemata to the resource group's "schemata"
file. On successful pseudo-locked region creation the mode will
automatically change to "pseudo-locked".
When monitoring is enabled all MON groups will also contain:
"mon_data":
@ -379,6 +424,170 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
Cache Pseudo-Locking
--------------------
CAT enables a user to specify the amount of cache space that an
application can fill. Cache pseudo-locking builds on the fact that a
CPU can still read and write data pre-allocated outside its current
allocated area on a cache hit. With cache pseudo-locking, data can be
preloaded into a reserved portion of cache that no application can
fill, and from that point on will only serve cache hits. The cache
pseudo-locked memory is made accessible to user space where an
application can map it into its virtual address space and thus have
a region of memory with reduced average read latency.
The creation of a cache pseudo-locked region is triggered by a request
from the user to do so that is accompanied by a schemata of the region
to be pseudo-locked. The cache pseudo-locked region is created as follows:
- Create a CAT allocation CLOSNEW with a CBM matching the schemata
from the user of the cache region that will contain the pseudo-locked
memory. This region must not overlap with any current CAT allocation/CLOS
on the system and no future overlap with this cache region is allowed
while the pseudo-locked region exists.
- Create a contiguous region of memory of the same size as the cache
region.
- Flush the cache, disable hardware prefetchers, disable preemption.
- Make CLOSNEW the active CLOS and touch the allocated memory to load
it into the cache.
- Set the previous CLOS as active.
- At this point the closid CLOSNEW can be released - the cache
pseudo-locked region is protected as long as its CBM does not appear in
any CAT allocation. Even though the cache pseudo-locked region will from
this point on not appear in any CBM of any CLOS an application running with
any CLOS will be able to access the memory in the pseudo-locked region since
the region continues to serve cache hits.
- The contiguous region of memory loaded into the cache is exposed to
user-space as a character device.
Cache pseudo-locking increases the probability that data will remain
in the cache via carefully configuring the CAT feature and controlling
application behavior. There is no guarantee that data is placed in
cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
“locked” data from cache. Power management C-states may shrink or
power off cache. Deeper C-states will automatically be restricted on
pseudo-locked region creation.
It is required that an application using a pseudo-locked region runs
with affinity to the cores (or a subset of the cores) associated
with the cache on which the pseudo-locked region resides. A sanity check
within the code will not allow an application to map pseudo-locked memory
unless it runs with affinity to cores associated with the cache on which the
pseudo-locked region resides. The sanity check is only done during the
initial mmap() handling, there is no enforcement afterwards and the
application self needs to ensure it remains affine to the correct cores.
Pseudo-locking is accomplished in two stages:
1) During the first stage the system administrator allocates a portion
of cache that should be dedicated to pseudo-locking. At this time an
equivalent portion of memory is allocated, loaded into allocated
cache portion, and exposed as a character device.
2) During the second stage a user-space application maps (mmap()) the
pseudo-locked memory into its address space.
Cache Pseudo-Locking Interface
------------------------------
A pseudo-locked region is created using the resctrl interface as follows:
1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
2) Change the new resource group's mode to "pseudo-locksetup" by writing
"pseudo-locksetup" to the "mode" file.
3) Write the schemata of the pseudo-locked region to the "schemata" file. All
bits within the schemata should be "unused" according to the "bit_usage"
file.
On successful pseudo-locked region creation the "mode" file will contain
"pseudo-locked" and a new character device with the same name as the resource
group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
by user space in order to obtain access to the pseudo-locked memory region.
An example of cache pseudo-locked region creation and usage can be found below.
Cache Pseudo-Locking Debugging Interface
---------------------------------------
The pseudo-locking debugging interface is enabled by default (if
CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
There is no explicit way for the kernel to test if a provided memory
location is present in the cache. The pseudo-locking debugging interface uses
the tracing infrastructure to provide two ways to measure cache residency of
the pseudo-locked region:
1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
from these measurements are best visualized using a hist trigger (see
example below). In this test the pseudo-locked region is traversed at
a stride of 32 bytes while hardware prefetchers and preemption
are disabled. This also provides a substitute visualization of cache
hits and misses.
2) Cache hit and miss measurements using model specific precision counters if
available. Depending on the levels of cache on the system the pseudo_lock_l2
and pseudo_lock_l3 tracepoints are available.
WARNING: triggering this measurement uses from two (for just L2
measurements) to four (for L2 and L3 measurements) precision counters on
the system, if any other measurements are in progress the counters and
their corresponding event registers will be clobbered.
When a pseudo-locked region is created a new debugfs directory is created for
it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
write-only file, pseudo_lock_measure, is present in this directory. The
measurement on the pseudo-locked region depends on the number, 1 or 2,
written to this debugfs file. Since the measurements are recorded with the
tracing infrastructure the relevant tracepoints need to be enabled before the
measurement is triggered.
Example of latency debugging interface:
In this example a pseudo-locked region named "newlock" was created. Here is
how we can measure the latency in cycles of reading from this region and
visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
is set:
# :> /sys/kernel/debug/tracing/trace
# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
# echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
# cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist
# event histogram
#
# trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
#
{ latency: 456 } hitcount: 1
{ latency: 50 } hitcount: 83
{ latency: 36 } hitcount: 96
{ latency: 44 } hitcount: 174
{ latency: 48 } hitcount: 195
{ latency: 46 } hitcount: 262
{ latency: 42 } hitcount: 693
{ latency: 40 } hitcount: 3204
{ latency: 38 } hitcount: 3484
Totals:
Hits: 8192
Entries: 9
Dropped: 0
Example of cache hits/misses debugging:
In this example a pseudo-locked region named "newlock" was created on the L2
cache of a platform. Here is how we can obtain details of the cache hits
and misses using the platform's precision counters.
# :> /sys/kernel/debug/tracing/trace
# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
# echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
Examples for RDT allocation usage:
Example 1
@ -502,7 +711,172 @@ siblings and only the real time threads are scheduled on the cores 4-7.
# echo F0 > p0/cpus
4) Locking between applications
Example 4
---------
The resource groups in previous examples were all in the default "shareable"
mode allowing sharing of their cache allocations. If one resource group
configures a cache allocation then nothing prevents another resource group
to overlap with that allocation.
In this example a new exclusive resource group will be created on a L2 CAT
system with two L2 cache instances that can be configured with an 8-bit
capacity bitmask. The new exclusive resource group will be configured to use
25% of each cache instance.
# mount -t resctrl resctrl /sys/fs/resctrl/
# cd /sys/fs/resctrl
First, we observe that the default group is configured to allocate to all L2
cache:
# cat schemata
L2:0=ff;1=ff
We could attempt to create the new resource group at this point, but it will
fail because of the overlap with the schemata of the default group:
# mkdir p0
# echo 'L2:0=0x3;1=0x3' > p0/schemata
# cat p0/mode
shareable
# echo exclusive > p0/mode
-sh: echo: write error: Invalid argument
# cat info/last_cmd_status
schemata overlaps
To ensure that there is no overlap with another resource group the default
resource group's schemata has to change, making it possible for the new
resource group to become exclusive.
# echo 'L2:0=0xfc;1=0xfc' > schemata
# echo exclusive > p0/mode
# grep . p0/*
p0/cpus:0
p0/mode:exclusive
p0/schemata:L2:0=03;1=03
p0/size:L2:0=262144;1=262144
A new resource group will on creation not overlap with an exclusive resource
group:
# mkdir p1
# grep . p1/*
p1/cpus:0
p1/mode:shareable
p1/schemata:L2:0=fc;1=fc
p1/size:L2:0=786432;1=786432
The bit_usage will reflect how the cache is used:
# cat info/L2/bit_usage
0=SSSSSSEE;1=SSSSSSEE
A resource group cannot be forced to overlap with an exclusive resource group:
# echo 'L2:0=0x1;1=0x1' > p1/schemata
-sh: echo: write error: Invalid argument
# cat info/last_cmd_status
overlaps with exclusive group
Example of Cache Pseudo-Locking
-------------------------------
Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
region is exposed at /dev/pseudo_lock/newlock that can be provided to
application for argument to mmap().
# mount -t resctrl resctrl /sys/fs/resctrl/
# cd /sys/fs/resctrl
Ensure that there are bits available that can be pseudo-locked, since only
unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
removed from the default resource group's schemata:
# cat info/L2/bit_usage
0=SSSSSSSS;1=SSSSSSSS
# echo 'L2:1=0xfc' > schemata
# cat info/L2/bit_usage
0=SSSSSSSS;1=SSSSSS00
Create a new resource group that will be associated with the pseudo-locked
region, indicate that it will be used for a pseudo-locked region, and
configure the requested pseudo-locked region capacity bitmask:
# mkdir newlock
# echo pseudo-locksetup > newlock/mode
# echo 'L2:1=0x3' > newlock/schemata
On success the resource group's mode will change to pseudo-locked, the
bit_usage will reflect the pseudo-locked region, and the character device
exposing the pseudo-locked region will exist:
# cat newlock/mode
pseudo-locked
# cat info/L2/bit_usage
0=SSSSSSSS;1=SSSSSSPP
# ls -l /dev/pseudo_lock/newlock
crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
/*
* Example code to access one page of pseudo-locked cache region
* from user space.
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
/*
* It is required that the application runs with affinity to only
* cores associated with the pseudo-locked region. Here the cpu
* is hardcoded for convenience of example.
*/
static int cpuid = 2;
int main(int argc, char *argv[])
{
cpu_set_t cpuset;
long page_size;
void *mapping;
int dev_fd;
int ret;
page_size = sysconf(_SC_PAGESIZE);
CPU_ZERO(&cpuset);
CPU_SET(cpuid, &cpuset);
ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
if (ret < 0) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
if (dev_fd < 0) {
perror("open");
exit(EXIT_FAILURE);
}
mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
dev_fd, 0);
if (mapping == MAP_FAILED) {
perror("mmap");
close(dev_fd);
exit(EXIT_FAILURE);
}
/* Application interacts with pseudo-locked memory @mapping */
ret = munmap(mapping, page_size);
if (ret < 0) {
perror("munmap");
close(dev_fd);
exit(EXIT_FAILURE);
}
close(dev_fd);
exit(EXIT_SUCCESS);
}
Locking between applications
----------------------------
Certain operations on the resctrl filesystem, composed of read/writes
to/from multiple files, must be atomic.
@ -510,7 +884,7 @@ to/from multiple files, must be atomic.
As an example, the allocation of an exclusive reservation of L3 cache
involves:
1. Read the cbmmasks from each directory
1. Read the cbmmasks from each directory or the per-resource "bit_usage"
2. Find a contiguous set of bits in the global CBM bitmask that is clear
in any of the directory cbmmasks
3. Create a new directory

View File

@ -35,7 +35,9 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o
obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o intel_rdt_ctrlmondata.o
obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o
obj-$(CONFIG_INTEL_RDT) += intel_rdt_ctrlmondata.o intel_rdt_pseudo_lock.o
CFLAGS_intel_rdt_pseudo_lock.o = -I$(src)
obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/

View File

@ -859,6 +859,8 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}
static enum cpuhp_state rdt_online;
static int __init intel_rdt_late_init(void)
{
struct rdt_resource *r;
@ -880,6 +882,7 @@ static int __init intel_rdt_late_init(void)
cpuhp_remove_state(state);
return ret;
}
rdt_online = state;
for_each_alloc_capable_rdt_resource(r)
pr_info("Intel RDT %s allocation detected\n", r->name);
@ -891,3 +894,11 @@ static int __init intel_rdt_late_init(void)
}
late_initcall(intel_rdt_late_init);
static void __exit intel_rdt_exit(void)
{
cpuhp_remove_state(rdt_online);
rdtgroup_exit();
}
__exitcall(intel_rdt_exit);

View File

@ -80,6 +80,34 @@ enum rdt_group_type {
RDT_NUM_GROUP,
};
/**
* enum rdtgrp_mode - Mode of a RDT resource group
* @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations
* @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed
* @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking
* @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations
* allowed AND the allocations are Cache Pseudo-Locked
*
* The mode of a resource group enables control over the allowed overlap
* between allocations associated with different resource groups (classes
* of service). User is able to modify the mode of a resource group by
* writing to the "mode" resctrl file associated with the resource group.
*
* The "shareable", "exclusive", and "pseudo-locksetup" modes are set by
* writing the appropriate text to the "mode" file. A resource group enters
* "pseudo-locked" mode after the schemata is written while the resource
* group is in "pseudo-locksetup" mode.
*/
enum rdtgrp_mode {
RDT_MODE_SHAREABLE = 0,
RDT_MODE_EXCLUSIVE,
RDT_MODE_PSEUDO_LOCKSETUP,
RDT_MODE_PSEUDO_LOCKED,
/* Must be last */
RDT_NUM_MODES,
};
/**
* struct mongroup - store mon group's data in resctrl fs.
* @mon_data_kn kernlfs node for the mon_data directory
@ -94,6 +122,43 @@ struct mongroup {
u32 rmid;
};
/**
* struct pseudo_lock_region - pseudo-lock region information
* @r: RDT resource to which this pseudo-locked region
* belongs
* @d: RDT domain to which this pseudo-locked region
* belongs
* @cbm: bitmask of the pseudo-locked region
* @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread
* completion
* @thread_done: variable used by waitqueue to test if pseudo-locking
* thread completed
* @cpu: core associated with the cache on which the setup code
* will be run
* @line_size: size of the cache lines
* @size: size of pseudo-locked region in bytes
* @kmem: the kernel memory associated with pseudo-locked region
* @minor: minor number of character device associated with this
* region
* @debugfs_dir: pointer to this region's directory in the debugfs
* filesystem
* @pm_reqs: Power management QoS requests related to this region
*/
struct pseudo_lock_region {
struct rdt_resource *r;
struct rdt_domain *d;
u32 cbm;
wait_queue_head_t lock_thread_wq;
int thread_done;
int cpu;
unsigned int line_size;
unsigned int size;
void *kmem;
unsigned int minor;
struct dentry *debugfs_dir;
struct list_head pm_reqs;
};
/**
* struct rdtgroup - store rdtgroup's data in resctrl file system.
* @kn: kernfs node
@ -106,16 +171,20 @@ struct mongroup {
* @type: indicates type of this rdtgroup - either
* monitor only or ctrl_mon group
* @mon: mongroup related data
* @mode: mode of resource group
* @plr: pseudo-locked region
*/
struct rdtgroup {
struct kernfs_node *kn;
struct list_head rdtgroup_list;
u32 closid;
struct cpumask cpu_mask;
int flags;
atomic_t waitcount;
enum rdt_group_type type;
struct mongroup mon;
struct kernfs_node *kn;
struct list_head rdtgroup_list;
u32 closid;
struct cpumask cpu_mask;
int flags;
atomic_t waitcount;
enum rdt_group_type type;
struct mongroup mon;
enum rdtgrp_mode mode;
struct pseudo_lock_region *plr;
};
/* rdtgroup.flags */
@ -148,6 +217,7 @@ extern struct list_head rdt_all_groups;
extern int max_name_width, max_data_width;
int __init rdtgroup_init(void);
void __exit rdtgroup_exit(void);
/**
* struct rftype - describe each file in the resctrl file system
@ -216,22 +286,24 @@ struct mbm_state {
* @mbps_val: When mba_sc is enabled, this holds the bandwidth in MBps
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
* @plr: pseudo-locked region (if any) associated with domain
*/
struct rdt_domain {
struct list_head list;
int id;
struct cpumask cpu_mask;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
struct delayed_work mbm_over;
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
u32 *ctrl_val;
u32 *mbps_val;
u32 new_ctrl;
bool have_new_ctrl;
struct list_head list;
int id;
struct cpumask cpu_mask;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
struct delayed_work mbm_over;
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
u32 *ctrl_val;
u32 *mbps_val;
u32 new_ctrl;
bool have_new_ctrl;
struct pseudo_lock_region *plr;
};
/**
@ -351,7 +423,7 @@ struct rdt_resource {
struct rdt_cache cache;
struct rdt_membw membw;
const char *format_str;
int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
int (*parse_ctrlval) (void *data, struct rdt_resource *r,
struct rdt_domain *d);
struct list_head evt_list;
int num_rmid;
@ -359,8 +431,8 @@ struct rdt_resource {
unsigned long fflags;
};
int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d);
int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d);
int parse_cbm(void *_data, struct rdt_resource *r, struct rdt_domain *d);
int parse_bw(void *_buf, struct rdt_resource *r, struct rdt_domain *d);
extern struct mutex rdtgroup_mutex;
@ -368,7 +440,7 @@ extern struct rdt_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
int __init rdtgroup_init(void);
extern struct dentry *debugfs_resctrl;
enum {
RDT_RESOURCE_L3,
@ -439,13 +511,32 @@ void rdt_last_cmd_printf(const char *fmt, ...);
void rdt_ctrl_update(void *arg);
struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
bool rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
u32 _cbm, int closid, bool exclusive);
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
u32 cbm);
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
int rdtgroup_tasks_assigned(struct rdtgroup *r);
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, u32 _cbm);
bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int update_domains(struct rdt_resource *r, int closid);
void closid_free(int closid);
int alloc_rmid(void);
void free_rmid(u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);

View File

@ -64,9 +64,10 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
return true;
}
int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d)
int parse_bw(void *_buf, struct rdt_resource *r, struct rdt_domain *d)
{
unsigned long data;
char *buf = _buf;
if (d->have_new_ctrl) {
rdt_last_cmd_printf("duplicate domain %d\n", d->id);
@ -87,7 +88,7 @@ int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d)
* are allowed (e.g. FFFFH, 0FF0H, 003CH, etc.).
* Additionally Haswell requires at least two bits set.
*/
static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
{
unsigned long first_bit, zero_bit, val;
unsigned int cbm_len = r->cache.cbm_len;
@ -122,22 +123,64 @@ static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
return true;
}
struct rdt_cbm_parse_data {
struct rdtgroup *rdtgrp;
char *buf;
};
/*
* Read one cache bit mask (hex). Check that it is valid for the current
* resource type.
*/
int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
int parse_cbm(void *_data, struct rdt_resource *r, struct rdt_domain *d)
{
unsigned long data;
struct rdt_cbm_parse_data *data = _data;
struct rdtgroup *rdtgrp = data->rdtgrp;
u32 cbm_val;
if (d->have_new_ctrl) {
rdt_last_cmd_printf("duplicate domain %d\n", d->id);
return -EINVAL;
}
if(!cbm_validate(buf, &data, r))
/*
* Cannot set up more than one pseudo-locked region in a cache
* hierarchy.
*/
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
rdtgroup_pseudo_locked_in_hierarchy(d)) {
rdt_last_cmd_printf("pseudo-locked region in hierarchy\n");
return -EINVAL;
d->new_ctrl = data;
}
if (!cbm_validate(data->buf, &cbm_val, r))
return -EINVAL;
if ((rdtgrp->mode == RDT_MODE_EXCLUSIVE ||
rdtgrp->mode == RDT_MODE_SHAREABLE) &&
rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) {
rdt_last_cmd_printf("CBM overlaps with pseudo-locked region\n");
return -EINVAL;
}
/*
* The CBM may not overlap with the CBM of another closid if
* either is exclusive.
*/
if (rdtgroup_cbm_overlaps(r, d, cbm_val, rdtgrp->closid, true)) {
rdt_last_cmd_printf("overlaps with exclusive group\n");
return -EINVAL;
}
if (rdtgroup_cbm_overlaps(r, d, cbm_val, rdtgrp->closid, false)) {
if (rdtgrp->mode == RDT_MODE_EXCLUSIVE ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
rdt_last_cmd_printf("overlaps with other group\n");
return -EINVAL;
}
}
d->new_ctrl = cbm_val;
d->have_new_ctrl = true;
return 0;
@ -149,8 +192,10 @@ int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
* separated by ";". The "id" is in decimal, and must match one of
* the "id"s for this resource.
*/
static int parse_line(char *line, struct rdt_resource *r)
static int parse_line(char *line, struct rdt_resource *r,
struct rdtgroup *rdtgrp)
{
struct rdt_cbm_parse_data data;
char *dom = NULL, *id;
struct rdt_domain *d;
unsigned long dom_id;
@ -167,15 +212,32 @@ next:
dom = strim(dom);
list_for_each_entry(d, &r->domains, list) {
if (d->id == dom_id) {
if (r->parse_ctrlval(dom, r, d))
data.buf = dom;
data.rdtgrp = rdtgrp;
if (r->parse_ctrlval(&data, r, d))
return -EINVAL;
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
/*
* In pseudo-locking setup mode and just
* parsed a valid CBM that should be
* pseudo-locked. Only one locked region per
* resource group and domain so just do
* the required initialization for single
* region and return.
*/
rdtgrp->plr->r = r;
rdtgrp->plr->d = d;
rdtgrp->plr->cbm = d->new_ctrl;
d->plr = rdtgrp->plr;
return 0;
}
goto next;
}
}
return -EINVAL;
}
static int update_domains(struct rdt_resource *r, int closid)
int update_domains(struct rdt_resource *r, int closid)
{
struct msr_param msr_param;
cpumask_var_t cpu_mask;
@ -220,13 +282,14 @@ done:
return 0;
}
static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
static int rdtgroup_parse_resource(char *resname, char *tok,
struct rdtgroup *rdtgrp)
{
struct rdt_resource *r;
for_each_alloc_enabled_rdt_resource(r) {
if (!strcmp(resname, r->name) && closid < r->num_closid)
return parse_line(tok, r);
if (!strcmp(resname, r->name) && rdtgrp->closid < r->num_closid)
return parse_line(tok, r, rdtgrp);
}
rdt_last_cmd_printf("unknown/unsupported resource name '%s'\n", resname);
return -EINVAL;
@ -239,7 +302,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
struct rdt_domain *dom;
struct rdt_resource *r;
char *tok, *resname;
int closid, ret = 0;
int ret = 0;
/* Valid input requires a trailing newline */
if (nbytes == 0 || buf[nbytes - 1] != '\n')
@ -253,7 +316,15 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
}
rdt_last_cmd_clear();
closid = rdtgrp->closid;
/*
* No changes to pseudo-locked region allowed. It has to be removed
* and re-created instead.
*/
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
ret = -EINVAL;
rdt_last_cmd_puts("resource group is pseudo-locked\n");
goto out;
}
for_each_alloc_enabled_rdt_resource(r) {
list_for_each_entry(dom, &r->domains, list)
@ -272,17 +343,27 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
ret = -EINVAL;
goto out;
}
ret = rdtgroup_parse_resource(resname, tok, closid);
ret = rdtgroup_parse_resource(resname, tok, rdtgrp);
if (ret)
goto out;
}
for_each_alloc_enabled_rdt_resource(r) {
ret = update_domains(r, closid);
ret = update_domains(r, rdtgrp->closid);
if (ret)
goto out;
}
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
/*
* If pseudo-locking fails we keep the resource group in
* mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service
* active and updated for just the domain the pseudo-locked
* region was requested for.
*/
ret = rdtgroup_pseudo_lock_create(rdtgrp);
}
out:
rdtgroup_kn_unlock(of->kn);
return ret ?: nbytes;
@ -318,10 +399,18 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (rdtgrp) {
closid = rdtgrp->closid;
for_each_alloc_enabled_rdt_resource(r) {
if (closid < r->num_closid)
show_doms(s, r, closid);
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
for_each_alloc_enabled_rdt_resource(r)
seq_printf(s, "%s:uninitialized\n", r->name);
} else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
seq_printf(s, "%s:%d=%x\n", rdtgrp->plr->r->name,
rdtgrp->plr->d->id, rdtgrp->plr->cbm);
} else {
closid = rdtgrp->closid;
for_each_alloc_enabled_rdt_resource(r) {
if (closid < r->num_closid)
show_doms(s, r, closid);
}
}
} else {
ret = -ENOENT;

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,43 @@
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM resctrl
#if !defined(_TRACE_PSEUDO_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_PSEUDO_LOCK_H
#include <linux/tracepoint.h>
TRACE_EVENT(pseudo_lock_mem_latency,
TP_PROTO(u32 latency),
TP_ARGS(latency),
TP_STRUCT__entry(__field(u32, latency)),
TP_fast_assign(__entry->latency = latency),
TP_printk("latency=%u", __entry->latency)
);
TRACE_EVENT(pseudo_lock_l2,
TP_PROTO(u64 l2_hits, u64 l2_miss),
TP_ARGS(l2_hits, l2_miss),
TP_STRUCT__entry(__field(u64, l2_hits)
__field(u64, l2_miss)),
TP_fast_assign(__entry->l2_hits = l2_hits;
__entry->l2_miss = l2_miss;),
TP_printk("hits=%llu miss=%llu",
__entry->l2_hits, __entry->l2_miss));
TRACE_EVENT(pseudo_lock_l3,
TP_PROTO(u64 l3_hits, u64 l3_miss),
TP_ARGS(l3_hits, l3_miss),
TP_STRUCT__entry(__field(u64, l3_hits)
__field(u64, l3_miss)),
TP_fast_assign(__entry->l3_hits = l3_hits;
__entry->l3_miss = l3_miss;),
TP_printk("hits=%llu miss=%llu",
__entry->l3_hits, __entry->l3_miss));
#endif /* _TRACE_PSEUDO_LOCK_H */
#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH .
#define TRACE_INCLUDE_FILE intel_rdt_pseudo_lock_event
#include <trace/define_trace.h>

View File

@ -20,7 +20,9 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/cacheinfo.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/fs.h>
#include <linux/sysfs.h>
#include <linux/kernfs.h>
@ -55,6 +57,8 @@ static struct kernfs_node *kn_mondata;
static struct seq_buf last_cmd_status;
static char last_cmd_status_buf[512];
struct dentry *debugfs_resctrl;
void rdt_last_cmd_clear(void)
{
lockdep_assert_held(&rdtgroup_mutex);
@ -121,11 +125,65 @@ static int closid_alloc(void)
return closid;
}
static void closid_free(int closid)
void closid_free(int closid)
{
closid_free_map |= 1 << closid;
}
/**
* closid_allocated - test if provided closid is in use
* @closid: closid to be tested
*
* Return: true if @closid is currently associated with a resource group,
* false if @closid is free
*/
static bool closid_allocated(unsigned int closid)
{
return (closid_free_map & (1 << closid)) == 0;
}
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
*
* Each resource group is associated with a @closid. Here the mode
* of a resource group can be queried by searching for it using its closid.
*
* Return: mode as &enum rdtgrp_mode of resource group with closid @closid
*/
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid)
{
struct rdtgroup *rdtgrp;
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (rdtgrp->closid == closid)
return rdtgrp->mode;
}
return RDT_NUM_MODES;
}
static const char * const rdt_mode_str[] = {
[RDT_MODE_SHAREABLE] = "shareable",
[RDT_MODE_EXCLUSIVE] = "exclusive",
[RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup",
[RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked",
};
/**
* rdtgroup_mode_str - Return the string representation of mode
* @mode: the resource group mode as &enum rdtgroup_mode
*
* Return: string representation of valid mode, "unknown" otherwise
*/
static const char *rdtgroup_mode_str(enum rdtgrp_mode mode)
{
if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES)
return "unknown";
return rdt_mode_str[mode];
}
/* set uid and gid of rdtgroup dirs and files to that of the creator */
static int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
{
@ -207,8 +265,12 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (rdtgrp) {
seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n",
cpumask_pr_args(&rdtgrp->cpu_mask));
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n",
cpumask_pr_args(&rdtgrp->plr->d->cpu_mask));
else
seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n",
cpumask_pr_args(&rdtgrp->cpu_mask));
} else {
ret = -ENOENT;
}
@ -394,6 +456,13 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
goto unlock;
}
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
ret = -EINVAL;
rdt_last_cmd_puts("pseudo-locking in progress\n");
goto unlock;
}
if (is_cpu_list(of))
ret = cpulist_parse(buf, newmask);
else
@ -509,6 +578,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
return ret;
}
/**
* rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group
* @r: Resource group
*
* Return: 1 if tasks have been assigned to @r, 0 otherwise
*/
int rdtgroup_tasks_assigned(struct rdtgroup *r)
{
struct task_struct *p, *t;
int ret = 0;
lockdep_assert_held(&rdtgroup_mutex);
rcu_read_lock();
for_each_process_thread(p, t) {
if ((r->type == RDTCTRL_GROUP && t->closid == r->closid) ||
(r->type == RDTMON_GROUP && t->rmid == r->mon.rmid)) {
ret = 1;
break;
}
}
rcu_read_unlock();
return ret;
}
static int rdtgroup_task_write_permission(struct task_struct *task,
struct kernfs_open_file *of)
{
@ -570,13 +665,22 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
return -EINVAL;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
return -ENOENT;
}
rdt_last_cmd_clear();
if (rdtgrp)
ret = rdtgroup_move_task(pid, rdtgrp, of);
else
ret = -ENOENT;
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
ret = -EINVAL;
rdt_last_cmd_puts("pseudo-locking in progress\n");
goto unlock;
}
ret = rdtgroup_move_task(pid, rdtgrp, of);
unlock:
rdtgroup_kn_unlock(of->kn);
return ret ?: nbytes;
@ -662,6 +766,94 @@ static int rdt_shareable_bits_show(struct kernfs_open_file *of,
return 0;
}
/**
* rdt_bit_usage_show - Display current usage of resources
*
* A domain is a shared resource that can now be allocated differently. Here
* we display the current regions of the domain as an annotated bitmask.
* For each domain of this resource its allocation bitmask
* is annotated as below to indicate the current usage of the corresponding bit:
* 0 - currently unused
* X - currently available for sharing and used by software and hardware
* H - currently used by hardware only but available for software use
* S - currently used and shareable by software only
* E - currently used exclusively by one resource group
* P - currently pseudo-locked by one resource group
*/
static int rdt_bit_usage_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct rdt_resource *r = of->kn->parent->priv;
u32 sw_shareable = 0, hw_shareable = 0;
u32 exclusive = 0, pseudo_locked = 0;
struct rdt_domain *dom;
int i, hwb, swb, excl, psl;
enum rdtgrp_mode mode;
bool sep = false;
u32 *ctrl;
mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
list_for_each_entry(dom, &r->domains, list) {
if (sep)
seq_putc(seq, ';');
ctrl = dom->ctrl_val;
sw_shareable = 0;
exclusive = 0;
seq_printf(seq, "%d=", dom->id);
for (i = 0; i < r->num_closid; i++, ctrl++) {
if (!closid_allocated(i))
continue;
mode = rdtgroup_mode_by_closid(i);
switch (mode) {
case RDT_MODE_SHAREABLE:
sw_shareable |= *ctrl;
break;
case RDT_MODE_EXCLUSIVE:
exclusive |= *ctrl;
break;
case RDT_MODE_PSEUDO_LOCKSETUP:
/*
* RDT_MODE_PSEUDO_LOCKSETUP is possible
* here but not included since the CBM
* associated with this CLOSID in this mode
* is not initialized and no task or cpu can be
* assigned this CLOSID.
*/
break;
case RDT_MODE_PSEUDO_LOCKED:
case RDT_NUM_MODES:
WARN(1,
"invalid mode for closid %d\n", i);
break;
}
}
for (i = r->cache.cbm_len - 1; i >= 0; i--) {
pseudo_locked = dom->plr ? dom->plr->cbm : 0;
hwb = test_bit(i, (unsigned long *)&hw_shareable);
swb = test_bit(i, (unsigned long *)&sw_shareable);
excl = test_bit(i, (unsigned long *)&exclusive);
psl = test_bit(i, (unsigned long *)&pseudo_locked);
if (hwb && swb)
seq_putc(seq, 'X');
else if (hwb && !swb)
seq_putc(seq, 'H');
else if (!hwb && swb)
seq_putc(seq, 'S');
else if (excl)
seq_putc(seq, 'E');
else if (psl)
seq_putc(seq, 'P');
else /* Unused bits remain */
seq_putc(seq, '0');
}
sep = true;
}
seq_putc(seq, '\n');
mutex_unlock(&rdtgroup_mutex);
return 0;
}
static int rdt_min_bw_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@ -740,6 +932,269 @@ static ssize_t max_threshold_occ_write(struct kernfs_open_file *of,
return nbytes;
}
/*
* rdtgroup_mode_show - Display mode of this resource group
*/
static int rdtgroup_mode_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdtgroup *rdtgrp;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
return -ENOENT;
}
seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode));
rdtgroup_kn_unlock(of->kn);
return 0;
}
/**
* rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other
* @r: Resource to which domain instance @d belongs.
* @d: The domain instance for which @closid is being tested.
* @cbm: Capacity bitmask being tested.
* @closid: Intended closid for @cbm.
* @exclusive: Only check if overlaps with exclusive resource groups
*
* Checks if provided @cbm intended to be used for @closid on domain
* @d overlaps with any other closids or other hardware usage associated
* with this domain. If @exclusive is true then only overlaps with
* resource groups in exclusive mode will be considered. If @exclusive
* is false then overlaps with any resource group or hardware entities
* will be considered.
*
* Return: false if CBM does not overlap, true if it does.
*/
bool rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
u32 _cbm, int closid, bool exclusive)
{
unsigned long *cbm = (unsigned long *)&_cbm;
unsigned long *ctrl_b;
enum rdtgrp_mode mode;
u32 *ctrl;
int i;
/* Check for any overlap with regions used by hardware directly */
if (!exclusive) {
if (bitmap_intersects(cbm,
(unsigned long *)&r->cache.shareable_bits,
r->cache.cbm_len))
return true;
}
/* Check for overlap with other resource groups */
ctrl = d->ctrl_val;
for (i = 0; i < r->num_closid; i++, ctrl++) {
ctrl_b = (unsigned long *)ctrl;
mode = rdtgroup_mode_by_closid(i);
if (closid_allocated(i) && i != closid &&
mode != RDT_MODE_PSEUDO_LOCKSETUP) {
if (bitmap_intersects(cbm, ctrl_b, r->cache.cbm_len)) {
if (exclusive) {
if (mode == RDT_MODE_EXCLUSIVE)
return true;
continue;
}
return true;
}
}
}
return false;
}
/**
* rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive
*
* An exclusive resource group implies that there should be no sharing of
* its allocated resources. At the time this group is considered to be
* exclusive this test can determine if its current schemata supports this
* setting by testing for overlap with all other resource groups.
*
* Return: true if resource group can be exclusive, false if there is overlap
* with allocations of other resource groups and thus this resource group
* cannot be exclusive.
*/
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
int closid = rdtgrp->closid;
struct rdt_resource *r;
struct rdt_domain *d;
for_each_alloc_enabled_rdt_resource(r) {
list_for_each_entry(d, &r->domains, list) {
if (rdtgroup_cbm_overlaps(r, d, d->ctrl_val[closid],
rdtgrp->closid, false))
return false;
}
}
return true;
}
/**
* rdtgroup_mode_write - Modify the resource group's mode
*
*/
static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct rdtgroup *rdtgrp;
enum rdtgrp_mode mode;
int ret = 0;
/* Valid input requires a trailing newline */
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;
buf[nbytes - 1] = '\0';
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
return -ENOENT;
}
rdt_last_cmd_clear();
mode = rdtgrp->mode;
if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) ||
(!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) ||
(!strcmp(buf, "pseudo-locksetup") &&
mode == RDT_MODE_PSEUDO_LOCKSETUP) ||
(!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED))
goto out;
if (mode == RDT_MODE_PSEUDO_LOCKED) {
rdt_last_cmd_printf("cannot change pseudo-locked group\n");
ret = -EINVAL;
goto out;
}
if (!strcmp(buf, "shareable")) {
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
ret = rdtgroup_locksetup_exit(rdtgrp);
if (ret)
goto out;
}
rdtgrp->mode = RDT_MODE_SHAREABLE;
} else if (!strcmp(buf, "exclusive")) {
if (!rdtgroup_mode_test_exclusive(rdtgrp)) {
rdt_last_cmd_printf("schemata overlaps\n");
ret = -EINVAL;
goto out;
}
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
ret = rdtgroup_locksetup_exit(rdtgrp);
if (ret)
goto out;
}
rdtgrp->mode = RDT_MODE_EXCLUSIVE;
} else if (!strcmp(buf, "pseudo-locksetup")) {
ret = rdtgroup_locksetup_enter(rdtgrp);
if (ret)
goto out;
rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP;
} else {
rdt_last_cmd_printf("unknown/unsupported mode\n");
ret = -EINVAL;
}
out:
rdtgroup_kn_unlock(of->kn);
return ret ?: nbytes;
}
/**
* rdtgroup_cbm_to_size - Translate CBM to size in bytes
* @r: RDT resource to which @d belongs.
* @d: RDT domain instance.
* @cbm: bitmask for which the size should be computed.
*
* The bitmask provided associated with the RDT domain instance @d will be
* translated into how many bytes it represents. The size in bytes is
* computed by first dividing the total cache size by the CBM length to
* determine how many bytes each bit in the bitmask represents. The result
* is multiplied with the number of bits set in the bitmask.
*/
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
struct rdt_domain *d, u32 cbm)
{
struct cpu_cacheinfo *ci;
unsigned int size = 0;
int num_b, i;
num_b = bitmap_weight((unsigned long *)&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == r->cache_level) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
}
return size;
}
/**
* rdtgroup_size_show - Display size in bytes of allocated regions
*
* The "size" file mirrors the layout of the "schemata" file, printing the
* size in bytes of each region instead of the capacity bitmask.
*
*/
static int rdtgroup_size_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
struct rdt_domain *d;
unsigned int size;
bool sep = false;
u32 cbm;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
return -ENOENT;
}
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
seq_printf(s, "%*s:", max_name_width, rdtgrp->plr->r->name);
size = rdtgroup_cbm_to_size(rdtgrp->plr->r,
rdtgrp->plr->d,
rdtgrp->plr->cbm);
seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
goto out;
}
for_each_alloc_enabled_rdt_resource(r) {
seq_printf(s, "%*s:", max_name_width, r->name);
list_for_each_entry(d, &r->domains, list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
size = 0;
} else {
cbm = d->ctrl_val[rdtgrp->closid];
size = rdtgroup_cbm_to_size(r, d, cbm);
}
seq_printf(s, "%d=%u", d->id, size);
sep = true;
}
seq_putc(s, '\n');
}
out:
rdtgroup_kn_unlock(of->kn);
return 0;
}
/* rdtgroup information files for one cache resource. */
static struct rftype res_common_files[] = {
{
@ -791,6 +1246,13 @@ static struct rftype res_common_files[] = {
.seq_show = rdt_shareable_bits_show,
.fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE,
},
{
.name = "bit_usage",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_bit_usage_show,
.fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE,
},
{
.name = "min_bandwidth",
.mode = 0444,
@ -853,6 +1315,22 @@ static struct rftype res_common_files[] = {
.seq_show = rdtgroup_schemata_show,
.fflags = RF_CTRL_BASE,
},
{
.name = "mode",
.mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.write = rdtgroup_mode_write,
.seq_show = rdtgroup_mode_show,
.fflags = RF_CTRL_BASE,
},
{
.name = "size",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_size_show,
.fflags = RF_CTRL_BASE,
},
};
static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
@ -883,6 +1361,103 @@ error:
return ret;
}
/**
* rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file
* @r: The resource group with which the file is associated.
* @name: Name of the file
*
* The permissions of named resctrl file, directory, or link are modified
* to not allow read, write, or execute by any user.
*
* WARNING: This function is intended to communicate to the user that the
* resctrl file has been locked down - that it is not relevant to the
* particular state the system finds itself in. It should not be relied
* on to protect from user access because after the file's permissions
* are restricted the user can still change the permissions using chmod
* from the command line.
*
* Return: 0 on success, <0 on failure.
*/
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name)
{
struct iattr iattr = {.ia_valid = ATTR_MODE,};
struct kernfs_node *kn;
int ret = 0;
kn = kernfs_find_and_get_ns(r->kn, name, NULL);
if (!kn)
return -ENOENT;
switch (kernfs_type(kn)) {
case KERNFS_DIR:
iattr.ia_mode = S_IFDIR;
break;
case KERNFS_FILE:
iattr.ia_mode = S_IFREG;
break;
case KERNFS_LINK:
iattr.ia_mode = S_IFLNK;
break;
}
ret = kernfs_setattr(kn, &iattr);
kernfs_put(kn);
return ret;
}
/**
* rdtgroup_kn_mode_restore - Restore user access to named resctrl file
* @r: The resource group with which the file is associated.
* @name: Name of the file
* @mask: Mask of permissions that should be restored
*
* Restore the permissions of the named file. If @name is a directory the
* permissions of its parent will be used.
*
* Return: 0 on success, <0 on failure.
*/
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask)
{
struct iattr iattr = {.ia_valid = ATTR_MODE,};
struct kernfs_node *kn, *parent;
struct rftype *rfts, *rft;
int ret, len;
rfts = res_common_files;
len = ARRAY_SIZE(res_common_files);
for (rft = rfts; rft < rfts + len; rft++) {
if (!strcmp(rft->name, name))
iattr.ia_mode = rft->mode & mask;
}
kn = kernfs_find_and_get_ns(r->kn, name, NULL);
if (!kn)
return -ENOENT;
switch (kernfs_type(kn)) {
case KERNFS_DIR:
parent = kernfs_get_parent(kn);
if (parent) {
iattr.ia_mode |= parent->mode;
kernfs_put(parent);
}
iattr.ia_mode |= S_IFDIR;
break;
case KERNFS_FILE:
iattr.ia_mode |= S_IFREG;
break;
case KERNFS_LINK:
iattr.ia_mode |= S_IFLNK;
break;
}
ret = kernfs_setattr(kn, &iattr);
kernfs_put(kn);
return ret;
}
static int rdtgroup_mkdir_info_resdir(struct rdt_resource *r, char *name,
unsigned long fflags)
{
@ -1224,6 +1799,9 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
if (atomic_dec_and_test(&rdtgrp->waitcount) &&
(rdtgrp->flags & RDT_DELETED)) {
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
rdtgroup_pseudo_lock_remove(rdtgrp);
kernfs_unbreak_active_protection(kn);
kernfs_put(rdtgrp->kn);
kfree(rdtgrp);
@ -1289,10 +1867,16 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
rdtgroup_default.mon.mon_data_kn = kn_mondata;
}
ret = rdt_pseudo_lock_init();
if (ret) {
dentry = ERR_PTR(ret);
goto out_mondata;
}
dentry = kernfs_mount(fs_type, flags, rdt_root,
RDTGROUP_SUPER_MAGIC, NULL);
if (IS_ERR(dentry))
goto out_mondata;
goto out_psl;
if (rdt_alloc_capable)
static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
@ -1310,6 +1894,8 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
goto out;
out_psl:
rdt_pseudo_lock_release();
out_mondata:
if (rdt_mon_capable)
kernfs_remove(kn_mondata);
@ -1447,6 +2033,10 @@ static void rmdir_all_sub(void)
if (rdtgrp == &rdtgroup_default)
continue;
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
rdtgroup_pseudo_lock_remove(rdtgrp);
/*
* Give any CPUs back to the default group. We cannot copy
* cpu_online_mask because a CPU might have executed the
@ -1483,6 +2073,8 @@ static void rdt_kill_sb(struct super_block *sb)
reset_all_ctrls(r);
cdp_disable_all();
rmdir_all_sub();
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
static_branch_disable_cpuslocked(&rdt_mon_enable_key);
static_branch_disable_cpuslocked(&rdt_enable_key);
@ -1682,6 +2274,114 @@ out_destroy:
return ret;
}
/**
* cbm_ensure_valid - Enforce validity on provided CBM
* @_val: Candidate CBM
* @r: RDT resource to which the CBM belongs
*
* The provided CBM represents all cache portions available for use. This
* may be represented by a bitmap that does not consist of contiguous ones
* and thus be an invalid CBM.
* Here the provided CBM is forced to be a valid CBM by only considering
* the first set of contiguous bits as valid and clearing all bits.
* The intention here is to provide a valid default CBM with which a new
* resource group is initialized. The user can follow this with a
* modification to the CBM if the default does not satisfy the
* requirements.
*/
static void cbm_ensure_valid(u32 *_val, struct rdt_resource *r)
{
/*
* Convert the u32 _val to an unsigned long required by all the bit
* operations within this function. No more than 32 bits of this
* converted value can be accessed because all bit operations are
* additionally provided with cbm_len that is initialized during
* hardware enumeration using five bits from the EAX register and
* thus never can exceed 32 bits.
*/
unsigned long *val = (unsigned long *)_val;
unsigned int cbm_len = r->cache.cbm_len;
unsigned long first_bit, zero_bit;
if (*val == 0)
return;
first_bit = find_first_bit(val, cbm_len);
zero_bit = find_next_zero_bit(val, cbm_len, first_bit);
/* Clear any remaining bits to ensure contiguous region */
bitmap_clear(val, zero_bit, cbm_len - zero_bit);
}
/**
* rdtgroup_init_alloc - Initialize the new RDT group's allocations
*
* A new RDT group is being created on an allocation capable (CAT)
* supporting system. Set this group up to start off with all usable
* allocations. That is, all shareable and unused bits.
*
* All-zero CBM is invalid. If there are no more shareable bits available
* on any domain then the entire allocation will fail.
*/
static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
{
u32 used_b = 0, unused_b = 0;
u32 closid = rdtgrp->closid;
struct rdt_resource *r;
enum rdtgrp_mode mode;
struct rdt_domain *d;
int i, ret;
u32 *ctrl;
for_each_alloc_enabled_rdt_resource(r) {
list_for_each_entry(d, &r->domains, list) {
d->have_new_ctrl = false;
d->new_ctrl = r->cache.shareable_bits;
used_b = r->cache.shareable_bits;
ctrl = d->ctrl_val;
for (i = 0; i < r->num_closid; i++, ctrl++) {
if (closid_allocated(i) && i != closid) {
mode = rdtgroup_mode_by_closid(i);
if (mode == RDT_MODE_PSEUDO_LOCKSETUP)
break;
used_b |= *ctrl;
if (mode == RDT_MODE_SHAREABLE)
d->new_ctrl |= *ctrl;
}
}
if (d->plr && d->plr->cbm > 0)
used_b |= d->plr->cbm;
unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1);
unused_b &= BIT_MASK(r->cache.cbm_len) - 1;
d->new_ctrl |= unused_b;
/*
* Force the initial CBM to be valid, user can
* modify the CBM based on system availability.
*/
cbm_ensure_valid(&d->new_ctrl, r);
if (bitmap_weight((unsigned long *) &d->new_ctrl,
r->cache.cbm_len) <
r->cache.min_cbm_bits) {
rdt_last_cmd_printf("no space on %s:%d\n",
r->name, d->id);
return -ENOSPC;
}
d->have_new_ctrl = true;
}
}
for_each_alloc_enabled_rdt_resource(r) {
ret = update_domains(r, rdtgrp->closid);
if (ret < 0) {
rdt_last_cmd_puts("failed to initialize allocations\n");
return ret;
}
rdtgrp->mode = RDT_MODE_SHAREABLE;
}
return 0;
}
static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
struct kernfs_node *prgrp_kn,
const char *name, umode_t mode,
@ -1700,6 +2400,14 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
goto out_unlock;
}
if (rtype == RDTMON_GROUP &&
(prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) {
ret = -EINVAL;
rdt_last_cmd_puts("pseudo-locking in progress\n");
goto out_unlock;
}
/* allocate the rdtgroup. */
rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
if (!rdtgrp) {
@ -1840,6 +2548,10 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
ret = 0;
rdtgrp->closid = closid;
ret = rdtgroup_init_alloc(rdtgrp);
if (ret < 0)
goto out_id_free;
list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
if (rdt_mon_capable) {
@ -1850,15 +2562,16 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
ret = mongroup_create_dir(kn, NULL, "mon_groups", NULL);
if (ret) {
rdt_last_cmd_puts("kernfs subdir error\n");
goto out_id_free;
goto out_del_list;
}
}
goto out_unlock;
out_del_list:
list_del(&rdtgrp->rdtgroup_list);
out_id_free:
closid_free(closid);
list_del(&rdtgrp->rdtgroup_list);
out_common_fail:
mkdir_rdt_prepare_clean(rdtgrp);
out_unlock:
@ -1945,6 +2658,21 @@ static int rdtgroup_rmdir_mon(struct kernfs_node *kn, struct rdtgroup *rdtgrp,
return 0;
}
static int rdtgroup_ctrl_remove(struct kernfs_node *kn,
struct rdtgroup *rdtgrp)
{
rdtgrp->flags = RDT_DELETED;
list_del(&rdtgrp->rdtgroup_list);
/*
* one extra hold on this, will drop when we kfree(rdtgrp)
* in rdtgroup_kn_unlock()
*/
kernfs_get(kn);
kernfs_remove(rdtgrp->kn);
return 0;
}
static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp,
cpumask_var_t tmpmask)
{
@ -1970,7 +2698,6 @@ static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp,
cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
update_closid_rmid(tmpmask, NULL);
rdtgrp->flags = RDT_DELETED;
closid_free(rdtgrp->closid);
free_rmid(rdtgrp->mon.rmid);
@ -1979,14 +2706,7 @@ static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp,
*/
free_all_child_rdtgrp(rdtgrp);
list_del(&rdtgrp->rdtgroup_list);
/*
* one extra hold on this, will drop when we kfree(rdtgrp)
* in rdtgroup_kn_unlock()
*/
kernfs_get(kn);
kernfs_remove(rdtgrp->kn);
rdtgroup_ctrl_remove(kn, rdtgrp);
return 0;
}
@ -2014,13 +2734,19 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
* If the rdtgroup is a mon group and parent directory
* is a valid "mon_groups" directory, remove the mon group.
*/
if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn)
ret = rdtgroup_rmdir_ctrl(kn, rdtgrp, tmpmask);
else if (rdtgrp->type == RDTMON_GROUP &&
is_mon_groups(parent_kn, kn->name))
if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn) {
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
ret = rdtgroup_ctrl_remove(kn, rdtgrp);
} else {
ret = rdtgroup_rmdir_ctrl(kn, rdtgrp, tmpmask);
}
} else if (rdtgrp->type == RDTMON_GROUP &&
is_mon_groups(parent_kn, kn->name)) {
ret = rdtgroup_rmdir_mon(kn, rdtgrp, tmpmask);
else
} else {
ret = -EPERM;
}
out:
rdtgroup_kn_unlock(kn);
@ -2046,7 +2772,8 @@ static int __init rdtgroup_setup_root(void)
int ret;
rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops,
KERNFS_ROOT_CREATE_DEACTIVATED,
KERNFS_ROOT_CREATE_DEACTIVATED |
KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK,
&rdtgroup_default);
if (IS_ERR(rdt_root))
return PTR_ERR(rdt_root);
@ -2102,6 +2829,29 @@ int __init rdtgroup_init(void)
if (ret)
goto cleanup_mountpoint;
/*
* Adding the resctrl debugfs directory here may not be ideal since
* it would let the resctrl debugfs directory appear on the debugfs
* filesystem before the resctrl filesystem is mounted.
* It may also be ok since that would enable debugging of RDT before
* resctrl is mounted.
* The reason why the debugfs directory is created here and not in
* rdt_mount() is because rdt_mount() takes rdtgroup_mutex and
* during the debugfs directory creation also &sb->s_type->i_mutex_key
* (the lockdep class of inode->i_rwsem). Other filesystem
* interactions (eg. SyS_getdents) have the lock ordering:
* &sb->s_type->i_mutex_key --> &mm->mmap_sem
* During mmap(), called with &mm->mmap_sem, the rdtgroup_mutex
* is taken, thus creating dependency:
* &mm->mmap_sem --> rdtgroup_mutex for the latter that can cause
* issues considering the other two lock dependencies.
* By creating the debugfs directory here we avoid a dependency
* that may cause deadlock (even though file operations cannot
* occur until the filesystem is mounted, but I do not know how to
* tell lockdep that).
*/
debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
return 0;
cleanup_mountpoint:
@ -2111,3 +2861,11 @@ cleanup_root:
return ret;
}
void __exit rdtgroup_exit(void)
{
debugfs_remove_recursive(debugfs_resctrl);
unregister_filesystem(&rdt_fs_type);
sysfs_remove_mount_point(fs_kobj, "resctrl");
kernfs_destroy_root(rdt_root);
}