1
0
Fork 0

KVM/arm fixes for 5.3

- A bunch of switch/case fall-through annotation, fixing one actual bug
 - Fix PMU reset bug
 - Add missing exception class debug strings
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1Bzw8PHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDlXYP/ixqJzqpJetTrvpiUpmLjhp4YwjjOxqyeQvo
 bWy/EFz8bSWbTZlwAAstFDVmtGenuwaiOakChvV8GH6USYqRsYdvc/sJu0evQplJ
 JQtOzGhyv1NuM0s9wYBcstAH+YAW+gBK5YFnowreheuidK/1lo3C/EnR2DxCtNal
 gpV3qQt8qfw3ysGlpC/fDjjOYw4lDkFa6CSx9uk3/587fPBqHANRY/i87nJxmhhX
 lGeCJcOrY3cy1HhbedFwxVt4Q/ZbHf0UhTfgwvsBYw7BaWmB1ymoEOoktQcUWoKb
 LL0rBe+OxNQgRnJpn3fMEHiCAmXaI9qE4dohFOl1J3dQvCElcV/jWjkXDD1+KgzW
 S2XZGB6yxet93Fh1x6xv4i6ATJvmZeTIDUXi9KkjcDiycB9YMCDYY2ejTbQv5VUP
 V0DghGGDd3d8sY7dEjxwBakuJ6nqKixSouQaNsWuBTm7tVpEVS8yW+hqWs/IVI5b
 48SDbxaNpKvx7sAyhuWAjCFbZeIm0hd//JN3JoxazF9i9PKuqnZLbNv/ME6hmzj+
 LrETwaAbjsw5Au+ST+OdT2UiauiBm9C6Kg62qagHrKJviuK941+3hjH8aj/e0pYk
 a0DQxumiyofXPQ0pVe8ZfqlPptONz+EKyAsrOm8AjLJ+bBdRUNHLcZKYj7em7YiE
 pANc8/T+
 =kcDj
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm fixes for 5.3

- A bunch of switch/case fall-through annotation, fixing one actual bug
- Fix PMU reset bug
- Add missing exception class debug strings
alistair/sunxi64-5.4-dsi
Paolo Bonzini 2019-08-09 16:53:39 +02:00
commit 0e1c438c44
5794 changed files with 623707 additions and 111796 deletions

1
.gitignore vendored
View File

@ -30,6 +30,7 @@
*.lz4 *.lz4
*.lzma *.lzma
*.lzo *.lzo
*.mod
*.mod.c *.mod.c
*.o *.o
*.o.* *.o.*

View File

@ -1770,7 +1770,6 @@ S: USA
N: Dave Jones N: Dave Jones
E: davej@codemonkey.org.uk E: davej@codemonkey.org.uk
W: http://www.codemonkey.org.uk
D: Assorted VIA x86 support. D: Assorted VIA x86 support.
D: 2.5 AGPGART overhaul. D: 2.5 AGPGART overhaul.
D: CPUFREQ maintenance. D: CPUFREQ maintenance.
@ -3120,7 +3119,7 @@ S: France
N: Rik van Riel N: Rik van Riel
E: riel@redhat.com E: riel@redhat.com
W: http://www.surriel.com/ W: http://www.surriel.com/
D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround D: Linux-MM site, Documentation/admin-guide/sysctl/*, swap/mm readaround
D: kswapd fixes, random kernel hacker, rmap VM, D: kswapd fixes, random kernel hacker, rmap VM,
D: nl.linux.org administrator, minor scheduler additions D: nl.linux.org administrator, minor scheduler additions
S: Red Hat Boston S: Red Hat Boston

View File

@ -11,7 +11,7 @@ Description:
Kernel code may export it for complete or partial access. Kernel code may export it for complete or partial access.
GPIOs are identified as they are inside the kernel, using integers in GPIOs are identified as they are inside the kernel, using integers in
the range 0..INT_MAX. See Documentation/gpio for more information. the range 0..INT_MAX. See Documentation/admin-guide/gpio for more information.
/sys/class/gpio /sys/class/gpio
/export ... asks the kernel to export a GPIO to userspace /export ... asks the kernel to export a GPIO to userspace

View File

@ -1,6 +1,6 @@
rfkill - radio frequency (RF) connector kill switch support rfkill - radio frequency (RF) connector kill switch support
For details to this subsystem look at Documentation/rfkill.txt. For details to this subsystem look at Documentation/driver-api/rfkill.rst.
What: /sys/class/rfkill/rfkill[0-9]+/claim What: /sys/class/rfkill/rfkill[0-9]+/claim
Date: 09-Jul-2007 Date: 09-Jul-2007

View File

@ -423,23 +423,6 @@ Description:
(e.g. driver restart on the VM which owns the VF). (e.g. driver restart on the VM which owns the VF).
sysfs interface for NetEffect RNIC Low-Level iWARP driver (nes)
---------------------------------------------------------------
What: /sys/class/infiniband/nesX/hw_rev
What: /sys/class/infiniband/nesX/hca_type
What: /sys/class/infiniband/nesX/board_id
Date: Feb, 2008
KernelVersion: v2.6.25
Contact: linux-rdma@vger.kernel.org
Description:
hw_rev: (RO) Hardware revision number
hca_type: (RO) Host Channel Adapter type (NEX020)
board_id: (RO) Manufacturing board id
sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4) sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
----------------------------------------------------- -----------------------------------------------------

View File

@ -1,6 +1,6 @@
rfkill - radio frequency (RF) connector kill switch support rfkill - radio frequency (RF) connector kill switch support
For details to this subsystem look at Documentation/rfkill.txt. For details to this subsystem look at Documentation/driver-api/rfkill.rst.
For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in
Documentation/ABI/removed/sysfs-class-rfkill. Documentation/ABI/removed/sysfs-class-rfkill.

View File

@ -61,7 +61,7 @@ Date: October 2002
Contact: Linux Memory Management list <linux-mm@kvack.org> Contact: Linux Memory Management list <linux-mm@kvack.org>
Description: Description:
The node's hit/miss statistics, in units of pages. The node's hit/miss statistics, in units of pages.
See Documentation/numastat.txt See Documentation/admin-guide/numastat.rst
What: /sys/devices/system/node/nodeX/distance What: /sys/devices/system/node/nodeX/distance
Date: October 2002 Date: October 2002

View File

@ -120,3 +120,23 @@ Description: These files show the system reset cause, as following: ComEx
the last reset cause. the last reset cause.
The files are read only. The files are read only.
Date: June 2019
KernelVersion: 5.3
Contact: Vadim Pasternak <vadimpmellanox.com>
Description: These files show the system reset cause, as following:
COMEX thermal shutdown; wathchdog power off or reset was derived
by one of the next components: COMEX, switch board or by Small Form
Factor mezzanine, reset requested from ASIC, reset cuased by BIOS
reload. Value 1 in file means this is reset cause, 0 - otherwise.
Only one of the above causes could be 1 at the same time, representing
only last reset cause.
The files are read only.
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_thermal
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_wd
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_asic
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_reload_bios
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sff_wd
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_swb_wd

View File

@ -29,4 +29,4 @@ Description:
17 - sectors discarded 17 - sectors discarded
18 - time spent discarding 18 - time spent discarding
For more details refer to Documentation/iostats.txt For more details refer to Documentation/admin-guide/iostats.rst

View File

@ -15,7 +15,7 @@ Description:
9 - I/Os currently in progress 9 - I/Os currently in progress
10 - time spent doing I/Os (ms) 10 - time spent doing I/Os (ms)
11 - weighted time spent doing I/Os (ms) 11 - weighted time spent doing I/Os (ms)
For more details refer Documentation/iostats.txt For more details refer Documentation/admin-guide/iostats.rst
What: /sys/block/<disk>/<part>/stat What: /sys/block/<disk>/<part>/stat

View File

@ -45,7 +45,7 @@ Description:
- Values below -2 are rejected with -EINVAL - Values below -2 are rejected with -EINVAL
For more information, see For more information, see
Documentation/laptops/disk-shock-protection.txt Documentation/admin-guide/laptops/disk-shock-protection.rst
What: /sys/block/*/device/ncq_prio_enable What: /sys/block/*/device/ncq_prio_enable

View File

@ -376,10 +376,42 @@ Description:
supply. Normally this is configured based on the type of supply. Normally this is configured based on the type of
connection made (e.g. A configured SDP should output a maximum connection made (e.g. A configured SDP should output a maximum
of 500mA so the input current limit is set to the same value). of 500mA so the input current limit is set to the same value).
Use preferably input_power_limit, and for problems that can be
solved using power limit use input_current_limit.
Access: Read, Write Access: Read, Write
Valid values: Represented in microamps Valid values: Represented in microamps
What: /sys/class/power_supply/<supply_name>/input_voltage_limit
Date: May 2019
Contact: linux-pm@vger.kernel.org
Description:
This entry configures the incoming VBUS voltage limit currently
set in the supply. Normally this is configured based on
system-level knowledge or user input (e.g. This is part of the
Pixel C's thermal management strategy to effectively limit the
input power to 5V when the screen is on to meet Google's skin
temperature targets). Note that this feature should not be
used for safety critical things.
Use preferably input_power_limit, and for problems that can be
solved using power limit use input_voltage_limit.
Access: Read, Write
Valid values: Represented in microvolts
What: /sys/class/power_supply/<supply_name>/input_power_limit
Date: May 2019
Contact: linux-pm@vger.kernel.org
Description:
This entry configures the incoming power limit currently set
in the supply. Normally this is configured based on
system-level knowledge or user input. Use preferably this
feature to limit the incoming power and use current/voltage
limit only for problems that can be solved using power limit.
Access: Read, Write
Valid values: Represented in microwatts
What: /sys/class/power_supply/<supply_name>/online, What: /sys/class/power_supply/<supply_name>/online,
Date: May 2007 Date: May 2007
Contact: linux-pm@vger.kernel.org Contact: linux-pm@vger.kernel.org

View File

@ -0,0 +1,30 @@
What: /sys/class/power_supply/wilco-charger/charge_type
Date: April 2019
KernelVersion: 5.2
Description:
What charging algorithm to use:
Standard: Fully charges battery at a standard rate.
Adaptive: Battery settings adaptively optimized based on
typical battery usage pattern.
Fast: Battery charges over a shorter period.
Trickle: Extends battery lifespan, intended for users who
primarily use their Chromebook while connected to AC.
Custom: A low and high threshold percentage is specified.
Charging begins when level drops below
charge_control_start_threshold, and ceases when
level is above charge_control_end_threshold.
What: /sys/class/power_supply/wilco-charger/charge_control_start_threshold
Date: April 2019
KernelVersion: 5.2
Description:
Used when charge_type="Custom", as described above. Measured in
percentages. The valid range is [50, 95].
What: /sys/class/power_supply/wilco-charger/charge_control_end_threshold
Date: April 2019
KernelVersion: 5.2
Description:
Used when charge_type="Custom", as described above. Measured in
percentages. The valid range is [55, 100].

View File

@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org
Description: Description:
The powercap/ class sub directory belongs to the power cap The powercap/ class sub directory belongs to the power cap
subsystem. Refer to subsystem. Refer to
Documentation/power/powercap/powercap.txt for details. Documentation/power/powercap/powercap.rst for details.
What: /sys/class/powercap/<control type> What: /sys/class/powercap/<control type>
Date: September 2013 Date: September 2013

View File

@ -1,6 +1,6 @@
switchtec - Microsemi Switchtec PCI Switch Management Endpoint switchtec - Microsemi Switchtec PCI Switch Management Endpoint
For details on this subsystem look at Documentation/switchtec.txt. For details on this subsystem look at Documentation/driver-api/switchtec.rst.
What: /sys/class/switchtec What: /sys/class/switchtec
Date: 05-Jan-2017 Date: 05-Jan-2017

View File

@ -34,7 +34,7 @@ Description: CPU topology files that describe kernel limits related to
present: cpus that have been identified as being present in present: cpus that have been identified as being present in
the system. the system.
See Documentation/cputopology.txt for more information. See Documentation/admin-guide/cputopology.rst for more information.
What: /sys/devices/system/cpu/probe What: /sys/devices/system/cpu/probe
@ -103,7 +103,7 @@ Description: CPU topology files that describe a logical CPU's relationship
thread_siblings_list: human-readable list of cpu#'s hardware thread_siblings_list: human-readable list of cpu#'s hardware
threads within the same core as cpu# threads within the same core as cpu#
See Documentation/cputopology.txt for more information. See Documentation/admin-guide/cputopology.rst for more information.
What: /sys/devices/system/cpu/cpuidle/current_driver What: /sys/devices/system/cpu/cpuidle/current_driver

View File

@ -31,7 +31,7 @@ Description:
To control the LED display, use the following : To control the LED display, use the following :
echo 0x0T000DDD > /sys/devices/platform/asus_laptop/ echo 0x0T000DDD > /sys/devices/platform/asus_laptop/
where T control the 3 letters display, and DDD the 3 digits display. where T control the 3 letters display, and DDD the 3 digits display.
The DDD table can be found in Documentation/laptops/asus-laptop.txt The DDD table can be found in Documentation/admin-guide/laptops/asus-laptop.rst
What: /sys/devices/platform/asus_laptop/bluetooth What: /sys/devices/platform/asus_laptop/bluetooth
Date: January 2007 Date: January 2007

View File

@ -36,3 +36,13 @@ KernelVersion: 3.5
Contact: "AceLan Kao" <acelan.kao@canonical.com> Contact: "AceLan Kao" <acelan.kao@canonical.com>
Description: Description:
Resume on lid open. 1 means on, 0 means off. Resume on lid open. 1 means on, 0 means off.
What: /sys/devices/platform/<platform>/fan_boost_mode
Date: Sep 2019
KernelVersion: 5.3
Contact: "Yurii Pavlovskyi" <yurii.pavlovskyi@gmail.com>
Description:
Fan boost mode:
* 0 - normal,
* 1 - overboost,
* 2 - silent

View File

@ -1,7 +1,7 @@
What: /sys/devices/platform/<i2c-demux-name>/available_masters What: /sys/devices/platform/<i2c-demux-name>/available_masters
Date: January 2016 Date: January 2016
KernelVersion: 4.6 KernelVersion: 4.6
Contact: Wolfram Sang <wsa@the-dreams.de> Contact: Wolfram Sang <wsa+renesas@sang-engineering.com>
Description: Description:
Reading the file will give you a list of masters which can be Reading the file will give you a list of masters which can be
selected for a demultiplexed bus. The format is selected for a demultiplexed bus. The format is
@ -12,7 +12,7 @@ Description:
What: /sys/devices/platform/<i2c-demux-name>/current_master What: /sys/devices/platform/<i2c-demux-name>/current_master
Date: January 2016 Date: January 2016
KernelVersion: 4.6 KernelVersion: 4.6
Contact: Wolfram Sang <wsa@the-dreams.de> Contact: Wolfram Sang <wsa+renesas@sang-engineering.com>
Description: Description:
This file selects/shows the active I2C master for a demultiplexed This file selects/shows the active I2C master for a demultiplexed
bus. It uses the <index> value from the file 'available_masters'. bus. It uses the <index> value from the file 'available_masters'.

View File

@ -212,7 +212,7 @@ The standard 64-bit addressing device would do something like this::
If the device only supports 32-bit addressing for descriptors in the If the device only supports 32-bit addressing for descriptors in the
coherent allocations, but supports full 64-bits for streaming mappings coherent allocations, but supports full 64-bits for streaming mappings
it would look like this: it would look like this::
if (dma_set_mask(dev, DMA_BIT_MASK(64))) { if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
dev_warn(dev, "mydev: No suitable DMA available\n"); dev_warn(dev, "mydev: No suitable DMA available\n");

View File

@ -1,4 +1,8 @@
ACPI considerations for PCI host bridges .. SPDX-License-Identifier: GPL-2.0
========================================
ACPI considerations for PCI host bridges
========================================
The general rule is that the ACPI namespace should describe everything the The general rule is that the ACPI namespace should describe everything the
OS might use unless there's another way for the OS to find it [1, 2]. OS might use unless there's another way for the OS to find it [1, 2].
@ -131,12 +135,13 @@ address always corresponds to bus 0, even if the bus range below the bridge
[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: [4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
QWord/DWord/Word Address Space Descriptor (.1, .2, .3) QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
General Flags: Bit [0] Ignored General Flags: Bit [0] Ignored
Extended Address Space Descriptor (.4) Extended Address Space Descriptor (.4)
General Flags: Bit [0] Consumer/Producer: General Flags: Bit [0] Consumer/Producer:
1This device consumes this resource
0This device produces and consumes this resource * 1 This device consumes this resource
* 0 This device produces and consumes this resource
[5] ACPI 6.2, sec 19.6.43: [5] ACPI 6.2, sec 19.6.43:
ResourceUsage specifies whether the Memory range is consumed by ResourceUsage specifies whether the Memory range is consumed by

View File

@ -0,0 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
======================
PCI Endpoint Framework
======================
.. toctree::
:maxdepth: 2
pci-endpoint
pci-endpoint-cfs
pci-test-function
pci-test-howto

View File

@ -1,41 +1,51 @@
CONFIGURING PCI ENDPOINT USING CONFIGFS .. SPDX-License-Identifier: GPL-2.0
Kishon Vijay Abraham I <kishon@ti.com>
=======================================
Configuring PCI Endpoint Using CONFIGFS
=======================================
:Author: Kishon Vijay Abraham I <kishon@ti.com>
The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
PCI endpoint function and to bind the endpoint function PCI endpoint function and to bind the endpoint function
with the endpoint controller. (For introducing other mechanisms to with the endpoint controller. (For introducing other mechanisms to
configure the PCI Endpoint Function refer to [1]). configure the PCI Endpoint Function refer to [1]).
*) Mounting configfs Mounting configfs
=================
The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
directory. configfs can be mounted using the following command. directory. configfs can be mounted using the following command::
mount -t configfs none /sys/kernel/config mount -t configfs none /sys/kernel/config
*) Directory Structure Directory Structure
===================
The pci_ep configfs has two directories at its root: controllers and The pci_ep configfs has two directories at its root: controllers and
functions. Every EPC device present in the system will have an entry in functions. Every EPC device present in the system will have an entry in
the *controllers* directory and and every EPF driver present in the system the *controllers* directory and and every EPF driver present in the system
will have an entry in the *functions* directory. will have an entry in the *functions* directory.
::
/sys/kernel/config/pci_ep/ /sys/kernel/config/pci_ep/
.. controllers/ .. controllers/
.. functions/ .. functions/
*) Creating EPF Device Creating EPF Device
===================
Every registered EPF driver will be listed in controllers directory. The Every registered EPF driver will be listed in controllers directory. The
entries corresponding to EPF driver will be created by the EPF core. entries corresponding to EPF driver will be created by the EPF core.
::
/sys/kernel/config/pci_ep/functions/ /sys/kernel/config/pci_ep/functions/
.. <EPF Driver1>/ .. <EPF Driver1>/
... <EPF Device 11>/ ... <EPF Device 11>/
... <EPF Device 21>/ ... <EPF Device 21>/
.. <EPF Driver2>/ .. <EPF Driver2>/
... <EPF Device 12>/ ... <EPF Device 12>/
... <EPF Device 22>/ ... <EPF Device 22>/
In order to create a <EPF device> of the type probed by <EPF Driver>, the In order to create a <EPF device> of the type probed by <EPF Driver>, the
user has to create a directory inside <EPF DriverN>. user has to create a directory inside <EPF DriverN>.
@ -44,34 +54,37 @@ Every <EPF device> directory consists of the following entries that can be
used to configure the standard configuration header of the endpoint function. used to configure the standard configuration header of the endpoint function.
(These entries are created by the framework when any new <EPF Device> is (These entries are created by the framework when any new <EPF Device> is
created) created)
::
.. <EPF Driver1>/ .. <EPF Driver1>/
... <EPF Device 11>/ ... <EPF Device 11>/
... vendorid ... vendorid
... deviceid ... deviceid
... revid ... revid
... progif_code ... progif_code
... subclass_code ... subclass_code
... baseclass_code ... baseclass_code
... cache_line_size ... cache_line_size
... subsys_vendor_id ... subsys_vendor_id
... subsys_id ... subsys_id
... interrupt_pin ... interrupt_pin
*) EPC Device EPC Device
==========
Every registered EPC device will be listed in controllers directory. The Every registered EPC device will be listed in controllers directory. The
entries corresponding to EPC device will be created by the EPC core. entries corresponding to EPC device will be created by the EPC core.
::
/sys/kernel/config/pci_ep/controllers/ /sys/kernel/config/pci_ep/controllers/
.. <EPC Device1>/ .. <EPC Device1>/
... <Symlink EPF Device11>/ ... <Symlink EPF Device11>/
... <Symlink EPF Device12>/ ... <Symlink EPF Device12>/
... start ... start
.. <EPC Device2>/ .. <EPC Device2>/
... <Symlink EPF Device21>/ ... <Symlink EPF Device21>/
... <Symlink EPF Device22>/ ... <Symlink EPF Device22>/
... start ... start
The <EPC Device> directory will have a list of symbolic links to The <EPC Device> directory will have a list of symbolic links to
<EPF Device>. These symbolic links should be created by the user to <EPF Device>. These symbolic links should be created by the user to
@ -81,7 +94,7 @@ The <EPC Device> directory will also have a *start* field. Once
"1" is written to this field, the endpoint device will be ready to "1" is written to this field, the endpoint device will be ready to
establish the link with the host. This is usually done after establish the link with the host. This is usually done after
all the EPF devices are created and linked with the EPC device. all the EPF devices are created and linked with the EPC device.
::
| controllers/ | controllers/
| <Directory: EPC name>/ | <Directory: EPC name>/
@ -102,4 +115,4 @@ all the EPF devices are created and linked with the EPC device.
| interrupt_pin | interrupt_pin
| function | function
[1] -> Documentation/PCI/endpoint/pci-endpoint.txt [1] :doc:`pci-endpoint`

View File

@ -1,11 +1,13 @@
PCI ENDPOINT FRAMEWORK .. SPDX-License-Identifier: GPL-2.0
Kishon Vijay Abraham I <kishon@ti.com>
:Author: Kishon Vijay Abraham I <kishon@ti.com>
This document is a guide to use the PCI Endpoint Framework in order to create This document is a guide to use the PCI Endpoint Framework in order to create
endpoint controller driver, endpoint function driver, and using configfs endpoint controller driver, endpoint function driver, and using configfs
interface to bind the function driver to the controller driver. interface to bind the function driver to the controller driver.
1. Introduction Introduction
============
Linux has a comprehensive PCI subsystem to support PCI controllers that Linux has a comprehensive PCI subsystem to support PCI controllers that
operates in Root Complex mode. The subsystem has capability to scan PCI bus, operates in Root Complex mode. The subsystem has capability to scan PCI bus,
@ -19,26 +21,30 @@ add endpoint mode support in Linux. This will help to run Linux in an
EP system which can have a wide variety of use cases from testing or EP system which can have a wide variety of use cases from testing or
validation, co-processor accelerator, etc. validation, co-processor accelerator, etc.
2. PCI Endpoint Core PCI Endpoint Core
=================
The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
library, the Endpoint Function library, and the configfs layer to bind the library, the Endpoint Function library, and the configfs layer to bind the
endpoint function with the endpoint controller. endpoint function with the endpoint controller.
2.1 PCI Endpoint Controller(EPC) Library PCI Endpoint Controller(EPC) Library
------------------------------------
The EPC library provides APIs to be used by the controller that can operate The EPC library provides APIs to be used by the controller that can operate
in endpoint mode. It also provides APIs to be used by function driver/library in endpoint mode. It also provides APIs to be used by function driver/library
in order to implement a particular endpoint function. in order to implement a particular endpoint function.
2.1.1 APIs for the PCI controller Driver APIs for the PCI controller Driver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This section lists the APIs that the PCI Endpoint core provides to be used This section lists the APIs that the PCI Endpoint core provides to be used
by the PCI controller driver. by the PCI controller driver.
*) devm_pci_epc_create()/pci_epc_create() * devm_pci_epc_create()/pci_epc_create()
The PCI controller driver should implement the following ops: The PCI controller driver should implement the following ops:
* write_header: ops to populate configuration space header * write_header: ops to populate configuration space header
* set_bar: ops to configure the BAR * set_bar: ops to configure the BAR
* clear_bar: ops to reset the BAR * clear_bar: ops to reset the BAR
@ -51,110 +57,116 @@ by the PCI controller driver.
The PCI controller driver can then create a new EPC device by invoking The PCI controller driver can then create a new EPC device by invoking
devm_pci_epc_create()/pci_epc_create(). devm_pci_epc_create()/pci_epc_create().
*) devm_pci_epc_destroy()/pci_epc_destroy() * devm_pci_epc_destroy()/pci_epc_destroy()
The PCI controller driver can destroy the EPC device created by either The PCI controller driver can destroy the EPC device created by either
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
pci_epc_destroy(). pci_epc_destroy().
*) pci_epc_linkup() * pci_epc_linkup()
In order to notify all the function devices that the EPC device to which In order to notify all the function devices that the EPC device to which
they are linked has established a link with the host, the PCI controller they are linked has established a link with the host, the PCI controller
driver should invoke pci_epc_linkup(). driver should invoke pci_epc_linkup().
*) pci_epc_mem_init() * pci_epc_mem_init()
Initialize the pci_epc_mem structure used for allocating EPC addr space. Initialize the pci_epc_mem structure used for allocating EPC addr space.
*) pci_epc_mem_exit() * pci_epc_mem_exit()
Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
2.1.2 APIs for the PCI Endpoint Function Driver
APIs for the PCI Endpoint Function Driver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This section lists the APIs that the PCI Endpoint core provides to be used This section lists the APIs that the PCI Endpoint core provides to be used
by the PCI endpoint function driver. by the PCI endpoint function driver.
*) pci_epc_write_header() * pci_epc_write_header()
The PCI endpoint function driver should use pci_epc_write_header() to The PCI endpoint function driver should use pci_epc_write_header() to
write the standard configuration header to the endpoint controller. write the standard configuration header to the endpoint controller.
*) pci_epc_set_bar() * pci_epc_set_bar()
The PCI endpoint function driver should use pci_epc_set_bar() to configure The PCI endpoint function driver should use pci_epc_set_bar() to configure
the Base Address Register in order for the host to assign PCI addr space. the Base Address Register in order for the host to assign PCI addr space.
Register space of the function driver is usually configured Register space of the function driver is usually configured
using this API. using this API.
*) pci_epc_clear_bar() * pci_epc_clear_bar()
The PCI endpoint function driver should use pci_epc_clear_bar() to reset The PCI endpoint function driver should use pci_epc_clear_bar() to reset
the BAR. the BAR.
*) pci_epc_raise_irq() * pci_epc_raise_irq()
The PCI endpoint function driver should use pci_epc_raise_irq() to raise The PCI endpoint function driver should use pci_epc_raise_irq() to raise
Legacy Interrupt, MSI or MSI-X Interrupt. Legacy Interrupt, MSI or MSI-X Interrupt.
*) pci_epc_mem_alloc_addr() * pci_epc_mem_alloc_addr()
The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
allocate memory address from EPC addr space which is required to access allocate memory address from EPC addr space which is required to access
RC's buffer RC's buffer
*) pci_epc_mem_free_addr() * pci_epc_mem_free_addr()
The PCI endpoint function driver should use pci_epc_mem_free_addr() to The PCI endpoint function driver should use pci_epc_mem_free_addr() to
free the memory space allocated using pci_epc_mem_alloc_addr(). free the memory space allocated using pci_epc_mem_alloc_addr().
2.1.3 Other APIs Other APIs
~~~~~~~~~~
There are other APIs provided by the EPC library. These are used for binding There are other APIs provided by the EPC library. These are used for binding
the EPF device with EPC device. pci-ep-cfs.c can be used as reference for the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
using these APIs. using these APIs.
*) pci_epc_get() * pci_epc_get()
Get a reference to the PCI endpoint controller based on the device name of Get a reference to the PCI endpoint controller based on the device name of
the controller. the controller.
*) pci_epc_put() * pci_epc_put()
Release the reference to the PCI endpoint controller obtained using Release the reference to the PCI endpoint controller obtained using
pci_epc_get() pci_epc_get()
*) pci_epc_add_epf() * pci_epc_add_epf()
Add a PCI endpoint function to a PCI endpoint controller. A PCIe device Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
can have up to 8 functions according to the specification. can have up to 8 functions according to the specification.
*) pci_epc_remove_epf() * pci_epc_remove_epf()
Remove the PCI endpoint function from PCI endpoint controller. Remove the PCI endpoint function from PCI endpoint controller.
*) pci_epc_start() * pci_epc_start()
The PCI endpoint function driver should invoke pci_epc_start() once it The PCI endpoint function driver should invoke pci_epc_start() once it
has configured the endpoint function and wants to start the PCI link. has configured the endpoint function and wants to start the PCI link.
*) pci_epc_stop() * pci_epc_stop()
The PCI endpoint function driver should invoke pci_epc_stop() to stop The PCI endpoint function driver should invoke pci_epc_stop() to stop
the PCI LINK. the PCI LINK.
2.2 PCI Endpoint Function(EPF) Library
PCI Endpoint Function(EPF) Library
----------------------------------
The EPF library provides APIs to be used by the function driver and the EPC The EPF library provides APIs to be used by the function driver and the EPC
library to provide endpoint mode functionality. library to provide endpoint mode functionality.
2.2.1 APIs for the PCI Endpoint Function Driver APIs for the PCI Endpoint Function Driver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This section lists the APIs that the PCI Endpoint core provides to be used This section lists the APIs that the PCI Endpoint core provides to be used
by the PCI endpoint function driver. by the PCI endpoint function driver.
*) pci_epf_register_driver() * pci_epf_register_driver()
The PCI Endpoint Function driver should implement the following ops: The PCI Endpoint Function driver should implement the following ops:
* bind: ops to perform when a EPC device has been bound to EPF device * bind: ops to perform when a EPC device has been bound to EPF device
@ -166,50 +178,54 @@ by the PCI endpoint function driver.
The PCI Function driver can then register the PCI EPF driver by using The PCI Function driver can then register the PCI EPF driver by using
pci_epf_register_driver(). pci_epf_register_driver().
*) pci_epf_unregister_driver() * pci_epf_unregister_driver()
The PCI Function driver can unregister the PCI EPF driver by using The PCI Function driver can unregister the PCI EPF driver by using
pci_epf_unregister_driver(). pci_epf_unregister_driver().
*) pci_epf_alloc_space() * pci_epf_alloc_space()
The PCI Function driver can allocate space for a particular BAR using The PCI Function driver can allocate space for a particular BAR using
pci_epf_alloc_space(). pci_epf_alloc_space().
*) pci_epf_free_space() * pci_epf_free_space()
The PCI Function driver can free the allocated space The PCI Function driver can free the allocated space
(using pci_epf_alloc_space) by invoking pci_epf_free_space(). (using pci_epf_alloc_space) by invoking pci_epf_free_space().
2.2.2 APIs for the PCI Endpoint Controller Library APIs for the PCI Endpoint Controller Library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This section lists the APIs that the PCI Endpoint core provides to be used This section lists the APIs that the PCI Endpoint core provides to be used
by the PCI endpoint controller library. by the PCI endpoint controller library.
*) pci_epf_linkup() * pci_epf_linkup()
The PCI endpoint controller library invokes pci_epf_linkup() when the The PCI endpoint controller library invokes pci_epf_linkup() when the
EPC device has established the connection to the host. EPC device has established the connection to the host.
2.2.2 Other APIs Other APIs
~~~~~~~~~~
There are other APIs provided by the EPF library. These are used to notify There are other APIs provided by the EPF library. These are used to notify
the function driver when the EPF device is bound to the EPC device. the function driver when the EPF device is bound to the EPC device.
pci-ep-cfs.c can be used as reference for using these APIs. pci-ep-cfs.c can be used as reference for using these APIs.
*) pci_epf_create() * pci_epf_create()
Create a new PCI EPF device by passing the name of the PCI EPF device. Create a new PCI EPF device by passing the name of the PCI EPF device.
This name will be used to bind the the EPF device to a EPF driver. This name will be used to bind the the EPF device to a EPF driver.
*) pci_epf_destroy() * pci_epf_destroy()
Destroy the created PCI EPF device. Destroy the created PCI EPF device.
*) pci_epf_bind() * pci_epf_bind()
pci_epf_bind() should be invoked when the EPF device has been bound to pci_epf_bind() should be invoked when the EPF device has been bound to
a EPC device. a EPC device.
*) pci_epf_unbind() * pci_epf_unbind()
pci_epf_unbind() should be invoked when the binding between EPC device pci_epf_unbind() should be invoked when the binding between EPC device
and EPF device is lost. and EPF device is lost.

View File

@ -1,5 +1,10 @@
PCI TEST .. SPDX-License-Identifier: GPL-2.0
Kishon Vijay Abraham I <kishon@ti.com>
=================
PCI Test Function
=================
:Author: Kishon Vijay Abraham I <kishon@ti.com>
Traditionally PCI RC has always been validated by using standard Traditionally PCI RC has always been validated by using standard
PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
@ -23,65 +28,76 @@ The PCI endpoint test device has the following registers:
8) PCI_ENDPOINT_TEST_IRQ_TYPE 8) PCI_ENDPOINT_TEST_IRQ_TYPE
9) PCI_ENDPOINT_TEST_IRQ_NUMBER 9) PCI_ENDPOINT_TEST_IRQ_NUMBER
*) PCI_ENDPOINT_TEST_MAGIC * PCI_ENDPOINT_TEST_MAGIC
This register will be used to test BAR0. A known pattern will be written This register will be used to test BAR0. A known pattern will be written
and read back from MAGIC register to verify BAR0. and read back from MAGIC register to verify BAR0.
*) PCI_ENDPOINT_TEST_COMMAND: * PCI_ENDPOINT_TEST_COMMAND
This register will be used by the host driver to indicate the function This register will be used by the host driver to indicate the function
that the endpoint device must perform. that the endpoint device must perform.
Bitfield Description: ======== ================================================================
Bit 0 : raise legacy IRQ Bitfield Description
Bit 1 : raise MSI IRQ ======== ================================================================
Bit 2 : raise MSI-X IRQ Bit 0 raise legacy IRQ
Bit 3 : read command (read data from RC buffer) Bit 1 raise MSI IRQ
Bit 4 : write command (write data to RC buffer) Bit 2 raise MSI-X IRQ
Bit 5 : copy command (copy data from one RC buffer to another Bit 3 read command (read data from RC buffer)
RC buffer) Bit 4 write command (write data to RC buffer)
Bit 5 copy command (copy data from one RC buffer to another RC buffer)
======== ================================================================
*) PCI_ENDPOINT_TEST_STATUS * PCI_ENDPOINT_TEST_STATUS
This register reflects the status of the PCI endpoint device. This register reflects the status of the PCI endpoint device.
Bitfield Description: ======== ==============================
Bit 0 : read success Bitfield Description
Bit 1 : read fail ======== ==============================
Bit 2 : write success Bit 0 read success
Bit 3 : write fail Bit 1 read fail
Bit 4 : copy success Bit 2 write success
Bit 5 : copy fail Bit 3 write fail
Bit 6 : IRQ raised Bit 4 copy success
Bit 7 : source address is invalid Bit 5 copy fail
Bit 8 : destination address is invalid Bit 6 IRQ raised
Bit 7 source address is invalid
Bit 8 destination address is invalid
======== ==============================
*) PCI_ENDPOINT_TEST_SRC_ADDR * PCI_ENDPOINT_TEST_SRC_ADDR
This register contains the source address (RC buffer address) for the This register contains the source address (RC buffer address) for the
COPY/READ command. COPY/READ command.
*) PCI_ENDPOINT_TEST_DST_ADDR * PCI_ENDPOINT_TEST_DST_ADDR
This register contains the destination address (RC buffer address) for This register contains the destination address (RC buffer address) for
the COPY/WRITE command. the COPY/WRITE command.
*) PCI_ENDPOINT_TEST_IRQ_TYPE * PCI_ENDPOINT_TEST_IRQ_TYPE
This register contains the interrupt type (Legacy/MSI) triggered This register contains the interrupt type (Legacy/MSI) triggered
for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
Possible types: Possible types:
- Legacy : 0
- MSI : 1
- MSI-X : 2
*) PCI_ENDPOINT_TEST_IRQ_NUMBER ====== ==
Legacy 0
MSI 1
MSI-X 2
====== ==
* PCI_ENDPOINT_TEST_IRQ_NUMBER
This register contains the triggered ID interrupt. This register contains the triggered ID interrupt.
Admissible values: Admissible values:
- Legacy : 0
- MSI : [1 .. 32] ====== ===========
- MSI-X : [1 .. 2048] Legacy 0
MSI [1 .. 32]
MSI-X [1 .. 2048]
====== ===========

View File

@ -1,38 +1,51 @@
PCI TEST USERGUIDE .. SPDX-License-Identifier: GPL-2.0
Kishon Vijay Abraham I <kishon@ti.com>
===================
PCI Test User Guide
===================
:Author: Kishon Vijay Abraham I <kishon@ti.com>
This document is a guide to help users use pci-epf-test function driver This document is a guide to help users use pci-epf-test function driver
and pci_endpoint_test host driver for testing PCI. The list of steps to and pci_endpoint_test host driver for testing PCI. The list of steps to
be followed in the host side and EP side is given below. be followed in the host side and EP side is given below.
1. Endpoint Device Endpoint Device
===============
1.1 Endpoint Controller Devices Endpoint Controller Devices
---------------------------
To find the list of endpoint controller devices in the system: To find the list of endpoint controller devices in the system::
# ls /sys/class/pci_epc/ # ls /sys/class/pci_epc/
51000000.pcie_ep 51000000.pcie_ep
If PCI_ENDPOINT_CONFIGFS is enabled If PCI_ENDPOINT_CONFIGFS is enabled::
# ls /sys/kernel/config/pci_ep/controllers # ls /sys/kernel/config/pci_ep/controllers
51000000.pcie_ep 51000000.pcie_ep
1.2 Endpoint Function Drivers
To find the list of endpoint function drivers in the system: Endpoint Function Drivers
-------------------------
To find the list of endpoint function drivers in the system::
# ls /sys/bus/pci-epf/drivers # ls /sys/bus/pci-epf/drivers
pci_epf_test pci_epf_test
If PCI_ENDPOINT_CONFIGFS is enabled If PCI_ENDPOINT_CONFIGFS is enabled::
# ls /sys/kernel/config/pci_ep/functions # ls /sys/kernel/config/pci_ep/functions
pci_epf_test pci_epf_test
1.3 Creating pci-epf-test Device
Creating pci-epf-test Device
----------------------------
PCI endpoint function device can be created using the configfs. To create PCI endpoint function device can be created using the configfs. To create
pci-epf-test device, the following commands can be used pci-epf-test device, the following commands can be used::
# mount -t configfs none /sys/kernel/config # mount -t configfs none /sys/kernel/config
# cd /sys/kernel/config/pci_ep/ # cd /sys/kernel/config/pci_ep/
@ -42,7 +55,7 @@ The "mkdir func1" above creates the pci-epf-test function device that will
be probed by pci_epf_test driver. be probed by pci_epf_test driver.
The PCI endpoint framework populates the directory with the following The PCI endpoint framework populates the directory with the following
configurable fields. configurable fields::
# ls functions/pci_epf_test/func1 # ls functions/pci_epf_test/func1
baseclass_code interrupt_pin progif_code subsys_id baseclass_code interrupt_pin progif_code subsys_id
@ -51,67 +64,83 @@ configurable fields.
The PCI endpoint function driver populates these entries with default values The PCI endpoint function driver populates these entries with default values
when the device is bound to the driver. The pci-epf-test driver populates when the device is bound to the driver. The pci-epf-test driver populates
vendorid with 0xffff and interrupt_pin with 0x0001 vendorid with 0xffff and interrupt_pin with 0x0001::
# cat functions/pci_epf_test/func1/vendorid # cat functions/pci_epf_test/func1/vendorid
0xffff 0xffff
# cat functions/pci_epf_test/func1/interrupt_pin # cat functions/pci_epf_test/func1/interrupt_pin
0x0001 0x0001
1.4 Configuring pci-epf-test Device
Configuring pci-epf-test Device
-------------------------------
The user can configure the pci-epf-test device using configfs entry. In order The user can configure the pci-epf-test device using configfs entry. In order
to change the vendorid and the number of MSI interrupts used by the function to change the vendorid and the number of MSI interrupts used by the function
device, the following commands can be used. device, the following commands can be used::
# echo 0x104c > functions/pci_epf_test/func1/vendorid # echo 0x104c > functions/pci_epf_test/func1/vendorid
# echo 0xb500 > functions/pci_epf_test/func1/deviceid # echo 0xb500 > functions/pci_epf_test/func1/deviceid
# echo 16 > functions/pci_epf_test/func1/msi_interrupts # echo 16 > functions/pci_epf_test/func1/msi_interrupts
# echo 8 > functions/pci_epf_test/func1/msix_interrupts # echo 8 > functions/pci_epf_test/func1/msix_interrupts
1.5 Binding pci-epf-test Device to EP Controller
Binding pci-epf-test Device to EP Controller
--------------------------------------------
In order for the endpoint function device to be useful, it has to be bound to In order for the endpoint function device to be useful, it has to be bound to
a PCI endpoint controller driver. Use the configfs to bind the function a PCI endpoint controller driver. Use the configfs to bind the function
device to one of the controller driver present in the system. device to one of the controller driver present in the system::
# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
Once the above step is completed, the PCI endpoint is ready to establish a link Once the above step is completed, the PCI endpoint is ready to establish a link
with the host. with the host.
1.6 Start the Link
Start the Link
--------------
In order for the endpoint device to establish a link with the host, the _start_ In order for the endpoint device to establish a link with the host, the _start_
field should be populated with '1'. field should be populated with '1'::
# echo 1 > controllers/51000000.pcie_ep/start # echo 1 > controllers/51000000.pcie_ep/start
2. RootComplex Device
2.1 lspci Output RootComplex Device
==================
Note that the devices listed here correspond to the value populated in 1.4 above lspci Output
------------
Note that the devices listed here correspond to the value populated in 1.4
above::
00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
2.2 Using Endpoint Test function Device
Using Endpoint Test function Device
-----------------------------------
pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
tests. To compile this tool the following commands should be used: tests. To compile this tool the following commands should be used::
# cd <kernel-dir> # cd <kernel-dir>
# make -C tools/pci # make -C tools/pci
or if you desire to compile and install in your system: or if you desire to compile and install in your system::
# cd <kernel-dir> # cd <kernel-dir>
# make -C tools/pci install # make -C tools/pci install
The tool and script will be located in <rootfs>/usr/bin/ The tool and script will be located in <rootfs>/usr/bin/
2.2.1 pcitest.sh Output
pcitest.sh Output
~~~~~~~~~~~~~~~~~
::
# pcitest.sh # pcitest.sh
BAR tests BAR tests

View File

@ -0,0 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
=======================
Linux PCI Bus Subsystem
=======================
.. toctree::
:maxdepth: 2
:numbered:
pci
picebus-howto
pci-iov-howto
msi-howto
acpi-info
pci-error-recovery
pcieaer-howto
endpoint/index

View File

@ -1,13 +1,16 @@
The MSI Driver Guide HOWTO .. SPDX-License-Identifier: GPL-2.0
Tom L Nguyen tom.l.nguyen@intel.com .. include:: <isonum.txt>
10/03/2003
Revised Feb 12, 2004 by Martine Silbermann
email: Martine.Silbermann@hp.com
Revised Jun 25, 2004 by Tom L Nguyen
Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
Copyright 2003, 2008 Intel Corporation
1. About this guide ==========================
The MSI Driver Guide HOWTO
==========================
:Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox
:Copyright: 2003, 2008 Intel Corporation
About this guide
================
This guide describes the basics of Message Signaled Interrupts (MSIs), This guide describes the basics of Message Signaled Interrupts (MSIs),
the advantages of using MSI over traditional interrupt mechanisms, how the advantages of using MSI over traditional interrupt mechanisms, how
@ -15,7 +18,8 @@ to change your driver to use MSI or MSI-X and some basic diagnostics to
try if a device doesn't support MSIs. try if a device doesn't support MSIs.
2. What are MSIs? What are MSIs?
==============
A Message Signaled Interrupt is a write from the device to a special A Message Signaled Interrupt is a write from the device to a special
address which causes an interrupt to be received by the CPU. address which causes an interrupt to be received by the CPU.
@ -29,7 +33,8 @@ Devices may support both MSI and MSI-X, but only one can be enabled at
a time. a time.
3. Why use MSIs? Why use MSIs?
=============
There are three reasons why using MSIs can give an advantage over There are three reasons why using MSIs can give an advantage over
traditional pin-based interrupts. traditional pin-based interrupts.
@ -61,14 +66,16 @@ Other possible designs include giving one interrupt to each packet queue
in a network card or each port in a storage controller. in a network card or each port in a storage controller.
4. How to use MSIs How to use MSIs
===============
PCI devices are initialised to use pin-based interrupts. The device PCI devices are initialised to use pin-based interrupts. The device
driver has to set up the device to use MSI or MSI-X. Not all machines driver has to set up the device to use MSI or MSI-X. Not all machines
support MSIs correctly, and for those machines, the APIs described below support MSIs correctly, and for those machines, the APIs described below
will simply fail and the device will continue to use pin-based interrupts. will simply fail and the device will continue to use pin-based interrupts.
4.1 Include kernel support for MSIs Include kernel support for MSIs
-------------------------------
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
option enabled. This option is only available on some architectures, option enabled. This option is only available on some architectures,
@ -76,14 +83,15 @@ and it may depend on some other options also being set. For example,
on x86, you must also enable X86_UP_APIC or SMP in order to see the on x86, you must also enable X86_UP_APIC or SMP in order to see the
CONFIG_PCI_MSI option. CONFIG_PCI_MSI option.
4.2 Using MSI Using MSI
---------
Most of the hard work is done for the driver in the PCI layer. The driver Most of the hard work is done for the driver in the PCI layer. The driver
simply has to request that the PCI layer set up the MSI capability for this simply has to request that the PCI layer set up the MSI capability for this
device. device.
To automatically use MSI or MSI-X interrupt vectors, use the following To automatically use MSI or MSI-X interrupt vectors, use the following
function: function::
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags); unsigned int max_vecs, unsigned int flags);
@ -101,12 +109,12 @@ any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set,
pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
To get the Linux IRQ numbers passed to request_irq() and free_irq() and the To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
vectors, use the following function: vectors, use the following function::
int pci_irq_vector(struct pci_dev *dev, unsigned int nr); int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
Any allocated resources should be freed before removing the device using Any allocated resources should be freed before removing the device using
the following function: the following function::
void pci_free_irq_vectors(struct pci_dev *dev); void pci_free_irq_vectors(struct pci_dev *dev);
@ -126,7 +134,7 @@ The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
as possible, likely up to the limit supported by the device. If nvec is as possible, likely up to the limit supported by the device. If nvec is
larger than the number supported by the device it will automatically be larger than the number supported by the device it will automatically be
capped to the supported limit, so there is no need to query the number of capped to the supported limit, so there is no need to query the number of
vectors supported beforehand: vectors supported beforehand::
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
if (nvec < 0) if (nvec < 0)
@ -135,7 +143,7 @@ vectors supported beforehand:
If a driver is unable or unwilling to deal with a variable number of MSI If a driver is unable or unwilling to deal with a variable number of MSI
interrupts it can request a particular number of interrupts by passing that interrupts it can request a particular number of interrupts by passing that
number to pci_alloc_irq_vectors() function as both 'min_vecs' and number to pci_alloc_irq_vectors() function as both 'min_vecs' and
'max_vecs' parameters: 'max_vecs' parameters::
ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
if (ret < 0) if (ret < 0)
@ -143,23 +151,24 @@ number to pci_alloc_irq_vectors() function as both 'min_vecs' and
The most notorious example of the request type described above is enabling The most notorious example of the request type described above is enabling
the single MSI mode for a device. It could be done by passing two 1s as the single MSI mode for a device. It could be done by passing two 1s as
'min_vecs' and 'max_vecs': 'min_vecs' and 'max_vecs'::
ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
if (ret < 0) if (ret < 0)
goto out_err; goto out_err;
Some devices might not support using legacy line interrupts, in which case Some devices might not support using legacy line interrupts, in which case
the driver can specify that only MSI or MSI-X is acceptable: the driver can specify that only MSI or MSI-X is acceptable::
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
if (nvec < 0) if (nvec < 0)
goto out_err; goto out_err;
4.3 Legacy APIs Legacy APIs
-----------
The following old APIs to enable and disable MSI or MSI-X interrupts should The following old APIs to enable and disable MSI or MSI-X interrupts should
not be used in new code: not be used in new code::
pci_enable_msi() /* deprecated */ pci_enable_msi() /* deprecated */
pci_disable_msi() /* deprecated */ pci_disable_msi() /* deprecated */
@ -174,9 +183,11 @@ number of vectors. If you have a legitimate special use case for the count
of vectors we might have to revisit that decision and add a of vectors we might have to revisit that decision and add a
pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
4.4 Considerations when using MSIs Considerations when using MSIs
------------------------------
4.4.1 Spinlocks Spinlocks
~~~~~~~~~
Most device drivers have a per-device spinlock which is taken in the Most device drivers have a per-device spinlock which is taken in the
interrupt handler. With pin-based interrupts or a single MSI, it is not interrupt handler. With pin-based interrupts or a single MSI, it is not
@ -188,7 +199,8 @@ acquire the spinlock. Such deadlocks can be avoided by using
spin_lock_irqsave() or spin_lock_irq() which disable local interrupts spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
and acquire the lock (see Documentation/kernel-hacking/locking.rst). and acquire the lock (see Documentation/kernel-hacking/locking.rst).
4.5 How to tell whether MSI/MSI-X is enabled on a device How to tell whether MSI/MSI-X is enabled on a device
----------------------------------------------------
Using 'lspci -v' (as root) may show some devices with "MSI", "Message Using 'lspci -v' (as root) may show some devices with "MSI", "Message
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
@ -196,7 +208,8 @@ has an 'Enable' flag which is followed with either "+" (enabled)
or "-" (disabled). or "-" (disabled).
5. MSI quirks MSI quirks
==========
Several PCI chipsets or devices are known not to support MSIs. Several PCI chipsets or devices are known not to support MSIs.
The PCI stack provides three ways to disable MSIs: The PCI stack provides three ways to disable MSIs:
@ -205,7 +218,8 @@ The PCI stack provides three ways to disable MSIs:
2. on all devices behind a specific bridge 2. on all devices behind a specific bridge
3. on a single device 3. on a single device
5.1. Disabling MSIs globally Disabling MSIs globally
-----------------------
Some host chipsets simply don't support MSIs properly. If we're Some host chipsets simply don't support MSIs properly. If we're
lucky, the manufacturer knows this and has indicated it in the ACPI lucky, the manufacturer knows this and has indicated it in the ACPI
@ -219,7 +233,8 @@ on the kernel command line to disable MSIs on all devices. It would be
in your best interests to report the problem to linux-pci@vger.kernel.org in your best interests to report the problem to linux-pci@vger.kernel.org
including a full 'lspci -v' so we can add the quirks to the kernel. including a full 'lspci -v' so we can add the quirks to the kernel.
5.2. Disabling MSIs below a bridge Disabling MSIs below a bridge
-----------------------------
Some PCI bridges are not able to route MSIs between busses properly. Some PCI bridges are not able to route MSIs between busses properly.
In this case, MSIs must be disabled on all devices behind the bridge. In this case, MSIs must be disabled on all devices behind the bridge.
@ -230,7 +245,7 @@ as the nVidia nForce and Serverworks HT2000). As with host chipsets,
Linux mostly knows about them and automatically enables MSIs if it can. Linux mostly knows about them and automatically enables MSIs if it can.
If you have a bridge unknown to Linux, you can enable If you have a bridge unknown to Linux, you can enable
MSIs in configuration space using whatever method you know works, then MSIs in configuration space using whatever method you know works, then
enable MSIs on that bridge by doing: enable MSIs on that bridge by doing::
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
@ -244,7 +259,8 @@ below this bridge.
Again, please notify linux-pci@vger.kernel.org of any bridges that need Again, please notify linux-pci@vger.kernel.org of any bridges that need
special handling. special handling.
5.3. Disabling MSIs on a single device Disabling MSIs on a single device
---------------------------------
Some devices are known to have faulty MSI implementations. Usually this Some devices are known to have faulty MSI implementations. Usually this
is handled in the individual device driver, but occasionally it's necessary is handled in the individual device driver, but occasionally it's necessary
@ -252,7 +268,8 @@ to handle this with a quirk. Some drivers have an option to disable use
of MSI. While this is a convenient workaround for the driver author, of MSI. While this is a convenient workaround for the driver author,
it is not good practice, and should not be emulated. it is not good practice, and should not be emulated.
5.4. Finding why MSIs are disabled on a device Finding why MSIs are disabled on a device
-----------------------------------------
From the above three sections, you can see that there are many reasons From the above three sections, you can see that there are many reasons
why MSIs may not be enabled for a given device. Your first step should why MSIs may not be enabled for a given device. Your first step should
@ -260,8 +277,8 @@ be to examine your dmesg carefully to determine whether MSIs are enabled
for your machine. You should also check your .config to be sure you for your machine. You should also check your .config to be sure you
have enabled CONFIG_PCI_MSI. have enabled CONFIG_PCI_MSI.
Then, 'lspci -t' gives the list of bridges above a device. Reading Then, 'lspci -t' gives the list of bridges above a device. Reading
/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1) `/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1)
or disabled (0). If 0 is found in any of the msi_bus files belonging or disabled (0). If 0 is found in any of the msi_bus files belonging
to bridges between the PCI root and the device, MSIs are disabled. to bridges between the PCI root and the device, MSIs are disabled.

View File

@ -1,12 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
PCI Error Recovery ==================
------------------ PCI Error Recovery
February 2, 2006 ==================
Current document maintainer:
Linas Vepstas <linasvepstas@gmail.com> :Authors: - Linas Vepstas <linasvepstas@gmail.com>
updated by Richard Lary <rlary@us.ibm.com> - Richard Lary <rlary@us.ibm.com>
and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009 - Mike Mason <mmlnx@us.ibm.com>
Many PCI bus controllers are able to detect a variety of hardware Many PCI bus controllers are able to detect a variety of hardware
@ -63,7 +64,8 @@ mechanisms for dealing with SCSI bus errors and SCSI bus resets.
Detailed Design Detailed Design
--------------- ===============
Design and implementation details below, based on a chain of Design and implementation details below, based on a chain of
public email discussions with Ben Herrenschmidt, circa 5 April 2005. public email discussions with Ben Herrenschmidt, circa 5 April 2005.
@ -73,30 +75,33 @@ pci_driver. A driver that fails to provide the structure is "non-aware",
and the actual recovery steps taken are platform dependent. The and the actual recovery steps taken are platform dependent. The
arch/powerpc implementation will simulate a PCI hotplug remove/add. arch/powerpc implementation will simulate a PCI hotplug remove/add.
This structure has the form: This structure has the form::
struct pci_error_handlers
{
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
int (*mmio_enabled)(struct pci_dev *dev);
int (*slot_reset)(struct pci_dev *dev);
void (*resume)(struct pci_dev *dev);
};
The possible channel states are: struct pci_error_handlers
enum pci_channel_state { {
pci_channel_io_normal, /* I/O channel is in normal state */ int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
pci_channel_io_frozen, /* I/O to channel is blocked */ int (*mmio_enabled)(struct pci_dev *dev);
pci_channel_io_perm_failure, /* PCI card is dead */ int (*slot_reset)(struct pci_dev *dev);
}; void (*resume)(struct pci_dev *dev);
};
Possible return values are: The possible channel states are::
enum pci_ers_result {
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ enum pci_channel_state {
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ pci_channel_io_normal, /* I/O channel is in normal state */
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ pci_channel_io_frozen, /* I/O to channel is blocked */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ pci_channel_io_perm_failure, /* PCI card is dead */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ };
};
Possible return values are::
enum pci_ers_result {
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
};
A driver does not have to implement all of these callbacks; however, A driver does not have to implement all of these callbacks; however,
if it implements any, it must implement error_detected(). If a callback if it implements any, it must implement error_detected(). If a callback
@ -134,16 +139,17 @@ shouldn't do any new IOs. Called in task context. This is sort of a
All drivers participating in this system must implement this call. All drivers participating in this system must implement this call.
The driver must return one of the following result codes: The driver must return one of the following result codes:
- PCI_ERS_RESULT_CAN_RECOVER:
Driver returns this if it thinks it might be able to recover - PCI_ERS_RESULT_CAN_RECOVER
the HW by just banging IOs or if it wants to be given Driver returns this if it thinks it might be able to recover
a chance to extract some diagnostic information (see the HW by just banging IOs or if it wants to be given
mmio_enable, below). a chance to extract some diagnostic information (see
- PCI_ERS_RESULT_NEED_RESET: mmio_enable, below).
Driver returns this if it can't recover without a - PCI_ERS_RESULT_NEED_RESET
slot reset. Driver returns this if it can't recover without a
- PCI_ERS_RESULT_DISCONNECT: slot reset.
Driver returns this if it doesn't want to recover at all. - PCI_ERS_RESULT_DISCONNECT
Driver returns this if it doesn't want to recover at all.
The next step taken will depend on the result codes returned by the The next step taken will depend on the result codes returned by the
drivers. drivers.
@ -159,25 +165,27 @@ then recovery proceeds to STEP 4 (Slot Reset).
If the platform is unable to recover the slot, the next step If the platform is unable to recover the slot, the next step
is STEP 6 (Permanent Failure). is STEP 6 (Permanent Failure).
>>> The current powerpc implementation assumes that a device driver will .. note::
>>> *not* schedule or semaphore in this routine; the current powerpc
>>> implementation uses one kernel thread to notify all devices;
>>> thus, if one device sleeps/schedules, all devices are affected.
>>> Doing better requires complex multi-threaded logic in the error
>>> recovery implementation (e.g. waiting for all notification threads
>>> to "join" before proceeding with recovery.) This seems excessively
>>> complex and not worth implementing.
>>> The current powerpc implementation doesn't much care if the device The current powerpc implementation assumes that a device driver will
>>> attempts I/O at this point, or not. I/O's will fail, returning *not* schedule or semaphore in this routine; the current powerpc
>>> a value of 0xff on read, and writes will be dropped. If more than implementation uses one kernel thread to notify all devices;
>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH thus, if one device sleeps/schedules, all devices are affected.
>>> assumes that the device driver has gone into an infinite loop Doing better requires complex multi-threaded logic in the error
>>> and prints an error to syslog. A reboot is then required to recovery implementation (e.g. waiting for all notification threads
>>> get the device working again. to "join" before proceeding with recovery.) This seems excessively
complex and not worth implementing.
The current powerpc implementation doesn't much care if the device
attempts I/O at this point, or not. I/O's will fail, returning
a value of 0xff on read, and writes will be dropped. If more than
EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
assumes that the device driver has gone into an infinite loop
and prints an error to syslog. A reboot is then required to
get the device working again.
STEP 2: MMIO Enabled STEP 2: MMIO Enabled
------------------- --------------------
The platform re-enables MMIO to the device (but typically not the The platform re-enables MMIO to the device (but typically not the
DMA), and then calls the mmio_enabled() callback on all affected DMA), and then calls the mmio_enabled() callback on all affected
device drivers. device drivers.
@ -192,34 +200,36 @@ link reset was performed by the HW. If the platform can't just re-enable IOs
without a slot reset or a link reset, it will not call this callback, and without a slot reset or a link reset, it will not call this callback, and
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
>>> The following is proposed; no platform implements this yet: .. note::
>>> Proposal: All I/O's should be done _synchronously_ from within
>>> this callback, errors triggered by them will be returned via The following is proposed; no platform implements this yet:
>>> the normal pci_check_whatever() API, no new error_detected() Proposal: All I/O's should be done _synchronously_ from within
>>> callback will be issued due to an error happening here. However, this callback, errors triggered by them will be returned via
>>> such an error might cause IOs to be re-blocked for the whole the normal pci_check_whatever() API, no new error_detected()
>>> segment, and thus invalidate the recovery that other devices callback will be issued due to an error happening here. However,
>>> on the same segment might have done, forcing the whole segment such an error might cause IOs to be re-blocked for the whole
>>> into one of the next states, that is, link reset or slot reset. segment, and thus invalidate the recovery that other devices
on the same segment might have done, forcing the whole segment
into one of the next states, that is, link reset or slot reset.
The driver should return one of the following result codes: The driver should return one of the following result codes:
- PCI_ERS_RESULT_RECOVERED - PCI_ERS_RESULT_RECOVERED
Driver returns this if it thinks the device is fully Driver returns this if it thinks the device is fully
functional and thinks it is ready to start functional and thinks it is ready to start
normal driver operations again. There is no normal driver operations again. There is no
guarantee that the driver will actually be guarantee that the driver will actually be
allowed to proceed, as another driver on the allowed to proceed, as another driver on the
same segment might have failed and thus triggered a same segment might have failed and thus triggered a
slot reset on platforms that support it. slot reset on platforms that support it.
- PCI_ERS_RESULT_NEED_RESET - PCI_ERS_RESULT_NEED_RESET
Driver returns this if it thinks the device is not Driver returns this if it thinks the device is not
recoverable in its current state and it needs a slot recoverable in its current state and it needs a slot
reset to proceed. reset to proceed.
- PCI_ERS_RESULT_DISCONNECT - PCI_ERS_RESULT_DISCONNECT
Same as above. Total failure, no recovery even after Same as above. Total failure, no recovery even after
reset driver dead. (To be defined more precisely) reset driver dead. (To be defined more precisely)
The next step taken depends on the results returned by the drivers. The next step taken depends on the results returned by the drivers.
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
@ -293,31 +303,33 @@ device will be considered "dead" in this case.
Drivers for multi-function cards will need to coordinate among Drivers for multi-function cards will need to coordinate among
themselves as to which driver instance will perform any "one-shot" themselves as to which driver instance will perform any "one-shot"
or global device initialization. For example, the Symbios sym53cxx2 or global device initialization. For example, the Symbios sym53cxx2
driver performs device init only from PCI function 0: driver performs device init only from PCI function 0::
+ if (PCI_FUNC(pdev->devfn) == 0) + if (PCI_FUNC(pdev->devfn) == 0)
+ sym_reset_scsi_bus(np, 0); + sym_reset_scsi_bus(np, 0);
Result codes: Result codes:
- PCI_ERS_RESULT_DISCONNECT - PCI_ERS_RESULT_DISCONNECT
Same as above. Same as above.
Drivers for PCI Express cards that require a fundamental reset must Drivers for PCI Express cards that require a fundamental reset must
set the needs_freset bit in the pci_dev structure in their probe function. set the needs_freset bit in the pci_dev structure in their probe function.
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
PCI card types: PCI card types::
+ /* Set EEH reset type to fundamental if required by hba */ + /* Set EEH reset type to fundamental if required by hba */
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
+ pdev->needs_freset = 1; + pdev->needs_freset = 1;
+ +
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
Failure). Failure).
>>> The current powerpc implementation does not try a power-cycle .. note::
>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
>>> However, it probably should. The current powerpc implementation does not try a power-cycle
reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
However, it probably should.
STEP 5: Resume Operations STEP 5: Resume Operations
@ -370,44 +382,43 @@ The current policy is to turn this into a platform policy.
That is, the recovery API only requires that: That is, the recovery API only requires that:
- There is no guarantee that interrupt delivery can proceed from any - There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the device on the segment starting from the error detection and until the
slot_reset callback is called, at which point interrupts are expected slot_reset callback is called, at which point interrupts are expected
to be fully operational. to be fully operational.
- There is no guarantee that interrupt delivery is stopped, that is, - There is no guarantee that interrupt delivery is stopped, that is,
a driver that gets an interrupt after detecting an error, or that detects a driver that gets an interrupt after detecting an error, or that detects
an error within the interrupt handler such that it prevents proper an error within the interrupt handler such that it prevents proper
ack'ing of the interrupt (and thus removal of the source) should just ack'ing of the interrupt (and thus removal of the source) should just
return IRQ_NOTHANDLED. It's up to the platform to deal with that return IRQ_NOTHANDLED. It's up to the platform to deal with that
condition, typically by masking the IRQ source during the duration of condition, typically by masking the IRQ source during the duration of
the error handling. It is expected that the platform "knows" which the error handling. It is expected that the platform "knows" which
interrupts are routed to error-management capable slots and can deal interrupts are routed to error-management capable slots and can deal
with temporarily disabling that IRQ number during error processing (this with temporarily disabling that IRQ number during error processing (this
isn't terribly complex). That means some IRQ latency for other devices isn't terribly complex). That means some IRQ latency for other devices
sharing the interrupt, but there is simply no other way. High end sharing the interrupt, but there is simply no other way. High end
platforms aren't supposed to share interrupts between many devices platforms aren't supposed to share interrupts between many devices
anyway :) anyway :)
>>> Implementation details for the powerpc platform are discussed in .. note::
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
>>> As of this writing, there is a growing list of device drivers with Implementation details for the powerpc platform are discussed in
>>> patches implementing error recovery. Not all of these patches are in the file Documentation/powerpc/eeh-pci-error-recovery.txt
>>> mainline yet. These may be used as "examples":
>>>
>>> drivers/scsi/ipr
>>> drivers/scsi/sym53c8xx_2
>>> drivers/scsi/qla2xxx
>>> drivers/scsi/lpfc
>>> drivers/next/bnx2.c
>>> drivers/next/e100.c
>>> drivers/net/e1000
>>> drivers/net/e1000e
>>> drivers/net/ixgb
>>> drivers/net/ixgbe
>>> drivers/net/cxgb3
>>> drivers/net/s2io.c
>>> drivers/net/qlge
The End As of this writing, there is a growing list of device drivers with
------- patches implementing error recovery. Not all of these patches are in
mainline yet. These may be used as "examples":
- drivers/scsi/ipr
- drivers/scsi/sym53c8xx_2
- drivers/scsi/qla2xxx
- drivers/scsi/lpfc
- drivers/next/bnx2.c
- drivers/next/e100.c
- drivers/net/e1000
- drivers/net/e1000e
- drivers/net/ixgb
- drivers/net/ixgbe
- drivers/net/cxgb3
- drivers/net/s2io.c
- drivers/net/qlge

View File

@ -1,14 +1,19 @@
PCI Express I/O Virtualization Howto .. SPDX-License-Identifier: GPL-2.0
Copyright (C) 2009 Intel Corporation .. include:: <isonum.txt>
Yu Zhao <yu.zhao@intel.com>
Update: November 2012 ====================================
-- sysfs-based SRIOV enable-/disable-ment PCI Express I/O Virtualization Howto
Donald Dutile <ddutile@redhat.com> ====================================
1. Overview :Copyright: |copy| 2009 Intel Corporation
:Authors: - Yu Zhao <yu.zhao@intel.com>
- Donald Dutile <ddutile@redhat.com>
1.1 What is SR-IOV Overview
========
What is SR-IOV
--------------
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
capability which makes one physical device appear as multiple virtual capability which makes one physical device appear as multiple virtual
@ -23,9 +28,11 @@ Memory Space, which is used to map its register set. VF device driver
operates on the register set so it can be functional and appear as a operates on the register set so it can be functional and appear as a
real existing PCI device. real existing PCI device.
2. User Guide User Guide
==========
2.1 How can I enable SR-IOV capability How can I enable SR-IOV capability
----------------------------------
Multiple methods are available for SR-IOV enablement. Multiple methods are available for SR-IOV enablement.
In the first method, the device driver (PF driver) will control the In the first method, the device driver (PF driver) will control the
@ -43,105 +50,123 @@ checks, e.g., check numvfs == 0 if enabling VFs, ensure
numvfs <= totalvfs. numvfs <= totalvfs.
The second method is the recommended method for new/future VF devices. The second method is the recommended method for new/future VF devices.
2.2 How can I use the Virtual Functions How can I use the Virtual Functions
-----------------------------------
The VF is treated as hot-plugged PCI devices in the kernel, so they The VF is treated as hot-plugged PCI devices in the kernel, so they
should be able to work in the same way as real PCI devices. The VF should be able to work in the same way as real PCI devices. The VF
requires device driver that is same as a normal PCI device's. requires device driver that is same as a normal PCI device's.
3. Developer Guide Developer Guide
===============
3.1 SR-IOV API SR-IOV API
----------
To enable SR-IOV capability: To enable SR-IOV capability:
(a) For the first method, in the driver:
(a) For the first method, in the driver::
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
'nr_virtfn' is number of VFs to be enabled.
(b) For the second method, from sysfs: 'nr_virtfn' is number of VFs to be enabled.
(b) For the second method, from sysfs::
echo 'nr_virtfn' > \ echo 'nr_virtfn' > \
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
To disable SR-IOV capability: To disable SR-IOV capability:
(a) For the first method, in the driver:
(a) For the first method, in the driver::
void pci_disable_sriov(struct pci_dev *dev); void pci_disable_sriov(struct pci_dev *dev);
(b) For the second method, from sysfs:
(b) For the second method, from sysfs::
echo 0 > \ echo 0 > \
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
To enable auto probing VFs by a compatible driver on the host, run To enable auto probing VFs by a compatible driver on the host, run
command below before enabling SR-IOV capabilities. This is the command below before enabling SR-IOV capabilities. This is the
default behavior. default behavior.
::
echo 1 > \ echo 1 > \
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
To disable auto probing VFs by a compatible driver on the host, run To disable auto probing VFs by a compatible driver on the host, run
command below before enabling SR-IOV capabilities. Updating this command below before enabling SR-IOV capabilities. Updating this
entry will not affect VFs which are already probed. entry will not affect VFs which are already probed.
::
echo 0 > \ echo 0 > \
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
3.2 Usage example Usage example
-------------
Following piece of code illustrates the usage of the SR-IOV API. Following piece of code illustrates the usage of the SR-IOV API.
::
static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
{ {
pci_enable_sriov(dev, NR_VIRTFN); pci_enable_sriov(dev, NR_VIRTFN);
...
return 0;
}
static void dev_remove(struct pci_dev *dev)
{
pci_disable_sriov(dev);
...
}
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
{
...
return 0;
}
static int dev_resume(struct pci_dev *dev)
{
...
return 0;
}
static void dev_shutdown(struct pci_dev *dev)
{
...
}
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
{
if (numvfs > 0) {
...
pci_enable_sriov(dev, numvfs);
...
return numvfs;
}
if (numvfs == 0) {
....
pci_disable_sriov(dev);
... ...
return 0; return 0;
} }
}
static struct pci_driver dev_driver = { static void dev_remove(struct pci_dev *dev)
.name = "SR-IOV Physical Function driver", {
.id_table = dev_id_table, pci_disable_sriov(dev);
.probe = dev_probe,
.remove = dev_remove, ...
.suspend = dev_suspend, }
.resume = dev_resume,
.shutdown = dev_shutdown, static int dev_suspend(struct pci_dev *dev, pm_message_t state)
.sriov_configure = dev_sriov_configure, {
}; ...
return 0;
}
static int dev_resume(struct pci_dev *dev)
{
...
return 0;
}
static void dev_shutdown(struct pci_dev *dev)
{
...
}
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
{
if (numvfs > 0) {
...
pci_enable_sriov(dev, numvfs);
...
return numvfs;
}
if (numvfs == 0) {
....
pci_disable_sriov(dev);
...
return 0;
}
}
static struct pci_driver dev_driver = {
.name = "SR-IOV Physical Function driver",
.id_table = dev_id_table,
.probe = dev_probe,
.remove = dev_remove,
.suspend = dev_suspend,
.resume = dev_resume,
.shutdown = dev_shutdown,
.sriov_configure = dev_sriov_configure,
};

View File

@ -1,10 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
How To Write Linux PCI Drivers ==============================
How To Write Linux PCI Drivers
==============================
by Martin Mares <mj@ucw.cz> on 07-Feb-2000 :Authors: - Martin Mares <mj@ucw.cz>
updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006 - Grant Grundler <grundler@parisc-linux.org>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The world of PCI is vast and full of (mostly unpleasant) surprises. The world of PCI is vast and full of (mostly unpleasant) surprises.
Since each CPU architecture implements different chip-sets and PCI devices Since each CPU architecture implements different chip-sets and PCI devices
have different requirements (erm, "features"), the result is the PCI support have different requirements (erm, "features"), the result is the PCI support
@ -15,8 +17,7 @@ PCI device drivers.
A more complete resource is the third edition of "Linux Device Drivers" A more complete resource is the third edition of "Linux Device Drivers"
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
LDD3 is available for free (under Creative Commons License) from: LDD3 is available for free (under Creative Commons License) from:
http://lwn.net/Kernel/LDD3/.
http://lwn.net/Kernel/LDD3/
However, keep in mind that all documents are subject to "bit rot". However, keep in mind that all documents are subject to "bit rot".
Refer to the source code if things are not working as described here. Refer to the source code if things are not working as described here.
@ -25,9 +26,8 @@ Please send questions/comments/patches about Linux PCI API to the
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list. "Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
Structure of PCI drivers
0. Structure of PCI drivers ========================
~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI drivers "discover" PCI devices in a system via pci_register_driver(). PCI drivers "discover" PCI devices in a system via pci_register_driver().
Actually, it's the other way around. When the PCI generic code discovers Actually, it's the other way around. When the PCI generic code discovers
a new device, the driver with a matching "description" will be notified. a new device, the driver with a matching "description" will be notified.
@ -42,24 +42,25 @@ pointers and thus dictates the high level structure of a driver.
Once the driver knows about a PCI device and takes ownership, the Once the driver knows about a PCI device and takes ownership, the
driver generally needs to perform the following initialization: driver generally needs to perform the following initialization:
Enable the device - Enable the device
Request MMIO/IOP resources - Request MMIO/IOP resources
Set the DMA mask size (for both coherent and streaming DMA) - Set the DMA mask size (for both coherent and streaming DMA)
Allocate and initialize shared control data (pci_allocate_coherent()) - Allocate and initialize shared control data (pci_allocate_coherent())
Access device configuration space (if needed) - Access device configuration space (if needed)
Register IRQ handler (request_irq()) - Register IRQ handler (request_irq())
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
Enable DMA/processing engines - Enable DMA/processing engines
When done using the device, and perhaps the module needs to be unloaded, When done using the device, and perhaps the module needs to be unloaded,
the driver needs to take the follow steps: the driver needs to take the follow steps:
Disable the device from generating IRQs
Release the IRQ (free_irq()) - Disable the device from generating IRQs
Stop all DMA activity - Release the IRQ (free_irq())
Release DMA buffers (both streaming and coherent) - Stop all DMA activity
Unregister from other subsystems (e.g. scsi or netdev) - Release DMA buffers (both streaming and coherent)
Release MMIO/IOP resources - Unregister from other subsystems (e.g. scsi or netdev)
Disable the device - Release MMIO/IOP resources
- Disable the device
Most of these topics are covered in the following sections. Most of these topics are covered in the following sections.
For the rest look at LDD3 or <linux/pci.h> . For the rest look at LDD3 or <linux/pci.h> .
@ -70,99 +71,38 @@ completely empty or just returning an appropriate error codes to avoid
lots of ifdefs in the drivers. lots of ifdefs in the drivers.
pci_register_driver() call
==========================
1. pci_register_driver() call PCI device drivers call ``pci_register_driver()`` during their
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI device drivers call pci_register_driver() during their
initialization with a pointer to a structure describing the driver initialization with a pointer to a structure describing the driver
(struct pci_driver): (``struct pci_driver``):
field name Description .. kernel-doc:: include/linux/pci.h
---------- ------------------------------------------------------ :functions: pci_driver
id_table Pointer to table of device ID's the driver is
interested in. Most drivers should export this
table using MODULE_DEVICE_TABLE(pci,...).
probe This probing function gets called (during execution The ID table is an array of ``struct pci_device_id`` entries ending with an
of pci_register_driver() for already existing
devices or later if a new device gets inserted) for
all PCI devices which match the ID table and are not
"owned" by the other drivers yet. This function gets
passed a "struct pci_dev *" for each device whose
entry in the ID table matches the device. The probe
function returns zero when the driver chooses to
take "ownership" of the device or an error code
(negative number) otherwise.
The probe function always gets called from process
context, so it can sleep.
remove The remove() function gets called whenever a device
being handled by this driver is removed (either during
deregistration of the driver or when it's manually
pulled out of a hot-pluggable slot).
The remove function always gets called from process
context, so it can sleep.
suspend Put device into low power state.
suspend_late Put device into low power state.
resume_early Wake device from low power state.
resume Wake device from low power state.
(Please see Documentation/power/pci.txt for descriptions
of PCI Power Management and the related functions.)
shutdown Hook into reboot_notifier_list (kernel/sys.c).
Intended to stop any idling DMA operations.
Useful for enabling wake-on-lan (NIC) or changing
the power state of a device before reboot.
e.g. drivers/net/e100.c.
err_handler See Documentation/PCI/pci-error-recovery.txt
The ID table is an array of struct pci_device_id entries ending with an
all-zero entry. Definitions with static const are generally preferred. all-zero entry. Definitions with static const are generally preferred.
Each entry consists of: .. kernel-doc:: include/linux/mod_devicetable.h
:functions: pci_device_id
vendor,device Vendor and device ID to match (or PCI_ANY_ID) Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up
subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID)
subdevice,
class Device class, subclass, and "interface" to match.
See Appendix D of the PCI Local Bus Spec or
include/linux/pci_ids.h for a full list of classes.
Most drivers do not need to specify class/class_mask
as vendor/device is normally sufficient.
class_mask limit which sub-fields of the class field are compared.
See drivers/scsi/sym53c8xx_2/ for example of usage.
driver_data Data private to the driver.
Most drivers don't need to use driver_data field.
Best practice is to use driver_data as an index
into a static list of equivalent device types,
instead of using it as a pointer.
Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up
a pci_device_id table. a pci_device_id table.
New PCI IDs may be added to a device driver pci_ids table at runtime New PCI IDs may be added to a device driver pci_ids table at runtime
as shown below: as shown below::
echo "vendor device subvendor subdevice class class_mask driver_data" > \ echo "vendor device subvendor subdevice class class_mask driver_data" > \
/sys/bus/pci/drivers/{driver}/new_id /sys/bus/pci/drivers/{driver}/new_id
All fields are passed in as hexadecimal values (no leading 0x). All fields are passed in as hexadecimal values (no leading 0x).
The vendor and device fields are mandatory, the others are optional. Users The vendor and device fields are mandatory, the others are optional. Users
need pass only as many optional fields as necessary: need pass only as many optional fields as necessary:
o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
o class and classmask fields default to 0 - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
o driver_data defaults to 0UL. - class and classmask fields default to 0
- driver_data defaults to 0UL.
Note that driver_data must match the value used by any of the pci_device_id Note that driver_data must match the value used by any of the pci_device_id
entries defined in the driver. This makes the driver_data field mandatory entries defined in the driver. This makes the driver_data field mandatory
@ -175,29 +115,31 @@ When the driver exits, it just calls pci_unregister_driver() and the PCI layer
automatically calls the remove hook for all devices handled by the driver. automatically calls the remove hook for all devices handled by the driver.
1.1 "Attributes" for driver functions/data "Attributes" for driver functions/data
--------------------------------------
Please mark the initialization and cleanup functions where appropriate Please mark the initialization and cleanup functions where appropriate
(the corresponding macros are defined in <linux/init.h>): (the corresponding macros are defined in <linux/init.h>):
====== =================================================
__init Initialization code. Thrown away after the driver __init Initialization code. Thrown away after the driver
initializes. initializes.
__exit Exit code. Ignored for non-modular drivers. __exit Exit code. Ignored for non-modular drivers.
====== =================================================
Tips on when/where to use the above attributes: Tips on when/where to use the above attributes:
o The module_init()/module_exit() functions (and all - The module_init()/module_exit() functions (and all
initialization functions called _only_ from these) initialization functions called _only_ from these)
should be marked __init/__exit. should be marked __init/__exit.
o Do not mark the struct pci_driver. - Do not mark the struct pci_driver.
o Do NOT mark a function if you are not sure which mark to use. - Do NOT mark a function if you are not sure which mark to use.
Better to not mark the function than mark the function wrong. Better to not mark the function than mark the function wrong.
How to find PCI devices manually
2. How to find PCI devices manually ================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI drivers should have a really good reason for not using the PCI drivers should have a really good reason for not using the
pci_register_driver() interface to search for PCI devices. pci_register_driver() interface to search for PCI devices.
@ -207,17 +149,17 @@ E.g. combined serial/parallel port/floppy controller.
A manual search may be performed using the following constructs: A manual search may be performed using the following constructs:
Searching by vendor and device ID: Searching by vendor and device ID::
struct pci_dev *dev = NULL; struct pci_dev *dev = NULL;
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
configure_device(dev); configure_device(dev);
Searching by class ID (iterate in a similar way): Searching by class ID (iterate in a similar way)::
pci_get_class(CLASS_ID, dev) pci_get_class(CLASS_ID, dev)
Searching by both vendor/device and subsystem vendor/device ID: Searching by both vendor/device and subsystem vendor/device ID::
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
@ -230,21 +172,20 @@ the pci_dev that they return. You must eventually (possibly at module unload)
decrement the reference count on these devices by calling pci_dev_put(). decrement the reference count on these devices by calling pci_dev_put().
Device Initialization Steps
3. Device Initialization Steps ===========================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As noted in the introduction, most PCI drivers need the following steps As noted in the introduction, most PCI drivers need the following steps
for device initialization: for device initialization:
Enable the device - Enable the device
Request MMIO/IOP resources - Request MMIO/IOP resources
Set the DMA mask size (for both coherent and streaming DMA) - Set the DMA mask size (for both coherent and streaming DMA)
Allocate and initialize shared control data (pci_allocate_coherent()) - Allocate and initialize shared control data (pci_allocate_coherent())
Access device configuration space (if needed) - Access device configuration space (if needed)
Register IRQ handler (request_irq()) - Register IRQ handler (request_irq())
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
Enable DMA/processing engines. - Enable DMA/processing engines.
The driver can access PCI config space registers at any time. The driver can access PCI config space registers at any time.
(Well, almost. When running BIST, config space can go away...but (Well, almost. When running BIST, config space can go away...but
@ -252,26 +193,29 @@ that will just result in a PCI Bus Master Abort and config reads
will return garbage). will return garbage).
3.1 Enable the PCI device Enable the PCI device
~~~~~~~~~~~~~~~~~~~~~~~~~ ---------------------
Before touching any device registers, the driver needs to enable Before touching any device registers, the driver needs to enable
the PCI device by calling pci_enable_device(). This will: the PCI device by calling pci_enable_device(). This will:
o wake up the device if it was in suspended state,
o allocate I/O and memory regions of the device (if BIOS did not),
o allocate an IRQ (if BIOS did not).
NOTE: pci_enable_device() can fail! Check the return value. - wake up the device if it was in suspended state,
- allocate I/O and memory regions of the device (if BIOS did not),
- allocate an IRQ (if BIOS did not).
[ OS BUG: we don't check resource allocations before enabling those .. note::
resources. The sequence would make more sense if we called pci_enable_device() can fail! Check the return value.
pci_request_resources() before calling pci_enable_device().
Currently, the device drivers can't detect the bug when when two .. warning::
devices have been allocated the same range. This is not a common OS BUG: we don't check resource allocations before enabling those
problem and unlikely to get fixed soon. resources. The sequence would make more sense if we called
pci_request_resources() before calling pci_enable_device().
Currently, the device drivers can't detect the bug when when two
devices have been allocated the same range. This is not a common
problem and unlikely to get fixed soon.
This has been discussed before but not changed as of 2.6.19:
http://lkml.org/lkml/2006/3/2/194
This has been discussed before but not changed as of 2.6.19:
http://lkml.org/lkml/2006/3/2/194
]
pci_set_master() will enable DMA by setting the bus master bit pci_set_master() will enable DMA by setting the bus master bit
in the PCI_COMMAND register. It also fixes the latency timer value if in the PCI_COMMAND register. It also fixes the latency timer value if
@ -288,8 +232,8 @@ pci_try_set_mwi() to have the system do its best effort at enabling
Mem-Wr-Inval. Mem-Wr-Inval.
3.2 Request MMIO/IOP resources Request MMIO/IOP resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --------------------------
Memory (MMIO), and I/O port addresses should NOT be read directly Memory (MMIO), and I/O port addresses should NOT be read directly
from the PCI device config space. Use the values in the pci_dev structure from the PCI device config space. Use the values in the pci_dev structure
as the PCI "bus address" might have been remapped to a "host physical" as the PCI "bus address" might have been remapped to a "host physical"
@ -304,9 +248,10 @@ Conversely, drivers should call pci_release_region() AFTER
calling pci_disable_device(). calling pci_disable_device().
The idea is to prevent two devices colliding on the same address range. The idea is to prevent two devices colliding on the same address range.
[ See OS BUG comment above. Currently (2.6.19), The driver can only .. tip::
determine MMIO and IO Port resource availability _after_ calling See OS BUG comment above. Currently (2.6.19), The driver can only
pci_enable_device(). ] determine MMIO and IO Port resource availability _after_ calling
pci_enable_device().
Generic flavors of pci_request_region() are request_mem_region() Generic flavors of pci_request_region() are request_mem_region()
(for MMIO ranges) and request_region() (for IO Port ranges). (for MMIO ranges) and request_region() (for IO Port ranges).
@ -316,12 +261,13 @@ BARs.
Also see pci_request_selected_regions() below. Also see pci_request_selected_regions() below.
3.3 Set the DMA mask size Set the DMA mask size
~~~~~~~~~~~~~~~~~~~~~~~~~ ---------------------
[ If anything below doesn't make sense, please refer to .. note::
Documentation/DMA-API.txt. This section is just a reminder that If anything below doesn't make sense, please refer to
drivers need to indicate DMA capabilities of the device and is not Documentation/DMA-API.txt. This section is just a reminder that
an authoritative source for DMA interfaces. ] drivers need to indicate DMA capabilities of the device and is not
an authoritative source for DMA interfaces.
While all drivers should explicitly indicate the DMA capability While all drivers should explicitly indicate the DMA capability
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than (e.g. 32 or 64 bit) of the PCI bus master, devices with more than
@ -342,23 +288,23 @@ Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
("consistent") data. ("consistent") data.
3.4 Setup shared control data Setup shared control data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
memory. See Documentation/DMA-API.txt for a full description of memory. See Documentation/DMA-API.txt for a full description of
the DMA APIs. This section is just a reminder that it needs to be done the DMA APIs. This section is just a reminder that it needs to be done
before enabling DMA on the device. before enabling DMA on the device.
3.5 Initialize device registers Initialize device registers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---------------------------
Some drivers will need specific "capability" fields programmed Some drivers will need specific "capability" fields programmed
or other "vendor specific" register initialized or reset. or other "vendor specific" register initialized or reset.
E.g. clearing pending interrupts. E.g. clearing pending interrupts.
3.6 Register IRQ handler Register IRQ handler
~~~~~~~~~~~~~~~~~~~~~~~~ --------------------
While calling request_irq() is the last step described here, While calling request_irq() is the last step described here,
this is often just another intermediate step to initialize a device. this is often just another intermediate step to initialize a device.
This step can often be deferred until the device is opened for use. This step can often be deferred until the device is opened for use.
@ -396,6 +342,7 @@ and msix_enabled flags in the pci_dev structure after calling
pci_alloc_irq_vectors. pci_alloc_irq_vectors.
There are (at least) two really good reasons for using MSI: There are (at least) two really good reasons for using MSI:
1) MSI is an exclusive interrupt vector by definition. 1) MSI is an exclusive interrupt vector by definition.
This means the interrupt handler doesn't have to verify This means the interrupt handler doesn't have to verify
its device caused the interrupt. its device caused the interrupt.
@ -410,24 +357,23 @@ See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
of MSI/MSI-X usage. of MSI/MSI-X usage.
PCI device shutdown
4. PCI device shutdown ===================
~~~~~~~~~~~~~~~~~~~~~~~
When a PCI device driver is being unloaded, most of the following When a PCI device driver is being unloaded, most of the following
steps need to be performed: steps need to be performed:
Disable the device from generating IRQs - Disable the device from generating IRQs
Release the IRQ (free_irq()) - Release the IRQ (free_irq())
Stop all DMA activity - Stop all DMA activity
Release DMA buffers (both streaming and consistent) - Release DMA buffers (both streaming and consistent)
Unregister from other subsystems (e.g. scsi or netdev) - Unregister from other subsystems (e.g. scsi or netdev)
Disable device from responding to MMIO/IO Port addresses - Disable device from responding to MMIO/IO Port addresses
Release MMIO/IO Port resource(s) - Release MMIO/IO Port resource(s)
4.1 Stop IRQs on the device Stop IRQs on the device
~~~~~~~~~~~~~~~~~~~~~~~~~~~ -----------------------
How to do this is chip/device specific. If it's not done, it opens How to do this is chip/device specific. If it's not done, it opens
the possibility of a "screaming interrupt" if (and only if) the possibility of a "screaming interrupt" if (and only if)
the IRQ is shared with another device. the IRQ is shared with another device.
@ -446,16 +392,16 @@ MSI and MSI-X are defined to be exclusive interrupts and thus
are not susceptible to the "screaming interrupt" problem. are not susceptible to the "screaming interrupt" problem.
4.2 Release the IRQ Release the IRQ
~~~~~~~~~~~~~~~~~~~ ---------------
Once the device is quiesced (no more IRQs), one can call free_irq(). Once the device is quiesced (no more IRQs), one can call free_irq().
This function will return control once any pending IRQs are handled, This function will return control once any pending IRQs are handled,
"unhook" the drivers IRQ handler from that IRQ, and finally release "unhook" the drivers IRQ handler from that IRQ, and finally release
the IRQ if no one else is using it. the IRQ if no one else is using it.
4.3 Stop all DMA activity Stop all DMA activity
~~~~~~~~~~~~~~~~~~~~~~~~~ ---------------------
It's extremely important to stop all DMA operations BEFORE attempting It's extremely important to stop all DMA operations BEFORE attempting
to deallocate DMA control data. Failure to do so can result in memory to deallocate DMA control data. Failure to do so can result in memory
corruption, hangs, and on some chip-sets a hard crash. corruption, hangs, and on some chip-sets a hard crash.
@ -467,8 +413,8 @@ While this step sounds obvious and trivial, several "mature" drivers
didn't get this step right in the past. didn't get this step right in the past.
4.4 Release DMA buffers Release DMA buffers
~~~~~~~~~~~~~~~~~~~~~~~ -------------------
Once DMA is stopped, clean up streaming DMA first. Once DMA is stopped, clean up streaming DMA first.
I.e. unmap data buffers and return buffers to "upstream" I.e. unmap data buffers and return buffers to "upstream"
owners if there is one. owners if there is one.
@ -478,8 +424,8 @@ Then clean up "consistent" buffers which contain the control data.
See Documentation/DMA-API.txt for details on unmapping interfaces. See Documentation/DMA-API.txt for details on unmapping interfaces.
4.5 Unregister from other subsystems Unregister from other subsystems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --------------------------------
Most low level PCI device drivers support some other subsystem Most low level PCI device drivers support some other subsystem
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
driver isn't losing resources from that other subsystem. driver isn't losing resources from that other subsystem.
@ -487,31 +433,30 @@ If this happens, typically the symptom is an Oops (panic) when
the subsystem attempts to call into a driver that has been unloaded. the subsystem attempts to call into a driver that has been unloaded.
4.6 Disable Device from responding to MMIO/IO Port addresses Disable Device from responding to MMIO/IO Port addresses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --------------------------------------------------------
io_unmap() MMIO or IO Port resources and then call pci_disable_device(). io_unmap() MMIO or IO Port resources and then call pci_disable_device().
This is the symmetric opposite of pci_enable_device(). This is the symmetric opposite of pci_enable_device().
Do not access device registers after calling pci_disable_device(). Do not access device registers after calling pci_disable_device().
4.7 Release MMIO/IO Port Resource(s) Release MMIO/IO Port Resource(s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --------------------------------
Call pci_release_region() to mark the MMIO or IO Port range as available. Call pci_release_region() to mark the MMIO or IO Port range as available.
Failure to do so usually results in the inability to reload the driver. Failure to do so usually results in the inability to reload the driver.
How to access PCI config space
==============================
5. How to access PCI config space You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ space of a device represented by `struct pci_dev *`. All these functions return
0 when successful or an error code (`PCIBIOS_...`) which can be translated to a
You can use pci_(read|write)_config_(byte|word|dword) to access the config text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
space of a device represented by struct pci_dev *. All these functions return 0
when successful or an error code (PCIBIOS_...) which can be translated to a text
string by pcibios_strerror. Most drivers expect that accesses to valid PCI
devices don't fail. devices don't fail.
If you don't have a struct pci_dev available, you can call If you don't have a struct pci_dev available, you can call
pci_bus_(read|write)_config_(byte|word|dword) to access a given device `pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
and function on that bus. and function on that bus.
If you access fields in the standard portion of the config header, please If you access fields in the standard portion of the config header, please
@ -522,10 +467,10 @@ pci_find_capability() for the particular capability and it will find the
corresponding register block for you. corresponding register block for you.
Other interesting functions
===========================
6. Other interesting functions ============================= ================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
bus and slot and number. If the device is bus and slot and number. If the device is
found, its reference count is increased. found, its reference count is increased.
@ -539,11 +484,11 @@ pci_set_drvdata() Set private driver data pointer for a pci_dev
pci_get_drvdata() Return private driver data pointer for a pci_dev pci_get_drvdata() Return private driver data pointer for a pci_dev
pci_set_mwi() Enable Memory-Write-Invalidate transactions. pci_set_mwi() Enable Memory-Write-Invalidate transactions.
pci_clear_mwi() Disable Memory-Write-Invalidate transactions. pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
============================= ================================================
Miscellaneous hints
7. Miscellaneous hints ===================
~~~~~~~~~~~~~~~~~~~~~~
When displaying PCI device names to the user (for example when a driver wants When displaying PCI device names to the user (for example when a driver wants
to tell the user what card has it found), please use pci_name(pci_dev). to tell the user what card has it found), please use pci_name(pci_dev).
@ -559,9 +504,8 @@ on the bus need to be capable of doing it, so this is something which needs
to be handled by platform and generic code, not individual drivers. to be handled by platform and generic code, not individual drivers.
Vendor and device identifications
8. Vendor and device identifications =================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Do not add new device or vendor IDs to include/linux/pci_ids.h unless they Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
are shared across multiple drivers. You can add private definitions in are shared across multiple drivers. You can add private definitions in
@ -575,28 +519,27 @@ There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
and https://github.com/pciutils/pciids. and https://github.com/pciutils/pciids.
Obsolete functions
9. Obsolete functions ==================
~~~~~~~~~~~~~~~~~~~~~
There are several functions which you might come across when trying to There are several functions which you might come across when trying to
port an old driver to the new PCI interface. They are no longer present port an old driver to the new PCI interface. They are no longer present
in the kernel as they aren't compatible with hotplug or PCI domains or in the kernel as they aren't compatible with hotplug or PCI domains or
having sane locking. having sane locking.
================= ===========================================
pci_find_device() Superseded by pci_get_device() pci_find_device() Superseded by pci_get_device()
pci_find_subsys() Superseded by pci_get_subsys() pci_find_subsys() Superseded by pci_get_subsys()
pci_find_slot() Superseded by pci_get_domain_bus_and_slot() pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
pci_get_slot() Superseded by pci_get_domain_bus_and_slot() pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
================= ===========================================
The alternative is the traditional PCI device driver that walks PCI The alternative is the traditional PCI device driver that walks PCI
device lists. This is still possible but discouraged. device lists. This is still possible but discouraged.
MMIO Space and "Write Posting"
10. MMIO Space and "Write Posting" ==============================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting a driver from using I/O Port space to using MMIO space Converting a driver from using I/O Port space to using MMIO space
often requires some additional changes. Specifically, "write posting" often requires some additional changes. Specifically, "write posting"
@ -609,14 +552,14 @@ the CPU before the transaction has reached its destination.
Thus, timing sensitive code should add readl() where the CPU is Thus, timing sensitive code should add readl() where the CPU is
expected to wait before doing other work. The classic "bit banging" expected to wait before doing other work. The classic "bit banging"
sequence works fine for I/O Port space: sequence works fine for I/O Port space::
for (i = 8; --i; val >>= 1) { for (i = 8; --i; val >>= 1) {
outb(val & 1, ioport_reg); /* write bit */ outb(val & 1, ioport_reg); /* write bit */
udelay(10); udelay(10);
} }
The same sequence for MMIO space should be: The same sequence for MMIO space should be::
for (i = 8; --i; val >>= 1) { for (i = 8; --i; val >>= 1) {
writeb(val & 1, mmio_reg); /* write bit */ writeb(val & 1, mmio_reg); /* write bit */
@ -633,4 +576,3 @@ handle the PCI master abort on all platforms if the PCI device is
expected to not respond to a readl(). Most x86 platforms will allow expected to not respond to a readl(). Most x86 platforms will allow
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). (e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").

View File

@ -1,21 +1,29 @@
The PCI Express Advanced Error Reporting Driver Guide HOWTO .. SPDX-License-Identifier: GPL-2.0
T. Long Nguyen <tom.l.nguyen@intel.com> .. include:: <isonum.txt>
Yanmin Zhang <yanmin.zhang@intel.com>
07/29/2006
===========================================================
The PCI Express Advanced Error Reporting Driver Guide HOWTO
===========================================================
1. Overview :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
- Yanmin Zhang <yanmin.zhang@intel.com>
1.1 About this guide :Copyright: |copy| 2006 Intel Corporation
Overview
===========
About this guide
----------------
This guide describes the basics of the PCI Express Advanced Error This guide describes the basics of the PCI Express Advanced Error
Reporting (AER) driver and provides information on how to use it, as Reporting (AER) driver and provides information on how to use it, as
well as how to enable the drivers of endpoint devices to conform with well as how to enable the drivers of endpoint devices to conform with
PCI Express AER driver. PCI Express AER driver.
1.2 Copyright (C) Intel Corporation 2006.
1.3 What is the PCI Express AER Driver? What is the PCI Express AER Driver?
-----------------------------------
PCI Express error signaling can occur on the PCI Express link itself PCI Express error signaling can occur on the PCI Express link itself
or on behalf of transactions initiated on the link. PCI Express or on behalf of transactions initiated on the link. PCI Express
@ -30,17 +38,19 @@ The PCI Express AER driver provides the infrastructure to support PCI
Express Advanced Error Reporting capability. The PCI Express AER Express Advanced Error Reporting capability. The PCI Express AER
driver provides three basic functions: driver provides three basic functions:
- Gathers the comprehensive error information if errors occurred. - Gathers the comprehensive error information if errors occurred.
- Reports error to the users. - Reports error to the users.
- Performs error recovery actions. - Performs error recovery actions.
AER driver only attaches root ports which support PCI-Express AER AER driver only attaches root ports which support PCI-Express AER
capability. capability.
2. User Guide User Guide
==========
2.1 Include the PCI Express AER Root Driver into the Linux Kernel Include the PCI Express AER Root Driver into the Linux Kernel
-------------------------------------------------------------
The PCI Express AER Root driver is a Root Port service driver attached The PCI Express AER Root driver is a Root Port service driver attached
to the PCI Express Port Bus driver. If a user wants to use it, the driver to the PCI Express Port Bus driver. If a user wants to use it, the driver
@ -48,7 +58,8 @@ has to be compiled. Option CONFIG_PCIEAER supports this capability. It
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
CONFIG_PCIEAER = y. CONFIG_PCIEAER = y.
2.2 Load PCI Express AER Root Driver Load PCI Express AER Root Driver
--------------------------------
Some systems have AER support in firmware. Enabling Linux AER support at Some systems have AER support in firmware. Enabling Linux AER support at
the same time the firmware handles AER may result in unpredictable the same time the firmware handles AER may result in unpredictable
@ -56,30 +67,34 @@ behavior. Therefore, Linux does not handle AER events unless the firmware
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
Specification for details regarding _OSC usage. Specification for details regarding _OSC usage.
2.3 AER error output AER error output
----------------
When a PCIe AER error is captured, an error message will be output to When a PCIe AER error is captured, an error message will be output to
console. If it's a correctable error, it is output as a warning. console. If it's a correctable error, it is output as a warning.
Otherwise, it is printed as an error. So users could choose different Otherwise, it is printed as an error. So users could choose different
log level to filter out correctable error messages. log level to filter out correctable error messages.
Below shows an example: Below shows an example::
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
0000:50:00.0: [20] Unsupported Request (First) 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 0000:50:00.0: [20] Unsupported Request (First)
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
In the example, 'Requester ID' means the ID of the device who sends In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for the error message to root port. Pls. refer to pci express specs for
other fields. other fields.
2.4 AER Statistics / Counters AER Statistics / Counters
-------------------------
When PCIe AER errors are captured, the counters / statistics are also exposed When PCIe AER errors are captured, the counters / statistics are also exposed
in the form of sysfs attributes which are documented at in the form of sysfs attributes which are documented at
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
3. Developer Guide Developer Guide
===============
To enable AER aware support requires a software driver to configure To enable AER aware support requires a software driver to configure
the AER capability structure within its device and to provide callbacks. the AER capability structure within its device and to provide callbacks.
@ -120,7 +135,8 @@ hierarchy and links. These errors do not include any device specific
errors because device specific errors will still get sent directly to errors because device specific errors will still get sent directly to
the device driver. the device driver.
3.1 Configure the AER capability structure Configure the AER capability structure
--------------------------------------
AER aware drivers of PCI Express component need change the device AER aware drivers of PCI Express component need change the device
control registers to enable AER. They also could change AER registers, control registers to enable AER. They also could change AER registers,
@ -128,9 +144,11 @@ including mask and severity registers. Helper function
pci_enable_pcie_error_reporting could be used to enable AER. See pci_enable_pcie_error_reporting could be used to enable AER. See
section 3.3. section 3.3.
3.2. Provide callbacks Provide callbacks
-----------------
3.2.1 callback reset_link to reset pci express link callback reset_link to reset pci express link
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This callback is used to reset the pci express physical link when a This callback is used to reset the pci express physical link when a
fatal error happens. The root port aer service driver provides a fatal error happens. The root port aer service driver provides a
@ -140,13 +158,15 @@ upstream ports should provide their own reset_link functions.
In struct pcie_port_service_driver, a new pointer, reset_link, is In struct pcie_port_service_driver, a new pointer, reset_link, is
added. added.
::
pci_ers_result_t (*reset_link) (struct pci_dev *dev); pci_ers_result_t (*reset_link) (struct pci_dev *dev);
Section 3.2.2.2 provides more detailed info on when to call Section 3.2.2.2 provides more detailed info on when to call
reset_link. reset_link.
3.2.2 PCI error-recovery callbacks PCI error-recovery callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The PCI Express AER Root driver uses error callbacks to coordinate The PCI Express AER Root driver uses error callbacks to coordinate
with downstream device drivers associated with a hierarchy in question with downstream device drivers associated with a hierarchy in question
@ -161,7 +181,8 @@ definitions of the callbacks.
Below sections specify when to call the error callback functions. Below sections specify when to call the error callback functions.
3.2.2.1 Correctable errors Correctable errors
~~~~~~~~~~~~~~~~~~
Correctable errors pose no impacts on the functionality of Correctable errors pose no impacts on the functionality of
the interface. The PCI Express protocol can recover without any the interface. The PCI Express protocol can recover without any
@ -169,13 +190,16 @@ software intervention or any loss of data. These errors do not
require any recovery actions. The AER driver clears the device's require any recovery actions. The AER driver clears the device's
correctable error status register accordingly and logs these errors. correctable error status register accordingly and logs these errors.
3.2.2.2 Non-correctable (non-fatal and fatal) errors Non-correctable (non-fatal and fatal) errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If an error message indicates a non-fatal error, performing link reset If an error message indicates a non-fatal error, performing link reset
at upstream is not required. The AER driver calls error_detected(dev, at upstream is not required. The AER driver calls error_detected(dev,
pci_channel_io_normal) to all drivers associated within a hierarchy in pci_channel_io_normal) to all drivers associated within a hierarchy in
question. for example, question. for example::
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
If Upstream port A captures an AER error, the hierarchy consists of If Upstream port A captures an AER error, the hierarchy consists of
Downstream port B and EndPoint. Downstream port B and EndPoint.
@ -199,53 +223,72 @@ function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
to mmio_enabled. to mmio_enabled.
3.3 helper functions helper functions
----------------
::
int pci_enable_pcie_error_reporting(struct pci_dev *dev);
3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
pci_enable_pcie_error_reporting enables the device to send error pci_enable_pcie_error_reporting enables the device to send error
messages to root port when an error is detected. Note that devices messages to root port when an error is detected. Note that devices
don't enable the error reporting by default, so device drivers need don't enable the error reporting by default, so device drivers need
call this function to enable it. call this function to enable it.
3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); ::
int pci_disable_pcie_error_reporting(struct pci_dev *dev);
pci_disable_pcie_error_reporting disables the device to send error pci_disable_pcie_error_reporting disables the device to send error
messages to root port when an error is detected. messages to root port when an error is detected.
3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); ::
int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
error status register. error status register.
3.4 Frequent Asked Questions Frequent Asked Questions
------------------------
Q: What happens if a PCI Express device driver does not provide an Q:
error recovery handler (pci_driver->err_handler is equal to NULL)? What happens if a PCI Express device driver does not provide an
error recovery handler (pci_driver->err_handler is equal to NULL)?
A: The devices attached with the driver won't be recovered. If the A:
error is fatal, kernel will print out warning messages. Please refer The devices attached with the driver won't be recovered. If the
to section 3 for more information. error is fatal, kernel will print out warning messages. Please refer
to section 3 for more information.
Q: What happens if an upstream port service driver does not provide Q:
callback reset_link? What happens if an upstream port service driver does not provide
callback reset_link?
A: Fatal error recovery will fail if the errors are reported by the A:
upstream ports who are attached by the service driver. Fatal error recovery will fail if the errors are reported by the
upstream ports who are attached by the service driver.
Q: How does this infrastructure deal with driver that is not PCI Q:
Express aware? How does this infrastructure deal with driver that is not PCI
Express aware?
A: This infrastructure calls the error callback functions of the A:
driver when an error happens. But if the driver is not aware of This infrastructure calls the error callback functions of the
PCI Express, the device might not report its own errors to root driver when an error happens. But if the driver is not aware of
port. PCI Express, the device might not report its own errors to root
port.
Q: What modifications will that driver need to make it compatible Q:
with the PCI Express AER Root driver? What modifications will that driver need to make it compatible
with the PCI Express AER Root driver?
A: It could call the helper functions to enable AER in devices and A:
cleanup uncorrectable status register. Pls. refer to section 3.3. It could call the helper functions to enable AER in devices and
cleanup uncorrectable status register. Pls. refer to section 3.3.
4. Software error injection Software error injection
========================
Debugging PCIe AER error recovery code is quite difficult because it Debugging PCIe AER error recovery code is quite difficult because it
is hard to trigger real hardware errors. Software based error is hard to trigger real hardware errors. Software based error
@ -261,6 +304,7 @@ After reboot with new kernel or insert the module, a device file named
Then, you need a user space tool named aer-inject, which can be gotten Then, you need a user space tool named aer-inject, which can be gotten
from: from:
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
More information about aer-inject can be found in the document comes More information about aer-inject can be found in the document comes

View File

@ -1,16 +1,23 @@
The PCI Express Port Bus Driver Guide HOWTO .. SPDX-License-Identifier: GPL-2.0
Tom L Nguyen tom.l.nguyen@intel.com .. include:: <isonum.txt>
11/03/2004
1. About this guide ===========================================
The PCI Express Port Bus Driver Guide HOWTO
===========================================
:Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004
:Copyright: |copy| 2004 Intel Corporation
About this guide
================
This guide describes the basics of the PCI Express Port Bus driver This guide describes the basics of the PCI Express Port Bus driver
and provides information on how to enable the service drivers to and provides information on how to enable the service drivers to
register/unregister with the PCI Express Port Bus Driver. register/unregister with the PCI Express Port Bus Driver.
2. Copyright 2004 Intel Corporation
3. What is the PCI Express Port Bus Driver What is the PCI Express Port Bus Driver
=======================================
A PCI Express Port is a logical PCI-PCI Bridge structure. There A PCI Express Port is a logical PCI-PCI Bridge structure. There
are two types of PCI Express Port: the Root Port and the Switch are two types of PCI Express Port: the Root Port and the Switch
@ -30,7 +37,8 @@ support (AER), and virtual channel support (VC). These services may
be handled by a single complex driver or be individually distributed be handled by a single complex driver or be individually distributed
and handled by corresponding service drivers. and handled by corresponding service drivers.
4. Why use the PCI Express Port Bus Driver? Why use the PCI Express Port Bus Driver?
========================================
In existing Linux kernels, the Linux Device Driver Model allows a In existing Linux kernels, the Linux Device Driver Model allows a
physical device to be handled by only a single driver. The PCI physical device to be handled by only a single driver. The PCI
@ -51,28 +59,31 @@ PCI Express Ports and distributes all provided service requests
to the corresponding service drivers as required. Some key to the corresponding service drivers as required. Some key
advantages of using the PCI Express Port Bus driver are listed below: advantages of using the PCI Express Port Bus driver are listed below:
- Allow multiple service drivers to run simultaneously on - Allow multiple service drivers to run simultaneously on
a PCI-PCI Bridge Port device. a PCI-PCI Bridge Port device.
- Allow service drivers implemented in an independent - Allow service drivers implemented in an independent
staged approach. staged approach.
- Allow one service driver to run on multiple PCI-PCI Bridge - Allow one service driver to run on multiple PCI-PCI Bridge
Port devices. Port devices.
- Manage and distribute resources of a PCI-PCI Bridge Port - Manage and distribute resources of a PCI-PCI Bridge Port
device to requested service drivers. device to requested service drivers.
5. Configuring the PCI Express Port Bus Driver vs. Service Drivers Configuring the PCI Express Port Bus Driver vs. Service Drivers
===============================================================
5.1 Including the PCI Express Port Bus Driver Support into the Kernel Including the PCI Express Port Bus Driver Support into the Kernel
-----------------------------------------------------------------
Including the PCI Express Port Bus driver depends on whether the PCI Including the PCI Express Port Bus driver depends on whether the PCI
Express support is included in the kernel config. The kernel will Express support is included in the kernel config. The kernel will
automatically include the PCI Express Port Bus driver as a kernel automatically include the PCI Express Port Bus driver as a kernel
driver when the PCI Express support is enabled in the kernel. driver when the PCI Express support is enabled in the kernel.
5.2 Enabling Service Driver Support Enabling Service Driver Support
-------------------------------
PCI device drivers are implemented based on Linux Device Driver Model. PCI device drivers are implemented based on Linux Device Driver Model.
All service drivers are PCI device drivers. As discussed above, it is All service drivers are PCI device drivers. As discussed above, it is
@ -89,9 +100,11 @@ header file /include/linux/pcieport_if.h, before calling these APIs.
Failure to do so will result an identity mismatch, which prevents Failure to do so will result an identity mismatch, which prevents
the PCI Express Port Bus driver from loading a service driver. the PCI Express Port Bus driver from loading a service driver.
5.2.1 pcie_port_service_register pcie_port_service_register
~~~~~~~~~~~~~~~~~~~~~~~~~~
::
int pcie_port_service_register(struct pcie_port_service_driver *new) int pcie_port_service_register(struct pcie_port_service_driver *new)
This API replaces the Linux Driver Model's pci_register_driver API. A This API replaces the Linux Driver Model's pci_register_driver API. A
service driver should always calls pcie_port_service_register at service driver should always calls pcie_port_service_register at
@ -99,69 +112,76 @@ module init. Note that after service driver being loaded, calls
such as pci_enable_device(dev) and pci_set_master(dev) are no longer such as pci_enable_device(dev) and pci_set_master(dev) are no longer
necessary since these calls are executed by the PCI Port Bus driver. necessary since these calls are executed by the PCI Port Bus driver.
5.2.2 pcie_port_service_unregister pcie_port_service_unregister
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
void pcie_port_service_unregister(struct pcie_port_service_driver *new) void pcie_port_service_unregister(struct pcie_port_service_driver *new)
pcie_port_service_unregister replaces the Linux Driver Model's pcie_port_service_unregister replaces the Linux Driver Model's
pci_unregister_driver. It's always called by service driver when a pci_unregister_driver. It's always called by service driver when a
module exits. module exits.
5.2.3 Sample Code Sample Code
~~~~~~~~~~~
Below is sample service driver code to initialize the port service Below is sample service driver code to initialize the port service
driver data structure. driver data structure.
::
static struct pcie_port_service_id service_id[] = { { static struct pcie_port_service_id service_id[] = { {
.vendor = PCI_ANY_ID, .vendor = PCI_ANY_ID,
.device = PCI_ANY_ID, .device = PCI_ANY_ID,
.port_type = PCIE_RC_PORT, .port_type = PCIE_RC_PORT,
.service_type = PCIE_PORT_SERVICE_AER, .service_type = PCIE_PORT_SERVICE_AER,
}, { /* end: all zeroes */ } }, { /* end: all zeroes */ }
}; };
static struct pcie_port_service_driver root_aerdrv = { static struct pcie_port_service_driver root_aerdrv = {
.name = (char *)device_name, .name = (char *)device_name,
.id_table = &service_id[0], .id_table = &service_id[0],
.probe = aerdrv_load, .probe = aerdrv_load,
.remove = aerdrv_unload, .remove = aerdrv_unload,
.suspend = aerdrv_suspend, .suspend = aerdrv_suspend,
.resume = aerdrv_resume, .resume = aerdrv_resume,
}; };
Below is a sample code for registering/unregistering a service Below is a sample code for registering/unregistering a service
driver. driver.
::
static int __init aerdrv_service_init(void) static int __init aerdrv_service_init(void)
{ {
int retval = 0; int retval = 0;
retval = pcie_port_service_register(&root_aerdrv); retval = pcie_port_service_register(&root_aerdrv);
if (!retval) { if (!retval) {
/* /*
* FIX ME * FIX ME
*/ */
} }
return retval; return retval;
} }
static void __exit aerdrv_service_exit(void) static void __exit aerdrv_service_exit(void)
{ {
pcie_port_service_unregister(&root_aerdrv); pcie_port_service_unregister(&root_aerdrv);
} }
module_init(aerdrv_service_init); module_init(aerdrv_service_init);
module_exit(aerdrv_service_exit); module_exit(aerdrv_service_exit);
6. Possible Resource Conflicts Possible Resource Conflicts
===========================
Since all service drivers of a PCI-PCI Bridge Port device are Since all service drivers of a PCI-PCI Bridge Port device are
allowed to run simultaneously, below lists a few of possible resource allowed to run simultaneously, below lists a few of possible resource
conflicts with proposed solutions. conflicts with proposed solutions.
6.1 MSI and MSI-X Vector Resource MSI and MSI-X Vector Resource
-----------------------------
Once MSI or MSI-X interrupts are enabled on a device, it stays in this Once MSI or MSI-X interrupts are enabled on a device, it stays in this
mode until they are disabled again. Since service drivers of the same mode until they are disabled again. Since service drivers of the same
@ -179,7 +199,8 @@ driver. Service drivers should use (struct pcie_device*)dev->irq to
call request_irq/free_irq. In addition, the interrupt mode is stored call request_irq/free_irq. In addition, the interrupt mode is stored
in the field interrupt_mode of struct pcie_device. in the field interrupt_mode of struct pcie_device.
6.3 PCI Memory/IO Mapped Regions PCI Memory/IO Mapped Regions
----------------------------
Service drivers for PCI Express Power Management (PME), Advanced Service drivers for PCI Express Power Management (PME), Advanced
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
@ -188,7 +209,8 @@ registers accessed are independent of each other. This patch assumes
that all service drivers will be well behaved and not overwrite that all service drivers will be well behaved and not overwrite
other service driver's configuration settings. other service driver's configuration settings.
6.4 PCI Config Registers PCI Config Registers
--------------------
Each service driver runs its PCI config operations on its own Each service driver runs its PCI config operations on its own
capability structure except the PCI Express capability structure, in capability structure except the PCI Express capability structure, in

View File

@ -1,3 +1,7 @@
==================
Control Groupstats
==================
Control Groupstats is inspired by the discussion at Control Groupstats is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
@ -19,9 +23,9 @@ about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
information will not be available. information will not be available.
To extract cgroup statistics a utility very similar to getdelays.c To extract cgroup statistics a utility very similar to getdelays.c
has been developed, the sample output of the utility is shown below has been developed, the sample output of the utility is shown below::
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a" ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a"
sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup" ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup"
sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2

View File

@ -1,5 +1,6 @@
================
Delay accounting Delay accounting
---------------- ================
Tasks encounter delays in execution when they wait Tasks encounter delays in execution when they wait
for some kernel resource to become available e.g. a for some kernel resource to become available e.g. a
@ -39,7 +40,9 @@ in detail in a separate document in this directory. Taskstats returns a
generic data structure to userspace corresponding to per-pid and per-tgid generic data structure to userspace corresponding to per-pid and per-tgid
statistics. The delay accounting functionality populates specific fields of statistics. The delay accounting functionality populates specific fields of
this structure. See this structure. See
include/linux/taskstats.h include/linux/taskstats.h
for a description of the fields pertaining to delay accounting. for a description of the fields pertaining to delay accounting.
It will generally be in the form of counters returning the cumulative It will generally be in the form of counters returning the cumulative
delay seen for cpu, sync block I/O, swapin, memory reclaim etc. delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
@ -61,13 +64,16 @@ also serves as an example of using the taskstats interface.
Usage Usage
----- -----
Compile the kernel with Compile the kernel with::
CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASKSTATS=y CONFIG_TASKSTATS=y
Delay accounting is enabled by default at boot up. Delay accounting is enabled by default at boot up.
To disable, add To disable, add::
nodelayacct nodelayacct
to the kernel boot options. The rest of the instructions to the kernel boot options. The rest of the instructions
below assume this has not been done. below assume this has not been done.
@ -78,40 +84,43 @@ The utility also allows a given command to be
executed and the corresponding delays to be executed and the corresponding delays to be
seen. seen.
General format of the getdelays command General format of the getdelays command::
getdelays [-t tgid] [-p pid] [-c cmd...] getdelays [-t tgid] [-p pid] [-c cmd...]
Get delays, since system boot, for pid 10 Get delays, since system boot, for pid 10::
# ./getdelays -p 10
(output similar to next case)
Get sum of delays, since system boot, for all pids with tgid 5 # ./getdelays -p 10
# ./getdelays -t 5 (output similar to next case)
Get sum of delays, since system boot, for all pids with tgid 5::
# ./getdelays -t 5
CPU count real total virtual total delay total CPU count real total virtual total delay total
7876 92005750 100000000 24001500 7876 92005750 100000000 24001500
IO count delay total IO count delay total
0 0 0 0
SWAP count delay total SWAP count delay total
0 0 0 0
RECLAIM count delay total RECLAIM count delay total
0 0 0 0
Get delays seen in executing a given simple command Get delays seen in executing a given simple command::
# ./getdelays -c ls /
bin data1 data3 data5 dev home media opt root srv sys usr # ./getdelays -c ls /
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
bin data1 data3 data5 dev home media opt root srv sys usr
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
CPU count real total virtual total delay total CPU count real total virtual total delay total
6 4000250 4000000 0 6 4000250 4000000 0
IO count delay total IO count delay total
0 0 0 0
SWAP count delay total SWAP count delay total
0 0 0 0
RECLAIM count delay total RECLAIM count delay total
0 0 0 0

View File

@ -0,0 +1,14 @@
.. SPDX-License-Identifier: GPL-2.0
==========
Accounting
==========
.. toctree::
:maxdepth: 1
cgroupstats
delay-accounting
psi
taskstats
taskstats-struct

View File

@ -35,14 +35,14 @@ Pressure interface
Pressure information for each resource is exported through the Pressure information for each resource is exported through the
respective file in /proc/pressure/ -- cpu, memory, and io. respective file in /proc/pressure/ -- cpu, memory, and io.
The format for CPU is as such: The format for CPU is as such::
some avg10=0.00 avg60=0.00 avg300=0.00 total=0 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
and for memory and IO: and for memory and IO::
some avg10=0.00 avg60=0.00 avg300=0.00 total=0 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0
The "some" line indicates the share of time in which at least some The "some" line indicates the share of time in which at least some
tasks are stalled on a given resource. tasks are stalled on a given resource.
@ -77,9 +77,9 @@ To register a trigger user has to open psi interface file under
/proc/pressure/ representing the resource to be monitored and write the /proc/pressure/ representing the resource to be monitored and write the
desired threshold and time window. The open file descriptor should be desired threshold and time window. The open file descriptor should be
used to wait for trigger events using select(), poll() or epoll(). used to wait for trigger events using select(), poll() or epoll().
The following format is used: The following format is used::
<some|full> <stall amount in us> <time window in us> <some|full> <stall amount in us> <time window in us>
For example writing "some 150000 1000000" into /proc/pressure/memory For example writing "some 150000 1000000" into /proc/pressure/memory
would add 150ms threshold for partial memory stall measured within would add 150ms threshold for partial memory stall measured within
@ -115,18 +115,20 @@ trigger is closed.
Userspace monitor usage example Userspace monitor usage example
=============================== ===============================
#include <errno.h> ::
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>
/* #include <errno.h>
* Monitor memory partial stall with 1s tracking window size #include <fcntl.h>
* and 150ms threshold. #include <stdio.h>
*/ #include <poll.h>
int main() { #include <string.h>
#include <unistd.h>
/*
* Monitor memory partial stall with 1s tracking window size
* and 150ms threshold.
*/
int main() {
const char trig[] = "some 150000 1000000"; const char trig[] = "some 150000 1000000";
struct pollfd fds; struct pollfd fds;
int n; int n;
@ -165,7 +167,7 @@ int main() {
} }
return 0; return 0;
} }
Cgroup2 interface Cgroup2 interface
================= =================

View File

@ -1,5 +1,6 @@
====================
The struct taskstats The struct taskstats
-------------------- ====================
This document contains an explanation of the struct taskstats fields. This document contains an explanation of the struct taskstats fields.
@ -10,16 +11,24 @@ There are three different groups of fields in the struct taskstats:
the common fields and basic accounting fields are collected for the common fields and basic accounting fields are collected for
delivery at do_exit() of a task. delivery at do_exit() of a task.
2) Delay accounting fields 2) Delay accounting fields
These fields are placed between These fields are placed between::
/* Delay accounting fields start */
and /* Delay accounting fields start */
/* Delay accounting fields end */
and::
/* Delay accounting fields end */
Their values are collected if CONFIG_TASK_DELAY_ACCT is set. Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
3) Extended accounting fields 3) Extended accounting fields
These fields are placed between These fields are placed between::
/* Extended accounting fields start */
and /* Extended accounting fields start */
/* Extended accounting fields end */
and::
/* Extended accounting fields end */
Their values are collected if CONFIG_TASK_XACCT is set. Their values are collected if CONFIG_TASK_XACCT is set.
4) Per-task and per-thread context switch count statistics 4) Per-task and per-thread context switch count statistics
@ -31,31 +40,33 @@ There are three different groups of fields in the struct taskstats:
Future extension should add fields to the end of the taskstats struct, and Future extension should add fields to the end of the taskstats struct, and
should not change the relative position of each field within the struct. should not change the relative position of each field within the struct.
::
struct taskstats { struct taskstats {
1) Common and basic accounting fields::
1) Common and basic accounting fields:
/* The version number of this struct. This field is always set to /* The version number of this struct. This field is always set to
* TAKSTATS_VERSION, which is defined in <linux/taskstats.h>. * TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
* Each time the struct is changed, the value should be incremented. * Each time the struct is changed, the value should be incremented.
*/ */
__u16 version; __u16 version;
/* The exit code of a task. */ /* The exit code of a task. */
__u32 ac_exitcode; /* Exit status */ __u32 ac_exitcode; /* Exit status */
/* The accounting flags of a task as defined in <linux/acct.h> /* The accounting flags of a task as defined in <linux/acct.h>
* Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG. * Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
*/ */
__u8 ac_flag; /* Record flags */ __u8 ac_flag; /* Record flags */
/* The value of task_nice() of a task. */ /* The value of task_nice() of a task. */
__u8 ac_nice; /* task_nice */ __u8 ac_nice; /* task_nice */
/* The name of the command that started this task. */ /* The name of the command that started this task. */
char ac_comm[TS_COMM_LEN]; /* Command name */ char ac_comm[TS_COMM_LEN]; /* Command name */
/* The scheduling discipline as set in task->policy field. */ /* The scheduling discipline as set in task->policy field. */
__u8 ac_sched; /* Scheduling discipline */ __u8 ac_sched; /* Scheduling discipline */
__u8 ac_pad[3]; __u8 ac_pad[3];
@ -64,26 +75,27 @@ struct taskstats {
__u32 ac_pid; /* Process ID */ __u32 ac_pid; /* Process ID */
__u32 ac_ppid; /* Parent process ID */ __u32 ac_ppid; /* Parent process ID */
/* The time when a task begins, in [secs] since 1970. */ /* The time when a task begins, in [secs] since 1970. */
__u32 ac_btime; /* Begin time [sec since 1970] */ __u32 ac_btime; /* Begin time [sec since 1970] */
/* The elapsed time of a task, in [usec]. */ /* The elapsed time of a task, in [usec]. */
__u64 ac_etime; /* Elapsed time [usec] */ __u64 ac_etime; /* Elapsed time [usec] */
/* The user CPU time of a task, in [usec]. */ /* The user CPU time of a task, in [usec]. */
__u64 ac_utime; /* User CPU time [usec] */ __u64 ac_utime; /* User CPU time [usec] */
/* The system CPU time of a task, in [usec]. */ /* The system CPU time of a task, in [usec]. */
__u64 ac_stime; /* System CPU time [usec] */ __u64 ac_stime; /* System CPU time [usec] */
/* The minor page fault count of a task, as set in task->min_flt. */ /* The minor page fault count of a task, as set in task->min_flt. */
__u64 ac_minflt; /* Minor Page Fault Count */ __u64 ac_minflt; /* Minor Page Fault Count */
/* The major page fault count of a task, as set in task->maj_flt. */ /* The major page fault count of a task, as set in task->maj_flt. */
__u64 ac_majflt; /* Major Page Fault Count */ __u64 ac_majflt; /* Major Page Fault Count */
2) Delay accounting fields: 2) Delay accounting fields::
/* Delay accounting fields start /* Delay accounting fields start
* *
* All values, until the comment "Delay accounting fields end" are * All values, until the comment "Delay accounting fields end" are
@ -134,7 +146,8 @@ struct taskstats {
/* version 1 ends here */ /* version 1 ends here */
3) Extended accounting fields 3) Extended accounting fields::
/* Extended accounting fields start */ /* Extended accounting fields start */
/* Accumulated RSS usage in duration of a task, in MBytes-usecs. /* Accumulated RSS usage in duration of a task, in MBytes-usecs.
@ -145,15 +158,15 @@ struct taskstats {
*/ */
__u64 coremem; /* accumulated RSS usage in MB-usec */ __u64 coremem; /* accumulated RSS usage in MB-usec */
/* Accumulated virtual memory usage in duration of a task. /* Accumulated virtual memory usage in duration of a task.
* Same as acct_rss_mem1 above except that we keep track of VM usage. * Same as acct_rss_mem1 above except that we keep track of VM usage.
*/ */
__u64 virtmem; /* accumulated VM usage in MB-usec */ __u64 virtmem; /* accumulated VM usage in MB-usec */
/* High watermark of RSS usage in duration of a task, in KBytes. */ /* High watermark of RSS usage in duration of a task, in KBytes. */
__u64 hiwater_rss; /* High-watermark of RSS usage */ __u64 hiwater_rss; /* High-watermark of RSS usage */
/* High watermark of VM usage in duration of a task, in KBytes. */ /* High watermark of VM usage in duration of a task, in KBytes. */
__u64 hiwater_vm; /* High-water virtual memory usage */ __u64 hiwater_vm; /* High-water virtual memory usage */
/* The following four fields are I/O statistics of a task. */ /* The following four fields are I/O statistics of a task. */
@ -164,17 +177,23 @@ struct taskstats {
/* Extended accounting fields end */ /* Extended accounting fields end */
4) Per-task and per-thread statistics 4) Per-task and per-thread statistics::
__u64 nvcsw; /* Context voluntary switch counter */ __u64 nvcsw; /* Context voluntary switch counter */
__u64 nivcsw; /* Context involuntary switch counter */ __u64 nivcsw; /* Context involuntary switch counter */
5) Time accounting for SMT machines 5) Time accounting for SMT machines::
__u64 ac_utimescaled; /* utime scaled on frequency etc */ __u64 ac_utimescaled; /* utime scaled on frequency etc */
__u64 ac_stimescaled; /* stime scaled on frequency etc */ __u64 ac_stimescaled; /* stime scaled on frequency etc */
__u64 cpu_scaled_run_real_total; /* scaled cpu_run_real_total */ __u64 cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
6) Extended delay accounting fields for memory reclaim 6) Extended delay accounting fields for memory reclaim::
/* Delay waiting for memory reclaim */ /* Delay waiting for memory reclaim */
__u64 freepages_count; __u64 freepages_count;
__u64 freepages_delay_total; __u64 freepages_delay_total;
}
::
}

View File

@ -1,5 +1,6 @@
=============================
Per-task statistics interface Per-task statistics interface
----------------------------- =============================
Taskstats is a netlink-based interface for sending per-task and Taskstats is a netlink-based interface for sending per-task and
@ -65,7 +66,7 @@ taskstats.h file.
The data exchanged between user and kernel space is a netlink message belonging The data exchanged between user and kernel space is a netlink message belonging
to the NETLINK_GENERIC family and using the netlink attributes interface. to the NETLINK_GENERIC family and using the netlink attributes interface.
The messages are in the format The messages are in the format::
+----------+- - -+-------------+-------------------+ +----------+- - -+-------------+-------------------+
| nlmsghdr | Pad | genlmsghdr | taskstats payload | | nlmsghdr | Pad | genlmsghdr | taskstats payload |
@ -167,15 +168,13 @@ extended and the number of cpus grows large.
To avoid losing statistics, userspace should do one or more of the following: To avoid losing statistics, userspace should do one or more of the following:
- increase the receive buffer sizes for the netlink sockets opened by - increase the receive buffer sizes for the netlink sockets opened by
listeners to receive exit data. listeners to receive exit data.
- create more listeners and reduce the number of cpus being listened to by - create more listeners and reduce the number of cpus being listened to by
each listener. In the extreme case, there could be one listener for each cpu. each listener. In the extreme case, there could be one listener for each cpu.
Users may also consider setting the cpu affinity of the listener to the subset Users may also consider setting the cpu affinity of the listener to the subset
of cpus to which it listens, especially if they are listening to just one cpu. of cpus to which it listens, especially if they are listening to just one cpu.
Despite these measures, if the userspace receives ENOBUFS error messages Despite these measures, if the userspace receives ENOBUFS error messages
indicated overflow of receive buffers, it should take measures to handle the indicated overflow of receive buffers, it should take measures to handle the
loss of data. loss of data.
----

View File

@ -20,7 +20,7 @@ driver. The aoetools are on sourceforge.
http://aoetools.sourceforge.net/ http://aoetools.sourceforge.net/
The scripts in this Documentation/aoe directory are intended to The scripts in this Documentation/admin-guide/aoe directory are intended to
document the use of the driver and are not necessary if you install document the use of the driver and are not necessary if you install
the aoetools. the aoetools.
@ -86,7 +86,7 @@ Using sysfs
a convenient way. Users with aoetools should use the aoe-stat a convenient way. Users with aoetools should use the aoe-stat
command:: command::
root@makki root# sh Documentation/aoe/status.sh root@makki root# sh Documentation/admin-guide/aoe/status.sh
e10.0 eth3 up e10.0 eth3 up
e10.1 eth3 up e10.1 eth3 up
e10.2 eth3 up e10.2 eth3 up

View File

@ -1,5 +1,3 @@
:orphan:
======================= =======================
ATA over Ethernet (AoE) ATA over Ethernet (AoE)
======================= =======================

View File

@ -11,7 +11,7 @@
# udev_rules="/etc/udev/rules.d/" # udev_rules="/etc/udev/rules.d/"
# bash# ls /etc/udev/rules.d/ # bash# ls /etc/udev/rules.d/
# 10-wacom.rules 50-udev.rules # 10-wacom.rules 50-udev.rules
# bash# cp /path/to/linux/Documentation/aoe/udev.txt \ # bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \
# /etc/udev/rules.d/60-aoe.rules # /etc/udev/rules.d/60-aoe.rules
# #

View File

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 22 KiB

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

@ -1,3 +1,7 @@
================================
kernel data structure for DRBD-9
================================
This describes the in kernel data structure for DRBD-9. Starting with This describes the in kernel data structure for DRBD-9. Starting with
Linux v3.14 we are reorganizing DRBD to use this data structure. Linux v3.14 we are reorganizing DRBD to use this data structure.
@ -10,7 +14,7 @@ device is represented by a block device locally.
The DRBD objects are interconnected to form a matrix as depicted below; a The DRBD objects are interconnected to form a matrix as depicted below; a
drbd_peer_device object sits at each intersection between a drbd_device and a drbd_peer_device object sits at each intersection between a drbd_device and a
drbd_connection: drbd_connection::
/--------------+---------------+.....+---------------\ /--------------+---------------+.....+---------------\
| resource | device | | device | | resource | device | | device |

View File

@ -0,0 +1,30 @@
.. SPDX-License-Identifier: GPL-2.0
.. The here included files are intended to help understand the implementation
Data flows that Relate some functions, and write packets
========================================================
.. kernel-figure:: DRBD-8.3-data-packets.svg
:alt: DRBD-8.3-data-packets.svg
:align: center
.. kernel-figure:: DRBD-data-packets.svg
:alt: DRBD-data-packets.svg
:align: center
Sub graphs of DRBD's state transitions
======================================
.. kernel-figure:: conn-states-8.dot
:alt: conn-states-8.dot
:align: center
.. kernel-figure:: disk-states-8.dot
:alt: disk-states-8.dot
:align: center
.. kernel-figure:: node-states-8.dot
:alt: node-states-8.dot
:align: center

View File

@ -1,4 +1,9 @@
==========================================
Distributed Replicated Block Device - DRBD
==========================================
Description Description
===========
DRBD is a shared-nothing, synchronously replicated block device. It DRBD is a shared-nothing, synchronously replicated block device. It
is designed to serve as a building block for high availability is designed to serve as a building block for high availability
@ -7,10 +12,8 @@ Description
Please visit http://www.drbd.org to find out more. Please visit http://www.drbd.org to find out more.
The here included files are intended to help understand the implementation .. toctree::
:maxdepth: 1
DRBD-8.3-data-packets.svg, DRBD-data-packets.svg data-structure-v9
relates some functions, and write packets. figures
conn-states-8.dot, disk-states-8.dot, node-states-8.dot
The sub graphs of DRBD's state transitions

View File

@ -11,4 +11,3 @@ digraph peer_states {
Unknown -> Primary [ label = "connected" ] Unknown -> Primary [ label = "connected" ]
Unknown -> Secondary [ label = "connected" ] Unknown -> Secondary [ label = "connected" ]
} }

View File

@ -1,35 +1,37 @@
This file describes the floppy driver. =============
Floppy Driver
=============
FAQ list: FAQ list:
========= =========
A FAQ list may be found in the fdutils package (see below), and also A FAQ list may be found in the fdutils package (see below), and also
at <http://fdutils.linux.lu/faq.html>. at <http://fdutils.linux.lu/faq.html>.
LILO configuration options (Thinkpad users, read this) LILO configuration options (Thinkpad users, read this)
====================================================== ======================================================
The floppy driver is configured using the 'floppy=' option in The floppy driver is configured using the 'floppy=' option in
lilo. This option can be typed at the boot prompt, or entered in the lilo. This option can be typed at the boot prompt, or entered in the
lilo configuration file. lilo configuration file.
Example: If your kernel is called linux-2.6.9, type the following line Example: If your kernel is called linux-2.6.9, type the following line
at the lilo boot prompt (if you have a thinkpad): at the lilo boot prompt (if you have a thinkpad)::
linux-2.6.9 floppy=thinkpad linux-2.6.9 floppy=thinkpad
You may also enter the following line in /etc/lilo.conf, in the description You may also enter the following line in /etc/lilo.conf, in the description
of linux-2.6.9: of linux-2.6.9::
append = "floppy=thinkpad" append = "floppy=thinkpad"
Several floppy related options may be given, example: Several floppy related options may be given, example::
linux-2.6.9 floppy=daring floppy=two_fdc linux-2.6.9 floppy=daring floppy=two_fdc
append = "floppy=daring floppy=two_fdc" append = "floppy=daring floppy=two_fdc"
If you give options both in the lilo config file and on the boot If you give options both in the lilo config file and on the boot
prompt, the option strings of both places are concatenated, the boot prompt, the option strings of both places are concatenated, the boot
prompt options coming last. That's why there are also options to prompt options coming last. That's why there are also options to
restore the default behavior. restore the default behavior.
@ -38,21 +40,23 @@ restore the default behavior.
Module configuration options Module configuration options
============================ ============================
If you use the floppy driver as a module, use the following syntax: If you use the floppy driver as a module, use the following syntax::
modprobe floppy floppy="<options>"
Example: modprobe floppy floppy="<options>"
modprobe floppy floppy="omnibook messages"
If you need certain options enabled every time you load the floppy driver, Example::
you can put:
options floppy floppy="omnibook messages" modprobe floppy floppy="omnibook messages"
If you need certain options enabled every time you load the floppy driver,
you can put::
options floppy floppy="omnibook messages"
in a configuration file in /etc/modprobe.d/. in a configuration file in /etc/modprobe.d/.
The floppy driver related options are: The floppy driver related options are:
floppy=asus_pci floppy=asus_pci
Sets the bit mask to allow only units 0 and 1. (default) Sets the bit mask to allow only units 0 and 1. (default)
@ -70,8 +74,7 @@ in a configuration file in /etc/modprobe.d/.
Tells the floppy driver that you have only one floppy controller. Tells the floppy driver that you have only one floppy controller.
(default) (default)
floppy=two_fdc floppy=two_fdc / floppy=<address>,two_fdc
floppy=<address>,two_fdc
Tells the floppy driver that you have two floppy controllers. Tells the floppy driver that you have two floppy controllers.
The second floppy controller is assumed to be at <address>. The second floppy controller is assumed to be at <address>.
This option is not needed if the second controller is at address This option is not needed if the second controller is at address
@ -84,8 +87,7 @@ in a configuration file in /etc/modprobe.d/.
floppy=0,thinkpad floppy=0,thinkpad
Tells the floppy driver that you don't have a Thinkpad. Tells the floppy driver that you don't have a Thinkpad.
floppy=omnibook floppy=omnibook / floppy=nodma
floppy=nodma
Tells the floppy driver not to use Dma for data transfers. Tells the floppy driver not to use Dma for data transfers.
This is needed on HP Omnibooks, which don't have a workable This is needed on HP Omnibooks, which don't have a workable
DMA channel for the floppy driver. This option is also useful DMA channel for the floppy driver. This option is also useful
@ -144,14 +146,16 @@ in a configuration file in /etc/modprobe.d/.
described in the physical CMOS), or if your BIOS uses described in the physical CMOS), or if your BIOS uses
non-standard CMOS types. The CMOS types are: non-standard CMOS types. The CMOS types are:
0 - Use the value of the physical CMOS == ==================================
1 - 5 1/4 DD 0 Use the value of the physical CMOS
2 - 5 1/4 HD 1 5 1/4 DD
3 - 3 1/2 DD 2 5 1/4 HD
4 - 3 1/2 HD 3 3 1/2 DD
5 - 3 1/2 ED 4 3 1/2 HD
6 - 3 1/2 ED 5 3 1/2 ED
16 - unknown or not installed 6 3 1/2 ED
16 unknown or not installed
== ==================================
(Note: there are two valid types for ED drives. This is because 5 was (Note: there are two valid types for ED drives. This is because 5 was
initially chosen to represent floppy *tapes*, and 6 for ED drives. initially chosen to represent floppy *tapes*, and 6 for ED drives.
@ -162,8 +166,7 @@ in a configuration file in /etc/modprobe.d/.
Print a warning message when an unexpected interrupt is received. Print a warning message when an unexpected interrupt is received.
(default) (default)
floppy=no_unexpected_interrupts floppy=no_unexpected_interrupts / floppy=L40SX
floppy=L40SX
Don't print a message when an unexpected interrupt is received. This Don't print a message when an unexpected interrupt is received. This
is needed on IBM L40SX laptops in certain video modes. (There seems is needed on IBM L40SX laptops in certain video modes. (There seems
to be an interaction between video and floppy. The unexpected to be an interaction between video and floppy. The unexpected
@ -199,47 +202,54 @@ in a configuration file in /etc/modprobe.d/.
Sets the floppy DMA channel to <nr> instead of 2. Sets the floppy DMA channel to <nr> instead of 2.
floppy=slow floppy=slow
Use PS/2 stepping rate: Use PS/2 stepping rate::
" PS/2 floppies have much slower step rates than regular floppies.
PS/2 floppies have much slower step rates than regular floppies.
It's been recommended that take about 1/4 of the default speed It's been recommended that take about 1/4 of the default speed
in some more extreme cases." in some more extreme cases.
Supporting utilities and additional documentation: Supporting utilities and additional documentation:
================================================== ==================================================
Additional parameters of the floppy driver can be configured at Additional parameters of the floppy driver can be configured at
runtime. Utilities which do this can be found in the fdutils package. runtime. Utilities which do this can be found in the fdutils package.
This package also contains a new version of mtools which allows to This package also contains a new version of mtools which allows to
access high capacity disks (up to 1992K on a high density 3 1/2 disk!). access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
It also contains additional documentation about the floppy driver. It also contains additional documentation about the floppy driver.
The latest version can be found at fdutils homepage: The latest version can be found at fdutils homepage:
http://fdutils.linux.lu http://fdutils.linux.lu
The fdutils releases can be found at: The fdutils releases can be found at:
http://fdutils.linux.lu/download.html http://fdutils.linux.lu/download.html
http://www.tux.org/pub/knaff/fdutils/ http://www.tux.org/pub/knaff/fdutils/
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/ ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
Reporting problems about the floppy driver Reporting problems about the floppy driver
========================================== ==========================================
If you have a question or a bug report about the floppy driver, mail If you have a question or a bug report about the floppy driver, mail
me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use
comp.os.linux.hardware. As the volume in these groups is rather high, comp.os.linux.hardware. As the volume in these groups is rather high,
be sure to include the word "floppy" (or "FLOPPY") in the subject be sure to include the word "floppy" (or "FLOPPY") in the subject
line. If the reported problem happens when mounting floppy disks, be line. If the reported problem happens when mounting floppy disks, be
sure to mention also the type of the filesystem in the subject line. sure to mention also the type of the filesystem in the subject line.
Be sure to read the FAQ before mailing/posting any bug reports! Be sure to read the FAQ before mailing/posting any bug reports!
Alain Alain
Changelog Changelog
========= =========
10-30-2004 : Cleanup, updating, add reference to module configuration. 10-30-2004 :
Cleanup, updating, add reference to module configuration.
James Nelson <james4765@gmail.com> James Nelson <james4765@gmail.com>
6-3-2000 : Original Document 6-3-2000 :
Original Document

View File

@ -0,0 +1,16 @@
.. SPDX-License-Identifier: GPL-2.0
===========================
The Linux RapidIO Subsystem
===========================
.. toctree::
:maxdepth: 1
floppy
nbd
paride
ramdisk
zram
drbd/index

View File

@ -1,3 +1,4 @@
==================================
Network Block Device (TCP version) Network Block Device (TCP version)
================================== ==================================
@ -28,4 +29,3 @@ max_part
nbds_max nbds_max
Number of block devices that should be initialized (default: 16). Number of block devices that should be initialized (default: 16).

View File

@ -1,15 +1,17 @@
===================================
Linux and parallel port IDE devices Linux and parallel port IDE devices
===================================
PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net> PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net>
1. Introduction 1. Introduction
===============
Owing to the simplicity and near universality of the parallel port interface Owing to the simplicity and near universality of the parallel port interface
to personal computers, many external devices such as portable hard-disk, to personal computers, many external devices such as portable hard-disk,
CD-ROM, LS-120 and tape drives use the parallel port to connect to their CD-ROM, LS-120 and tape drives use the parallel port to connect to their
host computer. While some devices (notably scanners) use ad-hoc methods host computer. While some devices (notably scanners) use ad-hoc methods
to pass commands and data through the parallel port interface, most to pass commands and data through the parallel port interface, most
external devices are actually identical to an internal model, but with external devices are actually identical to an internal model, but with
a parallel-port adapter chip added in. Some of the original parallel port a parallel-port adapter chip added in. Some of the original parallel port
adapters were little more than mechanisms for multiplexing a SCSI bus. adapters were little more than mechanisms for multiplexing a SCSI bus.
@ -28,47 +30,50 @@ were to open up a parallel port CD-ROM drive, for instance, one would
find a standard ATAPI CD-ROM drive, a power supply, and a single adapter find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
that interconnected a standard PC parallel port cable and a standard that interconnected a standard PC parallel port cable and a standard
IDE cable. It is usually possible to exchange the CD-ROM device with IDE cable. It is usually possible to exchange the CD-ROM device with
any other device using the IDE interface. any other device using the IDE interface.
The document describes the support in Linux for parallel port IDE The document describes the support in Linux for parallel port IDE
devices. It does not cover parallel port SCSI devices, "ditto" tape devices. It does not cover parallel port SCSI devices, "ditto" tape
drives or scanners. Many different devices are supported by the drives or scanners. Many different devices are supported by the
parallel port IDE subsystem, including: parallel port IDE subsystem, including:
MicroSolutions backpack CD-ROM - MicroSolutions backpack CD-ROM
MicroSolutions backpack PD/CD - MicroSolutions backpack PD/CD
MicroSolutions backpack hard-drives - MicroSolutions backpack hard-drives
MicroSolutions backpack 8000t tape drive - MicroSolutions backpack 8000t tape drive
SyQuest EZ-135, EZ-230 & SparQ drives - SyQuest EZ-135, EZ-230 & SparQ drives
Avatar Shark - Avatar Shark
Imation Superdisk LS-120 - Imation Superdisk LS-120
Maxell Superdisk LS-120 - Maxell Superdisk LS-120
FreeCom Power CD - FreeCom Power CD
Hewlett-Packard 5GB and 8GB tape drives - Hewlett-Packard 5GB and 8GB tape drives
Hewlett-Packard 7100 and 7200 CD-RW drives - Hewlett-Packard 7100 and 7200 CD-RW drives
as well as most of the clone and no-name products on the market. as well as most of the clone and no-name products on the market.
To support such a wide range of devices, PARIDE, the parallel port IDE To support such a wide range of devices, PARIDE, the parallel port IDE
subsystem, is actually structured in three parts. There is a base subsystem, is actually structured in three parts. There is a base
paride module which provides a registry and some common methods for paride module which provides a registry and some common methods for
accessing the parallel ports. The second component is a set of accessing the parallel ports. The second component is a set of
high-level drivers for each of the different types of supported devices: high-level drivers for each of the different types of supported devices:
=== =============
pd IDE disk pd IDE disk
pcd ATAPI CD-ROM pcd ATAPI CD-ROM
pf ATAPI disk pf ATAPI disk
pt ATAPI tape pt ATAPI tape
pg ATAPI generic pg ATAPI generic
=== =============
(Currently, the pg driver is only used with CD-R drives). (Currently, the pg driver is only used with CD-R drives).
The high-level drivers function according to the relevant standards. The high-level drivers function according to the relevant standards.
The third component of PARIDE is a set of low-level protocol drivers The third component of PARIDE is a set of low-level protocol drivers
for each of the parallel port IDE adapter chips. Thanks to the interest for each of the parallel port IDE adapter chips. Thanks to the interest
and encouragement of Linux users from many parts of the world, and encouragement of Linux users from many parts of the world,
support is available for almost all known adapter protocols: support is available for almost all known adapter protocols:
==== ====================================== ====
aten ATEN EH-100 (HK) aten ATEN EH-100 (HK)
bpck Microsolutions backpack (US) bpck Microsolutions backpack (US)
comm DataStor (old-type) "commuter" adapter (TW) comm DataStor (old-type) "commuter" adapter (TW)
@ -83,9 +88,11 @@ support is available for almost all known adapter protocols:
ktti KT Technology PHd adapter (SG) ktti KT Technology PHd adapter (SG)
on20 OnSpec 90c20 (US) on20 OnSpec 90c20 (US)
on26 OnSpec 90c26 (US) on26 OnSpec 90c26 (US)
==== ====================================== ====
2. Using the PARIDE subsystem 2. Using the PARIDE subsystem
=============================
While configuring the Linux kernel, you may choose either to build While configuring the Linux kernel, you may choose either to build
the PARIDE drivers into your kernel, or to build them as modules. the PARIDE drivers into your kernel, or to build them as modules.
@ -94,10 +101,10 @@ In either case, you will need to select "Parallel port IDE device support"
as well as at least one of the high-level drivers and at least one as well as at least one of the high-level drivers and at least one
of the parallel port communication protocols. If you do not know of the parallel port communication protocols. If you do not know
what kind of parallel port adapter is used in your drive, you could what kind of parallel port adapter is used in your drive, you could
begin by checking the file names and any text files on your DOS begin by checking the file names and any text files on your DOS
installation floppy. Alternatively, you can look at the markings on installation floppy. Alternatively, you can look at the markings on
the adapter chip itself. That's usually sufficient to identify the the adapter chip itself. That's usually sufficient to identify the
correct device. correct device.
You can actually select all the protocol modules, and allow the PARIDE You can actually select all the protocol modules, and allow the PARIDE
subsystem to try them all for you. subsystem to try them all for you.
@ -105,8 +112,9 @@ subsystem to try them all for you.
For the "brand-name" products listed above, here are the protocol For the "brand-name" products listed above, here are the protocol
and high-level drivers that you would use: and high-level drivers that you would use:
================ ============ ====== ========
Manufacturer Model Driver Protocol Manufacturer Model Driver Protocol
================ ============ ====== ========
MicroSolutions CD-ROM pcd bpck MicroSolutions CD-ROM pcd bpck
MicroSolutions PD drive pf bpck MicroSolutions PD drive pf bpck
MicroSolutions hard-drive pd bpck MicroSolutions hard-drive pd bpck
@ -119,8 +127,10 @@ and high-level drivers that you would use:
Hewlett-Packard 5GB Tape pt epat Hewlett-Packard 5GB Tape pt epat
Hewlett-Packard 7200e (CD) pcd epat Hewlett-Packard 7200e (CD) pcd epat
Hewlett-Packard 7200e (CD-R) pg epat Hewlett-Packard 7200e (CD-R) pg epat
================ ============ ====== ========
2.1 Configuring built-in drivers 2.1 Configuring built-in drivers
---------------------------------
We recommend that you get to know how the drivers work and how to We recommend that you get to know how the drivers work and how to
configure them as loadable modules, before attempting to compile a configure them as loadable modules, before attempting to compile a
@ -143,7 +153,7 @@ protocol identification number and, for some devices, the drive's
chain ID. While your system is booting, a number of messages are chain ID. While your system is booting, a number of messages are
displayed on the console. Like all such messages, they can be displayed on the console. Like all such messages, they can be
reviewed with the 'dmesg' command. Among those messages will be reviewed with the 'dmesg' command. Among those messages will be
some lines like: some lines like::
paride: bpck registered as protocol 0 paride: bpck registered as protocol 0
paride: epat registered as protocol 1 paride: epat registered as protocol 1
@ -158,10 +168,10 @@ the last two digits of the drive's serial number (but read MicroSolutions'
documentation about this). documentation about this).
As an example, let's assume that you have a MicroSolutions PD/CD drive As an example, let's assume that you have a MicroSolutions PD/CD drive
with unit ID number 36 connected to the parallel port at 0x378, a SyQuest with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
EZ-135 connected to the chained port on the PD/CD drive and also an EZ-135 connected to the chained port on the PD/CD drive and also an
Imation Superdisk connected to port 0x278. You could give the following Imation Superdisk connected to port 0x278. You could give the following
options on your boot command: options on your boot command::
pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36 pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
@ -169,24 +179,27 @@ In the last option, pf.drive1 configures device /dev/pf1, the 0x378
is the parallel port base address, the 0 is the protocol registration is the parallel port base address, the 0 is the protocol registration
number and 36 is the chain ID. number and 36 is the chain ID.
Please note: while PARIDE will work both with and without the Please note: while PARIDE will work both with and without the
PARPORT parallel port sharing system that is included by the PARPORT parallel port sharing system that is included by the
"Parallel port support" option, PARPORT must be included and enabled "Parallel port support" option, PARPORT must be included and enabled
if you want to use chains of devices on the same parallel port. if you want to use chains of devices on the same parallel port.
2.2 Loading and configuring PARIDE as modules 2.2 Loading and configuring PARIDE as modules
----------------------------------------------
It is much faster and simpler to get to understand the PARIDE drivers It is much faster and simpler to get to understand the PARIDE drivers
if you use them as loadable kernel modules. if you use them as loadable kernel modules.
Note 1: using these drivers with the "kerneld" automatic module loading Note 1:
system is not recommended for beginners, and is not documented here. using these drivers with the "kerneld" automatic module loading
system is not recommended for beginners, and is not documented here.
Note 2: if you build PARPORT support as a loadable module, PARIDE must Note 2:
also be built as loadable modules, and PARPORT must be loaded before the if you build PARPORT support as a loadable module, PARIDE must
PARIDE modules. also be built as loadable modules, and PARPORT must be loaded before
the PARIDE modules.
To use PARIDE, you must begin by To use PARIDE, you must begin by::
insmod paride insmod paride
@ -195,8 +208,8 @@ among other tasks.
Then, load as many of the protocol modules as you think you might need. Then, load as many of the protocol modules as you think you might need.
As you load each module, it will register the protocols that it supports, As you load each module, it will register the protocols that it supports,
and print a log message to your kernel log file and your console. For and print a log message to your kernel log file and your console. For
example: example::
# insmod epat # insmod epat
paride: epat registered as protocol 0 paride: epat registered as protocol 0
@ -205,22 +218,22 @@ example:
paride: k971 registered as protocol 2 paride: k971 registered as protocol 2
Finally, you can load high-level drivers for each kind of device that Finally, you can load high-level drivers for each kind of device that
you have connected. By default, each driver will autoprobe for a single you have connected. By default, each driver will autoprobe for a single
device, but you can support up to four similar devices by giving their device, but you can support up to four similar devices by giving their
individual co-ordinates when you load the driver. individual co-ordinates when you load the driver.
For example, if you had two no-name CD-ROM drives both using the For example, if you had two no-name CD-ROM drives both using the
KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
you could give the following command: you could give the following command::
# insmod pcd drive0=0x378,1 drive1=0x3bc,1 # insmod pcd drive0=0x378,1 drive1=0x3bc,1
For most adapters, giving a port address and protocol number is sufficient, For most adapters, giving a port address and protocol number is sufficient,
but check the source files in linux/drivers/block/paride for more but check the source files in linux/drivers/block/paride for more
information. (Hopefully someone will write some man pages one day !). information. (Hopefully someone will write some man pages one day !).
As another example, here's what happens when PARPORT is installed, and As another example, here's what happens when PARPORT is installed, and
a SyQuest EZ-135 is attached to port 0x378: a SyQuest EZ-135 is attached to port 0x378::
# insmod paride # insmod paride
paride: version 1.0 installed paride: version 1.0 installed
@ -237,46 +250,47 @@ Note that the last line is the output from the generic partition table
scanner - in this case it reports that it has found a disk with one partition. scanner - in this case it reports that it has found a disk with one partition.
2.3 Using a PARIDE device 2.3 Using a PARIDE device
--------------------------
Once the drivers have been loaded, you can access PARIDE devices in the Once the drivers have been loaded, you can access PARIDE devices in the
same way as their traditional counterparts. You will probably need to same way as their traditional counterparts. You will probably need to
create the device "special files". Here is a simple script that you can create the device "special files". Here is a simple script that you can
cut to a file and execute: cut to a file and execute::
#!/bin/bash #!/bin/bash
# #
# mkd -- a script to create the device special files for the PARIDE subsystem # mkd -- a script to create the device special files for the PARIDE subsystem
# #
function mkdev { function mkdev {
mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1 mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
} }
# #
function pd { function pd {
D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) ) D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
mkdev pd$D b 45 $[ $1 * 16 ] mkdev pd$D b 45 $[ $1 * 16 ]
for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do mkdev pd$D$P b 45 $[ $1 * 16 + $P ] do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
done done
} }
# #
cd /dev cd /dev
# #
for u in 0 1 2 3 ; do pd $u ; done for u in 0 1 2 3 ; do pd $u ; done
for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
# #
# end of mkd # end of mkd
With the device files and drivers in place, you can access PARIDE devices With the device files and drivers in place, you can access PARIDE devices
like any other Linux device. For example, to mount a CD-ROM in pcd0, use: like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
mount /dev/pcd0 /cdrom mount /dev/pcd0 /cdrom
If you have a fresh Avatar Shark cartridge, and the drive is pda, you If you have a fresh Avatar Shark cartridge, and the drive is pda, you
might do something like: might do something like::
fdisk /dev/pda -- make a new partition table with fdisk /dev/pda -- make a new partition table with
partition 1 of type 83 partition 1 of type 83
@ -289,41 +303,46 @@ might do something like:
Devices like the Imation superdisk work in the same way, except that Devices like the Imation superdisk work in the same way, except that
they do not have a partition table. For example to make a 120MB they do not have a partition table. For example to make a 120MB
floppy that you could share with a DOS system: floppy that you could share with a DOS system::
mkdosfs /dev/pf0 mkdosfs /dev/pf0
mount /dev/pf0 /mnt mount /dev/pf0 /mnt
2.4 The pf driver 2.4 The pf driver
------------------
The pf driver is intended for use with parallel port ATAPI disk The pf driver is intended for use with parallel port ATAPI disk
devices. The most common devices in this category are PD drives devices. The most common devices in this category are PD drives
and LS-120 drives. Traditionally, media for these devices are not and LS-120 drives. Traditionally, media for these devices are not
partitioned. Consequently, the pf driver does not support partitioned partitioned. Consequently, the pf driver does not support partitioned
media. This may be changed in a future version of the driver. media. This may be changed in a future version of the driver.
2.5 Using the pt driver 2.5 Using the pt driver
------------------------
The pt driver for parallel port ATAPI tape drives is a minimal driver. The pt driver for parallel port ATAPI tape drives is a minimal driver.
It does not yet support many of the standard tape ioctl operations. It does not yet support many of the standard tape ioctl operations.
For best performance, a block size of 32KB should be used. You will For best performance, a block size of 32KB should be used. You will
probably want to set the parallel port delay to 0, if you can. probably want to set the parallel port delay to 0, if you can.
2.6 Using the pg driver 2.6 Using the pg driver
------------------------
The pg driver can be used in conjunction with the cdrecord program The pg driver can be used in conjunction with the cdrecord program
to create CD-ROMs. Please get cdrecord version 1.6.1 or later to create CD-ROMs. Please get cdrecord version 1.6.1 or later
from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
your parallel port should ideally be set to EPP mode, and the "port delay" your parallel port should ideally be set to EPP mode, and the "port delay"
should be set to 0. With those settings it is possible to record at 2x should be set to 0. With those settings it is possible to record at 2x
speed without any buffer underruns. If you cannot get the driver to work speed without any buffer underruns. If you cannot get the driver to work
in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only. in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
3. Troubleshooting 3. Troubleshooting
==================
3.1 Use EPP mode if you can 3.1 Use EPP mode if you can
----------------------------
The most common problems that people report with the PARIDE drivers The most common problems that people report with the PARIDE drivers
concern the parallel port CMOS settings. At this time, none of the concern the parallel port CMOS settings. At this time, none of the
@ -332,6 +351,7 @@ If you are able to do so, please set your parallel port into EPP mode
using your CMOS setup procedure. using your CMOS setup procedure.
3.2 Check the port delay 3.2 Check the port delay
-------------------------
Some parallel ports cannot reliably transfer data at full speed. To Some parallel ports cannot reliably transfer data at full speed. To
offset the errors, the PARIDE protocol modules introduce a "port offset the errors, the PARIDE protocol modules introduce a "port
@ -347,23 +367,25 @@ read the comments at the beginning of the driver source files in
linux/drivers/block/paride. linux/drivers/block/paride.
3.3 Some drives need a printer reset 3.3 Some drives need a printer reset
-------------------------------------
There appear to be a number of "noname" external drives on the market There appear to be a number of "noname" external drives on the market
that do not always power up correctly. We have noticed this with some that do not always power up correctly. We have noticed this with some
drives based on OnSpec and older Freecom adapters. In these rare cases, drives based on OnSpec and older Freecom adapters. In these rare cases,
the adapter can often be reinitialised by issuing a "printer reset" on the adapter can often be reinitialised by issuing a "printer reset" on
the parallel port. As the reset operation is potentially disruptive in the parallel port. As the reset operation is potentially disruptive in
multiple device environments, the PARIDE drivers will not do it multiple device environments, the PARIDE drivers will not do it
automatically. You can however, force a printer reset by doing: automatically. You can however, force a printer reset by doing::
insmod lp reset=1 insmod lp reset=1
rmmod lp rmmod lp
If you have one of these marginal cases, you should probably build If you have one of these marginal cases, you should probably build
your paride drivers as modules, and arrange to do the printer reset your paride drivers as modules, and arrange to do the printer reset
before loading the PARIDE drivers. before loading the PARIDE drivers.
3.4 Use the verbose option and dmesg if you need help 3.4 Use the verbose option and dmesg if you need help
------------------------------------------------------
While a lot of testing has gone into these drivers to make them work While a lot of testing has gone into these drivers to make them work
as smoothly as possible, problems will arise. If you do have problems, as smoothly as possible, problems will arise. If you do have problems,
@ -373,7 +395,7 @@ clues, then please make sure that only one drive is hooked to your system,
and that either (a) PARPORT is enabled or (b) no other device driver and that either (a) PARPORT is enabled or (b) no other device driver
is using your parallel port (check in /proc/ioports). Then, load the is using your parallel port (check in /proc/ioports). Then, load the
appropriate drivers (you can load several protocol modules if you want) appropriate drivers (you can load several protocol modules if you want)
as in: as in::
# insmod paride # insmod paride
# insmod epat # insmod epat
@ -394,12 +416,14 @@ by e-mail to grant@torque.net, or join the linux-parport mailing list
and post your report there. and post your report there.
3.5 For more information or help 3.5 For more information or help
---------------------------------
You can join the linux-parport mailing list by sending a mail message You can join the linux-parport mailing list by sending a mail message
to to:
linux-parport-request@torque.net linux-parport-request@torque.net
with the single word with the single word::
subscribe subscribe
@ -412,6 +436,4 @@ have in your mail headers, when sending mail to the list server.
You might also find some useful information on the linux-parport You might also find some useful information on the linux-parport
web pages (although they are not always up to date) at web pages (although they are not always up to date) at
http://web.archive.org/web/*/http://www.torque.net/parport/ http://web.archive.org/web/%2E/http://www.torque.net/parport/

View File

@ -1,7 +1,8 @@
==========================================
Using the RAM disk block device with Linux Using the RAM disk block device with Linux
------------------------------------------ ==========================================
Contents: .. Contents:
1) Overview 1) Overview
2) Kernel Command Line Parameters 2) Kernel Command Line Parameters
@ -42,7 +43,7 @@ rescue floppy disk.
2a) Kernel Command Line Parameters 2a) Kernel Command Line Parameters
ramdisk_size=N ramdisk_size=N
============== Size of the ramdisk.
This parameter tells the RAM disk driver to set up RAM disks of N k size. The This parameter tells the RAM disk driver to set up RAM disks of N k size. The
default is 4096 (4 MB). default is 4096 (4 MB).
@ -50,16 +51,13 @@ default is 4096 (4 MB).
2b) Module parameters 2b) Module parameters
rd_nr rd_nr
===== /dev/ramX devices created.
/dev/ramX devices created.
max_part max_part
======== Maximum partition number.
Maximum partition number.
rd_size rd_size
======= See ramdisk_size.
See ramdisk_size.
3) Using "rdev -r" 3) Using "rdev -r"
------------------ ------------------
@ -71,11 +69,11 @@ to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
prompt/wait sequence is to be given before trying to read the RAM disk. Since prompt/wait sequence is to be given before trying to read the RAM disk. Since
the RAM disk dynamically grows as data is being written into it, a size field the RAM disk dynamically grows as data is being written into it, a size field
is not required. Bits 11 to 13 are not currently used and may as well be zero. is not required. Bits 11 to 13 are not currently used and may as well be zero.
These numbers are no magical secrets, as seen below: These numbers are no magical secrets, as seen below::
./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000 ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000 ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
Consider a typical two floppy disk setup, where you will have the Consider a typical two floppy disk setup, where you will have the
kernel on disk one, and have already put a RAM disk image onto disk #2. kernel on disk one, and have already put a RAM disk image onto disk #2.
@ -92,20 +90,23 @@ sequence so that you have a chance to switch floppy disks.
The command line equivalent is: "prompt_ramdisk=1" The command line equivalent is: "prompt_ramdisk=1"
Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word. Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
So to create disk one of the set, you would do: So to create disk one of the set, you would do::
/usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0 /usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
/usr/src/linux# rdev /dev/fd0 /dev/fd0 /usr/src/linux# rdev /dev/fd0 /dev/fd0
/usr/src/linux# rdev -r /dev/fd0 49152 /usr/src/linux# rdev -r /dev/fd0 49152
If you make a boot disk that has LILO, then for the above, you would use: If you make a boot disk that has LILO, then for the above, you would use::
append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1" append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
Since the default start = 0 and the default prompt = 1, you could use:
Since the default start = 0 and the default prompt = 1, you could use::
append = "load_ramdisk=1" append = "load_ramdisk=1"
4) An Example of Creating a Compressed RAM Disk 4) An Example of Creating a Compressed RAM Disk
---------------------------------------------- -----------------------------------------------
To create a RAM disk image, you will need a spare block device to To create a RAM disk image, you will need a spare block device to
construct it on. This can be the RAM disk device itself, or an construct it on. This can be the RAM disk device itself, or an
@ -120,11 +121,11 @@ a) Decide on the RAM disk size that you want. Say 2 MB for this example.
Create it by writing to the RAM disk device. (This step is not currently Create it by writing to the RAM disk device. (This step is not currently
required, but may be in the future.) It is wise to zero out the required, but may be in the future.) It is wise to zero out the
area (esp. for disks) so that maximal compression is achieved for area (esp. for disks) so that maximal compression is achieved for
the unused blocks of the image that you are about to create. the unused blocks of the image that you are about to create::
dd if=/dev/zero of=/dev/ram0 bs=1k count=2048 dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
b) Make a filesystem on it. Say ext2fs for this example. b) Make a filesystem on it. Say ext2fs for this example::
mke2fs -vm0 /dev/ram0 2048 mke2fs -vm0 /dev/ram0 2048
@ -133,11 +134,11 @@ c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
d) Compress the contents of the RAM disk. The level of compression d) Compress the contents of the RAM disk. The level of compression
will be approximately 50% of the space used by the files. Unused will be approximately 50% of the space used by the files. Unused
space on the RAM disk will compress to almost nothing. space on the RAM disk will compress to almost nothing::
dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
e) Put the kernel onto the floppy e) Put the kernel onto the floppy::
dd if=zImage of=/dev/fd0 bs=1k dd if=zImage of=/dev/fd0 bs=1k
@ -146,13 +147,13 @@ f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
(possibly larger) kernel onto the same floppy later without overlapping (possibly larger) kernel onto the same floppy later without overlapping
the RAM disk image. An offset of 400 kB for kernels about 350 kB in the RAM disk image. An offset of 400 kB for kernels about 350 kB in
size would be reasonable. Make sure offset+size of ram_image.gz is size would be reasonable. Make sure offset+size of ram_image.gz is
not larger than the total space on your floppy (usually 1440 kB). not larger than the total space on your floppy (usually 1440 kB)::
dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400 dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc. g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
have 2^15 + 2^14 + 400 = 49552. have 2^15 + 2^14 + 400 = 49552::
rdev /dev/fd0 /dev/fd0 rdev /dev/fd0 /dev/fd0
rdev -r /dev/fd0 49552 rdev -r /dev/fd0 49552
@ -160,15 +161,17 @@ g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
That is it. You now have your boot/root compressed RAM disk floppy. Some That is it. You now have your boot/root compressed RAM disk floppy. Some
users may wish to combine steps (d) and (f) by using a pipe. users may wish to combine steps (d) and (f) by using a pipe.
--------------------------------------------------------------------------
Paul Gortmaker 12/95 Paul Gortmaker 12/95
Changelog: Changelog:
---------- ----------
10-22-04 : Updated to reflect changes in command line options, remove 10-22-04 :
Updated to reflect changes in command line options, remove
obsolete references, general cleanup. obsolete references, general cleanup.
James Nelson (james4765@gmail.com) James Nelson (james4765@gmail.com)
12-95 : Original Document 12-95 :
Original Document

View File

@ -1,7 +1,9 @@
========================================
zram: Compressed RAM based block devices zram: Compressed RAM based block devices
---------------------------------------- ========================================
* Introduction Introduction
============
The zram module creates RAM based block devices named /dev/zram<id> The zram module creates RAM based block devices named /dev/zram<id>
(<id> = 0, 1, ...). Pages written to these disks are compressed and stored (<id> = 0, 1, ...). Pages written to these disks are compressed and stored
@ -12,9 +14,11 @@ use as swap disks, various caches under /var and maybe many more :)
Statistics for individual zram devices are exported through sysfs nodes at Statistics for individual zram devices are exported through sysfs nodes at
/sys/block/zram<id>/ /sys/block/zram<id>/
* Usage Usage
=====
There are several ways to configure and manage zram device(-s): There are several ways to configure and manage zram device(-s):
a) using zram and zram_control sysfs attributes a) using zram and zram_control sysfs attributes
b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
@ -22,7 +26,7 @@ In this document we will describe only 'manual' zram configuration steps,
IOW, zram and zram_control sysfs attributes. IOW, zram and zram_control sysfs attributes.
In order to get a better idea about zramctl please consult util-linux In order to get a better idea about zramctl please consult util-linux
documentation, zramctl man-page or `zramctl --help'. Please be informed documentation, zramctl man-page or `zramctl --help`. Please be informed
that zram maintainers do not develop/maintain util-linux or zramctl, should that zram maintainers do not develop/maintain util-linux or zramctl, should
you have any questions please contact util-linux@vger.kernel.org you have any questions please contact util-linux@vger.kernel.org
@ -30,19 +34,23 @@ Following shows a typical sequence of steps for using zram.
WARNING WARNING
======= =======
For the sake of simplicity we skip error checking parts in most of the For the sake of simplicity we skip error checking parts in most of the
examples below. However, it is your sole responsibility to handle errors. examples below. However, it is your sole responsibility to handle errors.
zram sysfs attributes always return negative values in case of errors. zram sysfs attributes always return negative values in case of errors.
The list of possible return codes: The list of possible return codes:
-EBUSY -- an attempt to modify an attribute that cannot be changed once
the device has been initialised. Please reset device first; ======== =============================================================
-ENOMEM -- zram was not able to allocate enough memory to fulfil your -EBUSY an attempt to modify an attribute that cannot be changed once
needs; the device has been initialised. Please reset device first;
-EINVAL -- invalid input has been provided. -ENOMEM zram was not able to allocate enough memory to fulfil your
needs;
-EINVAL invalid input has been provided.
======== =============================================================
If you use 'echo', the returned value that is changed by 'echo' utility, If you use 'echo', the returned value that is changed by 'echo' utility,
and, in general case, something like: and, in general case, something like::
echo 3 > /sys/block/zram0/max_comp_streams echo 3 > /sys/block/zram0/max_comp_streams
if [ $? -ne 0 ]; if [ $? -ne 0 ];
@ -51,7 +59,11 @@ and, in general case, something like:
should suffice. should suffice.
1) Load Module: 1) Load Module
==============
::
modprobe zram num_devices=4 modprobe zram num_devices=4
This creates 4 devices: /dev/zram{0,1,2,3} This creates 4 devices: /dev/zram{0,1,2,3}
@ -59,6 +71,8 @@ num_devices parameter is optional and tells zram how many devices should be
pre-created. Default: 1. pre-created. Default: 1.
2) Set max number of compression streams 2) Set max number of compression streams
========================================
Regardless the value passed to this attribute, ZRAM will always Regardless the value passed to this attribute, ZRAM will always
allocate multiple compression streams - one per online CPUs - thus allocate multiple compression streams - one per online CPUs - thus
allowing several concurrent compression operations. The number of allowing several concurrent compression operations. The number of
@ -66,16 +80,20 @@ allocated compression streams goes down when some of the CPUs
become offline. There is no single-compression-stream mode anymore, become offline. There is no single-compression-stream mode anymore,
unless you are running a UP system or has only 1 CPU online. unless you are running a UP system or has only 1 CPU online.
To find out how many streams are currently available: To find out how many streams are currently available::
cat /sys/block/zram0/max_comp_streams cat /sys/block/zram0/max_comp_streams
3) Select compression algorithm 3) Select compression algorithm
===============================
Using comp_algorithm device attribute one can see available and Using comp_algorithm device attribute one can see available and
currently selected (shown in square brackets) compression algorithms, currently selected (shown in square brackets) compression algorithms,
change selected compression algorithm (once the device is initialised change selected compression algorithm (once the device is initialised
there is no way to change compression algorithm). there is no way to change compression algorithm).
Examples: Examples::
#show supported compression algorithms #show supported compression algorithms
cat /sys/block/zram0/comp_algorithm cat /sys/block/zram0/comp_algorithm
lzo [lz4] lzo [lz4]
@ -83,20 +101,23 @@ Examples:
#select lzo compression algorithm #select lzo compression algorithm
echo lzo > /sys/block/zram0/comp_algorithm echo lzo > /sys/block/zram0/comp_algorithm
For the time being, the `comp_algorithm' content does not necessarily For the time being, the `comp_algorithm` content does not necessarily
show every compression algorithm supported by the kernel. We keep this show every compression algorithm supported by the kernel. We keep this
list primarily to simplify device configuration and one can configure list primarily to simplify device configuration and one can configure
a new device with a compression algorithm that is not listed in a new device with a compression algorithm that is not listed in
`comp_algorithm'. The thing is that, internally, ZRAM uses Crypto API `comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
and, if some of the algorithms were built as modules, it's impossible and, if some of the algorithms were built as modules, it's impossible
to list all of them using, for instance, /proc/crypto or any other to list all of them using, for instance, /proc/crypto or any other
method. This, however, has an advantage of permitting the usage of method. This, however, has an advantage of permitting the usage of
custom crypto compression modules (implementing S/W or H/W compression). custom crypto compression modules (implementing S/W or H/W compression).
4) Set Disksize 4) Set Disksize
===============
Set disk size by writing the value to sysfs node 'disksize'. Set disk size by writing the value to sysfs node 'disksize'.
The value can be either in bytes or you can use mem suffixes. The value can be either in bytes or you can use mem suffixes.
Examples: Examples::
# Initialize /dev/zram0 with 50MB disksize # Initialize /dev/zram0 with 50MB disksize
echo $((50*1024*1024)) > /sys/block/zram0/disksize echo $((50*1024*1024)) > /sys/block/zram0/disksize
@ -111,10 +132,13 @@ since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful. size of the disk when not in use so a huge zram is wasteful.
5) Set memory limit: Optional 5) Set memory limit: Optional
=============================
Set memory limit by writing the value to sysfs node 'mem_limit'. Set memory limit by writing the value to sysfs node 'mem_limit'.
The value can be either in bytes or you can use mem suffixes. The value can be either in bytes or you can use mem suffixes.
In addition, you could change the value in runtime. In addition, you could change the value in runtime.
Examples: Examples::
# limit /dev/zram0 with 50MB memory # limit /dev/zram0 with 50MB memory
echo $((50*1024*1024)) > /sys/block/zram0/mem_limit echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
@ -126,7 +150,11 @@ Examples:
# To disable memory limit # To disable memory limit
echo 0 > /sys/block/zram0/mem_limit echo 0 > /sys/block/zram0/mem_limit
6) Activate: 6) Activate
===========
::
mkswap /dev/zram0 mkswap /dev/zram0
swapon /dev/zram0 swapon /dev/zram0
@ -134,6 +162,7 @@ Examples:
mount /dev/zram1 /tmp mount /dev/zram1 /tmp
7) Add/remove zram devices 7) Add/remove zram devices
==========================
zram provides a control interface, which enables dynamic (on-demand) device zram provides a control interface, which enables dynamic (on-demand) device
addition and removal. addition and removal.
@ -142,44 +171,51 @@ In order to add a new /dev/zramX device, perform read operation on hot_add
attribute. This will return either new device's device id (meaning that you attribute. This will return either new device's device id (meaning that you
can use /dev/zram<id>) or error code. can use /dev/zram<id>) or error code.
Example: Example::
cat /sys/class/zram-control/hot_add cat /sys/class/zram-control/hot_add
1 1
To remove the existing /dev/zramX device (where X is a device id) To remove the existing /dev/zramX device (where X is a device id)
execute execute::
echo X > /sys/class/zram-control/hot_remove echo X > /sys/class/zram-control/hot_remove
8) Stats: 8) Stats
========
Per-device statistics are exported as various nodes under /sys/block/zram<id>/ Per-device statistics are exported as various nodes under /sys/block/zram<id>/
A brief description of exported device attributes. For more details please A brief description of exported device attributes. For more details please
read Documentation/ABI/testing/sysfs-block-zram. read Documentation/ABI/testing/sysfs-block-zram.
====================== ====== ===============================================
Name access description Name access description
---- ------ ----------- ====================== ====== ===============================================
disksize RW show and set the device's disk size disksize RW show and set the device's disk size
initstate RO shows the initialization state of the device initstate RO shows the initialization state of the device
reset WO trigger device reset reset WO trigger device reset
mem_used_max WO reset the `mem_used_max' counter (see later) mem_used_max WO reset the `mem_used_max` counter (see later)
mem_limit WO specifies the maximum amount of memory ZRAM can use mem_limit WO specifies the maximum amount of memory ZRAM can
to store the compressed data use to store the compressed data
writeback_limit WO specifies the maximum amount of write IO zram can writeback_limit WO specifies the maximum amount of write IO zram
write out to backing device as 4KB unit can write out to backing device as 4KB unit
writeback_limit_enable RW show and set writeback_limit feature writeback_limit_enable RW show and set writeback_limit feature
max_comp_streams RW the number of possible concurrent compress operations max_comp_streams RW the number of possible concurrent compress
operations
comp_algorithm RW show and change the compression algorithm comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out backing_dev RW set up backend storage for zram to write out
idle WO mark allocated slot as idle idle WO mark allocated slot as idle
====================== ====== ===============================================
User space is advised to use the following files to read the device statistics. User space is advised to use the following files to read the device statistics.
File /sys/block/zram<id>/stat File /sys/block/zram<id>/stat
Represents block layer statistics. Read Documentation/block/stat.txt for Represents block layer statistics. Read Documentation/block/stat.rst for
details. details.
File /sys/block/zram<id>/io_stat File /sys/block/zram<id>/io_stat
@ -188,23 +224,31 @@ The stat file represents device's I/O statistics not accounted by block
layer and, thus, not available in zram<id>/stat file. It consists of a layer and, thus, not available in zram<id>/stat file. It consists of a
single line of text and contains the following stats separated by single line of text and contains the following stats separated by
whitespace: whitespace:
failed_reads the number of failed reads
failed_writes the number of failed writes ============= =============================================================
invalid_io the number of non-page-size-aligned I/O requests failed_reads The number of failed reads
failed_writes The number of failed writes
invalid_io The number of non-page-size-aligned I/O requests
notify_free Depending on device usage scenario it may account notify_free Depending on device usage scenario it may account
a) the number of pages freed because of swap slot free a) the number of pages freed because of swap slot free
notifications or b) the number of pages freed because of notifications
REQ_OP_DISCARD requests sent by bio. The former ones are b) the number of pages freed because of
sent to a swap block device when a swap slot is freed, REQ_OP_DISCARD requests sent by bio. The former ones are
which implies that this disk is being used as a swap disk. sent to a swap block device when a swap slot is freed,
which implies that this disk is being used as a swap disk.
The latter ones are sent by filesystem mounted with The latter ones are sent by filesystem mounted with
discard option, whenever some data blocks are getting discard option, whenever some data blocks are getting
discarded. discarded.
============= =============================================================
File /sys/block/zram<id>/mm_stat File /sys/block/zram<id>/mm_stat
The stat file represents device's mm statistics. It consists of a single The stat file represents device's mm statistics. It consists of a single
line of text and contains the following stats separated by whitespace: line of text and contains the following stats separated by whitespace:
================ =============================================================
orig_data_size uncompressed size of data stored in this disk. orig_data_size uncompressed size of data stored in this disk.
This excludes same-element-filled pages (same_pages) since This excludes same-element-filled pages (same_pages) since
no memory is allocated for them. no memory is allocated for them.
@ -223,58 +267,71 @@ line of text and contains the following stats separated by whitespace:
No memory is allocated for such pages. No memory is allocated for such pages.
pages_compacted the number of pages freed during compaction pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages huge_pages the number of incompressible pages
================ =============================================================
File /sys/block/zram<id>/bd_stat File /sys/block/zram<id>/bd_stat
The stat file represents device's backing device statistics. It consists of The stat file represents device's backing device statistics. It consists of
a single line of text and contains the following stats separated by whitespace: a single line of text and contains the following stats separated by whitespace:
============== =============================================================
bd_count size of data written in backing device. bd_count size of data written in backing device.
Unit: 4K bytes Unit: 4K bytes
bd_reads the number of reads from backing device bd_reads the number of reads from backing device
Unit: 4K bytes Unit: 4K bytes
bd_writes the number of writes to backing device bd_writes the number of writes to backing device
Unit: 4K bytes Unit: 4K bytes
============== =============================================================
9) Deactivate
=============
::
9) Deactivate:
swapoff /dev/zram0 swapoff /dev/zram0
umount /dev/zram1 umount /dev/zram1
10) Reset: 10) Reset
Write any positive value to 'reset' sysfs node =========
echo 1 > /sys/block/zram0/reset
echo 1 > /sys/block/zram1/reset Write any positive value to 'reset' sysfs node::
echo 1 > /sys/block/zram0/reset
echo 1 > /sys/block/zram1/reset
This frees all the memory allocated for the given device and This frees all the memory allocated for the given device and
resets the disksize to zero. You must set the disksize again resets the disksize to zero. You must set the disksize again
before reusing the device. before reusing the device.
* Optional Feature Optional Feature
================
= writeback writeback
---------
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory. to backing storage rather than keeping it in memory.
To use the feature, admin should set up backing device via To use the feature, admin should set up backing device via::
"echo /dev/sda5 > /sys/block/zramX/backing_dev" echo /dev/sda5 > /sys/block/zramX/backing_dev
before disksize setting. It supports only partition at this moment. before disksize setting. It supports only partition at this moment.
If admin want to use incompressible page writeback, they could do via If admin want to use incompressible page writeback, they could do via::
"echo huge > /sys/block/zramX/write" echo huge > /sys/block/zramX/write
To use idle page writeback, first, user need to declare zram pages To use idle page writeback, first, user need to declare zram pages
as idle. as idle::
"echo all > /sys/block/zramX/idle" echo all > /sys/block/zramX/idle
From now on, any pages on zram are idle pages. The idle mark From now on, any pages on zram are idle pages. The idle mark
will be removed until someone request access of the block. will be removed until someone request access of the block.
IOW, unless there is access request, those pages are still idle pages. IOW, unless there is access request, those pages are still idle pages.
Admin can request writeback of those idle pages at right timing via Admin can request writeback of those idle pages at right timing via::
"echo idle > /sys/block/zramX/writeback" echo idle > /sys/block/zramX/writeback
With the command, zram writeback idle pages from memory to the storage. With the command, zram writeback idle pages from memory to the storage.
@ -285,7 +342,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature. To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
any writeback. IOW, if admin want to apply writeback budget, he should any writeback. IOW, if admin want to apply writeback budget, he should
enable writeback_limit_enable via enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable $ echo 1 > /sys/block/zramX/writeback_limit_enable
@ -296,7 +353,7 @@ until admin set the budget via /sys/block/zramX/writeback_limit.
assigned via /sys/block/zramX/writeback_limit is meaninless.) assigned via /sys/block/zramX/writeback_limit is meaninless.)
If admin want to limit writeback as per-day 400M, he could do it If admin want to limit writeback as per-day 400M, he could do it
like below. like below::
$ MB_SHIFT=20 $ MB_SHIFT=20
$ 4K_SHIFT=12 $ 4K_SHIFT=12
@ -305,16 +362,16 @@ like below.
$ echo 1 > /sys/block/zram0/writeback_limit_enable $ echo 1 > /sys/block/zram0/writeback_limit_enable
If admin want to allow further write again once the bugdet is exausted, If admin want to allow further write again once the bugdet is exausted,
he could do it like below he could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit /sys/block/zram0/writeback_limit
If admin want to see remaining writeback budget since he set, If admin want to see remaining writeback budget since he set::
$ cat /sys/block/zramX/writeback_limit $ cat /sys/block/zramX/writeback_limit
If admin want to disable writeback limit, he could do If admin want to disable writeback limit, he could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable $ echo 0 > /sys/block/zramX/writeback_limit_enable
@ -326,25 +383,35 @@ budget in next setting is user's job.
If admin want to measure writeback count in a certain period, he could If admin want to measure writeback count in a certain period, he could
know it via /sys/block/zram0/bd_stat's 3rd column. know it via /sys/block/zram0/bd_stat's 3rd column.
= memory tracking memory tracking
===============
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
zram block. It could be useful to catch cold or incompressible zram block. It could be useful to catch cold or incompressible
pages of the process with*pagemap. pages of the process with*pagemap.
If you enable the feature, you could see block state via If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows, /sys/kernel/debug/zram/zram0/block_state". The output is as follows::
300 75.033841 .wh. 300 75.033841 .wh.
301 63.806904 s... 301 63.806904 s...
302 63.806919 ..hi 302 63.806919 ..hi
First column is zram's block index. First column
Second column is access time since the system was booted zram's block index.
Third column is state of the block. Second column
(s: same page access time since the system was booted
w: written page to backing store Third column
h: huge page state of the block:
i: idle page)
s:
same page
w:
written page to backing store
h:
huge page
i:
idle page
First line of above example says 300th block is accessed at 75.033841sec First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing and the block's state is huge so it is written back to the backing

View File

@ -90,9 +90,9 @@ the disk is not available then you have three options:
run a null modem to a second machine and capture the output there run a null modem to a second machine and capture the output there
using your favourite communication program. Minicom works well. using your favourite communication program. Minicom works well.
(3) Use Kdump (see Documentation/kdump/kdump.rst), (3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
extract the kernel ring buffer from old memory with using dmesg extract the kernel ring buffer from old memory with using dmesg
gdbmacro in Documentation/kdump/gdbmacros.txt. gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
Finding the bug's location Finding the bug's location
-------------------------- --------------------------

View File

@ -3,7 +3,7 @@ Control Groups
============== ==============
Written by Paul Menage <menage@google.com> based on Written by Paul Menage <menage@google.com> based on
Documentation/cgroup-v1/cpusets.rst Documentation/admin-guide/cgroup-v1/cpusets.rst
Original copyright statements from cpusets.txt: Original copyright statements from cpusets.txt:
@ -76,7 +76,7 @@ On their own, the only use for cgroups is for simple job
tracking. The intention is that other subsystems hook into the generic tracking. The intention is that other subsystems hook into the generic
cgroup support to provide new attributes for cgroups, such as cgroup support to provide new attributes for cgroups, such as
accounting/limiting the resources which processes in a cgroup can accounting/limiting the resources which processes in a cgroup can
access. For example, cpusets (see Documentation/cgroup-v1/cpusets.rst) allow access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow
you to associate a set of CPUs and a set of memory nodes with the you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup. tasks in each cgroup.

View File

@ -49,7 +49,7 @@ hooks, beyond what is already present, required to manage dynamic
job placement on large systems. job placement on large systems.
Cpusets use the generic cgroup subsystem described in Cpusets use the generic cgroup subsystem described in
Documentation/cgroup-v1/cgroups.rst. Documentation/admin-guide/cgroup-v1/cgroups.rst.
Requests by a task, using the sched_setaffinity(2) system call to Requests by a task, using the sched_setaffinity(2) system call to
include CPUs in its CPU affinity mask, and using the mbind(2) and include CPUs in its CPU affinity mask, and using the mbind(2) and

View File

@ -1,5 +1,3 @@
:orphan:
======================== ========================
Control Groups version 1 Control Groups version 1
======================== ========================

View File

@ -10,7 +10,7 @@ Because VM is getting complex (one of reasons is memcg...), memcg's behavior
is complex. This is a document for memcg's internal behavior. is complex. This is a document for memcg's internal behavior.
Please note that implementation details can be changed. Please note that implementation details can be changed.
(*) Topics on API should be in Documentation/cgroup-v1/memory.rst) (*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
0. How to record usage ? 0. How to record usage ?
======================== ========================
@ -327,7 +327,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
You can see charges have been moved by reading ``*.usage_in_bytes`` or You can see charges have been moved by reading ``*.usage_in_bytes`` or
memory.stat of both A and B. memory.stat of both A and B.
See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
be written to move_charge_at_immigrate. be written to move_charge_at_immigrate.
9.10 Memory thresholds 9.10 Memory thresholds

View File

@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
conventions of cgroup v2. It describes all userland-visible aspects conventions of cgroup v2. It describes all userland-visible aspects
of cgroup including core and specific controller behaviors. All of cgroup including core and specific controller behaviors. All
future changes must be reflected in this document. Documentation for future changes must be reflected in this document. Documentation for
v1 is available under Documentation/cgroup-v1/. v1 is available under Documentation/admin-guide/cgroup-v1/.
.. CONTENTS .. CONTENTS
@ -1014,7 +1014,7 @@ All time durations are in microseconds.
A read-only nested-key file which exists on non-root cgroups. A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for CPU. See Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details. Documentation/accounting/psi.rst for details.
Memory Memory
@ -1355,7 +1355,7 @@ PAGE_SIZE multiple when read back.
A read-only nested-key file which exists on non-root cgroups. A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for memory. See Shows pressure stall information for memory. See
Documentation/accounting/psi.txt for details. Documentation/accounting/psi.rst for details.
Usage Guidelines Usage Guidelines
@ -1498,7 +1498,7 @@ IO Interface Files
A read-only nested-key file which exists on non-root cgroups. A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for IO. See Shows pressure stall information for IO. See
Documentation/accounting/psi.txt for details. Documentation/accounting/psi.rst for details.
Writeback Writeback
@ -2124,7 +2124,7 @@ following two functions.
a queue (device) has been associated with the bio and a queue (device) has been associated with the bio and
before submission. before submission.
wbc_account_io(@wbc, @page, @bytes) wbc_account_cgroup_owner(@wbc, @page, @bytes)
Should be called for each data segment being written out. Should be called for each data segment being written out.
While this function doesn't care exactly when it's called While this function doesn't care exactly when it's called
during the writeback session, it's the easiest and most during the writeback session, it's the easiest and most

View File

@ -1,5 +1,3 @@
:orphan:
============= =============
Device Mapper Device Mapper
============= =============

Some files were not shown because too many files have changed in this diff Show More