# System Setup Guide
This document is designed for customers participating in the Software Development Platform for Intel® Data
Center GPU Max 1100 Series program who receive the following system configuration:
- D50DNP server with two Intel® Xeon 8480+ CPUs (Sapphire Rapids 350W TDP)
- Two Intel® Data Center GPU Max 1100 PCIe cards (300W TDP each) with an Xe Link x2 bridge card.
The intent is to provide an end-to-end view of system setup and test content from the perspective
of this specific configuration. This includes instructions for:
- BIOS and operating system installation
- Driver and tool installation
- Readiness validation with example workloads
For simplicity, this guide focuses on Ubuntu. Intel GPU drivers support 3 baseline operating systems: Ubuntu, RHEL, and SLES. Other operating systems have similar steps.
For more information about the host system, see [Intel Server D50DNP Family Technical Product Specification](../../pdf/Intel_Server_D50DNP_SDP_Technical_Product_Specification_Rev_2.pdf). Additional information about GPU is available in the [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) documentation.
## Components
System firmware and BIOS are pre-installed. The following table lists all preinstalled firmware components and their versions.
| Firmware component | Version | Details |
| ------------------ | --------------------------------- | ------------ |
| IFWI | PVC2_1.23335 | preinstalled |
| AMC Firmware | PVC_AMC_V_6.7.0.0 | preinstalled |
| System Firmware | SE5C741.86B.01.01.0004.2303280404 | preinstalled |
Software components are expected to be installed by the end user. Systems were tested with the following components:
| Software component | Version | Details |
| -------------------------- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| OS | Ubuntu* 22.04 LTS (Jammy) | 5.15 kernel |
| GPU Driver | 2328 Production Release | [General-Purpose GPU documentation](https://dgpu-docs.intel.com/) |
| Intel® oneAPI Base toolkit | 2023.2.0-49384 | [Base toolkit documentation](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) |
| Intel® oneAPI HPC toolkit | 2023.2.0-49438 | [HPC toolkit documentation](https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html) |
| Intel® oneAPI AI toolkit | 2023.2.0.48997 | [AI tools documentation](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html) |
| Intel® XPU Manager | xpu-smi 1.2.21 | [XPU Manager releases](https://github.com/intel/xpumanager/releases) |
| Workload: DGEMM | - | [oneMKL repository](https://github.com/oneapi-src/oneAPI-samples/tree/release/2023.2/Libraries/oneMKL/) |
| Workload: BabelSTREAM | - | [BabelSTREAM repository](https://github.com/UoB-HPC/BabelStream) |
| Workload: BERT Large | - | [BERT Large documentation](https://github.com/IntelAI/models/blob/master/quickstart/language_modeling/pytorch/bert_large/training/gpu/DEVCATALOG.md#datasets) |
Driver kernel build versions are frequently updated to enhance security and fix bugs. DKMS patches rely on matching the kernel branch, not the minor build number. For example, with the kernel package 5.15.0-76-generic, only the 5.15 branch is required; the specific 0-76 build number is not a concern. Intel releases are regularly validated with the latest OSV builds, ensuring compatibility with any Ubuntu 5.15 build.
## Setting up BIOS
Follow these steps to configure the required BIOS settings for full performance of ML and AI workloads.
1. Enter BIOS [F2] and load default values [F9] to align with the validated setup.
{width=600}
```{note} All the settings covered in this setup are defaults. No changes are necessary if the defaults are already applied. The following steps will verify that the expected settings are in use.
```
2. Open the `Advanced` options and verify the `Processor Configuration`.
Enable Intel® Hyper-Threading Tech (Intel® Hyper-Threading Technology). This feature is used for improving
the Instructions per Cycle (IPC).
{width=600}
3. Open the `Advanced` options and verify the `Power & Performance` settings. Choose the `Balanced Performance` option. This setting weights optimization toward performance while conserving energy.
{width=600}
4. Open the `Advanced` options and verify the `PCI configuration` settings. Set `MMIO High Base` to `56T` for MMIO optimization. Set `Memory Mapped I/O` size to `1024G`.
{width=600}
## Installing Ubuntu 22.04 and the GPU driver
We recommend using the Ubuntu 22.04 Server (Jammy). Although the installation steps for RHEL* and SLES* should also work, the following steps have been verified with the Intel® Server Board D50DNP and Intel® Data Center GPU Max 1100 Series.
1. Download Ubuntu 22.04 LTS from the [Ubuntu website](https://ubuntu.com/download/server).
2. Start Ubuntu 22.04 LTS x86\_64 installation, press F6 to select boot device, for example, USB.

3. Select the following settings:
```{note} Internet access is required for the following steps. Add a proxy server address if needed.
```
- Language:

- `Ubuntu Server` as the base for the installation:

- `Use an entire disk` as the storage configuration. At least 650 GB is required to execute all the validation workloads.

- Accept the default options and create a user. To match the steps in this document, set up 'user1'.
- Select `Install OpenSSH server`, which is disabled by default, to enable remote SSH login and SCP to the server.

Wait for installation to finish, remove installation media, and then log in.
4. Check whether 5.15.0-xx-generic kernel is loaded.
```(bash)
uname -r
```
Example output:
```(bash)
5.15.0-84-generic
```
Kernel driver build versions are frequently updated for security and bug fixes. DKMS patches depend on matching the kernel branch, not the minor build number. For example, with the kernel package 5.15.0-76-generic, only the 5.15 branch is required; the specific 0-76 build number is not important. Intel releases are regularly validated with the latest OSV builds, so any Ubuntu 5.15 build is expected to work.
5. Follow the [driver installation steps](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps) to install the latest production driver, including compute and media runtimes and development packages.
6. Update the boot loader options by adding `pci=realloc=off` and disabling hang check to `GRUB\_CMDLINE\_LINUX\_DEFAULT` in `/etc/default/grub`.
```(bash)
sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="… i915.enable_hangcheck=0 pci=realloc=off"
sudo update-grub
```
7. Reboot the system.
```(bash)
sudo reboot
```
If Secure Boot is enabled in the BIOS, you might see a prompt during the reboot. Ensure you select `Enroll MOK` to allow the new kernel to take effect.
8. List the group assigned ownership of the render nodes and the groups you are a member of:
```(bash)
stat -c "%G" /dev/dri/render*
groups ${USER}
```
If a group is listed for the render node but not for the user, add the user to the group using gpasswd. The following command adds the active user to the render group and spawns a new shell with that group active:
```(bash)
sudo gpasswd -a ${USER} render
newgrp render
```
9. Verify the device is working with the i915 driver.
```(bash)
$ sudo apt-get install hwinfo
$ hwinfo --display
```
Example output for each Max 1100 card:
```(bash)
...
274: PCI 2900.0: 0380 Display controller
[Created at pci.386]
Unique ID: W2eL.+ER_Ec9Ujm4
Parent ID: wIUg.xbjkZcxCQYD
SysFS ID: /devices/pci0000:26/0000:26:01.0/0000:27:00.0/0000:28:01.0/0000:29:00.0
SysFS BusID: 0000:29:00.0
Hardware Class: graphics card
Model: "Intel Display controller"
Vendor: pci 0x8086 "Intel Corporation"
Device: pci 0x0bda
SubVendor: pci 0x8086 "Intel Corporation"
SubDevice: pci 0x0000
Revision: 0x2f
Driver: "i915"
Driver Modules: "i915"
Memory Range: 0x3afe3f000000-0x3afe3fffffff (ro,non-prefetchable)
Memory Range: 0x3a7000000000-0x3a7fffffffff (ro,non-prefetchable)
IRQ: 787 (341 events)
Module Alias: "pci:v00008086d00000BDAsv00008086sd00000000bc03sc80i00"
Driver Info #0:
Driver Status: i915 is active
Driver Activation Cmd: "modprobe i915"
Config Status: cfg=new, avail=yes, need=no, active=unknown
Attached to: #210 (PCI bridge)
```
10. Perform a smoke test on the compute stack. This is not a comprehensive test; it only verifies that the GPU OpenCL runtime can be loaded. Additional tests are required to ensure full functionality.
```(bash)
clinfo -l
Platform #0: Intel(R) OpenCL Graphics
+-- Device #0: Intel(R) Data Center GPU Max 1100
-- Device #1: Intel(R) Data Center GPU Max 1100
```
11. Update the device name.
The new GPU name:
```(bash)
sudo /sbin/update-pciids
lspci |grep Display
9a:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
ca:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
```
Previous GPU name:
```(bash)
lspci |grep Display
9a:00.0 Display controller [0380]: Intel Corporation Device [8086:0bda] (rev 2f)
ca:00.0 Display controller [0380]: Intel Corporation Device [8086:0bda] (rev 2f)
```
## Example workloads
The following workloads have been validated with this Max 1100 configuration:
- [oneMKL matrix multiply(DGEMM)](https://dgpu-docs.intel.com/solutions/max-sw/hpc/DGEMM.html)
- [Stream Triad (BabelSTREAM)](https://dgpu-docs.intel.com/solutions/max-sw/hpc/BabelSTREAM.html)
- [BERT Large Training](https://github.com/IntelAI/models/blob/master/quickstart/language_modeling/pytorch/bert_large/training/gpu/DEVCATALOG.md)
Documentation for each workload contains steps describing the installation of the necessary oneAPI toolkits.
## Intel® Xe link setup
The Intel® Data Center GPU Max 1100 can run in one, two, or four card configurations.
Two and four card configurations can use Intel® Xe link connections for direct all-to-all card-to-card communication.
### Disabling and enabling the Xe link
To disable and enable Xe Link, simply turn IAF on or off.
Disabling IAF:
```(bash)
$ sudo su
$ timeout --signal=SIGINT 5 modprobe -r iaf
$ modprobe -r iaf
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do echo 0 > /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
```
Enabling IAF:
```(bash)
$ sudo su
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do echo 1 > /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
modprobe iaf
```
Setting ptrace_scope to 0:
```(bash)
$ sysctl -w kernel.yama.ptrace_scope=0
```
### Verifying Xe link status
To verify Xe link status, use the XPU manager.
Checking the status:
```(bash)
$ xpu-smi config -d 0
```
When enabled, available Xe link ports are displayed.

Check the number of Xe link ports and lanes per Xe link port: 6 ports and 4 lanes:
```(bash)
$ xpu-smi discovery -d 0
```

Check Xe link health status:
```(bash)
$ xpu-smi health -l -c 5
```

### Xe link bandwidth test
Accessing remote memory using IPC can be tested with Intel MPI.
The following command runs a simple Intel MPI test using Xe link to check bandwidth.
Before trying this command, install oneAPI base and HPC toolkits as described in [Matrix Multiply (DGEMM) example GPU workloads instructions](https://dgpu-docs.intel.com/solutions/max-sw/hpc/DGEMM.html).
```(bash)
$ source /opt/intel/oneapi/setvars.sh
$ env I_MPI_OFFLOAD=1 mpirun -n 2 IMB-MPI1-GPU pingpong sendrecv -mem_alloc_type device -msglog 28
```
## Tools
This section describes the available tools available that can help with application development and optimization.
### Intel® XPU Manager
Intel® XPU Manager is a free and open-source tool for monitoring and managing Intel Data Center GPUs.
It is designed to simplify administration, maximize reliability and uptime, and improve utilization.
For more information, see [Intel® XPU System Management Interface User Guide](https://github.com/intel/xpumanager/blob/master/doc/smi_user_guide.md).
### GDB – PVC debugger
[GDB](https://www.intel.com/content/www/us/en/docs/distribution-for-gdb/get-started-guide-linux/2023-2/overview.html) is installed on the machine as a part of the oneAPI base toolkit, so no extra step is needed to use it.
The following configuration is required to debug GPU using GDB. It is a one-time requirement on the
system.
**Prerequisite steps**
Before setting up the GDB debugger, follow these steps.
1. Add the following two variables to `GRUB_CMDLINE_LINUX_DEFAULT=""` in `/etc/default/grub` `"i915.debug_eu=1 i915.enable_hangcheck=0"`.
```(bash)
$ sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="i915.debug_eu=1 i915.enable_hangcheck=0"
$ sudo update-grub
$ sudo reboot
```
2. Disable preemption timeout on GPU.
```
$ ACTION=="add|bind",SUBSYSTEM=="pci",DRIVER=="i915",RUN+="/bin/bash -c
'for i in /sys/$DEVPATH/drm/card?/engine/[rc]cs*/preempt_timeout_ms; do
echo 0 > $i; done'"
$ udevadm trigger -s pci --action=add
```
3. Ensure preemption timeout is set correctly.
```
$ find /sys/devices -regex '.*/drm/card[0-9]*/engine/[rc]cs[0-9]*/preempt_timeout_ms' -exec echo {} \; -exec cat {} \;
```
4. Set up GDB debugger.
```
$ source /opt/intel/oneapi/setvars.sh
$ export ZET_ENABLE_PROGRAM_DEBUGGING=1
$ python3 /path/to/intel/oneapi/diagnostics/latest/diagnostics.py --filter debugger_sys_check --force
```
5. Compile the program.
```
$ mkdir array-transform
$ cd array-transform
$ wget https://raw.githubusercontent.com/oneapi-src/oneAPIsamples/master/Tools/ApplicationDebugger/arraytransform/src/array-transform.cpp
$ icpx -fsycl -g -O0 array-transform.cpp -o array-transform
$ export ONEAPI_DEVICE_SELCTOR=level_zero:0
$ gdb-oneapi array-transform
```
6. FRun GBD from the GDB console.
```
(gdb) run
```
Reference output:
```
Starting program: /home/user1/workload/array-transform/array-transform
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffca416640 (LWP 46007)]
[Thread 0x7fffca416640 (LWP 46007) exited]
[New Thread 0x7fffc9a15640 (LWP 46008)]
[Thread 0x7fffc9a15640 (LWP 46008) exited]
intelgt: gdbserver-ze started for process 46004.
[New Thread 0x7fffc8ff4640 (LWP 46023)][SYCL] Using device: [Intel(R) Data Center GPU Max 1100] from [Intel(R) Level-Zero]
success; result is correct.
[Thread 0x7fffc8ff4640 (LWP 46023) exited]
[Inferior 1 (process 46004) exited normally]
Detaching from process 1
[Inferior 2 (device [9a:00.0]) detached]
Detaching from process 2
[Inferior 3 (device [ca:00.0]) detached]
intelgt: inferior 2 (gdbserver-ze) has been removed.
intelgt: inferior 3 (gdbserver-ze) has been removed.
```
7. Quit GDB console.
```
(gdb) quit
```
### Intel® VTune™ Profiler
This section describes how to use Intel® VTune™ Profiler with a DGEMM workload to analyze the
performance of the Intel GPU MAX 1100.
The following steps assume the working directory is `/home/user1/workload/benchmark/DGEMM`. See [DGEMM workload](https://dgpu-docs.intel.com/solutions/max-sw/hpc/DGEMM.html) for setup steps.
Test setup:
```(bash)
$ sudo su
$ source /opt/intel/oneapi/setvars.sh
$ cd /home/user1/workload/benchmark/DGEMM
$ export ONEAPI_DEVICE_SELECTOR=level_zero:0
$ /dgemm.mkl
```
In your system configuration, you should not see any error message, such as “Failed to start profiling because the scope of the ptrace() system call application is limited.” However, if you encounter this error, set the value of the kernel.yama.ptrace_scope sysctl option to 0 with the following command:
```(bash)
$ sysctl -w kernel.yama.ptrace_scope=0
```
For more information, see the [Intel® VTune™ Profiler User Guide](https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-2/error-ptrace-sys-call-scope-limited.html).
VTune is a component of oneAPI Base Toolkit, so no additional installation is required. Run it using the following command. For a detailed description of the parameters, refer to the [VTune User Guide](https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-2/overview.html).
```(bash)
$ vtune -collect gpu-hotspots -k characterization-mode=overview -k collect-programming-api=true -data-limit=0 --duration 20 -- ./dgemm.mkl
```
## Support for loaned systems
If you need support during the sample period, either submit a service request or call the customer support center.
### Submitting service requests
Flow these steps to submit a service request.
1. Log in to the [support portal](https://www.intel.com/support/gpumaxsupport).
2. Select **Intel® Data Center GPU Max 1100** and choose **Create Request**.
3. Describe your issue on the next screen and select **Check For Answers**.
4. Choose **Continue to Request Creation**.
5. Provide answers to additional questions and click **Submit Request**.
A confirmation window will appear informing you a new case number has been created. You can
expect a response within 24 hours.
### Call the customer support center
The customer support center is open Monday to Friday from 8 AM to 5 PM PST. To reach the center, please call: (+1) 855-816-1934.