# System Setup Guide

This document is designed for customers participating in the Software Development Platform for Intel® Data
Center GPU Max 1100 Series program who receive the following system configuration: 

- D50DNP server with two Intel® Xeon 8480+ CPUs (Sapphire Rapids 350W TDP)
- Two Intel® Data Center GPU Max 1100 PCIe cards (300W TDP each) with an X<sup>e</sup> Link x2 bridge card. 
 
The intent is to provide an end-to-end view of system setup and test content from the perspective
of this specific configuration. This includes instructions for:
 - BIOS and operating system installation
 - Driver and tool installation
 - Readiness validation with example workloads 

For simplicity, this guide focuses on Ubuntu. Intel GPU drivers support 3 baseline operating systems: Ubuntu, RHEL, and SLES. Other operating systems have similar steps.

For more information about the host system, see [Intel Server D50DNP Family Technical Product Specification](../../pdf/Intel_Server_D50DNP_SDP_Technical_Product_Specification_Rev_2.pdf). Additional information about GPU is available in the [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) documentation.


## Components

System firmware and BIOS are pre-installed. The following table lists all preinstalled firmware components and their versions. 
 
| Firmware component | Version                           | Details      |
| ------------------ | --------------------------------- | ------------ |
| IFWI               | PVC2_1.23335                      | preinstalled |
| AMC Firmware       | PVC_AMC_V_6.7.0.0                 | preinstalled |
| System Firmware    | SE5C741.86B.01.01.0004.2303280404 | preinstalled |

Software components are expected to be installed by the end user. Systems were tested with the following components:

| Software component         | Version                   | Details                                                                                                                           |
| -------------------------- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| OS                         | Ubuntu* 22.04 LTS (Jammy) | 5.15 kernel                                                                                                                       |
| GPU Driver                 | 2328 Production Release   | [General-Purpose GPU documentation](https://dgpu-docs.intel.com/)                                                                                                      |
| Intel® oneAPI Base toolkit | 2023.2.0-49384            | [Base toolkit documentation](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html)                                                  |
| Intel® oneAPI HPC toolkit  | 2023.2.0-49438            | [HPC toolkit documentation](https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html)                                                   |
| Intel® oneAPI AI toolkit   | 2023.2.0.48997            | [AI tools documentation](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html)                                          |
| Intel® XPU Manager         | xpu-smi 1.2.21            | [XPU Manager releases](https://github.com/intel/xpumanager/releases)                                                                                      |
| Workload: DGEMM            | -                         | [oneMKL repository](https://github.com/oneapi-src/oneAPI-samples/tree/release/2023.2/Libraries/oneMKL/)                                                |
| Workload: BabelSTREAM      | -                         | [BabelSTREAM repository](https://github.com/UoB-HPC/BabelStream)                                                                                            |
| Workload: BERT Large       | -                         | [BERT Large documentation](https://github.com/IntelAI/models/blob/master/quickstart/language_modeling/pytorch/bert_large/training/gpu/DEVCATALOG.md#datasets) |

Driver kernel build versions are frequently updated to enhance security and fix bugs. DKMS patches rely on matching the kernel branch, not the minor build number. For example, with the kernel package 5.15.0-76-generic, only the 5.15 branch is required; the specific 0-76 build number is not a concern. Intel releases are regularly validated with the latest OSV builds, ensuring compatibility with any Ubuntu 5.15 build.


## Setting up BIOS

Follow these steps to configure the required BIOS settings for full performance of ML and AI workloads.

1. Enter BIOS [F2] and load default values [F9] to align with the validated setup.

![BIOS main screen](images/BIOS1.png){width=600}

```{note} All the settings covered in this setup are defaults. No changes are necessary if the defaults are already applied. The following steps will verify that the expected settings are in use.
```

2. Open the `Advanced` options and verify the `Processor Configuration`.

Enable Intel® Hyper-Threading Tech (Intel® Hyper-Threading Technology). This feature is used for improving 
the Instructions per Cycle (IPC).

![BIOS Processor Configuration Screen](images/BIOS2.png){width=600}
  
3. Open the `Advanced` options and verify the `Power & Performance` settings. Choose the `Balanced Performance` option. This setting weights optimization toward performance while conserving energy. 

![BIOS Power/Performance Screen](images/BIOS3.png){width=600}

4. Open the `Advanced` options and verify the `PCI configuration` settings. Set `MMIO High Base` to `56T` for MMIO optimization. Set `Memory Mapped I/O` size to `1024G`.

![BIOS PCI Configuration Screen](images/BIOS4.png){width=600}


## Installing Ubuntu 22.04 and the GPU driver

We recommend using the Ubuntu 22.04 Server (Jammy). Although the installation steps for RHEL* and SLES* should also work, the following steps have been verified with the Intel® Server Board D50DNP and Intel® Data Center GPU Max 1100 Series.

1. Download Ubuntu 22.04 LTS from the [Ubuntu website](https://ubuntu.com/download/server).

2. Start Ubuntu 22.04 LTS x86\_64 installation, press F6 to select boot device, for example, USB. 

   ![OS Install Grub Options](images/OS1.png)

3. Select the following settings:

   ```{note} Internet access is required for the following steps.  Add a proxy server address if needed.
   ```
   - Language:

     ![OS Install Language Options](images/OS2.png)
     
   - `Ubuntu Server` as the base for the installation:

     ![OS Install Ubuntu Server](images/OS3.png)
     
   - `Use an entire disk` as the storage configuration. At least 650 GB is required to execute all the validation workloads.

     ![OS Install Storage Config](images/OS4.png)
     
   - Accept the default options and create a user. To match the steps in this document, set up 'user1'.
   
   - Select `Install OpenSSH server`, which is disabled by default, to enable remote SSH login and SCP to the server.

     ![OS Install OpenSSH](images/OS5.png)

   Wait for installation to finish, remove installation media, and then log in. 

4. Check whether 5.15.0-xx-generic kernel is loaded.

   ```(bash)
   uname -r 
   ```
 
   Example output:
   ```(bash)
   5.15.0-84-generic
   ```

   Kernel driver build versions are frequently updated for security and bug fixes. DKMS patches depend on matching the kernel branch, not the minor build number. For example, with the kernel package 5.15.0-76-generic, only the 5.15 branch is required; the specific 0-76 build number is not important. Intel releases are regularly validated with the latest OSV builds, so any Ubuntu 5.15 build is expected to work.


5. Follow the [driver installation steps](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps) to install the latest production driver, including compute and media runtimes and development packages.

6. Update the boot loader options by adding `pci=realloc=off` and disabling hang check to `GRUB\_CMDLINE\_LINUX\_DEFAULT` in `/etc/default/grub`.

   ```(bash)
   sudo vi /etc/default/grub 
   GRUB_CMDLINE_LINUX_DEFAULT="… i915.enable_hangcheck=0 pci=realloc=off"
   sudo update-grub
   ```

7. Reboot the system.

   ```(bash)
   sudo reboot
   ```
   
   If Secure Boot is enabled in the BIOS, you might see a prompt during the reboot. Ensure you select `Enroll MOK` to allow the new kernel to take effect.

8. List the group assigned ownership of the render nodes and the groups you are a member of:

   ```(bash)
   stat -c "%G" /dev/dri/render* 
   groups ${USER}
   ``` 
   
   If a group is listed for the render node but not for the user, add the user to the group using gpasswd. The following command adds the active user to the render group and spawns a new shell with that group active:
   
   ```(bash)
   sudo gpasswd -a ${USER} render 
   newgrp render 
   ```

9. Verify the device is working with the i915 driver.

   ```(bash)
   $ sudo apt-get install hwinfo
   $ hwinfo --display 
   ```
   
   Example output for each Max 1100 card: 
   
   ```(bash)
   ...
   274: PCI 2900.0: 0380 Display controller 
     [Created at pci.386] 
     Unique ID: W2eL.+ER_Ec9Ujm4   
     Parent ID: wIUg.xbjkZcxCQYD 
     SysFS ID: /devices/pci0000:26/0000:26:01.0/0000:27:00.0/0000:28:01.0/0000:29:00.0 
     SysFS BusID: 0000:29:00.0 
     Hardware Class: graphics card 
     Model: "Intel Display controller" 
     Vendor: pci 0x8086 "Intel Corporation" 
     Device: pci 0x0bda 
     SubVendor: pci 0x8086 "Intel Corporation" 
     SubDevice: pci 0x0000 
     Revision: 0x2f 
     Driver: "i915" 
     Driver Modules: "i915" 
     Memory Range: 0x3afe3f000000-0x3afe3fffffff (ro,non-prefetchable)   
     Memory Range: 0x3a7000000000-0x3a7fffffffff (ro,non-prefetchable) 
     IRQ: 787 (341 events) 
     Module Alias: "pci:v00008086d00000BDAsv00008086sd00000000bc03sc80i00" 
     Driver Info #0: 
       Driver Status: i915 is active 
       Driver Activation Cmd: "modprobe i915" 
     Config Status: cfg=new, avail=yes, need=no, active=unknown 
     Attached to: #210 (PCI bridge) 
   ```
 
10. Perform a smoke test on the compute stack. This is not a comprehensive test; it only verifies that the GPU OpenCL runtime can be loaded. Additional tests are required to ensure full functionality.

   ```(bash)
   clinfo -l 
   
   Platform #0: Intel(R) OpenCL Graphics 
    +-- Device #0: Intel(R) Data Center GPU Max 1100 
     -- Device #1: Intel(R) Data Center GPU Max 1100 
   ```

11. Update the device name.

   The new GPU name:
   
   ```(bash)
   sudo /sbin/update-pciids
   lspci |grep Display
   
   9a:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
   ca:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
   ```
   
   Previous GPU name:
   
   ```(bash)
   lspci |grep Display
   
   9a:00.0 Display controller [0380]: Intel Corporation Device [8086:0bda] (rev 2f)
   ca:00.0 Display controller [0380]: Intel Corporation Device [8086:0bda] (rev 2f)
   ```

## Example workloads

The following workloads have been validated with this Max 1100 configuration:

- [oneMKL matrix multiply(DGEMM)](https://dgpu-docs.intel.com/solutions/max-sw/hpc/DGEMM.html)
- [Stream Triad (BabelSTREAM)](https://dgpu-docs.intel.com/solutions/max-sw/hpc/BabelSTREAM.html)
- [BERT Large Training](https://github.com/IntelAI/models/blob/master/quickstart/language_modeling/pytorch/bert_large/training/gpu/DEVCATALOG.md)

Documentation for each workload contains steps describing the installation of the necessary oneAPI toolkits. 

## Intel® X<sup>e</sup> link setup

The Intel® Data Center GPU Max 1100 can run in one, two, or four card configurations. 
Two and four card configurations can use Intel® X<sup>e</sup> link connections for direct all-to-all card-to-card communication.

### Disabling and enabling the X<sup>e</sup> link 

To disable and enable X<sup>e</sup> Link, simply turn IAF on or off.

Disabling IAF:

```(bash)
$ sudo su
$ timeout --signal=SIGINT 5 modprobe -r iaf
$ modprobe -r iaf
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do echo 0 > /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
```

Enabling IAF:

```(bash)
$ sudo su
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do echo 1 > /sys/class/drm/card$i/iaf_power_enable ; done;
$ for i in {0..1}; do cat /sys/class/drm/card$i/iaf_power_enable ; done;
modprobe iaf
```

Setting ptrace_scope to 0:

```(bash)
$ sysctl -w kernel.yama.ptrace_scope=0
```

### Verifying X<sup>e</sup> link status

To verify X<sup>e</sup> link status, use the XPU manager.

Checking the status:

```(bash)
$ xpu-smi config -d 0
```

When enabled, available X<sup>e</sup> link ports are displayed.
![Xelink on](images/xelink_config_on.jpg)


Check the number of X<sup>e</sup> link ports and lanes per X<sup>e</sup> link port: 6 ports and 4 lanes:

```(bash)
$ xpu-smi discovery -d 0 
```

![Xelink ports and lanes](images/xelink_ports_lanes.jpg)

Check X<sup>e</sup> link health status:

```(bash)
$ xpu-smi health -l -c 5
```
![Xelink ports and lanes](images/xelink_health.jpg)

### X<sup>e</sup> link bandwidth test

Accessing remote memory using IPC can be tested with Intel MPI.

The following command runs a simple Intel MPI test using X<sup>e</sup> link to check bandwidth.

Before trying this command, install oneAPI base and HPC toolkits as described in [Matrix Multiply (DGEMM) example GPU workloads instructions](https://dgpu-docs.intel.com/solutions/max-sw/hpc/DGEMM.html).

```(bash)
$ source /opt/intel/oneapi/setvars.sh
$ env I_MPI_OFFLOAD=1 mpirun -n 2 IMB-MPI1-GPU pingpong sendrecv -mem_alloc_type device -msglog 28
```

## Tools

This section describes the available tools available that can help with application development and optimization.

### Intel® XPU Manager

Intel® XPU Manager is a free and open-source tool for monitoring and managing Intel Data Center GPUs.
It is designed to simplify administration, maximize reliability and uptime, and improve utilization.

For more information, see [Intel® XPU System Management Interface User Guide](https://github.com/intel/xpumanager/blob/master/doc/smi_user_guide.md).

### GDB – PVC debugger

[GDB](https://www.intel.com/content/www/us/en/docs/distribution-for-gdb/get-started-guide-linux/2023-2/overview.html) is installed on the machine as a part of the oneAPI base toolkit, so no extra step is needed to use it.

The following configuration is required to debug GPU using GDB. It is a one-time requirement on the
system.

**Prerequisite steps**
Before setting up the GDB debugger, follow these steps.

1. Add the following two variables to `GRUB_CMDLINE_LINUX_DEFAULT=""` in `/etc/default/grub` `"i915.debug_eu=1 i915.enable_hangcheck=0"`.

   ```(bash)
   $ sudo vi /etc/default/grub
   GRUB_CMDLINE_LINUX_DEFAULT="i915.debug_eu=1 i915.enable_hangcheck=0"
   $ sudo update-grub
   $ sudo reboot
   ```

2. Disable preemption timeout on GPU.

   ```   
   $ ACTION=="add|bind",SUBSYSTEM=="pci",DRIVER=="i915",RUN+="/bin/bash -c
   'for i in /sys/$DEVPATH/drm/card?/engine/[rc]cs*/preempt_timeout_ms; do
   echo 0 > $i; done'"
   $ udevadm trigger -s pci --action=add
   ```

3. Ensure preemption timeout is set correctly.

   ```
   $ find /sys/devices -regex '.*/drm/card[0-9]*/engine/[rc]cs[0-9]*/preempt_timeout_ms' -exec echo {} \; -exec cat {} \;
   ```
   
4. Set up GDB debugger.

   ```
   $ source /opt/intel/oneapi/setvars.sh
   $ export ZET_ENABLE_PROGRAM_DEBUGGING=1
   $ python3 /path/to/intel/oneapi/diagnostics/latest/diagnostics.py --filter debugger_sys_check --force
   ```

5. Compile the program.

   ```
   $ mkdir array-transform
   $ cd array-transform
   $ wget https://raw.githubusercontent.com/oneapi-src/oneAPIsamples/master/Tools/ApplicationDebugger/arraytransform/src/array-transform.cpp
   $ icpx -fsycl -g -O0 array-transform.cpp -o array-transform
   $ export ONEAPI_DEVICE_SELCTOR=level_zero:0
   $ gdb-oneapi array-transform
   ```

6. FRun GBD from the GDB console.

   ```
   (gdb) run
   ```
   
   Reference output:
   ```
   Starting program: /home/user1/workload/array-transform/array-transform
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
   [New Thread 0x7fffca416640 (LWP 46007)]
   [Thread 0x7fffca416640 (LWP 46007) exited]
   [New Thread 0x7fffc9a15640 (LWP 46008)]
   [Thread 0x7fffc9a15640 (LWP 46008) exited]
   intelgt: gdbserver-ze started for process 46004.
   [New Thread 0x7fffc8ff4640 (LWP 46023)][SYCL] Using device: [Intel(R) Data Center GPU Max 1100] from [Intel(R) Level-Zero]
   success; result is correct.
   [Thread 0x7fffc8ff4640 (LWP 46023) exited]
   [Inferior 1 (process 46004) exited normally]
   Detaching from process 1
   [Inferior 2 (device [9a:00.0]) detached]
   Detaching from process 2
   [Inferior 3 (device [ca:00.0]) detached]
   intelgt: inferior 2 (gdbserver-ze) has been removed.
   intelgt: inferior 3 (gdbserver-ze) has been removed.
   ```

7. Quit GDB console.

   ```
   (gdb) quit
   ```

### Intel® VTune™ Profiler

This section describes how to use Intel® VTune™ Profiler with a DGEMM workload to analyze the
performance of the Intel GPU MAX 1100.

The following steps assume the working directory is `/home/user1/workload/benchmark/DGEMM`. See [DGEMM workload](https://dgpu-docs.intel.com/solutions/max-sw/hpc/DGEMM.html) for setup steps.

Test setup:

```(bash)
$ sudo su
$ source /opt/intel/oneapi/setvars.sh
$ cd /home/user1/workload/benchmark/DGEMM
$ export ONEAPI_DEVICE_SELECTOR=level_zero:0
$ /dgemm.mkl
```

In your system configuration, you should not see any error message, such as “Failed to start profiling because the scope of the ptrace() system call application is limited.” However, if you encounter this error, set the value of the kernel.yama.ptrace_scope sysctl option to 0 with the following command:

```(bash)
$ sysctl -w kernel.yama.ptrace_scope=0
```

For more information, see the [Intel® VTune™ Profiler User Guide](https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-2/error-ptrace-sys-call-scope-limited.html).

VTune is a component of oneAPI Base Toolkit, so no additional installation is required. Run it using the following command. For a detailed description of the parameters, refer to the [VTune User Guide](https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-2/overview.html).

```(bash)
$ vtune -collect gpu-hotspots -k characterization-mode=overview -k collect-programming-api=true -data-limit=0 --duration 20 -- ./dgemm.mkl
```

## Support for loaned systems

If you need support during the sample period, either submit a service request or call the customer support center.

### Submitting service requests

Flow these steps to submit a service request.

1. Log in to the [support portal](https://www.intel.com/support/gpumaxsupport).
2. Select **Intel® Data Center GPU Max 1100** and choose **Create Request**.
3. Describe your issue on the next screen and select **Check For Answers**.
4. Choose **Continue to Request Creation**.
5. Provide answers to additional questions and click **Submit Request**.

A confirmation window will appear informing you a new case number has been created. You can
expect a response within 24 hours.

### Call the customer support center

The customer support center is open Monday to Friday from 8 AM to 5 PM PST. To reach the center, please call: (+1) 855-816-1934.