Core Programming Guide

Drivers and Devices

The API architecture exposes both physical and logical abstraction of the underlying devices capabilities. The device, sub-device and memory are exposed at physical level while command queues, events and synchronization methods are defined as logical entities. All logical entities will be bound to device level physical capabilities.

Device discovery APIs enumerate the accelerators functional features. These APIs provide interface to query information like compute unit count within the device or sub device, available memory and affinity to the compute, user managed cache size and work submission command queues.

Drivers

A driver object represents a collection of physical devices in the system accessed by the same Level-Zero driver.

  • The application may query the number of Level-Zero drivers installed on the system, and their respective handles, using zeDriverGet.

  • More than one driver may be available in the system. For example, one driver may support two GPUs from one vendor, another driver supports a GPU from a different vendor, and finally a different driver may support an FPGA.

  • Driver objects are read-only, global constructs. i.e. Multiple calls to zeDriverGet will return identical driver handles.

  • A driver handle is primarily used during device discovery and during creation and management of contexts.

Device

A device object represents a physical device in the system that supports Level-Zero.

  • The application may query the number devices supported by a driver, and their respective handles, using zeDeviceGet.

  • Device objects are read-only, global constructs. i.e. Multiple calls to zeDeviceGet will return identical device handles.

  • A device handle is primarily used during creation and management of resources that are specific to a device.

  • The application is responsible for sharing memory and explicit submission and synchronization across multiple devices.

  • Device may expose sub-devices that allow finer-grained control of physical or logical partitions of a device.

The following diagram illustrates the relationship between the driver, device and other objects described in this document.

../_images/core_device.png

Initialization and Discovery

The Level-Zero API must be initialized by calling zeInit before calling any other API function. This function will load all Level-Zero driver(s) in the system into memory for the current process, for use by all Host threads. Simultaneous calls to zeInit are thread-safe and only one instance of each driver will be loaded.

The following pseudo-code demonstrates a basic initialization and device discovery sequence:

// Initialize the driver
zeInit(0);

// Discover all the driver instances
uint32_t driverCount = 0;
zeDriverGet(&driverCount, nullptr);

ze_driver_handle_t* allDrivers = allocate(driverCount * sizeof(ze_driver_handle_t));
zeDriverGet(&driverCount, allDrivers);

// Find a driver instance with a GPU device
ze_driver_handle_t hDriver = nullptr;
ze_device_handle_t hDevice = nullptr;
for(i = 0; i < driverCount; ++i) {
    uint32_t deviceCount = 0;
    zeDeviceGet(allDrivers[i], &deviceCount, nullptr);

    ze_device_handle_t* allDevices = allocate(deviceCount * sizeof(ze_device_handle_t));
    zeDeviceGet(allDrivers[i], &deviceCount, allDevices);

    for(d = 0; d < deviceCount; ++d) {
        ze_device_properties_t device_properties;
        zeDeviceGetProperties(allDevices[d], &device_properties);

        if(ZE_DEVICE_TYPE_GPU == device_properties.type) {
            hDriver = allDrivers[i];
            hDevice = allDevices[d];
            break;
        }
    }

    free(allDevices);
    if(nullptr != hDriver) {
        break;
    }
}

free(allDrivers);
if(nullptr == hDevice)
    return; // no GPU devices found

...

Contexts

A context is a logical object used by the driver for managing all memory, command queues/lists, modules, synchronization objects, etc.

  • A context handle is primarily used during creation and management of resources that may be used by multiple devices.

  • For example, memory is not implicitly shared across all devices supported by a driver. However, it is available to be explicitly shared.

The following pseudo-code demonstrates a basic context creation:

// Create context
ze_context_desc_t ctxtDesc = {
   ZE_STRUCTURE_TYPE_CONTEXT_DESC,
   nullptr,
   0
};
zeContextCreate(hDriver, &ctxtDesc, &hContext);

An application may optionally create multiple contexts using zeContextCreate.

  • The primary usage-model for multiple contexts is isolation of memory and objects for multiple libraries within the same process.

  • The same context may be used simultaneously on multiple Host threads.

The following pseudo-code demonstrates a basic context creation and activation sequence:

// Create context(s)
zeContextCreate(hDriver, &ctxtDesc, &hContextA);
zeContextCreate(hDriver, &ctxtDesc, &hContextB);

zeMemAllocHost(hContextA, &desc, 80, 0, &ptrA);
zeMemAllocHost(hContextB, &desc, 88, 0, &ptrB);

memcpy(ptrA, ptrB, 0xe); // ok
zeMemGetAllocProperties(hContextA, ptrB, &props, &hDevice); // illegal: Context A has no knowledge of ptrB

If a device was hung or reset, then the context is no longer valid and all APIs will return ZE_RESULT_ERROR_DEVICE_LOST when any object associated with that context is used. All pointers to memory allocations and handles to objects (including other contexts) created on the context will be invalid and should no longer be used. An application can use zeContextGetStatus at any time to check the status of a context.

In order to recover, the context must be destroyed using zeContextDestroy. After the device is reset, the application can create a new context and continue operation. An application must call zeDeviceGetStatus to confirm the device has been reset and update the OS handle attached to the device handle. Otherwise, even after the device has been reset, the call to zeContextCreate will fail.

Memory and Images

Memory is visible to the upper-level software stack as unified memory with a single virtual address space covering both the Host and a specific device.

For GPUs, the API exposes two levels of the device memory hierarchy:

  1. Local Device Memory: can be managed at the device and/or sub device level.

  2. Device Cache(s):

    • Last Level Cache (L3) can be controlled through memory allocation APIs.

    • Low Level Cache (L1) can be controlled through program language intrinsics.

The API allows allocation of buffers and images at device and sub device granularity with full cacheablity hints.

  • Buffers are transparent memory accessed through virtual address pointers

  • Images are opaque objects accessed through handles

The memory APIs provide allocation methods to allocate either device, host or shared memory. The APIs enable both implicit and explicit management of the resources by the application or runtimes. The interface also provides query capabilities for all memory objects.

There are two types of allocations:

  1. Memory - linear, unformatted allocations for direct access from both the host and device.

  2. Images - non-linear, formatted allocations for direct access from the device.

Memory

Linear, unformatted memory allocations are represented as pointers in the host application. A pointer on the Host has the same size as a pointer on the device.

Types

Three types of allocations are supported. The type of allocation describes the ownership of the allocation:

  1. Host allocations are owned by the host and are intended to be allocated out of system memory.

    • Host allocations are accessible by the host and one or more devices.

    • The same pointer to a host allocation may be used on the host and all supported devices; they have address equivalence.

    • Host allocations are not expected to migrate between system memory and device local memory.

    • Host allocations trade off wide accessibility and transfer benefits for potentially higher per-access costs, such as over PCI express.

  2. Device allocations are owned by a specific device and are intended to be allocated out of device local memory, if present.

    • Device allocations generally trade off access limitations for higher performance.

    • With very few exceptions, device allocations may only be accessed by the specific device that they are allocated on, or copied to another device or Host allocation.

    • The same pointer to a device allocation may be used on any supported device.

  3. Shared allocations share ownership and are intended to migrate between the host and one or more devices.

    • Shared allocations are accessible by at least the host and an associated device.

    • Shared allocations may be accessed by other devices in some cases.

    • Shared allocations trade off transfer costs for per-access benefits.

    • The same pointer to a shared allocation may be used on the host and all supported devices.

A Shared System allocation is a sub-class of a Shared allocation, where the memory is allocated by a system allocator (such as malloc or new) rather than by an allocation API. Shared system allocations have no associated device; they are inherently cross-device. Like other shared allocations, shared system allocations are intended to migrate between the host and supported devices, and the same pointer to a shared system allocation may be used on the host and all supported devices.

In summary:

Name

Initial Location

Accessible By

Migratable To

Host

Host

Host

Yes

Host

N/A

Any Device

Yes (perhaps over PCIe)

Device

No

Device

Specific Device

Host

No

Host

No

Specific Device

Yes

Device

N/A

Another Device

Optional (may require p2p)

Another Device

No

Shared

Host, Specific Device, or Unspecified

Host

Yes

Host

Yes

Specific Device

Yes

Device

Yes

Another Device

Optional (may require p2p)

Another Device

Optional

Shared System

Host

Host

Yes

Host

Yes

Device

Yes

Device

Yes

Devices may support different capabilities for each type of allocation. Supported capabilities are:

  • ZE_MEMORY_ACCESS_CAP_FLAG_RW - if a device supports access (read or write) to allocations of the specified type.

  • ZE_MEMORY_ACCESS_CAP_FLAG_ATOMIC - if a device support atomic operations on allocations of the specified type. Atomic operations may include relaxed consistency read-modify-write atomics and atomic operations that enforce memory consistency for non-atomic operations.

  • ZE_MEMORY_ACCESS_CAP_FLAG_CONCURRENT - if a device supports concurrent access to allocations of the specified type. Concurrent access may be from another device that supports concurrent access, or from the host. Devices that support concurrent access but do not support concurrent atomic access must write to unique non-overlapping memory locations to avoid data races and hence undefined behavior.

  • ZE_MEMORY_ACCESS_CAP_FLAG_CONCURRENT_ATOMIC - if a device supports concurrent atomic operations on allocations of the specified type. Concurrent atomic operations may be from another device that supports concurrent atomic access, or from the host. Devices that support concurrent atomic access may use atomic operations to enforce memory consistency with other devices that support concurrent atomic access, or with the host.

At a minimum, drivers will assign unique physical pages for each device and shared memory allocation. However, it is undefined behavior for an application to access memory outside of the allocation size requested. The actual page size used for an allocation can be queried from ze_memory_allocation_properties_t.pageSize using zeMemGetAllocProperties. Applications should implement usage-specific allocators from device memory pools (e.g., small and/or fixed-sized allocations, lock-free, etc.).

Furthermore, drivers may oversubscribe some shared allocations. When and how such oversubscription occurs, including which allocations are evicted when the working set changes, are considered implementation details.

The required matrix of capabilities are:

Allocation Type

RW Access

Atomic Access

Concurrent Access

Concurrent Atomic Access

Host

Required

Optional

Optional

Optional

Device

Required

Optional

Optional

Optional

Shared

Required

Optional

Optional

Optional

Shared (Cross-Device)

Optional

Optional

Optional

Optional

Shared System (Cross-Device)

Optional

Optional

Optional

Optional

Cache Hints, Prefetch, and Memory Advice

Cacheability hints may be provided via separate host and device allocation flags when memory is allocated.

Shared allocations may be prefetched to a supporting device via the zeCommandListAppendMemoryPrefetch API. Prefetching may allow memory transfers to be scheduled concurrently with other computations and may improve performance.

Additionally, an application may provide memory advice for a shared allocation via the zeCommandListAppendMemAdvise API, to override driver heuristics or migration policies. Memory advice may avoid unnecessary or unprofitable memory transfers and may improve performance.

Both prefetch and memory advice are asynchronous operations that are appended into command lists.

Reserved Device Allocations

If an application needs finer grained control of physical memory consumption for device allocations then it can reserve a range of the virtual address space and map this to physical memory as needed. This provides flexibility for applications to manage large dynamic data structures which can grow and shrink over time while maintaining optimal physical memory usage.

Reserving Virtual Address Space

Virtual memory can be reserved using zeVirtualMemReserve. The reservation starting address and size must be page aligned. Applications should query the page size for the allocation using zeVirtualMemQueryPageSize.

The following pseudo-code demonstrates a basic sequence for reserving virtual memory:

// Query page size for our 1MB allocation.
size_t pageSize;
size_t allocationSize = 1048576;
zeVirtualMemQueryPageSize(hContext, hDevice, allocationSize, &pageSize);

// Reserve 1MB of virtual address space.
size_t reserveSize = align(allocationSize, pageSize);

void* ptr = nullptr;
zeVirtualMemReserve(hContext, nullptr, reserveSize, &ptr);

Growing Virtual Address Reservations

An application may wish to reserve an address range starting at a specific virtual address. This may be useful when there is a need to grow a reservation. However, if the implementation is not able to reserve the new allocation at the requested starting address then it will find a new suitable range with a different starting address. If the application requires a specific starting address then the application should ensure that the return address from zeVirtualMemReserve matches the starting address it wants. If they are different then the application may want to create a new larger reservation and remap the physical memory from the first reservation to this new reservation and free the old reservation.

// Reserve another 1MB of virtual address space that is contiguous with previous reservation.
void* newptr = (uint8_t*)ptr + reserveSize;
void* retptr;
zeVirtualMemReserve(hContext, newptr, reserveSize, &retptr);

if (retptr != newptr)
{
    // Free new reservation as it's not what we want due to incorrect starting address.
    zeVirtualMemFree(hContext, retptr, reserveSize);

    // Make new larger 2MB reservation and remap physical pages to this.
    size_t pageSize;
    size_t largerAllocationSize = 2097152;
    zeVirtualMemQueryPageSize(hContext, hDevice, largerAllocationSize, &pageSize);

    // Reserve 2MB of virtual address space.
    size_t largerReserveSize = align(largerAllocationSize, pageSize);

    void* ptr = nullptr;
    zeVirtualMemReserve(hContext, nullptr, largerReserveSize, &ptr);

    // Remap physical pages from original reservation to our new larger reservation.
    ...

    // Free original reservation that we were trying to grow.
    zeVirtualMemFree(hContext, ptr, reserveSize);
}

Physical Memory

Physical memory is explicitly represented in the API as physical memory objects that are reservations of physical pages. The application will use zePhysicalMemCreate to create a physical memory object.

The following pseudo-code demonstrates a basic sequence for creating a physical memory object:

// Create 1MB physical memory object
ze_physical_mem_handle_t hPhysicalAlloc;
size_t physicalSize = align(allocationSize, pageSize);
ze_physical_mem_desc_t pmemDesc = {
    ZE_STRUCTURE_TYPE_PHYSICAL_MEM_DESC,
    nullptr,
    0, // flags
    physicalSize // size
};

zePhysicalMemCreate(hContext, hDevice, &pmemDesc, &hPhysicalAlloc);

Mapping Virtual Memory Pages

Reserved virtual memory pages can be mapped to physical memory using zeVirtualMemMap. An application can map the entire reserved virtual address range or can sparsely map the reserved virtual address range using one or more physical memory objects. Once mapped, the physical pages for a physical memory object can be faulted in for devices that support on-demand paging. In addition, the residency API can be used to control residency of these physical pages.

The following pseudo-code demonstrates mapping a 1MB reservation into physical memory:

// Map entire 1MB reservation and set access to read/write.
zeVirtualMemMap(hContext, ptr, reserveSize, hPhysicalAlloc, 0,
    ZE_MEMORY_ACCESS_ATTRIBUTE_READWRITE);

Access Attributes

Access attributes can be set for a range of pages when mapping virtual memory pages with zeVirtualMemMap or with zeVirtualMemSetAccessAttribute. In addition, an application can query access attributes for a page aligned virtual memory range.

size_t accessRangeSize;
ze_memory_access_attribute_t access;
zeVirtualMemGetAccessAttribute(hContext, ptr, reserveSize, &access, &accessRangeSize);

// Expecting entire range to have the same access attribute and it be read/write.
assert(accessRangeSize == reserveSize);
assert(access == ZE_MEMORY_ACCESS_ATTRIBUTE_READWRITE);

Sparse Mappings

Applications may desire to reserve large virtual address ranges to make available to its custom allocators. These ranges can be sparsely mapped using one or more physical memory objects. It is recommended that the application queries the page size for each suballocation to ensure the implementation can use an optimal page size for the mappings based on the alignments used for starting address and size used.

The following example makes a 1GB reserved allocation and then makes both 128KB and 8MB sub-allocations.

../_images/core_reserved_suballocations.png
// Reserve 1GB of virtual address space to manage.
size_t pageSize;
size_t allocationSize = 1048576000;
zeVirtualMemQueryPageSize(hContext, hDevice, allocationSize, &pageSize);

size_t reserveSize = align(allocationSize, pageSize);

void* ptr = nullptr;
zeVirtualMemReserve(hContext, nullptr, reserveSize, &ptr);

...

// Sub-allocate 128KB of our 1GB allocation.
size_t subAllocSize = 131072;
zeVirtualMemQueryPageSize(hContext, hDevice, subAllocSize, &pageSize);

// Create physical memory object for our 128KB sub-allocation.
size_t subAllocAlignedSize = align(subAllocSize, pageSize);
ze_physical_mem_desc_t pmemDesc = {
    ZE_STRUCTURE_TYPE_PHYSICAL_MEM_DESC,
    nullptr,
    0, // flags
    subAllocAlignedSize // size
};
ze_physical_mem_handle_t hPhysicalAlloc;
zePhysicalMemCreate(hContext, hDevice, &pmemDesc, &hPhysicalAlloc);

// Find suitable 128KB sub-allocation that matches page alignments.
...

zeVirtualMemMap(hContext, subAllocPtr, subAllocAlignedSize, hPhysicalAlloc, 0,
    ZE_MEMORY_ACCESS_ATTRIBUTE_READWRITE);

...

// Sub-allocate 8MB of our 1GB allocation.
size_t subAllocDiffSize = 8388608;
zeVirtualMemQueryPageSize(hContext, hDevice, subAllocDiffSize, &pageSize);
...

Images

An image is used to store multi-dimensional and format-defined memory. An image’s contents may be stored in an implementation-specific encoding and layout in memory for optimal device access (e.g., tile swizzle patterns, lossless compression, etc.). There is no support for direct Host access to an image’s content. However, when an image is copied to a Host-accessible memory allocation, its contents will be implicitly decoded to be implementation-independent.

// Specify single component FLOAT32 format
ze_image_format_t format = {
    ZE_IMAGE_FORMAT_LAYOUT_32, ZE_IMAGE_FORMAT_TYPE_FLOAT,
    ZE_IMAGE_FORMAT_SWIZZLE_R, ZE_IMAGE_FORMAT_SWIZZLE_0, ZE_IMAGE_FORMAT_SWIZZLE_0, ZE_IMAGE_FORMAT_SWIZZLE_1
};

ze_image_desc_t imageDesc = {
    ZE_STRUCTURE_TYPE_IMAGE_DESC,
    nullptr,
    0, // read-only
    ZE_IMAGE_TYPE_2D,
    format,
    128, 128, 0, 0, 0
};
ze_image_handle_t hImage;
zeImageCreate(hContext, hDevice, &imageDesc, &hImage);

// upload contents from host pointer
zeCommandListAppendImageCopyFromMemory(hCommandList, hImage, nullptr, pImageData, nullptr, 0, nullptr);
...

A format descriptor is a combination of a format layout, type, and a swizzle. The format layout describes the number of components and their corresponding bit widths. The type describes the data type for all of these components with some exceptions that are described below. The swizzles associate how the image components are mapped into XYZW/RGBA channels of the kernel. It is allowed to replicate components into the channels.

The following table describes which types are required for each layout.

Format layout

UINT

SINT

UNORM

SNORM

FLOAT

8

Required

Required

Required

Required

Unsupported

8_8

Required

Required

Required

Required

Unsupported

8_8_8_8

Required

Required

Required

Required

Unsupported

16

Required

Required

Required

Required

Required

16_16

Required

Required

Required

Required

Required

16_16_16_16

Required

Required

Required

Required

Required

32

Required

Required

Required

Required

Required

32_32

Required

Required

Required

Required

Required

32_32_32_32

Required

Required

Required

Required

Required

10_10_10_2

Required

Required

Required

Required

Required

11_11_10

Unsupported

Unsupported

Unsupported

Unsupported

Required

5_6_5

Unsupported

Unsupported

Required

Unsupported

Unsupported

5_5_5_1

Unsupported

Unsupported

Required

Unsupported

Unsupported

4_4_4_4

Unsupported

Unsupported

Required

Unsupported

Unsupported

Device Cache Settings

There are two methods for device and kernel cache control:

  1. Cache Size Configuration: Ability to configure larger size for SLM vs Data per Kernel instance.

  2. Runtime Hint/preference for application to allow access to be Cached or not in Device Caches. For GPU device this is provided via two ways:

    • During Image creation via Flag

    • Kernel instruction

The following pseudo-code demonstrates a basic sequence for Cache size configuration:

// Configure cache to support larger SLM
// Note: The cache setting is applied to each kernel.
zeKernelSetCacheConfig(hKernel, ZE_CACHE_CONFIG_FLAG_LARGE_SLM);

External Memory Import and Export

External memory handles may be imported from other APIs, or exported for use in other APIs. Importing and exporting external memory is an optional feature. Devices may describe the types of external memory handles they support using zeDeviceGetExternalMemoryProperties.

Importing and exporting external memory is supported for device memory allocations and images.

The following pseudo-code demonstrates how to allocate and export an external memory handle for a device memory allocation as a Linux dma_buf:

// Set up the request for an exportable allocation
ze_external_memory_export_desc_t export_desc = {
    ZE_STRUCTURE_TYPE_EXTERNAL_MEMORY_EXPORT_DESC,
    nullptr, // pNext
    ZE_EXTERNAL_MEMORY_TYPE_FLAG_DMA_BUF
};

// Link the request into the allocation descriptor and allocate
alloc_desc.pNext = &export_desc;
zeMemAllocDevice(hContext, &alloc_desc, size, alignment, hDevice, &ptr);

...

// Set up the request to export the external memory handle
ze_external_memory_export_fd_t export_fd = {
    ZE_STRUCTURE_TYPE_EXTERNAL_MEMORY_EXPORT_FD,
    nullptr, // pNext
    ZE_EXTERNAL_MEMORY_TYPE_FLAG_OPAQUE_FD,
    0 // [out] fd
};

// Link the export request into the query
alloc_props.pNext = &export_fd;
zeMemGetAllocProperties(hContext, ptr, &alloc_props, nullptr);

The following pseudo-code demonstrates how to import a Linux dma_buf as an external memory handle for a device memory allocation:

// Set up the request to import the external memory handle
ze_external_memory_import_fd_t import_fd = {
    ZE_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMPORT_FD,
    nullptr, // pNext
    ZE_EXTERNAL_MEMORY_TYPE_FLAG_DMA_BUF,
    fd
};

// Link the request into the allocation descriptor and allocate
alloc_desc.pNext = &import_fd;
zeMemAllocDevice(hContext, &alloc_desc, size, alignment, hDevice, &ptr);

Another example, which the following pseudo-code demonstrates, is how to import a Linux dma_buf as an external memory handle for Images:

// Set up the request to import the external memory handle
ze_external_memory_import_fd_t import_fd = {
    ZE_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMPORT_FD,
    nullptr, // pNext
    ZE_EXTERNAL_MEMORY_TYPE_FLAG_DMA_BUF,
    fd
};

// Link the request into the allocation descriptor and allocate
image_desc.pNext = &import_fd; // extend ze_image_desc_t

// Setup matching image properties for imported image.
image_desc.width = import_width;
...

zeImageCreate(hContext, hDevice, &image_desc, &hImage);

Command Queues and Command Lists

The following are the motivations for separating a command queue from a command list:

  • Command queues are mostly associated with physical device properties, such as the number of input streams.

  • Command queues provide (near) zero-latency access to the device.

  • Command lists are mostly associated with Host threads for simultaneous construction.

  • Command list construction can occur independently of command queue submission.

The following diagram illustrates the hierarchy of command lists and command queues to the device:

../_images/core_queue.png

Command Queue Groups

A command queue group represents a physical input stream, which represents one or more physical device engines.

Discovery

The following pseudo-code demonstrates a basic sequence for discovery of command queue groups:

// Discover all command queue groups
uint32_t cmdqueueGroupCount = 0;
zeDeviceGetCommandQueueGroupProperties(hDevice, &cmdqueueGroupCount, nullptr);

ze_command_queue_group_properties_t* cmdqueueGroupProperties = (ze_command_queue_group_properties_t*)
    allocate(cmdqueueGroupCount * sizeof(ze_command_queue_group_properties_t));
zeDeviceGetCommandQueueGroupProperties(hDevice, &cmdqueueGroupCount, cmdqueueGroupProperties);


// Find a command queue type that support compute
uint32_t computeQueueGroupOrdinal = cmdqueueGroupCount;
for( uint32_t i = 0; i < cmdqueueGroupCount; ++i ) {
    if( cmdqueueGroupProperties[ i ].flags & ZE_COMMAND_QUEUE_GROUP_PROPERTY_FLAG_COMPUTE ) {
        computeQueueGroupOrdinal = i;
        break;
    }
}

if(computeQueueGroupOrdinal == cmdqueueGroupCount)
    return; // no compute queues found

Command Queues

A command queue represents a logical input stream to the device, tied to a physical input stream.

Creation

  • At creation time, the command queue is explicitly bound to a command queue group via its ordinal.

  • Multiple command queues may be created that use the same command queue group. For example, an application may create a command queue per Host thread with different scheduling priorities.

  • Multiple command queues created for the same command queue group on the same context, may also share the same physical hardware context.

  • The maximum number of command queues an application can create is limited by device-specific resources; e.g., the maximum number of logical hardware contexts supported by the device. This can be queried from ze_device_properties_t.maxHardwareContexts.

  • All command lists executed on a command queue are guaranteed to only execute on an engine from the command queue group which it is assigned; e.g., copy commands in a compute command list / queue will execute via the compute engine, not the copy engine.

  • The command queue index provides a mechanism for an application to indicate which command queues can execute concurrently (different indices) vs. those that cannot (same indices).

  • There is no guarantee command lists submitted to command queues with different indices will execute concurrently, only a possibility that they might execute concurrently.

The following pseudo-code demonstrates a basic sequence for creation of command queues:

// Create a command queue
ze_command_queue_desc_t commandQueueDesc = {
    ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC,
    nullptr,
    computeQueueGroupOrdinal,
    0, // index
    0, // flags
    ZE_COMMAND_QUEUE_MODE_DEFAULT,
    ZE_COMMAND_QUEUE_PRIORITY_NORMAL
};
ze_command_queue_handle_t hCommandQueue;
zeCommandQueueCreate(hContext, hDevice, &commandQueueDesc, &hCommandQueue);
...

Execution

  • Command lists submitted to a command queue are immediately submitted to the device for execution.

  • Submitting multiple commands lists in a single submission allows an implementation the opportunity to optimize across command lists.

  • Command queue submission is free-treaded, allowing multiple Host threads to share the same command queue.

  • If multiple Host threads enter the same command queue simultaneously, then execution order is undefined.

  • Command lists can only be executed on a command queue with an identical command queue group ordinal.

  • If a device contains multiple sub-devices, then command lists submitted to a device-level command queue may be optimized by the driver to fully exploit the concurrency of the sub-devices by distributing command lists across sub-devices.

  • If the application prefers to opt-out of these optimizations, such as when the application plans to perform this distribution itself, then it should use ZE_COMMAND_QUEUE_FLAG_EXPLICIT_ONLY. Only command lists created using ZE_COMMAND_LIST_FLAG_EXPLICIT_ONLY can be executed on a command queue created using ZE_COMMAND_QUEUE_FLAG_EXPLICIT_ONLY.

Destruction

  • The application is responsible for making sure the device is not currently executing from a command queue before it is deleted. This is typically done by tracking command queue fences, but may also be handled by calling zeCommandQueueSynchronize.

Command Lists

A command list represents a sequence of commands for execution on a command queue.

Creation

  • A command list is created for a device to allow device-specific appending of commands.

  • A command list is created for execution on a specific type of command queue, specified using the command queue group ordinal.

  • A command list can be copied to create another command list. The application may use this to copy a command list for use on a different device.

Appending

  • There is no implicit binding of command lists to Host threads. Therefore, an application may share a command list handle across multiple Host threads. However, the application is responsible for ensuring that multiple Host threads do not access the same command list simultaneously.

  • By default, commands are started in the same order in which they are appended. However, an application may allow the driver to optimize the ordering by using ZE_COMMAND_LIST_FLAG_RELAXED_ORDERING. Reordering is guaranteed to only occur between barriers and synchronization primitives.

  • By default, commands submitted to a command list are optimized for execution by balancing both device throughput and Host latency.

  • For very low-level latency usage-models, applications should use immediate command lists.

  • For usage-models where maximum throughput is desired, applications should use ZE_COMMAND_LIST_FLAG_MAXIMIZE_THROUGHPUT. This flag will indicate to the driver it may perform additional device-specific optimizations.

  • If a device contains multiple sub-devices, then commands submitted to a device-level command list may be optimized by the driver to fully exploit the concurrency of the sub-devices by distributing commands across sub-devices. If the application prefers to opt-out of these optimizations, such as when the application plans to perform this distribution itself, then it should use ZE_COMMAND_LIST_FLAG_EXPLICIT_ONLY.

The following pseudo-code demonstrates a basic sequence for creation of command lists:

// Create a command list
ze_command_list_desc_t commandListDesc = {
    ZE_STRUCTURE_TYPE_COMMAND_LIST_DESC,
    nullptr,
    computeQueueGroupOrdinal,
    0 // flags
};
ze_command_list_handle_t hCommandList;
zeCommandListCreate(hContext, hDevice, &commandListDesc, &hCommandList);
...

Submission

  • There is no implicit association between a command list and a command queue. Therefore, a command list may be submitted to any or multiple command queues.

  • By definition, a command list cannot be executed concurrently on multiple command queues.

  • The application is responsible for calling close before submission to a command queue.

  • Command lists do not inherit state from other command lists executed on the same command queue. i.e. each command list begins execution in its own default state.

  • A command list may be submitted multiple times. It is up to the application to ensure that the command list can be executed multiple times. For example, events must be explicitly reset prior to re-execution.

The following pseudo-code demonstrates submission of commands to a command queue, via a command list:

...
// finished appending commands (typically done on another thread)
zeCommandListClose(hCommandList);

// Execute command list in command queue
zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, nullptr);

// synchronize host and device
zeCommandQueueSynchronize(hCommandQueue, UINT32_MAX);

// Reset (recycle) command list for new commands
zeCommandListReset(hCommandList);
...

Recycling

  • A command list may be recycled to avoid the overhead of frequent creation and destruction.

  • The application is responsible for making sure the device is not currently executing from a command list before it is reset. This should be handled by tracking a completion event associated with the command list.

  • The application is responsible for making sure the device is not currently executing from a command list before it is deleted. This should be handled by tracking a completion event associated with the command list.

Low-Latency Immediate Command Lists

A special type of command list can be used for very low-latency submission usage-models.

  • An immediate command list is both a command list and an implicit command queue.

  • An immediate command list is created using a command queue descriptor.

  • Commands submitted to an immediate command list are immediately executed on the device.

  • An immediate command list is not required to be closed or reset. However, usage will be honored, and expected behaviors will be followed.

The following pseudo-code demonstrates a basic sequence for creation and usage of immediate command lists:

// Create an immediate command list
ze_command_queue_desc_t commandQueueDesc = {
    ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC,
    nullptr,
    computeQueueGroupOrdinal,
    0, // index
    0, // flags
    ZE_COMMAND_QUEUE_MODE_DEFAULT,
    ZE_COMMAND_QUEUE_PRIORITY_NORMAL
};
ze_command_list_handle_t hCommandList;
zeCommandListCreateImmediate(hContext, hDevice, &commandQueueDesc, &hCommandList);

// Immediately submit a kernel to the device
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);
...

Synchronization Primitives

There are two types of synchronization primitives:

  1. Fences - used to communicate to the host that command queue execution has completed.

  2. Events - used as fine-grain host-to-device, device-to-host or device-to-device execution and memory dependencies.

The following are the motivations for separating the different types of synchronization primitives:

  • Allows device-specific optimizations for certain types of primitives:

    • Fences may share device memory with all other fences within the same command queue.

    • Events may be implemented using pipelined operations as part of the program execution.

    • Fences are implicit, coarse-grain execution and memory barriers.

    • Events optionally cause fine-grain execution and memory barriers.

  • Allows distinction on which type of primitive may be shared across devices.

Generally. Events are generic synchronization primitives that can be used across many different usage-models, including those of fences. However, this generality comes with some cost in memory overhead and efficiency.

Fences

A fence is a heavyweight synchronization primitive used to communicate to the host that command list execution has completed.

  • A fence is associated with a single command queue.

  • A fence can only be signaled from a device’s command queue (e.g. between execution of command lists) and can only be waited upon from the host.

  • A fence guarantees both execution completion and memory coherency, across the device and host, prior to being signaled.

  • A fence only has two states: not signaled and signaled.

  • A fence doesn’t implicitly reset. Signaling a signaled fence (or resetting an unsignaled fence) is valid and has no effect on the state of the fence.

  • A fence can only be reset from the Host.

  • A fence cannot be shared across processes.

The following pseudo-code demonstrates a sequence for creation, submission and querying of a fence:

// Create fence
ze_fence_desc_t fenceDesc = {
    ZE_STRUCTURE_TYPE_FENCE_DESC,
    nullptr,
    0 // flags
};
ze_fence_handle_t hFence;
zeFenceCreate(hCommandQueue, &fenceDesc, &hFence);

// Execute a command list with a signal of the fence
zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, hFence);

// Wait for fence to be signaled
zeFenceHostSynchronize(hFence, UINT32_MAX);
zeFenceReset(hFence);
...

The primary usage model for fences is to notify the Host when a command list has finished execution to allow:

  • Recycling of memory and images

  • Recycling of command lists

  • Recycling of other synchronization primitives

  • Explicit memory residency.

The following diagram illustrates fences signaled after command lists on execution:

../_images/core_fence.png

Events

An event is used to communicate fine-grain host-to-device, device-to-host or device-to-device dependencies have completed.

  • An event can be:

    • Signaled from within a device’s command list and waited upon within the same command list

    • Signaled from within a device’s command list and waited upon from the host, another command queue or another device

    • Signaled from the host, and waited upon from within a device’s command list.

  • An event only has two states: not signaled and signaled.

  • An event doesn’t implicitly reset. Signaling a signaled event (or resetting an unsignaled event) is valid and has no effect on the state of the event.

  • An event can be explicitly reset from the Host or device.

  • An event can be appended into multiple command lists simultaneously.

  • An event can be shared across devices and processes.

  • An event can invoke an execution and/or memory barrier; which should be used sparingly to avoid device underutilization.

  • There are no protections against events causing deadlocks, such as circular waits scenarios.

    • These problems are left to the application to avoid.

  • An event intended to be signaled by the host, another command queue or another device after command list submission to a command queue may prevent subsequent forward progress within the command queue itself.

    • This can create bubbles in the pipeline or deadlock situations if not correctly scheduled.

An event pool is used for creation of individual events:

  • An event pool reduces the cost of creating multiple events by allowing underlying device allocations to be shared by events with the same properties

  • An event pool can be shared via Inter-Process Communication; allowing sharing blocks of events rather than sharing each individual event

The following pseudo-code demonstrates a sequence for creation and submission of an event:

// Create event pool
ze_event_pool_desc_t eventPoolDesc = {
    ZE_STRUCTURE_TYPE_EVENT_POOL_DESC,
    nullptr,
    ZE_EVENT_POOL_FLAG_HOST_VISIBLE, // all events in pool are visible to Host
    1 // count
};
ze_event_pool_handle_t hEventPool;
zeEventPoolCreate(hContext, &eventPoolDesc, 0, nullptr, &hEventPool);

ze_event_desc_t eventDesc = {
    ZE_STRUCTURE_TYPE_EVENT_DESC,
    nullptr,
    0, // index
    0, // no additional memory/cache coherency required on signal
    ZE_EVENT_SCOPE_FLAG_HOST  // ensure memory coherency across device and Host after event completes
};
ze_event_handle_t hEvent;
zeEventCreate(hEventPool, &eventDesc, &hEvent);

// Append a signal of an event into the command list after the kernel executes
zeCommandListAppendLaunchKernel(hCommandList, hKernel1, &launchArgs, hEvent, 0, nullptr);

// Execute the command list with the signal
zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, nullptr);

// Wait on event to complete
zeEventHostSynchronize(hEvent, 0);
...

The following diagram illustrates a dependency between command lists using events:

../_images/core_event.png

Kernel Timestamp Events

A kernel timestamp event is a special type of event that records device timestamps at the start and end of the execution of kernels. The primary motivation for kernel timestamps is to provide a duration of execution. For consistency and orthogonality, kernel timestamps are also supported for non-kernel operations. Kernel timestamps execute along a device timeline but because of limited range may wrap unexpectedly. Because of this, the temporal order of two kernel timestamps shouldn’t be inferred despite coincidental START/END values. {x}CommandListAppendWriteGlobalTimestamp provides a similar mechanism but with maximum range. Timestamps from {x}CommandListAppendWriteGlobalTimestamp and kernel timestamp events should not be inferred as equivalent even if reported within identical ranges.

// Get timestamp frequency
const double timestampFreq = NS_IN_SEC / device_properties.timerResolution;
const uint64_t timestampMaxValue = ~(-1 << device_properties.kernelTimestampValidBits);

// Create event pool
ze_event_pool_desc_t tsEventPoolDesc = {
    ZE_STRUCTURE_TYPE_EVENT_POOL_DESC,
    nullptr,
    ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP, // all events in pool are kernel timestamps
    1 // count
};
ze_event_pool_handle_t hTSEventPool;
zeEventPoolCreate(hContext, &tsEventPoolDesc, 0, nullptr, &hTSEventPool);

ze_event_desc_t tsEventDesc = {
    ZE_STRUCTURE_TYPE_EVENT_DESC,
    nullptr,
    0, // index
    0, // no additional memory/cache coherency required on signal
    0  // no additional memory/cache coherency required on wait
};
ze_event_handle_t hTSEvent;
zeEventCreate(hEventPool, &tsEventDesc, &hTSEvent);

// allocate memory for results
ze_device_mem_alloc_desc_t tsResultDesc = {
    ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC,
    nullptr,
    0, // flags
    0  // ordinal
};
ze_kernel_timestamp_result_t* tsResult = nullptr;
zeMemAllocDevice(hContext, &tsResultDesc, sizeof(ze_kernel_timestamp_result_t), sizeof(uint32_t), hDevice, &tsResult);

// Append a signal of a timestamp event into the command list after the kernel executes
zeCommandListAppendLaunchKernel(hCommandList, hKernel1, &launchArgs, hTSEvent, 0, nullptr);

// Append a query of a timestamp event into the command list
zeCommandListAppendQueryKernelTimestamps(hCommandList, 1, &hTSEvent, tsResult, nullptr, hEvent, 1, &hTSEvent);

// Execute the command list with the signal
zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, nullptr);

// Wait on event to complete
zeEventHostSynchronize(hEvent, 0);

// Calculation execution time(s)
double globalTimeInNs = ( tsResult->global.kernelEnd >= tsResult->global.kernelStart )
    ? ( tsResult->global.kernelEnd - tsResult->global.kernelStart ) * timestampFreq
    : (( timestampMaxValue - tsResult->global.kernelStart) + tsResult->global.kernelEnd + 1 ) * timestampFreq;

double contextTimeInNs = ( tsResult->context.kernelEnd >= tsResult->context.kernelStart )
    ? ( tsResult->context.kernelEnd - tsResult->context.kernelStart ) * timestampFreq
    : (( timestampMaxValue - tsResult->context.kernelStart) + tsResult->context.kernelEnd + 1 ) * timestampFreq;
...

Barriers

There are two types of barriers:

  1. Execution Barriers - used to communicate execution dependencies between commands within a command list or across command queues, devices and/or Host.

  2. Memory Barriers - used to communicate memory coherency dependencies between commands within a command list or across command queues, devices and/or Host.

The following pseudo-code demonstrates a sequence for submission of a brute-force execution and global memory barrier:

zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

// Append a barrier into a command list to ensure hKernel1 completes before hKernel2 begins
zeCommandListAppendBarrier(hCommandList, nullptr, 0, nullptr);

zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);
...

Execution Barriers

Commands executed on a command list are only guaranteed to start in the same order in which they are submitted; i.e. there is no implicit definition of the order of completion.

  • Fences provide implicit, coarse-grain control to indicate that all previous commands must complete prior to the fence being signaled.

  • Events provide explicit, fine-grain control over execution dependencies between commands; allowing more opportunities for concurrent execution and higher device utilization.

The following pseudo-code demonstrates a sequence for submission of a fine-grain execution-only dependency using events:

ze_event_desc_t event1Desc = {
    ZE_STRUCTURE_TYPE_EVENT_DESC,
    nullptr,
    0, // index
    0, // no additional memory/cache coherency required on signal
    0  // no additional memory/cache coherency required on wait
};
ze_event_handle_t hEvent1;
zeEventCreate(hEventPool, &event1Desc, &hEvent1);

// Ensure hKernel1 completes before signaling hEvent1
zeCommandListAppendLaunchKernel(hCommandList, hKernel1, &launchArgs, hEvent1, 0, nullptr);

// Ensure hEvent1 is signaled before starting hKernel2
zeCommandListAppendLaunchKernel(hCommandList, hKernel2, &launchArgs, nullptr, 1, &hEvent1);
...

Memory Barriers

Commands executed on a command list are not guaranteed to maintain memory coherency with other commands; i.e. there is no implicit memory or cache coherency.

  • Fences provide implicit, coarse-grain control to indicate that all caches and memory are coherent across the device and Host prior to the fence being signaled.

  • Events provide explicit, fine-grain control over cache and memory coherency dependencies between commands; allowing more opportunities for concurrent execution and higher device utilization.

The following pseudo-code demonstrates a sequence for submission of a fine-grain memory dependency using events:

ze_event_desc_t event1Desc = {
    ZE_STRUCTURE_TYPE_EVENT_DESC,
    nullptr,
    0, // index
    ZE_EVENT_SCOPE_FLAG_DEVICE, // ensure memory coherency across device before event signaled
    0  // no additional memory/cache coherency required on wait
};
ze_event_handle_t hEvent1;
zeEventCreate(hEventPool, &event1Desc, &hEvent1);

// Ensure hKernel1 memory writes are fully coherent across the device before signaling hEvent1
zeCommandListAppendLaunchKernel(hCommandList, hKernel1, &launchArgs, hEvent1, 0, nullptr);

// Ensure hEvent1 is signaled before starting hKernel2
zeCommandListAppendLaunchKernel(hCommandList, hKernel2, &launchArgs, nullptr, 1, &hEvent1);
...

Range-based Memory Barriers

Range-based memory barriers provide explicit control of which cachelines require coherency.

The following pseudo-code demonstrates a sequence for submission of a range-based memory barrier:

zeCommandListAppendLaunchKernel(hCommandList, hKernel1, &launchArgs, nullptr, 0, nullptr);

// Ensure memory range is fully coherent across the device after hKernel1 and before hKernel2
zeCommandListAppendMemoryRangesBarrier(hCommandList, 1, &size, &ptr, nullptr, 0, nullptr);

zeCommandListAppendLaunchKernel(hCommandList, hKernel2, &launchArgs, nullptr, 0, nullptr);
...

Modules and Kernels

There are multiple levels of constructs needed for executing kernels on the device:

  1. Modules represent a single translation unit that consists of kernels that have been compiled together.

  2. Kernels represent the kernel within the module that will be launched directly from a command list.

The following diagram provides a high-level overview of the major parts of the system.

../_images/core_module.png

Modules

Modules can be created from an IL or directly from native format using zeModuleCreate.

The following pseudo-code demonstrates a sequence for creating a module from an OpenCL kernel:

__kernel void image_scaling( __read_only  image2d_t src_img,
                             __write_only image2d_t dest_img,
                                          uint WIDTH,     // resized width
                                          uint HEIGHT )   // resized height
{
    int2       coor = (int2)( get_global_id(0), get_global_id(1) );
    float2 normCoor = convert_float2(coor) / (float2)( WIDTH, HEIGHT );

    float4    color = read_imagef( src_img, SMPL_PREF, normCoor );

    write_imagef( dest_img, coor, color );
}
...
// OpenCL C kernel has been compiled to SPIRV IL (pImageScalingIL)
ze_module_desc_t moduleDesc = {
    ZE_STRUCTURE_TYPE_MODULE_DESC,
    nullptr,
    ZE_MODULE_FORMAT_IL_SPIRV,
    ilSize,
    pImageScalingIL,
    nullptr,
    nullptr
};
ze_module_handle_t hModule;
zeModuleCreate(hContext, hDevice, &moduleDesc, &hModule, nullptr);
...

Module Build Options

Module build options can be passed with ze_module_desc_t as a string.

Build Option

Description

Default

Device Support

-ze-opt-disable

Disable optimizations.

Disabled

All

-ze-opt-level

Specifies optimization level for compiler. Levels are implementation specific.

  • 0 is no optimizations (equivalent to ze-opt-disable)

  • 1 is optimize minimally (may be the same as 2)

  • 2 is optimize more (default)

2

All

-ze-opt-greater-than-4GB-buffer-required

Use 64-bit offset calculations for buffers.

Disabled

GPU

-ze-opt-large-register-file

Increase number of registers available to threads.

Disabled

GPU

-ze-opt-has-buffer-offset-arg

Extend stateless to stateful optimization to more cases with the use of additional offset (e.g. 64-bit pointer to binding table with 32-bit offset).

Disabled

GPU

-g

Include debugging information.

Disabled

GPU

Module Specialization Constants

SPIR-V supports specialization constants that allow certain constants to be updated to new values during runtime execution. Each specialization constant in SPIR-V has an identifier and default value. The zeModuleCreate function allows for an array of constants and their corresponding identifiers to be passed in to override the constants in the SPIR-V module.

// Spec constant overrides for group size.
ze_module_constants_t specConstants = {
    3,
    pGroupSizeIds,
    pGroupSizeValues
};
// OpenCL C kernel has been compiled to SPIRV IL (pImageScalingIL)
ze_module_desc_t moduleDesc = {
    ZE_STRUCTURE_TYPE_MODULE_DESC,
    nullptr,
    ZE_MODULE_FORMAT_IL_SPIRV,
    ilSize,
    pImageScalingIL,
    nullptr,
    &specConstants
};
ze_module_handle_t hModule;
zeModuleCreate(hContext, hDevice, &moduleDesc, &hModule, nullptr);
...

Note: Specialization constants are only handled at module create time and therefore if you need to change them then you’ll need to compile a new module.

Module Build Log

The zeModuleCreate function can optionally generate a build log object ze_module_build_log_handle_t.

...
ze_module_build_log_handle_t buildlog;
ze_result_t result = zeModuleCreate(hContext, hDevice, &desc, &module, &buildlog);

// Only save build logs for module creation errors.
if (result != ZE_RESULT_SUCCESS)
{
    size_t szLog = 0;
    zeModuleBuildLogGetString(buildlog, &szLog, nullptr);

    char_t* strLog = allocate(szLog);
    zeModuleBuildLogGetString(buildlog, &szLog, strLog);

    // Save log to disk.
    ...

    free(strLog);
}

zeModuleBuildLogDestroy(buildlog);

Dynamically Linked Modules

Modules may be interdependent, i.e., a module may contain functions and global variables that are used and defined by different modules. Such a module is said to have both import as well as export linkage requirements. Private variables are not transferrable between linked modules, i.e., their visibility is limited to the module in which they are defined. Only global variables or static values passed to linked functions are visible between linked modules. All the import linkage requirements of a module must be satisfied before a kernel can be created from that module. Modules that have no imports do not need to be linked. Dynamically linking modules together is performed using zeModuleDynamicLink. Modules cannot have ambiguous import dependencies, i.e., imported functions and global variables must only be defined once in any given set of modules passed to zeModuleDynamicLink. Imports are linked only once. Once all the import dependencies of a module have been linked, the use of that fully import-linked module in subsequent calls to zeModuleDynamicLink will not cause the imports of the module to be re-linked.

The zeModuleDynamicLink function can optionally generate a link log object ze_module_build_log_handle_t.

...
ze_module_build_log_handle_t linklog;
ze_result_t result = zeModuleDynamicLink(numModules, &hModules, &hLinklog);

// Check if there are linking errors
if (result == ze_RESULT_ERROR_MODULE_LINK_FAILURE) {
  size_t szLog = 0;
  zeModuleBuildLogGetString(linklog, &szLog, nullptr);

  char_t* strLog = allocate(szLog);
  zeModuleBuildLogGetString(linklog, &szLog, strLog);

  // Save log to disk.
  ...

  free(strLog);
}

zeModuleBuildLogDestroy(linklog);

Module Caching with Native Binaries

Disk caching of modules is not supported by the driver. If a disk cache for modules is desired, then it is the responsibility of the application to implement this using zeModuleGetNativeBinary.

...
// compute hash for pIL and check cache.
...

if (cacheUpdateNeeded)
{
    size_t szBinary = 0;
    zeModuleGetNativeBinary(hModule, &szBinary, nullptr);

    uint8_t* pBinary = allocate(szBinary);
    zeModuleGetNativeBinary(hModule, &szBinary, pBinary);

    // cache pBinary for corresponding IL
    ...

    free(pBinary);
}

Also, note that the native binary will retain all debug information that is associated with the module. This allows debug capabilities for modules that are created from native binaries.

Built-in Kernels

Built-in kernels are not supported but can be implemented by an upper level runtime or library using the native binary interface.

Kernels

A Kernel is a reference to a kernel within a module and it supports both explicit and implicit kernel arguments along with data needed for launch.

The following pseudo-code demonstrates a sequence for creating a kernel from a module:

ze_kernel_desc_t kernelDesc = {
    ZE_STRUCTURE_TYPE_KERNEL_DESC,
    nullptr,
    0, // flags
    "image_scaling"
};
ze_kernel_handle_t hKernel;
ze_result_t result = zeKernelCreate(hModule, &kernelDesc, &hKernel);

// Check if there are unresolved imports
if (result == ze_RESULT_ERROR_INVALID_MODULE_UNLINKED) {
   // Call zeModuleDynamicLink to dump out a link log
   ...
}

// Check to see if the kernel "image_scaling" was found in the supplied module
if (result == ze_RESULT_ERROR_INVALID_KERNEL_NAME) {
   // Kernel "image_scaling" not found in module!
   ...
}

...

Kernel Properties

Use zeKernelGetProperties to query invariant properties from a Kernel object.

...
ze_kernel_properties_t kernelProperties;
zeKernelGetProperties(hKernel, &kernelProperties);
...

See ze_kernel_properties_t for more information for kernel properties.

Execution

Kernel Group Size

The group size for a kernel can be set using zeKernelSetGroupSize. If a group size is not set prior to appending a kernel into a command list then a default will be chosen. The group size can be updated over a series of append operations. The driver will copy the group size information when appending the kernel into the command list.

zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, 1);

...

The API supports a query for suggested group size when providing the global size. This function ignores the group size that was set on the kernel using zeKernelSetGroupSize.

// Find suggested group size for processing image.
uint32_t groupSizeX;
uint32_t groupSizeY;
zeKernelSuggestGroupSize(hKernel, imageWidth, imageHeight, 1, &groupSizeX, &groupSizeY, nullptr);

zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, 1);

...

Kernel Arguments

Kernel arguments represent only the explicit kernel arguments that are within brackets e.g. func(arg1, arg2, …).

The following pseudo-code demonstrates a sequence for setting kernel arguments and launching the kernel:

// Bind arguments
zeKernelSetArgumentValue(hKernel, 0, sizeof(ze_image_handle_t), &src_image);
zeKernelSetArgumentValue(hKernel, 1, sizeof(ze_image_handle_t), &dest_image);
zeKernelSetArgumentValue(hKernel, 2, sizeof(uint32_t), &width);
zeKernelSetArgumentValue(hKernel, 3, sizeof(uint32_t), &height);

ze_group_count_t launchArgs = { numGroupsX, numGroupsY, 1 };

// Append launch kernel
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

// Update image pointers to copy and scale next image.
zeKernelSetArgumentValue(hKernel, 0, sizeof(ze_image_handle_t), &src2_image);
zeKernelSetArgumentValue(hKernel, 1, sizeof(ze_image_handle_t), &dest2_image);

// Append launch kernel
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

...

Kernel Launch

In order to launch a kernel on the device an application must call one of the AppendLaunchKernel-style functions for a command list. The most basic version of these is zeCommandListAppendLaunchKernel which takes a command list, kernel handle, launch arguments, and an optional synchronization event used to signal completion. The launch arguments contain thread group dimensions.

// compute number of groups to launch based on image size and group size.
uint32_t numGroupsX = imageWidth / groupSizeX;
uint32_t numGroupsY = imageHeight / groupSizeY;

ze_group_count_t launchArgs = { numGroupsX, numGroupsY, 1 };

// Append launch kernel
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

The function zeCommandListAppendLaunchKernelIndirect allows the launch parameters to be supplied indirectly in a buffer that the device reads instead of the command itself. This allows for the previous operations on the device to generate the parameters.

ze_group_count_t* pIndirectArgs;

...
zeMemAllocDevice(hContext, &desc, sizeof(ze_group_count_t), sizeof(uint32_t), hDevice, &pIndirectArgs);

// Append launch kernel - indirect
zeCommandListAppendLaunchKernelIndirect(hCommandList, hKernel, &pIndirectArgs, nullptr, 0, nullptr);

Cooperative Kernels

Cooperative kernels allow sharing of data and synchronization across all launched groups in a safe manner. To support this there is a zeCommandListAppendLaunchCooperativeKernel that allows launching groups that can cooperate with each other. The command list must be submitted to a command queue that was created with an ordinal of a command queue group that has the ZE_COMMAND_QUEUE_GROUP_PROPERTY_FLAG_COOPERATIVE_KERNELS flags set. The maximum number of groups for a cooperative kernel launch may be determined by calling zeKernelSuggestMaxCooperativeGroupCount.

// query the maximum cooperative kernel launch for the kernel
uint32_t maxGroupCount;
zeKernelSuggestMaxCooperativeGroupCount(hKernel, &maxGroupCount);

// the total group count must be smaller than the queried maximum
assert(numGroupsX * numGroupsY * numGroupsZ < maxGroupCount);

ze_group_count_t launchArgs = { numGroupsX, numGroupsY, numGroupsZ };

// Append launch cooperative kernel
zeCommandListAppendLaunchCooperativeKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

Sampler

The API supports Sampler objects that represent state needed for sampling images from within kernels. The zeSamplerCreate function takes a sampler descriptor (ze_sampler_desc_t):

Sampler Field

Description

Address Mode

Determines how out-of-bounds accesses are handled. See ze_sampler_address_mode_t.

Filter Mode

Specifies which filtering mode to use. See ze_sampler_filter_mode_t.

Normalized

Specifies whether coordinates for addressing image are normalized [0,1] or not.

The following pseudo-code demonstrates the creation of a sampler object and passing it as a kernel argument:

// Setup sampler for linear filtering and clamp out of bounds accesses to edge.
ze_sampler_desc_t desc = {
    ZE_STRUCTURE_TYPE_SAMPLER_DESC,
    nullptr,
    ZE_SAMPLER_ADDRESS_MODE_CLAMP,
    ZE_SAMPLER_FILTER_MODE_LINEAR,
    false
    };
ze_sampler_handle_t sampler;
zeSamplerCreate(hContext, hDevice, &desc, &sampler);
...

// The sampler can be passed as a kernel argument.
zeKernelSetArgumentValue(hKernel, 0, sizeof(ze_sampler_handle_t), &sampler);

// Append launch kernel
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

Advanced

Environment Variables

The following table documents the supported knobs for overriding default functional behavior.

Category

Name

Values

Description

Device

ZE_AFFINITY_MASK

list

Forces driver to only report devices (and sub-devices) as specified by values

ZE_ENABLE_PCI_ID_DEVICE_ORDER

{0, 1}

Forces driver to report devices from lowest to highest PCI bus ID

Memory

ZE_SHARED_FORCE_DEVICE_ALLOC

{0, 1}

Forces all shared allocations into device memory

Affinity Mask

The affinity mask allows an application or tool to restrict which devices, and sub-devices, are visible to 3rd-party libraries or applications in another process, respectively. The affinity mask affects the number of handles returned from zeDeviceGet and zeDeviceGetSubDevices. The affinity mask is specified via an environment variable as a comma-seperated list of device and/or subdevice ordinals. The values are specific to system configuration; e.g., the number of devices and the number of sub-devices for each device. The values are specific to the order in which devices are reported by the driver; i.e., the first device maps to ordinal 0, the second device to ordinal 1, and so forth. If the affinity mask is not set, then all devices and sub-devices are reported; as is the default behavior.

The order of the devices reported by the zeDeviceGet is implementation-specific and not affected by the order of devices in the affinity mask. The order of the devices reported by the zeDeviceGet can be forced to be consistent by setting the ZE_ENABLE_PCI_ID_DEVICE_ORDER environment variable.

The following examples demonstrate proper usage for a system configuration of two devices, each with four sub-devices:

  • 0, 1: all devices and sub-devices are reported (same as default)

  • 0: only device 0 is reported;with all its sub-devices

  • 1: only device 1 is reported as device 0; with all its sub-devices

  • 0.0: only device 0, sub-device 0 is reported as device 0

  • 1.1, 1.2: only device 1 is reported as device 0; with its sub-devices 1 and 2 reported as sub-devices 0 and 1, respectively

  • 0.2, 1.3, 1.0, 0.3: both device 0 and 1 are reported; device 0 reports sub-devices 2 and 3 as sub-devices 0 and 1, respectively; device 1 reports sub-devices 0 and 3 as sub-devices 0 and 1, respectively; the order is unchanged.

Sub-Device Support

The API allows support for sub-devices which can enable finer-grained control of scheduling and memory allocation to a sub-partition of the device. There are functions to query and obtain sub-devices, but outside of these functions there are no distinctions between sub-devices and devices. Sub-devices are not required to represent unique partitions of a device; i.e. multiple sub-devices may share the same physical hardware. Furthermore, a sub-device can be partitioned into more sub-devices; e.g. down to a single slice.

Use zeDeviceGetSubDevices to confirm sub-devices are supported and to obtain a sub-device handle. There are additional device properties in ze_device_properties_t for sub-devices. These can be used to confirm a device is a sub-device and to query the sub-device id. This may be used by libraries to determine if an input device handle represents a device or sub-device.

A driver is required to make device memory allocations on the parent device visible to its sub-devices. However, when using a sub-device handle, the driver will attempt to place any device memory allocations in the local memory that is attached to the sub-device. These allocations are only visible to the sub-device, its sub-devices, and so forth. If the API call returns ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY, then the application may attempt to retry using the parent device.

When creating command queues for a sub-device, the application must determine the ordinal from calling zeDeviceGetCommandQueueGroupProperties using the sub-device handle. See ze_command_queue_desc_t for more details.

A 16-byte unique device identifier (uuid) can be obtained for a device or sub-device using zeDeviceGetProperties.

// Query for all sub-devices of the device
uint32_t subdeviceCount = 0;
zeDeviceGetSubDevices(hDevice, &subdeviceCount, nullptr);

ze_device_handle_t* allSubDevices = allocate(subdeviceCount * sizeof(ze_device_handle_t));
zeDeviceGetSubDevices(hDevice, &subdeviceCount, &allSubDevices);

// Desire is to allocate and dispatch work to sub-device 2.
assert(subdeviceCount >= 3);
ze_device_handle_t hSubdevice = allSubDevices[2];

// Query sub-device properties.
ze_device_properties_t subdeviceProps;
zeDeviceGetProperties(hSubdevice, &subdeviceProps);

assert(subdeviceProps.flags & ZE_DEVICE_PROPERTY_FLAG_SUBDEVICE); // Ensure that we have a handle to a sub-device.
assert(subdeviceProps.subdeviceId == 2);    // Ensure that we have a handle to the sub-device we asked for.

void* pMemForSubDevice2;
zeMemAllocDevice(hContext, &desc, memSize, sizeof(uint32_t), hSubdevice, &pMemForSubDevice2);
...

Device Residency

For devices that do not support page-faults, the driver must ensure that all pages that will be accessed by the kernel are resident before program execution. This can be determined by checking ze_device_properties_t.flags for ZE_DEVICE_PROPERTY_FLAG_ONDEMANDPAGING.

In most cases, the driver implicitly handles residency of allocations for device access. This can be done by inspecting API parameters, including kernel arguments. However, in cases where the devices does not support page-faulting and the driver is incapable of determining whether an allocation will be accessed by the device, such as multiple levels of indirection, there are two methods available:

  1. The application may set the ZE_KERNEL_FLAG_FORCE_RESIDENCY flag during program creation to force all device allocations to be resident during execution.

  2. Explcit zeContextMakeMemoryResident APIs are included for the application to dynamically change residency as needed. (Windows-only)

If the application does not properly manage residency for these cases then the device may experience unrecoverable page-faults.

The following pseudo-code demonstrates a sequence for using coarse-grain residency control for indirect arguments:

struct node {
    node* next;
};
node* begin = nullptr;
zeMemAllocHost(hContext, &desc, sizeof(node), 1, &begin);
zeMemAllocHost(hContext, &desc, sizeof(node), 1, &begin->next);
zeMemAllocHost(hContext, &desc, sizeof(node), 1, &begin->next->next);

// 'begin' is passed as kernel argument and appended into command list
zeKernelSetIndirectAccess(hKernel, ZE_KERNEL_INDIRECT_ACCESS_FLAG_HOST);
zeKernelSetArgumentValue(hKernel, 0, sizeof(node*), &begin);
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);

...

zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, nullptr);
...

The following pseudo-code demonstrates a sequence for using fine-grain residency control for indirect arguments:

struct node {
    node* next;
};
node* begin = nullptr;
zeMemAllocHost(hContext, &desc, sizeof(node), 1, &begin);
zeMemAllocHost(hContext, &desc, sizeof(node), 1, &begin->next);
zeMemAllocHost(hContext, &desc, sizeof(node), 1, &begin->next->next);

// 'begin' is passed as kernel argument and appended into command list
zeKernelSetArgumentValue(hKernel, 0, sizeof(node*), &begin);
zeCommandListAppendLaunchKernel(hCommandList, hKernel, &launchArgs, nullptr, 0, nullptr);
...

// Make indirect allocations resident before enqueuing
zeContextMakeMemoryResident(hContext, hDevice, begin->next, sizeof(node));
zeContextMakeMemoryResident(hContext, hDevice, begin->next->next, sizeof(node));

zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, hFence);

// wait until complete
zeFenceHostSynchronize(hFence, UINT32_MAX);

// Finally, evict to free device resources
zeContextEvictMemory(hContext, hDevice, begin->next, sizeof(node));
zeContextEvictMemory(hContext, hDevice, begin->next->next, sizeof(node));
...

Interoperability with Other APIs

Level-Zero includes general-purpose interoperability mechanisms for memory allocations (both images and device memory) and modules.

Memory allocations may be shared between Level-Zero and other APIs via External Memory Import and Export. Level-Zero supports exporting memory allocations for use in other APIs and importing memory allocations from other APIs.

Modules may be shared between Level-Zero and other APIs via native format binaries, see zeModuleGetNativeBinary and ZE_MODULE_FORMAT_NATIVE.

The following pseudo-code demonstrates interoperability with OpenCL from a OpenCL cl_program to a Level-Zero Kernel:

void* clDeviceBinary;
size_t clDeviceBinarySize;
clGetProgramInfo(cl_program, CL_PROGRAM_BINARIES, clDeviceBinary, &clDeviceBinarySize);

ze_module_desc_t desc = {
    ZE_MODULE_FORMAT_NATIVE,
    clDeviceBinarySize,
    clDeviceBinary
};
zeModuleCreate(hContext, hDevice, &desc, &hModule, nullptr);
zeKernelCreate(hModule, nullptr, hKernel); // same Kernel as OpenCL in Level-Zero

Inter-Process Communication

The API allows sharing of memory objects across different device processes. Since each process has its own virtual address space, there is no guarantee that the same virtual address will be available when the memory object is shared in new process. There are a set of APIs that makes it easier to share the memory objects with ease.

There are two types of Inter-Process Communication (IPC) APIs for using Level-Zero allocations across processes:

  1. Memory

  2. Events

Memory

The following code examples demonstrate how to use the memory IPC APIs:

  1. First, the allocation is made, packaged, and sent on the sending process:

void* dptr = nullptr;
zeMemAllocDevice(hContext, &desc, size, alignment, hDevice, &dptr);

ze_ipc_mem_handle_t hIPC;
zeMemGetIpcHandle(hContext, dptr, &hIPC);

// Method of sending to receiving process is not defined by Level-Zero:
send_to_receiving_process(hIPC);
  1. Next, the allocation is received and un-packaged on the receiving process:

// Method of receiving from sending process is not defined by Level-Zero:
ze_ipc_mem_handle_t hIPC;
hIPC = receive_from_sending_process();

void* dptr = nullptr;
zeMemOpenIpcHandle(hContext, hDevice, hIPC, 0, &dptr);
  1. Each process may now refer to the same device memory allocation via its dptr. Note, there is no guaranteed address equivalence for the values of dptr in each process.

  2. To cleanup, first close the handle in the receiving process:

zeMemCloseIpcHandle(hContext, dptr);
  1. Finally, free the device pointer in the sending process:

zeMemFree(hContext, dptr);

Events

The following code examples demonstrate how to use the event IPC APIs:

  1. First, the event pool is created, packaged, and sent on the sending process:

// create event pool
ze_event_pool_desc_t eventPoolDesc = {
    ZE_STRUCTURE_TYPE_EVENT_POOL_DESC,
    nullptr,
    ZE_EVENT_POOL_FLAG_IPC | ZE_EVENT_POOL_FLAG_HOST_VISIBLE,
    10 // count
};
ze_event_pool_handle_t hEventPool;
zeEventPoolCreate(hContext, &eventPoolDesc, 1, &hDevice, &hEventPool);

// get IPC handle and send to another process
ze_ipc_event_pool_handle_t hIpcEvent;
zeEventPoolGetIpcHandle(hEventPool, &hIpcEventPool);
send_to_receiving_process(hIpcEventPool);
  1. Next, the event pool is received and un-packaged on the receiving process:

// get IPC handle from other process
ze_ipc_event_pool_handle_t hIpcEventPool;
receive_from_sending_process(&hIpcEventPool);

// open event pool
ze_event_pool_handle_t hEventPool;
zeEventPoolOpenIpcHandle(hContext, hIpcEventPool, &hEventPool);
  1. Each process may now refer to the same device event allocation via its handle:

    1. Receiving process creates event at location

    ze_event_handle_t hEvent;
    ze_event_desc_t eventDesc = {
        ZE_STRUCTURE_TYPE_EVENT_DESC,
        nullptr,
        5, // index
        0, // no additional memory/cache coherency required on signal
        ZE_EVENT_SCOPE_FLAG_HOST, // ensure memory coherency across device and Host after event signaled
    };
    zeEventCreate(hEventPool, &eventDesc, &hEvent);

    // submit kernel and signal event when complete
    zeCommandListAppendLaunchKernel(hCommandList, hKernel, &args, hEvent, 0, nullptr);
    zeCommandListClose(hCommandList);
    zeCommandQueueExecuteCommandLists(hCommandQueue, 1, &hCommandList, nullptr);

b. Sending process creates event at same location
    ze_event_handle_t hEvent;
    ze_event_desc_t eventDesc = {
        ZE_STRUCTURE_TYPE_EVENT_DESC,
        nullptr,
        5,
        0, // no additional memory/cache coherency required on signal
        ZE_EVENT_SCOPE_FLAG_HOST, // ensure memory coherency across device and Host after event signaled
    };
    zeEventCreate(hEventPool, &eventDesc, &hEvent);

    zeEventHostSynchronize(hEvent, UINT32_MAX);

Note, there is no guaranteed address equivalence for the values of hEvent in each process.
  1. To cleanup, first close the pool handle in the receiving process:

zeEventDestroy(hEvent);
zeEventPoolCloseIpcHandle(&hEventPool);
  1. Finally, free the event pool handle in the sending process:

zeEventDestroy(hEvent);
zeEventPoolDestroy(hEventPool);

Peer-to-Peer Access and Queries

Peer to Peer API’s provide capabilities to marshall data across Host to Device, Device to Host and Device to Device. The data marshalling API can be scheduled as asynchronous operations or can be synchronized with kernel execution through command queues. Data coherency is maintained by the driver without any explicit involvement from the application.

Devices may be linked together within a node by a scale-up fabric and depending on the configuration, the fabric can support atomics, compute kernel remote access, and data copies.

The following Peer-to-Peer functionalities are provided through the API: