Execution Model
Contents
Execution Model#
This section describes implementation the execution environment common to all oneIPL functionality. The execution environment includes how data is provided to computational routines in Use of Queues, support for several devices in Device Usage, synchronous and asynchronous execution models in Asynchronous Execution and Host Thread Safety.
Use of Queues#
The sycl::queue
defined in the SYCL 2020 specification is used to specify the device
and features enabled on that device on which a task will be enqueued.
oneIPL shall have two forms of computational routines:
class-based Member Functions and
stand-alone Non-Member Functions.
As these forms may interact with the sycl::queue
in different ways,
a section for each one is provided to describe assumptions.
Non-Member Functions#
Each oneIPL non-member computational routine takes a sycl::queue
reference as its first parameter,
std::vector<sycl::event>
as last parameter, which shall have empty default value.
Typically the following template signature must be implemented:
ComputeT
- type in which computations would be doneSrcImageT
orSrcBatchT
- src image/batch typeDstImageT
orDstBatchT
- dst image/batch type
API might have multiple input/output type template arguments and funtion argumetns.
In addition to input and output image, the algorithmic parameters must be defined
in separate class with _spec
postfix. If a <function>
requires additional arguments,
they shall be provided in <function>_spec
class and corresponding argument in the function.
Spec shall be parameterized by required template arguments.
Example. Gaussian filter API:
template <typename ComputeT = float, typename SrcImageT, typename DstImageT> sycl::event gaussian(sycl::queue& queue, SrcImageT& src, DstImageT& dst, const gaussian_spec_t<SrcImageT::format_v, typename SrcImageT::data_t, ComputeT>& spec, const std::vector<sycl::event>& dependencies = {})
ComputeT
template argument shall define the datatype used for internal computations and have default value of float
.
If the implementation for specific type is available, is shall be implemented as specialization. The generic algorithm shall work as pseudocode template:
template <typename ComputeT = float,
typename SrcImageT,
typename DstImageT>
sycl::event algorithm(...)
{
// ... some implementation code, device kernel part:
// 1. Cast some data from SrcImageT::data_t type to ComputeT
// 2. Computations using ComputeT datatype
// 3 .Cast some data from ComputeT to DstImageT::data_t type
}
If ComputeT
is not available on device, the implementation shall handle it as error.
All computation performed by the routine shall be done on the hardware device(s) associated with this queue, with possible aid from the host, unless otherwise specified. In the case of an ordered queue, all computation shall also be ordered with respect to other kernels as if enqueued on that queue.
A particular oneIPL implementation may not support the execution of a given oneIPL routine on the specified device(s). In this case, the implementation shall either perform the computation on the host or throw an exception. See Error Handling for the possible exceptions.
Member Functions#
oneIPL class-based APIs, used for ipl::image
require
a sycl::queue
as an argument to the constructor.
Implementation shall submit explicit copy operations and check
USM pointer type if it is an argument of the constructor.The check can be disabled for performance reasons.
The recommended implementation is to control the checks by the build options.
Device Usage#
oneIPL itself may not provide any interfaces for controlling device usage:
for instance, controlling the number of cores used on the CPU, or the number of execution units
on a GPU. However, such functionality may be available by partitioning a sycl::device
instance into subdevices, when supported by the device.
When given a queue associated with such a subdevice, a oneIPL implementation shall only perform computation on that subdevice.
Asynchronous Execution#
The oneIPL API shall be designed to allow asynchronous execution of computational routines, to facilitate concurrent usage of multiple devices in the system. Each computational routine enqueues work to be performed on the selected device, and may (but is not required to) return before execution completes.
Hence, it is the calling application’s responsibility to ensure that any inputs are valid until computation is complete, and likewise to wait for computation completion before reading any outputs. This behavior can be achieved automatically when using DPC++ buffers, or manually when using Unified Shared Memory (USM) pointers, as described in the sections below.
Unless otherwise specified, asynchronous execution is allowed, but not guaranteed, by any oneIPL computational routine, and may vary between implementations and/or versions. oneIPL implementations must clearly document whether execution is guaranteed to be asynchronous for each supported routine. Regardless, calling applications shall not launch any oneIPL computational routine with a dependency on a future oneIPL API call, even if this computational routine executes asynchronously (i.e. a oneIPL implementation may assume no antidependencies are present). This guarantee allows oneIPL implementations to reserve resources for execution without risking deadlock.
Synchronization#
ipl::image
objects automatically manage synchronization between kernel launches linked by
a data dependency (either read-after-write, write-after-write, or write-after-read).
oneIPL routines are not required to perform any additional synchronization
of ipl::image
arguments. However all routines returns sycl::event
and accepts
std::vector<sycl::event>
so the explicit synchronization of user-provided
kernels and library calls can be done.
Host Thread Safety#
All oneIPL member and non-member functions shall be host thread safe. That is, they may be safely called simultaneously from concurrent host threads. However, oneIPL objects may not be shared between concurrent host threads unless otherwise specified.