.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation .. .. SPDX-License-Identifier: CC-BY-4.0 .. highlight:: cpp .. default-domain:: cpp .. _alg_kmeans: ======= K-Means ======= The K-Means algorithm solves :capterm:`clustering` problem by partitioning :math:`n` feature vectors into :math:`k` clusters minimizing some criterion. Each cluster is characterized by a representative point, called *a centroid*. .. |t_math| replace:: `Training `_ .. |t_lloyd| replace:: `Lloyd's `_ .. |t_input| replace:: `train_input `_ .. |t_result| replace:: `train_result `_ .. |t_op| replace:: `train(...) `_ .. |i_math| replace:: `Inference `_ .. |i_lloyd| replace:: `Lloyd's `_ .. |i_input| replace:: `infer_input `_ .. |i_result| replace:: `infer_result `_ .. |i_op| replace:: `infer(...) `_ =============== =========================== ======== =========== ============ **Operation** **Computational methods** **Programming Interface** --------------- --------------------------- --------------------------------- |t_math| |t_lloyd| |t_op| |t_input| |t_result| |i_math| |i_lloyd| |i_op| |i_input| |i_result| =============== =========================== ======== =========== ============ ------------------------ Mathematical formulation ------------------------ .. _kmeans_t_math: Training -------- Given the training set :math:`X = \{ x_1, \ldots, x_n \}` of :math:`p`-dimensional feature vectors and a positive integer :math:`k`, the problem is to find a set :math:`C = \{ c_1, \ldots, c_k \}` of :math:`p`-dimensional centroids that minimize the objective function .. math:: \Phi_{X}(C) = \sum_{i = 1}^n d^2(x_i, C), where :math:`d^2(x_i, C)` is the squared Euclidean distance from :math:`x_i` to the closest centroid in :math:`C`, .. math:: d^2(x_i, C) = \min_{1 \leq j \leq k} \| x_i - c_j \|^2, \quad 1 \leq i \leq n. Expression :math:`\|\cdot\|` denotes :math:`L_2` `norm `_. .. note:: In the general case, :math:`d` may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case. .. _kmeans_t_math_lloyd: Training method: *Lloyd's* ~~~~~~~~~~~~~~~~~~~~~~~~~~ The Lloyd's method [Lloyd82]_ consists in iterative updates of centroids by applying the alternating *Assignment* and *Update* steps, where :math:`t` denotes a index of the current iteration, e.g., :math:`C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}` is the set of centroids at the :math:`t`-th iteration. The method requires the initial centroids :math:`C^{(1)}` to be specified at the beginning of the algorithm (:math:`t = 1`). **(1) Assignment step:** Assign each feature vector :math:`x_i` to the nearest centroid. :math:`y_i^{(t)}` denotes the assigned label (cluster index) to the feature vector :math:`x_i`. .. math:: y_i^{(t)} = \mathrm{arg}\min_{1 \leq j \leq k} \| x_i - c_j^{(t)} \|^2, \quad 1 \leq i \leq n. Each feature vector from the training set :math:`X` is assigned to exactly one centroid so that :math:`X` is partitioned to :math:`k` disjoint sets (clusters) .. math:: S_j^{(t)} = \big\{ \; x_i \in X : \; y_i^{(t)} = j \; \big\}, \quad 1 \leq j \leq k. **(2) Update step:** Recalculate centroids by averaging feature vectors assigned to each cluster. .. math:: c_j^{(t + 1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x, \quad 1 \leq j \leq k. The steps (1) and (2) are performed until the following **stop condition**, .. math:: \sum_{j=1}^k \big\| c_j^{(t)} - c_j^{(t+1)} \big\|^2 < \varepsilon, is satisfied or number of iterations exceeds the maximal value :math:`T` defined by the user. .. _kmeans_i_math: Inference --------- Given the inference set :math:`X' = \{ x_1', \ldots, x_m' \}` of :math:`p`-dimensional feature vectors and the set :math:`C = \{ c_1, \ldots, c_k \}` of centroids produced at the training stage, the problem is to predict the index :math:`y_j' \in \{ 0, \ldots, k-1 \}`, :math:`1 \leq j \leq m`, of the centroid in accordance with a method-defined rule. .. _kmeans_i_math_lloyd: Inference method: *Lloyd's* ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Lloyd's inference method computes the :math:`y_j'` as an index of the centroid closest to the feature vector :math:`x_j'`, .. math:: y_j' = \mathrm{arg}\min_{1 \leq l \leq k} \| x_j' - c_l \|^2, \quad 1 \leq j \leq m. ------------- Usage example ------------- Training -------- .. onedal_code:: oneapi::dal::kmeans::example::run_training Inference --------- .. onedal_code:: oneapi::dal::kmeans::example::run_inference --------------------- Programming Interface --------------------- All types and functions in this section shall be declared in the ``oneapi::dal::kmeans`` namespace and be available via inclusion of the ``oneapi/dal/algo/kmeans.hpp`` header file. Descriptor ---------- .. onedal_class:: oneapi::dal::kmeans::descriptor Method tags ~~~~~~~~~~~ .. onedal_tags_namespace:: oneapi::dal::kmeans::method Task tags ~~~~~~~~~ .. onedal_tags_namespace:: oneapi::dal::kmeans::task Model ----- .. onedal_class:: oneapi::dal::kmeans::model .. _kmeans_t_api: Training :expr:`train(...)` -------------------------------- .. _kmeans_t_api_input: Input ~~~~~ .. onedal_class:: oneapi::dal::kmeans::train_input .. _kmeans_t_api_result: Result ~~~~~~ .. onedal_class:: oneapi::dal::kmeans::train_result Operation ~~~~~~~~~ .. onedal_func:: oneapi::dal::kmeans::train .. _kmeans_i_api: Inference :expr:`infer(...)` ---------------------------- .. _kmeans_i_api_input: Input ~~~~~ .. onedal_class:: oneapi::dal::kmeans::infer_input .. _kmeans_i_api_result: Result ~~~~~~ .. onedal_class:: oneapi::dal::kmeans::infer_result Operation ~~~~~~~~~ .. onedal_func:: oneapi::dal::kmeans::infer