# k-Nearest Neighbors Classification (k-NN)¶

$$k$$-NN classification algorithm infers the class for the new feature vector by computing majority vote of the $$k$$ nearest observations from the training set.

 Operation Computational methods Programming Interface Training Brute-force k-d tree train(…) train_input train_result Inference Brute-force k-d tree infer(…) infer_input infer_result

## Mathematical formulation¶

### Training¶

Let $$X = \{ x_1, \ldots, x_n \}$$ be the training set of $$p$$-dimensional feature vectors, let $$Y = \{ y_1, \ldots, y_n \}$$ be the set of class labels, where $$y_i \in \{ 0, \ldots, c-1 \}$$, $$1 \leq i \leq n$$. Given $$X$$, $$Y$$ and the number of nearest neighbors $$k$$, the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.

#### Training method: brute-force¶

The training operation produces the model that stores all the feature vectors from the initial training set $$X$$.

#### Training method: k-d tree¶

The training operation builds a $$k$$-$$d$$ tree that partitions the training set $$X$$ (for more details, see k-d Tree).

### Inference¶

Let $$X' = \{ x_1', \ldots, x_m' \}$$ be the inference set of $$p$$-dimensional feature vectors. Given $$X'$$, the model produced at the training stage and the number of nearest neighbors $$k$$, the problem is to predict the label $$y_j'$$ for each $$x_j'$$, $$1 \leq j \leq m$$, by performing the following steps:

1. Identify the set $$N(x_j') \subseteq X$$ of the $$k$$ feature vectors in the training set that are nearest to $$x_j'$$ with respect to the Euclidean distance.

2. Estimate the conditional probability for the $$l$$-th class as the fraction of vectors in $$N(x_j')$$ whose labels $$y_j$$ are equal to $$l$$:

(1)$P_{jl} = \frac{1}{| N(x_j') |} \Big| \big\{ x_r \in N(x_j') : y_r = l \big\} \Big|, \quad 1 \leq j \leq m, \; 0 \leq l < c.$
3. Predict the class that has the highest probability for the feature vector $$x_j'$$:

(2)$y_j' = \mathrm{arg}\max_{0 \leq l < c} P_{jl}, \quad 1 \leq j \leq m.$

#### Inference method: brute-force¶

Brute-force inference method determines the set $$N(x_j')$$ of the nearest feature vectors by iterating over all the pairs $$(x_j', x_i)$$ in the implementation defined order, $$1 \leq i \leq n$$, $$1 \leq j \leq m$$. The final prediction is computed according to the equations (1) and (2).

#### Inference method: k-d tree¶

K-d tree inference method traverses the $$k$$-$$d$$ tree to find feature vectors associated with a leaf node that are closest to $$x_j'$$, $$1 \leq j \leq m$$. The set $$\tilde{n}(x_j')$$ of the currently-known nearest $$k$$-th neighbors is progressively updated during tree traversal. The search algorithm limits exploration of the nodes for which the distance between the $$x_j'$$ and respective part of the feature space is not less than the distance between $$x_j'$$ and the most distant feature vector from $$\tilde{n}(x_j')$$. Once tree traversal is finished, $$\tilde{n}(x_j') \equiv N(x_j')$$. The final prediction is computed according to the equations (1) and (2).

## Usage example¶

### Training¶

knn::model<> run_training(const table& data,
const table& labels) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};

const auto result = train(knn_desc, data, labels);

return result.get_model();
}


### Inference¶

table run_inference(const knn::model<>& model,
const table& new_data) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};

const auto result = infer(knn_desc, model, new_data);

print_table("labels", result.get_labels());
}


## Programming Interface¶

All types and functions in this section shall be declared in the oneapi::dal::knn namespace and be available via inclusion of the oneapi/dal/algo/knn.hpp header file.

### Descriptor¶

template <typename Float = float,
typename Method = method::by_default,
class descriptor {
public:
explicit descriptor(std::int64_t class_count,
std::int64_t neighbor_count);

std::int64_t get_class_count() const;
descriptor& set_class_count(std::int64_t);

std::int64_t get_neighbor_count() const;
descriptor& set_neighbor_count(std::int64_t);
};

template<typename Float = float, typename Method = method::by_default, typename Task = task::by_default>
class descriptor
Template Parameters
• Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

• Method – Tag-type that specifies an implementation of algorithm. Can be method::bruteforce or method::kd_tree.

• Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Constructors

descriptor(std::int64_t class_count, std::int64_t neighbor_count)

Creates a new instance of the class with the given class_count and neighbor_count property values.

Properties

std::int64_t neighbor_count

The number of neighbors $$k$$.

Getter & Setter
std::int64_t get_neighbor_count() const
descriptor & set_neighbor_count(std::int64_t)
Invariants
std::int64_t class_count

The number of classes $$c$$.

Getter & Setter
std::int64_t get_class_count() const
descriptor & set_class_count(std::int64_t)
Invariants

#### Method tags¶

namespace method {
struct bruteforce {};
struct kd_tree {};
using by_default = bruteforce;
} // namespace method

struct bruteforce

Tag-type that denotes brute-force computational method.

struct kd_tree

Tag-type that denotes k-d tree computational method.

using by_default = bruteforce

Alias tag-type for brute-force computational method.

namespace task {
struct classification {};
using by_default = classification;

struct classification

Tag-type that parameterizes entities used for solving classification problem.

using by_default = classification

### Model¶

template <typename Task = task::by_default>
class model {
public:
model();
};

class model
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Constructors

model()

Creates a new instance of the class with the default property values.

### Training train(...)¶

#### Input¶

template <typename Task = task::by_default>
class train_input {
public:
train_input(const table& data = table{},
const table& labels = table{});

const table& get_data() const;
train_input& set_data(const table&);

const table& get_labels() const;
train_input& set_labels(const table&);
};

class train_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Constructors

train_input(const table &data = table{}, const table &labels = table{})

Creates a new instance of the class with the given data and labels property values.

Properties

const table &data

The training set $$X$$. Default value: table{}.

Getter & Setter
const table & get_data() const
train_input & set_data(const table &)
const table &labels

Vector of labels $$y$$ for the training set $$X$$. Default value: table{}.

Getter & Setter
const table & get_labels() const
train_input & set_labels(const table &)

#### Result¶

template <typename Task = task::by_default>
class train_result {
public:
train_result();

};

class train_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Constructors

train_result()

Creates a new instance of the class with the default property values.

Public Methods

The trained $$k$$-NN model.

#### Operation¶

template <typename Float, typename Method, typename Task>

template<typename Float, typename Method, typename Task>

Runs the training operation for $$k$$-NN classifier. For more details see oneapi::dal::train.

Template Parameters
• Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

• Method – Tag-type that specifies an implementation of algorithm. Can be method::bruteforce or method::kd_tree.

• Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Parameters
• desc – Descriptor of the algorithm.

• input – Input data for the training operation.

Preconditions
input.data.has_data == true
input.labels.has_data == true
input.data.row_count == input.labels.row_count
input.labels.column_count == 1
input.labels[i] >= 0
input.labels[i] < desc.class_count

### Inference infer(...)¶

#### Input¶

template <typename Task = task::by_default>
class infer_input {
public:
const table& data = table{});

infer_input& set_model(const model&);

const table& get_data() const;
infer_input& set_data(const table&);
};

class infer_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Constructors

Creates a new instance of the class with the given model and data property values.

Properties

The trained $$k$$-NN model. Default value: model<Task>{}.

Getter & Setter
const model< Task > & get_model() const
infer_input & set_model(const model &)
const table &data

The dataset for inference $$X'$$. Default value: table{}.

Getter & Setter
const table & get_data() const
infer_input & set_data(const table &)

#### Result¶

template <typename Task = task::by_default>
class infer_result {
public:
infer_result();

const table& get_labels() const;
};

class infer_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Constructors

infer_result()

Creates a new instance of the class with the default property values.

Public Methods

const table &get_labels() const

The predicted labels.

#### Operation¶

template <typename Float, typename Method, typename Task>

template<typename Float, typename Method, typename Task>

Runs the inference operation for $$k$$-NN classifier. For more details see oneapi::dal::infer.

Template Parameters
• Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

• Method – Tag-type that specifies an implementation of algorithm. Can be method::bruteforce or method::kd_tree.

• Task – Tag-type that specifies type of the problem to solve. Can be task::classification.

Parameters
• desc – Descriptor of the algorithm.

• input – Input data for the inference operation.

Preconditions
input.data.has_data == true
Postconditions
result.labels.row_count == input.data.row_count
result.labels.column_count == 1
result.labels[i] >= 0
result.labels[i] < desc.class_count