SNN is a fast and exact fixed-radius nearest neighbor search algorithm [1]. It uses the first principal component of the data to prune the search space and speeds up Euclidean distance computations using high-level BLAS routines. SNN is implemented in native Python. On many problems, SNN is faster than KDtree and Balltree in the scikit-learn package. There is also a C++ implementation of SNN.
To reproduce the experiments from the paper [1], see the instructions and code in the exp
subfolder.
The native Python implementation of SNN can be installed by:
pip install snnpy
Here is an example illustrating the use of SNN:
import numpy as np
from snnpy import *
from time import time
n_samples = 100000
n_dim = 100
radius = 3.5
rng = np.random.RandomState(0)
X = rng.random_sample((n_samples, n_dim))
# build SNN model
st = time()
snn_model = build_snn_model(X)
print("SNN index time:", time()-st)
# will be faster if return_dist is False, then no distance information come out
# query neighbors of X[0]
st = time()
ind,dist = snn_model.query_radius(X[0], radius, return_distance=True)
# If remove the returning of the associated distance, use: ind, dist = snn_model.query_radius(X[0], radius, return_distance=False)
sort_ind = np.argsort(dist)
print("SNN query time:", time()-st)
# print total number and top five indices
print("number of neighbors:", len(ind))
print("indices of closest five:", ", ".join([str(i) for i in ind[sort_ind][:5]]))
# EXAMPLE OUTPUT
# SNN index time: 0.2224433422088623
# SNN query time: 0.009207725524902344
# number of neighbors: 550
# indices of closest five: 0, 27279, 69983, 65906, 97095
Compare this to sklearn's KDTree:
from sklearn.neighbors import KDTree
st = time()
tree = KDTree(X)
print("KDTree index time:", time()-st)
st = time()
ind2 = tree.query_radius(X[0].reshape(1, -1), radius)
print("KDTree query time:", time()-st)
print("number of neighbors:", len(ind2[0]))
# KDTree index time: 7.597502946853638
# KDTree query time: 0.08962678909301758
# number of neighbors: 550
We also support multi-point queries which exploit single-threaded BLAS-3 (multi-threading is under development):
ind = snn_model.radius_batch_query(X[:10], radius)
snn_model.radius_batch_query
uses Numba for further speedup. For the first run, Numba has to go through the code and optimize it, adding extra overhead. Each subsequent run of batch queries will be much faster.
The C++ version of SNN has dependencies on CBLAS, LAPACK and openMP. Reference LAPACK is available from GitHub. LAPACK releases are available on netlib (using MKL can yield further speedup).
Please modify the CMakeList.txt
file to reflect your LAPACK location.
If liblpacke-dev
as well as libopenblas-dev
were installed using your package manager (for example apt
) there is no need to further modify the CMakeLists.txt
.
git clone https://github.com/nla-group/snn.git
cd snn
cmake -S . -B ./build/
cmake --build ./build/
cp *.a /usr/lib
cp include/*.h /usr/include
After installation, you can include "snn.h"
in your code, and compile it by linking the libsnn.a, CBLAS and LAPACK libraries.
For example, you can use g++ with GSL BLAS and LAPACK by typing g++ your_code.cpp libsnn.a -o output -llapacke -lgslcblas -lm -W
in Ubuntu.
SNN has an easy-to-use API, and it works with int
, float
and double
data types stored in column major order.
We first prepare the data:
int rows = 10;
int cols = 3;
double df[rows*cols] = {
0.5488135 , 0.54488318, 0.43758721, 0.38344152, 0.56804456,
0.0871293 , 0.77815675, 0.79915856, 0.11827443, 0.94466892,
0.71518937, 0.4236548 , 0.891773, 0.79172504, 0.92559664,
0.0202184 , 0.87001215, 0.46147936, 0.63992102, 0.52184832,
0.60276338, 0.64589411, 0.96366276, 0.52889492, 0.07103606,
0.83261985, 0.97861834, 0.78052918, 0.14335329, 0.41466194
};
/* data -- column major order
0.548813 0.715189 0.602763
0.544883 0.423655 0.645894
0.437587 0.891773 0.963663
0.383442 0.791725 0.528895
0.568045 0.925597 0.0710361
0.0871293 0.0202184 0.83262
0.778157 0.870012 0.978618
0.799159 0.461479 0.780529
0.118274 0.639921 0.143353
0.944669 0.521848 0.414662
*
Now we create the SNN index, specifying the number of objects and feature dimensions in the data:
// index SNN model
SNN_MODEL<double, double> snn_model_test(df, rows, cols);
Querying neighbors of a single data point:
// query data
double query[cols] = {0.5488135 , 0.71518937, 0.60276338};
// create the variable storing neighbors ID
vector<int> knnID;
// create the variable storing neighbors' distance to the query
vector<double> knnDist;
// employ single query, the number 0.4 refers to the search radius (range)
snn_model_test.radius_single_query(query, 0.4, &knnID, &knnDist);
We can also query neighbors of multiple data points at once:
// employ two queries, parallel compute by openMP
double query_batch[2*cols] = {0.5488135, 0.944669, 0.71518937, 0.521848, 0.60276338, 0.414662};
vector<vector<int> > batch_knnID;
vector<vector<double> > batch_knnDist;
snn_model_test.radius_batch_query(query_batch, 0.4, &batch_knnID, &batch_knnDist, 2);
for (int j=0; j<2; j++){
cout << "knnID" << endl;
for (auto i: batch_knnID[j]){
cout << i << " ";
}
cout << "\nknnDist" << endl;
for (auto i: batch_knnDist[j]){
cout << i << " ";
}
cout << endl;
}
/* output
knnID
3 0 1 7
knnDist
0.196627 0 0.294734 0.398299
knnID
9 7
knnDist
3.35193e-07 0.398342
*/
All the content in this repository is licensed under the MIT License.
Chen X, Güttel S. 2024. Fast and exact fixed-radius neighbor search based on sorting. PeerJ Computer Science 10:e1929 https://doi.org/10.7717/peerj-cs.1929