When we are working with high dimensional data, typically we have to deal with the problem of defining a distance between points that can take into account its underlying structure. For example, many tasks in machine learning require a notion of distance or similarity between points, such as clustering, classification, dimensionality reduction, and deep learning. This distance has to capture as much relevant information as possible in order to give better results.

For this scenario, a good hypothesis is that the dataset is constituted by a set of samples with unknown density distribution which is supported on a lower dimensional manifold. In some cases, this can be corroborated and there are a few degrees of freedom that parametrize the points in the dataset. Even more, there are theoretical results that show that independently of the dimension of the space, points are near to some lower dimensional structure.

Tipically, real data is modeled as a collection of points living in a manifold with a intrinsic dimension lower than the ambient space.

Naturally, The number of points needed to fill a dimensional set scales exponentially in the dimension of the space, making impossible for a finite set of points to do that task. More technically, the Johnson-Linderstraus Lemma claims that given a dataset of points, there always exists a projection of the points in with that preserves the Euclidean distance between all pairs of points with an error smaller than , independently of the dimension of the dataset.

## The Fermat distance

For the sake of the explanation, let consider first that we know the density , where is a -dimensional manifold embedded in . As previously mentioned, typically we have but this is not a restricted condition. On we have defined some distance, for example, the Euclidean distance . Given two points and a parameter we define the Fermat distance between and as

where the minimization carried over all rectifiable curves that start at and end at .

- Given the input distance , defines a (new) distance that takes into consideration both manifold and density.
- For , the Fermat distance is identical to the length of the geodesic on , recovering the Isomap distance.
- For higher values of , paths that cross for regions with low density are disfavorable.

Let’s considerer the example of a bimodal distribution supported on the lower dimensional manifold , with modes in points and and with different dispersions (see image below). If we consider the point on the manifold, both Euclidean and geodesic distance will make nearer to than to . On the contrary, since the density takes lower values in every path that conects with , the Fermat distance will be higher than for a choice of large enough.

On the other hand, for the optimal path that realizes the Fermat distance is most likely to cross regions with high density. In that way, we can think as the most likely continuous transformation of to .

The Fermat distance is defined as an optimization problem which selects a path minimizing some functional. As its name suggests, the definition of the Fermat distance resembles the optical principle in optics.

In optics, the path follows by light in a media is a minimizer of the optical path lenght, which is the path integral of the refractive index of the media. This least action formulation of the geometric optic is known as Fermat’s Principle.

The refractive index is related to the speed of light in the media. More specific, if is the speed of light in vacuum and is the speed of light in the media, the refractive index is defined as . Then, the path integral

represents the time that requires to a ray of light to go through the path given by . Regions with lower refractive index represent regions where light can move faster. In that way, the role of the refractive index in the definition of the Fermat distance is played by the inverse power of the density : *the light moves faster in regions with high density*.

### How to estimate the Fermat distance?

Assuming that we have knowledge of the density and its support , finding the optimal path that minimizes the Fermat distance could be computationally expensive. However, in a real case scenario we have the information of via some finite set of samples with common density . One approach could be to first estimate the density to estimate in a second step the Fermat distance.

However, let see how it is possible to estimate the Fermat distance directly using mathematical results in percolation theory. Given two points and and a sequence of points such that for all , with , we say that is a path from to if and are the nearest points in to and , respectelly. Given a parameter , the Fermat distance estimator is defined as

Starting in , this represents the minimum cost of going to when the local cost from moving from one point to another is .

At a first glance, the relationship between the Fermat distance and its estimator is not immediate. However, it can be proved that if we rescale adequately and take the limit as the number of samples goes to infinity then

where the parameter and satisfies the relation , with the intrinsic dimension of the manifold and some constant that depends on and .

To better understand this result, let’s consider the simple case where and the contribution of each is proportional to the area of a circle with center and radious . Then, the Fermat distance estimator is proportional to the total area of the shaded region in the figure above. Consequently, a path that goes for regions with high density will have lower cost, as the Fermat distance does.

### Algorithmic Implementation

The minimization between all possible paths that connect two given points can be treated as a shortest path problem in the graph with nodes in the points and where the weight of the edge between two points and is . The faster method to compute is the Floyd-Warshall algorithm, which requires operations (a lot!). However, we can perform the two following relaxation of the problem

Since the power discourages large distances between consecutive points of the optimal path, we can restrict the search of paths where each is -nearest neighbour of . Even more, if we choose then with probability at least the estimator of the Fermat distance doesn’t change. In such case, Dijkstra algorithm requires operations.

Furthermore, we can consider a subset of points which we are going to call the *landmarks*, selected with some rule, for example, a random sample of size from . Then, we compute the shortest path from each of the to the rest of the points in operations using Dijsktra algortihm. Finally, it is possible to bound the Fermat distance between all and as

If both lower and upper bound are the same, we have the exact value of , but if not we have some inteval where the real value lies.

An implementation in Python of all these variants can be found in www.aristas.com.ar/fermat

## References

Weighted Geodesic Distance Following Fermat’s Principle*
*F. Sapienza, P. Groisman, M. Jonckheere

6th International Conference on Learning Representations (ICLR), 2018.

6th International Conference on Learning Representations (ICLR), 2018.

Nonhomogeneous Euclidean first-passage percolation and distance learning*
*P. Groisman, M. Jonckheere, F. Sapienza

## Follow me