map

Sunday, 7 February 2016

How cursed is the curse of dimensionality? Simply explained.

Why does an alien look for neighbours longer than an earthling? This post explains!.

"The curse of dimensionality" is a jargon expression originating from statistical and machine learning. It refers to the issue of a drastic increase of the volume of the boundary layer as compared to the volume of the interior of a unit cube or a sphere as dimension grows.

An example: imagine, you are a climate data scientist and want to track the ice coverage of a remote mountain over time. The mountain is not just very far, but also very steep. Traveling would exhaust your budget, not to mention that you would have to be a professional climber to be able to get to the top to perform a measurement. Repeatedly.

Compared to that hard way, you could find a lot of ease in simply downloading satellite data with an image of the region of your interest. But also here pitfalls await: even if you consider yourself a lucky person, it is almost unavoidable, that some of the pixels contain no information. This may happen, for instance, due to the clouds hovering over the 'blind' point at the time when the image was captured.

What to do? Common sense is prompting, that if most of the points around the 'blind' point are covered with snow, it is likely that this point will lie under snow too. And vice versa: if most of the points around (k closest representatives) display a naked rock, we would believe, that the 'blind' point is a rocky surface as well.

At this stage we would instruct our data-imputing algorithm to go and find k nearest points and 'interview' them for their values. The value that they collectively vote for should be accepted. This procedure is called the k-nearest neighbours (KNN) classification.

How far would the algorithm need to wander until it discovers all the k interviewees? Our hope is that the space exploration would not take it too far, i.e. it will not have to reach the boundary... otherwise it may fall of the cliff :) By the boundary we mean 10% of the most extreme values on each axis.

If our entire space was a unit cube, in dimension 1 the main exploration space (the interior) would be a segment (0.05, 0.95) with volume 0.9, and the boundary would have volume 1-0.9 = 0.1. In dimension 2 the boundary would look as the space enclosed in-between two concentric squares. The volume of the interior is now (0.9)^2. Hence, the volume of the boundary is 1-(0.9)^2 = 0.19. The boundary now is almost twice as large as in dimension 1.

Following this logic, one concludes, that in dimension 100 the volume of the boundary constitutes 1-(0.9)^100 = 0.9999734, i.e. the boundary takes almost all of the space. Which means, if you forbid your algorithms to wander too close to the 'cliff', there will be almost nowhere to go.

Fortunately, your satellite image is two dimensional and, you might think, it is nice to be aware of the dimensionality issues in oder to feel empathic towards an alien living on a 100-dimensional planet with its 100-dimensional mountains.

Yes and no. The trap is that even on Earth we run into high-dimensionality troubles sooner than we expect...

This time imagine, you are a Big Data analyst with Zetabites of data in your pocket. In an urge of explaining an outcome by all possible factors that are available to you, you still would like to forbid the outliers - extreme values in the boundary layer... Do you see where it is going?...

Now a bit of play-around: to better train your empathy to the 100-dimensional alien, here is a code in R programming language that computes the volume of the boundary in dimensionality d using the Monte Carlo method. The plot below the code displays the dependency between dimensionality and the boundary volume.

#this function return approximate volume of a 
#d-dimensional boundary of a unit cube
#n-number of throws for the Monte Carlo method
boundary_volume <- function(n,d){
  #generate n d-dimensional random vectors
  throws <- data.frame(matrix(runif(n*d), nrow = n, ncol = d))
  count <-0
  for (i in 1:n){ 
    #test all generated points on belonging to the boundary
    if (min(throws[i,])<0.05 || max(throws[i,])>0.95) 
      count <- count+1
  }
  #compare numerical and theoretical values
  print(paste('d=',d,': Numerical = ', format(count/n*100, digits=3), '%, Theoretical=', format((1-(0.9)^d)*100, digits=3), '%;',sep=''))
  #return the answer as percentage
  return (count/n*100)
}

n <- 10000
volume <- rep(NA,20)
for (d in 1:20){
  print(paste('d=',d,': Elapsed time = ', format(system.time(volume[d] <- boundary_volume(n,d))[3], digits=4),sep=''))
}

plot(volume, type = 'b',col = "red", xlab = 'Dimension', ylab= 'Boundary volume,%')

No comments:

Post a Comment