Talk:K-medians clustering

Latest comment: 6 years ago by Engineeru

I teach a university course on Machine Learning, and just came across this.

I don't have the articles cited here handy for reference, but I'm just not sure about this claim that the use of medians as opposed to means is supposed to minimize the distances under taxicab metric vs Euclidean. Consider the 1-dimensional case. In this case taxicab and Euclidean are the same. Now imagine a skewed distribution, like a Gaussian that has been stretched on the right hand side of the center. This stretching doesn't move the median, but it does move the mean to the right to minimize the sum of distances. And since this is 1-dimensional, it should be true, no matter whether we're talking about taxicab or Euclidean, or Minkowski distances more generally.

In statistics, there are common reasons for using the median instead of the mean (it's basically to do with the relative importance of the number of instances for which you have a given amount of error, vs. magnitude of the error). -- So this thing doesn't really need the taxicab/Euclidean argument to motivate its use theoretically. So, as long as we're not sure, I'd suggest being conservative and removing this claim, or maybe someone can explain to me why this claim is actually true, despite my above argument and tell me what I'm missing, or point out a more readily available reference.

RichardBergmair 10:22, 10 Mar 2014 (UTC)

Edit: I did some research, and what is mentioned here is indeed true. In 2D (or higher-dimensional) space, there is no accepted definition for the median (See this link. Therefore, you have to look at each dimension seperately. This lecture helped me understand, they mention the formula: Engineeru (talk) 14:33, 3 August 2018 (UTC)Reply