Talk:Selection algorithm

Latest comment: 2 months ago by David Eppstein in topic Incorrect, unsourced (WP:OR?) illustration

No mention of min-max heap as a linear time selection algorithm

edit

No where on the page is the min-max heap mentioned, which is also a valid deterministic linear time solution to find the kth smallest (or largest) element in a list. It uses the linear time selection algorithm to build the min-max-median heap though. — Preceding unsigned comment added by Dhruvbird (talkcontribs) 15:30, 21 October 2012 (UTC)Reply

Select does not find the true median!

edit

lols, median might b found in linear time, recursively, assuming we got the solution 4 a smaller group, 4 median, n use this result etc... :P 4 ur delite — Preceding unsigned comment added by 93.118.212.31 (talk) 12:34, 27 July 2011 (UTC)Reply

Quoting the text:

Select is then called recursively on this sublist of n/5 elements to find their true median.

Since Select does not find the true median of the n elements, it does not find the true median of the n/5 elements in the general case. It would be more appropriate to call this value the "pivot" or "approximate median". Philippe.beaudoin (talk) 20:47, 5 April 2011 (UTC)Reply

Why is a group size of 5 used?

edit

I know it is not arbitrary chosen, but I cannot any article backing this up. If somebody would know why we can add it to the contents of this page. —Preceding unsigned comment added by 145.120.14.117 (talk) 21:21, 27 January 2011 (UTC)Reply

  • 5 is just the smallest value that works. It's from the original paper. There ought to be a citation for the fact that any larger (fixed) value would also work. Dcoetzee 04:42, 28 July 2011 (UTC)Reply

How to re-generate visualization table for proof of 40%/60%

edit
import random

x = range(100)
random.shuffle(x)

tuples = []
while x:
 tuples.append(sorted([x.pop() for _ in range(5)]))

tuples = sorted(tuples, key=lambda tuple:tuple[2])

pivot = tuples[int(100/5 / 2)-1][2]

for i in range(5):
 print '! '
 def colorized(n):
  return 'bgcolor='+('gray' if n<pivot else 'red' if n==pivot else 'white')+'|'+str(n)
 print '| '+'||||'.join(colorized(tuple[i]) for tuple in tuples)
 print '|-'

—Preceding unsigned comment added by Ninjagecko (talkcontribs) 15:09, 3 December 2009

Summary?

edit

If I understand it correctly the point of this article is that we can generalize an O(n) approach for finding (singular) min() and max() to finding k minimum or maximum values by keeping a small sorted side list of k items (perhaps in a b-tree or just using binary search and insertions). The idea is that as we go any values that don't fit inside the range of our current k-list are simply skipped, any that do are inserted (pushing one value off the end of the k-list). This would be O(n) * O(log k) time which, given he pre-condition that k is small, reduces to O(n) time overall. This would be particularly handy, for example, when the original list is too large to fit into memory ... one can still find k smallest or largest items in a single pass over the data (files, for example). (Even for really huge n and k that's also too large to fit in core, the upper and lower bounds of k would still fit and a random access method through the storage for k would still be relatively efficient).

The problem I'm having is understanding the part about median of medians. A diagram would be great. Is it that we're finding the median for a subset (say 5 items), then finding median for successive subsets (say 5 subsets), then finding the median of the medians, and so on? This sounds a little like the old RRDtool "compression" strategy (store data at one minute granularity for the most recent n-hours, then average those to hourly granularity, then to daily, and so on). Am I understanding that correctly? Is that mathematically a valid approach? Are the results of this process really the median for all possible data sets? There are no degenerate data set patterns that would significantly skew the finally results?

If what I'm saying here is true, how can we clarify the article? (Or am I just being dense)? JimD (talk) 20:19, 2 August 2009 (UTC)Reply

You're not being dense. Median of Medians is an incorrect algorithm. It does not produce the true median. Example:
median(3, 1, 4, 1, 5, 9, 2, 6, 5) = median(median(3, 1, 4), median(1, 5, 9), median(2, 6, 5)) = 5. But the actual median of the original set is 4. Therefore, though median of medians may provide an efficient method for approximating the median most of the time, it is far from mathematically accurate.
The whole point in using the median-of-median algorithm is to find an element that can act as as pivot element for partitioning.
It does not matter if the median-of-median algorithm does not produce the true median.
There are no degenerate data patterns that can make it go wrong.
This guarantees that the selection algorithm is O(N)
In David Musser's paper on IntroSort and IntraSelect, Musser said if QuickSort went quadratic, you could swap to HeapSort. His hybrid algorithm meant the worse case was O(N * log N) for sorting.
For IntraSelect, Musser said QuickSelect could be used. But he did not say what algorithm should be used if QuickSelect went quadratic. The Median-of-Median's is that algorithm.
Stephen Howe (talk) 21:58, 30 April 2016 (UTC)Reply

Benefit of Non-strict Comparisons

edit

"By using non-strict comparisons (<= and >=) instead, we would find the minimum element with maximum index; this approach also has some small performance benefits." What performance benefits, exactly?

On some machines it can cause less branch instructions, and so generate less icache flushes. The loop also has to have been unrolled or have other unconditional instructions in it. This effect is so small and machine-dependent however that I'll go ahead and remove this claim. Deco 02:08, 24 November 2005 (UTC)Reply

quicksortOnlyFirstK not recursive?

edit

Shouldn't the quicksortOnlyFirstK pseudocall call itsself instead of quicksort, or am I missing something obvious.--ISee 07:22, 26 April 2006 (UTC)Reply

Fixed that after verifying that part of the algorithm--ISee 11:10, 26 April 2006 (UTC)Reply

Minimum or maximum

edit

Is it necessary to say "minimum or maximum" everywhere? It adds unnecessary wordiness, and it makes the times when it's not used that much more confusing. Here's an example:

Note that there may be multiple minimum or maximum elements. Because the comparisons above are strict, these algorithms finds the minimum element with minimum index. By using non-strict comparisons (≤ and ≥) instead, we would find the minimum element with maximum index.

Two of those minimums should be changed to minimum or maximum.

What would be even better, though, would be to just generalize the whole article to be "finding the maximum" and give a note that everything can be flipped for the reverse case. --71.71.238.231 08:07, 15 June 2006 (UTC)Reply

I'm not sure what you're arguing. It already says "minimum or maximum" in only one place. The others are referring to something more subtle than that - the case where there are equal elements and we have to select which one is chosen. Deco 08:21, 15 June 2006 (UTC)Reply
It already says "minimum or maximum" in only one place. Search again. That exact text appears four times. Which says nothing about forms like "minimum/maximum." --71.71.238.231 02:53, 17 June 2006 (UTC)Reply
Oh, I was just referring to the portion you quoted above. You're right, it's a good idea to avoid the verbosity and just mention it once at the beginning. Deco 04:28, 17 June 2006 (UTC)Reply


Why dont u simply sort the given list L and return the k-th element in L? -- Orz 04:25, 29 August 2006 (UTC)Reply

Because that takes O(n log n) time, which is dramatically slower for large n. The article discusses this. Deco 05:37, 29 August 2006 (UTC)Reply
It does indeed. Thanks for clearing that out, Decon

swap in partition function

edit

is there an error in one line from the partition function? in particular, shouldn't the line:

swap a[pivotIndex] and a[right]

be replaced with the line:

swap a[pivotIndex] and a[left]

mwb

correction

edit

or perhaps the swap line was correct, but a the end of partition you need to swap a[right] back to a position between the two subsegments...

mwb

Yes... that line was there. Someone just removed it and introduced several other errors. I've reverted them. Deco 23:21, 11 October 2006 (UTC)Reply

The pseudocode still seems incomplete:

function select(list, left, right, k)

    if left = right // If the list contains only one element
        return list[left]  // Return that element
    // select pivotIndex between left and right
    pivotNewIndex := partition(list, left, right, pivotIndex)

Above, pivotIndex never got declared or introduced nor initialized. — Preceding unsigned comment added by NetGhost123 (talkcontribs) 11:00, 6 October 2012 (UTC) Instead of the comment // select pivotIndex between left and right it would be easier to read if a dummy function were used such as: pivotIndex = SelectPivotIndex( left, right) // There are various ways to select the pivot index. — Preceding unsigned comment added by NetGhost123 (talkcontribs) 11:06, 6 October 2012 (UTC)Reply

Linear minimum/maximum algorithms

edit

I don't understand why the index of the minimum/maximum element seen so far is stored in the described algorithm. In the example pseudocode, it is not used for anything. Ahy1 12:05, 29 December 2006 (UTC)Reply

It's part of the result. Often you want not only the minimum/maximum value, but the index at which it occurred, for later processing. Deco 14:50, 29 December 2006 (UTC)Reply

Remarks

edit

A selection algorithm is not only for finding the kth smallest (or kth largest) number in a list. In Genetic algorithms they use other shemes for selection than the smallest value for example. Or in real-time/embedded systems there exists selection algorithms, that just take a msg and take an action according to the type of message. So it isn't all about finding the smallest values...

These are just different things with the same name. I'll clarify that the term use here is specialized. Dcoetzee 00:08, 5 May 2007 (UTC)Reply

Introselect

edit

There is a variation of selection that David Musser refers to. Introselect has an analogous relationship to quickselect that Intrasort has to quicksort - all designed to ensure both average and worse-case performance of O(n). But Musser in his paper does not say what algorithm would be swopped to if quickselect goes bad. I think this page should to refer to Introselect.

There is also another variation of finding the kth of n elements - where the elements remain undisturbed. Naturally the overhead is considerably higher and I dont know what the average big-O performance is like. [User:Stephen Howe|Stephen Howe] 21:50, 4 May 2007 (UTC)

Good idea, introselect ought to be mentioned. Dcoetzee 00:08, 5 May 2007 (UTC)Reply

Error?

edit

In the selection function, shouldn't we compare pivotNewIndex - left + 1 with k, which means selection of kth smallest number in list(left:right).

I don't think so, you have to think of it as shrinking the target area around k, which remains relative to the whole array.--Sam Brightman (talk) 16:14, 7 January 2009 (UTC)Reply

primitive algorithm to select k elements - O(n) not O(kn)

edit

Section : Repeated application of simple selection algorithms

O(n) to select k-th element. O(n) to compare all elements to it. —Preceding unsigned comment added by 132.72.98.166 (talk) 13:03, 12 November 2007 (UTC)Reply

No, it's O(kn), because it can't just magically pick out the kth smallest element on the first try - it has to first identify each smaller element. Dcoetzee 13:26, 12 November 2007 (UTC)Reply
Read the article you are commenting on, and it'll show you a great "magical" algorithm for choosing the k-th smallest element in linear time. The GP is correct, it requires only O(n) time to choose the k smallest elements. 142.20.202.196 (talk) 16:49, 25 January 2011 (UTC)Reply

A C# implementation of the primative Algo shows an error. Would any compSci care to comment?

        public double BasicSelection(List<double> myList, int k)
        {
            for (int i = 0; i < k; i++)
            {
                int minIndex = i;
                double minValue = myList[i];
                for (int j = i+1; j < myList.Count; j++)
                {
                    if (myList[j] < minValue)
                    {
                        minIndex = j;
                        minValue = myList[j];
                    }
                    //now swap them around
                    double temp1 = myList[i];
                    double temp2 = myList[minIndex];
                    myList[i] = temp2;
                    myList[minIndex] = temp1;
                }
            }
            return myList[k];
        }

—Preceding unsigned comment added by 88.110.61.196 (talkcontribs) 12:01, 17 April 2008

Question regarding the nonlinear general selection algorithm

edit

One of the advantages is claimed to be "after locating the jth smallest element, it requires only O(j + (k-j)2) time to find the kth smallest element, or only O(k) for kj".

I didn't understand this part. Suppose we've found the minimum (j=1), and we are going to find the second smallest element (k=2). Can we find it in O(1) time? —Preceding unsigned comment added by Roy hu (talkcontribs) 22:50, 22 January 2009 (UTC)Reply

Yeah, this seems wrong, it should be O((n-j)*(k-j)). Esrogs (talk) 03:28, 26 February 2013 (UTC)Reply

Advanced algorithms missing

edit

See problem 13 (5.3.3) The Donald Knuth, The art of computer programming, Sorting and Searching.

Median can be found in 3/2 n + O(n^(2/3) log(n)) comparisons, due to R. W. Floyd. —Preceding unsigned comment added by Atulsvasu (talkcontribs) 07:19, 9 December 2009

Bug in linear general selection algorithm

edit

The proof of O(n) running time for the linear general selection algorithm implicitly assumes all elements are different (with the given partition and select function)

Indeed, when running with a series of n equal elements with the given partition and select function, this algorithm runs in O(n**2). I guess the following fix would be ok : partition should put all elements equal to the pivot together, and return two indexes i and j such that list[i], list[i+1], ..., list[j] are equal to the pivot, and all elements before i are smaller (not just smaller or equal) than the pivot and all elements after j are greater (not just greater or equal). Then select has to check whether the k-th value is before i, after j or between them.

(I have no access to Musser's article to check what he says about this)

Judicael (talk) 09:46, 4 January 2010 (UTC)Reply

Bug in quicksortOnlyFirstK

edit

I am trying to implement this algorithm and find a problem with this line:

            quicksortFirstK(list, pivotNewIndex+1, right, k-pivotNewIndex)

I believe it should be

            quicksortFirstK(list, pivotNewIndex+1, right, k)

Reasoning: Our index which we are sorting to should never change. By doing k - pivotNewIndex we change that index.

I believe the intent of the author who put this algorithm in here was to change it so that the first k indexes AFTER left (which now equals pivotNewIndex+1 and so this k would then make sense) rather than the absolute k. In order for this assumption to be true you would have to change the if statement to if pivotNewIndex - left < k. This change could also fix the algorithm, but would be slower and less clear.

I have it implemented it in java and it will not work correctly until I set that and it works perfect (no excessive sorting above k) with it set how I have it fixed.

You are correct. The "k-pivotNewIndex" would be correct for a function that always took a modified "list" pointer and a "left" boundary being always zero. I've updated the main page. --bcwhite —Preceding unsigned comment added by Bcwhite (talkcontribs) 18:15, 29 June 2010 (UTC)Reply

Agreed as well - though why has the main page changed back to the error? (23rd July 2010)

BFPRT Worst Case O(n) or Not?

edit

hi, i think main idea abt this algo could lead us to a linear time, evn for the worst case, by considerring 2 groups of 4 pivots, 2 from each group (2 groups of 2 pivots n produce the next 2 pivots 4 the group-double) . i think in this current configuration, BFPRT median of medians looks more like N^(log_5(3)) guard assured which is not linear (worst case not linear) Florin Matei,Romania 93.118.212.93 (talk) 04:51, 9 October 2012 (UTC)Reply

worst case O(n), ha? scary pic, u mean, isnt it? Romanians can do it! oh, yeah, that beaf of urs 2 Lol — Preceding unsigned comment added by 93.118.212.93 (talk) 16:50, 8 July 2012 (UTC)Reply

In the beginning of the partition based section it is claimed that BFPRT introduced a worst case linear time algorithm, while the algorithm described is (correctly) explained to be O(n^2) worst case. Was there another algorithm (like the median of medians) in the paper or is the claim wrong? Needs a citation in any case... Otus (talk) 06:55, 19 July 2010 (UTC)Reply

The writing is just confusing - somebody jumbled it up pretty bad since I originally wrote it. BFPRT (the same thing as the "median-of-median" algorithm) is indeed worst-case linear, as you can easily verify by reading the abstract of the paper cited. But the section "partition-based general selection algorithm" describes quickselect, the algorithm it is based on, which is expected-time linear and worst-case quadratic. Dcoetzee 10:00, 19 July 2010 (UTC)Reply

smallest K items

edit
      if pivotNewIndex < k // questionable
            findFirstK(list, pivotNewIndex+1, right, k)

This code is listed in the article. If pivotNewIdex is among the first K items, then we need to return the left partition ie list[left..pivotNewIndex-1] because these elements are definitely among the smallest K items. Then, we need to recur on the right partition but with a reduced value for k. If we initially wanted k=1000 items, now we just want 950 items if the left partition already provides the smallest 50 items.

The algorithm as is could crash if k exceeds the range ie K > (right - pivorNewindex) 208.123.162.2 (talk) 07:50, 2 October 2010 (UTC)Reply

think! its fun...

possible new ideas abt median:a challenge 2 think, like reverse engineering-reengineering

edit
CONST n = 10000
 CLS
 DIM a(n)
 FOR i = 1 TO n: a(i) = RND: NEXT i

l = n alfa:

FOR i = 1 TO l / 2
 IF a(i) > a(i + l / 2) THEN
    t = a(i): a(i) = a(i + l / 2): a(i + l / 2) = t
 END IF
 NEXT

l = l / 2

FOR i = 1 TO l / 2
 IF a(i) < a(i + l / 2) THEN
    t = a(i): a(i) = a(i + l / 2): a(i + l / 2) = t
 END IF
 NEXT

l = l / 2 IF l > 10 THEN GOTO alfa FOR i = 1 TO l: PRINT a(i); : NEXT i


there seems to b algos with the 2 n 3 numbers 4 sorting, floodfill,muls, median of medians evn 4 factorizations, an heuristic one. good luck .Florin,Romania,38 old — Preceding unsigned comment added by 93.118.212.93 (talk) 16:33, 8 July 2012 (UTC) median of median algoReply

hard 2 b proved by students

considerring 2^(2*n) no. of elements

comparing 1:1 after their position keeping mins in first, max of each comp in second

considerring groups of mins (first group)

procedind similary but keeping the big no. group

continuing alternatin till group is 2^n long sorting getting median

this shoud b n/4...3n/4 posn

hard to b proved like i said — Preceding unsigned comment added by 93.118.212.93 (talk) 16:19, 12 July 2012 (UTC)Reply

Lower(?) bounds

edit

The section titled Lower Bounds is a little disconnected. It mentions that Knuth discusses these, but then jumps to a paragraph about an upper bound in his book. A trap for the unwary? I can edit in the best known lower bounds, but should the section be retitled Upper and Lower Bounds for clarity? Hardmath (talk) 14:17, 10 July 2014 (UTC)Reply

Space complexity

edit

The section "Space complexity" says "The required space complexity of selection is easily seen to be k + O(1) (or n − k if k > n/2)". In what sense? There exists a simple algorithm with O(1) space complexity and O(n^2) time complexity: for each element of the input array, test whether it is the desired quantile by running through the array and counting how many elements are smaller than it and how many are equal. Jaan Vajakas (talk) 17:53, 25 May 2015 (UTC)Reply

Parallel Algorithms missing

edit

This article should be expanded with a discussion on paralellization. OpenMP has reduction for min and max, not median. Also GPU's should help getting the median of a large dataset, but some algorithms are better suited for the task than others. 132.183.13.70 (talk) 17:08, 11 March 2016 (UTC)Reply

Hoare's algorithm

edit

This page redirects from Hoare's algorithm and mentions Hoare once, but never specifically defines the algorithm. --Quantum7 07:39, 1 April 2016 (UTC)Reply

What is this sorcery? O(n + k log k) partial sort?

edit

The article currently says:

Rather than sorting the whole list or array, one can instead use partial sorting to select the k smallest or k largest elements. The kth smallest (resp., kth largest element) is then the largest (resp., smallest element) of the partially sorted list – this then takes O(1) to access in an array and O(k) to access in a list. This is more efficient than full sorting, but less efficient than simply selecting, and takes O(n + k log k) time, due to the sorting of the k elements.

What is this algorithm that does a partial sort in "O(n + k log k)" time? It's not presented in the article at all. This should either be clarified or removed. — Preceding unsigned comment added by 46.208.172.118 (talk) 21:00, 24 April 2017 (UTC)Reply

It is almost certainly heapsort and the entire algorithm is heapselect. Stephen Howe (talk) 14:04, 16 April 2019 (UTC)Reply

I have a theory (though it's not well explained in the article). Did this text refer to using introselect followed by extraction of elements above the returned element, followed by as many of the returned element as needed to reach "k" elements and then a O(k log k) sort? — Preceding unsigned comment added by 2A00:79E0:D:207:E459:A6FF:470E:2439 (talk) 16:15, 26 April 2017 (UTC)Reply

GA Review

edit
GA toolbox
Reviewing
This review is transcluded from Talk:Selection algorithm/GA1. The edit link for this section can be used to add comments to the review.

Reviewer: RoySmith (talk · contribs) 14:31, 5 August 2023 (UTC)Reply

Starting review RoySmith (talk) 14:31, 5 August 2023 (UTC)Reply

Problem statement

edit
  • it should be possible to sort the values into an order from smallest to largest; for instance, they may be numbers... Maybe it's better to say "they may be real numbers", to explicitly exclude the non-orderable complex numbers.
  • some sources may assume that the values are all distinct from each other.... The word "source" could refer to "source code" or "a reference in the literature". I'm pretty sure you mean the later, but perhaps a different word would remove the ambiguity.

Algorithms

edit
  • selection of the kth smallest value in a collection of values can be performed very simply... Delete "very". Or maybe "very simply" -> "trivially".
  • if the output of the sorting algorithm is an array, jump to its kth element... I assume the intent of "is an array" is that it's some data structure which can be indexed in constant time, not strictly an Array (data structure).
  • A careful design of these factories leads to an algorithm that, when applied to median-finding, uses at most 2.942 comparisons. I don't have access to the source; I assume it explains where 2.942 comes from. Is it feasible to give a short description here of how that mysterious number is derived?
    • Searching for the source will find alternative copies on Google Scholar. I have deliberately avoided linking to them because it is not clear that they are linkable (copies made available by the author or publisher rather than pirated by others). The paper is not illuminating about where that specific constant comes from, and doesn't even give the algorithm in full detail, instead referring to Dor's PhD thesis. —David Eppstein (talk) 20:56, 5 August 2023 (UTC)Reply

Language support

edit
  • Python ... A linear-time selection algorithm is not included. needs a better citation than a link to a code repository.
    • The only book sources that I can find that state which algorithm this uses also state an incorrect analysis (the same incorrect analysis, maybe copied from each other or from some third source): see e.g. Hetland Python Algorithms or Aziz Elements of Programming Interviews in Python. I don't think it would be appropriate to refer readers to sources that are outright incorrect in this way. (It is not just that the analysis is weaker than it needs to be: it states incorrectly that each heap operation takes time logarithmic in the number of items to be selected rather than in the total number of items.) There are even more book sources that correctly advise the reader that the Python selection methods should only be used for small k relative to n (again, seemingly copied from each other or from some third source) without providing any detail about how small is small or why the supplied method has this limitation: see e.g. Padmanabhan Programming with Python, Martelli Python Cookbook, Beazley Python Cookbook, or Hellmann The Python 3 Standard Library by Example. The source code is primary rather than secondary but it is at least commented, freely available, definitive, and not vague or wrong. But maybe you have some better idea how to find a secondary source with the same information. Or do you think we should just tell readers the same thing the copied-from-each-other programmer guides all say, "only use this when k is small" but not how small or why? I suppose another possibility would be to refer to Hetland for the fact that Python uses heap selection, include a note on the reference stating that Hetland's analysis is wrong, and state that this choice of algorithm is why programmers are warned only to use this for small k (citing another one of the book sources). Would that make you happier, knowing that the sources are secondary published books but also that they are wrong? —David Eppstein (talk) 21:53, 5 August 2023 (UTC)Reply
  • If a WP:RS is available, it would be interesting to explore why the STL and Python implementations differ (one provides O(1), the other doesn't). Is there something intrinsic to the languages which drove this? As an aside, I did a little hunting and was surprised to find that Java (or even guava) doesn't supply anything.
    • Um. STL is not O(1). It is O(n). Python makes a number of odd non-performance-based design decisions. But one possibility is simply that they are averse to randomization and that the deterministic O(n) methods are not very practical. Another is that, when you're programming in Python, doing anything using the library is fast and doing anything with your own code is slow, because compiled C versus interpreted Python. So maybe they thought that heapselect was good enough to be classified as fast in the same way and didn't worry much about the details. Anyway, I don't know of sourcing for why these two implementations made this design decision. In the Python case it might be found in the developer mailing lists, but that would be even less of a reliable source than the released code. —David Eppstein (talk) 21:58, 5 August 2023 (UTC)Reply

Approximate algorithms

edit

As a thought for possible expansion, there's lots of places where you want "approximately the k-th largest". A search results page will want to return the k-th best matches but it's OK if that's approximate. If relaxing the exactness criteria gives to a significant speed improvement, that may be a good tradeoff. It would be interesting to explore variations on selection algorithms which allow this.

Specific GA criteria

edit

Did you know nomination

edit
The following is an archived discussion of the DYK nomination of the article below. Please do not modify this page. Subsequent comments should be made on the appropriate discussion page (such as this nomination's talk page, the article's talk page or Wikipedia talk:Did you know), unless there is consensus to re-open the discussion at this page. No further edits should be made to this page.

The result was: promoted by Vaticidalprophet (talk15:32, 25 August 2023 (UTC)Reply

Improved to Good Article status by David Eppstein (talk). Self-nominated at 04:14, 9 August 2023 (UTC). Post-promotion hook changes for this nom will be logged at Template talk:Did you know nominations/Selection algorithm; consider watching this nomination, if it is successful, until the hook appears on the Main Page.Reply

Incorrect, unsourced (WP:OR?) illustration

edit
 
Failed attempt to determine the median of five values using six comparisons

This article has an illustration from editor David Eppstein that claims to illustrate how a 5-sample median can be determined using six comparisons. The underlying selection algorithm is unsourced. Either the algorithm is not well enough explained (which therefore needs to be fixed) or it does in fact not work. The specific 5-sample sequence (3,4,2,3,2) has the median 3, but for this example the provided algorithm appears to yield the incorrect number 2. Someone please either clarify the algorithm or remove the illustration if it is - as I suspect - in fact not describing a valid, general method for determining a 5-sample median. Thanks. Lklundin (talk) 14:31, 30 September 2023 (UTC)Reply

Your illustration performs the algorithm incorrectly. The third line should look like
  4
 / \
3   3   2
|
2
(After comparing the top elements 3 and 4 of the 2-3 and 3-4 pairs, we add an adge from 3 up to 4 but leave the 2-3 and 3-4 pairs as they are; where does your 3-3 edge come from?) There are many sources for a six-comparison five-element median; the one we use is Knuth. —David Eppstein (talk) 16:06, 30 September 2023 (UTC)Reply
OK. What is not self-evident and what seems to be missing from the illustration, is what exactly is swapped following a comparison. One (natural?) assumption, that just the two nodes connected by the yellow line segment are (conditionally) swapped, is not correct per my uploaded image. From your additional comments (and the logic of the problem), I now understand that in line two and in line four, what is (conditionally) swapped is pairs of nodes connected by a vertical line segment. Nevertheless, in line three still only two nodes are swapped. This is apparently so because although the left node has a vertical connection, its connecting node has been discarded (and marked as red). The idea of this illustration and the plenty of information it gives is a good one, but if the illustration is not clear to me then others may be confused as well. Thanks. Lklundin (talk) 19:17, 30 September 2023 (UTC)Reply
Nothing is swapped. The comparisons are made between the elements indicated in yellow. Once those comparisons are made, a partially-ordered set isomorphic to the next one in the sequence results. —David Eppstein (talk) 20:23, 30 September 2023 (UTC)Reply
As an engineer designing median filters and using this article to better understand selection algorithms, I find the illustration incomprehensible. Could you provide a more specific reference than "the one we use is Knuth"?
John G Hasler (talk) 14:36, 30 August 2024 (UTC)Reply
Sure. Let me copy it from the references of the article for you:
Knuth, Donald E. (1998). "Section 5.3.3: Minimum-comparison selection". The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 207–219. ISBN 0-201-89685-0. —David Eppstein (talk) 18:53, 30 August 2024 (UTC)Reply
Thank you. I don't think my 1973 edition is quite the same as your 1998 one. While certainly relevant section 5.3.3 is much shorter and contains little about medians. One of the exercises is quite illuminating, though.
I was able to understand your illustration but only after basically re-inventing it myself. I think it needs more detail to be useful. John G Hasler (talk) 22:52, 31 August 2024 (UTC)Reply
Such as?
Labeling the elements to make a decision tree, for instance, would be an obvious thing to try but in this instance it would miss the point. The point of the illustration is that after factoring out symmetries and forgetting the detailed ordering of eliminated elements you don't need case analysis: there is only one choice of what your state looks like after each step. But adding labels would undo that factoring-out and give you many different cases (a tree with 127 nodes and 64 leaves, unlikely to be informative or readable). —David Eppstein (talk) 23:18, 31 August 2024 (UTC)Reply