Talk:Selection algorithm/GA1

GA Review

Latest comment: 1 year ago12 comments2 people in discussion

Article (edit | visual edit | history) · Article talk (edit | history) · Watch

Reviewer: RoySmith (talk · contribs) 14:31, 5 August 2023 (UTC)Reply

Starting review RoySmith (talk) 14:31, 5 August 2023 (UTC)Reply

Problem statement

it should be possible to sort the values into an order from smallest to largest; for instance, they may be numbers... Maybe it's better to say "they may be real numbers", to explicitly exclude the non-orderable complex numbers.
- Computers cannot represent arbitrary real numbers. Changed from "numbers or ..." to "integers, floating-point numbers, or ...". —David Eppstein (talk) 20:41, 5 August 2023 (UTC)Reply
some sources may assume that the values are all distinct from each other.... The word "source" could refer to "source code" or "a reference in the literature". I'm pretty sure you mean the later, but perhaps a different word would remove the ambiguity.
- Replaced by "works on this problem". —David Eppstein (talk) 20:41, 5 August 2023 (UTC)Reply

Algorithms

selection of the kth smallest value in a collection of values can be performed very simply... Delete "very". Or maybe "very simply" -> "trivially".
- Removed both words. —David Eppstein (talk) 20:45, 5 August 2023 (UTC)Reply
if the output of the sorting algorithm is an array, jump to its kth element... I assume the intent of "is an array" is that it's some data structure which can be indexed in constant time, not strictly an Array (data structure).
- The intent is Array (data type). Link added to make this clearer. Also, although constant time would be typical, it is not a necessary component of this method. —David Eppstein (talk) 20:48, 5 August 2023 (UTC)Reply
A careful design of these factories leads to an algorithm that, when applied to median-finding, uses at most 2.942 comparisons. I don't have access to the source; I assume it explains where 2.942 comes from. Is it feasible to give a short description here of how that mysterious number is derived?
- Searching for the source will find alternative copies on Google Scholar. I have deliberately avoided linking to them because it is not clear that they are linkable (copies made available by the author or publisher rather than pirated by others). The paper is not illuminating about where that specific constant comes from, and doesn't even give the algorithm in full detail, instead referring to Dor's PhD thesis. —David Eppstein (talk) 20:56, 5 August 2023 (UTC)Reply

Language support

Python ... A linear-time selection algorithm is not included. needs a better citation than a link to a code repository.
- The only book sources that I can find that state which algorithm this uses also state an incorrect analysis (the same incorrect analysis, maybe copied from each other or from some third source): see e.g. Hetland Python Algorithms or Aziz Elements of Programming Interviews in Python. I don't think it would be appropriate to refer readers to sources that are outright incorrect in this way. (It is not just that the analysis is weaker than it needs to be: it states incorrectly that each heap operation takes time logarithmic in the number of items to be selected rather than in the total number of items.) There are even more book sources that correctly advise the reader that the Python selection methods should only be used for small k relative to n (again, seemingly copied from each other or from some third source) without providing any detail about how small is small or why the supplied method has this limitation: see e.g. Padmanabhan Programming with Python, Martelli Python Cookbook, Beazley Python Cookbook, or Hellmann The Python 3 Standard Library by Example. The source code is primary rather than secondary but it is at least commented, freely available, definitive, and not vague or wrong. But maybe you have some better idea how to find a secondary source with the same information. Or do you think we should just tell readers the same thing the copied-from-each-other programmer guides all say, "only use this when k is small" but not how small or why? I suppose another possibility would be to refer to Hetland for the fact that Python uses heap selection, include a note on the reference stating that Hetland's analysis is wrong, and state that this choice of algorithm is why programmers are warned only to use this for small k (citing another one of the book sources). Would that make you happier, knowing that the sources are secondary published books but also that they are wrong? —David Eppstein (talk) 21:53, 5 August 2023 (UTC)Reply
If a WP:RS is available, it would be interesting to explore why the STL and Python implementations differ (one provides O(1), the other doesn't). Is there something intrinsic to the languages which drove this? As an aside, I did a little hunting and was surprised to find that Java (or even guava) doesn't supply anything.
- Um. STL is not O(1). It is O(n). Python makes a number of odd non-performance-based design decisions. But one possibility is simply that they are averse to randomization and that the deterministic O(n) methods are not very practical. Another is that, when you're programming in Python, doing anything using the library is fast and doing anything with your own code is slow, because compiled C versus interpreted Python. So maybe they thought that heapselect was good enough to be classified as fast in the same way and didn't worry much about the details. Anyway, I don't know of sourcing for why these two implementations made this design decision. In the Python case it might be found in the developer mailing lists, but that would be even less of a reliable source than the released code. —David Eppstein (talk) 21:58, 5 August 2023 (UTC)Reply

Approximate algorithms

As a thought for possible expansion, there's lots of places where you want "approximately the k-th largest". A search results page will want to return the k-th best matches but it's OK if that's approximate. If relaxing the exactness criteria gives to a significant speed improvement, that may be a good tradeoff. It would be interesting to explore variations on selection algorithms which allow this.

Specific GA criteria

MOS:LEADLENGTH argues for a longer lead. Personally, I think the lead is fine as is, but mentioning this for completeness.
Earwig didn't report any issues, and I didn't dig any further.
No problems with neutrality or stability.
Illustrations are relevant, captioned, and appropriately licensed.
@David Eppstein: OK, that's pretty much it, I'll toss it back to you to address the issues above. RoySmith (talk) 15:51, 5 August 2023 (UTC)Reply
Thanks! I'm still traveling today but should have more time to work on this starting tomorrow. —David Eppstein (talk) 15:57, 5 August 2023 (UTC)Reply
I'm still a little uncomfortable with using a link to a source repository as a citation, but I'm not going to hold things up over that. I would however suggest that https://github.com/python/cpython is the official repository, so at least link to that instead. That SVN repo must be ancient. RoySmith (talk) 15:52, 6 August 2023 (UTC)Reply

Add topic