Wikipedia:Reference desk/Archives/Mathematics/2021 October 6

Mathematics desk
< October 5 << Sep | October | Nov >> October 7 >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


October 6

edit

Testing for biased/unbiased sampling from a distribution

edit
 

Hi guys, I have a stats question that I'm not quite sure how to formulate, but I think this comes close. I have a (small) set of latitudes, call it set L. I have a number of (smaller) subsets of latitudes, call each set S, that contain latitudes from L where a certain event was observed. I am now trying to figure out whether these events happened independent of latitude - i.e., whether an S can be regarded as a random sample from L.

My first impulse to test this was to test for equivalence of distributions (using a permutation test - stopped messing around with Welch's/KS once I realized that these can't be regarded as continuous values). However I realized that is flawed, because of course S is composed of samples from the same distribution; basically what I was testing for was "is S large enough to be recognizable as L", which isn't really the question. I guess what I actually want to test for is "is S the result of unbiased sampling from L". And I couldn't come up with a way to address that. Any ideas?

Image shows L set (left) and an S set (right) (the latter with a p-value for a Fisher-Pitman permutation test for equal distributions - but as I said, that is probably inapplicable) --Elmidae (talk · contribs) 18:57, 6 October 2021 (UTC)[reply]

Your data sets appear to be from a discrete distribution. Our article on the Kolmogorov–Smirnov test contains a section Discrete and mixed null distribution. Is that not applicable? The relatively low statistical power may be a problem, but I'm not sure much can be done about that unless we have a parametric model for the underlying distribution of the L data set.  --Lambiam 21:49, 6 October 2021 (UTC)[reply]
I know how to test whether the distributions are equal (the Fisher-Pitman permutation test does that just fine, or I could permute the KS test). But I suspect my question is not solved by testing for distribution equality. I know S is from the same distribution as D, but I need to figure out whether it represents a random sample. Isn't that a different question? --Elmidae (talk · contribs) 22:09, 6 October 2021 (UTC)[reply]
No, a biased sample would have a different probability distribution even if it has the same sample space. The sample space tells you what the possible values are, and the distribution also tells you how probable those values are. Introducing a bias changes the probabilities, so it gives a different distribution. --Amble (talk) 22:15, 6 October 2021 (UTC)[reply]
Hmm. I see. So I should probably just try and find an equality test that can deal well with small sample sizes, and go by that result. Thanks! --Elmidae (talk · contribs) 22:22, 6 October 2021 (UTC)[reply]
Many nonparametric tests can be found on the page Category:Nonparametric statistics. Many are not equality tests, though. I searched for papers comparing the power of nonparametric equality tests but found nothing useful.  --Lambiam 05:56, 7 October 2021 (UTC)[reply]
In addition to KS and ranking tests like the Mann–Whitney U test, there are empirical ways to test for sampling bias. Run a bootstrap on the L distribution, and calculate mutual information of L and the bootstrap sample. This will yield a distribution of MI values. Then compare the MI of L and S against the distribution to calculate an empirical p-value. If you have some sensible priors, a Bayesian approach might be applicable, too. --{{u|Mark viking}} {Talk} 18:41, 7 October 2021 (UTC)[reply]
I'm not entirely clear about the differences between the permutation test I have been using, and bootstrapping. I believe for permutation, you calculate the basic test statistic T0 on the two sets, then perform many re-arrangements of units between sets and compute a statistic Ti for each, then derive an exact p-value from the fraction of instances where Ti > T0. Whereas for bootstrapping you create a new distribution from many resamples and calculate the p-value from the comparison of the two distributions. Still I've seen people say that permutation and bootstrapping can be considered equivalent methods for small sample sizes. Not sure I see that... but neither makes any assumptions about the underlying distributions, which is the main point here I guess. --Elmidae (talk · contribs) 20:59, 7 October 2021 (UTC)[reply]
Both permutation and bootstrap are resampling methods, so are conceptually related. The permutation approach is sampling without replacement and bootstrap often is sampling with replacement. Sampling without replacement has a mild sampling bias, because the no replacement restriction adds a constraint not present in the original sampling process. --{{u|Mark viking}} {Talk} 22:41, 7 October 2021 (UTC)[reply]
Oh, I missed that you were specifically talking about mutual information - which is a thing based on Kullback–Leibler divergence, I see? New territory - looks interesting though! --Elmidae (talk · contribs) 21:03, 7 October 2021 (UTC)[reply]