Draft:Ball divergence

Review waiting, please be patient.

This may take 4 months or more, since drafts are reviewed in no specific order. There are 3,107 pending submissions waiting for review.

If the submission is accepted, then this page will be moved into the article space.
If the submission is declined, then the reason will be posted here.
In the meantime, you can continue to improve this submission by editing normally.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Reviewer tools

Instructions · What links here · Ball divergence (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Bing, Wikipedia) · Submitted 5 days ago by Shushijianke (talk: D · +) · Last edited 4 days ago by Shushijianke

Ball divergence is a non-parametric two-sample statistical test method in metric spaces. It measures the difference between two population probability distributions by integrating the difference over all balls in the space^[1]. Therefore, its value is zero if and only if the two probability measures are the same. Similar to common non-parametric test methods, ball divergence calculates the p-value through permutation tests.

Background

Distinguishing between two unknown samples in multivariate data is an important and challenging task. Previously, a more common non-parametric two-sample test method was the energy distance test^[2]. However, the effectiveness of the energy distance test relies on the assumption of moment conditions, making it less effective for extremely imbalanced data (where one sample size is disproportionately larger than the other). To address this issue, Chen, Dou, and Qiao proposed a non-parametric multivariate test method using ensemble subsampling nearest neighbors (ESS-NN) for imbalanced data^[3]. This method effectively handles imbalanced data and increases the test's power by fixing the size of the smaller group while increasing the size of the larger group.

Additionally, Gretton et al. introduced the maximum mean discrepancy (MMD) for the two-sample problem^[4]. Both methods require additional parameter settings, such as the number of groups 𝑘 in ESS-NN and the kernel function in MMD. Ball divergence addresses the two-sample test problem for extremely imbalanced samples without introducing other parameters.

Definition

Let's start with the population ball divergence. Suppose that we have a metric space ( $V,\|\cdot \|$ ), where norm $\|\cdot \|$ introduces a metric $\rho$ for two point $u,v$ in space $V$ by $\rho (u,v)=\|u-v\|$ . Besides, we use ${\bar {B}}(u,\rho (u,v))$ to show a closed ball with the center $u$ and radius $\rho (u,v)$ . Then, the population ball divergence of Borel probability measures $\mu ,\nu$ is

$BD(\mu ,\nu )=\iint _{\mathrm {V} \times \mathrm {V} }[\mu -\nu ]^{2}({\bar {B}}(u,\rho (u,v)))(\mu (du)\mu (dv)+\nu (du)\nu (du)).$

For convenience, we can decompose the Ball Divergence into two parts: $A=\iint _{V\times V}[\mu -\nu ]^{2}({\bar {B}}(u,\rho (u,v)))\mu (du)\mu (dv),$ and $C=\iint _{V\times V}[\mu -\nu ]^{2}({\bar {B}}(u,\rho (u,v)))\nu (du)\nu (dv).$ Thus $BD(\mu ,\nu )=A+C.$

Next, we will introduce the sample ball divergence. Let $\delta (x,y,z)=I(z\in {\bar {B}}(x,\rho (x,y)))$ denote whether point $z$ locates in the ball ${\bar {B}}(x,\rho (x,y))$ . Given two independent samples $\{X_{1},\ldots ,X_{n}\}$ form $\mu$ and $\{Y_{1},\ldots ,Y_{m}\}$ form $\nu$

${\begin{aligned}&A_{ij}^{X}={\frac {1}{n}}\sum _{u=1}^{n}\delta \left(X_{i},X_{j},X_{u}\right),A_{ij}^{Y}={\frac {1}{m}}\sum _{v=1}^{m}\delta \left(X_{i},X_{j},Y_{v}\right),\\&C_{kl}^{X}={\frac {1}{n}}\sum _{u=1}^{n}\delta \left(Y_{k},Y_{l},X_{u}\right),C_{ij}^{Y}={\frac {1}{m}}\sum _{v=1}^{m}\delta \left(Y_{k},Y_{l},Y_{v}\right),\end{aligned}}$ where $A_{ij}^{X}$ means the proportion of samples from the probability measure $\mu$ located in the ball ${\bar {B}}\left(X_{i},\rho \left(X_{i},X_{j}\right)\right)$ and $A_{ij}^{Y}$ means the proportion of samples from the probability measure $\nu$ located in the ball ${\bar {B}}\left(X_{i},\rho \left(X_{i},X_{j}\right)\right)$ . Meanwhile, $C_{ij}^{X}$ and $C_{ij}^{Y}$ means the proportion of samples from the probability measure $\mu$ and $\nu$ located in the ball ${\bar {B}}\left(Y_{i},\rho \left(Y_{i},Y_{j}\right)\right)$ . The sample versions of $A$ and $C$ are as follows

$A_{n,m}={\frac {1}{n^{2}}}\sum _{i,j=1}^{n}\left(A_{ij}^{X}-A_{ij}^{Y}\right)^{2},\qquad C_{n,m}={\frac {1}{m^{2}}}\sum _{k,l=1}^{m}\left(C_{kl}^{X}-C_{kl}^{Y}\right)^{2}.$ Finally, we can give the sample ball divergence

$BD_{n,m}=A_{n,m}+C_{n,m}.$

Properties

1. Given two Borel probability measures $\mu$ and $\nu$ on a finite dimensional Banach space $V$ , then $BD(\mu ,\nu )\geq 0$ where the equality holds if and only if $\mu =\nu$ .

2. Suppose $\mu$ and $\nu$ are two Borel probability measures in a separable Banach space $V$ . Denote their support $S_{\mu }$ and $S_{\nu }$ , if $S_{\mu }=V$ or $S_{\nu }$ , then we have $BD(\mu ,\nu )\geq 0$ where the equality holds if and only if $\mu =\nu$ .

3.Consistency: We have

$D_{n,m}{\xrightarrow[{n,m\rightarrow \infty }]{\text{ a.s. }}}D(\mu ,v),$ where ${\frac {n}{n+m}}\rightarrow \tau$ for some $\tau \in [0,1]$ .

Define $\xi (x,y,z_{1},z_{2})=\delta (x,y,z_{1})\cdot \delta (x,y,z_{2})$ , and then let $Q\left(x,y;x^{\prime },y^{\prime }\right)=\left(\phi _{A}^{(2,0)}\left(x,x^{\prime }\right)+\phi _{A}^{(1,1)}(x,y)+\phi _{A}^{(1,1)}\left(x^{\prime },y^{\prime }\right)+\phi _{A}^{(0,2)}\left(y,y^{\prime }\right)\right),$ where

${\begin{aligned}\phi _{A}^{(2,0)}\left(x,x^{\prime }\right)=&E\left[\xi \left(X_{1},X_{2},x,x^{\prime }\right)\right]+E\left[\xi \left(X_{1},X_{2},Y,Y_{3}\right)\right]\\&-E\left[\xi \left(X_{1},X_{2},x,Y\right)\right]-E\left[\xi \left(X_{1},X_{2},x^{\prime },Y_{3}\right)\right]\\\phi _{A}^{(1,1)}(x,y)=&E\left[\xi \left(X_{1},X_{2},x,X_{3}\right)\right]+E\left[\xi \left(X_{1},X_{2},y,Y_{3}\right)\right]\\&-E\left[\xi \left(X_{1},X_{2},x,y\right)\right]-E\left[\xi \left(X_{1},X_{2},X_{3},Y_{3}\right)\right]\\\phi _{A}^{(0,2)}\left(y,y^{\prime }\right)=&E\left[\xi \left(X_{1},X_{2},X,X_{3}\right)\right]+E\left[\xi \left(X_{1},X_{2},y,y^{\prime }\right)\right]\\&-E\left[\xi \left(X_{1},X_{2},X,y\right)\right]-E\left[\xi \left(X_{1},X_{2},X,y^{\prime }\right)\right].\end{aligned}}$ The function $Q\left(x,y;x^{\prime },y^{\prime }\right)$ has spectral decomposition: $Q\left(x,y;x^{\prime },y^{\prime }\right)=\sum _{k=1}^{\infty }\lambda _{k}f_{k}(x,y)f_{k}\left(x^{\prime },y^{\prime }\right),$ where $\lambda _{k}$ and $f_{k}$ are the eigenvalues and eigenfunctions of $Q$ . For $k=1,2,\ldots$ , $Z_{1k},Z_{2k}$ are i.i.d. $N(0,1)$ , and ${\begin{aligned}a_{k}^{2}(\tau )&=(1-\tau )E_{X}\left[E_{Y}f_{k}(X,Y)\right]^{2},\quad b_{k}^{2}(\tau )=\tau E_{Y}\left[E_{X}f_{k}(X,Y)\right]^{2},\\\theta &=2E\left[E\left(\delta \left(X_{1},X_{2},X\right)\left(1-\delta \left(X_{1},X_{2},Y\right)\right)\mid X_{1},X_{2}\right)\right].\end{aligned}}$

4.Asymptotic distribution under the null hypothesis: Suppose that both $n$ and $m\rightarrow \infty$ in such a way that ${\frac {n}{n+m}}\rightarrow \tau ,0\leq \tau \leq 1$ . Under the null hypothesis, we have ${\frac {nm}{n+m}}BD_{n,m}{\xrightarrow[{n\rightarrow \infty }]{d}}\sum _{k=1}^{\infty }2\lambda _{k}\left[\left(a_{k}(\tau )Z_{1k}+b_{k}(\tau )Z_{2k}\right)^{2}-\left(a_{k}^{2}(\tau )+b_{k}^{2}(\tau )\right)\right]+\theta {\text{. }}$

5. Distribution under the alternative hypothesis: let $\delta _{1,0}^{2}=\operatorname {Var} \left(g^{(1,0)}(X)\right)\quad {\text{ and }}\quad \delta _{0,1}^{2}=\operatorname {Var} \left(g^{(0,1)}(Y)\right).$ Suppose that both $n$ and $m\rightarrow \infty$ in such a way that ${\frac {n}{n+m}}\rightarrow \tau ,0\leq \tau \leq 1$ . Under the alternative hypothesis, we have ${\sqrt {\frac {nm}{n+m}}}\left(BD_{n,m}-BD(\mu ,\nu )\right){\underset {n\rightarrow \infty }{d}}N\left(0,(1-\tau )\delta _{1,0}^{2}+\tau \delta _{0,1}^{2}\right).$

6. The test based on $D_{n,m}$ is consistent against any general alternative $H_{1}$ . More specifically, $\lim _{n\rightarrow \infty }\operatorname {Var} _{H_{1}}\left(D_{n,m}\right)=0$ and $\Delta (\eta ):=\liminf _{n\rightarrow \infty }\left(E_{H_{1}}D_{n,m}-E_{H_{0}}D_{n,m}\right)>0.$ More importantly, $\Delta (\eta )$ can also be expressed as $\Delta (\eta )\equiv D(\mu ,\nu ),$ which is independent of $\eta$ .