In the picture below, are shown the red blood cell hemoglobin concentration and the red blood cell volume data of two groups of people, the Anemia group and the Control Group (i.e. the group of people without Anemia). As expected, people with Anemia have lower red blood cell volume and lower red blood cell hemoglobin concentration than those without Anemia.
is denoted as the group where belongs, with when belongs to Anemia Group and when belongs to Control Group. Also where , and . See Categorical distribution.
If is known, the estimation of the parameters results to be quite simple with maximum likelihood estimation. But if is unknown it is much more complicated.[2]
Being a latent variable (i.e. not observed), with unlabeled scenario, the Expectation Maximization Algorithm is needed to estimate as well as other parameters. Generally, this problem is set as a GMM since the data in each group is normally distributed.
[3][circular reference]
In machine learning, the latent variable is considered as a latent pattern lying under the data, which the observer is not able to see very directly. is the known data, while are the parameter of the model. With the EM algorithm, some underlying pattern in the data can be found, along with the estimation of the parameters. The wide application of this circumstance in machine learning is what makes EM algorithm so important.
The EM algorithm consists of two steps: the E-step and the M-step. Firstly, the model parameters and the can be randomly initialized. In the E-step, the algorithm tries to guess the value of based on the parameters, while in the M-step, the algorithm updates the value of the model parameters based on the guess of of the E-step. These two steps are repeated until convergence is reached.