Gaussian Finite Mixture Model
model based clustering method
Finite mixture models consider the situation that a sample is drawn from a population that consists of a finite number of components/subpopulations
WLOG, we adopt the common form of the pmf for a G-component mixture as
L ( Ψ ) = ∏ i = 1 n f ( x i ; Ψ ) = ∏ i = 1 n ∑ g = 1 G π g f g ( x i ; θ g ) , for i = 1 … n , Ψ = ( π 1 , π 2 , … π G − 1 , θ 1 , … , θ G ) ′
57
If we use the conventional Method of maximum likelihood we will not obtain an explicit solution for the parameters
∂ l ∂ π g = ∑ i = 1 n f ( x i ; θ g ) − f G ( x i ; θ G ) ∑ g π g f ( x i ; θ g ) = 0 = f ( x 1 ; θ g ) − f ( x 1 ; θ ) To use the Expectation-Maximization Algorithm we twist this problem as an incomplete-data problem. The observed data is X and the unobserved data is the component membership of each observation
z = ( z 1 ′ , … , z n ′ ) ′ y i = ( x i , z i ′ ) ′ The complete-data log-likelihood for Ψ has the form
Z i g = { 1 , if the i t h obs is from g t h component 0 otherwise Z i follows a multinomial( 1 , P i ∼ ) , prior, posterior something
Now the complete-data likelihood for Ψ has the form
L c = ∏ i = 1 n f ( y i ; Ψ ) = ∏ i = 1 n f ( x i , z i ; Ψ ) = L c ( Ψ ; y ) = ∏ i = 1 n ∏ g = 1 G [ π g f g ( x i ; θ g ) ] z i g Now we can formulate a general E-step and M-step for fitting a mixture model
l c ( Ψ ; y ) = ∑ i = 1 n ∑ g = 1 G Z i g log [ π g f g ( x i ; θ g ) ] = ∑ i = 1 n ∑ g = 1 G Z i g [ log π g + log f g ( x i ; θ g ) ] E-step
Given Ψ ( i ) = ( π ∼ ( i ) , θ ( i ) ) , estimate Z i g ( i + 1 ) using Bayes Theorem
Z i g ( 1 ) = E ( Z i g | π ∼ ( 0 ) , θ ∼ ( 0 ) ) = P ( Z i g = 1 ∣ x i , Ψ ) = P ( x i ∣ Z i g = 1 , Ψ ) ⋅ P ( Z i g = 1 ∣ Ψ ) P ( x i ∣ Ψ ) Z i g ( 1 ) = π g ⋅ f g ( x i ; θ g ) ∑ h = 1 G π h f h ( x i ; θ h ) Which is necessary in the Q formula:
Q ( Ψ ; Ψ ( 0 ) ) = E ( l c ( Ψ ; y ∼ ) | P s i ( 0 ) ) = E ( ∑ i = 1 n ∑ g = 1 G Z i g [ log π g + log f g ( x i ; θ g ) ] | π ∼ ( 0 ) , θ ( 0 ) ) = ∑ i = 1 n ∑ g = 1 G E ( Z i g [ log π g + log f g ( x i ; θ g ) ] | π ∼ ( 0 ) , θ ( 0 ) ) = ∑ i = 1 n ∑ g = 1 G E ( Z i g | π ∼ ( 0 ) , θ ( 0 ) ) [ log π g + log f g ( x i ; θ g ) ] Z i g ( 1 ) = E ( Z i g | π ∼ ( 0 ) , θ ∼ ( 0 ) ) = π g ( 0 ) f ( x i ; θ g ∼ ( 0 ) ) ∑ g = 1 G π g ( 0 ) f ( x i ; θ g ∼ ( 0 ) ) M-step
find Ψ = ( π , θ ) that maximizes Q ( Ψ ; Ψ ( i ) ) based on Z i g ∼ ( i )
Q ( Ψ ; Ψ ( 0 ) ) = ∑ i = 1 n ∑ g = 1 G Z i g ( 1 ) [ log π g + log f g ( x i ; θ g ∼ ) ] = ∑ i = 1 n ∑ g = 1 G Z i g ( 1 ) log π g + ∑ i = 1 n ∑ g = 1 G Z i g ( 1 ) log f g ( x i ; θ g ∼ ) = Q 1 ( π ∼ ; π ( 0 ) ∼ ) + Q 2 ( θ ∼ ; θ ( 0 ) ∼ )
Q 1 = ∑ i = 1 n ∑ g = 1 G Z i g ( i ) log π g + λ ( 1 − ∑ g = 1 G π g ) with a scalar λ to find a solution, as ∑ g π g = 1 , 1 − ∑ g π g = 0
Then maximizing Q 1 we get the values of π g based on Z i g ( i ) :
∂ Q 1 ∂ π g = ∑ i = 1 n Z i g ( i ) − λ = 0 → π g = ∑ i = 1 n Z i g ( i ) λ
terms in the ∑ g of a different g go to 0
λ is also an unknown variable, so to find the value of it that maximizes Q 1 :
∂ Q 1 ∂ λ = 1 − ∑ g = 1 G π g = 0 → 1 − ∑ g = 1 G ∑ i = 1 n Z i g λ = 1 − ∑ i = 1 n ∑ g = 1 G Z i g λ = 1 − ∑ i = 1 n 1 λ = 1 − n λ ∴ λ ^ = n and
π ^ g = ∑ i = 1 n Z i g ( i ) n , g = 1 , … , G
intuitive result, use the estimate of Z over the groups to find the share that should be from a certain population
Z ( i ) is expectation, continuous number 0-1, not binary as the real unknown data
Q 2 = ∑ i = 1 n ∑ g = 1 G Z i g ( 1 ) log f g ( x i ; θ g ∼ ) Q 2 = ∑ g = 1 G ∑ i = 1 n Z i g log f ( y i ; x i ∼ , β g ∼ , σ g 2 ) = ∑ g = 1 G ∑ i = 1 n Z i g log ( 1 2 π σ g 2 exp ( − ( y i − x i ∼ β g ∼ ) 2 2 σ g 2 ) ) = ∑ g = 1 G ∑ i = 1 n Z i g ( 1 ) ( − 1 2 log ( 2 π ) − 1 2 log σ g 2 − ( y i − x i ∼ β g ∼ ) 2 2 σ g 2 )
as β ~ g = ( β 0 g , β 1 g , … , β p g ) for p covariates and an intercept:
∂ Q 2 ∂ β ~ g = [ ∂ Q 2 ∂ β 0 g , ∂ Q 2 ∂ β 1 g , … , ∂ Q 2 ∂ β p g ] T Let j = 0 , 1 , … , p :
∂ Q 2 ∂ β j g = ∑ i = 1 n Z i g ( 1 ) ∂ ∂ β j g ( − ( y i − x ~ i β ~ g ) 2 2 σ g 2 ) = 1 σ g 2 ∑ i = 1 n Z i g ( 1 ) ( y i − x ~ i β ~ g ) x i j = 0 → ∑ i = 1 n [ Z i g ( 1 ) y i x i j − Z i g ( 1 ) x ~ i β ~ g x i j ] = 0 Then bringing it back to vector form for all β 's:
∂ Q 2 ∂ β ~ g = ∑ i = 1 n [ Z i g ( 1 ) ( y i − x ~ i β ~ i ) x ~ i T ] = 0 → ∑ i = 1 n Z i g ( 1 ) y i x ~ i T − ∑ i = 1 n Z i g ( 1 ) x ~ i β ~ g x ~ i T → ∑ i = 1 n Z i g ( 1 ) y i x ~ i T = ∑ i = 1 n Z i g ( 1 ) x ~ i β ~ g x ~ i T Rewriting it in matrix form:
X Z g ( 1 ) Y ~ = ( X T Z g ( 1 ) X ) β ~ g β ~ g ( 1 ) = β ^ g ∼ = ( X T Z g ( 1 ) X ) − 1 X T Z g ( 1 ) Y ~ where Z g ( 1 ) = d i a g ( Z 1 g ( 1 ) , … , Z n g ( 1 ) )
= [ Z 1 g ( 1 ) 0 0 0 … 0 Z 2 g ( 1 a ) 0 0 … 0 0 … 0 … 0 … … Z n − 1 g ( 1 ) … 0 … … … Z n g ( 1 ) ]
Differentiating with respect to σ g 2 we get isolate g and work with only one sum:
∂ Q 2 ∂ σ g 2 = ∑ i = 1 n Z i g ( 1 ) [ ( − 1 2 ) 1 σ g 2 + ( y i − x i ∼ β g ∼ ) 2 2 ( σ g 2 ) 2 ] = ∑ i = 1 n Z i g ( 1 ) [ ( y i − x i ∼ β g ∼ ) 2 − σ g 2 2 σ ] = 0 → ∑ i = 1 n Z i g ( 1 ) ( y i − x ~ i β ~ g ) 2 − ∑ Z i g ( 1 ) σ g 2 = 0 → σ g 2 = ∑ i = 1 n Z i g ( 1 ) ( y i − x ~ i β ~ g ) 2 ∑ i = 1 n Z i g ( 1 ) Plugging in the result for β ~ g ( 1 ) from the previous step we have:
σ g 2 ( 1 ) = σ ^ g 2 = ( Y − X β ~ g ( 1 ) ) Z g ( 1 ) ( Y − X β ~ g ( 1 ) ) trace − 1 ( Z g ( 1 ) ) where the trace sums all Z i g 's back