Application of finite mixture models to explore subpopulations in Crohn’s disease patients

Frequencies of the number of weeks in which the ibdsc value greater than 100 were summarized.

As it can be seen from Table 1, the frequency decreases across the weeks, but increased in the 7th week. Moreover, we can observe that the mean is 2.09 and the variance is 4.973, from this we can observe that there may be over-dispersion.

Table 1 Frequency and summary measures of weeks

Below in Table 2 is given the number of observations with respect to the distribution of weeks across treatment groups. It was revealed that each treatment has no the same number of observations through all the seven weeks. In 0 doses, there are 37 observations in the first week, but there are decreasing numbers of observations throughout the seven weeks; in dose 1, starting 40 numbers of observations in the first week, and then different number of observations in each of the weeks were observed, and the same is true in all the treatment groups. It can be also revealed that there are highest (170) total number of observations in all treatments in the first week as compared to the other weeks, whereas there are smallest (94) number of observations in the last week. This indicated that the total number of observations was increasing missing their follow-ups as weeks increase. Moreover, the total number of observations in treatment group1 is the highest (203), whereas in the treatment group3 are the smallest (176). This difference can happen due to the reason that the number of subjects having IBD score > 100 are higher in treatment group1 and smallest in treatment group3.

Table 2 Number of observations in each week across treatment groups for IBD score > 100

Histogram of the number of weeks for which the ibdsc is greater than 100 was given (Fig. 3). And it was observed that the distribution of the number of weeks seems to have multiple modal values which might not be easily described by standard distributions. In addition, the variance is higher than the mean (Table 1) which might indicate the presence of over-dispersion. Therefore, one way to take into account these problems is modeling underling heterogeneity using a finite mixture model.

Since the number of weeks is count data, it was reasonable to assume Poisson distribution. Nevertheless, as the assumption is not met (variance is higher than the mean, Table 1), there might be over dispersion, and then this model could not be appropriate. As it was depicted from the histogram as well as the summary statistics, it was observed heterogeneity as well as seems there is over dispersion among the weeks. For this reason, it is sensible to consider a model which accounts the multi-modality and over-dispersion problems, and then finite Poisson mixture model was fitted.

Table of Contents

Finite mixture model fitting

After observing the variance becomes larger than the mean, and that of Fig. 3 which looks having multimodality, we went to the model checking. We fitted three models, model one with a single component (assuming unimodal), model two is a mixture model with two components, and the third model is a mixture model with three components. AIC and BIC of all the three models were determined as given Table 3.

Table 3 AIC and BIC values of three fitted models

As it was revealed from Table 3, the AIC as well as the BIC values of the model with two components are the smallest, and which is the indication that the model with two components has the best fit. Therefore, all analysis was done using this model.

Here to estimate the number of components, nonparametric maximum likelihood estimation (NPMLE) was performed. The final fitted mixture model was with two components, and log likelihood value of -556.2931, and given as follows:

$$\:\:\:\:\:\:\:\:\frac{\text{Y}\text{i}}{\text{P}}\sim\:\text{P}\text{o}\text{i}\text{s}\text{s}\text{o}\text{n}\left({\uplambda\:}\text{i}\right),\:\text{w}\text{h}\text{e}\text{r}\text{e}\:\text{P}\sim\:\left|\begin{array}{c}\:\\\:0.789\:\:\:\:\:4.430\\\:0.637\:\:\:\:\:0.363\:\:\\\:\:\:\end{array}\right|,\:\text{f}\text{o}\text{r}\:\text{i}=\:1,\:2.$$

The plot of gradient function versus the parameter value (lambda) is presented (Fig. 4). From this figure, it can be clearly observed that the gradient function was less than or equal to one, which pointed that the estimated value of the parameters of the distribution function were the non-parametric maximum likelihood estimates(NPMLE), and also these estimates are unique as the gradient function is identically one. Classification was done based on the fitted mixture model, and as we can see from Table 4, the proportions are very close to the estimated components (π^). Most of the patients, 196(67.4%), were classified in the first component.

Model extension

The deviance for single component of the mixture model corrected for the covariates was 1049.3 and that of the two components was 948.8. The change in the deviance of these mixture models is large, and which revealed that there might be evidence for the presence of mixture after correcting for the patient characteristics (covariates) in the model.

The result was given in Table 4, and adjusting for the treatment and ibd score at baseline, the fitted models for population 1(component 1) and population 2 (component 2), respectively were given bellow:

$$\begin{array}{l}\:\text{l}\text{o}\text{g}\left({\upmu\:}1\right)\\=-4.335\:+\:0.10\text{d}\text{o}\text{s}\text{e}1\text{i}\:\\+\:0.0086\text{d}\text{o}\text{s}\text{e}2\:-0.250\text{d}\text{o}\text{s}\text{e}3\:\\+\:0.042\text{i}\text{b}\text{d}\text{s}\text{c}0\end{array}$$

$$\begin{array}{l}\:\text{l}\text{o}\text{g}\left({\upmu\:}2\right)\\=\:0.537\:+\:0.138\text{d}\text{o}\text{s}\text{e}1\text{i}\:\\+\:0.008\text{d}\text{o}\text{s}\text{e}2\:+\:0.160\text{d}\text{o}\text{s}\text{e}3\:\\+\:0.0087\text{i}\text{b}\text{d}\text{s}\text{c}0\end{array}$$

In two of the components, the effect of treatment was insignificant (p-values for all the treatment groups are larger than 0.05 significance level, Table 4). Therefore, the treatment does not completely explain the presence of potential clusters in the outcome. In the other hand, the effect of ibdsc0 was significant in both subpopulations, P-values < 0.05 significance level, Table 4). Exp(0.042) = 1.043 and exp(0.008) = 1.008 are the amounts by which the mean count (µ) is multiplied per unit change in the ibdsc0 for subpopulations 1 and 2, respectively. This showed that the patient characteristics (ibdsc0) completely explain the presence of potential clusters in the outcome.

Components, their relationship, and covariates

As the variance is higher than the mean look at Table 1(4.973 > 2.09), and this indicates the presence of over-dispersion. Therefore, using the mixture model to account a data with such potential heterogeneity problem is very important. As a result, this is the justification that using the poison mixture model is needed.

The Poisson fitted models for the two components showed that though the effect of ibdsc0 is significant on both components, its effect is higher in component 1 as its p-value = 0.0001 which is much more higher than the p-value = 0.0422 for the fitted model of component 2. Besides, the multiplying factor of the mean count (µ) for component 1 is higher(1.043) as compared with the multiplying factor of the mean count (µ) for component 2(1.008) per unit change for Inflammatory bowel disease score at baseline (ibdsc0).

We can also observe that the total subjects in subpopulation 1 are higher (n₁ = 196) as compared to that of the subpopulation 2 which are only (n₂ = 96). However, there are only the weeks 1 and 2 included in this component, but the rest (weeks 3, 4, 5, 6 and 7) were included in component 2. As of the number of subjects, the proportion is also higher for the component 1 which is 0.674 as compared to that of the component 2 which is 1-0.674 = 0.326.

It should also be mentioned that the empirical proportions (Pro) for both components are close to the estimated mixture proportions (π^). This indicated that the model’s assumptions about the mixture structure and latent subpopulations are consistent with the actual data distribution. That is the observed data aligns well with the model’s estimated parameters.

Table 4 Final number of components and classification of observations; estimates of covariates

Clinical implication of the predictors

Significance of the inflammatory bowel disease at baseline (ibdsc0) suggested that the initial severity of the ibd symptoms at baseline intensely influences disease evolution or outcomes over the 7-week period. Higher baseline IBD scores may predict more weeks with severe disease (IBD > 100), highlighting the importance of early and accurate assessment of baseline disease severity in clinical settings. Moreover, patients with higher baseline IBD scores may require closer monitoring and possibly more aggressive or tailored therapeutic interventions.

The finding that the different treatment doses did not have a statistically significant effect on the number of weeks with IBD > 100 raises concerns about the efficacy of these treatments in the context of this study. This can be due to the reason that the treatment may not be effective in altering the course of the disease within the 7-week period. The study design, duration, or sample size may not have been sufficient to detect true effects.

link