当前位置：首页 > ds >正文

【LCMM】纵向轨迹模型,组轨迹模型

ds 2025/7/12 18:03:30

latent_class_mixed_models

基础知识

增长混合模型（GMM）和潜在类别增长模型（LCGA）的核心区别确实主要在于是否允许类别内存在随机效应，但两者的差异还涉及模型灵活性、假设和应用场景等方面。以下是详细对比：

随机效应的处理
GMM（增长混合模型）

允许类别内随机效应：每个潜在类别中，个体的增长参数（如截距、斜率）可以围绕类别均值存在随机变异。例如，同一类别的个体可能有不同的截距和斜率，服从某种分布（如正态分布）。

更灵活：适用于数据中存在“类别内异质性”的情况，即同一类别内个体的轨迹不完全一致。

LCGA（潜在类别增长分析）

无随机效应：每个潜在类别内的所有个体具有完全相同的增长参数（固定效应），即类别内的轨迹是严格同质的（无个体变异）。

更严格：假设同一类别的个体轨迹完全相同，可能忽略实际存在的个体差异。

模型复杂度与假设
GMM

参数更多：需要估计随机效应的方差-协方差矩阵，模型更复杂。

样本量需求更大：由于参数更多，通常需要更大的样本量以保证估计稳定性。

适用性：适合探索“类别内异质性”的复杂场景，例如研究不同亚群体的发展轨迹时，同时考虑群体间和群体内的差异。

LCGA

参数更少：仅需估计类别均值和残差方差，模型更简单。

收敛更容易：计算复杂度低，适合小样本或初步探索性分析。

适用性：适用于假设类别内个体高度同质的情况，例如明确划分的干预组或自然分组。

实际应用中的选择
模型选择依据：

通过信息准则（如BIC、AIC）或似然比检验（需注意模型嵌套关系）比较。

若GMM中随机效应的方差接近零，则可简化为LCGA。

权衡：

GMM灵活性高，但可能过拟合；LCGA更简洁，但可能忽略重要变异。

实践建议：先尝试LCGA，若模型拟合不佳（如残差变异大），再考虑GMM。

其他差异
协变量纳入：两者均可纳入协变量预测类别成员资格（通过多项Logit模型），但GMM还可分析协变量对类别内随机效应的影响。

轨迹形状：LCGA通常假设线性或多项式增长，而GMM可结合更复杂的非线性随机效应。

示例1-GROW

TH MIXTURE MODEL

This repository holds code associated with the publication titled: “Identifying typical trajectories in longitudinal data: modelling strategies and interpretations” (DOI: https://doi.org/10.1007/s10654-020-00615-6 ).

# Load the necessary packages for latent class mixed models and spline functions
library(lcmm)
library(splines)# input dataset df: 
# data frame with columns pseudonymised patient episode identifier,
# time from blood culture collection to clinical parameter measurements,
# and C-reactive protein  values (Box-Cox transformed).# Define the natural spline basis for 'time'
# using the 25th, 50th, and 75th percentiles as knots and
# 1st and 99th percentiles as boundary knots
timeSplineKnots <- quantile(df$time, c(0.25, 0.5, 0.75))
timeSplineBoundary <- quantile(df$time, c(0.01, 0.99))
timeSpline <- ns(df$time, knots = timeSplineKnots, Boundary.knots = timeSplineBoundary)# Renaming columns for interpretability
timeSplineColNames <- paste0("T", 1:ncol(timeSpline))
colnames(timeSpline) <- timeSplineColNames# Binding the natural spline terms to the original dataframe
df <- cbind(df, timeSpline)# Fitting a series of latent class mixed models with varying number of classes (ng)
# Beginning with a standard linear mixed model (LCMM_CRP_1)
LCMM_CRP_1 <- lcmm(CRP_trans ~ T1 + T2 + T3 + T4,random = ~ T1 + T2 + T3 + T4,subject = "EpisodeID",nproc = 20, # Number of cores to use for parallel processingdata = df
)m2d <- gridsearch(hlme(normMMSE ~ age65+I(age65^2)+CEP,random =~ age65+I(age65^2), # 随机效应部分subject = 'ID',data=paquid, ng = 2, mixture=~age65+I(age65^2)),rep=100, maxiter=30, minit=m1)# Fitting latent class mixed models with multiple classes
# using LCMM_CRP_1 as the starting values (B = LCMM_CRP_1)
LCMM_CRP_2 <- lcmm(CRP_trans ~ T1 + T2 + T3 + T4,mixture = ~ T1 + T2 + T3 + T4,random = ~ T1 + T2 + T3 + T4, maxiter = 10000,subject = "EpisodeID", ng = 2, B = LCMM_CRP_1,nproc = 20, # Number of cores to use for parallel processingdata = df
)# The process continues, increasing ng for each model to test for better fitting class numbers
# The `update` function is used to change the number of classes while keeping the rest of the model the same.
LCMM_CRP_3 <- update(LCMM_CRP_2, ng = 3)
LCMM_CRP_4 <- update(LCMM_CRP_3, ng = 4)
LCMM_CRP_5 <- update(LCMM_CRP_4, ng = 5)
LCMM_CRP_6 <- update(LCMM_CRP_5, ng = 6)postprob(d4)conv<-ifelse(d4$conv==1, "TRUE", "FALSE")k<-paste(d4$AIC, d4$BIC, exp(d4$loglik), conv)k<-gsub(" ", ",", k)pp<-d4$pprobpp<-cbind(pp[,3:(ncol(pp))], d4$pprob$class)postprob(mod4)

示例2

## Package lcmm 
##  
## Cecile Proust-Lima, Viviane Philipps, Amadou Diakite & Benoit Liquet: lcmm: Extended Mixed Models Using Latent Classes and Latent Processes. 
## R package version 1.81. https://CRAN.R-project.org/package=lcmm 
# Note: Data needs to be in long format  library(lcmm)#load data
data <- read.table("Data.csv",sep=",", header= TRUE, na.strings = NA)#recenter and scale age
data$age12 <- (data$age - 12)/10### Estimation of mixed-effect models and latent class mixed-effect models for continuous Gaussian outcomes ### #Estimate the model with only one class (ng = 1)
m1 <- hlme(bmi ~ age12+I(age12^2),random =~ age12+I(age12^2),subject = 'ID', data = data) #estimate the model with two classes (ng = 2): 
m2 <- hlme(bmi ~ age12+I(age12^2),random =~ age12+I(age12^2), mixture=~age12+I(age12^2), subject = 'ID', data = data, ng = 2, B=m1) # ng=2
postprob(m2)# Estimate the model with two classes using gridsearch:
m2_gridsearch <- gridsearch(hlme(bmi ~ age12+I(age12^2),random =~ age12+I(age12^2), subject = 'ID', data = data, ng = 2, mixture=~age12+I(age12^2)),rep=100, maxiter=50, minit=m1) #ng=2
postprob(m2_gridsearch)#get summary of the models
summarytable(m1, m2, m2_gridsearch, which = c("G", "loglik", "conv", "npm", "AIC", "BIC", "SABIC", "entropy", "%class"))### Estimation of mixed-effect models and latent class mixed-effect models 
###   for continuous non-Gaussian or ordinal outcomes ### 
m1_lcmm <- lcmm(fussy_eating ~ age12+I(age12^2),random =~ age12+I(age12^2), subject = 'ID', data = data, link='thresholds') 
summary(m1_lcmm) ### Estimation of mixed-effect models and latent class mixed-effect models 
###   for multiple longitudinal markers measuring the same underlying latent process ### 
m1_mult_lin <- multlcmm(BMI + fussy_eating ~ age12+I(age12^2) + time + I(time^2/10), random =~ time +I(time^2 / 10), subject= 'ID', data = data, randomY = TRUE, cor = BM(time), link = c('linear','thresholds'))
summary(m1_mult_lin)### Estimation of mixed-effect models and latent class mixed-effect models 
###  for longitudinal markers and time to event (possibly with competing risks) ### 
m1_jointlcmm <- Jointlcmm(BMI~ age12+I(age12^2),random=~age12+I(age12^2),subject='ID', data=data, ng=1, survival = Surv(Tevent,Event)~ time+event, hazard="3-quant-splines", hazardtype="PH") #ng=1
summary(m1_jointlcmm)

示例3

############################################################################################################
############# Growth modeling for ECHO data
############################################################################################################library(tidyverse)
library(config)
library(here)
library(lcmm)
Sys.setenv(R_CONFIG_ACTIVE = "mike") # 'default')#
config <- config::get()# gm_df <- read.csv(here(config$ECHO_rolling_window_8_LIWC_path,config$ECHO_rolling_window_8_LIWC_name))############# rLSM.D modelsgm_df <- ECHO_smoothed_rLSM %>% # This df comes from runing the rLSM script trhough line 124 (stops before aggregating to file level)mutate(rLSM = rowMeans(select(.,contains('.rLSM')),na.rm = TRUE)) %>%drop_na(text_agg) %>%filter(Speaker == 'D') %>%select(File, rLSM) %>%drop_na() %>%# mutate(#   rLSM = scale(rLSM,center = TRUE)# ) %>%group_by(File) %>%mutate(id = row_number(),# t = case_when(id < (max(id)/5) ~ 0,#           id < (max(id/5))*2 ~ 1,#           id < (max(id/5))*3 ~ 2,#           id < (max(id/5))*4 ~ 3,#           TRUE ~ 4)t = case_when(id < (max(id)/10) ~ 0,id < (max(id/10))*2 ~ 1,id < (max(id/10))*3 ~ 2,id < (max(id/10))*4 ~ 3,id < (max(id/10))*5 ~ 4,id < (max(id/10))*6 ~ 5,id < (max(id/10))*7 ~ 6,id < (max(id/10))*8 ~ 7,id < (max(id/10))*9 ~ 8,TRUE ~ 9)) %>%ungroup() %>%select(-id) %>%group_by(File,t) %>%summarize(rLSM = mean(rLSM)) %>%ungroup() %>% mutate(group_id = group_indices(., File))m1.rLSM.D.v2 <- lcmm::hlme(rLSM ~ t,# random = ~ t,subject = 'group_id',data = gm_df
)m2.rLSM.D.v2 <- lcmm::hlme(rLSM ~ t,# random = ~ t,mixture = ~ t,ng = 2,subject = 'group_id',data = gm_df,B = random(m1.rLSM.D.v2))# m2g.rLSM.D.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.D.v2, 
#   hlme(rLSM ~ t,
#     # random = ~ 1 + t,
#     mixture = ~ t,
#     ng = 2,
#     subject = 'group_id',
#     data = gm_df))m3.rLSM.D.v2 <- lcmm::hlme(rLSM ~ t,# random = ~ 1 + t,mixture = ~ t,ng = 3,subject = 'group_id',data = gm_df,B = random(m1.rLSM.D.v2)
)# m3g.rLSM.D.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.D.v2, 
#   hlme(rLSM ~ t,
#        random = ~ 1 + t,
#        mixture = ~ t,
#        ng = 3,
#        subject = 'group_id',
#        data = gm_df))m4.rLSM.D.v2 <- lcmm::hlme(rLSM ~ t,# random = ~ 1 + t,mixture = ~ t,ng = 4,subject = 'group_id',data = gm_df,B = random(m1.rLSM.D.v2)
)# m4g.rLSM.D.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.D.v2, 
#   hlme(rLSM ~ t,
#        random = ~ 1 + t,
#        mixture = ~ t,
#        ng = 4,
#        subject = 'group_id',
#        data = gm_df))m5.rLSM.D.v2 <- lcmm::hlme(rLSM ~ t,# random = ~ 1 + t,mixture = ~ t,ng = 5,subject = 'group_id',data = gm_df,B = random(m1.rLSM.D.v2)
)
# 
# m5g.rLSM.D.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.D.v2, 
#   hlme(rLSM ~ t,
#        random = ~ 1 + t,
#        mixture = ~ 1 + t,
#        ng = 5,
#        subject = 'group_id',
#        data = gm_df))summarytable(m1.rLSM.D.v2,m2.rLSM.D.v2,m3.rLSM.D.v2, m4.rLSM.D.v2,m5.rLSM.D.v2,# m4, m4g,#m5,m5g,which = c("G", "loglik", "conv", "npm", "AIC", "BIC", "SABIC", "entropy","ICL", "%class"))
# summaryplot(m1.rLSM.D,m2.rLSM.D,m2g.rLSM.D,m3.rLSM.D,m3g.rLSM.D,
#             # m4, m4g,#m5,m5g,
#             which = c("BIC", "entropy","ICL"))summary(m2.rLSM.D.v2)data_pred0 <- data.frame(t = seq(0,9,length.out = 50), CEP = 0)
data_pred1 <- data.frame(t = seq(0,9,length.out = 50), CEP = 1)
pred0 <- predictY(m2.rLSM.D, data_pred0, var.time = "t")
pred1 <- predictY(m2.rLSM.D, data_pred1, var.time = "t")plot(pred0, col=c("red","navy"), lty=1,lwd=5,ylab="normMMSE",legend=NULL,  main="Predicted trajectories for normMMSE ",ylim=c(.6,.8))
plot(pred1, col=c("red","navy"), lty=2,lwd=3,legend=NULL,add=TRUE)
legend(x="topright",legend=c("class1 :","CEP-","CEP+","class2:","CEP-","CEP+"), col=c(rep("red",3),rep("navy",3)), lwd=2, lty=c(0,1,2,0,1,2), ncol=2, bty="n", cex = 0.7)rLSM.D.2Class <- m2.rLSM.D.v2$pprob %>%full_join(gm_df, by = 'group_id') %>%group_by(class,t) %>%mutate(t = t+1) %>%summarize(sd = sd(rLSM,na.rm = TRUE),rLSM = mean(rLSM)) %>%ungroup() %>%mutate(class = as.factor(class)) %>%ggplot(aes(x = as.factor(t), y = rLSM, color = class, group = class)) + geom_point() + geom_line() + ggthemes::theme_tufte() +# geom_errorbar(aes(ymin=rLSM-sd, ymax=rLSM+sd), width=.2,#               position=position_dodge(.9)#               ) +scale_fill_brewer(palette = "Set2") +labs(title = 'Mean rw.rLSM.D over time within encounters by growth class',x = 'Decile of turns within encounter',y = 'Mean rw.LSM.D')############# rLSM.P modelsgm_df_p <- ECHO_smoothed_rLSM %>% # This df comes from runing the rLSM script trhough line 124 (stops before aggregating to file level)mutate(rLSM = rowMeans(select(.,contains('.rLSM')),na.rm = TRUE)) %>%drop_na(text_agg) %>%filter(Speaker == 'P') %>%select(File, rLSM) %>%drop_na() %>%group_by(File) %>%mutate(id = row_number(),# t = case_when(id < (max(id)/5) ~ 0,#           id < (max(id/5))*2 ~ 1,#           id < (max(id/5))*3 ~ 2,#           id < (max(id/5))*4 ~ 3,#           TRUE ~ 4)t = case_when(id < (max(id)/10) ~ 0,id < (max(id/10))*2 ~ 1,id < (max(id/10))*3 ~ 2,id < (max(id/10))*4 ~ 3,id < (max(id/10))*5 ~ 4,id < (max(id/10))*6 ~ 5,id < (max(id/10))*7 ~ 6,id < (max(id/10))*8 ~ 7,id < (max(id/10))*9 ~ 8,TRUE ~ 9)) %>%ungroup() %>%select(-id) %>%group_by(File,t) %>%summarize(rLSM = mean(rLSM)) %>%ungroup() %>% mutate(group_id = group_indices(., File))m1.rLSM.P.v2 <- lcmm::hlme(rLSM ~ t,random = ~ 1 + t,subject = 'group_id',data = gm_df_p
)m2.rLSM.P.v2 <- lcmm::hlme(rLSM ~ t,random = ~ 1 + t,mixture = ~ t,ng = 2,subject = 'group_id',data = gm_df_p,B = random(m1.rLSM.P.v2)
)# m2g.rLSM.P.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.P.v2, 
#   hlme(rLSM ~ t,
#        random = ~ 1 + t,
#        mixture = ~ t,
#        ng = 2,
#        subject = 'group_id',
#        data = gm_df_p))m3.rLSM.P.v2 <- lcmm::hlme(rLSM ~ t,random = ~ 1 + t,mixture = ~ t,ng = 3,subject = 'group_id',data = gm_df_p,B = random(m1.rLSM.P.v2)
)# m3g.rLSM.P.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.P.v2, 
#   hlme(rLSM ~ t,
#        random = ~ 1 + t,
#        mixture = ~ t,
#        ng = 3,
#        subject = 'group_id',
#        data = gm_df_p))m4.rLSM.P.v2 <- lcmm::hlme(rLSM ~ t,random = ~ 1 + t,mixture = ~ t,ng = 4,subject = 'group_id',data = gm_df_p,B = random(m1.rLSM.P.v2)
)# m4g.rLSM.P.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.P.v2, 
#   hlme(rLSM ~ t,
#        random = ~ 1 + t,
#        mixture = ~ t,
#        ng = 4,
#        subject = 'group_id',
#        data = gm_df_p))m5.rLSM.P.v2 <- lcmm::hlme(rLSM ~ t,random = ~ 1 + t,mixture = ~ t,ng = 5,subject = 'group_id',data = gm_df_p,B = random(m1.rLSM.P.v2)
)# m5g.rLSM.P.v2 <- lcmm::gridsearch(
#   rep = 100, maxiter = 30, minit = m1.rLSM.P.v2, 
#   hlme(rLSM ~ t,
#        # random = ~ 1 + t,
#        mixture = ~ 1 + t,
#        ng = 5,
#        subject = 'group_id',
#        data = gm_df))summarytable(m1.rLSM.P.v2,m2.rLSM.P.v2,m3.rLSM.P.v2,m4.rLSM.P.v2,m5.rLSM.P.v2,which = c("G", "loglik", "conv", "npm", "AIC", "BIC", "SABIC", "entropy","ICL", "%class"))
# summaryplot(m1.rLSM.P,m2.rLSM.P,m2g.rLSM.P,m3.rLSM.P,m3g.rLSM.P,
#             # m4, m4g,#m5,m5g,
#             which = c("BIC", "entropy","ICL"))summary(m2.rLSM.P)data_pred0 <- data.frame(t = seq(0,9,length.out = 50), CEP = 0)
data_pred1 <- data.frame(t = seq(0,9,length.out = 50), CEP = 1)
pred0 <- predictY(m2.rLSM.P, data_pred0, var.time = "t")
pred1 <- predictY(m2.rLSM.P, data_pred1, var.time = "t")plot(pred0, col=c("red","navy"), lty=1,lwd=5,ylab="normMMSE",legend=NULL,  main="Predicted trajectories for normMMSE ",ylim=c(0,1))
plot(pred1, col=c("red","navy"), lty=2,lwd=3,legend=NULL,add=TRUE)
legend(x="topright",legend=c("class1 :","CEP-","CEP+","class2:","CEP-","CEP+"), col=c(rep("red",3),rep("navy",3)), lwd=2, lty=c(0,1,2,0,1,2), ncol=2, bty="n", cex = 0.7)rLSM.P.2Class <- m2.rLSM.P.v2$pprob %>%full_join(gm_df, by = 'group_id') %>%mutate(t = t+1) %>%group_by(class,t) %>%summarize(sd = sd(rLSM,na.rm = TRUE),rLSM = mean(rLSM)) %>%# ungroup() %>%mutate(class = as.factor(class)) %>%ggplot(aes(x = as.factor(t), y = rLSM, color = class, group = class)) + geom_point() + geom_line() + ggthemes::theme_tufte() +scale_fill_brewer(palette = "Set2") +labs(title = 'Mean rw.rLSM.P over time within encounters by growth class',x = 'Decile of turns within encounter',y = 'Mean rw.LSM.P')############# classes to outcomesclass_df <- m2.rLSM.P$pprob %>%full_join(gm_df, by = 'group_id') %>%rename(class.P = class) %>%select(group_id, class.P, File) %>%full_join(m2.rLSM.D$pprob, by = 'group_id') %>%rename(class.D = class) %>%# select(-c(prob1,prob2,group_id)) %>%distinct()
table(class_df$class.D,class_df$class.P)#open survey data
ECHO_survey_data <- haven::read_dta(here(config$ECHO_survey_data_path, config$ECHO_survey_data_name))#Open ID key file
ECHO_ID_key <- read_csv('ECHO_ID_key.csv')#merge GROWTH_MODELING_FILE and ECHO_ID_key by "File"
class_df <- left_join(class_df, ECHO_ID_key, by = "File" ) #merge GROWTH_MODELING_FILE and ECHO_survey_data by "tapeid" to include survey data
class_df <- left_join(class_df, ECHO_survey_data, by = "tapeid")# Transforming cdspeak from continuous to dichotomous variable "cdspeakhigh"
class_df <- class_df %>%rowwise() %>%mutate(cdspeakhigh = ifelse(cdspeak >= 3, 1, 0))#Making the provider id variable
class_df  <- class_df %>%mutate(provider_id = str_sub(File, 1, 4))#Making the site_name variable: only three sites in data set: JC, OC, SC
class_df <- class_df %>%mutate(site_name = str_sub(provider_id, 1, 2))#creating site_id variable by converting JC==0, OC==1, SC==2
class_df <- class_df %>%mutate(site_id = ifelse(site_name == "JC", 0,ifelse(site_name == "OC", 1,2)))#changing site_id to factor
class_df <- class_df %>%mutate(site_id = factor(site_id))class_df <- class_df %>%mutate(provider_id = factor(provider_id))table(class_df$class.D,class_df$cdspeakhigh)
chisq.test(class_df$class.D,class_df$cdspeakhigh, correct = FALSE)# gginference::ggchisqtest(
#   chisq.test(class_df$class.D,class_df$cdspeakhigh, correct = FALSE)
# )
table(class_df$class.P,class_df$cdspeakhigh)
chisq.test(class_df$class.P,class_df$cdspeakhigh, correct = FALSE)class_df <- class_df %>% mutate(x = paste0(class.D,'D_',class.P,'P'))
table(class_df$x,class_df$cdspeakhigh)
chisq.test(class_df$x,class_df$cdspeakhigh, correct = FALSE)library(gtsummary)
class_df %>%select(cdspeakhigh, x) %>%mutate(cdspeakhigh = as.factor(cdspeakhigh),x = as.factor(x)) %>%drop_na() %>%gtsummary::tbl_summary(by = x) %>%gtsummary::add_p(pvalue_fun = ~style_pvalue(.x, digits = 2))### Pulling pprob
pprob_df <- m2.rLSM.P$pprob %>%full_join(gm_df, by = 'group_id') %>%rename(class.P = class, P.prob1 = prob1, P.prob2 = prob2) %>%select(-t,-rLSM) %>%full_join(m2.rLSM.D$pprob, by = 'group_id') %>%rename(class.D = class, D.prob1 = prob1, D.prob2 = prob2) %>%select(-c(group_id)) %>%distinct()write_csv(pprob_df, 'ECHO_pprob.csv')

lcmm::gridsearch

第一段代码是单次拟合，计算速度较快，但可能陷入局部最优。

第二段代码通过网格搜索（rep = 100）和更宽松的迭代限制（maxiter = 30）提高找到全局最优解的可能性，但计算时间显著增加。

第一段代码可能是为了快速测试模型结构。

第二段代码更适用于正式分析，通过网格搜索确保参数估计的稳定性。

测试

以下是基于R语言的组轨迹模型（GBTM）或去掉随机效应变为LCMM，分析步骤，适用于纵向血压、体重数据，并纳入年龄、性别作为协变量，分析心血管疾病发病风险的示例代码：

# 1. 准备数据与加载包
# 加载所需包
library(lcmm)
library(tidyverse)
library(traj)# 模拟数据集（假设数据结构）
set.seed(123)
n <- 500  # 样本量
data <- data.frame(id = rep(1:n, each=4),time = rep(0:3, n),  # 0-3表示4个时间点age = rep(rnorm(n, 50, 5), each=4),sex = rep(sample(c("Male", "Female"), n, replace=TRUE), each=4),sbp = rnorm(n*4, 120 + 0.5*time, 10),  # 收缩压随时间轻微上升weight = rnorm(n*4, 70 - 0.3*time, 5), # 体重随时间轻微下降cvd = rep(rbinom(n, 1, 0.2), each=4)   # 结局：是否发生心血管疾病
)
# 2. 拟合血压的组轨迹模型
# 使用lcmm包拟合潜类别混合模型
# 模型设定：时间固定效应，随机截距+斜率，按类别分组
bp_model <- hlme(fixed = sbp ~ time + age + sex,   # 固定效应：时间、年龄、性别mixture = ~ time,                 # 类别间时间效应不同random = ~ time,                  # 随机效应：截距和斜率ng = 3,                           # 假设3个类别subject = "id",                   # 个体IDdata = data,B = random                        # 随机初始值
)# 查看模型结果
summary(bp_model)
plot(bp_model, which = "trajectory")  # 绘制轨迹图
# 3. 选择最佳类别数（K）
# 比较不同K值的BIC
bic_values <- sapply(2:4, function(k) {model <- hlme(sbp ~ time + age + sex, mixture = ~time, random = ~time, ng = k, data = data)return(model$BIC)
})
plot(2:4, bic_values, type="b", xlab="Number of Classes", ylab="BIC")
# 选择BIC最小的K值
# 4. 分配个体到轨迹组
# 提取类别概率并分配最可能类别
data$bp_class <- as.factor(bp_model$pprob$class)# 查看各类别占比
table(data$bp_class) / n
# 5. 分析轨迹组与心血管疾病的关系
# 逻辑回归：轨迹组预测CVD发病（需每条个体保留唯一结局）
data_unique <- data %>% group_by(id) %>% slice(1) %>% ungroup()
glm_model <- glm(cvd ~ bp_class + age + sex, family = binomial, data = data_unique
)
summary(glm_model)
# 6. 对体重进行类似分析
# 重复步骤2-5，将因变量替换为weight。