特征重要性 — xgb.importance • xgboost

创建一个特征重要性的 data.table。

用法

xgb.importance(
  model = NULL,
  feature_names = getinfo(model, "feature_name"),
  trees = NULL
)

参数

model: xgb.Booster 类的对象。
feature_names: 字符向量，用于覆盖模型的特征名称。默认值为 NULL (使用原始特征名称)。
trees: 一个整数向量，表示应包含在重要性计算中的（基于 1 的）树索引（仅适用于 "gbtree" 助推器）。默认值 (NULL) 解析所有树。这在多分类等场景中可能有用，可以为每个类别单独获取特征重要性。

返回值

一个包含以下列的 data.table

对于树模型

Features: 模型中使用的特征名称。
Gain: 基于特征分割的总增益，表示每个特征对模型的贡献分数。百分比越高，重要性越高。
Cover: 与此特征相关的观测数量指标。
Frequency: 特征在树中被使用的次数百分比。

对于线性模型

Features: 模型中使用的特征名称。
Weight: 此特征的线性系数。
Class: 类别标签（仅适用于多分类模型）。对于 xgboost 类的对象（由 xgboost() 生成），它将是一个 factor；而对于 xgb.Booster 类的对象（由 xgb.train() 生成），它将是一个基于零的整数向量。

如果未提供 feature_names 且 model 不包含 feature_names，则将使用特征的索引代替。由于索引是从模型转储中提取的（基于 C++ 代码），因此它从 0 开始（如 C/C++ 或 Python 中），而不是从 1 开始（通常在 R 中）。

详情

此函数适用于线性模型和树模型。

对于线性模型，重要性是线性系数的绝对值。要获得有意义的线性模型重要性排序，特征需要位于相同的尺度上（在使用 L1 或 L2 正则化时也建议这样做）。

示例

# binary classification using "gbtree":
data("ToothGrowth")
x <- ToothGrowth[, c("len", "dose")]
y <- ToothGrowth$supp
model_tree_binary <- xgboost(
  x, y,
  nrounds = 5L,
  nthreads = 1L,
  booster = "gbtree",
  max_depth = 2L
)
xgb.importance(model_tree_binary)

# binary classification using "gblinear":
model_tree_linear <- xgboost(
  x, y,
  nrounds = 5L,
  nthreads = 1L,
  booster = "gblinear",
  learning_rate = 0.3
)
xgb.importance(model_tree_linear)

# multi-class classification using "gbtree":
data("iris")
x <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
y <- iris$Species
model_tree_multi <- xgboost(
  x, y,
  nrounds = 5L,
  nthreads = 1L,
  booster = "gbtree",
  max_depth = 3
)
# all classes clumped together:
xgb.importance(model_tree_multi)
# inspect importances separately for each class:
num_classes <- 3L
nrounds <- 5L
xgb.importance(
  model_tree_multi, trees = seq(from = 1, by = num_classes, length.out = nrounds)
)
xgb.importance(
  model_tree_multi, trees = seq(from = 2, by = num_classes, length.out = nrounds)
)
xgb.importance(
  model_tree_multi, trees = seq(from = 3, by = num_classes, length.out = nrounds)
)

# multi-class classification using "gblinear":
model_linear_multi <- xgboost(
  x, y,
  nrounds = 5L,
  nthreads = 1L,
  booster = "gblinear",
  learning_rate = 0.2
)
xgb.importance(model_linear_multi)