从外部数据创建 DMatrix — xgb.ExtMemDMatrix • xgboost

从 xgb.DataIter() 对象提供的外部数据创建一种特殊的 XGBoost 'DMatrix' 对象，这些数据可能以批次形式从一个更大的数据集传递，该数据集可能无法完全放入内存。

迭代器提供的数据会根据需要按需多次访问，而不会被连接。但请注意，像 'label' 这样的字段将通过多次调用数据迭代器进行连接。

有关更多信息，请参阅指南“使用 XGBoost 外部内存版本”：https://docs.xgboost.com.cn/en/stable/tutorials/external_memory.html

用法

xgb.ExtMemDMatrix(
  data_iterator,
  cache_prefix = tempdir(),
  missing = NA,
  nthread = NULL
)

参数

data_iterator

一个数据迭代器结构，由 xgb.DataIter() 返回，它包括一个在函数调用之间共享的环境，以及按需分批访问数据的函数。

cache_prefix

缓存文件的路径，调用者必须初始化此路径中的所有目录。

缺失值

一个浮点值，表示数据中的缺失值。

请注意，虽然 xgb.DMatrix() 等函数可以接受通用的 NA 并正确解释不同类型（如 numeric 和 integer），但如果此处传递 NA 值，它将不会针对不同的输入类型进行调整。

例如，在 R 的 integer 类型中，缺失值由整数 -2147483648 表示（因为机器的 'integer' 类型没有固有的 'NA' 值）——因此，如果传递 NA（它被 xgb.ExtMemDMatrix() 和 xgb.QuantileDMatrix.from_iterator() 解释为浮点 NaN），这些整数缺失值将不会被视为缺失。这对于 numeric 类型应该不会造成任何问题，因为它们确实具有固有的 NaN 值。

线程数

用于创建 DMatrix 的线程数。

值

一个 'xgb.DMatrix' 对象，带有 'xgb.ExtMemDMatrix' 子类，其中数据不保留在内部，而是在需要时通过迭代器访问。

详细信息

请注意，外部数据 DMatrix 的构建会将数据缓存到磁盘上，以压缩格式存储在 cache_prefix 中提供的路径下。

精确树方法不支持外部数据。

另请参阅

xgb.DataIter(), xgb.DataBatch(), xgb.QuantileDMatrix.from_iterator()

示例

data(mtcars)

# This custom environment will be passed to the iterator
# functions at each call. It is up to the user to keep
# track of the iteration number in this environment.
iterator_env <- as.environment(
  list(
    iter = 0,
    x = mtcars[, -1],
    y = mtcars[, 1]
  )
)

# Data is passed in two batches.
# In this example, batches are obtained by subsetting the 'x' variable.
# This is not advantageous to do, since the data is already loaded in memory
# and can be passed in full in one go, but there can be situations in which
# only a subset of the data will fit in the computer's memory, and it can
# be loaded in batches that are accessed one-at-a-time only.
iterator_next <- function(iterator_env) {
  curr_iter <- iterator_env[["iter"]]
  if (curr_iter >= 2) {
    # there are only two batches, so this signals end of the stream
    return(NULL)
  }

  if (curr_iter == 0) {
    x_batch <- iterator_env[["x"]][1:16, ]
    y_batch <- iterator_env[["y"]][1:16]
  } else {
    x_batch <- iterator_env[["x"]][17:32, ]
    y_batch <- iterator_env[["y"]][17:32]
  }
  on.exit({
    iterator_env[["iter"]] <- curr_iter + 1
  })

  # Function 'xgb.DataBatch' must be called manually
  # at each batch with all the appropriate attributes,
  # such as feature names and feature types.
  return(xgb.DataBatch(data = x_batch, label = y_batch))
}

# This moves the iterator back to its beginning
iterator_reset <- function(iterator_env) {
  iterator_env[["iter"]] <- 0
}

data_iterator <- xgb.DataIter(
  env = iterator_env,
  f_next = iterator_next,
  f_reset = iterator_reset
)
cache_prefix <- tempdir()

# DMatrix will be constructed from the iterator's batches
dm <- xgb.ExtMemDMatrix(data_iterator, cache_prefix, nthread = 1)

# After construction, can be used as a regular DMatrix
params <- xgb.params(nthread = 1, objective = "reg:squarederror")
model <- xgb.train(data = dm, nrounds = 2, params = params)

# Predictions can also be called on it, and should be the same
# as if the data were passed differently.
pred_dm <- predict(model, dm)
pred_mat <- predict(model, as.matrix(mtcars[, -1]))