创建一个特殊类型的 XGBoost 'DMatrix' 对象,该对象的数据来自由 xgb.DataIter()
对象提供的外部数据,这些数据可能以批次形式从一个更大的、可能无法完全放入内存的数据集中传递。
迭代器提供的数据在需要时按需访问,可能会访问多次,而不会被拼接在一起。但请注意,'label' 等字段将从对数据迭代器的多次调用中**拼接**在一起。
更多信息,请参阅指南 '使用 XGBoost 外部内存版本': https://docs.xgboost.com.cn/en/release_3.0.0/tutorials/external_memory.html
用法
xgb.ExtMemDMatrix(
data_iterator,
cache_prefix = tempdir(),
missing = NA,
nthread = NULL
)
参数
- data_iterator
一个数据迭代器结构,由
xgb.DataIter()
返回,它包含一个在函数调用之间共享的环境,以及按需分批访问数据的功能。- cache_prefix
缓存文件的路径,调用者必须初始化该路径中的所有目录。
- missing
一个浮点值,表示数据中的缺失值。
请注意,虽然像
xgb.DMatrix()
这样的函数可以接受通用的NA
并为numeric
和integer
等不同类型正确解释它,但如果在此处传递NA
值,它将不会针对不同的输入类型进行调整。例如,在 R 的
integer
类型中,缺失值由整数-2147483648
表示(因为机器的 'integer' 类型没有固有的 'NA' 值)- 因此,如果传递NA
,xgb.ExtMemDMatrix()
和xgb.QuantileDMatrix.from_iterator()
会将其解释为浮点 NaN,这些整数缺失值将不会被视为缺失。这对于numeric
类型不应该造成任何问题,因为它们确实有固有的 NaN 值。- nthread
创建 DMatrix 时使用的线程数。
示例
data(mtcars)
# This custom environment will be passed to the iterator
# functions at each call. It is up to the user to keep
# track of the iteration number in this environment.
iterator_env <- as.environment(
list(
iter = 0,
x = mtcars[, -1],
y = mtcars[, 1]
)
)
# Data is passed in two batches.
# In this example, batches are obtained by subsetting the 'x' variable.
# This is not advantageous to do, since the data is already loaded in memory
# and can be passed in full in one go, but there can be situations in which
# only a subset of the data will fit in the computer's memory, and it can
# be loaded in batches that are accessed one-at-a-time only.
iterator_next <- function(iterator_env) {
curr_iter <- iterator_env[["iter"]]
if (curr_iter >= 2) {
# there are only two batches, so this signals end of the stream
return(NULL)
}
if (curr_iter == 0) {
x_batch <- iterator_env[["x"]][1:16, ]
y_batch <- iterator_env[["y"]][1:16]
} else {
x_batch <- iterator_env[["x"]][17:32, ]
y_batch <- iterator_env[["y"]][17:32]
}
on.exit({
iterator_env[["iter"]] <- curr_iter + 1
})
# Function 'xgb.DataBatch' must be called manually
# at each batch with all the appropriate attributes,
# such as feature names and feature types.
return(xgb.DataBatch(data = x_batch, label = y_batch))
}
# This moves the iterator back to its beginning
iterator_reset <- function(iterator_env) {
iterator_env[["iter"]] <- 0
}
data_iterator <- xgb.DataIter(
env = iterator_env,
f_next = iterator_next,
f_reset = iterator_reset
)
cache_prefix <- tempdir()
# DMatrix will be constructed from the iterator's batches
dm <- xgb.ExtMemDMatrix(data_iterator, cache_prefix, nthread = 1)
# After construction, can be used as a regular DMatrix
params <- xgb.params(nthread = 1, objective = "reg:squarederror")
model <- xgb.train(data = dm, nrounds = 2, params = params)
# Predictions can also be called on it, and should be the same
# as if the data were passed differently.
pred_dm <- predict(model, dm)
pred_mat <- predict(model, as.matrix(mtcars[, -1]))