Dataset API¶
Forest Dataset¶
The ForestDataset
class is a wrapper around data needed to sample one or more tree ensembles.
Its core elements are
Covariates: Features / variables used to partition the forests. Stored internally as a (column-major)
Eigen::MatrixXd
.Basis: [Optional] basis vector used to define a “leaf regression” — a partitioned linear model where covariates define the partitions and basis defines the regression variables. Also stored internally as a (column-major)
Eigen::MatrixXd
.Sample Weights: [Optional] case weights for every observation in a training dataset. These may be heteroskedastic variance parameters or simply survey / case weights. Stored internally as an
Eigen::VectorXd
.
-
class ForestDataset¶
API for loading and accessing data used to sample tree ensembles.
Public Functions
-
inline ForestDataset()¶
Default constructor. No data is loaded at construction time.
-
inline void AddCovariates(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)¶
Copy / load covariates from raw memory buffer (often pointer to data in a R matrix or numpy array)
- Parameters:
data_ptr – Pointer to first element of a contiguous array of data storing a covariate matrix
num_row – Number of rows in the covariate matrix
num_col – Number of columns / covariates in the covariate matrix
is_row_major – Whether or not the data in
data_ptr
are organized in a row-major or column-major fashion
-
inline void AddBasis(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)¶
Copy / load basis matrix from raw memory buffer (often pointer to data in a R matrix or numpy array)
- Parameters:
data_ptr – Pointer to first element of a contiguous array of data storing a basis matrix
num_row – Number of rows in the basis matrix
num_col – Number of columns in the basis matrix
is_row_major – Whether or not the data in
data_ptr
are organized in a row-major or column-major fashion
-
inline void AddVarianceWeights(double *data_ptr, data_size_t num_row)¶
Copy / load variance weights from raw memory buffer (often pointer to data in a R vector or numpy array)
- Parameters:
data_ptr – Pointer to first element of a contiguous array of data storing weights
num_row – Number of rows in the weight vector
-
inline bool HasCovariates()¶
Whether or not a
ForestDataset
has (yet) loaded covariate data.
-
inline bool HasBasis()¶
Whether or not a
ForestDataset
has (yet) loaded basis data.
-
inline bool HasVarWeights()¶
Whether or not a
ForestDataset
has (yet) loaded variance weights.
-
inline data_size_t NumObservations()¶
Number of observations (rows) in the dataset.
-
inline int NumCovariates()¶
Number of covariate columns in the dataset.
-
inline int NumBasis()¶
Number of bases in the dataset. This is 0 if the dataset has not been provided a basis matrix.
-
inline double CovariateValue(data_size_t row, int col)¶
Returns a dataset’s covariate value stored at (
row
,col
)- Parameters:
row – Row number to query in the covariate matrix
col – Column number to query in the covariate matrix
-
inline double BasisValue(data_size_t row, int col)¶
Returns a dataset’s basis value stored at (
row
,col
)- Parameters:
row – Row number to query in the basis matrix
col – Column number to query in the basis matrix
-
inline double VarWeightValue(data_size_t row)¶
Returns a dataset’s variance weight stored at element
row
- Parameters:
row – Index to query in the weight vector
-
inline Eigen::MatrixXd &GetCovariates()¶
Return a reference to the raw
Eigen::MatrixXd
storing the covariate data.- Returns:
Reference to internal Eigen::MatrixXd
-
inline Eigen::MatrixXd &GetBasis()¶
Return a reference to the raw
Eigen::MatrixXd
storing the basis data.- Returns:
Reference to internal Eigen::MatrixXd
-
inline Eigen::VectorXd &GetVarWeights()¶
Return a reference to the raw
Eigen::VectorXd
storing the variance weights.- Returns:
Reference to internal Eigen::VectorXd
-
inline void UpdateBasis(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)¶
Update the data in the internal basis matrix to new values stored in a raw double array.
- Parameters:
data_ptr – Pointer to first element of a contiguous array of data storing a basis matrix
num_row – Number of rows in the basis matrix
num_col – Number of columns in the basis matrix
is_row_major – Whether or not the data in
data_ptr
are organized in a row-major or column-major fashion
-
inline ForestDataset()¶
Random Effects Dataset¶
The RandomEffectsDataset
class is a wrapper around data needed to sample one or more tree ensembles.
Its core elements are
Basis: Vector of variables that have group-specific random coefficients. In the simplest additive group random effects model, this is a constant intercept of all ones. Stored internally as a (column-major)
Eigen::MatrixXd
.Group Indices: Integer-valued indices of group membership. In a model with three groups, these indices would typically be 0, 1, and 2 (remapped from perhaps more descriptive labels in R or Python). Stored internally as an
std::vector
of integers.Sample Weights: [Optional] case weights for every observation in a training dataset. These may be heteroskedastic variance parameters or simply survey / case weights. Stored internally as an
Eigen::VectorXd
.
-
class RandomEffectsDataset¶
API for loading and accessing data used to sample (additive) random effects.
Public Functions
-
inline RandomEffectsDataset()¶
Default constructor. No data is loaded at construction time.
-
inline void AddBasis(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)¶
Copy / load basis matrix from raw memory buffer (often pointer to data in a R matrix or numpy array)
- Parameters:
data_ptr – Pointer to first element of a contiguous array of data storing a basis matrix
num_row – Number of rows in the basis matrix
num_col – Number of columns in the basis matrix
is_row_major – Whether or not the data in
data_ptr
are organized in a row-major or column-major fashion
-
inline void AddVarianceWeights(double *data_ptr, data_size_t num_row)¶
Copy / load variance weights from raw memory buffer (often pointer to data in a R vector or numpy array)
- Parameters:
data_ptr – Pointer to first element of a contiguous array of data storing weights
num_row – Number of rows in the weight vector
-
inline void AddGroupLabels(std::vector<int32_t> &group_labels)¶
Copy / load group indices for random effects.
- Parameters:
group_labels – Vector of integers with as many elements as
num_row
in the basis matrix, where each element corresponds to the group label for a given observation.
-
inline data_size_t NumObservations()¶
Number of observations (rows) in the dataset.
-
inline bool HasBasis()¶
Whether or not a
RandomEffectsDataset
has (yet) loaded basis data.
-
inline bool HasVarWeights()¶
Whether or not a
RandomEffectsDataset
has (yet) loaded variance weights.
-
inline bool HasGroupLabels()¶
Whether or not a
RandomEffectsDataset
has (yet) loaded group labels.
-
inline double BasisValue(data_size_t row, int col)¶
Returns a dataset’s basis value stored at (
row
,col
)- Parameters:
row – Row number to query in the basis matrix
col – Column number to query in the basis matrix
-
inline double VarWeightValue(data_size_t row)¶
Returns a dataset’s variance weight stored at element
row
- Parameters:
row – Index to query in the weight vector
-
inline int32_t GroupId(data_size_t row)¶
Returns a dataset’s group label stored at element
row
- Parameters:
row – Index to query in the group label vector
-
inline Eigen::MatrixXd &GetBasis()¶
Return a reference to the raw
Eigen::MatrixXd
storing the basis data.- Returns:
Reference to internal Eigen::MatrixXd
-
inline Eigen::VectorXd &GetVarWeights()¶
Return a reference to the raw
Eigen::VectorXd
storing the variance weights.- Returns:
Reference to internal Eigen::VectorXd
-
inline std::vector<int32_t> &GetGroupLabels()¶
Return a reference to the raw
std::vector
storing the group labels.- Returns:
Reference to internal std::vector
-
inline RandomEffectsDataset()¶