Dataset API

Forest Dataset

The ForestDataset class is a wrapper around data needed to sample one or more tree ensembles. Its core elements are

  • Covariates: Features / variables used to partition the forests. Stored internally as a (column-major) Eigen::MatrixXd.

  • Basis: [Optional] basis vector used to define a “leaf regression” — a partitioned linear model where covariates define the partitions and basis defines the regression variables. Also stored internally as a (column-major) Eigen::MatrixXd.

  • Sample Weights: [Optional] case weights for every observation in a training dataset. These may be heteroskedastic variance parameters or simply survey / case weights. Stored internally as an Eigen::VectorXd.

class ForestDataset

API for loading and accessing data used to sample tree ensembles.

Public Functions

inline ForestDataset()

Default constructor. No data is loaded at construction time.

inline void AddCovariates(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)

Copy / load covariates from raw memory buffer (often pointer to data in a R matrix or numpy array)

Parameters:
  • data_ptr – Pointer to first element of a contiguous array of data storing a covariate matrix

  • num_row – Number of rows in the covariate matrix

  • num_col – Number of columns / covariates in the covariate matrix

  • is_row_major – Whether or not the data in data_ptr are organized in a row-major or column-major fashion

inline void AddBasis(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)

Copy / load basis matrix from raw memory buffer (often pointer to data in a R matrix or numpy array)

Parameters:
  • data_ptr – Pointer to first element of a contiguous array of data storing a basis matrix

  • num_row – Number of rows in the basis matrix

  • num_col – Number of columns in the basis matrix

  • is_row_major – Whether or not the data in data_ptr are organized in a row-major or column-major fashion

inline void AddVarianceWeights(double *data_ptr, data_size_t num_row)

Copy / load variance weights from raw memory buffer (often pointer to data in a R vector or numpy array)

Parameters:
  • data_ptr – Pointer to first element of a contiguous array of data storing weights

  • num_row – Number of rows in the weight vector

inline bool HasCovariates()

Whether or not a ForestDataset has (yet) loaded covariate data.

inline bool HasBasis()

Whether or not a ForestDataset has (yet) loaded basis data.

inline bool HasVarWeights()

Whether or not a ForestDataset has (yet) loaded variance weights.

inline data_size_t NumObservations()

Number of observations (rows) in the dataset.

inline int NumCovariates()

Number of covariate columns in the dataset.

inline int NumBasis()

Number of bases in the dataset. This is 0 if the dataset has not been provided a basis matrix.

inline double CovariateValue(data_size_t row, int col)

Returns a dataset’s covariate value stored at (row, col)

Parameters:
  • row – Row number to query in the covariate matrix

  • col – Column number to query in the covariate matrix

inline double BasisValue(data_size_t row, int col)

Returns a dataset’s basis value stored at (row, col)

Parameters:
  • row – Row number to query in the basis matrix

  • col – Column number to query in the basis matrix

inline double VarWeightValue(data_size_t row)

Returns a dataset’s variance weight stored at element row

Parameters:

row – Index to query in the weight vector

inline Eigen::MatrixXd &GetCovariates()

Return a reference to the raw Eigen::MatrixXd storing the covariate data.

Returns:

Reference to internal Eigen::MatrixXd

inline Eigen::MatrixXd &GetBasis()

Return a reference to the raw Eigen::MatrixXd storing the basis data.

Returns:

Reference to internal Eigen::MatrixXd

inline Eigen::VectorXd &GetVarWeights()

Return a reference to the raw Eigen::VectorXd storing the variance weights.

Returns:

Reference to internal Eigen::VectorXd

inline void UpdateBasis(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)

Update the data in the internal basis matrix to new values stored in a raw double array.

Parameters:
  • data_ptr – Pointer to first element of a contiguous array of data storing a basis matrix

  • num_row – Number of rows in the basis matrix

  • num_col – Number of columns in the basis matrix

  • is_row_major – Whether or not the data in data_ptr are organized in a row-major or column-major fashion

Random Effects Dataset

The RandomEffectsDataset class is a wrapper around data needed to sample one or more tree ensembles. Its core elements are

  • Basis: Vector of variables that have group-specific random coefficients. In the simplest additive group random effects model, this is a constant intercept of all ones. Stored internally as a (column-major) Eigen::MatrixXd.

  • Group Indices: Integer-valued indices of group membership. In a model with three groups, these indices would typically be 0, 1, and 2 (remapped from perhaps more descriptive labels in R or Python). Stored internally as an std::vector of integers.

  • Sample Weights: [Optional] case weights for every observation in a training dataset. These may be heteroskedastic variance parameters or simply survey / case weights. Stored internally as an Eigen::VectorXd.

class RandomEffectsDataset

API for loading and accessing data used to sample (additive) random effects.

Public Functions

inline RandomEffectsDataset()

Default constructor. No data is loaded at construction time.

inline void AddBasis(double *data_ptr, data_size_t num_row, int num_col, bool is_row_major)

Copy / load basis matrix from raw memory buffer (often pointer to data in a R matrix or numpy array)

Parameters:
  • data_ptr – Pointer to first element of a contiguous array of data storing a basis matrix

  • num_row – Number of rows in the basis matrix

  • num_col – Number of columns in the basis matrix

  • is_row_major – Whether or not the data in data_ptr are organized in a row-major or column-major fashion

inline void AddVarianceWeights(double *data_ptr, data_size_t num_row)

Copy / load variance weights from raw memory buffer (often pointer to data in a R vector or numpy array)

Parameters:
  • data_ptr – Pointer to first element of a contiguous array of data storing weights

  • num_row – Number of rows in the weight vector

inline void AddGroupLabels(std::vector<int32_t> &group_labels)

Copy / load group indices for random effects.

Parameters:

group_labels – Vector of integers with as many elements as num_row in the basis matrix, where each element corresponds to the group label for a given observation.

inline data_size_t NumObservations()

Number of observations (rows) in the dataset.

inline bool HasBasis()

Whether or not a RandomEffectsDataset has (yet) loaded basis data.

inline bool HasVarWeights()

Whether or not a RandomEffectsDataset has (yet) loaded variance weights.

inline bool HasGroupLabels()

Whether or not a RandomEffectsDataset has (yet) loaded group labels.

inline double BasisValue(data_size_t row, int col)

Returns a dataset’s basis value stored at (row, col)

Parameters:
  • row – Row number to query in the basis matrix

  • col – Column number to query in the basis matrix

inline double VarWeightValue(data_size_t row)

Returns a dataset’s variance weight stored at element row

Parameters:

row – Index to query in the weight vector

inline int32_t GroupId(data_size_t row)

Returns a dataset’s group label stored at element row

Parameters:

row – Index to query in the group label vector

inline Eigen::MatrixXd &GetBasis()

Return a reference to the raw Eigen::MatrixXd storing the basis data.

Returns:

Reference to internal Eigen::MatrixXd

inline Eigen::VectorXd &GetVarWeights()

Return a reference to the raw Eigen::VectorXd storing the variance weights.

Returns:

Reference to internal Eigen::VectorXd

inline std::vector<int32_t> &GetGroupLabels()

Return a reference to the raw std::vector storing the group labels.

Returns:

Reference to internal std::vector

Other Classes and Types

enum StochTree::FeatureType

Integer encoding of feature types.

Values:

enumerator kNumeric

Numeric feature

enumerator kOrderedCategorical

Ordered categorical feature

enumerator kUnorderedCategorical

Unordered categorical feature