Decision Tree API

Tree

The fundamental building block of the C++ tree interface is the Tree class.

class Tree

API for constructing decision trees (splitting, pruning, setting parameter values)

Public Functions

void CloneFromTree(Tree *tree)

Copy the structure and parameters of another tree. If the Tree object calling this method already has a non-root tree structure / parameters, this will be erased and replaced with a copy of tree.

Parameters:

treeTree to be cloned

void Reset()

Reset tree to empty vectors and default values of boolean / integer variables.

void Init(int output_dimension = 1)

Initialize the tree with a single root node.

int AllocNode()

Allocate a new node and return the node’s ID.

void DeleteNode(std::int32_t nid)

Deletes node indexed by node ID.

void ExpandNode(std::int32_t nid, int split_index, double split_value, double left_value, double right_value)

Expand a node based on a numeric split rule.

void ExpandNode(std::int32_t nid, int split_index, std::vector<std::uint32_t> const &categorical_indices, double left_value, double right_value)

Expand a node based on a categorical split rule.

void ExpandNode(std::int32_t nid, int split_index, double split_value, std::vector<double> left_value_vector, std::vector<double> right_value_vector)

Expand a node based on a numeric split rule.

void ExpandNode(std::int32_t nid, int split_index, std::vector<std::uint32_t> const &categorical_indices, std::vector<double> left_value_vector, std::vector<double> right_value_vector)

Expand a node based on a categorical split rule.

void ExpandNode(std::int32_t nid, int split_index, TreeSplit &split, double left_value, double right_value)

Expand a node based on a generic split rule.

void ExpandNode(std::int32_t nid, int split_index, TreeSplit &split, std::vector<double> left_value_vector, std::vector<double> right_value_vector)

Expand a node based on a generic split rule.

inline bool IsRoot()

Whether or not a tree is a “stump” consisting of a single root node.

json to_json()

Convert tree to JSON and return JSON in-memory.

void from_json(const json &tree_json)

Load from JSON.

Parameters:

tree_json – In-memory json object (of type nlohmann::json)

inline void CollapseToLeaf(std::int32_t nid, double value)

Collapse an internal node to a leaf node, deleting its children from the tree.

Parameters:
  • nid – Node id of the new leaf node

  • value_vector – New leaf value

inline void CollapseToLeaf(std::int32_t nid, std::vector<double> value_vector)

Collapse an internal node to a leaf node, deleting its children from the tree.

Parameters:
  • nid – Node id of the new leaf node

  • value_vector – New leaf vector value

template<typename Func>
inline void WalkTree(Func func) const

Iterate through all nodes in this tree.

Template Parameters:

Func – Function object type, must map std::int32_t to bool.

Parameters:

func – Function that accepts a node index and returns False when iteration through a given branch of the tree should stop and True otherwise.

inline bool HasVectorOutput() const

Whether or not a tree has vector output.

Getters

inline std::int32_t OutputDimension() const

Dimension of tree output.

inline std::int32_t Parent(std::int32_t nid) const

Index of the node’s parent.

Parameters:

nid – ID of node being queried

inline std::int32_t LeftChild(std::int32_t nid) const

Index of the node’s left child.

Parameters:

nid – ID of node being queried

inline std::int32_t RightChild(std::int32_t nid) const

Index of the node’s right child.

Parameters:

nid – ID of node being queried

inline std::int32_t DefaultChild(std::int32_t nid) const

Index of the node’s “default” child (potentially used in the case of a missing feature at prediction time)

Parameters:

nid – ID of node being queried

inline std::int32_t SplitIndex(std::int32_t nid) const

Feature index defining the node’s split rule.

Parameters:

nid – ID of node being queried

inline bool IsLeaf(std::int32_t nid) const

Whether the node is a leaf node.

Parameters:

nid – ID of node being queried

inline bool IsRoot(std::int32_t nid) const

Whether the node is root.

Parameters:

nid – ID of node being queried

inline bool IsDeleted(std::int32_t nid) const

Whether the node has been deleted.

Parameters:

nid – ID of node being queried

inline double LeafValue(std::int32_t nid) const

Get parameter value of a node (typically though not necessarily a leaf node)

Parameters:

nid – ID of node being queried

inline double LeafValue(std::int32_t nid, std::int32_t dim_id) const

Get parameter value of a node (typically though not necessarily a leaf node) at a given output dimension.

Parameters:
  • nid – ID of node being queried

  • dim_id – Output dimension being queried

inline std::int32_t MaxLeafDepth() const

Get maximum depth of all of the leaf nodes.

inline std::vector<double> LeafVector(std::int32_t nid) const

Get vector-valued parameters of a node (typically leaf)

Parameters:

nid – ID of node being queried

inline double SumSquaredNodeValues(std::int32_t nid) const

Sum of squared parameter values for a given node (typically though not necessarily a leaf node)

Parameters:

nid – ID of node being queried

inline double SumSquaredLeafValues() const

Sum of squared values for all leaves in a tree.

inline bool HasLeafVector(std::int32_t nid) const

Tests whether the leaf node has a non-empty leaf vector.

Parameters:

nid – ID of node being queried

inline double Threshold(std::int32_t nid) const

Get split threshold of the node.

Parameters:

nid – ID of node being queried

inline std::vector<std::uint32_t> CategoryList(std::int32_t nid) const

Get list of all categories belonging to the left child node. Categories are integers ranging from 0 to (n-1), where n is the number of categories in that particular feature. This list is assumed to be in ascending order.

Parameters:

nid – ID of node being queried

inline TreeNodeType NodeType(std::int32_t nid) const

Get the type of a node (i.e. numeric split, categorical split, leaf)

Parameters:

nid – ID of node being queried

inline bool HasCategoricalSplit() const

Query whether this tree contains any categorical splits.

inline std::vector<std::int32_t> const &GetInternalNodes() const

Get indices of all internal nodes.

inline std::vector<std::int32_t> const &GetLeaves() const

Get indices of all leaf nodes.

inline std::vector<std::int32_t> const &GetLeafParents() const

Get indices of all leaf parent nodes.

inline std::int32_t GetDepth(std::int32_t nid) const

Get the depth of a node.

Parameters:

nid – node id

inline std::int32_t NumNodes() const noexcept

Get the total number of nodes including deleted ones in this tree.

inline std::int32_t NumDeletedNodes() const noexcept

Get the total number of deleted nodes in this tree.

inline std::int32_t NumValidNodes() const noexcept

Get the total number of valid nodes in this tree.

inline void SetLeftChild(std::int32_t nid, std::int32_t left_child)

Identify left child node.

Setters

Parameters:
  • nid – ID of node being modified

  • left_child – ID of the left child node

inline void SetRightChild(std::int32_t nid, std::int32_t right_child)

Identify right child node.

Parameters:
  • nid – ID of node being modified

  • right_child – ID of the right child node

inline void SetChildren(std::int32_t nid, std::int32_t left_child, std::int32_t right_child)

Identify two child nodes of the node and the corresponding parent node of the child nodes.

Parameters:
  • nid – ID of node being modified

  • left_child – ID of the left child node

  • right_child – ID of the right child node

inline void SetParent(std::int32_t child_node, std::int32_t parent_node)

Identify parent node.

Parameters:
  • child_node – ID of child node

  • parent_node – ID of the parent node

inline void SetParents(std::int32_t nid, std::int32_t left_child, std::int32_t right_child)

Identify parent node of the left and right node ids.

Parameters:
  • nid – ID of parent node

  • left_child – ID of the left child node

  • right_child – ID of the right child node

void SetNumericSplit(std::int32_t nid, std::int32_t split_index, double threshold)

Create a numerical split.

Parameters:
  • nid – ID of node being updated

  • split_index – Feature index to split

  • threshold – Threshold value

void SetCategoricalSplit(std::int32_t nid, std::int32_t split_index, std::vector<std::uint32_t> const &category_list)

Create a categorical split.

Parameters:
  • nid – ID of node being updated

  • split_index – Feature index to split

  • category_list – List of categories to belong to either the right child node or the left child node. Set categories_list_right_child parameter to indicate which node the category list should represent.

void SetLeaf(std::int32_t nid, double value)

Set the leaf value of the node.

Parameters:
  • nid – ID of node being updated

  • value – Leaf value

void SetLeafVector(std::int32_t nid, std::vector<double> const &leaf_vector)

Set the leaf vector of the node; useful for multi-output trees.

Parameters:
  • nid – ID of node being updated

  • leaf_vector – Leaf vector

void PredictLeafIndexInplace(ForestDataset *dataset, std::vector<int32_t> &output, int32_t offset, int32_t max_leaf)

Obtain a 0-based leaf index for each observation in a ForestDataset. Internally, trees are stored as vectors of node information, and the leaves_ vector gives us node IDs for every leaf in the tree. Here, we would like to know, for every observation in a dataset, which leaf number it is mapped to. Since the leaf numbers themselves do not carry any information, we renumber them from 0 to leaves_.size()-1.

Note: this is a tree-level helper function for an ensemble-level function. It assumes the creation of:

  1. a vector of column indices of size dataset.NumObservations() x ensemble.NumTrees(), stored in “tree-major” order

  2. a running counter of the number of tree-observations already indexed in the ensemble

    (used as offsets for the leaf number computed and returned here) Users running this function for a single tree may simply pre-allocate an output vector as

    std::vector<int32_t> output(dataset->NumObservations()) and set the offset to 0.

Parameters:
  • dataset – Dataset with which to predict leaf indices from the tree

  • output – Pre-allocated output vector storing a matrix of column indices, with “rows” corresponding to observations in dataset and “columns” corresponding to trees in an ensemble

  • offset – Bookkeeping index that determines where in output vector that column indices should be unpacked

  • max_leaf – Largest leaf value mapped so far. (Leaf indices serve as sparse column indices, so it is important that leaf values be unique to each tree.)

void PredictLeafIndexInplace(Eigen::MatrixXd &covariates, std::vector<int32_t> &output, int32_t offset, int32_t max_leaf)

Obtain a 0-based leaf index for each observation in a ForestDataset. Internally, trees are stored as vectors of node information, and the leaves_ vector gives us node IDs for every leaf in the tree. Here, we would like to know, for every observation in a dataset, which leaf number it is mapped to. Since the leaf numbers themselves do not carry any information, we renumber them from 0 to leaves_.size()-1.

Note: this is a tree-level helper function for an ensemble-level function. It assumes the creation of:

  1. a vector of column indices of size dataset.NumObservations() x ensemble.NumTrees(), stored in “tree-major” order

  2. a running counter of the number of tree-observations already indexed in the ensemble

    (used as offsets for the leaf number computed and returned here) Users running this function for a single tree may simply pre-allocate an output vector as

    std::vector<int32_t> output(dataset->NumObservations()) and set the offset to 0.

Parameters:
  • covariates – Eigen matrix with which to predict leaf indices

  • output – Pre-allocated output vector storing a matrix of column indices, with “rows” corresponding to observations in covariates and “columns” corresponding to trees in an ensemble

  • offset – Bookkeeping index that determines where in output vector that column indices should be unpacked

  • max_leaf – Largest leaf value mapped so far. (Leaf indices serve as sparse column indices, so it is important that leaf values be unique to each tree.)

void PredictLeafIndexInplace(Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> &covariates, std::vector<int32_t> &output, int32_t offset, int32_t max_leaf)

Obtain a 0-based leaf index for each observation in a ForestDataset. Internally, trees are stored as vectors of node information, and the leaves_ vector gives us node IDs for every leaf in the tree. Here, we would like to know, for every observation in a dataset, which leaf number it is mapped to. Since the leaf numbers themselves do not carry any information, we renumber them from 0 to leaves_.size()-1.

Note: this is a tree-level helper function for an ensemble-level function. It assumes the creation of:

  1. a vector of column indices of size dataset.NumObservations() x ensemble.NumTrees(), stored in “tree-major” order

  2. a running counter of the number of tree-observations already indexed in the ensemble

    (used as offsets for the leaf number computed and returned here) Users running this function for a single tree may simply pre-allocate an output vector as

    std::vector<int32_t> output(dataset->NumObservations()) and set the offset to 0.

Parameters:
  • covariates – Eigen matrix with which to predict leaf indices

  • output – Pre-allocated output vector storing a matrix of column indices, with “rows” corresponding to observations in covariates and “columns” corresponding to trees in an ensemble

  • offset – Bookkeeping index that determines where in output vector that column indices should be unpacked

  • max_leaf – Largest leaf value mapped so far. (Leaf indices serve as sparse column indices, so it is important that leaf values be unique to each tree.)

Tree Split

Numeric and categorical splits are represented by a TreeSplit class.

class TreeSplit

Representation of arbitrary tree split rules, including numeric split rules (X[,i] <= c) and categorical split rules (X[,i] in {2,4,6,7})

Public Functions

inline TreeSplit(double split_value)

Construct a numeric TreeSplit.

Parameters:

split_value – Numeric cutoff defining a new split rule

inline TreeSplit(std::vector<std::uint32_t> &split_categories)

Construct a categorical TreeSplit.

Parameters:

split_categories – Vector of category indices defining a new (unordered) categorical split rule

inline bool NumericSplit()

Whether or not a TreeSplit rule is numeric.

inline bool SplitTrue(double fvalue)

Whether a given covariate value is True or False on the rule defined by a TreeSplit object.

Parameters:

fvalue – Value of the covariate

inline double SplitValue()

Numeric cutoff value defining a TreeSplit object.

inline std::vector<std::uint32_t> SplitCategories()

Categories defining a TreeSplit object.