Decision Trees

- General Information
- Decision Trees Components
- Criteria for nodes splitting
- Overfitting/underfitting concepts
- Decision Trees pruning
- Decision Trees types

Decision Trees is a supervised non-parametric classifier that takes as input a vector of attribute values and returns a decision

- Input/output values can be both numerical and categorical
- Missing values are allowed
- Classification structure is explicit and easily interpretable (Friedl & Brodley, 1997)

“A classification procedure that recursively partitions a data set into smaller subdivisions on the basis of a sets of tests defined at each branch (or node)” (Friedl & Brodley, 1997)

Decision Trees classification model generated in R using PlanetScope image as input (Digital Number-DN).

- Nodes: testing different attributes for splitting
- Edges: connecting to the next node or leaf
- Leaves: terminal nodes predicting the outcomes

- Classification problems: predicting discrete values
- Regression problems: predicting continuous values

The mains steps are:

- select the best attribute as the root node
- for each value of this attribute, create new child node
- split samples to child nodes
- if subset is pure, then terminal node
- else: continue splitting

Source: Artificial Intelligence: a modern approach (Stuart Russel and Peter Norvig, 2009)

Source: Artificial Intelligence: a modern approach (Stuart Russel and Peter Norvig, 2009)

Patrons variable

The splits defined at each internal node of decision trees are estimated from training data by using a statistical procedure

- Information gain
- Variance reduction (regression problem)
- Gini impurity

𝑝𝑖 = probability of an object being classified as a particular class

Entropy

- a measure of uncertainty/how pure a subset/tree node is
- units used to measure it: bits

p = number of positive examples

n = number of negative examples

H = entropy

k = values an attribute can take

Entropy

There are two main options

- Binay decision = consider all possible splits according to the attributes values (-)
- Discretization: ordinal categorical feature (+)

Overfitting: learning noise on top of the signal

- Presence of noise
- Lack of representative samples

Underfitting: the model is too simple to find the patterns in the data

Too large: overfitting

- Poor generalization to new samples (low bias/high variance)

overfitting

Too small: underfitting

- Poor fit to the data (high bias/low variance)

Underfitting

- Do not grow a tree that is too large
- Pruning: reducing the size of the decision tree

Pros

- Not a black-box classifier: i.e. users can easily understand the decisions
- Can handle both categorical and numerical data
- Can handle missing data
- Irrelevant attributes are easily discarded
- fast

Cons

- Axis-aligned splitting of data

'Simple' decision trees

- Classification and Regression Tree (CART) (Breiman et al., 1984)

Ensemble of Decision Trees (weak learners):

- Boosting: weighted average of the results
- Bagging: averaging the results for the end prediction: e.g. Random Forests (Breiman, 2001)

- Non-parametric and transparent classifier
- Addresses both classification and regression problems
- Different options for splitting the decision trees nodes: information gain, Gini impurity, variance
- Best decision trees : overfitting/underfitting problems
- Popular implementations of decision trees: CART, Random Forests

Breiman, L. (2001). Random Forest. Machine Learning, 45

Friedl, M.A., & Brodley, C.E. (1997). Decision tree classification of land cover from remotely sensed data. Remote Sensing of Environment, 61, 399-409

Russell, S., & Norvig, P. (2009). Artificial intelligence: a modern approach. (3 edition ed.). Pearson