Home / Summaries / Introduction to data mining / nodes-child-comparable

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Q: Within a decision tree we distinguish tree types of nodes. Which are they?

Root node. The very first node of the tree (i.e., the starting point). Represents the entire dataset being split based on the most significant attribute. Internal node. Any node between the root and leafs. Each represents a test/decision on a feature. Internal nodes keep splitting the data until the stopping condition is reached. Leaf Nodes. The endpoints of the decision tree. From this point, no further splitting happens. Represents: In classification: the final class label. In Regression trees: a value prediction .

Q: What is a child node?

Child nodes are the new nodes you create when you split a node based on some attribute. You can distinguish between parent and child nodes: Parent node: The node you split. Child node: the resulting subgroups from that split.

Q: What are the four major advantages of classifying based on a decision tree?

Inexpensive. Inexpensive to construct. Fast. Extremely fast at classifying unknown records. Easy Interpretable. Easy to interpret for small-sized trees. Comparable Accuracy. Accuracy is comparable for other classification techniques for many simple data sets.

Q: What are the three core practical issues of classification?

Underfitting and Overfitting . Underfitting vs Overfitting. Underfitting: Model is too simple . Doesn't capture the underlying patterns in the data. Performs poorly on both training and test data. Overfitting: Model is too complex (than necessary). It memorised the data instead of generalising. Performs very well on training data, but poorly on test data. I.e., Cannot asses model performance by training error. Missing Values . Affects decision tree construction in several ways: How to distribute instance with missing value to child nodes. How impourity measures are computed. How a test instance with missing value is classified. Cost of Classification .

5 important questions on Classification: Basic Concepts, Decision Trees, and Model Evaluation

Within a decision tree we distinguish tree types of nodes.
Which are they?

Root node.

The very first node of the tree (i.e., the starting point).
Represents the entire dataset being split based on the most significant attribute.

Internal node.

Any node between the root and leafs.
Each represents a test/decision on a feature.
Internal nodes keep splitting the data until the stopping condition is reached.

Leaf Nodes.

The endpoints of the decision tree.
From this point, no further splitting happens.
Represents:

In classification: the final class label.
In Regression trees: a value prediction.

What is a child node?

Child nodes are the new nodes you create when you split a node based on some attribute.
You can distinguish between parent and child nodes:

Parent node: The node you split.
Child node: the resulting subgroups from that split.

What are the four major advantages of classifying based on a decision tree?

Inexpensive.

Inexpensive to construct.

Fast.

Extremely fast at classifying unknown records.

Easy Interpretable.

Easy to interpret for small-sized trees.

Comparable Accuracy.

Accuracy is comparable for other classification techniques for many simple data sets.

What are the three core practical issues of classification?

Underfitting and Overfitting.

Underfitting vs Overfitting.

Underfitting:

Model is too simple.
Doesn't capture the underlying patterns in the data.
Performs poorly on both training and test data.

Overfitting:

Model is too complex (than necessary).
It memorised the data instead of generalising.
Performs very well on training data, but poorly on test data.

I.e., Cannot asses model performance by training error.

Missing Values.

Affects decision tree construction in several ways:

How to distribute instance with missing value to child nodes.
How impourity measures are computed.
How a test instance with missing value is classified.

Cost of Classification.

Can you describe the common reasons for Overfitting and how we can fix this?

Noise: Exceptional cases or outliers in the training set can lead to the model making wrong classifications, when you generalise it to unseen data.

We should accept that errors due to exceptional cases are unavoidable and establish the minimum error rate achievable by any classifier.

Lack of Representative Samples: If you don't have sufficient training samples, or they are not representative for the group you want to generalise to, overfitting might occur.

The question on the page originate from the summary of the following study material:

Introduction to data mining | 9780321321367 | Pang Ning Tan, et al

View summary