Summary: Introduction To Data Mining | 9780321321367 | Pang Ning Tan, et al
- This + 400k other summaries
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding
Read the summary and the most important questions on Introduction to data mining | 9780321321367 | Pang-Ning Tan ; Michael Steinbach ; Vipin Kumar.
-
1 Introduction
This is a preview. There are 8 more flashcards available for chapter 1
Show more cards here -
The Data Miner's Toolkit can be broken down into two fundamental tasks. Which two tasks are we talking about?
Predictive Tasks.- Goal: Predict the future.
- Use known data to predict unknown values.
- E.g., Will this customer spend more than $100 today?
- Common tasks:
- Classification.
- Regression.
- Deviation/Anomaly Detection.
Descriptive Tasks.- Goal: Understand the present.
- Finds human-interpretable patterns.
- Requires post processing to validate & explain results.
- E.g., What are our main customer groups, what have our best customers in common?
- Common Tasks:
- Clustering.
- Association Rule Discovery.
- Sequential Pattern Discovery.
-
What is (one of) the most famous discoveries made by data mining?
- Men who bought diapers on a Friday night were also likely to buy beer.
- Highlights Data Mining's power to discover surprising, non-obvious and useful patterns.
- Men who bought diapers on a Friday night were also likely to buy beer.
-
Deviation/Anomaly Detection is a predictive task of data mining. Can you explain what it is?
- Identify abnormal behaviour.
- By training the model on what is "normal" behaviour, it can predict more accurate which data points, patterns, or events are very different, or anomalous, from the "normal behaviour.
- These cases are called anomalies, outliers, or deviations.
- By identifying them, we can detect problems, risks, or are but important events.
Common Applications:- Credit Card Fraud Detection (which transactions are anomalies and potentially fraudulent?).
- Network Intrusion Detection (Detect unusual patterns in network traffic that might indicate an intrusion).
- Identifying Disease and ecosystem disturbances.
- Identify abnormal behaviour.
-
2 Data
This is a preview. There are 17 more flashcards available for chapter 2
Show more cards here -
What is data preprocessing and what are the seven key preprocessing techniques?
- Data Preprocessing.
- Making raw data more suitable for data mining analysis.
- Involves: Improving data quality, adapting data for specific algorithms, improving algorithm performance.
- Seven key preprocessing techniques:
- Aggregation.
- Sampling.
Dimensionality Reduction .- Feature subset selection.
- Feature Creation.
- Discretisation and Binarisation.
- Attribute Transformation.
- Data Preprocessing.
-
Can you describe the preprocessing technique called Aggregation?
- Combining two or more attributes/objects into a single attribute/object.
- Pros:
- Reduces data size.
- Changes the scale by providing a higher-level view.
- Makes data more stable (i.e., less variability for aggregate quantities like averages).
- Cons:
- Potential loss of interesting details.
- E.g., taking daily sales figures and aggregating them into monthly/yearly totals.
-
What are the best ways for Dimensionality Reduction?
- Linear Algebra Techniques.
- Components Analysis (PCA).
- Singular Value Decomposition.
- Feature Subset Selection.
- Linear Algebra Techniques.
-
Similarity and Dissimilarity are both proximity measures. Can you explain how they differ form each other?
- Similarity.
- Numerical measure showing how alike two data objects are.
- The more alike, the higher.
- Often a range of [0, 1].
- Dissimilarity.
- Numerical measure showing how different two data objects are.
- The more different, the higher.
- Minimum dissimilarity is often 0, but the upper limit varies.
- Similarity.
-
Similarity measures, like the Jaccard similarity measure, have two of the same properties a metric (distance) has. What two properties are they, and why doesn't the other property apply?
- Properties in common:
- Positivity (non-negativity).
- Symmetry.
- Similarity doesn't have the triangle inequality.
- Similarity doesn't measure "length" or "shortest path.
- Similarity measures overlap, resemblance, or angle.
- A can overlap B, and B can overlap C, but A and C might share practically nothing in common.
- This would break the triangle inequality.
- Properties in common:
-
Like cosine Similarity, the extended Jaccard Coefficient is also a vector-based similarity measure. When would you use the Extended Jaccard Coefficient over the Cosine Similarity?
- Absolute quantities.
- When besides the proportions (do both customers have product X in their basket), quantity matters too (Customer A has bought 3 apples, Customer B has bought 8 apples.
- Absolute quantities.
-
What is the Extended Jaccard Coefficient (Tanimoto Coefficient) and how does it differ from the normal one?
- Non-binary attributes.
- The normal Jaccard coefficient only works with binary attributes (absence (0) vs presence (1)).
- The Extended Jaccard Coefficient can be used for all positive numbers.
- Non-binary attributes.
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding

















