Summary: Introduction To Data Mining | 9780321321367 | Pang Ning Tan, et al

Summary: Introduction To Data Mining | 9780321321367 | Pang Ning Tan, et al Book cover image
  • This + 400k other summaries
  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
PLEASE KNOW!!! There are just 44 flashcards and notes available for this material. This summary might not be complete. Please search similar or other summaries.
Use this summary
Remember faster, study better. Scientifically proven.
Trustpilot Logo

Read the summary and the most important questions on Introduction to data mining | 9780321321367 | Pang-Ning Tan ; Michael Steinbach ; Vipin Kumar.

  • 1 Introduction

    This is a preview. There are 8 more flashcards available for chapter 1
    Show more cards here

  • The Data Miner's Toolkit can be broken down into two fundamental tasks. Which two tasks are we talking about?

    • Predictive Tasks.
      • Goal: Predict the future.
        • Use known data to predict unknown values.
        • E.g., Will this customer spend more than $100 today?
      • Common tasks:
        • Classification.
        • Regression.
        • Deviation/Anomaly Detection.
    • Descriptive Tasks.
      • Goal: Understand the present.
        • Finds human-interpretable patterns.
        • Requires post processing to validate & explain results.
        • E.g., What are our main customer groups, what have our best customers in common?
      • Common Tasks:
        • Clustering.
        • Association Rule Discovery.
        • Sequential Pattern Discovery.
  • What is (one of) the most famous discoveries made by data mining?

    • Men who bought diapers on a Friday night were also likely to buy beer.
    • Highlights Data Mining's power to discover surprising, non-obvious and useful patterns
  • Deviation/Anomaly Detection is a predictive task of data mining. Can you explain what it is?

    • Identify abnormal behaviour.
      • By training the model on what is "normal" behaviour, it can predict more accurate which data points, patterns, or events are very different, or anomalous, from the "normal behaviour.
      • These cases are called anomalies, outliers, or deviations.
      • By identifying them, we can detect problems, risks, or are but important events.


    Common Applications:
    • Credit Card Fraud Detection (which transactions are anomalies and potentially fraudulent?).
    • Network Intrusion Detection (Detect unusual patterns in network traffic that might indicate an intrusion).
    • Identifying Disease and ecosystem disturbances
  • 2 Data

    This is a preview. There are 17 more flashcards available for chapter 2
    Show more cards here

  • What is data preprocessing and what are the seven key preprocessing techniques?

    • Data Preprocessing.
      • Making raw data more suitable for data mining analysis.
      • Involves: Improving data quality, adapting data for specific algorithms, improving algorithm performance.
    • Seven key preprocessing techniques:
      • Aggregation.
      • Sampling.
      • Dimensionality Reduction.
      • Feature subset selection.
      • Feature Creation.
      • Discretisation and Binarisation.
      • Attribute Transformation.
  • Can you describe the preprocessing technique called Aggregation?

    • Combining two or more attributes/objects into a single attribute/object.
    • Pros:
      • Reduces data size.
      • Changes the scale by providing a higher-level view.
      • Makes data more stable (i.e., less variability for aggregate quantities like averages).
    • Cons:
      • Potential loss of interesting details.
    • E.g., taking daily sales figures and aggregating them into monthly/yearly totals.
  • What are the best ways for Dimensionality Reduction?

    • Linear Algebra Techniques.
      • Components Analysis (PCA).
      • Singular Value Decomposition.
    • Feature Subset Selection.
  • Similarity and Dissimilarity are both proximity measures. Can you explain how they differ form each other?

    • Similarity.
      • Numerical measure showing how alike two data objects are.
      • The more alike, the higher.
      • Often a range of [0, 1].
    • Dissimilarity.
      • Numerical measure showing how different two data objects are.
      • The more different, the higher.
      • Minimum dissimilarity is often 0, but the upper limit varies.
  • Similarity measures, like the Jaccard similarity measure, have two of the same properties a metric (distance) has. What two properties are they, and why doesn't the other property apply?

    • Properties in common:
      • Positivity (non-negativity).
      • Symmetry. 
    • Similarity doesn't have the triangle inequality.
      • Similarity doesn't measure "length" or "shortest path.
      • Similarity measures overlap, resemblance, or angle
      • A can overlap B, and B can overlap C, but A and C might share practically nothing in common. 
        • This would break the triangle inequality. 
  • Like cosine Similarity, the extended Jaccard Coefficient is also a vector-based similarity measure. When would you use the Extended Jaccard Coefficient over the Cosine Similarity?

    • Absolute quantities.
      • When besides the proportions (do both customers have product X in their basket), quantity matters too (Customer A has bought 3 apples, Customer B has bought 8 apples.
  • What is the Extended Jaccard Coefficient (Tanimoto Coefficient) and how does it differ from the normal one?

    • Non-binary attributes.
      • The normal Jaccard coefficient only works with binary attributes (absence (0) vs presence (1)).
      • The Extended Jaccard Coefficient can be used for all positive numbers. 
PLEASE KNOW!!! There are just 44 flashcards and notes available for this material. This summary might not be complete. Please search similar or other summaries.

To read further, please click:

Read the full summary
This summary +380.000 other summaries A unique study tool A rehearsal system for this summary Studycoaching with videos
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

Topics related to Summary: Introduction To Data Mining