Lecture 1: An Introduction to Data Mining

CPaT

Computing Practice and Theory

WordPress

What is data mining?

  • looking for hidden patterns in data
  • How is it different from statistics?
  • How is it different from machine learning?
  • Think of some examples: Google advertising, predicting property values, meteorology, gov’t census data for social services, tax return spending trends, employees working remotely, smuggling detection, financial records, airplane data for engineering, pictures for object recognition, optimizing software. voice recognition, music genome, song recognition, funding for public projects. intrusion detection (IDS), netflix

Topics for the quarter

  • What are Concepts, Instances, Attributes
  • Knowledge representation: Rule Lists, Trees, Linear Models
  • Training a machine learning system
  • Cleaning and Transforming Data
  • Bayesian Networks
  • Clustering
  • Neural Networks
  • Regression

Readings for this week

  • Chapter 1 in Witten

Data

  • What kinds of data are there?
  • Why data needs to be cleaned/preprocessed: missing values, inconsistent values
  • Summarizing data: mean, standard deviation, min, max, quartiles
  • attribute subset selection: finding a minimum set of attributes that adequately describes the concept.