Introduction to Data Mining SLIDES BY: SHREE JASWAL Books 2 Text Books: 1. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition 2. Business Intelligence: Data Mining and Optimization for Decision Making by Carlo Vercellis ,Wiley India Publications Chp1 Slides by: Shree Jaswal Topics to be covered 3 What is Data Mining; Kind of patterns to be mined; Technologies used; Major issues in Data Mining Chp1 Slides by: Shree Jaswal Evolution of database system technology 4 Chp1 Slides by: Shree Jaswal What is a data warehouse? 5 Chp1 Slides by: Shree Jaswal Example of a datawarehouse framework used by an electronics store 6 Chp1 Slides by: Shree Jaswal Chp1 Slides by: Shree Jaswal 7 Why Data Mining? 8 The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Chp1 Slides by: Shree Jaswal Why Data Mining?—Potential Applications 9 Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications Chp1 Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis Slides by: Shree Jaswal Ex. 1: Market Analysis and Management 10 Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling—What types of customers buy what products (clustering or classification) Customer requirement analysis Identify the best products for different groups of customers Predict what factors will attract new customers Provision of summary information Chp1 Multidimensional summary reports Statistical summary information (data central tendency and variation) Slides by: Shree Jaswal Ex. 2: Corporate Analysis & Risk Management 11 Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning summarize and compare the resources and spending Competition Chp1 monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market Slides by: Shree Jaswal Ex. 3: Fraud Detection & Mining Unusual Patterns 12 Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone-call fraud Retail industry Chp1 Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism Slides by: Shree Jaswal What Is Data Mining? 13 Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems Chp1 Slides by: Shree Jaswal Examples: What is (not) Data Mining? What is not Data What is Data Mining? Mining? – Look up phone – Certain names are more number in phone directory prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Query a Web – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) search engine for information about “Amazon” KDD Process: Several Key Steps 15 Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Chp1 Slides by: Shree Jaswal Knowledge Discovery (KDD) Process 16 This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Chp1 Databases Slides by: Shree Jaswal Selection Example: A Web Mining Framework 17 Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledgebase Chp1 Slides by: Shree Jaswal Data Mining in Business Intelligence 18 Increasing potential to support business decisions End User Decision Making Data Presentation Visualization Techniques Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Chp1 Slides by: Shree Jaswal DBA KDD Process: A Typical View from ML and Statistics 19 Input Data Data Mining Data PreProcessing Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern Pattern Pattern Pattern evaluation selection interpretation visualization This is a view from typical machine learning and statistics communities Chp1 Slides by: Shree Jaswal Decisions in Data Mining(1) 20 Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining : Descriptive mining tasks characterize properties of data in a target data set & predictive mining tasks perform induction on current data in order to make predictions Multiple/integrated functions and mining at multiple levels Chp1 Slides by: Shree Jaswal Decisions in Data Mining(2) 21 Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. Chp1 Slides by: Shree Jaswal What Kinds of Patterns Can Be Mined? 22 1. 2. 3. 4. 5. Chp1 Generalization Association and Correlation Analysis Classification Cluster Analysis Outlier Analysis Slides by: Shree Jaswal Data Mining Function: (1) Generalization 23 Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region Data characterization is a summarization of the general characteristics or features of a target class of data Data cube technology for computing Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing) Examples of Output forms : pie charts, MDD cubes, bar charts, curves etc. Chp1 Slides by: Shree Jaswal Data Mining Function: (1) Generalization contd. 24 Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes Data cube technology for computing Drill down on any dimension Discriminant rules: Discrimination descriptions expressed in the form of rules Output forms : same as that of data characterization along with discrimination descriptions Chp1 Slides by: Shree Jaswal Data Mining Function: (2) Association and Correlation Analysis 25 Frequent patterns (or frequent itemsets) What items are frequently purchased together in your mart? Eg. Milk & bread Association, correlation vs. causality A typical association rule Computer software [1%, 50%] (support, confidence) Confidence means that if one buys a computer there is a 50% chance that she will buy software too. A 1% support means that 1% of all transactions under analysis show that computer & software are purchased together Association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold Chp1 Slides by: Shree Jaswal Data Mining Function: (3) Classification 26 Classification and label prediction Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown class labels Typical methods Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, … Typical applications: Chp1 Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … Slides by: Shree Jaswal Various forms of a classification model 27 Chp1 Slides by: Shree Jaswal Data Mining Function: (4) Cluster Analysis 28 Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Data objects are clustered or grouped based on the principle of maximizing intraclass similarity and minimizing interclass similarity Many methods and applications Chp1 Slides by: Shree Jaswal A 2D plot of customer data with respect to customer locations in a city showing 3 data clusters Chp1 Slides by: Shree Jaswal 29 Data Mining Function: (5) Outlier Analysis 30 Outlier analysis (anomaly mining) Outlier: A data object that does not comply with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure Methods: by product of clustering or regression analysis, … Useful in fraud detection, rare events analysis Chp1 Slides by: Shree Jaswal Are All the “Discovered” Patterns Interesting? 31 Data mining may generate thousands of patterns: Not all of them are interesting Suggested approach: Human-centered, query-based, focused mining Interestingness measures A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Chp1 Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. Eg. A large Earthquake often follows a cluster of small quakes Slides by: Shree Jaswal Find All and Only Interesting Patterns? 32 Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? Heuristic vs. exhaustive search Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem Can a data mining system find only the interesting patterns? Approaches First generate all the patterns and then filter out the uninteresting ones Generate Chp1 only the interesting patterns—mining query optimization Slides by: Shree Jaswal Other Pattern Mining Issues 33 Precise patterns vs. approximate patterns Association and correlation mining: possible find sets of precise patterns But approximate patterns can be more compact and sufficient How to find high quality approximate patterns?? Gene sequence mining: approximate patterns are inherent How to derive efficient approximate pattern mining algorithms?? Constrained vs. non-constrained patterns Chp1 Why constraint-based mining? What are the possible kinds of constraints? How to push constraints into the mining process? Slides by: Shree Jaswal Data Mining: Confluence of Multiple Disciplines 34 Machine Learning Applications Algorithm Chp1 Slides by: Shree Jaswal Pattern Recognition Data Mining Database Technology Statistics Visualization High-Performance Computing Why Confluence of Multiple Disciplines? 35 Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications Chp1 Slides by: Shree Jaswal Applications of Data Mining 36 Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL- Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining Chp1 Slides by: Shree Jaswal Major Issues in Data Mining (1) 37 Mining Methodology Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining User Interaction Chp1 Interactive mining Incorporation of background knowledge Presentation and visualization of data mining results Slides by: Shree Jaswal Major Issues in Data Mining (2) 38 Efficiency and Scalability Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society Chp1 Social impacts of data mining Privacy-preserving data mining Invisible data mining Slides by: Shree Jaswal Summary 39 Data mining: Discovering interesting patterns and knowledge from massive amount of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of data Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining technologies and applications Major issues in data mining Chp1 Slides by: Shree Jaswal
© Copyright 2024