Objectives Motivation: Why data warehouse? What is a data warehouse? Whyy separate p DW? Conceptual modeling of DW Data Mart Data Warehousing Architectures Data Warehouse Development Data Warehouse Vendors R l Real-time DW ` ` ` ` ` Data Warehousing and OLAP Lecture 2/DMBI/IKI83403T/MTI/UI ` ` ` Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) Faculty of Computer Science, University of Indonesia ` 2 Motivation: Why data warehouse? ` ` ` ` ` Construction of data warehouses (DW) involves data cleaning and data integration Æ important preprocessing step for data mining (DM). DW provide OLAP for the interactive analysis of multidimensional data,, which facilitates effective DM. Data mining functions can be integrated with OLAP operations to enhance interactive mining of knowledge. DW will provide an effective platform for DM. Whil DWs While DW are nott requirements i t to t do d DM, DM DW store t massive amounts of data that can be uses for DM. [DO] 3 What is a data warehouse? [JH] Defined in many different ways, but not rigorously. ` ` ` A decision support database that is maintained separately from the organization’s ODB. Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” — W. H. Inmon Case Study 2: Continental Airlines flies high with its real-time data warehouse ` ` 4 What is a data warehouse? [ET] Data warehouse A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format. Characteristics ` ` ` ` ` Subject oriented, Integrated, Time Variant, Non-volatile Web-based, Relational/multidimensional, Client/server, Real-time Include metadata Data warehousing ` ` ` Process of constructing and using data warehouses. Requires data integration, data cleaning, and data consolidation. Subject Oriented Organized around major subjects, such as customer, product, sales. ` Provide P id a simple i l and d concise i view i around d particular subject issues by excluding data that are not useful in the decision support process. ` Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. processing ` 5 6 Integrated Integrate multiple, heterogeneous data sources ` ` Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. ` Relational databases, flat-files, on-line transaction records Data cleaningg and data integration g techniques q are applied ` ` Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources ` ` 7 ` Operational database: current value data. ` Data warehouse data: provide information from a historical perspective (e (e.g., g past 5-10 years) Every key structure in the data warehouse ` E.g., Hotel price: currency, tax, breakfast covered, etc. Wh data When d is i moved d to the h warehouse, h it i is i converted. d 8 ` Contains an element of time, explicitly or implicitly ` But the key of operational data may or may not contain “time element”. Non volatile Non-volatile Ap physically y y separate p store of data transformed from the ` ` Data Warehouse vs vs. Heterogeneous DBMS ` Traditional heterogeneous DB integration: operational environment. ` Build wrappers/mediators on top of multiple, heterogeneous databases. Ex: IBM Data Joiner, Informix DataBlade O Operational i l update d off data d does d not occur in i the h d data ` Q Query d driven i approach: h warehouse environment. ` ` When a query is posed to a client site, a metadata-dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. There queries are then mapped and sent to local query processors. The results returned from the different sites are integrated d into a global l b l answer set. ` Complex information filtering and integration processes, compete for resources. resources ` Inefficient and potentially expensive for frequent queries, especially for Does not require transaction processing, recovery, and concurrencyy control mechanisms ` Requires only two operations in data accessing: ` i i i l lloading initial di off data d and d access off data. d queries requireing q q g aggregations. gg g 9 10 Data Warehouse vs vs. Heterogeneous DBMS (2) ` Using DW Æ update-driven approach ` ` ` ` ` ` Major task of ODB Æ OLTP: ` Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. DW serve ffor data d analysis l i and d decision d i i making ki Æ OLAP ` Distinct Features (OLTP vs. OLAP) Information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis. Unlike OLTP, OLTP DW do not contain the most current information information. DW brings high performance to the integrated heterogeneous DB system since data are copied, copied preprocessed, preprocessed integrated, integrated annotated, summarized, and restructured into one data store. Query processing in DW does not interfere with the processing at local sources DW can store and integrate g historical information and support pp complex multidimensional queries. 11 DW vs. vs ODB ` ` ` ` ` ` 12 User and U d system orientation: i i customer vs. market k Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. vs star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read read-only only but complex queries OLTP vs OLAP Why Separate DW? OLTP OLAP users Clerk IT professional Clerk, Knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access ` ` ` ` 14 Why Separate DW? (2) Different functions and different data: ` ` ` missing data: Decision support requires historical data which operational DBs do not typically maintain. So, data in ODB is p for decision making. g usuallyy far from complete data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources. ODB contain t i detailed d t il d raw data d t (transactions) (t ti ) which hi h need d to t be b consolidated before analysis. data qquality: y different sources typically yp y use inconsistent data representations, codes and formats which have to be reconciled. Conceptual Modeling of DW ` Data Cube: ` ` see TSBD Lecture Notes on Visualization of Data Cubes M d li data Modeling d t warehouses: h dimensions di i & measurements t ` ` ` 15 DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse — tuned for OLAP: complex OLAP queries, computation of large groups of data at summarized levels, multidimensional view,, consolidation. Processing OLAP queries in operational databases would degrade the performance of operational tasks. In ODB, concurrency control and recovery mechanisms (locking, logging) are required to ensure the consistency and d robustness b off transactions. i OLAP Æ read only access. No need for concurrency control and recovery. recovery ` 13 ` High g performance p for both systems: y ` Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables, tables one for each dimension). Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. Fact constellations: Multiple fact tables share dimension tables. ` Also known as galaxy schema 16 Example of Star Schema Snowflake Schema Product Date Sales Fact Table Date Product Year Product Month Date Month Year Day Month Date Product Store Store StoreID City State Country Region Cust Customer dollar_sales ` City Country Bandung, Bogor keduanya ada di Jawa Barat 17 City State State Potensi Redundansi ` Store Store CustId C tN CustName CustCity CustCountry unit_sales Yen_sales Measurements Cust StoreID Cit City Customer unit_sales dollar_sales State Country Country Region CustId CustName CustCity CustCountry Yen_sales Measurements 18 View of Warehouses and Hierarchies Data Cube D t Date 19 ProductNo ProdName ProdDesc Category QOH Sales Fact Table ` Importing data ` Table Browsing ` Dimension creation ` Dimension browsing ` Cube buildingg ` Cube browsing TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in U.S.A. USA U.S.A Ca ada Canada Mexico sum 20 Country Day Month Year Year ProductNo ProdName ProdDesc C Category QOH Data Cube Typical OLAP Operations ` ` ` ` ` ` ` 21 ` Visualization OLAP capabilities p Interactive manipulation Data Mart ` ` ` ` DW collects information about subjects that span the entire organization, such as customers, products, sales, assets, and personnel. Its scope is enterprise-wide. For DW, fact constellation schema is commonly used since it can model multiple, interrelated subjects. Data Mart is a subset of a DW, focuses on a particular subject. Its scope is department-wide. Typically, a data mart consisting off a single l subject b area (e.g. ( marketing, k operations). For Data Mart, star or snowflake schema are commonly used since both are geared towards modeling single subjects, bj t although lth h th the star t schema h iis more popular. l 23 ` Roll up (drill-up): summarize data ` by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up ` from higher level summary to lower level summary or detailed data, data or introducing new dimensions Slice and dice: ` project and select Pivot (rotate): ` reorient the cube, visualization, 3D to series of 2D planes. Other operations ` drill d ill across: iinvolving l i ((across)) more than th one fact f t table. t bl ` drill through: through the bottom level to its back-end relational tables. More info: ` www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org 22 Data Mart ` ` A data mart can be either dependent or independent. A dependent data mart is a subset that is created directly from the DW. ` ` ` ` ` Consistent data model Providing quality data DW must be constructed first Ensures that the user viewing the same version of the data that are accessed d by b allll other h d data warehouse h users An independent data mart is a small warehouse designed f ad for department, and d iits source is i not an EDW. EDW 24 Data Warehousing Process Overview Data Warehousing Process Overview ` The major components of a data warehousing process ` ` ` ` ` ` 25 Data Warehousing Architectures 27 Data sources ` Legacy systems, external data providers (e.g. BPS), OLTP, ERP Systems Data extraction Data loading Comprehensive database Metadata Middleware tools 26 Data Warehousing Architectures 28 Data Warehousing Architectures Data Warehousing Architectures 29 30 Data Integration and the ETL Process ` Various integration technologies: ` Enterprise Application Integration (EAI) ` ` ` ` ` ETL ` ` ` ` Enterprise Information Integration (EII) ` ` 31 A technology that provides a vehicle for pushing data from source systems t iinto t a data d t warehouse h Integrating application functionality and is focused on sharing functionality across systems Traditionally, API. Nowadays, SOA (web services). Data Integration and the ETL Process An evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases A mechanism for pulling data from source systems to satisfy a request for information. 32 60-70% of the time in a data-centric project. Extraction: Reading data from one or more databases Transformation ` Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a DW Load ` Putting the data into d the DW Data Warehouse Development ` Direct benefits ` ` ` ` ` Data Warehouse Development ` Some best practices for implementing a DW (Weir, 2002): ` Allowing end users to perform extensive analysis in numerous ways A consolidated view off corporate data (i.e ( a single version off the truth) Better and more timely information Enhanced system performance. DW frees production processing because some operational system reporting requirements are moved to DSS Simplification of data access ` ` ` ` ` ` ` ` ` 34 33 Data Warehouse Vendors ` ` ` ` ` ` ` ` ` ` ` Computer Associates DataMirror Data Advantage g Groupp Dell Computer Embarcadero Technologies Business Objects HP Hummingbird H Hyperion IBM Informatica 35 ` ` ` ` ` ` ` Microsoft Oracle SAS Siemens Sybase Teradata Please visit: ` ` Data Warehousing Institute (tdwi com) (tdwi.com) DM Review (dmreview.com) Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers and users managers, It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability M Managed d bby bboth h IT and d bbusiness i professionals f i l Develop a business/supplier relationship O l lload Only dd data t that th t have h been b cleansed l d and d are off a quality lit understood by the organization Do not overlook training requirements Be politically aware Data Warehouse Vendors ` Six gguidelines to considered when developing p ga vendor list: 1 1. 2. 3. 4. 5. 6. 36 Financial strength ERP linkages Q lifi d consultants Qualified l Market share Industry experience Established partnerships p p Real time DW Real-time Real time DW Real-time Traditionally, updated on a weekly basis. Unsuitable for some businesses. Real-time ((active)) data warehousing g The process of loading and providing data via a data warehouse as theyy become available Levels of data warehouses: ` ` ` ` 1. 2. 3. 4. 5. Reports what happened Some analysis occurs Provides p prediction capabilities, p , Operationalization Becomes capable p of makingg events happen pp 37 38 Real time DW Real-time From DW to DM [JH] ` Three kinds of data warehouse applications ` ` ` 39 Information processing ` supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing ` multidimensional analysis of data warehouse data ` supports basic OLAP operations, slice-dice, drilling, pivoting Data mining ` knowledge discovery from hidden patterns ` supports associations, constructing analytical models, performing classification and prediction, and presenting the mining i i results lt using i visualization i li ti ttools. l 40 References ` ` ` [JH] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [ET] Efraim Turban et al., Decision Support and Business Intelligence Systems, Pearson, 2007. [DO] David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill, 2007. 41
© Copyright 2024