1. What is data Mining? Extraction of non-trivial, previously unknown but potentially useful from large database is called Data mining. 2. What is Meta-data? It is a database containg the data about data. It is used by the end-users. 3. What is Data-warehousing? Data warehouses are large databases built on operational database designed for strategic decision support. 4. Give the empirical cycle of learning? It is defined by the following diagram: Analysis Observation Theory Prediction 5. Define Supervised Learning Algorithm: Algorithm which needed to control of a human operator during their execution is called a Supervision Algorithm. 6. Give the 2 important types of multi-processing machines. Symmetric multi-processing and Massively parallel-processing. 7. Give some factors where machine is superior to man. Speed, Accuracy, Complexity and tired of Repetition. 8. What is de-duplication of data? De-duplication of data is nothing but the same data represented as several records in the database due to spelling mistakes in the entries. 9. Give some data mining techniques. Query tools, OLAP decision trees, association rules, neural networks, genetic algorithm and visualization. 10. Define KDD. Data mining in other term is referred to as Knowledge Discover from database KDD process itself is called KDD and extraction process is called data mining. 11. 12. What is DSS? It stands for Decision Support System. Data are arranged in such a way to help the DSS. What is learning? 13. An individual learns how to carry out a certain task by making a transition from a situation in which the task cannot be carried out to a situation in which the same task can be carried out under the same circumstances. What is Data mask? Smaller and local data warehouses are called as the Data mask. 14. Define Unsupervised learning algorithm? Algorithms which does not need human intervention during their execution is Unsupervised learning algorithm. 15. Six stages of KDD. Data selection, Cleaning, Enrichment, Coding, Data-mining and Repoting. 16. Give some data mining techniques. Pollution in data is called Domain inconsistency. The data which gives impossible information can be called so. 17. Give the data flow within ware housing. * Extract and load the data. * Clean and transform the data. * Backup and achieve data. * Manage Queries. 18. Functions of a Query manager. It is responsible for directing and managing queries. 19. What are the dimension tables? The reference tables surrounding the fact tables are called dimension tables. 20. Define Data-mart. Subset of information content of a data warehouse that is stored in its own database summarized (or) in detail. 21. Types of Partioning? Horizontal, Vertical and Hardware Partioning. 22. Give any 2 ways in which summary table differ from fact table. 1. Use of Intelligent or Non-intelligent keys. 2. Horizontal Partitioning stratergy. 23. Give the areas which forms the cost of data marting. Hardware and Software, Network access and Time-window constraint. 24. What is the use of a Unique identifier? It is used to avoid any confusion occurring between two fields of the same name from different sources. 25. What is aggregating a dimension? In a summary table each row represents an aggregation of a level within a summary table. 26. Give the functions of a load manager. Extracts and loads data performing simple transformations before and during load. 27. What is a fact table? The central factual transaction fact table is called a fact table. 28. Give the uses of a Meta data. • To map data source to common view of information. • Automate the production of summary table. • Direct query to appropriate data source. 29. What is star-flake schema? The combination of demoralized star and normalized snow-flake is referred to as a star-flake schema. 30. Why aggregations are performed? Aggregations are performed in order to speed up common queries and to offset operational cost of creating and managing aggregations. 31. Give the purpose of creating a data mart. • Speed-up queries by reducing volume of data to be scanned. • Segment data into different hardware platforms. 32. What is subsuming the dimension? It is the technique of designing summary tables to retain all the dimensions that has not been aggregated. 33. What are the hardware architectures used as server platform in data warehousing solution? • Symmetric muiti-processing • Massively parallel-processing 34. Define an SMP machine. It is the set of CPUs that share memory and disk. This is sometimes called as “Shared Everything Environment”. 35. What is Cluster? A set of loosely coupled SMP machines connected by a high speed interconnect. Each machine in a cluster has its own CPU and memor, but they share to access disk. Thus, these systems are called as “Shared disk systems”. 36. Define an MPP machine. It is made up of many loosely coupled nodes. These nodes will be linked together by a high speed interconnection. 37. Define Daemons. The warehousing application managers are likely to have a number if servise and background process running. These processes are called as Daemons. 38. What is called data extracts? They are subsets of data that are off-loaded from the server machine to other machines. 39. List the operations that affect the operation of data warehouse. • Performance monitoring. • Scheduling. • Event monitoring. 40. Define Problem Management. It is an area that needed to be clearly defined and documented. This whole area needs to be bottomed out early in the data warehouse project. 41. What are the tasks needed to be accomplished in Overnight? • Data roll-up. • Obtaining the data. • Data transformation. 42. What are the features required for the management of back-ups? • Scheduling. • Backup data tracking. • Database awareness. 43. What are the data warehouse process managers? • Load Manager. • Warehouse Manager. • Query Manager. 44. What are responsible for data transformation in Load manager? • Data source interaction. • Data transformation. • Data load. 45. What are the responsibilities of Warehouse manager? • Data movement. • Meta data management. • Performance monitoring and tuning. • Data archiving. 46. What are the approaches followed in Data load? • Load from flat files. • 3GL programs for extracting and loading data. 47. Define Archiving. The storage of data offline is Archiving. In data warehouse environment it is common to keep online only the data required by the business. Legal and other requirements often mean that older data needs to be stored over a long period. This data needs to be in format that can be quickly restored to the data warehouse if require. 48. What are the responsibilities of Query manager? • User access to the data. • Query Scheduling. • Query monitoring. 49. What are the two forms taken by vertical partitioning? • Normalization. • Row splitting. 50. Define Normalization. It is a standard relational method of database organization. It allows common fields to be collapsed into single rows, thereby reducing space usage. 51. Difference between row splitting and Normalization. • Row splitting tends to leave a 1-1 mapping between the partitions. • Normalization will leave to 1-many mapping. 52. List the uses of Meta data. • Data transformation and Load • Data management • Query generation. 53. What is the information required in source data field? • Unique identifier • Name • Type • Location 54. What is the information required in destination field? • Unique identifier • Name • Type • Table name 55. What is base data? Base data is used to distinguish between the original data load destination and any copies, such as aggregations, in which the data may later appear. 56. List some functions on aggregation. Min, max, average, sum 57. What are the tools used to manage data warehousing? • System management tools. • Data warehouse process management tools. 58. What are the managers involved in system manager? • Configuration manager. • Schedule manager. • Event manager. 59. What is the aim of row splitting? It is used to speed up the access to the large table by reducing its size. 60. What are the areas that must be addressed in Hardware partitioning? • Maximizing the processing power available. • Maximizing disk and I/O performance. • Avoiding bottlenecks on a single CPU. • Avoiding bottlenecks on I/O throughput. 61. Define SLA. SLA stands for Service Level Agreement, which acts as a limit on the size of any partitioning scheme. 62. What are the steps and guidelines for determining facts for dimensions? i) Look for the elemental transactions within the business process. This identifies entities that are candidates to be fact table. ii) Determine the key dimensions that apply to each fact. This identifies entities that are candidates to be dimension table. 63. What are the techniques used to design time into fact table? • Storing the physical date. • Storing an offset from the inherent start of the table. • Storing a data range. 64. What are the two ways used to structure the foreign keys within a fact table? • Using intelligent keys. • Using non-intelligent keys. 65. What is an aggregation? Aggregation strategy relies on the fact that most common queries will analyze either a subset or an aggregation of the detailed data. A large number of queries in a typical decision support data warehouse will ask questions that analyze data dimensionally. 66. What is the need for aggregation? It is an essential component for decision support data warehouses. It allows us to provide cost-effective query performance, by avoiding the need for substantial investment in processing power in order to address performance responsibilities. 67. How is the summary table different from the fact table? i) Use of intelligent or non-intelligent keys. ii) Use of abbreviations decoded on the fly. iii) Use of embedded reference keys. • • 68. Give the reasons for creating data marts. To speed up queries by reducing the volume of data to be scanned. To structure data in a form suitable for a user access tool. 69.What are summaries? Summaries are data marked in the same way, as they would have been designed within the data warehouse; In other words, each summary table is designed to utilize all the dimension data in the star flake schema. 70.Which are the areas where the data marting strategies occur costs? i) Hardware and Software. ii) Network access. iii) Time-window constraints. 71.Give the factors where data marts are feasible. • Time window available. • Network capacity. • Volume of data being transferred. 72. Define Network access. The process of loading data into each data mart will be affected by the available capacity of the physical connections between both the items of hardware is called Network access. 73. What are the two distinct phases of load? i) Daily processing. ii) Overnight processing. 74. What are the operations involved in an overnight processing? i) Data transformation and load. ii) Aggregation and index creation. iii) Backup. 75. What is Daily processing? It is centered on the user queries. 76. What are the categories of Disk requirement? • Database requirements: • a) Administration. b) Aggregation. Known database requirements a) Operating System requirements b) Data warehouse requirements. 77. What will we need to know for calculating the actual size? • The average size of the data in each field. • The position of each data field in the table. • The RDBMS storage format of each data type. 78. What are the measures involved in the SLA. • Average query response time. • Scan rights. • I/O throughput rates. 79. What do mean by data load? It is the entry point into the system; it provides the first opportunity to improve performance. This is the crucial part of the overnight processing. 80. What are the types of query involved in the data warehousing? • Fixed queries • Ad-hoc queries. 81. What is the aim of data warehouse designer? It gets query to get run against the aggregated data rather than the row fact data. 82. What are the data maintained in the query manager? • Queries syntax • Query explanation plan • CPU resource used • Memory resource used 83. Define complete backup. The entire database is backup at the same time. This includes all the database, data files, control files and the journal files. 84. Define Cold backup. It is the backup taken while the database is completely shut down. In a multi instance environment, all the instance of the database must be shut down. 85. Define hot backup. The terminology comes from the fact that the database engine is up and running, hence hot. There are special requirement that need to be observed if the hot backup is to be valid. 86. What are the technologies involved in tape concepts? i) Tape media ii) Stand-alone tape drives. iii) Tape stackers. iv) Tape silos. 87. Give the advantage of disk-to-disk backup. One advantage is that it gives an online copy of the latest backup. 88. Define Operational System. It is a system that has responsibilities to the operations of the business. 89. Define Mission critical system. It is a system that the business absolutely depends onto a function. 90. What are the two categories of SLA? 1. User requirements. 2. System requirements. 91. What are the key items involved in an SLA? • Priority of network access. • Network availability guarantees. • Network bandwidth guarantees. • Priority of access to the backup hardware. 92. What is data shipping? Data shipping is where a process requests for the data to be shipped to the location where the process is running. 93. What is function shipping? Function shipping is where the function to be performed is moved to the location of the data. 94. Differentiate intra and inter statement operations. Intra-statement operations are separate operations that occur with the confines of the SQL statement itself. The contrast to the Intra-state operations is the Inter-state operation, which takes place between statements. 95. What is Rule-based Optimizer? Rule-based optimizer uses known rules to perform this function. It will have for example, a rule that says it is preferable to use an index to access a table rather than scan the whole table. 96. What is Cost-based Optimizer? A cost-based optimizer uses stored statistics about the tables and their indices to calculate the best strategy for executing the SQL statement. 97. Define Degree of Parallelism. The degree of parallelism is the number of processes that will be used to simultaneously perform any single operation. 98. Define RAID. It stands for Redundant Array of Inexpensive Disk and it is a technology to provide resilience against disk failure, so that the loss of an individual disk does not mean a loss of data. 99. Define Auditing. It is a specific subset of security that is often mandated by the oraganizations. 100. What is Convert Channel? It is a method whereby restricted information is inadvertently given away by implication rather than by design. 16 MARKS: 1..........An Introduction to Data Mining Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. The Scope of Data Mining Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: • Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. • Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. Databases can be larger in both depth and breadth: • More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without preselecting a subset of variables. • More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population. The most commonly used techniques in data mining are: • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique. • Rule induction: The extraction of useful if-then rules from data based on statistical significance. Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms. The appendix to this white paper provides a glossary of data mining terms. 2..............An Architecture for Data Mining To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse. Figure 1 - Integrated Data Mining Architecture The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. 3..................Glossary of Data Mining Terms: analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. anomalous data Data that result from errors (for example, data entry keying errors) or that represent unusual events. Anomalous data should be examined carefully because it may carry important information. artificial neural networks Non-linear predictive models that learn through training and resemble biological neural networks in structure. classification The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad." clustering The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to all available variables. data cleansing The process of ensuring that all values in a dataset are consistent and correctly recorded. data mining The extraction of hidden predictive information from large databases. data navigation The process of viewing different dimensions, slices, and levels of detail of a multidimensional database. See OLAP. data visualization The visual interpretation of complex relationships in multidimensional data. data warehouse A system for storing and delivering massive quantities of data. decision tree A tree-shaped structure that represents a set of decisions. These decisions generate rules for the classification of a dataset. See CART and CHAID. dimension In a flat or relational database, each field in a record represents a dimension. In a multidimensional database, a dimension is a set of similar entities; for example, a multidimensional sales database might include the dimensions Product, Time, and City. exploratory data analysis The use of graphical and descriptive statistical techniques to learn about the structure of a dataset. genetic algorithms Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. linear model An analytical model that assumes linear relationships in the coefficients of the variables being studied. linear regression A statistical technique used to find the best-fitting linear relationship between a target (dependent) variable and its predictors (independent variables). logistic regression A linear regression that predicts the proportions of a categorical target variable, such as type of customer, in a population. multidimensional database A database designed for on-line analytical processing. Structured as a multidimensional hypercube with one axis per dimension. multiprocessor computer A computer that includes multiple processors connected by a network. See parallel processing. nearest neighbor A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called a k-nearest neighbor technique. non-linear model An analytical model that does not assume linear relationships in the coefficients of the variables being studied. OLAP On-line analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multidimensional databases. outlier A data item whose value falls outside the bounds enclosing most of the other corresponding values in the sample. May indicate anomalous data. Should be examined carefully; may carry important information. parallel processing The coordinated use of multiple processors to perform computational tasks. Parallel processing can occur on a multiprocessor computer or on a network of workstations or PCs. predictive model A structure and process for predicting the values of specified variables in a dataset. prospective data analysis Data analysis that predicts future trends, behaviors, or events based on historical data. RAID Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of data for high-performance computer systems. retrospective data analysis Data analysis that provides insights into trends, behaviors, or events that have already occurred. rule induction The extraction of useful if-then rules from data based on statistical significance. SMP Symmetric multiprocessor. A type of multiprocessor computer in which memory is shared among the processors. terabyte One trillion bytes. time series analysis The analysis of a sequence of measurements made at specified time intervals. Time is usually the dominating dimension of the data. 4...........Incremental Concept Learning for Bounded Data Mining: Important refinements of concept learning in the limit from positive data considerably restricting the accessibility of input data are studied. Let c be any concept; every infinite sequence of elements exhausting c is called positive presentation of c. In all learning models considered the learning machine computes a sequence of hypotheses about the target concept from a positive presentation of it. With iterative learning, the learning machine, in making a conjecture, has access to its previous conjecture and the latest data items coming in. In k-bounded example-memory inference (k is a priori fixed) the learner is allowed to access, in making a conjecture, its previous hypothesis, its memory of up to k data items it has already seen, and the next element coming in. In the case of k-feedback identification, the learning machine, in making a conjecture, has access to its previous conjecture, the latest data item coming in, and, on the basis of this information, it can compute k items and query the database of previous data to find out, for each of the k items, whether or not it is in the database (k is again a priori fixed). In all cases, the sequence of conjectures has to converge to a hypothesis correctly describing the target concept. Our results are manyfold. An infinite hierarchy of more and more powerful feedback learners in dependence on the number k of queries allowed to be asked is established. However, the hierarchy collapses to 1-feedback inference if only indexed families of infinite concepts are considered, and moreover, its learning power is then equal to learning in the limit. But it remains infinite for concept classes of only infinite r.e. concepts. Both k-feedback inference and k-bounded example-memory identification are more powerful than iterative learning but incomparable to one another. Furthermore, there are cases where redundancy in the hypothesis space is shown to be a resource increasing the learning power of iterative learners. Finally, the union of at most k pattern languages is shown to be iteratively inferable. 5............Knowledge Discovery In Databases: Tools The amount of data being collected in databases today far exceeds our ability to reduce and analyze data without the use of automated analysis techniques. Many scientific and transactional business databases grow at a phenomenal rate. A single system, the astronomical survey application SCICAT, is expected to exceed three terabytes of data at completion [4]. Knowledge discovery in databases (KDD) is the field that is evolving to provide automated analysis solutions. Knowledge discovery is defined as ``the non-trivial extraction of implicit, unknown, and potentially useful information from data'' [6]. In [5], a clear distinction between data mining and knowledge discovery is drawn. Under their conventions, the knowledge discovery process takes the raw results from data mining (the process of extracting trends or patterns from data) and carefully and accurately transforms them into useful and understandable information. This information is not typically retrievable by standard techniques but is uncovered through the use of AI techniques. KDD is a growing field: There are many knowledge discovery methodologies in use and under development. Some of these techniques are generic, while others are domain-specific. The purpose of this paper is to present the results of a literature survey outlining the state-of-the-art in KDD techniques and tools. The paper is not intended to provide an in-depth introduction to each approach; rather, we intend it to acquaint the reader with some KDD approaches and potential uses. Background Although there are many approaches to KDD, six common and essential elements qualify each as a knowledge discovery technique. The following are basic features that all KDD techniques share (adapted from [5] and [6]): • All approaches deal with large amounts of data • Efficiency is required due to volume of data • Accuracy is an essential element • All require the use of a high-level language • All approaches use some form of automated learning • All produce some interesting results Large amounts of data are required to provide sufficient information to derive additional knowledge. Since large amounts of data are required, processing efficiency is essential. Accuracy is required to assure that discovered knowledge is valid. The results should be presented in a manner that is understandable by humans. One of the major premises of KDD is that the knowledge is discovered using intelligent learning techniques that sift through the data in an automated process. For this technique to be considered useful in terms of knowledge discovery, the discovered knowledge must be interesting; that is, it must have potential value to the user. KDD provides the capability to discover new and meaningful information by using existing data. KDD quickly exceeds the human capacity to analyze large data sets. The amount of data that requires processing and analysis in a large database exceeds human capabilities, and the difficulty of accurately transforming raw data into knowledge surpasses the limits of traditional databases. Therefore, the full utilization of stored data depends on the use of knowledge discovery techniques. 6............Knowledge Discovery In Databases: Techniques KDD Techniques Learning algorithms are an integral part of KDD. Learning techniques may be supervised or unsupervised. In general, supervised learning techniques enjoy a better success rate as defined in terms of usefulness of discovered knowledge. According to [1], learning algorithms are complex and generally considered the hardest part of any KDD technique. Machine discovery is one of the earliest fields that has contributed to KDD [5]. While machine discovery relies solely on an autonomous approach to information discovery, KDD typically combines automated approaches with human interaction to assure accurate, useful, and understandable results. There are many different approaches that are classified as KDD techniques. There are quantitative approaches, such as the probabilistic and statistical approaches. There are approaches that utilize visualization techniques. There are classification approaches such as Bayesian classification, inductive logic, data cleaning/pattern discovery, and decision tree analysis. Other approaches include deviation and trend analysis, genetic algorithms, neural networks, and hybrid approaches that combine two or more techniques. Because of the ways that these techniques can be used and combined, there is a lack of agreement on how these techniques should be categorized. For example, the Bayesian approach may be logically grouped with probabilistic approaches, classification approaches, or visualization approaches. For the sake of organization, each approach described here is included in the group that it seemed to fit best. However, this selection is not intended to imply a strict categorization. Probabilistic Approach This family of KDD techniques utilizes graphical representation models to compare different knowledge representations. These models are based on probabilities and data independencies. They are useful for applications involving uncertainty and applications structured such that a probability may be assigned to each ``outcome'' or bit of discovered knowledge. Probabilistic techniques may be used in diagnostic systems and in planning and control systems [2]. Automated probabilistic tools are available both commercially and in the public domain. Statistical Approach The statistical approach uses rule discovery and is based on data relationships. An ``inductive learning algorithm can automatically select useful join paths and attributes to construct rules from a database with many relations'' [8]. This type of induction is used to generalize patterns in the data and to construct rules from the noted patterns. Online analytical processing (OLAP) is an example of a statistically-oriented approach. Automated statistical tools are available both commercially and in the public domain. Classification Approach Classification is probably the oldest and most widely-used of all the KDD approaches [11]. This approach groups data according to similarities or classes. There are many types of classification techniques and numerous automated tools available. The Bayesian Approach to KDD ``is a graphical model that uses directed arcs exclusively to form an [sic] directed acyclic graph'' [2]. Although the Bayesian approach uses probabilities and a graphical means of representation, it is also considered a type of classification. Bayesian networks are typically used when the uncertainty associated with an outcome can be expressed in terms of a probability. This approach relies on encoded domain knowledge and has been used for diagnostic systems. Other pattern recognition applications, including the Hidden Markov Model, can be modeled using a Bayesian approach [3]. Automated tools are available both commercially and in the public domain. Pattern Discovery and Data Cleaning is another type of classification that systematically reduces a large database to a few pertinent and informative records [7]. If redundant and uninteresting data is eliminated, the task of discovering patterns in the data is simplified. This approach works on the premise of the old adage, ``less is more''. The pattern discovery and data cleaning techniques are useful for reducing enormous volumes of application data, such as those encountered when analyzing automated sensor recordings. Once the sensor readings are reduced to a manageable size using a data cleaning technique, the patterns in the data may be more easily recognized. Automated tools using these techniques are available both commercially and in the public domain. The Decision Tree Approach uses production rules, builds a directed acyclical graph based on data premises, and classifies data according to its attributes. This method requires that data classes are discrete and predefined [11]. According to [5], the primary use of this approach is for predictive models that may be appropriate for either classification or regression techniques. Tools for decision tree analysis are available commercially and in the public domain. Deviation and Trend Analysis Pattern detection by filtering important trends is the basis for this KDD approach. Deviation and trend analysis techniques are normally applied to temporal databases. A good application for this type of KDD is the analysis of traffic on large telecommunications networks. Other Approaches Neural networks may be used as a method of knowledge discovery. Neural networks are particularly useful for pattern recognition, and are sometimes grouped with the classification approaches. There are tools available in the public domain and commercially. Genetic algorithms, also used for classification, are similar to neural networks although they are typically considered more powerful. There are tools for the genetic approach available commercially. Hybrid Approach A hybrid approach to KDD combines more than one approach and is also called a multi-paradigmatic approach. Although implementation may be more difficult, hybrid tools are able to combine the strengths of various approaches. Some of the commonly used methods combine visualization techniques, induction, neural networks, and rule-based systems to achieve the desired knowledge discovery. Deductive databases and genetic algorithms have also been used in hybrid approaches. There are hybrid tools available commercially and in the public domain. 7.............Knowledge Discovery In Databases: Data Selection & Cleaning: Most organizations with a global reach - government and civilian, collect, generate and/or publish more data and information in a week than a single human can comprehend in a lifetime. Vannevar Bush described the problem as early as 1945 in an Atlantic Monthly article, "As We May Think." He wrote,"...There is a growing mountain of research ...[and] the difficulty seems to be, not so much that we publish unduly, rather, that publication has been extended far beyond our present ability to make real use of the record..." While Bush described the inability of scientists to keep abreast of the massive about of research data generated during World War II, his assessment of the situation is equally applicable to the problem facing large organizations today, namely, how and what can be learned from data collected during the daily performance of business. Knowledge Discovery In Databases (KDD) is a process an organization can use to tap into the hidden knowledge contained within the large volume of data an organization collects. The following is a skeletal view of the KDD process and some suggestions on how an organization can increase the effectiveness of the KDD effort. The KDD Process Most current sources describe KDD as a process consisting of six phases: Data Selection, Cleaning, Enrichment, Coding, Data Mining and Reporting. I have added two additional phases: Knowledge Requirement and Action. Using the two additional steps not only increases the efficiency of the process but also results in more valuable data. The KDD process is iterative and tunable. The output of and the lessons learned from one phase can be applied to the next phase to improve output. Knowledge Requirement Before a KDD process can begin a requirement for knowledge must be identified. This doesn't mean an organization must know what specific additional knowledge it seeks before hand. Instead, management should seek to recognize a pattern or patterns of data. By analyzing this information, management can formulate a strategy for determining a knowledge requirement, which when implemented will produce the kind of information needed to make informed business decisions. Note the word "business" can apply to many different knowledge domains, whether it's the business of conducting efficient military operations, tracking drug smugglers, or selling baby food. By formally adopting a knowledge requirement phase in its KDD effort, an organization can improve the KDD process effectiveness. The improvement comes from analyzing and documenting what forms of knowledge are gained given a stated knowledge requirement and implementing each of the phases of KDD - data selection, cleaning, enrichment and data mining. This plan also includes the data mining techniques employed and the actions taken based upon knowledge gained. Data Selection As mentioned above, the data used in the KDD process is selected based on an evaluation of its potential to yield knowledge. Often these data sets are not in the best possible form for the KDD process and the evaluation is really just an organization's best guess as to the value of the data. Generally speaking, while an experienced user is an asset to an organization, the empowerment individual users gain through experience and training is a problem that stands in the way of an effective KDD effort. As users gain proficiency and independence in solving business problems with their PCs, data, which is vital to an organization, is captured in databases that are poorly designed and implemented. There is no thought or plan for future integration with other organizational databases or strategy to capitalize on the investment of data already collected. The concept of the data warehouse is a solution to this problem. The Data Warehouse The data warehouse is designed to centralize an organization's data. However, structuring and implementing an organization's data warehouse means more than just rounding up all the problematic data stores with an eye towards making them accessible in a unified manner. An effectively designed data warehouse facilitates the KDD process. Cleaning If an organization's data collection techniques and database structure guidelines were properly employed, the data contained in an organization's database would already be clean. But since the majority of organizations have not implemented and provided guidance for a true data warehouse the data selected for the KDD process must be cleaned. I was recently asked in what phase of the KDD could an organization expect to spend most of its time, and hence, money. The answer is in the cleaning phase, especially if steps had not been taken to ensure the reliability of the data initially. Data cleaning is labor intensive and involves examining the data for completeness and integrity. A few examples of dirty data include: missing field information, names that are the same but spelled slightly different, duplicate data or data that has been intentionally altered in an attempt to gain services fraudulently. The causes of dirty or noisy data are many and each business domain has unique data cleanliness issues. Pattern matching algorithms exist to help in the data cleaning effort but as of yet they are not a complete solution. 8.............Knowledge Discovery In Databases: Enrichment and Coding, Data Mining & Reporting: Enrichment and Coding In the enrichment phase, data, which may be obtained from an external source outside the organization, is added to the selected and cleaned data to add value to it. For example, in a retail KDD effort, customer shopping data is combined with third party demographic data, which maps customer names to addresses, number and types of cars they own, how many and what ages are their children, etc., perhaps to produce a target market group. In the data coding phase raw data is converted into a form that lends itself to be more effectively mined. It's sort of like cloud salting. Attributes can be, and are often added to enhance the discovery of patterns within the data. Values, such as dates, times, salaries or other data that express numerical values are normalized in such a manner that informational clusters are easily formed and associations can be more readily identified. The following record is an example of one possible form of data normalization. This record contains fields for name, date of birth, age, salary, number of degrees, number of cars owned, number of magazines subscribed to and number of children: Doe | John | 4 February 1961 | 39 | 145000 | 2 | 3 | 6 | 2 | The focus of a coding transformation on this type of data might be to bring the salary and age fields to within the same value order as the other numeric fields. Dividing 39 by 10 and 145000 by 100000 would do the trick, yielding the following record: Doe | John | 4 February 1961 | 3.9 | 1.45 | 2 | 3 | 6 | 2 | The numerical points of this record, when plotted, will fall closer to each other than if the data had been plotted in its original form. Depending on the data mining technique used on this particular kind of data set, this could prove to be an extremely valuable transformation. Data Mining Data mining is the term once used to describe the entire KDD process, or rather; data mining was the KDD process. Through research, trial and error, data mining practitioners discovered that effective data mining required good, clean, properly enriched and intelligently coded data. The data mining phase is where the KDD rubber hits the road. Some practitioners describe it as an anything goes affair and advocate applying any technique that helps extract knowledge from data. Many useful data mining techniques exist. A few of these include query tools, statistical methods, data visualization, Online Analytical Processing (OLAP), case-based learning (k's nearest neighbor), decision trees, association rules, neural networks and genetic algorithms. The "k's" nearest neighbor refers to an association or grouping algorithm that compares each point or data element with a number of its neighbors. The letter k represents a number, i.e., find 5 of the nearest neighbors to a particular data element. Levels of Knowledge and the Use of Visualization Techniques There are four levels of knowledge that exist within an organization's data: • Shallow knowledge can be easily extracted with query tools and it's estimated that shallow knowledge accounts for 80% of the knowledge in an organization's data. • Multi-dimensional knowledge can be mined using OLAP tools. • Hidden knowledge must be found using pattern recognition or machine learning algorithms. • Deep knowledge is interesting in that you must have a clue to its existence before you can find it. Reporting and Action The reporting phase makes the knowledge gained from the KDD process available to the right people in the right format. Knowledge is power but it is useless if not acted upon. To reap the full benefits of a KDD effort, organizations should adopt a formal knowledge action process through which knowledge gained via a KDD effort is formally reviewed to see how it can best be used within the organization. 9.............. Security on Data Warehousing: Defining warehouse security : The Data Warehouse Center controls access with user IDs. When a user logs on, the user ID is compared to the warehouse users that are defined in the Data Warehouse Center to determine whether the user is authorized to access the Data Warehouse Center. You can authorize additional users to access the Data Warehouse Center by defining new warehouse users. The user ID for the new user does not require authorization to the operating system or the warehouse control database. The user ID exists only within the Data Warehouse Center. In the Data Warehouse Center, warehouse groups include the authorization to perform certain tasks and access objects in the Data Warehouse Center. To authorize one or more users to perform tasks, you must define a warehouse group, and then add the users to the group. To define security for your data warehouse: 1. 2. Define warehouse users. Define warehouse groups. Data warehousing systems present special security issues including: • The degree of security appropriate to summaries and aggregates of data, • The security appropriate for the exploration data warehouse, specifically designed for browsing and ad hoc queries, and • The uses and abuses of data encryption as a method of enhancing privacy. 10………….Backup and Recovery of the Data Warehouse Recoverability has various levels: recovery from disk failure, human error, software failure, fire, and so on, requires different procedures. Oracle provides only part of the solution. Organizations must decide how much to spend on backup and recovery by considering the business cost of a long outage. The NOLOGGING option enables you to perform certain operations without the overhead of generating a log. Even without logging, you can avoid disk failure if you use disk mirroring or RAID technology. If you load your warehouse from tapes every day or week, you might satisfactorily recover from all failures simply by saving copies of the tape in several remote locations and reloading from tape when something goes wrong. At the other end of the spectrum, you could both mirror disks and take backups and archive logs, and maintain a remote standby system. The mirrored disks prevent loss of availability for disk failure, and also protect against total loss in the event of human error (such as dropping the wrong table) or software error (such as disk block corruption). In the event of fire, power failure, or other problems at the primary site, the backup site prevents long outages. 11………….Tuning Database Performance: Elements of performance Performance is the way a computer system behaves given a particular work load. Performance is measured in terms of system response time, throughput, and availability. Performance is also affected by: • • The resources available in your system How well those resources are used and shared. In general, you tune your system to improve its cost-benefit ratio. Specific goals could include: • Processing a larger, or more demanding, work load without increasing processing costs For example, to increase the work load without buying new hardware or using more processor time • • Obtaining faster system response times, or higher throughput, without increasing processing costs Reducing processing costs without degrading service to your users Translating performance from technical terms to economic terms is difficult. Performance tuning certainly costs money in terms of user time as well as processor time, so before you undertake a tuning project, weigh its costs against its possible benefits. Some of these benefits are tangible: • • More efficient use of resources The ability to add more users to the system. 12………Designing Database: Relational databases A relational database is a database that is treated as a set of tables and manipulated in accordance with the relational model of data. It contains a set of objects used to store, manage, and access data. Examples of such objects are tables, views, indexes, functions, triggers, and packages. A partitioned relational database is a relational database whose data is managed across multiple partitions (also called nodes). This separation of data across partitions is transparent to users of most SQL statements. However, some data definition language (DDL) statements take partition information into consideration (for example, CREATE DATABASE PARTITION GROUP). (Data definition language is the subset of SQL statements used to describe data relationships in a database.) A federated database is a relational database whose data is stored in multiple data sources (such as separate relational databases). The data appears as if it were all in a single large database and can be accessed through traditional SQL queries. Changes to the data can be explicitly directed to the appropriate data source. Structured Query Language (SQL) SQL is a standardized language for defining and manipulating data in a relational database. In accordance with the relational model of data, the database is treated as a set of tables, relationships are represented by values in tables, and data is retrieved by specifying a result table that can be derived from one or more base tables. SQL statements are executed by a database manager. One of the functions of the database manager is to transform the specification of a result table into a sequence of internal operations that optimize data retrieval. The transformation occurs in two phases: preparation and binding. All executable SQL statements must be prepared before they can be executed. The result of preparation is the executable or operational form of the statement. The method of preparing an SQL statement and the persistence of its operational form distinguish static SQL from dynamic SQL. 13…………Administering Database Systems: Moving data between typed tables The DB2(R) export and import utilities can be used to move data out of, and into, typed tables. Typed tables may be in a hierarchy. Data movement across hierarchies can include: • • • Movement from one hierarchy to an identical hierarchy. Movement from one hierarchy to a sub-section of a larger hierarchy. Movement from a sub-section of a large hierarchy to a separate hierarchy. The IMPORT CREATE option allows you to create both the table hierarchy and the type hierarchy. Identification of types in a hierarchy is database dependent. This means that in different databases, the same type has a different identifier. Therefore, when moving data between these databases, a mapping of the same types must be done to ensure that the data is moved correctly. Before each typed row is written out during an export operation, an identifier is translated into an index value. This index value can be any number from one to the number of relevant types in the hierarchy. Index values are generated by numbering each type when moving through the hierarchy in a specific order. This order is called the traverse order. It is the order of proceeding top-to-bottom, left-to-right through all of the supertables and subtables in the hierarchy. The traverse order is important when moving data between table hierarchies, because it determines where the data is moved in relation to other data. One method is to proceed from the top of the hierarchy (or the root table), down the hierarchy (subtables) to the bottom subtable, then back up to its supertable, down to the next "right-most" subtable(s), then back up to next higher supertable, down to its subtables, and so on. 14……….Monitoring database systems: Database system monitor data organization The database system monitor stores information it collects in entities called monitor elements (these were previously known as data elements). Each monitor element stores information regarding one specific aspect of the state of the database system. In addition, monitor elements are identified by unique names and store a certain type of information. The following are the available element types in which monitor elements store data: Counter Counts the number of times an activity occurs. Counter values increase during monitoring. Most counter elements can be reset. Gauge Indicates the current value for an item. Gauge values can go up and down depending on database activity (for example, the number of locks held). Gauge elements can not be reset. Water mark Indicates the highest (maximum) or lowest (minimum) value an element has reached since monitoring was started. Water mark elements can not be reset. Information Provides reference-type details of your monitoring activities. This can include items such as partition names, aliases, and path details. Information elements can not be reset. Timestamp Indicates the date and time that an activity took place by providing the number of seconds and microseconds that have elapsed since January 1, 1970. For the snapshot monitor and event monitors, the collection of timestamp elements is controlled by the TIMESTAMP monitor switch. While this switch is on by default, you should turn it off for performance reasons if CPU utilization on the database instance approaches 100%. Timestamp elements can not be reset. Time Returns the number of seconds and microseconds spent on an activity. For the snapshot monitor and event monitors, the collection of most time elements is controlled by the TIMESTAMP monitor switch. While this switch is on by default, you should turn it off for performance reasons if CPU utilization on the database instance approaches 100%. Some time elements can be reset. Monitor elements collect data for one or more logical data groups. A logical data group is a collection of monitor elements that gather database system monitoring information for a specific scope of database activity. Monitor elements are sorted in logical data groups based on the levels of information they provide. For example, while snapshot monitoring, the Total Sort Time monitor element returns database (dbase), application (appl), and statement (stmt) information; hence, it appears in each of the logical data groups listed in parentheses. Although many monitor elements are used by both the snapshot monitor and event monitors, they each use a distinct set of logical data groups. This is because the scopes of database activity for which you can capture a snapshot differ from those for which you can collect event data. Practically speaking, the overall set of monitor elements accessible from the snapshot monitor is different from those accessible from event monitors. 15……….Data Redundancy There are essentially three levels of data redundancy that enterprises should think about when considering their data warehouse options: • "Virtual" or "Point-to-Point" Data Warehouses • Central Data Warehouses • Distributed Data Warehouses There is no one best approach. Each option fits a specific set of requirements, and a data warehousing strategy may ultimately include all three options "Virtual" or "Point-to-Point" Data Warehouses A virtual or point-to-point data warehousing strategy means that end-users are allowed to get at operational databases directly using whatever tools are enabled to the "data access network" . This approach provides the ultimate in flexibility as well as the minimum amount of redundant data that must be loaded and maintained. This approach can also put the largest unplanned query load on operational systems. As we will see, virtual warehousing is often an initial strategy in organizations where there is a broad but largely undefined need to get at operational data from a relatively large class of end-users and where the likely frequency of requests is low. Virtual data warehouses often provide a starting point for organizations to learn what end-users are really looking for. Figure 3 below shows a Virtual Data Warehouse within the Data Warehouse Architecture. Central Data Warehouses Central Data Warehouses are what most people think of when they first are introduced to the concept of data warehouse. The central data warehouse is a single physical database that contains all of the data for a specific functional area, department, division, or enterprise. Central Data Warehouses are often selected where there is a common need for informational data and there are large numbers of end-users already connected to a central computer or network. Central Data Warehouses are real. The data stored in the data warehouse is accessible from one place and must be loaded and maintained on a regular basis. Normally, data warehouses are built around advanced RDBMs or some form of multi-dimensional informational database server. Distributed Data Warehouses Distributed Data Warehouses are just what their name implies. They are data warehouses in which the certain components of the data warehouse are distributed across a number of different physical databases. Increasingly, large organizations are pushing decision-making down to lower and lower levels of the organization and in turn pushing the data needed for decision making down (or out) to the LAN or local computer serving the local decision-maker. Distributed Data Warehouses usually involve the most redundant data and, as a consequence, most complex loading and updating processes. 16………….Developing Data Warehouses Developing a good data warehouse is no different from any other IT project; it requires careful planning, requirements definition, design, prototyping and implementation. The first and most important element is a planning process that determines what kind of data warehouse strategy the organization is going to start with. Developing a Data Warehouse Strategy Before developing a data warehouse, it is critical to develop a balanced data warehousing strategy that is appropriate for its needs and its user population. Who is the audience? What is the scope? What type of data warehouse should we build? There are a number of strategies by which organizations can get into data warehousing. One way is to establish a "Virtual Data Warehouse" environment. A Virtual Data Warehouse is created by: (1) installing a set of data access, data directory and process management facilities, (2) training the endusers (3) monitoring how the data warehouse facilities are actually used and then (4) based on actual usage, create a physical data warehouse to support the high-frequency requests. A second strategy is simply to build a copy of the operational data from a single operational system and enable the data warehouse from a series of information access tools. This strategy has the advantage of being both simple and fast. Unfortunately, if the existing data is of poor quality and/or the access to the data has not been thought through, then this approach can create a number of significant problems. Ultimately, the optimal data warehousing strategy is to select a user population based on value to the enterprise and do an analysis of their issues, questions and data access needs. Based on these needs, prototype data warehouses are built and populated so the end-users can experiment and modify their requirements. Once there is general agreement on the needs, then the data can be acquired from existing operational systems across the enterprise and/or from external data sources and loaded into the data warehouse. If it is required, information access tools can also be enabled to allow end-users to have access to required data using their own favorite tools or to allow for the creation of highperformance multi-dimensional information access systems using the core data warehouse as the basis. Evolving a Data Warehouse Architecture The Data Warehouse Architecture in Figure 1 is simply a framework for understanding data warehousing and how the components of data warehousing fit together. Only the most sophisticated organizations will be able to put together such an architecture the first time out. What the Data Warehouse Architecture provides then is a kind of roadmap that can be used to design toward. Coupled with an understanding of the options at hand, the Data Warehouse Architecture provides a useful way of determining if the organization is moving toward a reasonable data warehousing framework. One of the keys to data warehousing is flexibility. It is critical to keep in mind that the more successful a data warehouse strategy is, the more end-users are going to want to add to it. Designing Data Warehouses Designing data warehouses is very different from designing traditional operational systems. For one thing, data warehouse users typically don't know nearly as much about their wants and needs as operational users. Second, designing a data warehouse often involves thinking in terms of much broader, and more difficult to define, business concepts than does designing an operational system. In this respect, data warehousing is quite close to Business Process Reengineering (BPR). Finally, the ideal design strategy for a data warehouse is often outside-in as opposed to top-down. 17…………. Managing Data Warehouses Data Warehouses are not magic-they take a great deal of very hard work. In many cases data warehouse projects are viewed as a stopgap measure to get users off our backs or to provide something for nothing. But data warehouses require careful management and marketing. A data warehouse is a good investment only if end-users actually can get at vital information faster and cheaper than they can using current technology. As a consequence, management has to think seriously about how they want their warehouses to perform and how they are going to get the word out to the end-user community. And management has to recognize that the maintenance of the data warehouse structure is as critical as the maintenance of any other mission-critical application. In fact, experience has shown that data warehouses quickly become one of the most used systems in any organization. Management, especially IT management, must also understand that if they embark on a data warehousing program, they are going to create new demands upon their operational systems: demands for better data, demands for consistent data, demands for different kinds of data. 18………..Updating Data Warehouse Data Updating data warehouse data includes periodically extracting data from operational systems, cleansing and transforming the data, and loading the new data into the data warehouse. Each data update also includes tasks that must be accomplished to synchronize cubes if Microsoft® SQL Server™ 2000 Analysis Services is used for online analytical processing (OLAP), and to update any data marts that are part of the data warehouse. The process of extracting, cleansing, and transforming data for a periodic update is essentially the same as the process used in the initial loading of the data warehouse, although the update process is often much less complex and more automated than the initial load process. Procedures and automated tasks developed during the initial load process can reduce the amount of manual effort required during updates. Corrections to source operational systems identified and implemented during the initial load also reduce the number of inconsistencies and errors that must be addressed during updates. However, it is often the case that manual intervention is required during updates to ensure the data is ready for loading into the data warehouse. One difference between the initial data load and data updates is that verifying the referential integrity should be performed incrementally on update data before it is loaded into the data warehouse and made available to users. Updates often include additions and changes to dimension tables as well as the addition of rows to the fact tables. The new and changed data should be checked for internal consistency as well as verified against existing data in the data warehouse before it is loaded into the data warehouse. After the update data has been made ready for loading into the data warehouse, you can use Transact-SQL, Data Transformation Services (DTS), or the bcp utility to update the data warehouse tables. Depending on the design and implementation of the presentation applications that provide access to data warehouse data for end users, you may need to take the data warehouse offline during the update to prevent inconsistencies in query results. 19………..Data Warehouses or Data Marts? For many years, all systems that extracted production and other data from source systems and stored it in useful ways in a decision-support system were called data warehouses. Over the past few years, a distinction has been articulated between data warehouses and data marts, even as the overall market continues to be called data warehousing. General industry discussions indicate that, as of January 1997, approximately 70 percent to 80 percent of all data warehouses currently in production are, in fact, data marts. Intelligent Solutions, Inc., compares data warehouses to data marts is the following way: Data Warehouse Data Mart Constructed to meet the information needs of the entire enterprise. Constructed to meet the needs of a specific business unit or function. Designed to optimize integration and management of source data. Designed to optimize information delivery of decision support information. Manages extensive amounts of atomic level history. Primarily focused at managing summary and/or sample data. Owned and managed by enterprise Information Systems (IS) organizations. Can be owned and managed by Line of Business IS group. 20…………..DATA MART: A data mart (DM) is a specialized version of a data warehouse (DW). Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences. The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data. A data mart configuration emphasizes easy access to relevant information. There can be multiple data marts inside a single corporation; each one relevant to one or more business units for which it was designed. DMs may or may not be dependent or related to other data marts in a single corporation. If the data marts are designed using conformed facts and dimensions, then they will be related. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.[1] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc. Design schemas • star schema or dimensional model is a fairly popular design choice, as it enables a relational database to emulate the analytical functionality of a multidimensional database. • snowflake schema Reasons for creating a data mart • Ease access to frequently needed data • Creates collective view by a group of users • Improves end-user response time • Ease of creation • Lower cost than implementing a full Data warehouse • Potential users are more clearly defined than in a full Data warehouse Dependent data mart According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons: • A need for a special data model or schema: e.g., to restructure for OLAP • Performance: to offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse. • Security: to separate an authorized data subset selectively • Expediency: to bypass the data governance and authorizations required to incorporate a new application on the Enterprise Data Warehouse • Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the Enterprise Data Warehouse • Politics: a coping strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse. • Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse. According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data.
© Copyright 2024