1. What is data Mining?

1. What is data Mining?
Extraction of non-trivial, previously unknown but potentially useful from large
database is called Data mining.
2. What is Meta-data?
It is a database containg the data about data. It is used by the end-users.
3. What is Data-warehousing?
Data warehouses are large databases built on operational database designed for
strategic decision support.
4. Give the empirical cycle of learning?
It is defined by the following diagram:
Analysis
Observation
Theory
Prediction
5. Define Supervised Learning Algorithm:
Algorithm which needed to control of a human operator during their execution
is called a Supervision Algorithm.
6. Give the 2 important types of multi-processing machines.
Symmetric multi-processing and Massively parallel-processing.
7. Give some factors where machine is superior to man.
Speed, Accuracy, Complexity and tired of Repetition.
8. What is de-duplication of data?
De-duplication of data is nothing but the same data represented as several
records in the database due to spelling mistakes in the entries.
9. Give some data mining techniques.
Query tools, OLAP decision trees, association rules, neural networks, genetic
algorithm and visualization.
10. Define KDD.
Data mining in other term is referred to as Knowledge Discover from database
KDD process itself is called KDD and extraction process is called data mining.
11.
12.
What is DSS?
It stands for Decision Support System. Data are arranged in such a way
to help the DSS.
What is learning?
13.
An individual learns how to carry out a certain task by making a
transition from a situation in which the task cannot be carried out to a
situation in which the same task can be carried out under the same
circumstances.
What is Data mask?
Smaller and local data warehouses are called as the Data mask.
14.
Define Unsupervised learning algorithm?
Algorithms which does not need human intervention during their
execution is Unsupervised learning algorithm.
15.
Six stages of KDD.
Data selection, Cleaning, Enrichment, Coding, Data-mining and
Repoting.
16.
Give some data mining techniques.
Pollution in data is called Domain inconsistency. The data which gives
impossible information can be called so.
17.
Give the data flow within ware housing.
* Extract and load the data.
* Clean and transform the data.
* Backup and achieve data.
* Manage Queries.
18.
Functions of a Query manager.
It is responsible for directing and managing queries.
19.
What are the dimension tables?
The reference tables surrounding the fact tables are called dimension
tables.
20.
Define Data-mart.
Subset of information content of a data warehouse that is stored in its
own database summarized (or) in detail.
21.
Types of Partioning?
Horizontal, Vertical and Hardware Partioning.
22.
Give any 2 ways in which summary table differ from fact table.
1. Use of Intelligent or Non-intelligent keys.
2. Horizontal Partitioning stratergy.
23.
Give the areas which forms the cost of data marting.
Hardware and Software, Network access and Time-window constraint.
24.
What is the use of a Unique identifier?
It is used to avoid any confusion occurring between two fields of the
same name from different sources.
25.
What is aggregating a dimension?
In a summary table each row represents an aggregation of a level
within a summary table.
26.
Give the functions of a load manager.
Extracts and loads data performing simple transformations before and
during load.
27.
What is a fact table?
The central factual transaction fact table is called a fact table.
28.
Give the uses of a Meta data.
• To map data source to common view of information.
• Automate the production of summary table.
• Direct query to appropriate data source.
29.
What is star-flake schema?
The combination of demoralized star and normalized snow-flake is
referred to as a star-flake schema.
30.
Why aggregations are performed?
Aggregations are performed in order to speed up common queries and to
offset operational cost of creating and managing aggregations.
31.
Give the purpose of creating a data mart.
• Speed-up queries by reducing volume of data to be scanned.
• Segment data into different hardware platforms.
32.
What is subsuming the dimension?
It is the technique of designing summary tables to retain all the dimensions
that has not been aggregated.
33.
What are the hardware architectures used as server platform in data
warehousing solution?
• Symmetric muiti-processing
• Massively parallel-processing
34.
Define an SMP machine.
It is the set of CPUs that share memory and disk. This is sometimes called
as “Shared Everything Environment”.
35.
What is Cluster?
A set of loosely coupled SMP machines connected by a high speed
interconnect. Each machine in a cluster has its own CPU and memor, but
they share to access disk. Thus, these systems are called as “Shared disk
systems”.
36.
Define an MPP machine.
It is made up of many loosely coupled nodes. These nodes will be linked
together by a high speed interconnection.
37.
Define Daemons.
The warehousing application managers are likely to have a number if
servise and background process running. These processes are called as
Daemons.
38.
What is called data extracts?
They are subsets of data that are off-loaded from the server machine to
other machines.
39.
List the operations that affect the operation of data warehouse.
• Performance monitoring.
• Scheduling.
• Event monitoring.
40.
Define Problem Management.
It is an area that needed to be clearly defined and documented. This whole
area needs to be bottomed out early in the data warehouse project.
41.
What are the tasks needed to be accomplished in Overnight?
• Data roll-up.
• Obtaining the data.
• Data transformation.
42.
What are the features required for the management of back-ups?
• Scheduling.
• Backup data tracking.
• Database awareness.
43.
What are the data warehouse process managers?
• Load Manager.
• Warehouse Manager.
• Query Manager.
44.
What are responsible for data transformation in Load manager?
• Data source interaction.
• Data transformation.
• Data load.
45.
What are the responsibilities of Warehouse manager?
• Data movement.
• Meta data management.
• Performance monitoring and tuning.
• Data archiving.
46.
What are the approaches followed in Data load?
• Load from flat files.
• 3GL programs for extracting and loading data.
47.
Define Archiving.
The storage of data offline is Archiving. In data warehouse environment it
is common to keep online only the data required by the business. Legal
and other requirements often mean that older data needs to be stored over a
long period. This data needs to be in format that can be quickly restored to
the data warehouse if require.
48.
What are the responsibilities of Query manager?
• User access to the data.
• Query Scheduling.
• Query monitoring.
49.
What are the two forms taken by vertical partitioning?
• Normalization.
• Row splitting.
50.
Define Normalization.
It is a standard relational method of database organization. It allows
common fields to be collapsed into single rows, thereby reducing space
usage.
51.
Difference between row splitting and Normalization.
• Row splitting tends to leave a 1-1 mapping between the partitions.
• Normalization will leave to 1-many mapping.
52. List the uses of Meta data.
• Data transformation and Load
• Data management
• Query generation.
53. What is the information required in source data field?
• Unique identifier
• Name
• Type
• Location
54. What is the information required in destination field?
• Unique identifier
• Name
• Type
• Table name
55. What is base data?
Base data is used to distinguish between the original data load
destination and any copies, such as aggregations, in which the data
may later appear.
56. List some functions on aggregation.
Min, max, average, sum
57. What are the tools used to manage data warehousing?
• System management tools.
• Data warehouse process management tools.
58. What are the managers involved in system manager?
• Configuration manager.
• Schedule manager.
• Event manager.
59. What is the aim of row splitting?
It is used to speed up the access to the large table by reducing its size.
60. What are the areas that must be addressed in Hardware partitioning?
• Maximizing the processing power available.
• Maximizing disk and I/O performance.
• Avoiding bottlenecks on a single CPU.
• Avoiding bottlenecks on I/O throughput.
61. Define SLA.
SLA stands for Service Level Agreement, which acts as a limit on the
size of any partitioning scheme.
62. What are the steps and guidelines for determining facts for dimensions?
i)
Look for the elemental transactions within the
business process. This identifies entities that are
candidates to be fact table.
ii)
Determine the key dimensions that apply to each
fact. This identifies entities that are candidates to be
dimension table.
63. What are the techniques used to design time into fact table?
• Storing the physical date.
• Storing an offset from the inherent start of the table.
• Storing a data range.
64. What are the two ways used to structure the foreign keys within a fact table?
• Using intelligent keys.
• Using non-intelligent keys.
65. What is an aggregation?
Aggregation strategy relies on the fact that most common queries will
analyze either a subset or an aggregation of the detailed data. A large
number of queries in a typical decision support data warehouse will
ask questions that analyze data dimensionally.
66. What is the need for aggregation?
It is an essential component for decision support data warehouses. It
allows us to provide cost-effective query performance, by avoiding the
need for substantial investment in processing power in order to address
performance responsibilities.
67. How is the summary table different from the fact table?
i)
Use of intelligent or non-intelligent keys.
ii)
Use of abbreviations decoded on the fly.
iii)
Use of embedded reference keys.
•
•
68. Give the reasons for creating data marts.
To speed up queries by reducing the volume of data to be
scanned.
To structure data in a form suitable for a user access tool.
69.What are summaries?
Summaries are data marked in the same way, as they would have been
designed within the data warehouse; In other words, each summary
table is designed to utilize all the dimension data in the star flake
schema.
70.Which are the areas where the data marting strategies occur costs?
i)
Hardware and Software.
ii)
Network access.
iii)
Time-window constraints.
71.Give the factors where data marts are feasible.
• Time window available.
• Network capacity.
• Volume of data being transferred.
72. Define Network access.
The process of loading data into each data mart will be affected by the
available capacity of the physical connections between both the items
of hardware is called Network access.
73. What are the two distinct phases of load?
i)
Daily processing.
ii)
Overnight processing.
74. What are the operations involved in an overnight processing?
i)
Data transformation and load.
ii)
Aggregation and index creation.
iii)
Backup.
75. What is Daily processing?
It is centered on the user queries.
76. What are the categories of Disk requirement?
• Database requirements:
•
a) Administration.
b) Aggregation.
Known database requirements
a) Operating System requirements
b) Data warehouse requirements.
77. What will we need to know for calculating the actual size?
• The average size of the data in each field.
• The position of each data field in the table.
• The RDBMS storage format of each data type.
78. What are the measures involved in the SLA.
• Average query response time.
• Scan rights.
• I/O throughput rates.
79. What do mean by data load?
It is the entry point into the system; it provides the first opportunity to
improve performance. This is the crucial part of the overnight
processing.
80. What are the types of query involved in the data warehousing?
• Fixed queries
• Ad-hoc queries.
81. What is the aim of data warehouse designer?
It gets query to get run against the aggregated data rather than the row
fact data.
82. What are the data maintained in the query manager?
• Queries syntax
• Query explanation plan
• CPU resource used
• Memory resource used
83. Define complete backup.
The entire database is backup at the same time. This includes all the
database, data files, control files and the journal files.
84. Define Cold backup.
It is the backup taken while the database is completely shut down. In a
multi instance environment, all the instance of the database must be
shut down.
85. Define hot backup.
The terminology comes from the fact that the database engine is up and
running, hence hot. There are special requirement that need to be
observed if the hot backup is to be valid.
86. What are the technologies involved in tape concepts?
i)
Tape media
ii)
Stand-alone tape drives.
iii)
Tape stackers.
iv)
Tape silos.
87. Give the advantage of disk-to-disk backup.
One advantage is that it gives an online copy of the latest backup.
88. Define Operational System.
It is a system that has responsibilities to the operations of the business.
89. Define Mission critical system.
It is a system that the business absolutely depends onto a function.
90. What are the two categories of SLA?
1. User requirements.
2. System requirements.
91. What are the key items involved in an SLA?
• Priority of network access.
• Network availability guarantees.
• Network bandwidth guarantees.
• Priority of access to the backup hardware.
92. What is data shipping?
Data shipping is where a process requests for the data to be shipped to
the location where the process is running.
93. What is function shipping?
Function shipping is where the function to be performed is moved to
the location of the data.
94. Differentiate intra and inter statement operations.
Intra-statement operations are separate operations that occur with the
confines of the SQL statement itself. The contrast to the Intra-state
operations is the Inter-state operation, which takes place between
statements.
95. What is Rule-based Optimizer?
Rule-based optimizer uses known rules to perform this function. It will
have for example, a rule that says it is preferable to use an index to
access a table rather than scan the whole table.
96. What is Cost-based Optimizer?
A cost-based optimizer uses stored statistics about the tables and their
indices to calculate the best strategy for executing the SQL statement.
97. Define Degree of Parallelism.
The degree of parallelism is the number of processes that will be used
to simultaneously perform any single operation.
98. Define RAID.
It stands for Redundant Array of Inexpensive Disk and it is a
technology to provide resilience against disk failure, so that the loss of
an individual disk does not mean a loss of data.
99. Define Auditing.
It is a specific subset of security that is often mandated by the
oraganizations.
100. What is Convert Channel?
It is a method whereby restricted information is inadvertently given
away by implication rather than by design.
16 MARKS:
1..........An Introduction to Data Mining
Data mining, the extraction of hidden predictive information from large databases, is a powerful new
technology with great potential to help companies focus on the most important information in their
data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make
proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining
move beyond the analyses of past events provided by retrospective tools typical of decision support
systems.
The Scope of Data Mining
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store scanner
data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through
an immense amount of material, or intelligently probing it to find exactly where the value resides.
Given databases of sufficient size and quality, data mining technology can generate new business
opportunities by providing these capabilities:
•
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive
hands-on analysis can now be answered directly from the data — quickly.
•
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step.
Databases can be larger in both depth and breadth:
•
More columns. Analysts must often limit the number of variables they examine when doing
hands-on analysis due to time constraints. Yet variables that are discarded because they
seem unimportant may carry information about unknown patterns. High performance data
mining allows users to explore the full depth of a database, without preselecting a subset of
variables.
•
More rows. Larger samples yield lower estimation errors and variance, and allow users to
make inferences about small but important segments of a population.
The most commonly used techniques in data mining are:
•
Artificial neural networks: Non-linear predictive models that learn through training and
resemble biological neural networks in structure.
•
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions
generate rules for the classification of a dataset. Specific decision tree methods include
Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection
(CHAID) .
•
Genetic algorithms: Optimization techniques that use processes such as genetic combination,
mutation, and natural selection in a design based on the concepts of evolution.
•
Nearest neighbor method: A technique that classifies each record in a dataset based on a
combination of the classes of the k record(s) most similar to it in a historical dataset (where
k ³ 1). Sometimes called the k-nearest neighbor technique.
•
Rule induction: The extraction of useful if-then rules from data based on statistical
significance.
Many of these technologies have been in use for more than a decade in specialized analysis tools that
work with relatively small volumes of data. These capabilities are now evolving to integrate directly
with industry-standard data warehouse and OLAP platforms. The appendix to this white paper
provides a glossary of data mining terms.
2..............An Architecture for Data Mining
To best apply these advanced techniques, they must be fully integrated with a data warehouse as well
as flexible interactive business analysis tools. Many data mining tools currently operate outside of the
warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when
new insights require operational implementation, integration with the warehouse simplifies the
application of results from data mining. The resulting analytic data warehouse can be applied to
improve business processes throughout the organization, in areas such as promotional campaign
management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for
advanced analysis in a large data warehouse.
Figure 1 - Integrated Data Mining Architecture
The ideal starting point is a data warehouse containing a combination of internal data tracking all
customer contact coupled with external market data about competitor activity. Background
information on potential customers also provides an excellent basis for prospecting. This warehouse
can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on,
and should be optimized for flexible and fast data access.
An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model
to be applied when navigating the data warehouse. The multidimensional structures allow the user to
analyze the data as they want to view their business – summarizing by product line, region, and other
key perspectives of their business. The Data Mining Server must be integrated with the data
warehouse and the OLAP server to embed ROI-focused business analysis directly into this
infrastructure. An advanced, process-centric metadata template defines the data mining objectives for
specific business issues like campaign management, prospecting, and promotion optimization.
Integration with the data warehouse enables operational decisions to be directly implemented and
tracked. As the warehouse grows with new decisions and results, the organization can continually
mine the best practices and apply them to future decisions.
This design represents a fundamental shift from conventional decision support systems. Rather than
simply delivering data to the end user through query and reporting software, the Advanced Analysis
Server applies users’ business models directly to the warehouse and returns a proactive analysis of
the most relevant information. These results enhance the metadata in the OLAP Server by providing a
dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and
other analysis tools can then be applied to plan future actions and confirm the impact of those plans.
3..................Glossary of Data Mining Terms:
analytical model
A structure and process for analyzing a dataset. For example, a decision tree is a model
for the classification of a dataset.
anomalous data
Data that result from errors (for example, data entry keying errors) or that represent
unusual events. Anomalous data should be examined carefully because it may carry
important information.
artificial neural
networks
Non-linear predictive models that learn through training and resemble biological neural
networks in structure.
classification
The process of dividing a dataset into mutually exclusive groups such that the members
of each group are as "close" as possible to one another, and different groups are as "far"
as possible from one another, where distance is measured with respect to specific
variable(s) you are trying to predict. For example, a typical classification problem is to
divide a database of companies into groups that are as homogeneous as possible with
respect to a creditworthiness variable with values "Good" and "Bad."
clustering
The process of dividing a dataset into mutually exclusive groups such that the members
of each group are as "close" as possible to one another, and different groups are as "far"
as possible from one another, where distance is measured with respect to all available
variables.
data cleansing
The process of ensuring that all values in a dataset are consistent and correctly recorded.
data mining
The extraction of hidden predictive information from large databases.
data navigation
The process of viewing different dimensions, slices, and levels of detail of a
multidimensional database. See OLAP.
data visualization
The visual interpretation of complex relationships in multidimensional data.
data warehouse
A system for storing and delivering massive quantities of data.
decision tree
A tree-shaped structure that represents a set of decisions. These decisions generate rules
for the classification of a dataset. See CART and CHAID.
dimension
In a flat or relational database, each field in a record represents a dimension. In a
multidimensional database, a dimension is a set of similar entities; for example, a
multidimensional sales database might include the dimensions Product, Time, and City.
exploratory data
analysis
The use of graphical and descriptive statistical techniques to learn about the structure of
a dataset.
genetic algorithms
Optimization techniques that use processes such as genetic combination, mutation, and
natural selection in a design based on the concepts of natural evolution.
linear model
An analytical model that assumes linear relationships in the coefficients of the variables
being studied.
linear regression
A statistical technique used to find the best-fitting linear relationship between a target
(dependent) variable and its predictors (independent variables).
logistic regression
A linear regression that predicts the proportions of a categorical target variable, such as
type of customer, in a population.
multidimensional
database
A database designed for on-line analytical processing. Structured as a multidimensional
hypercube with one axis per dimension.
multiprocessor
computer
A computer that includes multiple processors connected by a network. See parallel
processing.
nearest neighbor
A technique that classifies each record in a dataset based on a combination of the classes
of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes
called a k-nearest neighbor technique.
non-linear model
An analytical model that does not assume linear relationships in the coefficients of the
variables being studied.
OLAP
On-line analytical processing. Refers to array-oriented database applications that allow
users to view, navigate through, manipulate, and analyze multidimensional databases.
outlier
A data item whose value falls outside the bounds enclosing most of the other
corresponding values in the sample. May indicate anomalous data. Should be examined
carefully; may carry important information.
parallel processing
The coordinated use of multiple processors to perform computational tasks. Parallel
processing can occur on a multiprocessor computer or on a network of workstations or
PCs.
predictive model
A structure and process for predicting the values of specified variables in a dataset.
prospective data
analysis
Data analysis that predicts future trends, behaviors, or events based on historical data.
RAID
Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of
data for high-performance computer systems.
retrospective data
analysis
Data analysis that provides insights into trends, behaviors, or events that have already
occurred.
rule induction
The extraction of useful if-then rules from data based on statistical significance.
SMP
Symmetric multiprocessor. A type of multiprocessor computer in which memory is shared
among the processors.
terabyte
One trillion bytes.
time series analysis
The analysis of a sequence of measurements made at specified time intervals. Time is
usually the dominating dimension of the data.
4...........Incremental Concept Learning for Bounded Data Mining:
Important refinements of concept learning in the limit from positive data considerably restricting the
accessibility of input data are studied. Let c be any concept; every infinite sequence of elements
exhausting c is called positive presentation of c. In all learning models considered the learning
machine computes a sequence of hypotheses about the target concept from a positive presentation of
it. With iterative learning, the learning machine, in making a conjecture, has access to its previous
conjecture and the latest data items coming in. In k-bounded example-memory inference (k is a priori
fixed) the learner is allowed to access, in making a conjecture, its previous hypothesis, its memory of
up to k data items it has already seen, and the next element coming in. In the case of k-feedback
identification, the learning machine, in making a conjecture, has access to its previous conjecture, the
latest data item coming in, and, on the basis of this information, it can compute k items and query the
database of previous data to find out, for each of the k items, whether or not it is in the database (k is
again a priori fixed). In all cases, the sequence of conjectures has to converge to a hypothesis
correctly describing the target concept. Our results are manyfold. An infinite hierarchy of more and
more powerful feedback learners in dependence on the number k of queries allowed to be asked is
established. However, the hierarchy collapses to 1-feedback inference if only indexed families of
infinite concepts are considered, and moreover, its learning power is then equal to learning in the
limit. But it remains infinite for concept classes of only infinite r.e. concepts. Both k-feedback
inference and k-bounded example-memory identification are more powerful than iterative learning but
incomparable to one another. Furthermore, there are cases where redundancy in the hypothesis space
is shown to be a resource increasing the learning power of iterative learners. Finally, the union of at
most k pattern languages is shown to be iteratively inferable.
5............Knowledge Discovery In Databases: Tools
The amount of data being collected in databases today far exceeds our ability to reduce and analyze
data without the use of automated analysis techniques. Many scientific and transactional business
databases grow at a phenomenal rate. A single system, the astronomical survey application SCICAT,
is expected to exceed three terabytes of data at completion [4]. Knowledge discovery in databases
(KDD) is the field that is evolving to provide automated analysis solutions.
Knowledge discovery is defined as ``the non-trivial extraction of implicit, unknown, and potentially
useful information from data'' [6]. In [5], a clear distinction between data mining and knowledge
discovery is drawn. Under their conventions, the knowledge discovery process takes the raw results
from data mining (the process of extracting trends or patterns from data) and carefully and accurately
transforms them into useful and understandable information. This information is not typically
retrievable by standard techniques but is uncovered through the use of AI techniques.
KDD is a growing field: There are many knowledge discovery methodologies in use and under
development. Some of these techniques are generic, while others are domain-specific. The purpose of
this paper is to present the results of a literature survey outlining the state-of-the-art in KDD
techniques and tools. The paper is not intended to provide an in-depth introduction to each approach;
rather, we intend it to acquaint the reader with some KDD approaches and potential uses.
Background
Although there are many approaches to KDD, six common and essential elements qualify each as a
knowledge discovery technique. The following are basic features that all KDD techniques share
(adapted from [5] and [6]):
•
All approaches deal with large amounts of data
•
Efficiency is required due to volume of data
•
Accuracy is an essential element
•
All require the use of a high-level language
•
All approaches use some form of automated learning
•
All produce some interesting results
Large amounts of data are required to provide sufficient information to derive additional knowledge.
Since large amounts of data are required, processing efficiency is essential. Accuracy is required to
assure that discovered knowledge is valid. The results should be presented in a manner that is
understandable by humans. One of the major premises of KDD is that the knowledge is discovered
using intelligent learning techniques that sift through the data in an automated process. For this
technique to be considered useful in terms of knowledge discovery, the discovered knowledge must be
interesting; that is, it must have potential value to the user.
KDD provides the capability to discover new and meaningful information by using existing data. KDD
quickly exceeds the human capacity to analyze large data sets. The amount of data that requires
processing and analysis in a large database exceeds human capabilities, and the difficulty of
accurately transforming raw data into knowledge surpasses the limits of traditional databases.
Therefore, the full utilization of stored data depends on the use of knowledge discovery techniques.
6............Knowledge Discovery In Databases: Techniques
KDD Techniques
Learning algorithms are an integral part of KDD. Learning techniques may be supervised or
unsupervised. In general, supervised learning techniques enjoy a better success rate as defined in
terms of usefulness of discovered knowledge. According to [1], learning algorithms are complex and
generally considered the hardest part of any KDD technique.
Machine discovery is one of the earliest fields that has contributed to KDD [5]. While machine
discovery relies solely on an autonomous approach to information discovery, KDD typically combines
automated approaches with human interaction to assure accurate, useful, and understandable results.
There are many different approaches that are classified as KDD techniques. There are quantitative
approaches, such as the probabilistic and statistical approaches. There are approaches that utilize
visualization techniques. There are classification approaches such as Bayesian classification, inductive
logic, data cleaning/pattern discovery, and decision tree analysis. Other approaches include deviation
and trend analysis, genetic algorithms, neural networks, and hybrid approaches that combine two or
more techniques.
Because of the ways that these techniques can be used and combined, there is a lack of agreement on
how these techniques should be categorized. For example, the Bayesian approach may be logically
grouped with probabilistic approaches, classification approaches, or visualization approaches. For the
sake of organization, each approach described here is included in the group that it seemed to fit best.
However, this selection is not intended to imply a strict categorization.
Probabilistic Approach
This family of KDD techniques utilizes graphical representation models to compare different knowledge
representations. These models are based on probabilities and data independencies. They are useful for
applications involving uncertainty and applications structured such that a probability may be assigned
to each ``outcome'' or bit of discovered knowledge. Probabilistic techniques may be used in
diagnostic systems and in planning and control systems [2]. Automated probabilistic tools are
available both commercially and in the public domain.
Statistical Approach
The statistical approach uses rule discovery and is based on data relationships. An ``inductive
learning algorithm can automatically select useful join paths and attributes to construct rules from a
database with many relations'' [8]. This type of induction is used to generalize patterns in the data
and to construct rules from the noted patterns. Online analytical processing (OLAP) is an example of a
statistically-oriented approach. Automated statistical tools are available both commercially and in the
public domain.
Classification Approach
Classification is probably the oldest and most widely-used of all the KDD approaches [11]. This
approach groups data according to similarities or classes. There are many types of classification
techniques and numerous automated tools available.
The Bayesian Approach to KDD ``is a graphical model that uses directed arcs exclusively to form an
[sic] directed acyclic graph'' [2]. Although the Bayesian approach uses probabilities and a graphical
means of representation, it is also considered a type of classification.
Bayesian networks are typically used when the uncertainty associated with an outcome can be
expressed in terms of a probability. This approach relies on encoded domain knowledge and has been
used for diagnostic systems. Other pattern recognition applications, including the Hidden Markov
Model, can be modeled using a Bayesian approach [3]. Automated tools are available both
commercially and in the public domain.
Pattern Discovery and Data Cleaning is another type of classification that systematically reduces a
large database to a few pertinent and informative records [7]. If redundant and uninteresting data is
eliminated, the task of discovering patterns in the data is simplified. This approach works on the
premise of the old adage, ``less is more''. The pattern discovery and data cleaning techniques are
useful for reducing enormous volumes of application data, such as those encountered when analyzing
automated sensor recordings. Once the sensor readings are reduced to a manageable size using a
data cleaning technique, the patterns in the data may be more easily recognized. Automated tools
using these techniques are available both commercially and in the public domain.
The Decision Tree Approach uses production rules, builds a directed acyclical graph based on data
premises, and classifies data according to its attributes. This method requires that data classes are
discrete and predefined [11]. According to [5], the primary use of this approach is for predictive
models that may be appropriate for either classification or regression techniques. Tools for decision
tree analysis are available commercially and in the public domain.
Deviation and Trend Analysis
Pattern detection by filtering important trends is the basis for this KDD approach. Deviation and trend
analysis techniques are normally applied to temporal databases. A good application for this type of
KDD is the analysis of traffic on large telecommunications networks.
Other Approaches
Neural networks may be used as a method of knowledge discovery. Neural networks are particularly
useful for pattern recognition, and are sometimes grouped with the classification approaches. There
are tools available in the public domain and commercially. Genetic algorithms, also used for
classification, are similar to neural networks although they are typically considered more powerful.
There are tools for the genetic approach available commercially.
Hybrid Approach
A hybrid approach to KDD combines more than one approach and is also called a multi-paradigmatic
approach. Although implementation may be more difficult, hybrid tools are able to combine the
strengths of various approaches. Some of the commonly used methods combine visualization
techniques, induction, neural networks, and rule-based systems to achieve the desired knowledge
discovery. Deductive databases and genetic algorithms have also been used in hybrid approaches.
There are hybrid tools available commercially and in the public domain.
7.............Knowledge Discovery In Databases: Data Selection & Cleaning:
Most organizations with a global reach - government and civilian, collect, generate and/or publish
more data and information in a week than a single human can comprehend in a lifetime. Vannevar
Bush described the problem as early as 1945 in an Atlantic Monthly article, "As We May Think." He
wrote,"...There is a growing mountain of research ...[and] the difficulty seems to be, not so much that
we publish unduly, rather, that publication has been extended far beyond our present ability to make
real use of the record..." While Bush described the inability of scientists to keep abreast of the
massive about of research data generated during World War II, his assessment of the situation is
equally applicable to the problem facing large organizations today, namely, how and what can be
learned from data collected during the daily performance of business.
Knowledge Discovery In Databases (KDD) is a process an organization can use to tap into the hidden
knowledge contained within the large volume of data an organization collects. The following is a
skeletal view of the KDD process and some suggestions on how an organization can increase the
effectiveness of the KDD effort.
The KDD Process
Most current sources describe KDD as a process consisting of six phases: Data Selection, Cleaning,
Enrichment, Coding, Data Mining and Reporting. I have added two additional phases: Knowledge
Requirement and Action. Using the two additional steps not only increases the efficiency of the
process but also results in more valuable data. The KDD process is iterative and tunable. The output
of and the lessons learned from one phase can be applied to the next phase to improve output.
Knowledge Requirement
Before a KDD process can begin a requirement for knowledge must be identified. This doesn't mean
an organization must know what specific additional knowledge it seeks before hand. Instead,
management should seek to recognize a pattern or patterns of data. By analyzing this information,
management can formulate a strategy for determining a knowledge requirement, which when
implemented will produce the kind of information needed to make informed business decisions. Note
the word "business" can apply to many different knowledge domains, whether it's the business of
conducting efficient military operations, tracking drug smugglers, or selling baby food.
By formally adopting a knowledge requirement phase in its KDD effort, an organization can improve
the KDD process effectiveness. The improvement comes from analyzing and documenting what forms
of knowledge are gained given a stated knowledge requirement and implementing each of the phases
of KDD - data selection, cleaning, enrichment and data mining. This plan also includes the data mining
techniques employed and the actions taken based upon knowledge gained.
Data Selection
As mentioned above, the data used in the KDD process is selected based on an evaluation of its
potential to yield knowledge. Often these data sets are not in the best possible form for the KDD
process and the evaluation is really just an organization's best guess as to the value of the data.
Generally speaking, while an experienced user is an asset to an organization, the empowerment
individual users gain through experience and training is a problem that stands in the way of an
effective KDD effort. As users gain proficiency and independence in solving business problems with
their PCs, data, which is vital to an organization, is captured in databases that are poorly designed
and implemented. There is no thought or plan for future integration with other organizational
databases or strategy to capitalize on the investment of data already collected. The concept of the
data warehouse is a solution to this problem.
The Data Warehouse
The data warehouse is designed to centralize an organization's data. However, structuring and
implementing an organization's data warehouse means more than just rounding up all the problematic
data stores with an eye towards making them accessible in a unified manner. An effectively designed
data warehouse facilitates the KDD process.
Cleaning
If an organization's data collection techniques and database structure guidelines were properly
employed, the data contained in an organization's database would already be clean. But since the
majority of organizations have not implemented and provided guidance for a true data warehouse the
data selected for the KDD process must be cleaned. I was recently asked in what phase of the KDD
could an organization expect to spend most of its time, and hence, money. The answer is in the
cleaning phase, especially if steps had not been taken to ensure the reliability of the data initially.
Data cleaning is labor intensive and involves examining the data for completeness and integrity. A few
examples of dirty data include: missing field information, names that are the same but spelled slightly
different, duplicate data or data that has been intentionally altered in an attempt to gain services
fraudulently. The causes of dirty or noisy data are many and each business domain has unique data
cleanliness issues. Pattern matching algorithms exist to help in the data cleaning effort but as of yet
they are not a complete solution.
8.............Knowledge Discovery In Databases: Enrichment and Coding, Data Mining &
Reporting:
Enrichment and Coding
In the enrichment phase, data, which may be obtained from an external source outside the
organization, is added to the selected and cleaned data to add value to it. For example, in a retail KDD
effort, customer shopping data is combined with third party demographic data, which maps customer
names to addresses, number and types of cars they own, how many and what ages are their children,
etc., perhaps to produce a target market group.
In the data coding phase raw data is converted into a form that lends itself to be more effectively
mined. It's sort of like cloud salting. Attributes can be, and are often added to enhance the discovery
of patterns within the data. Values, such as dates, times, salaries or other data that express numerical
values are normalized in such a manner that informational clusters are easily formed and associations
can be more readily identified.
The following record is an example of one possible form of data normalization. This record contains
fields for name, date of birth, age, salary, number of degrees, number of cars owned, number of
magazines subscribed to and number of children:
Doe | John | 4 February 1961 | 39 | 145000 | 2 | 3 | 6 | 2 |
The focus of a coding transformation on this type of data might be to bring the salary and age fields to
within the same value order as the other numeric fields. Dividing 39 by 10 and 145000 by 100000
would do the trick, yielding the following record:
Doe | John | 4 February 1961 | 3.9 | 1.45 | 2 | 3 | 6 | 2 |
The numerical points of this record, when plotted, will fall closer to each other than if the data had
been plotted in its original form. Depending on the data mining technique used on this particular kind
of data set, this could prove to be an extremely valuable transformation.
Data Mining
Data mining is the term once used to describe the entire KDD process, or rather; data mining was the
KDD process. Through research, trial and error, data mining practitioners discovered that effective
data mining required good, clean, properly enriched and intelligently coded data.
The data mining phase is where the KDD rubber hits the road. Some practitioners describe it as an
anything goes affair and advocate applying any technique that helps extract knowledge from data.
Many useful data mining techniques exist. A few of these include query tools, statistical methods, data
visualization, Online Analytical Processing (OLAP), case-based learning (k's nearest neighbor),
decision trees, association rules, neural networks and genetic algorithms. The "k's" nearest neighbor
refers to an association or grouping algorithm that compares each point or data element with a
number of its neighbors. The letter k represents a number, i.e., find 5 of the nearest neighbors to a
particular data element.
Levels of Knowledge and the Use of Visualization Techniques
There are four levels of knowledge that exist within an organization's data:
•
Shallow knowledge can be easily extracted with query tools and it's estimated that shallow
knowledge accounts for 80% of the knowledge in an organization's data.
•
Multi-dimensional knowledge can be mined using OLAP tools.
•
Hidden knowledge must be found using pattern recognition or machine learning algorithms.
•
Deep knowledge is interesting in that you must have a clue to its existence before you can
find it.
Reporting and Action
The reporting phase makes the knowledge gained from the KDD process available to the right people
in the right format. Knowledge is power but it is useless if not acted upon. To reap the full benefits of
a KDD effort, organizations should adopt a formal knowledge action process through which knowledge
gained via a KDD effort is formally reviewed to see how it can best be used within the organization.
9.............. Security on Data Warehousing:
Defining warehouse security :
The Data Warehouse Center controls access with user IDs. When a user logs on, the user ID is
compared to the warehouse users that are defined in the Data Warehouse Center to determine
whether the user is authorized to access the Data Warehouse Center. You can authorize additional
users to access the Data Warehouse Center by defining new warehouse users.
The user ID for the new user does not require authorization to the operating system or the warehouse
control database. The user ID exists only within the Data Warehouse Center.
In the Data Warehouse Center, warehouse groups include the authorization to perform certain tasks
and access objects in the Data Warehouse Center. To authorize one or more users to perform tasks,
you must define a warehouse group, and then add the users to the group.
To define security for your data warehouse:
1.
2.
Define warehouse users.
Define warehouse groups.
Data warehousing systems present special security issues including:
•
The degree of security appropriate to summaries and aggregates of data,
•
The security appropriate for the exploration data warehouse, specifically designed for
browsing and ad hoc queries, and
•
The uses and abuses of data encryption as a method of enhancing privacy.
10………….Backup and Recovery of the Data Warehouse
Recoverability has various levels: recovery from disk failure, human error, software failure, fire, and
so on, requires different procedures. Oracle provides only part of the solution. Organizations must
decide how much to spend on backup and recovery by considering the business cost of a long outage.
The NOLOGGING option enables you to perform certain operations without the overhead of generating
a log. Even without logging, you can avoid disk failure if you use disk mirroring or RAID technology. If
you load your warehouse from tapes every day or week, you might satisfactorily recover from all
failures simply by saving copies of the tape in several remote locations and reloading from tape when
something goes wrong.
At the other end of the spectrum, you could both mirror disks and take backups and archive logs, and
maintain a remote standby system. The mirrored disks prevent loss of availability for disk failure, and
also protect against total loss in the event of human error (such as dropping the wrong table) or
software error (such as disk block corruption). In the event of fire, power failure, or other problems at
the primary site, the backup site prevents long outages.
11………….Tuning Database Performance:
Elements of performance
Performance is the way a computer system behaves given a particular work load. Performance is
measured in terms of system response time, throughput, and availability. Performance is also affected
by:
•
•
The resources available in your system
How well those resources are used and shared.
In general, you tune your system to improve its cost-benefit ratio. Specific goals could include:
•
Processing a larger, or more demanding, work load without increasing processing costs
For example, to increase the work load without buying new hardware or using more
processor time
•
•
Obtaining faster system response times, or higher throughput, without increasing processing
costs
Reducing processing costs without degrading service to your users
Translating performance from technical terms to economic terms is difficult. Performance tuning
certainly costs money in terms of user time as well as processor time, so before you undertake a
tuning project, weigh its costs against its possible benefits. Some of these benefits are tangible:
•
•
More efficient use of resources
The ability to add more users to the system.
12………Designing Database:
Relational databases
A relational database is a database that is treated as a set of tables and manipulated in accordance
with the relational model of data. It contains a set of objects used to store, manage, and access data.
Examples of such objects are tables, views, indexes, functions, triggers, and packages.
A partitioned relational database is a relational database whose data is managed across multiple
partitions (also called nodes). This separation of data across partitions is transparent to users of most
SQL statements. However, some data definition language (DDL) statements take partition information
into consideration (for example, CREATE DATABASE PARTITION GROUP). (Data definition language is
the subset of SQL statements used to describe data relationships in a database.)
A federated database is a relational database whose data is stored in multiple data sources (such as
separate relational databases). The data appears as if it were all in a single large database and can be
accessed through traditional SQL queries. Changes to the data can be explicitly directed to the
appropriate data source.
Structured Query Language (SQL)
SQL is a standardized language for defining and manipulating data in a relational database. In
accordance with the relational model of data, the database is treated as a set of tables, relationships
are represented by values in tables, and data is retrieved by specifying a result table that can be
derived from one or more base tables.
SQL statements are executed by a database manager. One of the functions of the database manager
is to transform the specification of a result table into a sequence of internal operations that optimize
data retrieval. The transformation occurs in two phases: preparation and binding.
All executable SQL statements must be prepared before they can be executed. The result of
preparation is the executable or operational form of the statement. The method of preparing an SQL
statement and the persistence of its operational form distinguish static SQL from dynamic SQL.
13…………Administering Database Systems: Moving data between typed tables
The DB2(R) export and import utilities can be used to move data out of, and into, typed tables. Typed
tables may be in a hierarchy. Data movement across hierarchies can include:
•
•
•
Movement from one hierarchy to an identical hierarchy.
Movement from one hierarchy to a sub-section of a larger hierarchy.
Movement from a sub-section of a large hierarchy to a separate hierarchy.
The IMPORT CREATE option allows you to create both the table hierarchy and the type hierarchy.
Identification of types in a hierarchy is database dependent. This means that in different databases,
the same type has a different identifier. Therefore, when moving data between these databases, a
mapping of the same types must be done to ensure that the data is moved correctly.
Before each typed row is written out during an export operation, an identifier is translated into an
index value. This index value can be any number from one to the number of relevant types in the
hierarchy. Index values are generated by numbering each type when moving through the hierarchy in
a specific order. This order is called the traverse order. It is the order of proceeding top-to-bottom,
left-to-right through all of the supertables and subtables in the hierarchy. The traverse order is
important when moving data between table hierarchies, because it determines where the data is
moved in relation to other data.
One method is to proceed from the top of the hierarchy (or the root table), down the hierarchy
(subtables) to the bottom subtable, then back up to its supertable, down to the next "right-most"
subtable(s), then back up to next higher supertable, down to its subtables, and so on.
14……….Monitoring database systems: Database system monitor data organization
The database system monitor stores information it collects in entities called monitor elements (these
were previously known as data elements). Each monitor element stores information regarding one
specific aspect of the state of the database system. In addition, monitor elements are identified by
unique names and store a certain type of information.
The following are the available element types in which monitor elements store data:
Counter
Counts the number of times an activity occurs. Counter values increase during monitoring.
Most counter elements can be reset.
Gauge
Indicates the current value for an item. Gauge values can go up and down depending on
database activity (for example, the number of locks held). Gauge elements can not be reset.
Water mark
Indicates the highest (maximum) or lowest (minimum) value an element has reached since
monitoring was started. Water mark elements can not be reset.
Information
Provides reference-type details of your monitoring activities. This can include items such as
partition names, aliases, and path details. Information elements can not be reset.
Timestamp
Indicates the date and time that an activity took place by providing the number of seconds
and microseconds that have elapsed since January 1, 1970. For the snapshot monitor and
event monitors, the collection of timestamp elements is controlled by the TIMESTAMP
monitor switch. While this switch is on by default, you should turn it off for performance
reasons if CPU utilization on the database instance approaches 100%. Timestamp elements
can not be reset.
Time
Returns the number of seconds and microseconds spent on an activity. For the snapshot
monitor and event monitors, the collection of most time elements is controlled by the
TIMESTAMP monitor switch. While this switch is on by default, you should turn it off for
performance reasons if CPU utilization on the database instance approaches 100%. Some
time elements can be reset.
Monitor elements collect data for one or more logical data groups. A logical data
group is a collection of monitor elements that gather database system monitoring
information for a specific scope of database activity. Monitor elements are sorted in
logical data groups based on the levels of information they provide. For example,
while snapshot monitoring, the Total Sort Time monitor element returns database
(dbase), application (appl), and statement (stmt) information; hence, it appears in
each of the logical data groups listed in parentheses.
Although many monitor elements are used by both the snapshot monitor and event
monitors, they each use a distinct set of logical data groups. This is because the
scopes of database activity for which you can capture a snapshot differ from those
for which you can collect event data. Practically speaking, the overall set of monitor
elements accessible from the snapshot monitor is different from those accessible
from event monitors.
15……….Data Redundancy
There are essentially three levels of data redundancy that enterprises should think about when
considering their data warehouse options:
•
"Virtual" or "Point-to-Point" Data Warehouses
•
Central Data Warehouses
•
Distributed Data Warehouses
There is no one best approach. Each option fits a specific set of requirements, and a data warehousing
strategy may ultimately include all three options
"Virtual" or "Point-to-Point" Data Warehouses
A virtual or point-to-point data warehousing strategy means that end-users are allowed to get at
operational databases directly using whatever tools are enabled to the "data access network" . This
approach provides the ultimate in flexibility as well as the minimum amount of redundant data that
must be loaded and maintained. This approach can also put the largest unplanned query load on
operational systems.
As we will see, virtual warehousing is often an initial strategy in organizations where there is a broad
but largely undefined need to get at operational data from a relatively large class of end-users and
where the likely frequency of requests is low. Virtual data warehouses often provide a starting point
for organizations to learn what end-users are really looking for. Figure 3 below shows a Virtual Data
Warehouse within the Data Warehouse Architecture.
Central Data Warehouses
Central Data Warehouses are what most people think of when they first are introduced to the concept
of data warehouse. The central data warehouse is a single physical database that contains all of the
data for a specific functional area, department, division, or enterprise. Central Data Warehouses are
often selected where there is a common need for informational data and there are large numbers of
end-users already connected to a central computer or network.
Central Data Warehouses are real. The data stored in the data warehouse is accessible from one place
and must be loaded and maintained on a regular basis. Normally, data warehouses are built around
advanced RDBMs or some form of multi-dimensional informational database server.
Distributed Data Warehouses
Distributed Data Warehouses are just what their name implies. They are data warehouses in which the
certain components of the data warehouse are distributed across a number of different physical
databases. Increasingly, large organizations are pushing decision-making down to lower and lower
levels of the organization and in turn pushing the data needed for decision making down (or out) to
the LAN or local computer serving the local decision-maker.
Distributed Data Warehouses usually involve the most redundant data and, as a consequence, most
complex loading and updating processes.
16………….Developing Data Warehouses
Developing a good data warehouse is no different from any other IT project; it requires careful
planning, requirements definition, design, prototyping and implementation. The first and most
important element is a planning process that determines what kind of data warehouse strategy the
organization is going to start with.
Developing a Data Warehouse Strategy
Before developing a data warehouse, it is critical to develop a balanced data warehousing strategy
that is appropriate for its needs and its user population. Who is the audience? What is the scope?
What type of data warehouse should we build?
There are a number of strategies by which organizations can get into data warehousing. One way is to
establish a "Virtual Data Warehouse" environment. A Virtual Data Warehouse is created by: (1)
installing a set of data access, data directory and process management facilities, (2) training the endusers (3) monitoring how the data warehouse facilities are actually used and then (4) based on actual
usage, create a physical data warehouse to support the high-frequency requests.
A second strategy is simply to build a copy of the operational data from a single operational system
and enable the data warehouse from a series of information access tools. This strategy has the
advantage of being both simple and fast. Unfortunately, if the existing data is of poor quality and/or
the access to the data has not been thought through, then this approach can create a number of
significant problems.
Ultimately, the optimal data warehousing strategy is to select a user population based on value to the
enterprise and do an analysis of their issues, questions and data access needs. Based on these needs,
prototype data warehouses are built and populated so the end-users can experiment and modify their
requirements. Once there is general agreement on the needs, then the data can be acquired from
existing operational systems across the enterprise and/or from external data sources and loaded into
the data warehouse. If it is required, information access tools can also be enabled to allow end-users
to have access to required data using their own favorite tools or to allow for the creation of highperformance multi-dimensional information access systems using the core data warehouse as the
basis.
Evolving a Data Warehouse Architecture
The Data Warehouse Architecture in Figure 1 is simply a framework for understanding data
warehousing and how the components of data warehousing fit together. Only the most sophisticated
organizations will be able to put together such an architecture the first time out. What the Data
Warehouse Architecture provides then is a kind of roadmap that can be used to design toward.
Coupled with an understanding of the options at hand, the Data Warehouse Architecture provides a
useful way of determining if the organization is moving toward a reasonable data warehousing
framework.
One of the keys to data warehousing is flexibility. It is critical to keep in mind that the more successful
a data warehouse strategy is, the more end-users are going to want to add to it.
Designing Data Warehouses
Designing data warehouses is very different from designing traditional operational systems. For one
thing, data warehouse users typically don't know nearly as much about their wants and needs as
operational users. Second, designing a data warehouse often involves thinking in terms of much
broader, and more difficult to define, business concepts than does designing an operational system. In
this respect, data warehousing is quite close to Business Process Reengineering (BPR). Finally, the
ideal design strategy for a data warehouse is often outside-in as opposed to top-down.
17…………. Managing Data Warehouses
Data Warehouses are not magic-they take a great deal of very hard work. In many cases data
warehouse projects are viewed as a stopgap measure to get users off our backs or to provide
something for nothing. But data warehouses require careful management and marketing. A data
warehouse is a good investment only if end-users actually can get at vital information faster and
cheaper than they can using current technology. As a consequence, management has to think
seriously about how they want their warehouses to perform and how they are going to get the word
out to the end-user community. And management has to recognize that the maintenance of the data
warehouse structure is as critical as the maintenance of any other mission-critical application. In fact,
experience has shown that data warehouses quickly become one of the most used systems in any
organization.
Management, especially IT management, must also understand that if they embark on a data
warehousing program, they are going to create new demands upon their operational systems:
demands for better data, demands for consistent data, demands for different kinds of data.
18………..Updating Data Warehouse Data
Updating data warehouse data includes periodically extracting data from operational systems,
cleansing and transforming the data, and loading the new data into the data warehouse. Each data
update also includes tasks that must be accomplished to synchronize cubes if Microsoft® SQL
Server™ 2000 Analysis Services is used for online analytical processing (OLAP), and to update any
data marts that are part of the data warehouse.
The process of extracting, cleansing, and transforming data for a periodic update is essentially the
same as the process used in the initial loading of the data warehouse, although the update process is
often much less complex and more automated than the initial load process. Procedures and
automated tasks developed during the initial load process can reduce the amount of manual effort
required during updates. Corrections to source operational systems identified and implemented during
the initial load also reduce the number of inconsistencies and errors that must be addressed during
updates. However, it is often the case that manual intervention is required during updates to ensure
the data is ready for loading into the data warehouse.
One difference between the initial data load and data updates is that verifying the referential integrity
should be performed incrementally on update data before it is loaded into the data warehouse and
made available to users. Updates often include additions and changes to dimension tables as well as
the addition of rows to the fact tables. The new and changed data should be checked for internal
consistency as well as verified against existing data in the data warehouse before it is loaded into the
data warehouse.
After the update data has been made ready for loading into the data warehouse, you can use
Transact-SQL, Data Transformation Services (DTS), or the bcp utility to update the data warehouse
tables. Depending on the design and implementation of the presentation applications that provide
access to data warehouse data for end users, you may need to take the data warehouse offline during
the update to prevent inconsistencies in query results.
19………..Data Warehouses or Data Marts?
For many years, all systems that extracted production and other data from source systems and stored
it in useful ways in a decision-support system were called data warehouses. Over the past few years,
a distinction has been articulated between data warehouses and data marts, even as the overall
market continues to be called data warehousing.
General industry discussions indicate that, as of January 1997, approximately 70 percent to 80
percent of all data warehouses currently in production are, in fact, data marts.
Intelligent Solutions, Inc., compares data warehouses to data marts is the following way:
Data Warehouse
Data Mart
Constructed to meet the information needs of
the entire enterprise.
Constructed to meet the needs of a specific
business unit or function.
Designed to optimize integration and
management of source data.
Designed to optimize information delivery of
decision support information.
Manages extensive amounts of atomic level
history.
Primarily focused at managing summary and/or
sample data.
Owned and managed by enterprise Information
Systems (IS) organizations.
Can be owned and managed by Line of
Business IS group.
20…………..DATA MART:
A data mart (DM) is a specialized version of a data warehouse (DW). Like data warehouses, data
marts contain a snapshot of operational data that helps business people to strategize based on
analyses of past trends and experiences. The key difference is that the creation of a data mart is
predicated on a specific, predefined need for a certain grouping and configuration of select data. A
data mart configuration emphasizes easy access to relevant information.
There can be multiple data marts inside a single corporation; each one relevant to one or more
business units for which it was designed. DMs may or may not be dependent or related to other data
marts in a single corporation. If the data marts are designed using conformed facts and dimensions,
then they will be related. In some deployments, each department or business unit is considered the
owner of its data mart including all the hardware, software and data.[1] This enables each department
to use, manipulate and develop their data any way they see fit; without altering information inside
other data marts or the data warehouse. In other deployments where conformed dimensions are
used, this business unit ownership will not hold true for shared dimensions like customer, product,
etc.
Design schemas
•
star schema or dimensional model is a fairly popular design choice, as it enables a relational
database to emulate the analytical functionality of a multidimensional database.
•
snowflake schema
Reasons for creating a data mart
•
Ease access to frequently needed data
•
Creates collective view by a group of users
•
Improves end-user response time
•
Ease of creation
•
Lower cost than implementing a full Data warehouse
•
Potential users are more clearly defined than in a full Data warehouse
Dependent data mart
According to the Inmon school of data warehousing, a dependent data mart is a logical subset
(view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following
reasons:
•
A need for a special data model or schema: e.g., to restructure for OLAP
•
Performance: to offload the data mart to a separate computer for greater efficiency or to
obviate the need to manage that workload on the centralized data warehouse.
•
Security: to separate an authorized data subset selectively
•
Expediency: to bypass the data governance and authorizations required to incorporate a new
application on the Enterprise Data Warehouse
•
Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an
application prior to migrating it to the Enterprise Data Warehouse
•
Politics: a coping strategy for IT (Information Technology) in situations where a user group
has more influence than funding or is not a good citizen on the centralized data warehouse.
•
Politics: a coping strategy for consumers of data in situations where a data warehouse team
is unable to create a usable data warehouse.
According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited
scalability, duplication of data, data inconsistency with other silos of information, and inability to
leverage enterprise sources of data.