CS5483 review questions: Answer:

CS5483 review questions:
1. How to differentiate between HTML and XML?
Answer:
• The tags used to markup HTML documents and the structure of HTML
documents are predefined.
•
The author of HTML documents can only use tags that are defined in the
HTML standard.
•
XML allows the author to define his own tags and his own document
structure.
As a result, XML allows programming, but HTML cannot. XML is a supplement
to HTML such that it is not a replacement for HTML. It will be used to structure
and describe the Web data while HTML will be used to format and display the
same data. XML can also store data insider HTML documents (Data Islands).,
and can be used to exchange and store data.
2. How to implement cardinality data semantic in DTD?
Answer:
one-to-one cardinality
DTDGraph
Element A
Element B
A1
A2
B1
B2
One-to-many cardinality
DTD
<!ELEMENT A(B)>
<!ATTLISTAA1CDATA#REQUIRED>
<!ATTLISTAA2CDATA#REQUIRED>
<!ELEMENT BEMPTY>
<!ATTLISTBB1CDATA#REQUIRED>
<!ATTLISTBB2CDATA#REQUIRED>
3. How to define various data semantics in relational database?
Answer:
One-to-one cardinality / one-to-many cardinality / total participation
Relation President (President_name, Race, *Nation_name)
Relation Nation (Nation_name, Nation_size)
Where underlined are primary keys and "*" prefixed are foreign keys
Many-to-many cardinality
Relation Student (Student_id, Student_name)
Relation Course (Course_id, Course_name)
Relation take (*Student_id, *Course_id)
Is-a relationship
Relation Male (Name, Height)
Relation Father (*Name, Birth_date)
Disjoint generalization / overlap generalization:
Relation Boat_person (Name, Birth_date, Birth_place)
Relation Refugee (*Name, Open_center)
Relation Non-refugee (*Name, Detention_center)
Categorization
Relation Department (Borrower_card, Department_id)
Relation Doctor (Borrower_card, Doctor_name)
Relation Hospital (Borrower_card, Hospital_name)
Relation Borrower (*Borrower_card, Return_date, File_id)
Aggregation:
Relation Student (Student_no, Student_name)
Relation Course (Course_no, Course_name)
Relation Takes (*Student_no, *Course_no, *Instructor_name)
Relation Instructor (Instructor_name, Department)
Partial participation:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_no, Employee_name, &Department_id)
Where & means that null value is allowed
Weak entity:
Relation Hotel (Hotel_name, Ranking)
Relation Room (*Hotel_name, Room_no, Room_size)
N-ary relationship:
Relation Engineer (Employee_id, Employee_name)
Relation Skill (Skill_name, Years_experience)
Relation Project (Project_id, Start_date, End_date)
Relation Skill_used (*Employee_id, *Skill_name, *Project_id)
4. Is user supervision needed in schemas integration? Why or why not?
Answer:
• Data semantics define the relationships between data for users’ data
requirements.
•
Data semantics are presented in database conceptual schema such as EER
model and DTD Graph.
•
Only relevant data can be integrated for an application.
•
Data relevance depends on users’ data requirements for an application.
•
Data consistency are the standard of data domain value and format.
Inconsistent data must be transformed before data integration.
•
User supervision means users input for users’ data requirements, and
which are for database design and schema integration
As a result, user supervision is needed in schema integration in order to meet
users’ data requirements.
5. How to develop a data warehouse?
Answer:
Planning
2. Gathering Data Requirements and Modeling
3. Physical Database Design and Development
4. Data Mapping and Transformation
5. Data Extraction and Load
6. Automating the Data Management Process
7. Application Development-Creating the starter sets
8. Data Validation and Testing
9. Training
10. Rollout
of reports
6. What are the differences among various star schemas?
Answer:
Simple star schema – All primary keys in the fact tables are also foreign keys to
the primary keys of dimension tables.
Multiple Fact Tables – More than one fact tables are related to the same
dimension tables.
Outboard Tables – Dimension tables contain a foreign key that references the
primary key in another dimension table.
Multi-Star schema – The fact table in the simple star schema has its own primary
keys without referencing dimension tables.
Snowflaked schema – Snowflake schema is a star schema which stores all
dimensional information in third normal form.
7. How to implement various OLAP operations by use of SQL?
Answer:
Roll-up:
select sum(amount), area
from SALES
where (area='Kowloon') group by area
Drill-down:
select sum(amount), the_date
from SALES
where (the_date='2003-Dec-31')
or (the_date='2003-Dec-30')
or… …or (the_date='2003-Dec-2')
or (the_date='2003-Dec-1') group by the_date
Slice:
select sum(amount), storecode
from SALES
where (storecode='292') group by storecode
Dice:
select sum(amount), area
from SALES
where ( (area='HK') or (area='NT') or (area='Kowloon'))
and (the_date='2003-Dec-24')
group by area
8. How to implement various OLAP operations by use of MDX?
Roll-up:
SELECT [SALES].[AMOUNT] ON COLUMNS,
[store].[Kowloon] ON ROWS
FROM SALES
Drill-down:
SELECT [SALES].[AMOUNT] ON COLUMNS,
[time].[2003].[Q4].[Dec].[31],
[time].[2003].[Q4].[Dec].[30],… …,
[time].[2003].[Q4].[Dec].[2],
[time].[2003].[Q4].[Dec].[1] ON ROWS FROM SALES
Slice:
SELECT [SALES].[AMOUNT] ON COLUMNS,
[store].[Kowloon].[292] ON ROWS FROM SALES
Dice:
SELECT [SALES].[AMOUNT] ON COLUMNS,
[store].[HK],[store].[NT], [store].[Kowloon] ON ROWS
FROM SALES WHERE [time].[2003].[Q4].[Dec].[24]
9. What is the definition and application of Apriori Algorithm?
Answer:
Apriori algorithm:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
We can apply Apriori algorithm to find out the cross marketing sell of two
different products, that is, the probability of selling a product leads selling of
another product.
10. What is the definition and application of Frequent Pattern Tree Algorithm?
Answer:
Frequent Pattern Tree algorithm:
Step 1: Create a table of candidate data items in descending order.
Step 2: Build the Frequent Pattern Tree according to each event of the candidate
data items.
Step 3: Link the table with the tree.
We can apply Frequent Pattern Tree algorithm in cross products marketing.
11. What is the definition and application of Sequential Pattern?
Answer:
A sequential pattern is defined as an ordered set of pages that satisfies a given
support and is maximal (i.e. it has no subsequence that is also frequent). In other
words, sequential pattern is the ordered set of web pages browsed by a user in a
session.
The support level of sequential patterns is
Frequent forward ordering web pages occurrences of X1, X2…Xn
Each Customer/User
The application of sequential pattern is to locate an user’s web pages browsing
access path from the home page forward in a website.
12. What is the definition and application of Maximal Frequent Forward
Sequence?
Answer:
Forward sequences is to remove any backward traversals. Each raw session is
transformed into forward reference (i.e. remove the backward traversals and
reloads/refreshes), from which the traversal patterns are then mined using
improved level-wise algorithms.
The maximal frequent forward sequence is the sequence patterns that can meet
the users support level requirements.
The application of maximal frequent forward sequence is to locate the “frequent”
access pattern path of a user who browses web pages in a website.
13. What is the definition and application of K-means algorithm?
Answer:
The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters so that the resulting intracluster similarity is high but the
intercluster similarity is low. Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed as the clusters’ centroid or
center of gravity.
The application of k-means algorithm can be classification of good customers in
customer relationship management.
14. What is the definition and application of K-Medoid algorithm?
Answer:
K-Medoid algorithm cluster categorical data by replacing the means of clusters
with modes, using new dissimilarity measures to deal with categorical objects and
a frequency-based method to update modes of clusters.
An application of k-Meodid algorithm is in CRM.
15. How to implement Genetic Algorithm with user supervision?
Answer:
Step 1: Initialize a population P of n elements as a potential solution.
Step 2: Until a specified termination condition is satisfied:
2a: Use a fitness function to evaluate each element of the current solution. If an
element passes the fitness criteria, it remains in P.
2b: The population now contains m elements (m <= n). Use genetic operators to
create (n – m) new elements. Add the new elements to the population.
16. How to implement Genetic Algorithm without user supervision?
Answer:
Genetic algorithm can also be a powerful unsupervised clustering technique. For
example, given a set of solutions of centroids for clustering, we can apply genetic
algorithm by using crossover and mutation to locate the best solution, that is, the
centroids with minimum Euclidean distance between centroids and data points in
each cluster.
17. How to implement Decision Tree with a C4.5 Algorithm?
Answer:
Begin
Partition (S)
If (all records in S are of the same class or only 1 record found in S)
then return;
For each attribute Ai do
evaluate splits on attribute Ai;
Use best split found to partition S into S1 and S2 to grow a tree with two
Partition (S1) and Partition (S2) which has the most information gain.
Repeat partitioning for Partition (S1) and (S2) until it meets tree stop
growing criteria;
After growing tree phase, perform pruning tree phase by removing branches
that can achieve a smaller prediction error rate;
End;
18. What kind of user requirement is suitable for using Decision Tree in data
mining?
Answer:
The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for exploratory
knowledge discovery. Decision trees can handle high dimensional data. Their
representation of acquired knowledge in tree form is intuitive and generally easy
to assimilate by humans. The learning and classification steps of decision tree
induction are simple and fast. In general, decision tree classifications have good
accuracy.
19. How to implement neural network by using Backpropagation Algorithm?
Answer:
The procedure of processing neural network is:
Step 1: Separate dataset into training and testing sets.
Step 2: Select the Neural Network Topology
Step 3: Initialize the weights (W) and biases (Φ).
Step 4: Training Phase:
Step 4.1: Propagate the inputs forward by computing output of each hidden
layers’ nodes values and output node value according to the input nodes’ values
and initial weights..
Step 4.2-4.4: Backpropagate steps: updating Weights and biases based on the
derived adjusted weights and biases for sample data matches
Step 4.5: Computer total error
Step 4.6: Terminate processing if exceeding Epoch run for all sample inputs
or total error rate is less than pre-determined level
Repeat step 4.2-4.5 until processing terminates
Step 5: Testing Phase:
Test the derived weights and biases (from the training data) against testing data.
Step 6: If error rate is acceptable, then the derived weights and biases is the
correct pattern from the data mining; else repeat all three phases again.
20. Define various formulas and their justifications used in neural network?
Answer:
Submit training set & compute layers’ responses
Oj=1 / (1 + e –Ij; )of which Ij =Σi Wij Oi +θj
Update the weights of output layer:
Wij = Wij + ∆Wij of which ∆Wij = (η) Ej Oi ;
Ej = Oj (1 - Oj )( Tj - Oj )
Update the weights of the hidden layer:
Wij = Wij + ∆Wij of which ∆Wij = (η) Ej Oi ;
Ej = Oj (1 - Oj ) Σk Ek Wjk
Update the biases:
θj = θj + ∆θj
Their justification:
of which
∆θj
= (η) Ej
Wkj
Wji
i
j
Yk
Yi
Output
Yi
Yj
Yi and Yj = output of note i & j
Wji = -ŋ (dE) = - ŋ (dE) (dyi) (dSi)
dW
dyi
dSi
dWji
Where ŋ = learning rate, S = transfer function, E = error rate
dyi = d (
1
)
(
1
) = (1 – yi) yi
dSi
dSi
1 + e-Si
Where e = 2.718
dSi
=
yj
dWji
dE
=d
Therefore dyi
dyi
= (1 -
1 + e-Si
Σ (dm – ym)2)
(1
1
=
)
1 + e-Si
- (di – yi)
2
Wji = ŋ yj (1-yi) (di-yi) yi = ŋ Ej
yi where Ej = yj (1-yi) (di-yi) as a new error rate
21. What is the best approach of doing data conversion and why?
Answer:
The best approach of doing data conversion is by using Logical Level Translation
because this approach does not need to deal with physical data type conversion
which can be handled by the source schema and target schema. In case of data
type not defined in the target schema, one can also use Customer Data Type
conversion for this special case.
22. How to translate relational schema into XML DTD with validation?
Answer:
One-to-one cardinality
EER Model
DTD Graph
A1
A2
Entity A
Element A
A1
A2
1
R
1
Schema
B1
B2
Entity B
Element B
Translation
B1
B2
Relational Schema
DTD
Relation A(A1, A2)
Relation B(B1, B2, *A1)
<!ELEMENT A(B)>
<!ATTLIST A A1 CDATA #REQUIRED>
<!ATTLIST A A2 CDATA #REQUIRED>
<!ELEMENT B EMPTY>
<!ATTLIST B B1 CDATA #REQUIRED>
<!ATTLIST B B2 CDATA #REQUIRED>
One-to-many cardinality
DTD Graph
EER Model
Entity A
A1
A2
Element A
A1
A2
1
*
R
Schema
n
Entity B
B1
B2
Translation
Element B
B1
B2
Relational Schema
DTD
Relation A(A1, A2)
Relation B(B1, B2, *A1)
<!ELEMENT A(B)*>
<!ATTLIST A A1 CDATA #REQUIRED>
<!ATTLIST A A2 CDATA #REQUIRED>
<!ELEMENT B EMPTY>
<!ATTLIST B B1 CDATA #REQUIRED>
<!ATTLIST B B2 CDATA #REQUIRED>
Many-to-many cardinality
EER Model
Entity A
DTD Graph
A1
A2
m
R
A1
n
Entity B
A2
B1
B2
Relation A(A1, A2)
Relation B(B1, B2)
Relation R(*A1, *B1)
Element R
Element B
A_id
B2
B_id
Schema
Translation
Relational Schema
B1
Element A
A_idref
B_idref
DTD
<!ELEMENT A EMPTY>
<!ATTLIST A A1 CDATA #REQUIRED>
<!ATTLIST A A2 CDATA #REQUIRED>
<!ATTLIST A A_id ID #REQUIRED>
<!ELEMENT R EMPTY>
<!ATTLIST R A_idref IDREF #REQUIRED>
<!ATTLIST R B_idref IDREF #REQUIRED>
<!ELEMENT B EMPTY>
<!ATTLIST B B1 CDATA #REQUIRED>
<!ATTLIST B B2 CDATA #REQUIRED>
<!ATTLIST B B_id ID #REQUIRED>
The justification of the translations are:
The preservation of data dependencies between the foreign keys and their
referred primary key before and after schema translation.
For example, in many to many cardinality, we can preserve their inclusion
dependency (ID):
ID (in relational schema): R.A1 ⊆ A.A1
R.B1 ⊆ B.B1
ID (in XML schema):
R.A_idref ⊆ A.A_id
R.B_idref ⊆ B.A_id