CS5483 review questions: 1. How to differentiate between HTML and XML? Answer: • The tags used to markup HTML documents and the structure of HTML documents are predefined. • The author of HTML documents can only use tags that are defined in the HTML standard. • XML allows the author to define his own tags and his own document structure. As a result, XML allows programming, but HTML cannot. XML is a supplement to HTML such that it is not a replacement for HTML. It will be used to structure and describe the Web data while HTML will be used to format and display the same data. XML can also store data insider HTML documents (Data Islands)., and can be used to exchange and store data. 2. How to implement cardinality data semantic in DTD? Answer: one-to-one cardinality DTDGraph Element A Element B A1 A2 B1 B2 One-to-many cardinality DTD <!ELEMENT A(B)> <!ATTLISTAA1CDATA#REQUIRED> <!ATTLISTAA2CDATA#REQUIRED> <!ELEMENT BEMPTY> <!ATTLISTBB1CDATA#REQUIRED> <!ATTLISTBB2CDATA#REQUIRED> 3. How to define various data semantics in relational database? Answer: One-to-one cardinality / one-to-many cardinality / total participation Relation President (President_name, Race, *Nation_name) Relation Nation (Nation_name, Nation_size) Where underlined are primary keys and "*" prefixed are foreign keys Many-to-many cardinality Relation Student (Student_id, Student_name) Relation Course (Course_id, Course_name) Relation take (*Student_id, *Course_id) Is-a relationship Relation Male (Name, Height) Relation Father (*Name, Birth_date) Disjoint generalization / overlap generalization: Relation Boat_person (Name, Birth_date, Birth_place) Relation Refugee (*Name, Open_center) Relation Non-refugee (*Name, Detention_center) Categorization Relation Department (Borrower_card, Department_id) Relation Doctor (Borrower_card, Doctor_name) Relation Hospital (Borrower_card, Hospital_name) Relation Borrower (*Borrower_card, Return_date, File_id) Aggregation: Relation Student (Student_no, Student_name) Relation Course (Course_no, Course_name) Relation Takes (*Student_no, *Course_no, *Instructor_name) Relation Instructor (Instructor_name, Department) Partial participation: Relation Department (Department_id, Department_name) Relation Employee (Employee_no, Employee_name, &Department_id) Where & means that null value is allowed Weak entity: Relation Hotel (Hotel_name, Ranking) Relation Room (*Hotel_name, Room_no, Room_size) N-ary relationship: Relation Engineer (Employee_id, Employee_name) Relation Skill (Skill_name, Years_experience) Relation Project (Project_id, Start_date, End_date) Relation Skill_used (*Employee_id, *Skill_name, *Project_id) 4. Is user supervision needed in schemas integration? Why or why not? Answer: • Data semantics define the relationships between data for users’ data requirements. • Data semantics are presented in database conceptual schema such as EER model and DTD Graph. • Only relevant data can be integrated for an application. • Data relevance depends on users’ data requirements for an application. • Data consistency are the standard of data domain value and format. Inconsistent data must be transformed before data integration. • User supervision means users input for users’ data requirements, and which are for database design and schema integration As a result, user supervision is needed in schema integration in order to meet users’ data requirements. 5. How to develop a data warehouse? Answer: Planning 2. Gathering Data Requirements and Modeling 3. Physical Database Design and Development 4. Data Mapping and Transformation 5. Data Extraction and Load 6. Automating the Data Management Process 7. Application Development-Creating the starter sets 8. Data Validation and Testing 9. Training 10. Rollout of reports 6. What are the differences among various star schemas? Answer: Simple star schema – All primary keys in the fact tables are also foreign keys to the primary keys of dimension tables. Multiple Fact Tables – More than one fact tables are related to the same dimension tables. Outboard Tables – Dimension tables contain a foreign key that references the primary key in another dimension table. Multi-Star schema – The fact table in the simple star schema has its own primary keys without referencing dimension tables. Snowflaked schema – Snowflake schema is a star schema which stores all dimensional information in third normal form. 7. How to implement various OLAP operations by use of SQL? Answer: Roll-up: select sum(amount), area from SALES where (area='Kowloon') group by area Drill-down: select sum(amount), the_date from SALES where (the_date='2003-Dec-31') or (the_date='2003-Dec-30') or… …or (the_date='2003-Dec-2') or (the_date='2003-Dec-1') group by the_date Slice: select sum(amount), storecode from SALES where (storecode='292') group by storecode Dice: select sum(amount), area from SALES where ( (area='HK') or (area='NT') or (area='Kowloon')) and (the_date='2003-Dec-24') group by area 8. How to implement various OLAP operations by use of MDX? Roll-up: SELECT [SALES].[AMOUNT] ON COLUMNS, [store].[Kowloon] ON ROWS FROM SALES Drill-down: SELECT [SALES].[AMOUNT] ON COLUMNS, [time].[2003].[Q4].[Dec].[31], [time].[2003].[Q4].[Dec].[30],… …, [time].[2003].[Q4].[Dec].[2], [time].[2003].[Q4].[Dec].[1] ON ROWS FROM SALES Slice: SELECT [SALES].[AMOUNT] ON COLUMNS, [store].[Kowloon].[292] ON ROWS FROM SALES Dice: SELECT [SALES].[AMOUNT] ON COLUMNS, [store].[HK],[store].[NT], [store].[Kowloon] ON ROWS FROM SALES WHERE [time].[2003].[Q4].[Dec].[24] 9. What is the definition and application of Apriori Algorithm? Answer: Apriori algorithm: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; We can apply Apriori algorithm to find out the cross marketing sell of two different products, that is, the probability of selling a product leads selling of another product. 10. What is the definition and application of Frequent Pattern Tree Algorithm? Answer: Frequent Pattern Tree algorithm: Step 1: Create a table of candidate data items in descending order. Step 2: Build the Frequent Pattern Tree according to each event of the candidate data items. Step 3: Link the table with the tree. We can apply Frequent Pattern Tree algorithm in cross products marketing. 11. What is the definition and application of Sequential Pattern? Answer: A sequential pattern is defined as an ordered set of pages that satisfies a given support and is maximal (i.e. it has no subsequence that is also frequent). In other words, sequential pattern is the ordered set of web pages browsed by a user in a session. The support level of sequential patterns is Frequent forward ordering web pages occurrences of X1, X2…Xn Each Customer/User The application of sequential pattern is to locate an user’s web pages browsing access path from the home page forward in a website. 12. What is the definition and application of Maximal Frequent Forward Sequence? Answer: Forward sequences is to remove any backward traversals. Each raw session is transformed into forward reference (i.e. remove the backward traversals and reloads/refreshes), from which the traversal patterns are then mined using improved level-wise algorithms. The maximal frequent forward sequence is the sequence patterns that can meet the users support level requirements. The application of maximal frequent forward sequence is to locate the “frequent” access pattern path of a user who browses web pages in a website. 13. What is the definition and application of K-means algorithm? Answer: The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the clusters’ centroid or center of gravity. The application of k-means algorithm can be classification of good customers in customer relationship management. 14. What is the definition and application of K-Medoid algorithm? Answer: K-Medoid algorithm cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a frequency-based method to update modes of clusters. An application of k-Meodid algorithm is in CRM. 15. How to implement Genetic Algorithm with user supervision? Answer: Step 1: Initialize a population P of n elements as a potential solution. Step 2: Until a specified termination condition is satisfied: 2a: Use a fitness function to evaluate each element of the current solution. If an element passes the fitness criteria, it remains in P. 2b: The population now contains m elements (m <= n). Use genetic operators to create (n – m) new elements. Add the new elements to the population. 16. How to implement Genetic Algorithm without user supervision? Answer: Genetic algorithm can also be a powerful unsupervised clustering technique. For example, given a set of solutions of centroids for clustering, we can apply genetic algorithm by using crossover and mutation to locate the best solution, that is, the centroids with minimum Euclidean distance between centroids and data points in each cluster. 17. How to implement Decision Tree with a C4.5 Algorithm? Answer: Begin Partition (S) If (all records in S are of the same class or only 1 record found in S) then return; For each attribute Ai do evaluate splits on attribute Ai; Use best split found to partition S into S1 and S2 to grow a tree with two Partition (S1) and Partition (S2) which has the most information gain. Repeat partitioning for Partition (S1) and (S2) until it meets tree stop growing criteria; After growing tree phase, perform pruning tree phase by removing branches that can achieve a smaller prediction error rate; End; 18. What kind of user requirement is suitable for using Decision Tree in data mining? Answer: The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle high dimensional data. Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans. The learning and classification steps of decision tree induction are simple and fast. In general, decision tree classifications have good accuracy. 19. How to implement neural network by using Backpropagation Algorithm? Answer: The procedure of processing neural network is: Step 1: Separate dataset into training and testing sets. Step 2: Select the Neural Network Topology Step 3: Initialize the weights (W) and biases (Φ). Step 4: Training Phase: Step 4.1: Propagate the inputs forward by computing output of each hidden layers’ nodes values and output node value according to the input nodes’ values and initial weights.. Step 4.2-4.4: Backpropagate steps: updating Weights and biases based on the derived adjusted weights and biases for sample data matches Step 4.5: Computer total error Step 4.6: Terminate processing if exceeding Epoch run for all sample inputs or total error rate is less than pre-determined level Repeat step 4.2-4.5 until processing terminates Step 5: Testing Phase: Test the derived weights and biases (from the training data) against testing data. Step 6: If error rate is acceptable, then the derived weights and biases is the correct pattern from the data mining; else repeat all three phases again. 20. Define various formulas and their justifications used in neural network? Answer: Submit training set & compute layers’ responses Oj=1 / (1 + e –Ij; )of which Ij =Σi Wij Oi +θj Update the weights of output layer: Wij = Wij + ∆Wij of which ∆Wij = (η) Ej Oi ; Ej = Oj (1 - Oj )( Tj - Oj ) Update the weights of the hidden layer: Wij = Wij + ∆Wij of which ∆Wij = (η) Ej Oi ; Ej = Oj (1 - Oj ) Σk Ek Wjk Update the biases: θj = θj + ∆θj Their justification: of which ∆θj = (η) Ej Wkj Wji i j Yk Yi Output Yi Yj Yi and Yj = output of note i & j Wji = -ŋ (dE) = - ŋ (dE) (dyi) (dSi) dW dyi dSi dWji Where ŋ = learning rate, S = transfer function, E = error rate dyi = d ( 1 ) ( 1 ) = (1 – yi) yi dSi dSi 1 + e-Si Where e = 2.718 dSi = yj dWji dE =d Therefore dyi dyi = (1 - 1 + e-Si Σ (dm – ym)2) (1 1 = ) 1 + e-Si - (di – yi) 2 Wji = ŋ yj (1-yi) (di-yi) yi = ŋ Ej yi where Ej = yj (1-yi) (di-yi) as a new error rate 21. What is the best approach of doing data conversion and why? Answer: The best approach of doing data conversion is by using Logical Level Translation because this approach does not need to deal with physical data type conversion which can be handled by the source schema and target schema. In case of data type not defined in the target schema, one can also use Customer Data Type conversion for this special case. 22. How to translate relational schema into XML DTD with validation? Answer: One-to-one cardinality EER Model DTD Graph A1 A2 Entity A Element A A1 A2 1 R 1 Schema B1 B2 Entity B Element B Translation B1 B2 Relational Schema DTD Relation A(A1, A2) Relation B(B1, B2, *A1) <!ELEMENT A(B)> <!ATTLIST A A1 CDATA #REQUIRED> <!ATTLIST A A2 CDATA #REQUIRED> <!ELEMENT B EMPTY> <!ATTLIST B B1 CDATA #REQUIRED> <!ATTLIST B B2 CDATA #REQUIRED> One-to-many cardinality DTD Graph EER Model Entity A A1 A2 Element A A1 A2 1 * R Schema n Entity B B1 B2 Translation Element B B1 B2 Relational Schema DTD Relation A(A1, A2) Relation B(B1, B2, *A1) <!ELEMENT A(B)*> <!ATTLIST A A1 CDATA #REQUIRED> <!ATTLIST A A2 CDATA #REQUIRED> <!ELEMENT B EMPTY> <!ATTLIST B B1 CDATA #REQUIRED> <!ATTLIST B B2 CDATA #REQUIRED> Many-to-many cardinality EER Model Entity A DTD Graph A1 A2 m R A1 n Entity B A2 B1 B2 Relation A(A1, A2) Relation B(B1, B2) Relation R(*A1, *B1) Element R Element B A_id B2 B_id Schema Translation Relational Schema B1 Element A A_idref B_idref DTD <!ELEMENT A EMPTY> <!ATTLIST A A1 CDATA #REQUIRED> <!ATTLIST A A2 CDATA #REQUIRED> <!ATTLIST A A_id ID #REQUIRED> <!ELEMENT R EMPTY> <!ATTLIST R A_idref IDREF #REQUIRED> <!ATTLIST R B_idref IDREF #REQUIRED> <!ELEMENT B EMPTY> <!ATTLIST B B1 CDATA #REQUIRED> <!ATTLIST B B2 CDATA #REQUIRED> <!ATTLIST B B_id ID #REQUIRED> The justification of the translations are: The preservation of data dependencies between the foreign keys and their referred primary key before and after schema translation. For example, in many to many cardinality, we can preserve their inclusion dependency (ID): ID (in relational schema): R.A1 ⊆ A.A1 R.B1 ⊆ B.B1 ID (in XML schema): R.A_idref ⊆ A.A_id R.B_idref ⊆ B.A_id