STAT Lab 7: The Good, the Bad and the Ugly Setting: In the Define phase it is likely that a number of decision blocks were identified on the Process Map. These decision blocks can direct product flow depending on a product evaluation. An incorrect assessment can add significant cost or customer dissatisfaction. In the Measure phase the ability to make a product evaluation is assessed. If variable data measurements are possible, a Gage R&R Study is appropriate to evaluate the gaging process. However, many decisions result in attribute data such as good/bad, return/process (nominal data) or excellent/good/fair/marginal/bad (ordinal data). Thus, Measurement System Analysis (MSA) studies must also address attribute data. For example, in a bank the check processing operators must decide if a check can be processed or return the check due to errors such as no signature, unreadable, numeric and written amounts differ, etc. In this lab you are evaluating the penny processing department of a penny production operation. Old pennies are inspected by assessors and a good/bad determination made. Are the assessors repeatable in this determination? Do assessors make accurate classifications? Bad pennies are melted and reprocessed, good pennies are sent into circulation. Learning Outcome: Students will be able to conduct an Attribute Measurement System Study. They will understand how to measure within and between appraiser variation and how this assessment can be used for process improvement and cost reduction. Tools: All MSA studies must address operational definitions of the characteristics being evaluated. An operational definition for attribute data clearly defines what is being evaluated and how to make a classification. Frequently, examples of products in each state are used at the location where classifications are made. If good operational definitions exist, different appraisers should classify products in the same manner. For example, a credit card processor must determine if a bill should be processed if the store’s name is unreadable, the date is incorrect, etc. The decision to process or return the bill must be based on criteria established in an operational definition for “when to return a bill” (i.e., definition of defective check). An Attribute Measurement System Study is conducted with the process “as is,” but improvement opportunities noted such as revising operational definitions. An Attribute Study starts by selecting the number of appraisers/operators (m), samples/parts (n) and trials/evaluations (r) to reclassify all samples. First, select m≥2 appraisers who normally make assessment evaluations. Do not provide special instructions on how to make the assessments for the study. The normal measurement process is what is being evaluated (i.e., the “as is” process). Second, select n samples that ideally span the range of operations and capture long-term variation. Third, determine an expert assessor or assessors to evaluate each sample. This assessment is called a “reference value” for a sample. Finally, select the number of times each assessor will reclassify all samples (r). For reasonable precision of estimates, it is desirable that n≥20 and r≥2. Label the samples to evaluate. Randomly submit samples to each appraiser for evaluation. Evaluate all samples in random order for each trial. Appraisers should not compare results and no appraiser should be aware of their evaluations from previous trials. The data for an Attribute Measurement Study for m=2 appraisers in a two outcome (0 or 1) evaluation results in a frequency table: Operator B B=0 B=1 Total Operator A A=0 a b a+b A=1 c d c+d Total a+c b+d N where N=a+b+c+d, the total number of observations (nmr). The case for two appraisers and two possible outcomes (0 or 1) is presented here. Cohen (1960 J. Educational and Psychological Measurement) considers extensions. Also, Fleiss (1981, Statistical Methods for Rates and Proportions, Opened: 1/20/10 File: STAT8140Lab07d103109.docx Page 1 of 5 V. Kane Wiley) gives the case for more than two outcomes and appraisers. The user should consider Kendall’s coefficient in this case. Using AIAG notation, define po = observed proportion of evaluations where appraisers agree = . pe = expected proportion of evaluations where appraisers assigned values independently (by chance) = Pr(A=0 and B=0) + Pr(A=1 and B=1) = x + x . Minitab Commands: Stat>Quality Tools>Attribute Agreement Analysis. In Options box check Cohen’s kappa. Data should be arranged in stacked columns: Sample, Appraiser, Trial, Value (0,1) and Reference. For any 2x2 table, to get kappa check the kappa box in Other Stats from Stats>Tables>Cross Tabulation. A) Repeatability Measure The study is conducted using repeat trials with the same parts. For each trial it is essential that the appraiser be unaware of the previous assessment determinations. With this data, it is possible to determine the within appraiser repeatability. Minitab refers to this analysis as “Within Appraiser” and reports the number and percentage of parts where all trials agree: % Appraiser Repeatability = 100 x Number of parts where all trials agree/(nr) Differences between appraisers would indicate the best and worst, in terms of repeatability. In either case, better operational definitions or training could be useful. Larger sample sizes (n) are desirable since use of confidence intervals is important for determining significant differences. B) Reproducibility Measure Using po and pe, Cohen proposed a kappa (к) statistic Interrater Agreement = к = (po - pe)/(1 - pe). When к=0 appraisers agree no more than would be expected by chance. When к=1 appraisers have perfect agreement. AIAG and ASQ suggest the following conclusions be drawn from kappa: kappa к ≥ 0.9 к ≥ 0.75 к ≤ 0.4 к≤0 Assessment of Appraiser Agreement Excellent agreement, no action required Good agreement, improvement possible Poor Agreement, improvement essential Agreement less than by chance Since kappa quantifies appraiser agreement, it is a measure of reproducibility. It is desirable to report the 2x2 frequency table with kappa for every pair of appraisers. In Minitab the 2x2 tables can be prepared several ways. One approach is to rearrange the stacked data format into columns: Sample, Appraiser, Trial, Assessment Value and Reference Value. To create tables that use nr trials to make appraiser A vs. appraiser B or appraiser X vs. Reference comparisons, unstack the data to create the format Opened: 1/20/10 File: STAT8140Lab07d103109.docx Page 2 of 5 V. Kane Sample 1 2 . n 1 2 . . n A B C Trial 1 . Values(0,1) . 1 2 2 . . 2 etc. Reference The 2x2 tables can be obtained from Stat>Tables>Cross Tabulation after unstacking the original data (Data>Unstack Columns> subscripts in Appraiser). In the Other Stats box, check kappa. When considering all appraisers as a system, the mnr values and reference values will need to be stacked. C) Accuracy Measure The study is conducted using repeat trials of the n samples. Comparison of assessments to the reference standard enables a measure of appraiser accuracy. AIAG (2002 p.132) defines the quantity to measure the percentage agreement with the reference % Appraiser Effectiveness = 100 x Number of samples where all trials agree with reference/(nr) This percentage is reported for each appraiser. Minitab refers to this analysis as “Each Appraiser vs Standard” and reports the percentage of trials where all trials agree for a sample. The 2x2 table appraiser vs. reference provides the data to compute kappa. This value of kappa assesses the agreement of an appraiser with the reference. Also, the 2x2 tables provide a measure of an appraiser’s failure to detect a category. This is called either a “Miss Rate” (classifying a part as 1 when it is 0 from the reference) or a “False Alarm Rate” (classifying a part as 0 when it is 1 from the reference). The following criteria are suggested to evaluate the above percentages: Decision Effectiveness Miss Rate False Alarm Rate Appraiser Acceptable ≥90% ≤2% ≤5% Appraiser Unacceptable <80% >5% >10% D) System Measures The above measures consider each individual appraiser. It is meaningful to consider all appraisers collectively as the process “gage” which defines the gaging system. AIAG (p.131) suggests System % Effectiveness Score = 100 x Number of samples where all appraisers completely agree/n. This percentage is reported in Minitab in the “Between Appraisers” section. There are two ways to measure the system’s ability to classify accurately. System % Effectiveness Score vs. Reference = 100 x Number of samples where all trials for all appraisers agree with the reference/n Opened: 1/20/10 File: STAT8140Lab07d103109.docx Page 3 of 5 V. Kane Minitab reports this percentage in the “All Appraisers vs. Standard” section. A 2x2 table for this case would appears as: Reference 0 1 Appraisers 0 1 where the total number of samples classified is n. In this system table, the diagonal frequencies are where all appraisers on all trials must agree with the reference. The kappa statistic for this table is a measure of the uniformity of appraiser agreement with the reference. Alternatively, all trials for all appraisers can be compared to the reference. The total frequencies in the 2x2 table sum to mnr. The diagonal agreements require only that an appraiser agree with the reference, but not necessarily with other appraisers. Here the kappa statistic is a measure of the system accuracy. A summary Effectiveness table can be used to collect the accuracy measures: Table 1. System and Appraiser Comparison to Reference Standard Appraiser kappa Effectiveness Miss Rate False Alarm Rate A B C System Instructions: Students will form groups of 2-4 people. Each student should perform the Minitab analysis. Students are encouraged to help others. Each student should attempt to answer the lab questions below while not evaluating pennies. One student should assist the appraiser with the logistics of the appraising process. Other students should work on the lab questions below. During the last 10 minutes of class, students should brainstorm answers to questions in their groups. Written answers should be a student’s own work. Lab procedure: 1) The penny factory requires you to inspect the n=40 pennies given to your group. If your process rejects too many pennies, there is added cost in reprocessing ok pennies. If you fail to reject a bad penny, customers are unhappy and other penny suppliers will take your customers. You have not yet done a QFD matrix so you are not entirely sure what characteristics are important to your customers. However, the accepted standard in your facility is that a penny is bad (coded as 0) if the date is not easily readable. 2) Your materials consist of two bags, one having the 40 pennies to classify. The bag contains a sealed envelope containing an expert evaluator’s assessment of each penny. Do NOT open the envelope until the lab is completed. The expert’s assessments will serve as the reference value for every penny. 3) There will be a minimum of 2 appraisers. Depending on time, there may be up to 4 appraisers. Each appraiser will complete 2 trials (one “do over”) of evaluating the 40 pennies. 4) Appraiser 1, Trial 1: randomly select a penny from the bag and record a 1 (good) or 0 (bad) for each penny. Each penny should have a number on it, record this number and your evaluation (0 or 1). Take up to 15 seconds per penny, completing both trials within 10 minutes. After recording a 0 or 1 for a penny, place the penny in a different bag. Opened: 1/20/10 File: STAT8140Lab07d103109.docx Page 4 of 5 V. Kane 5) Appraiser 1, Trial 2: Mix the second bag, now having 40 pennies that have been evaluated once. Repeat the evaluation process, but record your results on a NEW sheet of paper. Results from your first trial should not influence Trial 2. 6) Once an appraiser completes both trials, select another appraiser. Go to Step 4. Questions: Review syllabus criteria for format. Validate a few Minitab calculations by hand. Turn in answers next week in class. 1) A go/no go attribute gage needs to be evaluated since the customer is receiving rejects from a capable process (AIAG p.127). Obtain data from STAT8140 Lab7 Data in Sheet Q1. Three appraisers (m=3) randomly evaluate n=50 samples that span the specification range. (a) Reorganize data in a standard format rows=Samples and columns Appraiser A (Trials 1,2,3), Appraiser B (Trials 1,2,3), Appraiser C (Trials 1,2,3), Reference. (b) Report 2x2 tables (appraiser A vs. appraiser B, etc. and appraiser A vs. reference, etc.) These tables should appear from Minitab as: Rows: A 0 1 All Columns: B 0 44 3 47 1 6 97 103 Rows:A Columns: Reference2 All 50 100 150 0 1 All Cell Contents: Count Kappa 0.862944 0 45 3 48 1 5 97 102 Cell Contents: Kappa 0.878788 All 50 100 150 Count (c) Repeatability: Assess appraiser repeatability. Run Minitab Attribute Agreement Analysis. (d) Reproducibility: Assess appraiser reproducibility. Compute and interpret kappa statistics. (e) Accuracy: Assess appraiser accuracy. (f) Combine data for all appraisers vs reference into one “system” 2x2 table with frequencies totaling mnr as shown below. Compute and interpret the kappa statistic. Rows: Value Columns: Reference 0 1 All 0 132 16 148 1 12 290 302 All 144 306 450 Cell Contents: Kappa 0.858070 Count (g) Prepare an Effectiveness table as shown in Table 1 and interpret the results. (h) Prepare a one paragraph management summary with recommendations. Suggest possible next steps to improve the process and reduce cost. 2) Analyze the penny data in the question 1 (a) – (h) steps. Use kappa statistic where appropriate. Opened: 1/20/10 File: STAT8140Lab07d103109.docx Page 5 of 5 V. Kane