STAT Lab 7: The Good, the Bad and the Ugly

STAT Lab 7: The Good, the Bad and the Ugly
Setting: In the Define phase it is likely that a number of decision blocks were identified on the Process
Map. These decision blocks can direct product flow depending on a product evaluation. An incorrect
assessment can add significant cost or customer dissatisfaction. In the Measure phase the ability to
make a product evaluation is assessed. If variable data measurements are possible, a Gage R&R Study is
appropriate to evaluate the gaging process. However, many decisions result in attribute data such as
good/bad, return/process (nominal data) or excellent/good/fair/marginal/bad (ordinal data). Thus,
Measurement System Analysis (MSA) studies must also address attribute data. For example, in a bank
the check processing operators must decide if a check can be processed or return the check due to
errors such as no signature, unreadable, numeric and written amounts differ, etc. In this lab you are
evaluating the penny processing department of a penny production operation. Old pennies are
inspected by assessors and a good/bad determination made. Are the assessors repeatable in this
determination? Do assessors make accurate classifications? Bad pennies are melted and reprocessed,
good pennies are sent into circulation.
Learning Outcome: Students will be able to conduct an Attribute Measurement System Study. They will
understand how to measure within and between appraiser variation and how this assessment can be
used for process improvement and cost reduction.
Tools: All MSA studies must address operational definitions of the characteristics being evaluated. An
operational definition for attribute data clearly defines what is being evaluated and how to make a
classification. Frequently, examples of products in each state are used at the location where
classifications are made. If good operational definitions exist, different appraisers should classify
products in the same manner. For example, a credit card processor must determine if a bill should be
processed if the store’s name is unreadable, the date is incorrect, etc. The decision to process or return
the bill must be based on criteria established in an operational definition for “when to return a bill” (i.e.,
definition of defective check). An Attribute Measurement System Study is conducted with the process
“as is,” but improvement opportunities noted such as revising operational definitions.
An Attribute Study starts by selecting the number of appraisers/operators (m), samples/parts (n)
and trials/evaluations (r) to reclassify all samples. First, select m≥2 appraisers who normally make
assessment evaluations. Do not provide special instructions on how to make the assessments for the
study. The normal measurement process is what is being evaluated (i.e., the “as is” process). Second,
select n samples that ideally span the range of operations and capture long-term variation. Third,
determine an expert assessor or assessors to evaluate each sample. This assessment is called a
“reference value” for a sample. Finally, select the number of times each assessor will reclassify all
samples (r). For reasonable precision of estimates, it is desirable that n≥20 and r≥2. Label the samples
to evaluate. Randomly submit samples to each appraiser for evaluation. Evaluate all samples in random
order for each trial. Appraisers should not compare results and no appraiser should be aware of their
evaluations from previous trials.
The data for an Attribute Measurement Study for m=2 appraisers in a two outcome (0 or 1)
evaluation results in a frequency table:
Operator B
B=0
B=1
Total
Operator A
A=0
a
b
a+b
A=1
c
d
c+d
Total
a+c
b+d
N
where N=a+b+c+d, the total number of observations (nmr). The case for two appraisers and two
possible outcomes (0 or 1) is presented here. Cohen (1960 J. Educational and Psychological
Measurement) considers extensions. Also, Fleiss (1981, Statistical Methods for Rates and Proportions,
Opened: 1/20/10
File: STAT8140Lab07d103109.docx
Page 1 of 5
V. Kane
Wiley) gives the case for more than two outcomes and appraisers. The user should consider Kendall’s
coefficient in this case.
Using AIAG notation, define
po = observed proportion of evaluations where appraisers agree
=
.
pe = expected proportion of evaluations where appraisers assigned values
independently (by chance)
= Pr(A=0 and B=0) + Pr(A=1 and B=1)
=
x
+
x
.
Minitab Commands: Stat>Quality Tools>Attribute Agreement Analysis. In Options box check Cohen’s
kappa. Data should be arranged in stacked columns: Sample, Appraiser, Trial, Value (0,1) and Reference.
For any 2x2 table, to get kappa check the kappa box in Other Stats from Stats>Tables>Cross Tabulation.
A) Repeatability Measure
The study is conducted using repeat trials with the same parts. For each trial it is essential that the
appraiser be unaware of the previous assessment determinations. With this data, it is possible to
determine the within appraiser repeatability. Minitab refers to this analysis as “Within Appraiser” and
reports the number and percentage of parts where all trials agree:
% Appraiser Repeatability = 100 x Number of parts where all trials agree/(nr)
Differences between appraisers would indicate the best and worst, in terms of repeatability. In either
case, better operational definitions or training could be useful. Larger sample sizes (n) are desirable
since use of confidence intervals is important for determining significant differences.
B) Reproducibility Measure
Using po and pe, Cohen proposed a kappa (к) statistic
Interrater Agreement = к = (po - pe)/(1 - pe).
When к=0 appraisers agree no more than would be expected by chance. When к=1 appraisers have
perfect agreement. AIAG and ASQ suggest the following conclusions be drawn from kappa:
kappa
к ≥ 0.9
к ≥ 0.75
к ≤ 0.4
к≤0
Assessment of Appraiser Agreement
Excellent agreement, no action required
Good agreement, improvement possible
Poor Agreement, improvement essential
Agreement less than by chance
Since kappa quantifies appraiser agreement, it is a measure of reproducibility. It is desirable to report
the 2x2 frequency table with kappa for every pair of appraisers.
In Minitab the 2x2 tables can be prepared several ways. One approach is to rearrange the stacked
data format into columns: Sample, Appraiser, Trial, Assessment Value and Reference Value. To create
tables that use nr trials to make appraiser A vs. appraiser B or appraiser X vs. Reference comparisons,
unstack the data to create the format
Opened: 1/20/10
File: STAT8140Lab07d103109.docx
Page 2 of 5
V. Kane
Sample
1
2
.
n
1
2
.
.
n
A B C
Trial
1
.
Values(0,1) .
1
2
2
.
.
2
etc.
Reference
The 2x2 tables can be obtained from Stat>Tables>Cross Tabulation after unstacking the original data
(Data>Unstack Columns> subscripts in Appraiser). In the Other Stats box, check kappa. When
considering all appraisers as a system, the mnr values and reference values will need to be stacked.
C) Accuracy Measure
The study is conducted using repeat trials of the n samples. Comparison of assessments to the
reference standard enables a measure of appraiser accuracy. AIAG (2002 p.132) defines the quantity to
measure the percentage agreement with the reference
% Appraiser Effectiveness = 100 x Number of samples where all trials agree with reference/(nr)
This percentage is reported for each appraiser. Minitab refers to this analysis as “Each Appraiser vs
Standard” and reports the percentage of trials where all trials agree for a sample.
The 2x2 table appraiser vs. reference provides the data to compute kappa. This value of kappa
assesses the agreement of an appraiser with the reference. Also, the 2x2 tables provide a measure of an
appraiser’s failure to detect a category. This is called either a “Miss Rate” (classifying a part as 1 when it
is 0 from the reference) or a “False Alarm Rate” (classifying a part as 0 when it is 1 from the reference).
The following criteria are suggested to evaluate the above percentages:
Decision
Effectiveness
Miss Rate
False Alarm Rate
Appraiser Acceptable
≥90%
≤2%
≤5%
Appraiser Unacceptable
<80%
>5%
>10%
D) System Measures
The above measures consider each individual appraiser. It is meaningful to consider all appraisers
collectively as the process “gage” which defines the gaging system. AIAG (p.131) suggests
System % Effectiveness Score = 100 x Number of samples where all
appraisers completely agree/n.
This percentage is reported in Minitab in the “Between Appraisers” section.
There are two ways to measure the system’s ability to classify accurately.
System % Effectiveness Score vs. Reference = 100 x Number of samples
where all trials for all appraisers agree with the reference/n
Opened: 1/20/10
File: STAT8140Lab07d103109.docx
Page 3 of 5
V. Kane
Minitab reports this percentage in the “All Appraisers vs. Standard” section. A 2x2 table for this case
would appears as:
Reference
0
1
Appraisers
0
1
where the total number of samples classified is n. In this system table, the diagonal frequencies are
where all appraisers on all trials must agree with the reference. The kappa statistic for this table is a
measure of the uniformity of appraiser agreement with the reference.
Alternatively, all trials for all appraisers can be compared to the reference. The total frequencies in
the 2x2 table sum to mnr. The diagonal agreements require only that an appraiser agree with the
reference, but not necessarily with other appraisers. Here the kappa statistic is a measure of the system
accuracy.
A summary Effectiveness table can be used to collect the accuracy measures:
Table 1. System and Appraiser Comparison to Reference Standard
Appraiser
kappa
Effectiveness
Miss Rate
False Alarm
Rate
A
B
C
System
Instructions: Students will form groups of 2-4 people. Each student should perform the Minitab analysis.
Students are encouraged to help others. Each student should attempt to answer the lab questions
below while not evaluating pennies. One student should assist the appraiser with the logistics of the
appraising process. Other students should work on the lab questions below. During the last 10 minutes
of class, students should brainstorm answers to questions in their groups. Written answers should be a
student’s own work. Lab procedure:
1) The penny factory requires you to inspect the n=40 pennies given to your group. If your
process rejects too many pennies, there is added cost in reprocessing ok pennies. If you fail
to reject a bad penny, customers are unhappy and other penny suppliers will take your
customers. You have not yet done a QFD matrix so you are not entirely sure what
characteristics are important to your customers. However, the accepted standard in your
facility is that a penny is bad (coded as 0) if the date is not easily readable.
2) Your materials consist of two bags, one having the 40 pennies to classify. The bag contains a
sealed envelope containing an expert evaluator’s assessment of each penny. Do NOT open
the envelope until the lab is completed. The expert’s assessments will serve as the reference
value for every penny.
3) There will be a minimum of 2 appraisers. Depending on time, there may be up to 4
appraisers. Each appraiser will complete 2 trials (one “do over”) of evaluating the 40 pennies.
4) Appraiser 1, Trial 1: randomly select a penny from the bag and record a 1 (good) or 0 (bad) for
each penny. Each penny should have a number on it, record this number and your evaluation
(0 or 1). Take up to 15 seconds per penny, completing both trials within 10 minutes. After
recording a 0 or 1 for a penny, place the penny in a different bag.
Opened: 1/20/10
File: STAT8140Lab07d103109.docx
Page 4 of 5
V. Kane
5) Appraiser 1, Trial 2: Mix the second bag, now having 40 pennies that have been evaluated
once. Repeat the evaluation process, but record your results on a NEW sheet of paper.
Results from your first trial should not influence Trial 2.
6) Once an appraiser completes both trials, select another appraiser. Go to Step 4.
Questions: Review syllabus criteria for format. Validate a few Minitab calculations by hand. Turn in
answers next week in class.
1) A go/no go attribute gage needs to be evaluated since the customer is receiving rejects from
a capable process (AIAG p.127). Obtain data from STAT8140 Lab7 Data in Sheet Q1. Three
appraisers (m=3) randomly evaluate n=50 samples that span the specification range.
(a) Reorganize data in a standard format rows=Samples and columns Appraiser A (Trials
1,2,3), Appraiser B (Trials 1,2,3), Appraiser C (Trials 1,2,3), Reference.
(b) Report 2x2 tables (appraiser A vs. appraiser B, etc. and appraiser A vs. reference, etc.)
These tables should appear from Minitab as:
Rows: A
0
1
All
Columns: B
0
44
3
47
1
6
97
103
Rows:A Columns: Reference2
All
50
100
150
0
1
All
Cell Contents: Count
Kappa 0.862944
0
45
3
48
1
5
97
102
Cell Contents:
Kappa 0.878788
All
50
100
150
Count
(c) Repeatability: Assess appraiser repeatability. Run Minitab Attribute Agreement Analysis.
(d) Reproducibility: Assess appraiser reproducibility. Compute and interpret kappa statistics.
(e) Accuracy: Assess appraiser accuracy.
(f) Combine data for all appraisers vs reference into one “system” 2x2 table with frequencies
totaling mnr as shown below. Compute and interpret the kappa statistic.
Rows: Value
Columns: Reference
0
1 All
0
132
16 148
1
12 290 302
All
144 306 450
Cell Contents:
Kappa 0.858070
Count
(g) Prepare an Effectiveness table as shown in Table 1 and interpret the results.
(h) Prepare a one paragraph management summary with recommendations. Suggest possible
next steps to improve the process and reduce cost.
2) Analyze the penny data in the question 1 (a) – (h) steps. Use kappa statistic where
appropriate.
Opened: 1/20/10
File: STAT8140Lab07d103109.docx
Page 5 of 5
V. Kane