1 What Is Measurement? Overview

01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 1
1
What Is Measurement?
Overview
This chapter covers the basics of measurement. The chapter aims to provide
understanding of measurement and measure development procedures currently discussed in the literature. Measurement error is introduced, and the
steps in the measure development process and empirical procedures to assess
error are described. The chapter uses many examples based on data to illustrate measurement issues and procedures. Treatment of the material is at an
intuitive, nuts-and-bolts level with numerous illustrations.
The chapter is not intended to be a statistics primer; rather, it provides
sufficient bases from which to seek out more advanced treatment of empirical procedures. It should be noted that many issues are involved in using
appropriately the statistical techniques discussed here. The discussion in this
chapter aims to provide an introduction to specific statistical techniques, and
the references cited here provide direction for in-depth reading.
What Is Measurement Error?
Measurement is essential to empirical research. Typically, a method used to
collect data involves measuring many things. Understanding how any one
thing is measured is central to understanding the entire research method.
Scientific measurement has been defined as “rules for assigning numbers to
objects in such a way as to represent quantities of attributes” (e.g., Nunnally,
1978, p. 3). Measurement “consists of rules for assigning symbols to objects
1
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 2
2——Measurement Error and Research Design
so as to (1) represent quantities of attributes numerically (scaling) or
(2) define whether the objects fall in the same or different categories with
respect to a given attribute (classification)” (Nunnally & Bernstein, 1994,
p. 3).1 The attributes of objects, as well as people and events, are the underlying concepts that need to be measured. This element of the definition of
measurement highlights the importance of finding the most appropriate
attributes to study in a research area. This element also emphasizes understanding what these attributes really mean, that is, fully understanding the
underlying concepts being measured. Rules refer to everything that needs to
be done to measure something, whether measuring brain activity, attitude
toward an object, organizational emphasis on research and development, or
stock market performance. Therefore, these rules include a range of things
that occur during the data collection process, such as how questions are
worded and how a measure is administered. Numbers are central to the definition of measurement for several reasons: (a) Numbers are standardized and
allow communication in science, (b) numbers can be subjected to statistical
analyses, and (c) numbers are precise. But underneath the façade of precise,
analyzable, standardized numbers is the issue of accuracy and measurement
error.
The very idea of scientific measurement presumes that there is a thing
being measured (i.e., an underlying concept). A concept and its measurement
can be distinguished. A measure of a concept is not the concept itself, but
one of several possible error-filled ways of measuring it.2 A distinction can
be drawn between conceptual and operational definitions of concepts.
A conceptual definition describes a concept in terms of other concepts
(Kerlinger, 1986; Nunnally, 1978). For instance, stock market performance
is an abstract notion in people’s minds. It can be defined conceptually in
terms of growth in value of stocks; that is, by using other concepts such as
value and growth. An operational definition describes the operations that
need to be performed to measure a concept (Kerlinger, 1986; Nunnally,
1978). An operational definition is akin to rules in the definition of measurement discussed earlier in the chapter, and refers to everything that needs
to be done to measure something. The Dow Jones average is one measure
of stock market performance. This operational definition involves tracking
the stock value of a specific set of companies. It is by no means a perfect
measure of stock market performance. It is one error-filled way of measuring the concept of stock market performance.
The term construct is used to refer to a concept that is specifically defined
for scientific study (Kerlinger, 1986). In Webster’s New World Dictionary,
construct means “to build, form or devise.” This physical meaning of the
word construct is similar to the scientific meaning: Constructs are concepts
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 3
What Is Measurement?——3
devised or built to meet scientific specifications. These specifications include
precisely defining the construct, elaborating on what it means, and relating
it to existing research. Words that are acceptable for daily conversation
would not fit the specifications for science in terms of clear and precise
definitions. “I am going to study what people think of catastrophic events”
is a descriptive statement that may be acceptable at the preliminary stages
of research. But several concepts in this statement need precise description,
such as “think of,” which may separate into several constructs, and “catastrophe,” which has to be distinguished from other descriptors of events.
Scientific explanations are essentially words, and some of these words relate
to concepts. Constructs are words devised for scientific study.3 In science,
though, these words need to be used carefully and defined precisely.
When measuring something, error is any deviation from the “true” value,
whether it is the true value of the amount of cola consumed in a period of time,
the level of extroversion, or the degree of job satisfaction.4 Although this true
value is rarely known, particularly when measuring psychological variables,
this hypothetical notion is useful to understand measurement error inherent in
scientific research. Such error can have a pattern to it or be “all over the
place.” Thus, an important distinction can be drawn between consistent (i.e.,
systematic) error and inconsistent (i.e., random) error (Appendix 1.1). This
distinction highlights two priorities in minimizing error. One priority is to
achieve consistency,5 and the second is to achieve accuracy.
Simple explanations of random and systematic error are provided below,
although subsequent discussions will introduce nuances. Consider using a
weighing machine in a scenario where a person’s weight is measured twice
in the space of a few minutes with no apparent change (no eating and no
change in clothing). If the weighing machine shows different readings, there
is random error in measurement. In other words, the error has no pattern to
it and is inconsistent. Alternatively, the weighing machine may be off in one
direction, say, consistently showing a reading that is 5 pounds higher than
the accurate value. In other words, the machine is consistent across readings
with no apparent change in the weight being measured. Such error is called
systematic error because there is a consistent pattern to it. It should be noted,
though, that on just one reading, the nature of error is not clear. Even if the
true value is independently known through some other method, consistency
still cannot be assessed in one reading. Multiple readings suggest the inconsistent or consistent nature of any error in the weighing machine, provided
the weight of the target person has not changed. Similarly, repetition either
across time or across responses to similar items clarifies the consistent or
inconsistent nature of error in empirical measurement, assuming the phenomenon across time is unchanged.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 4
4——Measurement Error and Research Design
If the weighing machine is “all over the place” in terms of error (i.e.,
random), conclusions cannot be drawn about the construct being measured.
Random error in measures attenuates relationships (Nunnally, 1978); that
is, it restricts the ability of a measure to be related to other measures. The
phrase “all over the place” is, in itself, in need of scientific precision, which
will be provided in subsequent pages. Random error has to be reduced
before proceeding with any further analyses. This is not to suggest no
random error at all, but just that such error should be reasonably low.
Certainly, a viable approach is to collect multiple observations of such readings in the hope that the random errors average out. Such an assumption
may work when random errors are small in magnitude and when the measure actually captures the construct in question. The danger here, though, is
that the measure may not capture any aspect of the intended construct.
Therefore, the average of a set of inaccurate items remains inaccurate.
Measures of abstract concepts in the social sciences, such as attitudes, are
not as clear-cut as weighing machines. Thus, it may not be clear if, indeed,
a construct, such as an attitude, is being captured with some random error
that averages out. The notion that errors average out is one reason for using
multiple items, as discussed subsequently.
Measures that are relatively free of random error are called reliable
(Nunnally, 1978). There are some similarities between the use of reliability
in measurement and the common use of the term. For example, a person
who is reliable in sticking to a schedule is probably consistently on time. A
reliable friend is dependable and predictable, can be counted on, and is consistent. However, there are some differences as well. A reliable person who
is consistent but always 15 minutes late would still be reliable in a measurement context. Stated in extreme terms, reliability in measurement actually
could have nothing to do with accuracy in measurement, because reliability
relates only to consistency. Without some degree of consistency, the issue of
accuracy may not be germane. If a weighing machine is all over the place,
there is not much to be said about its accuracy. Clearly, there are shades of
gray here in that some small amount of inconsistency is acceptable. Waiting
for perfect consistency before attempting accuracy may not be as efficient or
pragmatic as achieving reasonable consistency and approximate accuracy.
Perfect consistency may not even be possible in the social sciences, given the
inherent nature of phenomena being studied.
Whether a measure is accurate or not is the realm of validity, or the
degree to which a measure is free of random and systematic error (Nunnally,
1978). If the weighing machine is consistent, then whether it is accurate
becomes germane. If the weighing machine is consistent but off in one
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 5
What Is Measurement?——5
direction, it is reliable but not valid. In other words, there is consistency
in the nature of error. To say a measure is accurate begs the question,
“Accurate in doing what?” Here, validity refers to accuracy in measuring
the intended construct.6
Random and systematic errors are the two types of errors that are pervasive in measurement across the sciences. Low reliability and low validity
have consequences. Random error may reduce correlations between a measure and other variables, whereas systematic error may increase or reduce
correlations between two variables. A brief overview of the measure development process is provided next. The purpose here is to cover the basic
issues briefly to provide a background for more in-depth understanding of
types of measurement error. However, the coverage is markedly different
from treatment elsewhere in the literature. Presentation is at a level to enable
intuitive understanding. Statistical analyses are presented succinctly with
many illustrations and at an intuitive level.
Overview of Traditional
Measure Development Procedures
A number of steps have been suggested in the measure development
process (Churchill, 1979; Gerbing & Anderson, 1988) that are adapted
here and discussed.7 As in many stepwise processes, these steps are often
blurred and iterative. The process is illustrated in Figure 1.1. The steps in
the process emphasize that traversing the distance from the conceptual to
the operational requires a systematic process. Rather than consider a concept and move directly to item generation, and use of a resulting measure,
the distance between the conceptual and the operational has to be spanned
carefully and iteratively.
Conceptual and Operational Definitions
The very idea of measurement suggests an important distinction between a
concept and its measurement. Hence, the literature distinguishes between
conceptual and operational definitions of constructs (i.e., concepts specifically designed for scientific study) (Kerlinger, 1986). Scientific research is
about constructing abstract devices; nevertheless, the term construct is
similar in several respects to constructing concrete things. In developing
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 6
6——Measurement Error and Research Design
Concept
Domain delineation
Measure design and item generation
Reliability testing
Dimensionality testing
Validity testing
Validation
Figure 1.1
Steps in the Measure Development Process
constructs, the level of abstraction is an important consideration. If constructs are too concrete, then they are not as useful for theoretical generalization, although their measurement may be more direct. If constructs are
too abstract, their direct measurement is difficult, although such constructs
can be used for developing medium-level constructs that are measurable. For
example, Freud’s concepts of id and superego may not be directly measurable (not easily, anyway), yet they can be used to theorize and derive
medium-level constructs. On the other hand, a construct, such as response
time or product sales, is more concrete in nature but lacks the explanatory
ability of a more abstract construct. Response time is often of interest in cognitive psychology in that it is an indicator of an underlying construct, such
as cognitive effort. By itself, response time may not have the same level of
theoretical importance.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 7
What Is Measurement?——7
Concept of preference
for numerical
information (PNI)
Concept of
weight
Concept of
length
Conceptual Domain
Operational Domain
Measurement
of PNI
Figure 1.2
Measurement
of weight
Measurement
of length
Distances Between Conceptual and Operational Domain for
Physical and Psychological Constructs and Measures
As the abstractness of a construct increases, the distance between the
conceptual and the operational definitions increases (Figure 1.2). Other than
a few dimensions such as length (for short distances, not for astronomical
ones!) and time (at least in recent times, an issue that will be revisited in a
subsequent chapter), the operational definition or measure of a construct is
indirect (e.g., weight or temperature) (Nunnally, 1978). For example, length
can be defined conceptually as the shortest distance between two points.
Measurement of length follows directly, at least for short lengths. The same
could be said for time. However, weight or temperature involve more
abstract conceptual definitions, as well as larger distances between the conceptual and the operational. Measurement is more indirect, such as through
the expansion of mercury for temperature, that is, a correlate of temperature, or gravitational pull on calibrated weights for weight.
In moving to the social sciences, the distance between the conceptual
and the operational can be large, for example, in measuring attitudes
toward objects or issues. Larger distances between the conceptual and the
operational have at least two implications. As the distance increases, so,
too, do measurement error and the number of different ways in which
something can be measured (Figure 1.3). This is akin to there being several
ways of getting to a more distant location, and several ways of getting lost
as well!
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 8
8——Measurement Error and Research Design
Preference for numerical
information
(PNI)
Conceptual Domain
Operational Domain
Measurement
of PNI
Measurement
of PNI
Measurement
of PNI
Measurement
of PNI
Measurement
of PNI
Alternative Measures of PNI
Figure 1.3
Distance Between the Conceptual and the Operational, Potential
Operationalizations and Error
Preference for numerical information, or PNI, is used as a sample construct in this chapter. This construct relates to an enduring proclivity toward
numbers across several contexts, a relatively abstract construct. In contrast,
a less abstract construct may be consumers’ preference for numerical information or employees’ preference for numerical information, essentially
restricting the context of the construct (Figure 1.4). These context-specific
constructs may be at a level of abstraction that is useful in deriving theoretical generalizations. In contrast, consider usage of calorie information as a
construct (Figure 1.5). Here, the numerical context is further restricted.
Although this construct may be useful in the realm of consumption studies,
it is relatively concrete for purposes of understanding the use of numerical
information in general. The point is that there is less theoretical generalization across different domains when a construct is relatively concrete.
Abstract constructs allow for broader generalization. Suppose that a basic
preference for numerical information is expected to cause higher performance in numerically oriented tasks. From the perspective of understanding
numerical information, this is an argument at an abstract level with broader
generalization than one linking usage of calorie information to, say, performance in choosing a brand of cereal. In fact, a modified form of the latter
scenario of calorie information and choice of cereals may be a way to test
the former scenario involving preference for numerical information and performance in numerically oriented tasks. But such an empirical test would
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 9
What Is Measurement?——9
Preference for
numerical
information
Employees’
preference for
numerical
information
Figure 1.4
Consumers’
preference for
numerical
information
Levels of Abstractness of Constructs
Concepts
Preference for
numerical
information
Consumers’
preference for
numerical
information
Usage of calorie
information
Phenomena
Figure 1.5
Abstractness of Constructs
have to be developed by carefully deriving ways of measuring the concepts
being studied and choosing the best way to operationalize them. Empirical
tests in specific, narrow contexts ideally should be developed from broader
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 10
10——Measurement Error and Research Design
theoretical and methodological considerations, thereby choosing the best
setting in which a theory is tested.
In conceptually defining a construct, it should be sufficiently different
from other existing constructs. In other words, the construct should not be
redundant and should make a sizable contribution in terms of explanatory
power. Conducting science requires linking the work to past knowledge.
Existing research is an important guide in defining a construct. The words
that are used to define a construct have to communicate to a larger scientific
audience. Defining constructs in idiosyncratic fashion that are counter to
their use in the extant literature precludes or hinders such communication.
This is not to preclude the redefinition of existing constructs. In fact, it is
necessary to their constantly evaluate construct definitions. However, the
point being made here is that redefinitions should be supported with compelling rationale. Existing theory in the area should be used to define a construct to be distinct from other constructs. The distinction should warrant
the creation of a new construct. Thus, the hurdle for the development of new
constructs is usually high. Conceptually rich constructs enable theoretical
generalizations of importance and interest to a discipline.
The distance between the conceptual and operational can also lead
to confounding between measurement and theory. Sometimes, discussions
of conceptual relationships between constructs and hypotheses about
these relationships may confound constructs with their operationalizations,
essentially mixing two different levels of analysis (e.g., arguing for the relationship between PNI and numerical ability on the basis of specific items of
the PNI scale, rather than at a conceptual level). This highlights the need to
keep these two levels of analysis separate while iterating between them in
terms of issues, such as conceptual definitions of constructs and rationale
for conceptual relationships between constructs. Iterative analysis between
these two levels is common and necessary; however, a clear understanding
and maintenance of the distinction is critical. Alternatively, measures that
aim to assess a specific construct may indeed assess a related construct,
either an antecedent or an outcome (say, preference for numerical information measuring numerical ability or preference for qualitative information),
thereby confounding constructs. Constructs may also have multiple dimensions, each with a different relationship with other constructs (say, usage of
numbers and enjoyment of numbers, with the former having a stronger
relationship with a measure of numerical ability) and that may need to be
clarified. Such clarification often occurs as research in an area progresses
and theoretical sophistication leads to sophistication in measurement and
vice versa.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 11
What Is Measurement?——11
Domain Delineation
The next step in the process of developing a measure of a construct is to
delineate its domain (Churchill, 1979; DeVellis, 1991). This step involves
explicating what the construct is and is not. Related constructs can be used
to explain how the focal construct is both similar to and different from them.
This is also a step where the proposed dimensionality of the construct is
described explicitly. For instance, a measure of intelligence would be explicated in terms of different dimensions, such as quantitative and verbal intelligence. The domain of the construct is described, thus clearly delineating
what the construct is and is not. At its core, this step involves understanding
what is being measured by elaborating on the concept, before the measure is
designed and items in the measure are generated. Careful domain delineation
should precede measure design and item generation as a starting point. This
argument is not intended to preclude iterations between these two steps; iterations between measure design and item generation, and domain delineation
are invaluable and serve to clarify the domain further. Rather, the point is to
consider carefully what abstract concept is being measured as a starting
point before attempting measure design and item generation and iterating
between these steps. Collecting data or using available data without attention to underlying concepts is not likely to lead to the development of knowledge at a conceptual level.
The goal of domain delineation is to explicate the construct to the point
where a measure can be designed and items can be generated. Domain delineation is a step in the conceptual domain and not in the operational domain.
Therefore, different ways of measuring the construct are not considered in
this step. Domain delineation precedes such consideration to enable fuller
understanding of the construct to be measured. Several issues should be
considered here, including using past literature and relating the construct
to other constructs. In other words, the construct has to be placed in the
context of existing knowledge, thus motivating its need and clarifying
its uniqueness. The construct should be described in different ways, in terms
of what is included and what is not included by the domain. Such an
approach is an effective way of clarifying a construct and distinguishing
it from related constructs. For example, preference for numerical information
has to be clearly differentiated from numerical ability. This is similar
to clarifying the exact meaning of related words, whether it is happiness
versus contentment, bluntness versus honesty, and so on. In fact, if anything,
scientific research can be distinguished in terms of the precision with which
words are used, all the more so with words that denote constructs, the focus
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 12
12——Measurement Error and Research Design
of scientific inquiry. Just as numbers are used in scientific measurement
because they are standardized and they facilitate communication, words
should be used to convey precise meaning in order to facilitate communication. Incisive conceptual thinking, akin to using a mental scalpel, should carefully clarify, distinguish, and explicate constructs well before data collection
or even measure development is attempted. Such thinking will also serve
to separate a construct from its antecedents and effects. A high hurdle is usually and appropriately placed in front of new constructs in terms of clearly
distinguishing them from existing constructs.
A worthwhile approach to distinguishing related constructs is to construct scenarios where different levels of each construct coexist. For instance,
if happiness and contentment are two related constructs, constructing examples where different levels of happiness versus contentment coexist can serve
to clarify the domain of each at a conceptual level as well, distinguishing
a construct from what it is not and getting to the essence of it conceptually. The process of moving between the conceptual and the operational can
clarify both.
The continuous versus categorical nature of the construct, as well as its
level of specificity (e.g., too broad vs. too narrow; risk aversion vs. financial
risk aversion or physical risk aversion; preference for numerical information
vs. consumers’ preference for numerical information) need to be considered.
Constructs vary in their level of abstractness, which should be clarified in
domain delineation. The elements of the domain of the construct need to be
explicated (e.g., satisfaction, liking, or interest with regard to, say, a construct
relating to preference for some type of information). The purpose here is
to carefully understand what exactly the domain of the construct covers,
such as tendencies versus attitudes versus abilities. Visual representation is
another useful way of thinking through the domain. An iterative process
where items are developed at an operational level can, in turn, help in
domain delineation as well by concretizing the abstract domain.
Domain delineation is an important step that enables understanding the
conceptual as it relates to the operational. Delineation may well change
the conceptual definition. Such iteration is common in measure development.
The key is to allow sufficient iteration to lead to a well-thought-out measurement process. Domain delineation may help screen out constructs that
are too broad or narrow. If constructs are too abstract, intermediate-level
constructs may be preferable. Figure 1.6 illustrates a narrow versus broad
operationalization of the domain of preference for numerical information.
Figure 1.6 also illustrates how domain delineation can facilitate item generation by identifying different aspects of the domain. Thus, during the
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 13
What Is Measurement?——13
Domain of PNI
N1
Enjoyment of Satisfaction
numbers with numbers
Liking of
numbers
Interest in
numbers
N1
N2
N3
N2
N3
N4
Importance
of numbers
Usage of
numbers
Broad
N5 Operationalization
of PNI
N6
N7
N8
N9
N4
Narrow
Operationalization
of PNI
Figure 1.6
Domain Delineation and Item Generation
subsequent step of item generation, items can be generated to cover aspects
such as enjoyment and importance. In this way, the distance between the
conceptual and the operational is bridged.
Domain delineation is demonstrated using an example of the preference for numerical information (PNI) scale (Exhibit 1.1). Several points are
noteworthy. First, PNI is distinguished from ability in using numerical
information, and several aspects of preference are discussed. PNI is also
distinguished from statistics and mathematics. In addition, PNI is described
in terms of its level of abstraction in being a general construct rather than
specific to a context such as consumer settings or classroom settings. In this
respect, it can be distinguished from other scales such as attitude toward
statistics and enjoyment of mathematics. The distinctive value of the PNI
construct is thus illustrated by showing that it is a unique construct that
is different from potentially related constructs, and that it has the potential
to explain important phenomena. As discussed earlier, any proposed new
construct has to answer the “So what?” question. After all, any construct
can be proposed and a measure developed. There are different ways of
motivating a construct: The PNI scale is motivated in a more generic
way because of lack of existing theory, but alternatively, a scale could be
motivated by placing it in the context of past theory or by identifying a gap
in past theory.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 14
14——Measurement Error and Research Design
Exhibit 1.1
The Preference for Numerical Information Scale: Definition,
Domain Delineation, and Item Generation
The errors discussed earlier are illustrated using data from a published scale, the preference for numerical information scale (Viswanathan, 1993). Preference for numerical information is defined as a preference or proclivity toward using numerical
information and engaging in thinking involving numerical information. Several
aspects of this definition need to be noted. First, the focus is on preference or proclivity rather than ability, because the aim here is to focus on attitude toward numerical information. Second, the focus is solely on numerical information, rather than on
domains such as statistics or mathematics, in order to isolate numerical information
from domains that involve the use of such information. Third, PNI is conceptualized
as a broad construct that is relevant in a variety of settings by using a general
context rather than a specific context such as, say, an academic setting.
Item Generation
Items were generated for the PNI scale in line with an operationalization of the definition of the construct. As mentioned earlier, three aspects of importance in the definition are the focus on preference or proclivity, the focus on numerical information, and
the use of a general rather than a specific context. The domain of the construct was operationalized by using parallel terms that represent numerical information, such as
numbers, numerical information, and quantitative information. Proclivity or preference
for numerical information was operationalized using a diverse set of elements or aspects,
such as the extent to which people enjoy using numerical information (e.g., “I enjoy
work that requires the use of numbers”), liking for numerical information (e.g., “I don’t
like to think about issues involving numbers”), and perceived need for numerical information (e.g., “I think more information should be available in numerical form”). Other
aspects were usefulness (e.g., “Numerical information is very useful in everyday life”),
importance (e.g., “I think it is important to learn and use numerical information to make
well informed decisions”), perceived relevance (e.g., “I don’t find numerical information
to be relevant for most situations”), satisfaction (e.g., “I find it satisfying to solve dayto-day problems involving numbers”), and attention/interest (e.g., “I prefer not to pay
attention to information involving numbers”). The use of information in a general
rather than a specific context was captured by wording items to be general (“I prefer not
to pay attention to information involving numbers”) rather than specific to any context.
A pool of 35 items was generated in line with the operationalization described
above in the form of statements with which a respondent could agree or disagree
to varying degrees. These items were generated to cover the range of aspects of preference for numerical information listed above (such as satisfaction and usefulness).
A total of 20 items were chosen from this pool and inspected in terms of content
for coverage of these different aspects, usage of different terms to represent numerical information, and generality of context. The items were also chosen such that
an equal number were positively or negatively worded with respect to preference
for numerical information. The response format was a 7-point scale labeled at the
extremes as strongly agree and strongly disagree.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 15
What Is Measurement?——15
Items of the PNI Scale
Strongly
Disagree
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
I enjoy work that requires the use
of numbers.
I find information easy to understand if
it does not involve numbers.
I find it satisfying to solve day-today problems involving numbers.
Numerical information is very
useful in everyday life.
I prefer not to pay attention to
information involving numbers.
I think more information should
be available in numerical form.
I don’t like to think about issues
involving numbers.
Numbers are not necessary for
most situations.
Thinking is enjoyable when it does
not involve quantitative information.
I like to make calculations using
numerical information.
Quantitative information is vital
for accurate decisions.
I enjoy thinking based on
qualitative information.
Understanding numbers is as important
in daily life as reading or writing.
I easily lose interest in graphs, percentages,
and other quantitative information.
I don’t find numerical information to
be relevant for most situations.
I think it is important to learn and use
numerical information to make
well-informed decisions.
Numbers are redundant for most situations.
Learning and remembering numerical
information about various issues
is a waste of time.
I like to go over numbers in my mind.
It helps me to think if I put down
information as numbers.
Strongly
Agree
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
EXHIBIT SOURCE: Adapted from Viswanathan, M., Measurement of individual differences
in preference for numerical information, in Journal of Applied Psychology, 78(5), 741–752.
Copyright © 1993. Reprinted by permission of the American Psychological Association.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 16
16——Measurement Error and Research Design
N1
ε1
N2
ε2
N20
ε20
PNI
Figure 1.7
Visual Representation for PNI of Latent Construct, Items, and
Error
Measure Design and Item Generation
Measure design and item generation follows domain delineation as illustrated for the PNI scale (Exhibit 1.1). Before specific items can be generated,
the design of the measure needs to be determined. Important here is the need
to think beyond measures involving agreement-disagreement with statements, and also beyond self-report measures. Measures can range from
observational data to behavioral inventories. Later in the book, a variety of
measures from different disciplines are reviewed to highlight the importance
of creative measure design. The basic model that relates a construct to items
in a measure is shown in Figure 1.7. A latent construct causes responses on
items of the measure. The items are also influenced by error. The error term
is the catch-all for everything that is not the construct.
Redundancy is a virtue during item generation, with the goal being to
cover important aspects of the domain. Even trivial redundancy—that is,
minor grammatical changes in the way an item is worded—is acceptable at
this stage (DeVellis, 1991). Measure development is inductive in that the
actual effects of minor wording differences are put to empirical test, and
items that pass the test are retained. It is impossible to anticipate how people
may respond to each and every nuance in item wording. Therefore, items
are tested by examining several variations. All items are subject to interpretation. The aim is to develop items that most people in the relevant
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 17
What Is Measurement?——17
population would interpret in a similar fashion. Generic recommendations
to enable such interpretation include avoiding ambiguous, lengthy, complex,
or double-barreled items. Items also should not be worded to be extreme
(e.g., I hate numbers), as discussed subsequently.
Unit of analysis is another consideration in item development (Sirotnik,
1980). For instance, items designed to capture individual perceptions about
self versus other versus groups suggest different types of wording. Items such
as “My workplace is friendly” versus “My colleagues view our workplace
as being friendly” versus “Our work group views our workplace as friendly”
capture different perceptions—an individual’s perception of the workplace,
colleagues’ perceptions of the workplace (as measured by an individual’s perception of these perceptions), and the group’s perceptions of the workplace
(as measured by an individual’s perception of these perceptions), respectively.
Level of abstraction is an important consideration with item generation.
Items vary in their level of abstractness. For instance, items could be designed
at a concrete level about what people actually do in specific situations versus
at a more abstract level in terms of their attitudes. On one hand, concrete
items provide context and enable respondents to provide a meaningful
response (say, a hypothetical item on the PNI scale such as, “I use calorie
information on packages when shopping”). On the other hand, items should
ideally reflect the underlying construct and not other constructs, the latter
being a particular problem when it occurs for a subset of items, as discussed
later. Because PNI is a more global construct that aims to cover different
contexts, including the consumer context, an item specific to the consumer
context may not be appropriate when other items are at a global level. Such
an item also does not parallel other items on the PNI scale that are more
global in their wording, the point being that the wording of items should be
parallel in terms of level of abstraction. A different approach may well identify specific subdomains (e.g., social, economic) and develop items that are
specific to each such subdomain. Thus, the subdomains are based on contexts
where preference for numerical information plays out. In such an approach,
items at the consumer level may be appropriate. Each of these subdomains
essentially would be measured separately, the issue of items worded to be parallel in terms of the level of abstraction being germane for each subdomain.
Another approach is to divide up the domain into subdomains, such as enjoyment of numbers and importance of numbers; that is, in terms of the different aspects of preference. Each subdomain could be conceptualized as a
separate dimension and items developed for each dimension. Of course, the
hypothesized dimensionality has to be tested empirically.
A large pool of items is important, as are procedures to develop a large
pool. Several procedures can be employed to develop a pool of items,
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 18
18——Measurement Error and Research Design
such as asking experts to generate items and asking experts or respondents
to evaluate items in the degree to which they capture defined constructs
(Haynes, Richard, & Kubany, 1995). There may be a temptation to quickly
put together a few items because item generation seems intuitive in the
social sciences. However, the importance of systematic procedures as a
way to develop a representative set or sample of items that covers the
domain of the construct cannot be overemphasized. These procedures,
which take relatively small additional effort in the scheme of things,
can greatly improve the chances of developing a representative measure.
Moreover, it should be noted that item generation is an iterative process
that involves frequently traversing the pathways between the conceptual
and the operational. If many items are quickly put together in the hope
that errors average out, the average may be of fundamentally inaccurate
items, and therefore not meaningful.
The procedures discussed so far relate to the content validity of a measure,
that is, whether a measure adequately captures the content of a construct.
There are no direct empirical measures of content validity, only assessment
based on whether (a) a representative set of items was developed and
(b) acceptable procedures were employed in developing items (Nunnally,
1978). Hence, assessment of content validity rests on the explication of the
domain and its representation in the item pool. The proof of the pudding is
in the procedures used to develop the measure rather than any empirical indicator. In fact, indicators of reliability can be enhanced at the expense of content validity by representing one or a few aspects of the domain (as shown
through narrow operationalization in Figure 1.6) and by trivial redundancies
among items. An extreme scenario here is, of course, repetition of the same
item, which likely enhances reliability at the expense of validity.
Internal Consistency Reliability
After item generation, measures typically are assessed for internal consistency reliability, and items are deleted or modified. Internal consistency frequently is the first empirical test employed and assesses whether items within
a set are consistent with each other, that is, covary with each other. Internal
consistency procedures assess whether a set of items fits together or belongs
together. Responses are collected on a measure from a sample of respondents. Intercorrelations among items and correlations between each item and
the total score are used to purify measures, using an overall indicator of
internal consistency reliability, coefficient alpha (Cronbach, 1951).
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 19
What Is Measurement?——19
Illustrative descriptions are provided using the PNI scale (Exhibit 1.2).
Means for the items in the PNI scale are shown. Means close to the middle
of the scale are desirable to allow for sufficient variance and the ability to
covary with other items. If an item does not vary, it cannot covary, and measurement is about capturing variation as it exists. (A parallel example is in
experimental design, where, say, the dependent variable is recall of information shown earlier, wherein either one or two pieces of information could be
shown leading to very high recall and small variance, or a hundred pieces of
information shown leading to very low recall and small variance. In both
instances, lack of variation precludes appropriate tests of hypotheses.) As an
illustration, Item 18 in Exhibit 1.2 (Learning and remembering numerical
information about various issues is a waste of time) has a mean of 5.24; it
was subsequently found to have low item-to-total correlations in some studies and replaced. The item is worded in somewhat extreme terms, referring
to “waste of time.” This is akin to having items with words such as “never”
or “have you ever” or “do you hate.” Items 4 and 5 are additional examples
relating to usefulness of numbers and not paying attention to numbers. Item
17 relates to numbers being redundant for most situations—perhaps another
example of somewhat extreme wording. An important issue here is that the
tone of the item can lead to high or low means, thus inhibiting the degree to
which an item can vary with other items. The item has to be valenced in one
direction or the other (i.e., stated positively or negatively with respect to the
construct in question) to elicit levels of agreement or disagreement (because
disagreement with a neutral statement such as “I neither like nor dislike
numbers” is ambiguous). However, if it is worded in extreme terms (e.g., “I
hate . . .”), then item variance may be reduced. Therefore, items need to be
moderately valenced. The name of the game is, of course, variation, and satisfactory items have the ability to vary with other items. Scale variance is the
result of items with relatively high variance that covary with other items.
This discussion is also illustrative of the level of understanding of the relationship between item characteristics and empirical results that is essential in
measure development. Designers of measures, no matter how few the items,
have to do a lot more than throw some items together.
Item-to-total correlations are typically examined to assess the extent
to which items are correlated with the total. As shown in Appendix 1.1, a
matrix of responses is analyzed in terms of the correlation between an item
and the total score. The matrix shows individual responses and total scores.
The degree to which an individual item covaries with other items can be
assessed in a number of ways, such as through examining intercorrelations
between items or the correlation between an item and the total scale. Items
(Text continues on page 24)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 20
20——Measurement Error and Research Design
Exhibit 1.2
Internal Consistency of PNI Scale
Analysis of 20-Item PNI Scale
Results for the 20-item PNI scale (Viswanathan, 1993) are presented below for a
sample of 93 undergraduate students at a midwestern U.S. university. Means and
standard deviations can be examined. Extreme means or low standard deviations
suggest potential problems, as discussed in the chapter.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
N19
N20
Mean
Std. Dev.
4.3407
3.7253
4.3956
5.2747
5.3297
3.9890
5.0000
4.7363
4.3516
4.5495
5.0769
3.4286
5.0989
4.3297
4.8571
5.0769
5.0220
5.2418
3.7033
4.2527
2.0397
1.9198
1.9047
1.6304
1.3000
1.6259
1.6063
1.6208
1.6250
1.9551
1.3269
1.3755
1.3586
1.7496
1.2163
1.3600
1.2109
1.1489
1.7670
1.9096
The average interitem correlation provides an overall indicator of the internal
consistency between items. It is the average of correlations between all possible pairs
of items in the 20-item PNI scale.
Average interitem correlation = .2977
Item-to-total statistics report the correlation between an item and the total score,
as well as the value of coefficient alpha if a specific item were deleted. Items with low
correlations with the total score are candidates for deletion. Item 12 actually has a
negative correlation with the total score and should be deleted. It should be noted
that all analyses are after appropriate reverse scoring of items such that higher scores
reflect higher PNI.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 21
What Is Measurement?——21
Item-Total Statistics
Corrected ItemTotal Correlation
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
N19
N20
.7736
.3645
.7451
.5788
.4766
.6322
.7893
.4386
.4781
.7296
.4843
−.3603
.5073
.6287
.4617
.6095
.5301
.3947
.5323
.6281
Alpha If
Item Deleted
.8839
.8975
.8857
.8911
.8939
.8896
.8853
.8949
.8938
.8861
.8937
.9143
.8931
.8895
.8943
.8904
.8927
.8958
.8924
.8894
Coefficient alpha is the overall indicator of internal consistency as explained in the
chapter.
Alpha = .8975
Standardized item alpha = .8945
The analysis was repeated after deleting Item 12.
Analysis After Deletion of Item 12 (19-item version)
Average interitem correlation = .3562
Item-Total Statistics
N1
N2
N3
N4
N5
N6
N7
Corrected ItemTotal Correlation
Alpha If
Item Deleted
.7705
.3545
.7479
.5834
.4823
.6219
.7889
.9043
.9160
.9052
.9097
.9121
.9088
.9047
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 22
22——Measurement Error and Research Design
Item-Total Statistics
N8
N9
N10
N11
N13
N14
N15
N16
N17
N18
N19
N20
.4432
.4811
.7299
.5071
.5175
.6373
.4672
.6237
.5306
.3950
.5381
.6339
Alpha = .9143
Standardized item alpha = .9131
.9131
.9122
.9056
.9115
.9113
.9083
.9124
.9088
.9111
.9138
.9109
.9084
The analysis is repeated after deleting another item, Item 2, which has a relatively
low item-to-total correlation.
Analysis After Deletion of Item 2 (18-item version)
Average interitem correlation = .3715
Item-Total Statistics
N1
N3
N4
N5
N6
N7
N8
N9
N10
N11
N13
N14
N15
N16
N17
N18
N19
N20
Alpha = .9160
Corrected ItemTotal Correlation
Alpha If
Item Deleted
.7650
.7442
.6025
.4806
.6144
.7916
.4417
.4589
.7290
.5298
.5277
.6453
.4726
.6307
.5156
.3829
.5405
.6374
.9063
.9069
.9111
.9140
.9108
.9062
.9151
.9147
.9073
.9129
.9129
.9098
.9142
.9104
.9132
.9159
.9128
.9101
Standardized item alpha = .9141
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 23
What Is Measurement?——23
The effect of the number of items in a scale on reliability is noteworthy. Below,
10- and 5-item versions of the PNI scale are presented.
Analysis of 10- and 5-Item Versions of the PNI Scale
Reliability Analysis—10-item version
Average interitem correlation = .3952
Item-Total Statistics
Corrected ItemTotal Correlation
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
.7547
.3885
.7489
.5212
.4872
.5702
.8033
.3696
.4761
.7521
Alpha = .8688
Standardized item alpha = .8673
Alpha If
Item Deleted
.8411
.8723
.8424
.8611
.8634
.8575
.8402
.8718
.8643
.8417
Reliability Analysis—5-Item Version
Average interitem correlation = .3676
Item-Total Statistics
Corrected ItemTotal Correlation
N1
N2
N3
N4
N5
.7016
.3189
.6888
.4648
.4357
Alpha = .7475
Standardized item alpha = .7440
Alpha If
Item Deleted
.6205
.7705
.6351
.7198
.7298
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 24
24——Measurement Error and Research Design
that are internally consistent would have high correlation with the total
score. Consider a scale that is assumed to consist of a number of internally
consistent items. The higher an individual’s response on a particular item,
the higher the individual’s response is likely to be on other items in the scale,
and the higher the individual’s total score. A person high on PNI would be
expected to have a high total score as well as high scores on individual items
(once the items have been reverse scored so that higher values reflect higher
PNI; for instance, an item such as “I don’t like numbers” would be reverse
scored so that a response of a 7 is equivalent to a 1, and so on). Items with
low correlations with the total score are deleted or modified. Such items are
assumed to be lacking internal consistency. They do not covary or are not
consistent with the total score or with other items.
Item 12 (Exhibit 1.2 and Figure 1.8) is instructive in that it has a negative correlation with the total score after reverse scoring. The item pertains
to qualitative information and is employed here on the assumption that
higher PNI may be associated with lower preference for qualitative information. In other words, rather than directly capture preference for numerical information, the item is premised on a contingent relationship between
PNI and preference for qualitative information. In fact, for some individuals, high preferences for both types of information may coexist. The key
lesson here is the importance of directly relating items to the constructs
being measured. Essentially, the item confounds constructs it purports to
measure. The negative correlation may also have been the result of some
respondents misreading the term qualitative as quantitative. It should be
noted here that the item was appropriately reverse-scored before data
analyses, including computing correlations.
Another example from a different research area is illustrative. Consider
a measure that attempts to capture the type of strategy a firm adopts in
order to compete in the marketplace. A strategy may consist of many different aspects, such as emphasis on new product development and emphasis on research and development. Therefore, self-report items could be
generated to measure strategy regarding emphasis on research and development and emphasis on new product development (e.g., “We emphasize
new product development to a greater degree than our competition,” with
responses ranging from strongly disagree to strongly agree). Suppose also
that there is an empirical relationship between the type of environment in
which firms operate (e.g., uncertain environments) and the type of strategy
in which firms engage; that is, in uncertain environments, firms tend to
invest in research and development and new product development. Then,
an item generated to measure strategy about how uncertain the environment is (e.g., “Our company operates in an uncertain environment,” with
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 25
What Is Measurement?——25
Preference for
qualitative information
Preference for
numerical information
Item 12
Figure 1.8
Indirect Versus Direct Measurement of Constructs
responses from strongly disagree to strongly agree) attempts to tap into
strategy on the basis of a contingent relationship between the type of strategy and the type of environment. Such an item assumes that firms adopt a
particular strategy in a certain environment. As far as possible, items should
be developed to directly assess the construct they aim to measure. Otherwise, substantive relationships between constructs are confounded with
measures of single constructs. Subtler issues arise depending on the type
of indicator that is being developed, as discussed subsequently. Direct
measurement does not imply repetitive, similar items. Any item typically
captures a construct in some context. Creative measurement (and perhaps
interesting items from the respondents’ perspective) involves different ways
of capturing a construct in different contexts. A key point to note is that
ideally, items should not confound constructs but rather use contexts in
which the focal construct plays out.
Coefficient alpha is an indicator of the internal consistency reliability of
the entire scale. A goal in internal consistency procedures is to maximize
coefficient alpha, or the proportion of variance attributable to common
sources (Cronbach, 1951; DeVellis, 1991). Items are deleted on this basis to
achieve a higher coefficient alpha, and this process is repeated until the marginal gain in alpha is minimal. Beyond a point, the marginal increase in
alpha may not warrant additional deletion of items. In fact, items with
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 26
26——Measurement Error and Research Design
moderate correlations with the total may be retained if they capture some
unique aspect of the domain not captured by other items. The average
interitem correlation, along with coefficient alpha, provides a summary,
and researchers have suggested that its ideal range is between 0.2 and 0.4
(Briggs & Cheek, 1988). The rationale for this guideline is that relatively
low correlations suggest that a common core does not exist, whereas
relatively high correlations suggest trivial redundancies and a narrow operationalization of one subdomain of the overall construct. For example, for
the PNI scale, if all the correlations in Exhibit 1.2 between items are close
to 1, this may suggest that all the items in the measure capture only some
narrow aspect of the domain of the construct. This could happen if items
are merely repetitions of each other with trivial differences. In addition to
item deletion on purely empirical grounds, the nature of the item needs to
be taken into account. If an item has moderate correlation with the total
score, yet captures some unique aspect of a construct not captured by other
items, it may well be worth retaining. This is not to maximize reliability but
rather to trade off reliability to increase validity, an issue to be discussed
subsequently. Purely empirical or purely conceptual approaches are not
sufficient in measure development and validation. The iteration between
empirical results and conceptual examination is essential.
The definition and computation of coefficient alpha are discussed below
(adapted from DeVellis, 1991). Exhibit 1.2 presents coefficient alpha for
several versions of the PNI scale. The computation of coefficient alpha is
illustrated in Appendix 1.1. A variance covariance matrix is shown for a
three-item scale. Considering extreme scenarios in internal consistency is
useful in clarifying the typical scenarios that fall in the middle. It is possible
that none of the items covaries with each other (i.e., all nondiagonal items
are zero). It is also possible that all items covary perfectly with each other
(i.e., a correlation of 1 and covariances depending on the unit of measurement). Coefficient alpha is the proportion of total variance that is due to
common sources. (Note the plural “sources” for subsequent discussion in
factor analysis.) Variation attributable to common sources is indicated by
covariances between items. The correction term in Appendix 1.1 standardizes alpha values from 0 to 1 (i.e., 0 and 1, respectively) for the extreme
scenarios above. A simple example using three items with perfect intercorrelations illustrates the need for a correction term.
How does coefficient alpha measure reliability (where reliability is the
minimization of random error)? Coefficient alpha is the proportion of variance attributable to common sources. These common sources are presumably the construct in question, although this remains to be demonstrated.
These common sources have to be assumed to be the construct in question
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 27
What Is Measurement?——27
for the moment. Anything not attributable to common sources is assumed to
be random error, that is, variation that is not systematically related to the
common core that items share. The proportion of variance attributable to
common sources represents the square of a correlation; hence, the unit of
reliability is the square of the correlation between the scale and the true
score. The unit of relevance for reliability is variance, the square of the
correlation between a measure and a hypothetical true score.
A qualification is noteworthy here. Internal consistency reliability is essentially about the degree to which a set of items taps into some common sources
of variance, presumably the construct being measured. The basic premise here
in deleting items that are inconsistent based on their lack of correlation with
the total scale is that most of the items in the scale tap into this basic construct. However, whether the scale indeed taps into the construct in question
is determined subsequently through procedures for assessing validity. Hence,
it is quite possible that a small number of items being deleted indeed represent the construct in question, whereas the majority of items retained tap into
an unintended construct. This issue highlights the importance of examining
items both conceptually and empirically in conjunction.
As the number of items in a scale increases, coefficient alpha increases.
(Exhibit 1.2 reports alphas for 5-, 10-, and 20-item versions.) Why is this
happening?8 Note that this would happen even if the best five items on the
basis of item-to-total correlation in the scale are used and compared with
the 20-item scale. The mathematical answer to this question is that as
the number of items increases, the number of covariance terms (k2 − k)
increases much faster than the number of variance terms (k). As the number
of items increases, the number of ways in which items covary increases at
an even more rapid rate (e.g., if the number of items is 2, 3, 4, 5, 6, 7, 8, 9,
and 10, the number of covariance terms is 2, 6, 12, 20, 30, 42, 56, 72, and
90, respectively). But what does this mean intuitively? As the number of
items increases, so does the number of ways in which they covary with each
other and contribute to a total score with a high degree of variation. Each
item covaries with other items and captures aspects of the domain.
During this stage of measure development, the emphasis is on tapping
into common sources, presumably the underlying construct. Whether the
common sources are multiple sources or a single source is in the purview of
dimensionality. However, whether the common sources that are reliably
measured are the construct in question is in the purview of validity. Of
course, whether any aspect of the true score is captured is an issue in the
realm of validity. It is possible that the items in a measure capture no aspect
of a construct. In summary, internal consistency assesses whether items
covary with each other or belong together and share common sources. The
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 28
28——Measurement Error and Research Design
word sources is noteworthy because dimensionality addresses the question
of the exact number of distinguishable sources. Hence, internal consistency
is a distinct form of reliability that hinges on items sharing a common core,
presumably the construct in question. Reliability procedures attempt to
assess the common variance in a scale. Internal consistency assesses the
degree to which items covary together or are consistent with each other. In
other words, if a person has a high score on one item, that person should
have a high score on another item as well, items being scored such that
higher values on them denote higher values on the underlying construct.
Dimensionality, as discussed subsequently, assesses whether the common
core in question is, indeed, a single core, or if multiple factors reflect the
hypothesized number of dimensions with items loading onto their specified
factors and covarying more closely within a factor than across. A factor at
an empirical level is formed by a subset or combination of items that covary
more closely with each other than with other items.
Test-Retest Reliability
Internal consistency reliability is often supplemented with test-retest reliability. Typically, the same scale is administered twice with an interval of a
few weeks, with recommendations ranging anywhere from 4 to 6 weeks and
higher. More importantly, the interval has to fit the specific research study in
terms of allowing sufficient time to minimize recall of responses in the previous administration, just as the weighing machine in the example earlier in the
chapter does not have memory of the previous weighing. The logic of testretest reliability is simply that individuals who score higher (or lower) in one
administration should score higher (or lower) in the second, or vice versa. In
other words, the ordering of scores should be approximately maintained. A
key assumption of test-retest reliability is that the true score does not change
between test and retest. For instance, in the weighing machine example, the
person does not eat or change clothes before being weighed again. In a sense,
test-retest reliability represents a one-to-one correspondence with the concept
of reliability, which is centered on replicability. Assessment of reliability
involves asking a question: If the measure were repeated, would similar (i.e.,
consistent) scores be obtained? Test-retest reliability offers direct evidence of
such consistency. Internal consistency reliability treats different items as replications of a measure.
Scales (e.g., the PNI scale) are evaluated by examining the overall test-retest
correlation. The computation of test-retest reliability is shown in Figure 1.9.
As shown in the figure, test-retest reliability is, indeed, the test-retest correlation itself, the square of the hypothetical correlation between a measure and
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 29
What Is Measurement?——29
PNI
t1
Figure 1.9
t2
Computation of Test-Retest Reliability Formula
t1 = Score at test at Time 1
t2 = Score at retest at Time 2
rt1 T = Correlation between score at Time 1 and true score
rt2 T = Correlation between score at Time 2 and true score
rt1 t2 = Test-retest correlation
rt1 T · rt2 T = rt1 t2
r 2t1 T = rt1 t2
r 2t1 T = Proportion of variance attributable to true score
Test-retest reliability = Test-retest correlation
the true score. Individual items could also be examined by this criterion and
deleted on this basis. Details for the PNI scale are presented (Exhibit 1.3).
Noteworthy here is that some items may perform well on test-retest reliability
but not on internal consistency, or vice versa, an issue discussed subsequently.
Item 8, for example, has a high item-to-total correlation yet a low test-retest
correlation for both the 1-week and 12-week intervals. This could be due to a
variety of factors. Inconsistent administration across test and retest or distracting settings could cause low test-retest correlation. Similarly, item wording in Item 8, “Numbers are not necessary for most situations,” may lead to
inconsistent interpretations or inconsistent responses across time depending on
the “situations” that respondents recall in responding to the item. Means at
test versus retest can also be examined.
Dimensionality—Exploratory Factor Analysis
Reliability through internal consistency assesses whether a set of items is
tapping into a common core as measured through the degree to which they
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 30
30——Measurement Error and Research Design
Exhibit 1.3
Test-Retest Reliability of a Modified PNI Scale
The PNI scale was administered to 106 undergraduate students at a midwestern U.S.
university twice with an interval of 1 week (Viswanathan, 1993). Longer intervals
should be used, and the PNI scale has been assessed for test-retest reliability using a
12-week interval (Viswanathan, 1994).
The following are new items that replaced items from the original scale:
2. I think quantitative information is difficult to understand.
12. I enjoy thinking about issues that do not involve numerical information.
18. It is a waste of time to learn information containing a lot of numbers.
Test-Retest Correlations
1-week interval
12-week interval
Total scale
Item 1
Item 2
Item 3
.91**
.87**
.50**
.74**
.73**
.69**
.46**
.52**
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
.62**
.56**
.66**
.65**
.28**
.74**
.76**
.58**
.56**
.60**
.60**
.46**
.64**
.42**
.35**
.74**
.60**
.30**
.42**
.56**
.66**
.00
.42**
.73**
.30**
.31**
.39**
.60**
.50**
.45**
.39**
.25**
.56**
.54**
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Mean score at test
Mean score at retest
**p < .01.
4.64
4.62
4.64
4.56
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 31
What Is Measurement?——31
covary with each other. But whether the common core consists of a specific
set of dimensions is in the realm of dimensionality, assessed through factor
analysis. At the outset, it should be noted that there are many issues involved
in using factor analysis appropriately; the references cited here provide direction for in-depth reading. This discussion aims to provide an introduction to
the topic.
Factor analysis is an approach in which variables are reduced to combinations of variables, or factors (Comrey & Lee, 1992; Hair, Anderson, Tatham,
& Black, 1998). The assumption in performing factor analysis on a set of
items is that an underlying factor or set of factors exists that is a combination
of individual items. A correlation matrix for all items in a scale may offer
some indication of the outcome of factor analysis and is well worth examining. When subsets of items have sizably higher correlations among each other
and lower correlations with items outside of the subset, these items are likely
to form a factor. For example, if there are two distinct factors, the matrix of
correlations would have two distinct subsets of items with high correlations
within items from the same subset and lower correlations among items from
different subsets. At an intuitive level, this suggests that if a respondent provided a high (low) rating in response to an item, he or she was more likely to
provide a high (low) rating in response to another item within the subset
rather than across subsets. Items in a subset or items that belong in a factor
covary more closely than do items from different factors. What each factor is
has to be determined by the researcher. For example, two factors may be
found among items just based on whether they are positively worded or negatively worded. Responses to positively worded items may covary among
each other to a much greater degree than they do with negatively worded
items. A set of ratings of attributes of grocery stores (e.g., location, times of
operation, aisle space, etc.) may reduce to a couple of factors such as atmosphere and convenience. Similarly, ratings of automobiles on attributes such as
shoulder room, gas mileage, and so on may reduce to underlying factors such
as comfort, safety, and economy. In these examples, factor analysis can be
used to understand what consumers are really looking for. Thus, when appropriately used by the researcher, factor analysis can provide insight into
responses at the item level (e.g., why someone likes a grocery store in terms
of underlying abstract factors, such as convenience, based on responses to
concrete items such as checkout speed and location). Clearly, the results of
factor analysis reflect the set of items and the content covered, that is, the
questions asked and the aspects on which data were collected to begin with.
Potential factors are precluded if items exclude specific subdomains.
For a subset to rise to the level of a factor, both empirical outcomes and
conceptual examination have to be taken into account. In a sense, when
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 32
32——Measurement Error and Research Design
items are generated to represent the domain of a construct, each subdomain
may separate into a distinct factor. At an extreme, each item could be
considered a separate factor. But such an approach is both an extremely
inefficient way of conducting scientific research and very unlikely to lead to
conceptually rich theory and abstract generalizations. In practice, whether a
set of items rises to the level of a factor has to be determined by both empirical outcomes and conceptual examination of items.
Factor analysis results include correlations or loadings between individual
items and factors. If a single factor is expected, then this is likely to be the
first factor, with item correlations or loadings for this factor being high. The
construct in question, rather than incidental factors being measured, is likely
to be dominant and explain the most variation. Individual items can be evaluated by assessing their loading or correlation with the factor considered to
represent the overall construct.
Another distinction in factor analysis is in using principal component
analysis and common factor analysis (Hair et al., 1998). Essentially, variance for each item can be divided into error variance, specific variance, and
common variance shared with other items. In other words, responses to each
item reflect some error, some content unique to the item, and some content
shared with other items. An item on the PNI scale has some content unique
to the item and some content, presumably preference for numerical information, shared with other items. The content unique to an item would
depend on the specific context of the item and the wording. In a narrow
sense, the item is a measure of some unique and likely very concrete “thing.”
If one item on the PNI scale is influenced by social desirability in wording
and others are not, then social desirability contributes to unique content in
an item that does not covary with other items. Each item also has unique
content in the way it is worded. Principal components analysis uses all the
variance available and is a pure data reduction technique, whereas common
factor analysis uses only the common variance shared by variables and is
more appropriate for conceptually driven examination of the data. The
nature of factor extraction that is meaningful for scales with a common core
is one that works off the common variance across items rather than one that
is a pure data reduction technique (e.g., principal component analysis). This
is because such common variance shared among items reflects the conceptual meaning of a factor that relates to an underlying construct. Using a pure
data reduction technique that takes advantage of all variation and chance
correlations is inconsistent with the goal of capturing conceptual meaning
shared across items within a factor.
In performing common factor analysis, the communality of each item is
estimated, which is an indicator of the variance in each item that is shared
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 33
What Is Measurement?——33
with other items (Hair et al., 1998; see also Appendix 1.1). Every item has
variance, but not all of it is shared with other items or is reflected in
covariance between an item and other items. Only the shared portion of an
item’s variance is employed in common factor analysis. Such an approach
corresponds to the conceptualization of a measure as consisting of items
that share content. An approach that uses all the variance in an item
regardless of such variance being shared is akin to a pure data reduction
technique that capitalizes on available variance rather than isolating
shared variance.
The results of factor analysis include the number of factors extracted and
the variance explained by each factor. Appendix 1.1 presents simple illustrations of exploratory factor analysis. In unrotated factor analysis, the first factor represents the best linear combination of items, the second factor the
second best, and so on (Hair et al., 1998). The second factor should be based
on the remaining variance after extraction of the first factor in order to be
orthogonal to it (i.e., mathematically independent factors or no correlations
between factors) (Hair et al., 1998). Unrotated factor matrices are rotated to
improve interpretation of the loadings of items on factors. Rotation involves
turning the axes on their origin and can lead to improved interpretation
of results (Figure 1.10). Rotation redistributes variance from earlier to later
factors, whereas earlier factors usually explain considerable variance in unrotated factor analysis. Rotation can be orthogonal or oblique. In oblique rotation, factors may be correlated to each other (Figure 1.10).9 The purpose of
rotation is to facilitate interpretation of factors by simplifying loadings to be
closer to 1 or 0. Different types of rotations serve different purposes, each
with its limitations. Varimax rotation is one such approach often used when
multiple dimensions are anticipated. This type of rotation may be particularly
useful when multiple dimensions are present. However, such an approach
may lead to multiple factors even when there is a single underlying dimension.
In factor analysis, a judgment has to be made as to the number of meaningful factors underlying the data by comparing the percentage of variance
explained by each extracted factor (reflected in eigenvalues) in light of expectations. For instance, a dominant factor would be indicated by the first factor having a much higher eigenvalue (or percent of variance explained) than
the second factor. The noteworthy point here is that not all extracted factors
are meaningful, and several approaches are used to assess the number of
meaningful factors. In essence, the relative variances explained by each factor are compared in deciding how many factors to retain. A scree test plots
the variance explained by each factor to identify discontinuities in terms of
a drop-off in variance explained (Hair et al., 1998). Later factors are generally thought to contain a higher degree of unique variance (Hair et al., 1998).
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 34
34——Measurement Error and Research Design
F2′′
F2′
F2
X4
.5
X5
X6
0
21°
F1
.5
1.0
X3
X2
X1
F1′′
−.5
F1′
Figure 1.10
Orthogonal Factor Rotation
F1 and F2 = orthogonal factors (axes) before rotation;
F1′ and F2′ = orthogonal factors after varimax rotation;
F1″ and F2″ = oblique factors after Direct Oblimin rotation. The angle between the two is 66.42
degrees
SOURCE: Kim, J-O., & Mueller, C., Introduction to factor analysis: What it is and how to do
it, p. 57, copyright © 1978. Reprinted by permission of Sage Publications, Inc.
The scree test plots factors in their order of extraction by eigenvalues
(Figure 1.11). Such a plot is usually vertical to begin with and then becomes
horizontal, the point at which this occurs being the cutoff for the maximum
number of factors to extract. Another approach is to consider any factor
with an eigenvalue greater than 1, suggesting that the factor explains more
variance than any individual item (Hair et al., 1998).10
It should be noted that coefficient alpha and internal consistency relate to
whether there are common underlying sources, the plural being the key here,
and should not be used to make inferences about the dimensionality of a
construct. This is illustrated using an example where there are clearly two
factors, but coefficient alpha is extremely high (Figure 1.12). Whether there
is one or a specific number of underlying sources is in the realm of dimensionality and factor analysis.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 35
What Is Measurement?——35
4
Eigenvalue
X
3
X
Straightening of the line
indicates that no more than
five factors should be
extracted
2
X
X
1
X
2
4
X
6
X
X
X
8
X
X
10
X
12
Factor Number
Figure 1.11
Eigenvalue Plot for Scree Test Criterion
SOURCE: Kim, J-O., & Mueller, C., Introduction to factor analysis: What it is and how to do
it, p. 45, copyright © 1978. Reprinted by permission of Sage Publications, Inc.
Dimensionality analyses for the PNI scale are presented in Exhibit 1.4.11
Key indicators here at this exploratory stage include the variance explained
by a dominant factor and factor loadings of individual items. It should be
noted that some items have medium to low loadings with the first factor,
whereas other items have high loadings with the first factor and high loadings with other factors as well. What are these other factors? They could represent some content domain or some wording or methodological aspect of
the measure that is shared by a subset of items. For example, items worded
such that higher scores suggest higher PNI (i.e., positively worded items)
may be more closely related with each other than with items worded such
that lower scores suggest higher PNI (i.e., negatively worded items). Rotated
factor analysis can serve to simplify matrices.
Dimensionality—Confirmatory Factor
Analysis and Structural Equation Modeling
Exploratory factor analysis provides initial evidence of dimensionality,
but confirmatory factor analysis is required for conclusive evidence.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 36
36——Measurement Error and Research Design
Hypothetical correlation matrix for a 20-item scale with two factors comprising
Items 1–10 and Items 11–20, respectively.
1
.9
.9
.9
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
1
.9
.9
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
1
.9
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
1
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
.9
1
.9
.9
1
.9
.9
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
1
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
.9
.9
.9
.9
.9
.9
.9
1
.9
.9
1
.9
.9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
.9
.9
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
.9
1
.9
.9
.9
.9
.9
0
0
0
0
0
0
.9
.9
1
.9
.9
.9
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
1
.9
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
1
.9
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
1
.9
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
1
.9
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
1
.9
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
.9
1
.9
0
0
0
0
0
0
0
0
0
0
.9
.9
.9
.9
.9
.9
.9
.9
.9
1
α =
K
K r¯
·
−1
1 + (K − 1)· r¯
100 × 0 + 90 × .9
r¯ =
= 0.426
190
20
20 · (0.426)
α =
·
20 − 1
1 + 19 · (0.426)
= 0.986
Figure 1.12
K
Coefficient α Versus Factor Analysis
Exploratory factor analysis assumes that all items are related to all factors,
whereas confirmatory factor analysis imposes a more restrictive model
where items have prespecified loadings with certain factors that also may be
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 37
What Is Measurement?——37
Exhibit 1.4
Exploratory Factor Analysis of the 20-Item PNI Scale
Results for the original 20-item PNI scale are presented below for the same sample
as in Exhibit 1.2 (Viswanathan, 1993). Correlations between items provide clues to
the likely factor structure that may underlie the data. Below, the factor matrix is presented and is the result of exploratory factor analysis. The correlation between each
item and each factor is presented here. A useful way to examine such a matrix is by
highlighting the high loadings that each item has on one or more factors. Here, many
items have their highest loading on the first factor.
Factor Matrix
Factor 1
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
N19
N20
.79874
.36629
.77987
.63339
.54453
.63067
.81937
.47641
.45266
.72986
.57628
−.31912
.55337
.67780
.51572
.67509
.57455
.42251
.58197
.66761
Factor 2
.13506
.25747
.27999
−.22174
−.02839
.21084
.07605
−.32387
.15707
.21385
−.47323
.17510
−.24397
.11840
−.28529
−.29761
−.12036
.15295
.44522
−.08386
Factor 3
Factor 4
−.01212
.27185
−.14529
−.29342
−.03588
.10678
.03373
.11877
.05986
−.33176
−.27856
.16366
.05221
.09128
.30472
.00241
.56461
.41612
−.22408
−.17123
−.09337
.10054
−.12853
.00671
.56900
−.00046
.28493
.01884
.10623
−.08889
.06503
.10411
−.23473
−.27664
−.08219
−.14682
.04140
−.05502
−.06153
.13250
The communality of each variable, or the variance it shares with other variables, is
reported below.
Variable
N1
N2
N3
N4
N5
N6
Communality
.66510
.28447
.72421
.53649
.62237
.45359
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 38
38——Measurement Error and Research Design
Exhibit 1.4 (Continued)
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
N19
N20
.75948
.34632
.24444
.69640
.63786
.17012
.42356
.55830
.44697
.56587
.66510
.37809
.59091
.49961
Eigenvalues and percent of variance explained by each factor is presented below.
Such information is used to decide on the number of meaningful factors as discussed
in the chapter.
Factor
1
2
3
4
Eigenvalue
Pct. of Var.
Cum. Pct.
7.32600
1.17967
1.13261
0.66097
36.6
5.9
5.5
3.3
36.6
42.5
48.0
51.3
related to each other. Exploratory factor analysis uses a general model no
matter what the substantively motivated constraints are. Confirmatory factor analysis allows more precise specification of the relationship between
items and factors.
It should be emphasized strongly that, although exploratory factor analysis can be used in preliminary stages of measure development, it should be
followed up with confirmatory factor analysis (Gerbing & Anderson, 1988).
Confirmatory factor analysis tests carefully specify models of the relationship between items and factors (Gerbing & Anderson, 1988). Confirmatory
factor analysis also provides overall indexes of fit between the proposed
model and the data, which range in value from 0 to 1. Exploratory factor
analysis employs more of a shotgun approach by allowing all items to be
related with all factors (Figure 1.13).
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 39
What Is Measurement?——39
Exploratory Factor Analysis
I
N
S1
S2
S3
S4
S5
S6
ε1
ε2
ε3
ε4
ε5
ε6
Confirmatory Factor Analysis
I
N
S1
S2
S3
S4
S5
S6
ε1
ε2
ε3
ε4
ε5
ε6
Figure 1.13
Confirmatory Factor Analysis Versus Exploratory Factor Analysis
for the Speech Quality Scale
NOTES: I = Intelligibility factor of speech quality; N = Naturalness factor of speech quality;
S1−3 = Items for intelligibility; S4−6 = Items for naturalness.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 40
40——Measurement Error and Research Design
Whereas exploratory factor analysis considers internal consistency
among items of a measure, confirmatory factor analysis considers external
consistency across items of different measures or dimensions (Gerbing &
Anderson, 1988). As shown in Figure 1.13 and illustrated in Appendix 1.1,
the relationship between items from different dimensions is exclusively
through the relationship between the dimensions. Therefore, external consistency is assessed by comparing the observed correlation between two
items of different dimensions or constructs with the predicted correlation
that arises out of the hypothesized relationship between items and measures. Confirmatory factor analysis applies the criterion of external consistency wherein relationships between items across factors are assessed
(Appendix 1.1). In essence, the observed correlations between items are
compared to the hypothesized correlations in light of the specified model.
Moreover, overall fit indexes can be computed with confirmatory factor
analysis. Residuals between items denote the degree to which observed relationships deviate from hypothesized relationships. A positive residual
between two items suggests that the model underpredicts the relationship
between two items and vice versa.
Confirmatory factor analysis is demonstrated using a speech quality scale
(Exhibit 1.5). Confirmatory factor analysis allows isolation of items that measure multiple factors that are of substantive importance (see Appendix 1.1
for simplified illustration and Figures 1.13 and 1.14). In a multidimensional
measure, each item ideally should measure only one dimension. Two items
measuring different dimensions may have an unduly large relationship, suggesting that they are influenced by a common factor, perhaps one of the
factors being measured or a different factor. If only one item is influenced
by an additional unknown factor, that is usually tolerable. In fact, all items
will probably have some unique variance. However, items that measure
more than one substantive factor have to be identified and deleted. For
instance, in Figure 1.14, reproduced from Gerbing and Anderson (1988),
Items 4 and 7 measure more than one factor.
Confirmatory factor analysis falls under the broader umbrella of structural equation modeling (SEM). The caution about appropriate use of statistical techniques and the introductory nature of the material covered here
should be reemphasized for structural equation modeling. Structural equation modeling combines econometric and psychometric approaches by
simultaneously assessing structural and measurement models; the former
deals with relationships between constructs, whereas the latter deals with
relationships between constructs and their measures (Figure 1.15). In a
typical econometric approach, such as regression, each measure is entered
into the analysis without accounting for measurement error. For example,
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 41
What Is Measurement?——41
ξ1
x1
ξ2
x2
x3
x4
x5
ξ3
Figure 1.14
x6
x7
x8
x9
x10
ξ4
Example of Confirmatory Factor Analysis
SOURCE: Adapted from Gerbing, D. W., & Anderson, J. C. An updated paradigm for scale
development incorporating unidimensionality and its assessment, in Journal of Marketing
Research, 25(5), pp. 186–92, copyright © 1988. Reprinted by permission of the American
Marketing Association.
NOTE: ξ1 and ξ2 are two moderately correlated factors. x1-5 and x6-10 are indicators for ξ1 and
ξ2, respectively. ξ3 and ξ4 are additional factors that provide a source of common covariance
for two pairs of items across two sets.
considerable random error in one or more measures in regression may
decrease the observed relationship between two measures. Nevertheless, the
observed relationship is assumed to reflect the true relationship between
constructs. SEM combines the econometric and psychometric traditions to
evaluate relationships between constructs while accounting for measurement error. At the measurement level, the advantage of SEM is in terms of
specifying a precise model and testing it using confirmatory factor analysis
(Figure 1.16). At the theoretical level, SEM allows assessment of relationships between constructs while accounting measurement error. For
instance, as discussed earlier, random error and unreliability reduce the
ability of a measure to correlate with other measures. SEM estimates the
relationship between two constructs while taking into account the degree of
reliability of their measures.
(Text continues on page 58)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 42
42——Measurement Error and Research Design
Exhibit 1.5
Confirmatory Factor Analysis on the Speech Quality Scale
The quality of text-to-speech (TTS) systems can be assessed effectively only on the
basis of reliable and valid listening tests. The method for these tests must be rigorous in voice sample presentation, respondent selection, and questionnaire preparation. These tests involve preparing several samples of synthesized output from
multiple TTS systems, randomizing the system-sentence combinations, and asking
listeners to score each output audio. Johnston (1996) notes that opinion tests of
speech quality are the basis of speech quality assessment. The Mean Opinion Scale
(MOS) has been the recommended measure of text-to-speech quality (ITU-T
Recommendation, 1994). It consists of seven 5-point scales that assess overall
impression, listening effort, comprehension problems, articulation, pronunciation,
speaking rate, and pleasantness (Lewis, 2001). Items in this scale are presented in
the Exhibit 1.5 Figure.
Past research lacks explication of the factors or dimensions of speech quality.
Moreover, past research employed exploratory factor analysis to assess factor structure, a procedure appropriate at preliminary stages of measure development that
needs to be followed up with testing using confirmatory factor analysis. An item in
the MOS asks the respondent to rate the overall quality of the synthesized speech clip
on a scale from 1 to 5 (Exhibit 1.5 Figure). The other items relate to various aspects
of synthetic speech such as listening effort, pronunciation, speed, pleasantness, naturalness, audio flow, ease of listening, comprehension, and articulation (Exhibit 1.5
Figure). Responses are gathered on the 5-point scales with appropriate phrase
anchors. The MOS combines an item on overall sound quality with other items that
are more specific and relate to different facets of speech quality.
Several issues are noteworthy with respect to the MOS. At a conceptual level, a
central issue is the factor structure of the domain of speech quality. A factor is essentially a linear combination of variables (Hair et al., 1998). In this context, factor
analysis is conducted to assess the number of factors that are extracted and to assess
the degree to which items correlate with specific factors. In terms of dimensionality,
a variety of different results have been reported using exploratory factor analysis,
including two factors referred to as intelligibility and naturalness and a separate
speaking rate item (Kraft & Portele, 1995; Lewis, 2001) and one factor (Sonntag,
Portele, Haas, & Kohler, 1999). More recently, Lewis suggested a revised version of
the MOS with modified 7-point response scales. Results suggested two factors, with
the speaking rate item loading on the intelligibility factor. Moreover, past research
has typically employed exploratory factor analysis and has not followed up with subsequent confirmatory factor analysis, as recommended in the psychometric literature
(Gerbing & Anderson, 1988). Confirmatory factor analysis offers a test of factor
structure by testing specific models and providing overall indexes of fit.
Also lacking in past research is an explication of the domain of the speech quality construct through a description of underlying factors such as intelligibility. Such
conceptual examination should ideally precede item generation and empirical assessment. Intelligibility is related to the extent to which words and sentences can be
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 43
What Is Measurement?——43
Exhibit 1.5 Figure
Speech Quality Scale*
Overall Impression
Listening Effort
Pronunciation
How do you rate the
quality of the audio you
just heard?
How would you describe
the effort you were required
to make in order to
understand the message?
Did you notice anomalies
in pronunciation?
Excellent
Good
Fair
Poor
Very poor
Complete relaxation
Speaking Rate
possible; no effort
required
Attention necessary;
no appreciable effort
required
Moderate effort required
Considerable effort
required
No meaning understood
with any feasible effort
Pleasantness
The average speed How would you
of delivery was:
describe the
pleasantness of the
Just right
voice?
Slightly fast or
slightly slow
Very pleasant
Fairly fast or
Pleasant
fairly slow
Neutral
Very fast or
Unpleasant
Very unpleasant
very slow
Extremely fast or
extremely slow
Ease of Listening
Comprehension
Problems
Would it be easy or Did you find certain
difficult to listen to
words hard to
this voice for long
understand?
periods of time?
Never
Very easy
Rarely
Easy
Occasionally
Neutral
Often
Difficult
All of the time
Very difficult
No
Yes, but not annoying
Yes, slightly annoying
Yes, annoying
Yes, very annoying
Naturalness
Audio Flow
How would you rate
the naturalness of
the audio?
How would you
describe the
continuity or flow of
the audio?
Very natural
Natural
Neutral
Unnatural
Very unnatural
Very smooth
Smooth
Neutral
Discontinuous
Very
discontinuous
Articulation
Acceptance
Were the sounds
in the audio
distinguishable?
Do you think that
this voice can be
used for an
interactive
telephone or
wireless hand-held
information service
system?
Very clear
Clear
Neutral
Less clear
Much less clear
Yes
No
*Items 1 to 9 of the scale described in the text refer to the consecutive items from listening
effort to articulation.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 44
44——Measurement Error and Research Design
(Continued)
understood; therefore, items should tap into factors that assess listening effort,
pronunciation, speaking rate, comprehension problems, and articulation (Exhibit 1.5
Figure). These are aspects of speech that contribute to intelligibility. Therefore, contrary
to results that suggest that speaking rate is a separate factor, it belongs conceptually in
the intelligibility factor. Naturalness relates to the degree to which speech is similar to
natural human speech; hence, items such as naturalness, ease of listening, pleasantness,
and audio flow are relevant (see Exhibit 1.5 figure). These are impressions about the
speech and the feeling it engenders in respondents. This should be contrasted with specific aspects of respondents’ cognition, such as speaking rate, listening effort, and pronunciation. Thus, conceptually, intelligibility and naturalness relate to specific cognitions
about aspects of the speech versus broader impressions and feelings about the speech,
respectively. Central here from a procedural viewpoint is the importance of explicating
the domain of speech quality prior to testing through confirmatory factor analysis.
Another issue with the MOS is the inclusion of an item that is global in nature—
assessing overall speech quality—with other items that are more specific to aspects of
speech quality, such as articulation and pronunciation. Such an approach is problematic; the scale should consist of either global items or specific items. The global
Exhibit 1.5
Table Results for Confirmatory Factor Analysis on Speech Quality
Scale
Dataset
No. of factors
1
1
2
2
3
1
2
4
5
1
2
1
2
1
6
2
n
128
128
128
128
128
128
128
128
128
128
df
27
26
27
26
27
26
27
26
27
26
1
2
128 128
27
26
Chi-square
72.8
39.1* 49.5
38.3^* 118.2 57.6* 102.8
64.3* 88.0 52.8* 78.3 49.2*
NFI
0.95
0.97 0.96
0.97
0.90 0.95
0.90
0.94 0.92
0.95 0.94 0.97
NNFI
0.96
0.99 0.98
0.99
0.89 0.96
0.90
0.95 0.92
0.96 0.95 0.98
CFI
0.97
0.99 0.98
0.99
0.92 0.97
0.92
0.96 0.94
0.97 0.96 0.98
IFI
0.97
0.99 0.98
0.99
0.92 0.97
0.92
0.96 0.94
0.97 0.96 0.98
SRMR
0.06
0.04 0.05
0.04
0.08 0.06
0.07
0.06 0.07
0.05 0.05 0.05
NOTES: ^ p > .05; p < .05 for all other chi-square values. *Chi-square values of 1- versus 2factor models significantly different at .05 level. n = sample size; df = degrees of freedom;
NFI = normed fit index; NNFI = non-normed fit index; CFI = comparative fit index;
IFI = incremental fit index; SRMR = standardized root mean square residual. The results for
the 3-factor model (df = 24) with speaking rate as a separate item were identical to those for
the 2-factor model with the following exceptions: Dataset 1: Chi-square = 39.0, NNFI = 0.98;
Dataset 2: Chi-square = 36.5; Dataset 3: Chi-square = 57.4; Dataset 4: Chi-square = 64.2,
NNFI = 0.94; Dataset 6: Chi-square = 47.6, NNFI = 0.97.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 45
What Is Measurement?——45
approach essentially asks for overall impressions of speech quality, whereas the
specific approach has items representing different aspects of speech quality. For
example, if different items relate to different factors such as intelligibility and naturalness, then a global item would relate to both factors, being broader than either. In
fact, this argument is supported by inconsistent results that have been obtained.
Although this global item was thought to belong to the naturalness factor, Lewis
(2001) unexpectedly found that it related to the intelligibility factor. All of these
issues were examined through confirmatory factor analysis.
One hundred and twenty-eight employees of a U.S. company rated six systems on
the single item on overall speech quality, the 9-item speech quality scale, and the single item on system acceptability. Because each respondent rated more than one system
and the purpose here was to generate independent observations within a dataset, each
dataset analyzed related to independent responses to a particular system. Moreover,
some systems were slightly different across subsets of respondents. Therefore, the first
step was to identify datasets of responses to exactly the same system. This led to four
data sets of 64 respondents, each rating identical systems. These four datasets provided the best quality of data of independent observations of identical systems. Six
more datasets provided 128 independent observations; however, the systems rated
varied slightly across respondents. Sample sizes for these datasets met some of the criteria in past research for factor analysis (i.e., greater than five times the number of
items [> 45], or greater than 100 in the case of the larger datasets [Iacobucci, 1994]),
although more stringent criteria have been suggested as a function of factors such as
the communality of items (MacCallum, Widaman, Zhang, & Hong, 1999). Given the
high communality of items of speech quality, less stringent criteria appear to be appropriate here.
Several models were tested through confirmatory factor analysis. Several points
are noteworthy with the overwhelming pattern of results. First, the overall levels of fit
of both 1- and 2-factor models (i.e., Items 1, 2, 3, 8, and 9 on intelligibility and Items
4, 5, 6, and 7 on naturalness in Exhibit 1.5 Figure) are satisfactory by accepted (e.g.,
> 0.90) (Bagozzi & Yi, 1988; Bentler & Bonnet, 1980) and even by conservative
norms (e.g., > 0.95) (Hu & Bentler, 1998). All individual items had significant loadings on hypothesized factors. Whereas the 2-factor model improved on the 1-factor
model, the 1-factor model had a high level of fit. These results suggest that both 1- and
2-factor formulations of the 9-item scale are strongly supported through confirmatory
factor analysis. Contrary to the notion that the item on speaking rate loads on a
separate factor, here, speaking rate appears to be a part of the intelligibility factor.
An alternative model with speaking rate as a separate factor did not improve on the
2-factor fit indexes, although fit levels were so high as to allow little or no room
for improvement. A model with speaking rate loading on the intelligibility factor
led to superior fit as opposed to a model with speaking rate loading on the naturalness factor, consistent with the argument that speaking rate loads primarily on the
intelligibility factor.
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 46
46——Measurement Error and Research Design
(Continued)
Exploratory factor analysis is more of a shotgun approach, where all items can
be related to all factors. Confirmatory factor analysis can be used to carefully specify the relationship between items and factors, and between factors. Overall fit
indexes can be computed, and alternative models can be compared.
Input and Output From LISREL 8.5
for Confirmatory Factor Analysis
The program statements for confirmatory factor analysis are shown below. The
first line states the number of input variables, the number of observations, the
number of factors, and the nature of the matrix (i.e., a covariance matrix). The second line specifies the labels for the variables. This is followed by the covariance
matrix and the model statement specifying nine variables, one factor, and the label
for the factor.
DA NI = 9 MA = KM NO = 128
LA
s1 s2 s3 s4 s5 s6 s7 s8 s9
KM
1.000000
0.595594
0.478988
0.636631
0.614881
0.618288
0.647556
1.000000
0.582192
0.394849
0.628814
0.663076
1.000000
0.375190
0.448744
0.567760
0.546916
0.523697
0.534391
1.000000
0.561461
0.483979
1.000000
0.401614
0.348624
0.410717
0.398602
1.000000
0.756280
0.642101
0.783353
1.000000
0.722303
0.798083
1000000
0.731203
0.378980
0.398120
0.349791
0.401245
0.359818
1.000000
0.647491
0.651060
0.624736
MO NX = 9 NK = 1 LX = FR PH = ST
LK
sq
OU
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 47
What Is Measurement?——47
Edited results are shown below.
DA NI = 9 NO = 128 NG = 1 MA = CM
SE
123456789/
MO NX = 9 NK = 1 LX = FU, FI PH = SY, FR TD = DI, FR
LK
sq
FI PH(1,1)
FR LX(1,1) LX(2,1) LX(3,1) LX(4,1) LX(5,1) LX(6,1) LX(7,1) LX(8,1) LX(9,1)
VA 1.00 PH(1,1)
PD
OU ME = ML RS XM
TI DA NI = 9 NO = 128
Number
Number
Number
Number
Number
Number
of
of
of
of
of
of
Input Variables 9
Y - Variables 0
X - Variables 9
ETA - Variables 0
KSI - Variables 1
Observations 128
TI DA NI = 9 NO = 128
Covariance Matrix
s1
s2
s3
s4
s5
s6
s7
s8
s9
s1
---------1.00
0.60
0.48
0.64
0.61
0.62
0.65
0.58
0.63
s2
----------
s3
----------
s4
----------
s5
----------
s6
----------
1.00
0.38
0.45
0.57
0.55
0.52
0.53
0.56
1.00
0.40
0.35
0.41
0.40
0.38
0.36
1.00
0.76
0.64
0.78
0.40
0.65
1.00
0.72
0.80
0.35
0.65
1.00
0.73
0.40
0.62
Covariance Matrix
s7
s8
s9
s7
---------1.00
0.39
0.66
s8
----------
s9
----------
1.00
0.48
1.00
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 48
48——Measurement Error and Research Design
(Continued)
TI DA NI = 9 NO = 128
Parameter Specifications
LAMBDA-X
s1
s2
s3
s4
s5
s6
s7
s8
s9
sq
---------1
2
3
4
5
6
7
8
9
THETA-DELTA
s1
---------10
s2
---------11
s3
---------12
s4
---------13
s5
---------14
s6
---------15
THETA-DELTA
s7
---------16
s8
---------17
s9
---------18
TI DA NI = 9 NO = 128
Number of Iterations = 9
LISREL Estimates (Maximum Likelihood)
Loadings of each item on the factor are listed below with associated standard errors
and t-values.
LAMBDA-X
s1
sq
---------0.77
(0.08)
10.11
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 49
What Is Measurement?——49
s2
0.65
(0.08)
8.08
s3
0.48
(0.09)
5.63
s4
0.84
(0.07)
11.54
s5
0.87
(0.07)
12.13
s6
0.81
(0.07)
10.94
s7
0.89
(0.07)
12.61
s8
0.52
(0.08)
6.15
s9
0.77
(0.08)
10.19
PHI
sq
---------1.00
THETA-DELTA
s1
---------0.41
(0.06)
7.21
s2
---------0.57
(0.08)
7.58
s3
---------0.77
(0.10)
7.81
s4
s5
---------- ---------0.29
0.25
(0.04)
(0.04)
6.71
6.38
s6
---------0.34
(0.05)
6.96
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 50
50——Measurement Error and Research Design
(Continued)
THETA-DELTA
s7
---------0.21
(0.04)
6.02
s8
---------0.73
(0.09)
7.78
s9
---------0.40
(0.06)
7.19
Squared Multiple Correlations for X - Variables
s1
---------0.59
s2
---------0.43
s3
---------0.23
s4
---------0.71
s5
---------0.75
s6
---------0.66
Squared Multiple Correlations for X - Variables
s7
---------0.79
s8
---------0.27
s9
---------0.60
Various goodness-of-fit indexes are presented below.
Goodness of Fit Statistics
Degrees of Freedom = 27
Minimum Fit Function Chi-Square = 72.82 (P = 0.00)
Normal Theory Weighted Least Squares Chi-Square = 87.95 (P = 0.00)
Estimated Non-Centrality Parameter (NCP) = 60.95
90 Percent Confidence Interval for NCP = (36.28 ; 93.23)
Minimum Fit Function Value = 0.57
Population Discrepancy Function Value (F0) = 0.48
90 Percent Confidence Interval for F0 = (0.29 ; 0.73)
Root Mean Square Error of Approximation (RMSEA) = 0.13
90 Percent Confidence Interval for RMSEA = (0.10 ; 0.16)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.00
Expected Cross-Validation Index (ECVI) = 0.98
90 Percent Confidence Interval for ECVI = (0.78 ; 1.23)
ECVI for Saturated Model = 0.71
ECVI for Independence Model = 6.10
Chi-Square for Independence Model with 36 Degrees of Freedom = 756.24
Independence AIC = 774.24
Model AIC = 123.95
Saturated AIC = 90.00
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 51
What Is Measurement?——51
Independence CAIC = 808.91
Model CAIC = 193.29
Saturated CAIC = 263.34
Normed Fit Index (NFI) = 0.90
Non-Normed Fit Index (NNFI) = 0.92
Parsimony Normed Fit Index (PNFI) = 0.68
Comparative Fit Index (CFI) = 0.94
Incremental Fit Index (IFI) = 0.94
Relative Fit Index (RFI) = 0.87
Critical N (CN) = 82.91
Root Mean Square Residual (RMR) = 0.061
Standardized RMR = 0.061
Goodness of Fit Index (GFI) = 0.87
Adjusted Goodness of Fit Index (AGFI) = 0.78
Parsimony Goodness of Fit Index (PGFI) = 0.52
TI DA NI = 9 NO = 128
Fitted Covariance Matrix
s1
s2
s3
s4
s5
s6
s7
s8
s9
s1
---------1.00
0.50
0.37
0.65
0.67
0.62
0.68
0.40
0.60
s2
----------
s3
----------
s4
----------
s5
----------
s6
----------
1.00
0.32
0.55
0.57
0.53
0.58
0.34
0.50
1.00
0.41
0.42
0.39
0.43
0.25
0.37
1.00
0.73
0.68
0.74
0.44
0.65
1.00
0.70
0.77
0.45
0.67
1.00
0.72
0.42
0.63
Fitted Covariance Matrix
s7
s8
s9
s7
---------1.00
0.46
0.69
s8
----------
s9
----------
1.00
0.40
1.00
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 52
52——Measurement Error and Research Design
(Continued)
Fitted Residuals
s1
s2
s3
s4
s5
s6
s7
s8
s9
s1
---------0.00
0.09
0.11
−0.01
−0.05
−0.01
−0.03
0.18
0.03
s2
----------
s3
----------
s4
----------
s5
----------
s6
----------
0.00
0.06
−0.10
0.00
0.02
−0.06
0.19
0.06
0.00
0.00
−0.07
0.02
−0.03
0.13
−0.01
0.00
0.03
−0.04
0.04
−0.04
0.00
0.00
0.02
0.03
−0.10
−0.02
0.00
0.01
−0.02
0.00
s8
----------
s9
----------
Fitted Residuals
s7
s8
s9
s7
---------0.00
−0.07
−0.02
0.00
0.08
0.00
Summary Statistics for Fitted Residuals
Smallest Fitted Residual = −0.10
Median Fitted Residual = 0.00
Largest Fitted Residual = 0.19
Stemleaf Plot
−1|00
−0|7765
−0|44332221110000000000000
0|12223334
0|6689
1|13
1|89
The standardized residuals between pairs of variables are useful to examine. For
example, the residual between Items 1 and 8 is positive and high, whereas the residual between Items 2 and 4 is negative and high. It should be noted that Items 4, 5, 6,
and 7 form the naturalness factor, and Items 1, 2, 3, 8, and 9 form the intelligibility
factor. In a 1-factor model, these residuals are consistent with two underlying factors.
Items 1, 2, 3, 8, and 9 tend to have large negative residuals or small residuals with
Items 4, 5, 6, and 7. Residuals within each set of items tend to be positive.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 53
What Is Measurement?——53
Standardized Residuals
s1
s2
s3
s4
s5
s6
s7
s8
s9
s1
---------—
2.37
2.30
−0.37
−2.25
−0.22
−1.70
4.00
1.05
s2
----------
s3
----------
s4
----------
s5
----------
s6
----------
—
1.06
−3.08
0.08
0.49
−2.14
3.52
1.45
—
−0.11
−2.05
0.45
−0.97
1.96
−0.30
—
1.57
−1.71
2.40
−1.07
−0.09
—
0.94
2.11
−3.07
−0.85
—
0.64
−0.53
−0.11
Standardized Residuals
s7
s8
s9
s7
---------—
−2.27
−1.14
s8
----------
s9
----------
—
1.80
—
Summary Statistics for Standardized Residuals
Smallest Standardized Residual = −3.08
Median Standardized Residual = 0.00
Largest Standardized Residual = 4.00
Stemleaf Plot
− 3|11
− 2|3311
− 1|77110
− 0|85432111000000000
0|14569
1|11468
2|01344
3|5
4|0
Largest Negative Standardized Residuals
Residual for s4 and s2 −3.08
Residual for s8 and s5 −3.07
Largest Positive Standardized Residuals
Residual for s8 and s1 4.00
Residual for s8 and s2 3.52
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 54
54——Measurement Error and Research Design
(Continued)
Edited results for a 2-factor model are presented below for the same data.
MO NX = 9 NK = 2 PH = ST
LK
sq1 sq2
FR LX(1,1) LX(2,1) LX(3,1) LX(4,2) LX(5,2) LX(6,2) LX(7,2) LX(8,1) LX(9,1)
OU
LISREL Estimates (Maximum Likelihood)
LAMBDA-X
sq1
---------0.84
(0.07)
11.22
sq2
---------—
s2
0.72
(0.08)
8.97
—
s3
0.53
(0.09)
6.09
—
s4
—
(0.07)
11.74
0.85
s5
—
(0.07)
12.47
0.88
s6
—
(0.07)
10.85
0.81
s7
—
(0.07)
13.04
0.91
s8
0.63
(0.08)
7.59
—
s1
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 55
What Is Measurement?——55
s9
0.79
(0.08)
10.26
—
PHI
sq1
sq2
---------- ---------sq1
1.00
sq2
0.86
(0.04)
24.26
1.00
THETA-DELTA
s1
---------0.30
(0.05)
5.60
s2
---------0.48
(0.07)
6.93
s3
---------0.72
(0.09)
7.60
s4
---------0.28
(0.04)
6.47
s5
---------0.22
(0.04)
5.93
s6
---------0.35
(0.05)
6.90
THETA-DELTA
s7
---------0.18
(0.03)
5.31
s8
---------0.60
(0.08)
7.32
s9
---------0.38
(0.06)
6.33
Goodness of Fit Statistics
Degrees of Freedom = 26
Minimum Fit Function Chi-Square = 39.10 (P = 0.048)
Normal Theory Weighted Least Squares Chi-Square = 38.56 (P = 0.054)
Estimated Non-Centrality Parameter (NCP) = 12.56
90 Percent Confidence Interval for NCP = (0.0 ; 33.28)
Minimum Fit Function Value = 0.31
Population Discrepancy Function Value (F0) = 0.099
90 Percent Confidence Interval for F0 = (0.0 ; 0.26)
Root Mean Square Error of Approximation (RMSEA) = 0.062
90 Percent Confidence Interval for RMSEA = (0.0 ; 0.10)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.30
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 56
56——Measurement Error and Research Design
(Continued)
Expected Cross-Validation Index (ECVI) = 0.60
90 Percent Confidence Interval for ECVI = (0.50 ; 0.77)
ECVI for Saturated Model = 0.71
ECVI for Independence Model = 6.10
Chi-Square for Independence Model with 36 Degrees of Freedom = 756.24
Independence AIC = 774.24
Model AIC = 76.56
Saturated AIC = 90.00
Independence CAIC = 808.91
Model CAIC = 149.74
Saturated CAIC = 263.34
Normed Fit Index (NFI) = 0.95
Non-Normed Fit Index (NNFI) = 0.97
Parsimony Normed Fit Index (PNFI) = 0.68
Comparative Fit Index (CFI) = 0.98
Incremental Fit Index (IFI) = 0.98
Relative Fit Index (RFI) = 0.93
Critical N (CN) = 149.23
Root Mean Square Residual (RMR) = 0.043
Standardized RMR = 0.043
Goodness of Fit Index (GFI) = 0.94
Adjusted Goodness of Fit Index (AGFI) = 0.89
Parsimony Goodness of Fit Index (PGFI) = 0.54
Summary Statistics for Fitted Residuals
Smallest Fitted Residual = −0.13
Median Fitted Residual = 0.00
Largest Fitted Residual = 0.08
Stemleaf Plot
−12|0
−10|
−8|9
−6|75
−4|725
−2|9753
−0|63876421000000000
0|6936
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 57
What Is Measurement?——57
2|2246
4|457711
6|95
8|0
Standardized Residuals
s1
s2
s3
s4
s5
s6
s7
s8
s9
s1
---------—
−0.33
1.11
0.75
−0.86
1.02
−0.32
1.81
−1.91
s2
----------
s3
----------
s4
----------
s5
----------
s6
----------
—
−0.08
−2.00
0.63
1.12
−1.14
1.91
−0.22
—
0.33
−1.18
0.86
−0.31
0.83
−1.45
—
0.43
−2.06
1.09
−1.50
2.06
—
0.53
−0.18
−3.28
1.69
—
−0.08
−0.82
2.01
Standardized Residuals
s7
s8
s9
s7
---------—
−2.66
1.71
s8
----------
s9
----------
—
−0.46
—
Summary Statistics for Standardized Residuals
Smallest Standardized Residual = −3.28
Median Standardized Residual = 0.00
Largest Standardized Residual = 2.06
Stemleaf Plot
− 3|3
− 2|710
− 1|95421
− 0|9853332211000000000
0|3456889
1|01117789
2|01
Largest Negative Standardized Residuals
Residual for s8 and s5 −3.28
Residual for s8 and s7 −2.66
EXHIBIT SOURCE: Adapted from Computer Speech and Language, 19(1), Viswanathan, M.,
& Viswanathan, M., Measuring speech quality for text-to-speech systems: Development and
assessment of a modified mean opinion score (MOS) scale, pp. 55–83, Copyright 2005.
Reprinted with permission from Elsevier.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 58
58——Measurement Error and Research Design
δ1
λx11
x1
λx21
δ2
ξ1
γ11
ζ1
λy11
η1
x2
δ3
x3
δ4
λx32
ϕ13
x4
ϕ12
γ12
ξ2
γ22
λx42
ϕ23
δ5
x5
λx53
δ6
ξ3
γ23
ε1
y2
ε2
λy32
y3
ε3
λy42
y4
ε4
λy21
β21
η2
ζ2
λx63
x6
Figure 1.15
β12
y1
Combined Measurement Component and Structural Component of
the Covariance Structure Model
SOURCE: Long, J. S., Covariance structure models: An introduction to LISREL, p. 18,
copyright © 1983. Reprinted by permission of Sage Publications, Inc.
NOTES: The measurement model for independent variables consists of the following:
ξ = latent variable
x = observed variable
δ = unique factor or error
λ = loading of x on ξ
x1 = λx11 ξ1 + δ1 .
The measurement model for dependent variables consists of the following:
η = latent variable
y = the observed variable
ε = unique factor or error
λ = loading of y on η
y
y1 = λ11 η1 + ε1 .
The structural model relating independent to dependent latent variables consists of the following:
η related to ξ by γ, with error represented by ζ
η1 = γ11 ξ1 + γ12 ξ1 + β12 η2 + ζ1
The xs are independent observed variables related to the dependent variables by the slope
coefficients β1 and β2.
As illustrated in Figures 1.15 and 1.16, and as explained in the adjoining
notations and equations, the structural model relates latent variables to each
other, whereas the measurement model relates to the operationalizations of
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 59
What Is Measurement?——59
COV(ξ2,η3)
COV(ξ1,ξ2)
ξ1
λx11
COV(η1,η2) COV(η2,η3)
η1
ξ2
λx21
λx22
λx32 λx42
λy11
λy21
η3
η2
λy31
λy32
λy42
λy43
λy53
x1
x2
x3
x4
y1
y2
y3
y4
y5
δ1
δ2
δ3
δ4
ε1
ε2
ε3
ε4
ε5
θδ13
Figure 1.16
θε13
θε25
The Measurement Component of the Covariance Structure Model
SOURCE: Long, J. S., Covariance structure models: An introduction to LISREL, p. 18,
copyright © 1983. Reprinted by permission of Sage Publications, Inc.
latent variables through observed variables. In specifying the measurement
model, items could be aggregated into parcels or subsets. Method factors
could be specified. For instance, if specific methods are used to collect data
on specific items, they could be incorporated into the model as separate factors in addition to the latent variables. For example, in Figure 1.14, ξ3 could
be considered as a method factor for items x2 and x7. Thus, the effect of specific methods is explicitly incorporated. Correlated errors between items also
can be specified, as shown in Figure 1.16 between x1 and x3 or between y1
and y3. Considerable caution and strong rationale are necessary to specify
method factors and correlated error terms. Otherwise, this approach essentially can be misused to fit the model to the data.
A key issue in structural equation modeling is identification, or whether
model parameters can be uniquely determined from the data (Bollen, 1989;
Kaplan, 2000). Another issue relates to procedures used to estimate the
model, such as maximum likelihood—used in the example on speech quality
in Exhibit 1.5—and generalized least squares (Kaplan, 2000). A considerable
amount of work has focused on the nature of statistical tests and overall fit
indexes, and their vulnerability to factors such as sample size. The chi-square
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 60
60——Measurement Error and Research Design
statistic reflects the level of mismatch between the sample and fitted covariance
matrices (Hu & Bentler, 1998); hence, a nonsignificant chi-square is desirable. However, this statistic is influenced by a number of factors, including
sample size. On the other hand, fit indexes quantify the degree of fit. They
have been classified in several ways, such as absolute (i.e., directly assessing
goodness of a model) versus incremental (assessing goodness of a model in
comparison with a baseline model, such as a model in confirmatory factor
analysis where each item is a distinct factor). Specific fit indexes, such as the
goodness-of-fit index, the adjusted goodness-of-fit index, the normed fit
index (ranging from 0 to 1), and the non-normed fit index, are each vulnerable to various factors, such as sample size (e.g., Bollen, 1986; Mulaik et al.,
1989). Fit indexes have been compared on criteria such as small sample bias
and sensitivity to model misspecification (Hu & Bentler, 1998). The standardized root mean square residual and the root mean square error of
approximation are other indexes used to gauge fit. Considerable literature has
compared various indexes of fit. As an exemplar of such literature, Hu and
Bentler (1998) recommend the root mean square residual along with one of
several indexes, including the comparative fit index (Hu & Bentler, 1998).
The comparative fit index currently is the most recommended index (Bagozzi
& Edwards, 1998). Important here is the usage of several indexes, including
those currently recommended in the literature and generally produced in
programs such as LISREL 8.5, as well as consideration and presentation of
full information on the results, ranging from chi-square to multiple fit
indexes to variance extracted and standardized root mean square residuals.
Computations for some fit indexes are shown in Appendix 1.1. What specific
value constitutes acceptable fit is another area with a wealth of literature.
Some guidelines are provided in Chapter 5.
In comparing SEM versus the traditional approach to measurement, the
traditional approach is a necessary step to developing measures and assessing reliability, dimensionality, and validity. In using SEM, a key issue is
that many different models may fit the data. Therefore, SEM could be used
for confirmatory factor analysis after some initial work has been completed on a measure such as the PNI scale. Preliminary empirical and conceptual work would serve to purify measures before using SEM. SEM can
also be used for assessing different types of validity, as discussed subsequently. If confirmatory factor analysis is used in an exploratory fashion,
the findings have to be confirmed with new data. If this procedure is used
to test alternative models, then the chosen model should be tested with
new data. Such testing is extremely important, because many different
models can lead to good fit. Furthermore, a model could be modified in
many ways to achieve fit (for example, in a simple unidimensional model,
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 61
What Is Measurement?——61
General intelligence
Figure 1.17
Verbal intelligence
Quantitative intelligence
Verbal IQ
Quantitative IQ
Hierarchical Factor Structure
correlated errors could be used to achieve fit). If two items of the same
construct or of different constructs are influenced by an identifiable factor
(say, the same measure at different points in time), then using a model with
correlated error may be useful. Noteworthy here is that there are many
ways to improve fit, and plausible reasoning should precede modeling.
Furthermore, SEM ideally should not be used during the early stages of
measure development for item purification.
It should be noted that hierarchical factor analyses can also be performed
using structural equation modeling, wherein a set of factors loads onto higher
level factors, such as, say, quantitative and verbal intelligence loading onto
general intelligence. Thus, second-order factor analysis would involve an intermediate level of factors that loads onto higher order factors (Figure 1.17).
Validity
Whether a measure captures the intended construct, or whether the core
tapped by a measure is the intended core, is the purview of validity.
Assessing validity is like searching for an object at the bottom of the ocean
with a searchlight, not knowing what the object looks like! Such searches
may employ certain predetermined criteria (i.e., target areas on which to
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 62
62——Measurement Error and Research Design
focus that are akin to representative domain, and content validity in terms
of what the object might look like and indicators to look for in the ocean).
When multiple searchlights and search crews are used and converge on the
same location (i.e., multiple measures), then one type of evidence of validity is provided. When an unseen object has to be located, different types of
evidence are needed to locate it.
The distinction between reliability and validity, and how one can actually
be unrelated to the other, is brought out by the following stark example
modified from Nunnally (1978). Consider an exercise where a stone of a certain weight is thrown as far as possible over 20 trials with sufficient rest
between trials and under conditions with no wind and comfortable weather.
Is this a reliable measure of intelligence? Yes, if the same measure a week
later yields the same average throwing distance. In the social sciences, the criterion for consistency is often the relative standing of a set of people at test
versus retest as reflected in a correlation, the requirement of the exact throwing distance between test and retest being stringent for the nature of the phenomenon being studied. Note that multiple trials and rest between trials may
minimize variations because of arm weariness and may average out errors.
Whether throwing stones captures intelligence is the purview of validity.
This is an extreme scenario because a reliable measure is typically likely to
have some degree of validity as a result of procedures employed to capture
the content of a construct. However, the scenario is instructive in illustrating that a reliable measure is nothing more than a replicable measure. Hence,
in colloquial language, a reliable friend who is late but consistently late by
the same amount of time would still be considered reliable in the measurement world.
Several types of validity need to be considered (Churchill, 1979;
Cronbach & Meehl, 1955; Nunnally, 1978). Very brief descriptions are provided below, demonstrated in Exhibit 1.6, and listed in Figure 1.18.
1. Content validity (subjective judgment of content, following proper procedures to delineate content domain, using a representative set of items, assessing if items make sense)
2. Face validity (Does the measure look like it is measuring what it is supposed
to measure?)
3. Known-groups validity (Does the measure distinguish between groups that
are known to differ on the construct, such as differences in scores on measures between people with or without specific medical conditions?)
4. Predictive validity (Does the measure predict what it is supposed to predict,
such as an external criterion, say GRE or university entrance exam and grades
in college?)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 63
What Is Measurement?——63
5. Convergent validity (Does the measure correlate or converge with another
measure of the same construct?)
6. Discriminant validity (Is the measure of a construct not correlated with measures of constructs to which it is not expected to be related?)
7. Nomological validity (Does the measure of a construct relate to measures of
other constructs with which it is theoretically expected to be correlated; that
is, considering a nomological or theoretical network of constructs, does the
measure behave in theoretically expected ways?)
8. Construct validity (Does a measure measure what it aims to measure; does a
measure or operationalization correspond to the underlying construct it is
aiming to measure?)
Exhibit 1.6
Validity Analysis of the PNI Scale
Correlations Between PNI Scale and Other Scales
PNI
0.67
0.56
0.74
0.57
0.51
0.61
MEN
MVL
MTH
ATF
ATC
ATS
MEN
MVL
MTH
ATF
ATC
0.41
0.93
0.42
0.65
0.64
0.73
0.46
0.41
0.49
0.50
0.66
0.69
0.49
0.79
0.92
NOTES: All correlations were significant at the 0.01 level. MEN = Enjoyment of
mathematics scale; MVL = Value of mathematics scale; MTH = Total attitude
toward mathematics scale; ATF = Attitude toward statistics field scale;
ATC = Attitude toward statistics course scale; ATS = Total attitude toward statistics scale. All scales scored such that higher scores indicate more positive attitudes
toward statistics, mathematics, and so on.
AMB
NFC
SOC
PNI
−0.24**
0.30**
0.03
NOTE: Scales scored such that higher scores indicate more tolerance for ambiguity and higher need for cognition.
**p < .01.
Scale Descriptions
Attitude Toward Mathematics Scale (MATH). This scale is divided into (a) an
Enjoyment of Mathematics scale (MATHEN)—“a liking for mathematical problems and a liking for mathematical terms, symbols, and routine computations”
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 64
64——Measurement Error and Research Design
(Continued)
(Aiken, 1974, p. 67); sample item: “Mathematics is enjoyable and stimulating to
me” (Aiken, 1974, p. 68); and (b) a Value of Mathematics scale (MATHVAL),
which relates to “recognizing the importance and relevance of mathematics to
individuals and to society” (Aiken, 1974, p. 67); sample item: “Mathematics is not
important for the advance of civilization and society” (Aiken, 1974, p. 68).
Intolerance for Ambiguity Scale (AMB). This scale defines “a tendency to perceive or interpret information marked by vague, incomplete, fragmented, multiple,
probable, unstructured, uncertain, inconsistent, contrary, contradictory, or unclear
meanings as actual or potential sources of psychological discomfort or threat”
(Norton, 1975, p. 608). A sample item is “I do not believe that in the final analysis
there is a distinct difference between right and wrong” (Norton, 1975, p. 616).
Attitude Toward Statistics Scale (STAT). This scale is a measure of attitudes held
by college students toward an introductory course in statistics (Wise, 1985). This
scale is divided into Attitude Toward Course (STCOURS) and Attitude Toward the
Field (STFIELD) subscales.
Need for Cognition (NFC). This scale defines “the tendency of an individual to
engage in and enjoy thinking” (Cacioppo & Petty, 1982, p. 116).
Results
The relationship between the PNI scale and the Social Desirability scale (Crowne
& Marlowe, 1964) was assessed to examine whether responses to items indicating
a higher (or lower) preference for numerical information may be partially explained
by a motive to appear socially desirable, and to provide evidence for the discriminant validity of the PNI scale. A possible explanation for such responses may be
based on a perception that it is more socially desirable to indicate greater preference
for numerical information. This may be particularly likely given the composition
of the sample, which consisted of undergraduate students at a midwestern U.S.
university. Having taken quantitative courses, students may be likely to indicate a
greater preference for numerical information as a means of appearing socially desirable. PNI had no significant correlations with the 33-item Social Desirability scale
(r = 0.03). Therefore, it appears that social desirability is not a significant factor in
explaining responses to items on the PNI scale, with the result providing evidence
for the discriminant validity of the PNI scale.
The relationship between PNI and Need for Cognition—the “tendency of an
individual to engage in and enjoy thinking” (Cacioppo & Petty, 1982, p. 116)—was
assessed to provide evidence of the nomological validity of the PNI scale. Because
the PNI scale is argued to tap individuals’ preference for engaging in thinking
involving numerical information (as captured by aspects such as enjoyment and liking), a positive relationship was expected between PNI and Need for Cognition.
Individuals who have a tendency to enjoy thinking may be more likely to enjoy
thinking based on a particular type of information (i.e., numerical information) than
individuals who do not enjoy thinking. PNI had a positive correlation with Need for
Cognition (r = 0.30, p < .01). The significant correlation provides evidence for the
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 65
What Is Measurement?——65
claim that the PNI scale taps proclivity toward thinking based on one type of
information (i.e., numerical information). However, the size of the correlation suggests that a tendency to enjoy thinking per se is not strongly related to a tendency
to enjoy thinking based on information in a numerical form, perhaps because
numerical information is just one of several types of information that could be used
in thinking. This is a form of evidence for nomological validity in tapping enjoyment
of thinking based on one type of information.
Given the numerical content in statistics and mathematics, positive correlations
between preference for numerical information and attitudes toward statistics and
mathematics were expected in order to provide evidence of the nomological validity of the PNI scale. Positive correlations were obtained between the PNI scale and
the Enjoyment of Mathematics scale (r = 0.67, p < .01); the PNI scale and the Value
of Mathematics scale (r = 0.56, p < .01); and the PNI scale and the total Attitude
Toward Mathematics scale (r = 0.74, p < .01). Positive correlations were also
obtained between the PNI scale and the attitude toward statistics course scale
(r = .57, p < .01); the PNI scale and the attitude to statistics field scale (r = 0.51,
p < .01); and the PNI scale and the total statistics scale (r = 0.61, p < .01). The
results suggest that PNI has moderate to strong relationships with the various
subscales relating to mathematics and statistics, thereby providing evidence for the
nomological validity of the PNI scale due to the overlap between these scales in
terms of numerical content. The PNI scale had comparable or higher correlations
with the subscales of attitude toward mathematics (statistics), such as the
Enjoyment of Mathematics and Value of Mathematics scales, than the correlations
between these subscales, possibly because PNI overlaps with both subscales in
terms of numerical content, whereas the subscales overlap in terms of mathematical
(statistical) content.
EXHIBIT SOURCE: Adapted from Viswanathan (1993).
Construct
validity
Nomological
validity
Figure 1.18
Discriminant
validity
Types of Validity
Content
validity
Predictive
validity
Convergent
validity
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 66
66——Measurement Error and Research Design
Content validity relates to whether a measure is representative of a domain
and whether appropriate procedures have been employed for the development
of items. Content validity comes into play if a measure is construed to represent just one or a few aspects of the domain. Assessments of content validity
focus on whether aspects of the domain of a construct have been excluded and
aspects of the domain of a distinctly different construct have been included.
Content validity is a subjective assessment of the representativeness of a measure and the procedures used to develop the domain. Whereas most other
validity assessments are mainly empirical in nature, content validity is not and
is also a central indicator of the representativeness of a measure. For the PNI
scale, content validity evidence resides on documentation of procedures for
domain delineation and item generation (Exhibit 1.1). Face validity relates to
whether a measure looks like it is measuring what it is supposed to measure,
a very preliminary judgment about validity.
Several other types of validity depend on empirical evidence. Convergent
validity tests assess the degree to which two measures of a construct converge
(i.e., are highly correlated or strongly related). Convergent validity between
two measures is somewhat akin to internal consistency between items of
a measure. The logic of convergent validity is that a measure of a construct
should converge with another validated measure of the same construct, assuming such a measure is available. However, convergence alone is not definitive
evidence of validity because both measures could be invalid. No two measures
in practice are likely to be exactly identical and lead to perfect convergence. In
fact, such similar measures may not serve to test convergent validity if they are
really trivially different from each other. The aim here is to attempt different
approaches to measuring the same construct, which generally may translate to
less-than-perfect correlation among alternative measures. Two different ways
of measuring the same thing may not, in practice, be identical. Just as items get
at different aspects of a domain, different ways or methods of measuring
something may get at different aspects of a construct. Different approaches
also differ methodologically, contributing to the less-than-perfect correlation
to be expected between two measures of the same construct. In fact, as knowledge in an area progresses, researchers may well conclude that what were
originally considered two measures of the same construct are really measures
of different constructs. But this is one way in which knowledge in an area
progresses, different ways of attempting to measure constructs being central
in advancing such knowledge. Convergence between measures of the same
construct is a matter of degree for a variety of reasons. Note that convergent
validity is not involved when two measures of constructs are shown to be
related. Rather, it is useful to restrict this type of validity to the relationship
between two measures of the same construct. Relationships between measures
of related constructs are captured under nomological validity.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 67
What Is Measurement?——67
Nomological validity demonstrates that a measure is related to a
construct to which it is theoretically expected to be related. Discriminant
validity assesses the degree to which a measure is not related to constructs to
which it is not supposed to be related. Hence, it is the empirical counterpart
of the notion that domain delineation spell out what a construct is not.
Significant relationships with constructs to which a measure is expected to
be related provide evidence of nomological validity, whereas nonsignificant
relationships with constructs to which a measure is expected to be unrelated
provide evidence of discriminant validity. However, such a statement is
polarized in that evidence of small relationships could also provide evidence
of either type of validity (e.g., the relationship between PNI and need for precision) (see Exhibit 1.6). For example, if the expected theoretical relationship
is small and such a pattern is found, then it could provide evidence of either
nomological or discriminant validity.
Discriminant validity and nomological validity are two sides of the same
coin. A small relationship between a measure and another measure could
provide evidence of both and could well be considered nomological validity; that is, are measures of constructs behaving in theoretically expected
ways in nonrelationships or in small relationships? A simple version of
discriminant validity is to show that a measure of a construct is unrelated
to a measure of a construct to which it is not supposed to be related. For
example, a measure of intelligence could be shown to have discriminant
validity by showing a nonrelationship with stone throwing, a measure of
arm strength. A measure of preference for numerical information could be
shown to be unrelated to a measure of extroversion. Clearly, a completely
unrelated measure could be selected from an infinite set to demonstrate
discriminant validity. Often, measures are shown to be unrelated to individual differences in social desirability in order to demonstrate that the
measure in question is not eliciting responses based on respondents’ needs
to appear socially desirable. Such assessments play an important role in
showing that response tendencies unrelated to content drive scores on
measures.
However, showing that a measure is unrelated to measures of other
constructs may be a weak form of evidence of discriminant validity. After all,
a construct would likely be unrelated to numerous constructs from completely unrelated domains. The need for a reasonable test arises only when
there is some possibility of a relationship and a specific psychometric purpose in assessing the relationship, such as showing that a measure is not
influenced by social desirability. For the PNI scale (Exhibit 1.6), its relationship with social desirability is assessed to show that responses are not
being influenced by social desirability—a plausible alternative for students
rating preference for numerical information. In a similar vein, showing that
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 68
68——Measurement Error and Research Design
a measure is unrelated to a few measures from a large pool of potentially
unrelated measures in itself does not provide sufficient evidence about the
discriminant validity of the measure. Perhaps stronger evidence can be provided by showing that a measure has a small relationship with a measure of
a construct to which it is likely to be related; that is, these related measures
are indeed tapping into different constructs. For the PNI scale (Exhibit 1.6),
its relationship with need for cognition is assessed to show a significant but
small relationship. A subtler form of discriminant validity is to show a differential relationship between two related measures and a third measure.
Examples of each of these types of evidence are presented below for the PNI
scale (Exhibit 1.6). Here, the PNI scale and a measure of a construct to
which it is strongly related are shown to have a differential relationship with
a third measure. Thereby, the PNI scale is shown to be distinct from a measure of a construct to which it is strongly related.
Predictive validity is related to whether a measure can predict an outcome,
and it has relevance in practical settings. In descriptive research, predictive
validity is often subsumed under nomological validity. Known-groups validity is a form of predictive validity wherein a measure is shown to distinguish
between known groups in terms of scores (e.g., differences in scores on measures between people with or without specific medical conditions in clinical
settings, or differences in PNI scores for undergraduate majors in mathematics vs. art).
Construct validity is an umbrella term that asks the basic question, Does
a measure measure the construct it aims to measure? There is no empirical
coefficient for construct validity, just increasing amounts of evidence for
it. Therefore, there is no correlation coefficient for construct validity, only
degrees of evidence for it using all the types of validity listed above.
Construct validity is the most important and most difficult form of validity
to establish. It is akin to establishing causality between variables in substantive research, only causality is between a construct and its measure.
Although Exhibit 1.6 presents correlations between measures, it should
be noted that structural equation modeling can be used to assess different
types of validity while accounting for measurement error, such as unreliability in measures. Relationships between a focal measure and measures of
other constructs in nomological validity tests can be assessed while accounting for measurement error. Discriminant validity can be shown by differential relationships between a target measure of a construct and a measure of
a related construct and a third variable (Judd, Jessor, & Donovan, 1986).
Discriminant validity could also be demonstrated by specifying a one- versus
two-factor model of measures of two different constructs and showing that
the two-factor model has superior fit.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 69
What Is Measurement?——69
An example of the entire measure development process for a somewhat
different construct is presented in Exhibit 1.7. This construct is, in some
ways, more complex than PNI and is used to illustrate a different scenario.
Exhibit 1.7
Developing a Scale of Consumer Literacy: A Different Type of
Scale
Whereas the scales used as examples earlier relate to attitudes, other scales, such as intelligence tests, may be ability-related. Some scales may be on the margin between ability
and attitude, such as the consumer literacy scale for people with low levels of literacy.
First Phase: Literature Review and Exploratory Interviews
In the first phase, exploratory research is undertaken to (a) examine parallel measures of ability and skill through a literature review; (b) assess adult education tests
and textbooks for examples in consumer contexts; (c) examine assessment tools in
literacy, adult education, and, more broadly, education, such as the National Adult
Literacy Survey (1993); and (d) interview educators in adult education. The aim of
this phase is to develop comprehensive grounding before developing the measure and
conceptualizing the construct.
Second Phase: Conceptualization and Domain Delineation
In the second phase, the domain of consumer literacy is carefully delineated. In
conceptually defining a construct, it should be sufficiently different from other existing constructs and should make a sizable contribution in terms of explanatory power.
This step involves explicating what the construct is and what it is not. This is also a
step where the proposed dimensionality of the construct is described explicitly.
Keeping in focus the need to have a measure at the low end of the literacy continuum, this delineation of domain needs to list the basic skills necessary to complete
fundamental tasks as consumers. A matrix of basic reading, writing, and mathematical skills versus consumer tasks should be created to provide a complete delineation.
This listing of skills and associated consumer tasks needs to be examined by several
individuals with different types of expertise, such as consumer researchers, adult
education teachers, education researchers, and students at adult education centers
who have progressed through to completion of the GED and further education.
A team of experts can be formed consisting of teachers/directors at adult education
centers, researchers in consumer behavior, and researchers in educational measurement. Experts can be provided with conceptual definitions of various dimensions of
consumer literacy and associated listings of skills and consumer tasks and asked to
evaluate the definitions and the domain delineation for completeness and accuracy.
Such careful delineation of the domain is very important in any measure development
process, all the more so for a construct as complex as consumer literacy.
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 70
70——Measurement Error and Research Design
(Continued)
Third Phase: Measure Construction and Content Validity Assessment
In the third phase, the measure should be constructed through item generation. In
this phase, a large pool of items (i.e., consumer tasks) is developed to identify specific
aspects of the domain explicated in the previous phase. A large pool of items is
important, as are appropriate procedures to develop individual items. Redundancy is
a virtue at this stage in the process, with the goal being to cover important aspects of
the domain (DeVellis, 1991). Researchers have documented a host of issues that need
to be addressed in developing items (DeVellis, 1991; Haynes et al., 1995). Several
such procedures should be employed, such as asking experts to generate items on the
basis of definitions of specific dimensions of consumer literacy, and asking experts to
evaluate items in light of conceptual definitions. Thus, item generation draws on the
team of experts mentioned in Phase 2. Through these means, the ever-pertinent issue
of understanding of items by low-literate individuals is also addressed. Thus, items
and usage conditions are assessed together and modified. The procedures discussed
so far relate to the content validity of a measure, or whether a measure adequately
captures the content of a construct. There are no empirical measures of content validity, only assessments based on whether (a) a representative set of items was developed
and (b) appropriate procedures were employed in developing items (Nunnally,
1978). Hence, assessment of content validity rests on the explication of the domain
and its representation in the item pool.
Fourth Phase: Reliability and Validity Assessment
In the fourth phase, empirical studies should be conducted to assess the reliability,
dimensionality, and validity of the consumer literacy measure. Convenience samples
can be drawn for all studies from students at adult education centers. This method also
allows for access to the records of participants on important and relevant tests. The
consumer literacy measure is broad in scope and ideally applies to low-literate individuals in general. In this regard, students at adult education centers should be distinguished from other functionally low-literate consumers in their motivation to become
functionally literate. Nevertheless, the choice of students at adult education centers
greatly facilitates efficient access and recruitment of a group that is very difficult to sample and provides a sound starting point. Convenience sampling is suited for these studies rather than probabilistic sampling because the aim is not to establish population
estimates, but rather to use correlational analysis to examine relationships between
items and measures. Several pilot tests should be conducted in which students at adult
education centers complete the measure. Through observation and explicit feedback
from students and teachers, the measure can be modified as needed. Such pilot testing
is expected to be carried out in many stages, each using a small set of respondents, with
the measure being adjusted between stages. Improvements in the measure during pilot
testing, both in changing individual items and in addressing administration procedures,
are likely to be considerable and to go a long way toward minimizing sources of
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 71
What Is Measurement?——71
measurement error in the final measure. Four to five larger scale studies are
required in this phase, each employing sample sizes of 100–150. The first two to
three large-scale studies should aim to assess and purify the scale through assessment of the reliability of individual items, and through initial assessment of validity
and dimensionaity. Subsequent studies with a purified measure are more confirmatory and assess dimensionality as well as test validity in a detailed manner.
After item generation, measures typically are assessed for internal consistency
reliability, and items are deleted or modified. Internal consistency frequently is the first
empirical test employed and assesses whether the items in a set are consistent with each
other or belong together. Such internal consistency assessment is pertinent in evaluating items within each dimension of consumer literacy for consistency in response.
An important form of reliability in this context is test-retest reliability, where the
measure is completed twice by the same individuals with an interval of, typically, a
few weeks. In a sense, test-retest reliability represents a one-to-one correspondence
with the concept of reliability, which is centered on replicability or consistency. Testretest reliability offers direct evidence of such consistency over time.
Exploratory factor analysis should be employed for preliminary evaluation of
the dimensionality of the measure by assessing the degree to which individual items
are correlated with respective dimensions or factors. However, confirmatory
factor analysis is required for conclusive evidence (Gerbing & Anderson, 1988).
Confirmatory factor analysis imposes a more restrictive model, where items have
prespecified relationships with certain factors or dimensions.
Item-response theory (Hambleton, Swaminathan, & Rogers, 1991) offers another
scaling technique that is particularly relevant here, because sets of items may require
similar levels of skill. Therefore, responses to such sets of items may be similar. Itemresponse theory can specify the relationship between a respondent’s level and the
probability of a specific response. This approach can be used to assess measurement
accuracy at different levels of ability and also to construct measures that have comparable accuracy across different levels of ability. The use of item-response theory for
the National Adult Literacy Survey (1993) provides guidance in this regard.
Tests of different types of validity take measure development to the realm of crossconstruct relationships. Whereas reliability establishes consistency and the lack of
random error, validity pertains to whether a measure measures what it purports to
measure (i.e., the lack of random and systematic error). Several types of validity need
to be considered, and very brief descriptions are provided below along with examples of possible tests.
Content validity relates to a subjective judgment of content, based on adherence
to proper procedures to delineate content domain and on generation of a representative set of items. For example, content validity could be assessed and enhanced by
a team of experts reviewing the conceptual definition, domain delineation, and item
generation. Convergent validity relates to whether the measure correlates or converges with another measure of the same construct. This type of validity is not
directly applicable here because no other measure of consumer literacy is available.
(Continued)
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 72
72——Measurement Error and Research Design
(Continued)
However, a proxy measure can be created from current functional literacy measures
using tasks oriented toward consumer economics. Other types of validity include
known-groups validity, where a measure is shown to distinguish between known
groups, such as individuals with or without an illness in clinical settings. In this
context, this form of validity can be evaluated by assessing differences in the consumer literacy measure for students with 0 to 4th- versus 5th- to 8th- versus 9th- to
12th-grade reading levels. Predictive validity is related to whether a measure can
predict a criterial outcome. In this context, one way in which predictive validity can
be assessed is by relating the measure to performance in specific consumer tasks.
Nomological validity demonstrates that a measure is related to a construct to which
it is theoretically expected to be related. Examination of the relationship between
consumer literacy and traditional functional literacy tests is an example of a nomological validity test. Discriminant validity assesses whether the measure of a construct
correlates weakly or not at all with measures of constructs to which it is expected to
relate weakly or not at all, respectively. Small or moderate correlations between a
measure of consumer literacy and basic measures of reading literacy would provide
a form of evidence of discriminant validity.
The aim of all of the measurement procedures described above is to provide
evidence of construct validity, an umbrella term that asks the basic question, “Does
a measure measure the construct it aims to measure?” There is no empirical coefficient for construct validity, just increasing amounts of evidence for it.
In all phases of measure assessment, both individual items and administration
procedures need to be adjusted carefully. This is critical for the proposed measure;
both the items of a measure and the conditions of their use have to be carefully
assessed to minimize measurement error.
It should be noted that validity tests rest on several assumptions (Figure
1.19). To validate a measure of a construct by relating it to another measure
of a different construct (say, in a test of nomological validity), it is important
to use a reliable and valid measure of this different construct. It is also
important to have a sufficient support for the hypothesized relationship
between two constructs. Thus, only one of the three branches in Figure 1.19
is being tested and results can be interpreted clearly.
Looking at the broad picture of the relationship between reliability and
validity, reliability is a necessary but not sufficient condition for validity.
Although reliability and validity are both a matter of degree, generally
speaking, a valid measure is reliable, but a reliable measure is not necessarily valid. Reliability is about random error, and if random error has not
been reduced, the issue of validity does not arise.12 If a measure is reliable, it
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 73
What Is Measurement?——73
Figure 1.19
X
Y
x
y
Assumptions of Validity Tests
Assumptions
− Construct X is related to Construct Y
− y is a reliable and valid measure of Y
∴ rxy can be used to assess rxX
rxy = rxX . rXY . rYy
means that a large proportion of the total variance is due to common
sources. Then, it is pertinent to discuss validity. Even if a measure looks
face valid, if it is not reliable, it is not valid. Reliability and validity have
been viewed as being the ends of a continuum where reliability relates to
agreement between the same or similar methods, whereas validity relates to
agreement between different methods (Campbell & Fiske, 1959). This view
is consistent with the difference between reliability and convergent validity
discussed here.
The multitrait-multimethod approach systematically examines the effect
of using identical versus different methods (Campbell & Fiske, 1959). It can
be illustrated using the PNI construct, essentially adapting similar matrices
presented in past research (Campbell & Fiske, 1959) (see Figure 1.20).
Consider three different ways of measuring preference for numerical information: (a) a self-report scale as shown earlier (self-report), (b) a rating by
observers based on a half-hour of observation of behavior in a controlled
setting on the degree to which individuals spend time with numerically
oriented tasks (observation), and (c) a diary that individuals complete of
numerically oriented tasks in which they engaged every day for a few weeks
(diary method). Consider also two other traits that are measured through
these three methods: Need for precision (NFP) (Viswanathan, 1997) and
social desirability (SOC). A hypothetical pattern of correlations is shown in
Figure 1.20 and represents data from the same individuals on these different
methods and traits.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 74
74——Measurement Error and Research Design
PNI1
PNI1
NFP1 SOC1
PNI2
NFP2 SOC2
PNI3
NFP3 SOC3
.90
Self-report
NFP1
.65
.85
SOC1
.24
.35
PNI2
.75
.70
Observation NFP2
.42
.65
SOC2
.01
.27
PNI3
.71
NFP3
.45
.71
SOC3
.05
.25
Diary
.85
.82
.55
.75
.30
.32
.81
.72
.78
.85
.45
.74
.02
.20
.75
.52
.83
.10
.07
.88
Heterotrait-heteromethod triangle
Heterotrait-monomethod triangle
Figure 1.20
Multitrait-Multimethod Matrix
The reliability diagonal refers to the reliability of the measures and is
essentially the correlation of a measure with itself. This is consistent with
indicators of reliability, such as coefficient alpha, being the proportion of
variance attributable to common sources and test-retest reliability being
the test-retest correlation. Reliability is the relationship of a measure with
itself. The validity diagonal (that is, heteromethod-monotrait) lists correlations between measures of the same construct using different methods. This
is convergent validity, the relationship between two different measures of
the same construct. The heterotrait-monomethod triangle shows the relationship between measures when using the same method. These correlations may
have been inflated by the use of the same method, say, capturing individuals’ tendencies to respond in certain ways to certain methods, such as in using
the positive end of self-report scales. This is an example of a type of systematic error—consistent differences across individuals over and above the
construct being measured—that may inflate or deflate correlations. Whereas
random error reduces observed correlations, systematic error may inflate or
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 75
What Is Measurement?——75
deflate observed correlations. A simplified example of systematic error was
illustrated with the weighing machine example. Subsequent chapters will
introduce a nuanced discussion of systematic error. To examine the effect
of method, these correlations have to be compared to the heterotraitheteromethod triangle. The effect of using the same method can be gauged by
comparing correlations between measures using different methods versus the
same method. For example, the correlation between PNI and NFP is considerably lower when using observation for NFP and self-report for PNI (0.42)
than when using self-report for both (0.65). Therefore, the differences in correlations using the same versus different methods provide an estimate of the
effect of a common method. In contrast, the diary method does not have as
much of a common method effect as self-report (0.52 vs. 0.45). Social desirability has a correlation of 0.24 and 0.35 with PNI and NFP, respectively,
using self-report. However, for PNI, the corresponding correlations in the
heteromethod-heterotrait triangle are considerably lower (0.01 and 0.05)
but remain small but sizable for NFP (0.27 and 0.25). This pattern points to
problems in the self-report NFP scale in terms of being associated with social
desirability, an undesirable characteristic in this context. On the other hand,
the diary method of assessing NFP does not have this problem.
The multitrait-multimethod approach outlined above suffers from several
problems (Bagozzi & Yi, 1991). One problem is the absence of clear standards for ascertaining when any particular criterion is met, because only
rules of thumb are available. Another problem is the inability to assess
separate amounts of trait, method, and random error in the data, all
confounded by examining only the raw correlations. Several assumptions are
made, such as no correlations between trait and method factors, all traits
being equally influenced by method factors, and method factors being uncorrelated (Bagozzi & Yi, 1991). The use of structural equation modeling for
multitrait-multimethod matrices has several advantages, such as providing
measures of the overall degree of fit and providing information about
whether and how well convergent and discriminant validity are achieved
(Bagozzi & Yi, 1991). Figure 1.21 illustrates such an approach where
method factors are incorporated into the model. Other approaches to modeling the relationships include using correlated error terms between items
that share the same method.
General Issues in Measurement
One issue about constructs is noteworthy: The strength and relevance of
constructs may vary across people. The notion of a metatrait has been used
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 76
76——Measurement Error and Research Design
Ψ31
Ψ32
Ψ21
PNI
λ1 λ4
x1
λ7
λ2 λ5
ε4
ε1
SOC
NFP
x4
ε7
x7
λ11
λ10
ε2
x2
λ12
Ψ54
λ3 λ6
ε5
x5
λ13
Self-report
λ8
λ14
Observation
ε8
x8
ε3
ε6
x3
λ15
Ψ65
λ16
λ9
x6
λ17
ε9
x9
λ18
Diary
Ψ64
Figure 1.21
Using Structural Equation Modeling for Multitrait-Multimethod
Data
to refer to possessing or not possessing a trait (Britt, 1993). Moreover, many
questions used in a research design are really just that and do not purport to
assess an underlying abstract construct. For instance, they may assess specific
behaviors, such as spending on entertainment. Nevertheless, examination of
the intended thing being measured using the measure development process
described can be very beneficial. Such examination can lead to insights, precise wording of the question to enable consistent interpretation, and a broadened examination of multiple issues about this phenomenon through several
questions. For example, with entertainment spending, pertinent questions
relate to what entertainment is and what is included or excluded.
A few general points are noteworthy with regard to psychometrics. First,
relatively large sample sizes are essential for various psychometric procedures (Guadagnoli & Velicer, 1988; Nunnally, 1978). Small samples lead to
large sampling errors and add new forms of uncertainty into psychometric
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 77
What Is Measurement?——77
estimates of reliability and validity. Coefficient alpha has a confidence
interval associated with it as well that is based on sampling error
(Appendix 1.1). Second, lack of sufficient variation inhibits the ability of an
item to covary or correlate with other items and with the total measure.
Covariations, reflected statistically in correlations, are the underpinning of
psychometric analyses, and lack of sufficient variation inhibits the ability of
an item to covary with other items. Third, although reliability and validity
are the key concepts in measurement, dimensionality is discussed separately
earlier. However, it is subsumed within validity, the aim being to understand
what a measure is capturing both conceptually and empirically. Fourth, and
more generally, measure validation is an ongoing process whereby measures
are refined; items are added, modified, or deleted; dimensions are expanded;
and constructs are redefined. Therefore, no single study is definitive in terms
of measure validation. Fifth, empirical testing of measures rests on relative
ordering, correlations being largely determined by relative standing between
variables (Nunnally, 1978). Relationships examined in substantive studies in
the social sciences are correlational in nature as well, reflecting the nature of
the phenomenon and the precision with which relationships can be specified.
Sixth, an attenuation formula to allow for unreliability in studying the relationship between two variables is available in the literature and presented in
Appendix 1.1. However, there is no substitute for well-designed, reliable,
valid measures to begin with. More generally, data analysis may not be able
to account for problems in research design. Finally, generalizability studies
may be a useful alternative by formalizing the effects of occasions and items
(Cronbach, Rajaratnam, & Gleser, 1963; Rentz, 1987). Generalizability
studies assess measurement error across facets or conditions of measurement, such as items, methods, and occasions. This approach relates to generalizing from observations to a universe of generalization consisting of the
conditions of measurement of facets. Effects of facets and interactions
between them are analyzed in this approach. The generalizability approach
blurs the distinction between reliability and validity in that methods could
be a facet of generalization (Rentz, 1987). However, this approach may be
complex to design and implement (Peter, 1979).
Summary
Accurate measurement is central to scientific research. There are two basic
types of measurement error in all of scientific research—random error and
systematic error. Measures relatively free of random error are reliable,
and measures relatively free of random and systematic error are reliable
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 78
78——Measurement Error and Research Design
and valid.13 The measure development process consists of a series of steps to
develop reliable and valid measures. It starts out by carefully understanding
what is being measured, through definition of the construct and delineation
of its domain. Rather than proceed directly from an abstract construct to its
concrete measurement, the distance between the conceptual and the operational has to be traversed carefully and iteratively. Internal consistency reliability, test-retest reliability, dimensionality, and validity tests are performed
on measures, with items being added, modified, or deleted along the way.
Measure development should ideally combine empirical assessment with
conceptual examination. A purely empirical approach may neglect the content of items, whereas a purely conceptual approach neglects empirical reality
and how respondents are actually interpreting an item. A purely conceptual
approach that neglects empirical results rests on the notion that the measure or
individual items are reliable and valid no matter what the outcomes say. A
purely empirical approach rests on the assumption that, somehow, item content and domain representation are irrelevant once data are collected. For
instance, an item that has moderate item-to-total correlation may still be worth
retaining if it is the only item that captures a certain aspect of the construct’s
domain. Even an item with low item-to-total correlation may be worth editing
if its content is uniquely capturing some aspect of a construct’s domain.
Similarly, conceptual reasoning for an item cannot overrule poor empirical outcomes, which are essentially suggesting problems in the item’s interpretation.
Low reliability and low validity have consequences. Random error may
reduce correlations between a measure and other variables, whereas systematic error may increase or reduce correlations between two variables. If there
is a key theme in all of this, it is that a construct and its measure are not the
same, and imperfect measurement of any construct has to be taken explicitly
into account.
A metaphor that can be used to understand measurement is to consider
the universe and the location of specific planets or stars, which are like
abstract concepts. Although we are unable to see them, they have to be
located through paths from Earth to them, akin to a path from a measure to
the underlying concept. Reliability refers to whether the paths (measures)
can be reproduced across time. Convergent validity refers to whether
two different paths (measures) converge on the same planet (construct).
Discriminant validity refers to whether the path (measure) to one planet
(construct) is different from a path (measure) leading to a different planet
(construct). Nomological validity refers to whether the path (measure)
relates to other known paths (measures) to different planets (constructs) in
expected ways in light of where the target planet (construct) is expected to
be. Construct validity refers to whether the path (measure) indeed leads to
the intended planet (construct).
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 79
What Is Measurement?——79
The book has so far provided a brief overview of traditional measure
development procedures. The rest of the book will provide more in-depth
understanding of measurement error leading to important insights for
measure development and methodological design. Thus, the rest of the
material will break new ground in understanding measurement error and
methodological design.
Notes
1. Nunnally (1978) is cited throughout the book wherein both the Nunnally
(1978) book and the later edition by Nunnally and Bernstein (1994) can be cited.
2. The term measure is sometimes used interchangeably with the term scale in
this book.
3. Many words used in science, of course, refer to things other than constructs,
such as umbrella terms referring to topic areas, stimuli, or artifacts, such as a notfor-profit organization. Any stimulus object could be viewed in terms of numerous
underlying constructs, with the specific object having specific levels of magnitude on
these constructs. A not-for-profit organization, for instance, needs to be differentiated from other organizations by isolating the key dimensions of distinction. A construct can be viewed as having levels that are either along a quantitative continuum
or are qualitatively distinct.
4. Whether a true value really exists is a relevant issue. When dealing with psychological phenomena, responses are often crystallized where questions are posed.
The notion of a true value that exists when all the factors of a response situation
are controlled for is somewhat like the notion of infinity, a useful concept. What
usually matters in the study of relationships in the social sciences is the accurate
relative ordering of individuals or stimuli.
5. Consistency is a central element of scientific research. A key characteristic of
science is replicability.
6. Can measurement error be categorized in some other way? Perhaps error
can be classified further in terms of having constant versus variable effects.
Subsequent discussion will address this issue.
7. Researchers have suggested alternative approaches to the measure development process. For example, the C-OARS-E procedure (Rossiter, 2002) emphasizes
content validity. The steps in this procedure—reflected in the acronym—are
Construct definition (where the construct is defined initially in relation to object,
attribute, and rater entity, e.g., IBM’s [object] service quality [attribute] as perceived
by IBM’s managers [raters]), Object classification (where raters are interviewed to
classify object as concrete or abstract and items are developed accordingly),
Attribute classification (where raters are interviewed to classify attributes similarly
as objects), Rater identification (where the rater is identified), Scale formation
(where the scale is formed from object and attribute items, response categories are
chosen, and items are pretested), and Enumeration.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 80
80——Measurement Error and Research Design
8. As the number of items in a measure increases, it could be argued that there
is a greater likelihood of picking items to cover different aspects of a domain, akin
to randomly picking states in the United States and covering the geographic area.
In practice, though, the opposite may happen sometimes, where items are picked
from one or a few subdomains. Moreover, the issue of representing different aspects
of the domain is not one of reliability but one of validity. In fact, representing a
broad domain may actually lower reliability while enhancing validity.
9. For oblique rotations, a factor pattern matrix and a factor structure matrix
are reported in factor analysis. The former, which is easier to interpret and usually
reported, shows the unique loadings of items on factors, whereas the latter shows
correlations between items and factors that include correlations between factors
(Hair et al., 1998).
10. Another issue in factor analysis is the use of factor scores, which are estimates
of the values of common factors. Problems with factor scoring procedures suggest
that careful evaluation of factor scores is needed before using them (Grice, 2001).
11. This type of factor analysis is referred to as R factor analysis and is distinguished from Q factor analysis, where individuals load on factors and correlations
are computed across stimuli (Hair et al., 1998; McKeown & Thomas, 1988). For
example, in a Q sort method, 53 respondents ranked 60 statements on morality by
ordering them from – 5 to +5 (i.e., most unlike to most like respondents’ views)
(McKeown & Thomas, 1988). The rankings were then correlated and factor analyzed. Groups of individuals loaded on specific factors. The Q factor analysis
approach is not used often because of problems with computation (Hair et al., 1998).
12. One way in which reliability and validity have been illustrated is through bullet
holes in a shooting target, wherein the bull’s-eye is analogous with the construct that a
measure aims to measure. A set of bullet holes close together in proximity suggests consistency and reliability, whereas a set of scattered bullet holes suggests unreliability. A set
of bullet holes close together around the bull’s-eye suggests reliability and validity.
13. These statements describe reliability and validity in a mathematical sense in
terms of the types of error. It is possible for a measure to be perfectly reliable and valid
in capturing an unintended construct. Therefore, validity relates conceptually to
whether a measure measures the construct it purports to measure. Measurement of a
different construct reliably and validly would not, of course, meet this requirement.
For instance, the stone-throwing exercise may be a reliable measure of intelligence,
and perhaps a valid measure of arm strength, but it is not a valid measure of intelligence because it lacks content validity. Therefore, the mathematical description of reliability and validity in terms of error, in some sense, works when it is assumed that the
intended construct is captured along with error. In other words, the observed score is
the sum of the true score and error. But in extreme scenarios, lacking the true score
term to begin with because of measurement of the unintended construct, the mathematical expression does not fully reflect the concept of validity. Observed scores on an
unintended construct may be viewed as reflecting scores on the intended construct. For
instance, in the case of the stone-throwing exercise, consistency may be achieved internally; however, the pattern of responses would be unrelated to true scores on intelligence. Noteworthy is that the issue of reliability is internal to a measure.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 81
What Is Measurement?——81
Appendix 1.1
Selected Metrics and Illustrations in Measurement
Random and Systematic Errors
Xo = Xt + Xr + Xs
Xo = observed score
Xt = true score
Xr = random error
Xs = systematic error
SOURCE: Adapted from Churchill, 1979.
Illustration of Internal Consistency
Sample Item Responses for the PNI Scale After Reverse Scoring Items
Items
Respondents
1
2
3
99
100
N1
N2
N3
N4
N5
N6
1
3
4
5
7
2
3
4
6
7
2
4
5
6
6
1
2
3
4
5
2
3
3
4
6
2
3
3
5
7
N7
Total
31
58
83
102
122
NOTE: Responses are reverse scored so that higher values denote higher preference for
numerical information.
Covariance:
n
σxy =
i=1 [(xi
− x)
¯ · (yi − y¯ )]
n−1
Correlationxy:
Covariancexy
σ 2x · σ 2y
σ 2x = x true score variance
σ 2y = y true score variance
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 82
82——Measurement Error and Research Design
Item-to-total correlation = Cov (N1, Total)/Sq rt [Var(N1)Var(Tot)]
Cov = Covariance
Var = Variance
Sq rt = Square root
Computation of coefficient alpha (DeVellis, 1991; Nunnally & Bernstein,
1994, p. 232)
Coefficient alpha = kr−/[1 + (k – 1)r−]
k = number of items in a measure
r = average interitem correlation
The expression of alpha presented above is based on a variance-covariance
matrix of all the items in a measure. Alpha is the proportion of variance due
to common sources. The elements in a covariance matrix can be separated
into unique variances represented by diagonal elements and covariances
between items represented by off-diagonal elements. A three-item example is
shown below.
⎡
⎤
σ11 σ12 σ13
⎣σ21 σ22 σ23 ⎦
σ31 σ32 σ33
Alpha = [k/(k – 1)][1 – (Sum of unique elements/Sum of covariance elements)]
k/(k – 1) is a correction term applied to restrict the range of alpha from 0 to 1.
For example, coefficient alpha = kr−/[1 + (k – 1)r−]
For a 3-item scale:
If r = 0, then alpha = 0; If r = 1, then alpha = 3·1/[1 + (3 – 1)1] = 1.
Attenuation Formula for Reliability
X
rxX
x
rXY
Y
ryY
y
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 83
What Is Measurement?——83
x is a measure of construct X
y is a measure of construct Y
rxy = observed correlation between x and y
rXY = true correlation between X and Y
rxX = correlation between measure x and construct X
ryY = correlation between measure y and construct Y
rxy = rxX · rXY · rYy
rx2X = reliability of x; ry2Y = reliability of y
(i.e., the proportion of variance attributed to the true score)
∴ rxy = √Relx . rXY . √Rely
Relx = Reliability of measure x; Rely = Reliability of measure y
If rXY = .5, Relx = .5, and Rely = .5,
Then = rxy = √.5 · .5 · √.5 = .25
SOURCE: Nunnally and Bernstein (1994), p. 241.
Kuder-Richardson Formula 20 for
Reliability of Dichotomous Item Scale
rxx
N
=
N−1
S2 −
pq
S2
N = number of items
S2 = variance of the total score
p = proportion passing each item; q = 1 – p
SOURCE: Nunnally and Bernstein (1994), p. 235.
Standard Error of Measurement (SEM) for Reliability
SEM = σx ·
1 − rxx
σx = Standard deviation of the distribution of total scores on the test
rxx = reliability of the test
SOURCE: Nunnally (1978), p. 239.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 84
84——Measurement Error and Research Design
Illustration of Test-Retest Reliability
Items
Respondents
1
2
3
99
100
N1
N2
N3
Total
1
3
4
5
7
2
3
4
6
7
2
4
5
6
6
31
58
83
102
122
RN1
2
3
3
4
6
RN2
RN3
2
3
3
5
7
RTotal
29
59
77
98
127
NOTES: Responses on items N1, N2, and so on are at test, and RN1, RN2, and so on are at
retest. Items are reverse scored so that higher values denote higher preference for numerical
information. Total and RTotal are total scores at test and retest, respectively.
A Simplified Illustration of Exploratory Factor Analysis
The following are equations and computations showing the relationships
between two uncorrelated factors and four items.
F = factor
x = item
l = loading or correlation between an item and a factor
e = error term
Numbers refer to different factors or items.
x1 = l11F1 + l12F2 + e1
x2 = l21F1 + l22F2 + e2
x3 = l31F1 + l32F2 + e3
x4 = l41F1 + l42F2 + e4
In matrix form:
⎡
⎤ ⎡
⎤
⎡ ⎤
x1
l11 l12
e1
⎢ x2 ⎥ ⎢ l21 l22 ⎥
⎢ e2 ⎥
F
1
⎢ ⎥=⎢
⎥
⎢ ⎥
⎣ x3 ⎦ ⎣ l31 l32 ⎦ × F2 + ⎣ e3 ⎦
x4
l41 l42
e4
2
2
Communality of x1 = l11 + l12.
More generally, communality of xi = jn=1 l 2ij , where i is the ith item and
j is the jth factor with n factors.
2
2
2
2
Variance explained by F1 in eigenvalues = l11+ l 21+ l31+ l41.
More generally, variance explained by Fj in eigenvalues, Fj = im=1 l 2ij ,
where i is the ith item and j is the jth factor with m items.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 85
What Is Measurement?——85
A Simplified Illustration of Confirmatory Factor Analysis
In confirmatory factor analysis, rather than every item being allowed to be
related to every factor (as shown in the equations earlier), specific models are
tested. Consider a model where the first two of four items are hypothesized
to load on the first factor and the last two items are hypothesized to load on
the second factor.
x1 = l11F1 + 0F2 + e1
x2 = l21F1 + 0F2 + e2
x3 = 0F1 + l32F2 + e3
x4 = 0F1 + l42F2 + e4
In matrix form:
⎡
⎤ ⎡
x1
l11
⎢ x2 ⎥ ⎢ l21
⎢ ⎥=⎢
⎣ x3 ⎦ ⎣ 0
x4
0
⎤
⎡ ⎤
0
e1
⎥
⎢
F
0⎥
e2 ⎥
⎥
× 1 +⎢
⎦
⎣
l32
F2
e3 ⎦
l42
e4
Any correlation between items from different factors—say, x1 and x3—
occurs only because of a relationship between F1 and F2. Confirmatory
factor analysis assesses the degree to which the proposed model is consistent with the data in terms of relationships between items belonging to the
same factor (say, x1 and x2) and the relationship between items belonging
to different factors (say, x1 and x3). For instance, if an item actually measures two factors, then its relationship with an item from the second factor
will be high, leading to a poor fit for the overall model. Confirmatory factor
analysis assesses internal consistency across items from the same factors and
external consistency across items from different factors.
Internal and External Consistency
in Confirmatory Factor Analysis
The product rule for internal consistency for correlation between two
indicators i and j of construct ξ, where t is the true score, is
ρij = ρiξ ρjξ .
The product rule for external consistency for correlation between two
indicators i and p, p being an indicator of another construct ξ*, is
ρip = ρiξ ρξ ξ ∗ ρpξ ∗ .
SOURCE: Gerbing and Anderson (1988).
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 86
86——Measurement Error and Research Design
Expressions in Structural Equation Modeling
Reliability of item (Bagozzi, 1991)
ρxi =
λ2xi
λ2xi + var(δi )
Reliability of measure with p items (Bagozzi, 1991)
p
ρc = p
i=1
λxi
i=1
2
+
2
λxi
p
i=1
var(δi )
Average variance extracted for measure
with p items (Fornell & Larcker, 1981)
p
ρvc = p
i=1
λ2x
p i
i=i
λ2xi +
i=1
var(δi )
ξ = latent variable
x = observed variable
δ = error
λ = loading of x on ξ
var = variance
Alternative Overall Fit Indexes
Normed fit index (Bentler & Bonnett, 1980)
T 2b − T 2n
T 2b
T b2 = chi-square for baseline model
T n2 = chi-square for hypothesized model
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 87
What Is Measurement?——87
Incremental fit index (Bollen, 1989)
T 2b − T 2n
T 2b − dfn
df b = degrees of freedom for baseline model
dfn = degrees of freedom for hypothesized model
Comparative fit index (Bentler, 1990)
(T 2b − dfb ) − (T 2n − dfn )
T 2b − dfb
Tucker-Lewis index (Tucker & Lewis, 1973)
T 2b /dfb − T 2n /dfn
T 2b /dfb − 1
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 88
88——Measurement Error and Research Design
Appendix 1.2
Sample Scales (Response Categories
Adapted to 7-Point Scales for Convenience)
Consumer independent judgment-making (Manning, Bearden, & Madden, 19951)
Strongly
Disagree
Strongly
Agree
Prior to purchasing a new brand, I prefer to consult
a friend that has experience with the new brand.
1
2
3
4
5
6
7
When it comes to deciding whether to
purchase a new service, I do not rely on
experienced friends or family members for advice.
1
2
3
4
5
6
7
I seldom ask a friend about his or her experiences
with a new product before I buy the new product.
1
2
3
4
5
6
7
I decide to buy new products and
services without relying on the opinions
of friends who have already tried them.
1
2
3
4
5
6
7
When I am interested in purchasing a new service,
I do not rely on my friends or close acquaintances
that have already used the new service to give me
information as to whether I should try it.
1
2
3
4
5
6
7
I do not rely on experienced friends for
information about new products prior to making
up my mind about whether or not to purchase.
1
2
3
4
5
6
7
Consumer novelty (Manning et al., 19951)
Strongly
Disagree
Strongly
Agree
I often seek out information about new products
and brands.
1
2
3
4
5
6
7
I like to go to places where I will be exposed
to information about new products and brands.
1
2
3
4
5
6
7
I like magazines that introduce new brands.
1
2
3
4
5
6
7
I frequently look for new products and services.
1
2
3
4
5
6
7
I seek out situations in which I will be exposed to
new and different sources of product information.
1
2
3
4
5
6
7
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 89
What Is Measurement?——89
I am continually seeking new product experiences.
1
2
3
4
5
6
7
When I go shopping, I find myself spending very
little time checking out new products and brands.
1
2
3
4
5
6
7
I take advantage of the first available opportunity
to find out about new and different products.
1
2
3
4
5
6
7
Material values—Defining success (Richins & Dawson, 19922)
Strongly
Disagree
Strongly
Agree
I admire people who own expensive homes, cars,
and clothes.
1
2
3
4
5
6
7
Some of the most important achievements
in life include acquiring material possessions.
1
2
3
4
5
6
7
I don’t place much emphasis on the amount of
material objects people own as a sign of success.
1
2
3
4
5
6
7
The things I own say a lot about how well
I’m doing in life.
1
2
3
4
5
6
7
I like to own things that impress people.
1
2
3
4
5
6
7
I don’t pay much attention to the material objects
other people own.
1
2
3
4
5
6
7
Material values—Acquisition centrality (Richins & Dawson, 19922)
Strongly
Disagree
Strongly
Agree
I usually buy only the things I need.
1
2
3
4
5
6
7
I try to keep my life simple, as far as possessions
are concerned.
1
2
3
4
5
6
7
The things I own aren’t all that important to me.
1
2
3
4
5
6
7
I enjoy spending money on things that aren’t
practical.
1
2
3
4
5
6
7
Buying things gives me a lot of pleasure.
1
2
3
4
5
6
7
I like a lot of luxury in my life.
1
2
3
4
5
6
7
I put less emphasis on material things than most
people I know.
1
2
3
4
5
6
7
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 90
90——Measurement Error and Research Design
Material values—Pursuit of happiness (Richins & Dawson, 19922)
Strongly
Disagree
Strongly
Agree
I have all the things I really need to enjoy life.
1
2
3
4
5
6
7
My life would be better if I owned certain
things I don’t have.
1
2
3
4
5
6
7
I wouldn’t be any happier if I owned nicer things.
1
2
3
4
5
6
7
I’d be happier if I could afford to buy more things.
1
2
3
4
5
6
7
It sometimes bothers me quite a bit that I can’t
afford to buy all the things I’d like.
1
2
3
4
5
6
7
Value consciousness (Lichtenstein, Ridgway, & Netemeyer, 19933)
Strongly
Disagree
Strongly
Agree
I am very concerned about low prices, but I am
equally concerned about product quality.
1
2
3
4
5
6
7
When grocery shopping, I compare the prices of
different brands to be sure I get the best value
for the money.
1
2
3
4
5
6
7
When purchasing a product, I always try to
maximize the quality I get for the money I spend.
1
2
3
4
5
6
7
When I buy products, I like to be sure that I am
getting my money’s worth.
1
2
3
4
5
6
7
I generally shop around for lower prices on
products, but they still must meet certain quality
requirements before I will buy them.
1
2
3
4
5
6
7
When I shop, I usually compare the “price per
ounce” information for brands I normally buy.
1
2
3
4
5
6
7
I always check prices at the grocery store to be sure
I get the best value for the money I spend.
1
2
3
4
5
6
7
Price consciousness (Lichtenstein et al., 19933)
Strongly
Disagree
I am not willing to go to extra effort to find
lower prices.
1
2
Strongly
Agree
3
4
5
6
7
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 91
What Is Measurement?——91
I will grocery shop at more than one store
to take advantage of low prices.
1
2
3
4
5
6
7
The money saved by finding lower prices is
usually not worth the time and effort.
1
2
3
4
5
6
7
I would never shop at more than one store
to find low prices.
1
2
3
4
5
6
7
The time it takes to find low prices is usually
not worth the effort.
1
2
3
4
5
6
7
Coupon proneness (Lichtenstein et al., 19933)
Strongly
Disagree
Strongly
Agree
Redeeming coupons makes me feel good.
1
2
3
4
5
6
7
I enjoy clipping coupons out of the newspaper.
1
2
3
4
5
6
7
When I use coupons, I feel that I am getting
a good deal.
1
2
3
4
5
6
7
I enjoy using coupons regardless of the amount
1
2
3
4
5
6
7
1
2
3
4
5
6
7
I save by doing so.
Beyond the money I save, redeeming coupons
gives me a sense of joy.
Sale proneness (Lichtenstein et al., 19933)
Strongly
Disagree
Strongly
Agree
If a product is on sale, that can be a reason
for me to buy it.
1
2
3
4
5
6
7
When I buy a brand that’s on sale, I feel that I
am getting a good deal.
1
2
3
4
5
6
7
I have favorite brands, but most of the time I
buy the brand that’s on sale.
1
2
3
4
5
6
7
I am more likely to buy brands that are on sale.
1
2
3
4
5
6
7
Compared to most people, I am more likely to buy
brands that are on special.
1
2
3
4
5
6
7
Consumer ethnocentrism (Shimp & Sharma, 19874)
Strongly
Disagree
American people should always buy American-made 1
products instead of imports.
2
Strongly
Agree
3
4
5
6
7
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 92
92——Measurement Error and Research Design
Only those products that are unavailable in the
U.S. should be imported.
1
2
3
4
5
6
7
Buy American-made products. Keep
America working.
1
2
3
4
5
6
7
American products, first, last, and foremost.
1
2
3
4
5
6
7
Purchasing foreign-made products is un-American.
1
2
3
4
5
6
7
It is not right to purchase foreign products.
1
2
3
4
5
6
7
A real American should always buy
American-made products.
1
2
3
4
5
6
7
We should purchase products manufactured in
America instead of letting other countries
get rich off us.
1
2
3
4
5
6
7
It is always best to purchase American products.
1
2
3
4
5
6
7
There should be very little trading or purchasing of
goods from other countries unless out of necessity.
1
2
3
4
5
6
7
Americans should not buy foreign products,
because this hurts American business and
causes unemployment.
1
2
3
4
5
6
7
Curbs should be put on all imports.
1
2
3
4
5
6
7
It may cost me in the long run but I prefer to
support American products.
1
2
3
4
5
6
7
Foreigners should not be allowed to put their
products in our markets.
1
2
3
4
5
6
7
Foreign products should be taxed heavily to
reduce their entry into the U.S.
1
2
3
4
5
6
7
We should buy from foreign countries only those
products that we cannot obtain within
our own country.
1
2
3
4
5
6
7
American consumers who purchase products
made in other countries are responsible for
putting their fellow Americans out of work.
1
2
3
4
5
6
7
Need for cognition (Perri & Wolfgang, 19885)
Strongly
Disagree
Strongly
Agree
I would prefer complex to simple problems.
1
2
3
4
5
6
7
I like to have the responsibility of handling a
situation that requires a lot of thinking.
1
2
3
4
5
6
7
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 93
What Is Measurement?——93
Thinking is not my idea of fun.
1
2
3
4
5
6
7
I would rather do something that requires little
thought than something that is sure to challenge
my thinking abilities.
1
2
3
4
5
6
7
I try to anticipate and avoid situations where there
is likely chance I will have to think in depth
about something.
1
2
3
4
5
6
7
I find satisfaction in deliberating hard and
for long hours.
1
2
3
4
5
6
7
I only think as hard as I have to.
1
2
3
4
5
6
7
I prefer to think about small, daily projects to
long-term ones.
1
2
3
4
5
6
7
I like tasks that require little thought once
I’ve learned them.
1
2
3
4
5
6
7
The idea of relying on thought to make my way
to the top appeals to me.
1
2
3
4
5
6
7
I really enjoy a task that involves coming up with
new solutions to problems.
1
2
3
4
5
6
7
Learning new ways to think doesn’t excite me
very much.
1
2
3
4
5
6
7
I prefer my life to be filled with puzzles that
I must solve.
1
2
3
4
5
6
7
The notion of thinking abstractly is appealing to me. 1
2
3
4
5
6
7
I would prefer a task that is intellectual, difficult,
and important to one that is somewhat important
but does not require much thought.
1
2
3
4
5
6
7
I feel relief rather than satisfaction after completing
a task that required a lot of mental effort.
1
2
3
4
5
6
7
It’s enough for me that something gets the job done; 1
I don’t care how or why it works.
2
3
4
5
6
7
I usually end up deliberating about issues even when 1
they do not affect me personally.
2
3
4
5
6
7
Consumer susceptibility to interpersonal influence (Bearden, Netemeyer, & Teel,
19896)
Strongly
Disagree
I often consult other people to help choose the
best alternative available from a product class.
1
2
Strongly
Agree
3
4
5
6
7
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 94
94——Measurement Error and Research Design
If I want to be like someone, I often try to buy
the same brands that they buy.
1
2
3
4
5
6
7
It is important that others like the products
and brands I buy.
1
2
3
4
5
6
7
To make sure I buy the right product or brand, I
often observe what others are buying and using.
1
2
3
4
5
6
7
I rarely purchase the latest fashion styles until I
am sure my friends approve of them.
1
2
3
4
5
6
7
I often identify with other people by purchasing the
same products and brands they purchase.
1
2
3
4
5
6
7
If I have little experience with a product, I often ask 1
my friends about the product.
2
3
4
5
6
7
When buying products, I generally purchase those
brands that I think others will approve of.
1
2
3
4
5
6
7
I like to know what brands and products make
good impressions on others.
1
2
3
4
5
6
7
I frequently gather information from friends or
family about a product before I buy.
1
2
3
4
5
6
7
If other people can see me using a product, I often
purchase the brand they expect me to buy.
1
2
3
4
5
6
7
I achieve a sense of belonging by purchasing the
same products and brands that others purchase.
1
2
3
4
5
6
7
Notes
1. From Manning, K. C., Bearden, W. O., & Madden, T. J., Consumer innovativeness and the adoption process, in Journal of Consumer Psychology, 4(4),
copyright © 1995, pp. 329–345. Reprinted by permission.
2. From Richins, M. L., & Dawson, S., Materialism as a consumer value:
Measure development and validation, in Journal of Consumer Research, 19,
pp. 303–316, copyright © 1992. Reprinted by permission of the American Marketing
Association.
3. From Lichtenstein, D. R., Ridgway, N. M., & Netemeyer, R. G., Price
perceptions and consumer shopping behavior: A field study, in Journal of Marketing
Research, 30, pp. 234–245, copyright © 1993. Reprinted by permission of the
American Marketing Association.
4. From Shimp, T. A., & Sharma, S., Consumer ethnocentrism: Construction
and validation of the CETSCALE, in Journal of Marketing Research, 24,
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 95
What Is Measurement?——95
pp. 280–289, copyright © 1987. Reprinted by permission of the American Marketing
Association.
5. From Perri, M., & Wolfgang, A. P., A modified measure of need for cognition, in Psychological Reports, 62, pp. 955–957, copyright © 1988. Reprinted by
permission.
6. From Bearden, W. O., Netemeyer, R. G., & Teel, J. E., Measurement of
consumer susceptibility to interpersonal influence, in Journal of Consumer
Research, 15, pp. 473–481, copyright © 1989. Reprinted by permission.
01-Viswanathan.qxd
1/10/2005
2:51 PM
Page 96