How to Analyze Data? Aravinda Guntupalli

How to Analyze
Data?
Aravinda Guntupalli
SPSS windows process
Data window
 Variable view window
 Output window
 Chart editor window

How to use different file types?



Excel file
csv file
SPSS file
Types of variables

You can select type of variable
 String
 Numeric

You can also select format of variable
 Categorical
 Ordinal
 Interval
Why does it matter?
Statistical computations and analyses
assume that the variables have specific
levels of measurement
 Can you compute average of hair color?
 Does it makes sense to compute the
average of educational experience?
 An average requires a variable to be
interval.

Stock and flow variables
In data analysis it is useful to distinguish
between between stock and flow
variables.
 Stock variables are measured at a point
in time and flow variables are measured
over a period in time.
 Cross-section data make comparisons
at a given or in a given period in time,
while time-series data depict evolution
over time.

Manipulate existing
data
Compute new variable
You can calculate different variables from
the existing variables.
 For this you need to know the way to
compute your target variable from the
existing variables.
 You can perform operations like addition,
subtraction, division and multiplication of
variables to create a new variable.

Example
Total out put of food grains (addition of
rice, wheat, maize and other grain output)
 Income difference between males and
females (male income – female income)
 Age square variable (age*age)
 GDP Per capita (Total GDP/Population)

Recode variable
Using SPSS you can recode a variable
into the same variable. How?
 We have data on years of education from
0 to 22 years for mothers and you need to
do analysis using only 3 categories:
Mothers who did not complete the high
school, mothers who completed high
school and mothers completed
college?How you will do this?

How to perform this?
Go to Transform pull down menu – then
go to Recode- then to Recode into same
variable (if you want to replace the
existing information)
 Select education and move it into the
numeric variable list.
 Define values by clicking Old and new
values.

 Enter
as 3
0-11 range as 1, 12-15 as 2 and 16-22
How to make a new data set?

We will create now a data set on our own.
 Cross-sectional
 Panel
 Time

series
Types of variables
 String
 Numeric
Replace missing values
Missing observations can be problematic
in analysis, and some time series
measures cannot be computed if there are
missing values in the series.
 Replace Missing Values creates new time
series variables from existing ones,
replacing missing values with estimates
computed with one of several methods.

Also…



Default new variable names are the first six
characters of the existing variable used to create
it, followed by an underscore and a sequential
number.
For example, for the variable PRICE, the new
variable name would be PRICE_1. The new
variables retain any defined value labels from
the original variables.
Optionally, you can enter variable names to
override the default new variable names.
To Replace Missing Values for
Time Series Variables
From the pull down menu choose:
Transform and then Replace Missing
Values
 You can then select the estimation
method you want to use to replace
missing values.
 Select the variable for which you want to
replace missing values.
 Also you can enter variable names to
override the default new variable names.

Graphs
Boxplot




A boxplot consists of box and 2 tails.
The horizontal line inside the box tells the
position of the median and its upper and lower
boundaries are its upper and lower quartiles.
The tails run to the most extreme values.
boxplot in sum shows structure of the data along
with its skewness and spread.
Drawing a boxplot.
Question: We have recorded the heights in cm of boys in a
class as shown below. We will draw a boxplot for this data.
Q2
QL
Qu
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
Lower
Quartile
= 158
130
140
Upper
Quartile
= 180
Median
= 171
150
160
170
180
cm
190
Boxplot
80
70
60
50
40
30
20
N =
SES
47
95
58
low
middle
high
How to make a boxplot?





From the menus, choose: Graphs and Boxplot
Select the icon for Simple and select
Summaries for groups of cases.
Select Define.
Select the variable for which you want boxplots,
and move it into the Variable box.
Select a variable for the category axis and move
it into the Category Axis box. This variable may
be numeric, string, or long string.
Histogram
A Histogram is a graphical representation
of a frequency distribution for continuous
data.
The height is proportional to the frequency
of that class
Histogram (2)
30
20
10
Std. Dev = 9.37
Mean = 52.6
N = 200.00
0
32.5
37.5
35.0
42.5
40.0
math score
47.5
45.0
52.5
50.0
57.5
55.0
62.5
60.0
67.5
65.0
72.5
70.0
75.0
How to make histogram?
From the menus, choose: Graphs and
Histogram
 Select a numeric variable for Variable in
the Histogram dialog.
 Select Display normal curve to display a
normal curve on the histogram.

Scatter plot (1)




To know the relationships between two
quantitative variables we are interested in we
can use scatter plots.
A scatter diagram plots the value of one
economic variable against the value of another
variable.
It can be used to reveal whether a relationship
exists and the type of relationship that exists.
A scatter plot can describe the relation between
reading and writing scores.
Scatter plot (2)
80
70
60
50
40
30
20
30
writing score
40
50
60
70
Typical Patterns
Positive linear relationship
Negative nonlinear relationship
No relationship
Negative linear relationship
Nonlinear (concave) relationship
How to make scatter plots?





From the menus, choose: Graphs and Scatter
Select the icon for Simple.
Select Define.
You must select a variable for the Y-axis and a
variable for the X-axis. These variables must be
numeric, but should not be in date format.
You can select a variable and move it into the
Set Markers by box. This variable may be
numeric or string.
Descriptive statistics
Descriptive statistics
It tells you how many valid cases you
have for data along with mean and
standard deviation.
 You can understand about distribution
using this command in SPSS.
 How to do this?






Analyse
Descriptive statistics
Frequencies/Descriptives/Explore/Crosstabs
Select the variables
Using shift or ctrl key you can select multiple variables
Correlation and
regression
What is Correlation?
 Research
question: What is the relation
between two variables?
 Correlation is a measure of the direction
and degree of linear association between
2 variables
Interpreting Correlation
Strength
very weak
weak
moderate
strong
very strong
r
0 - .19
.20 - .39
.40 - .59
.60 - .79
.80 - 1.00
Relation between hourly pay
and age
Model Summary
Model
1
R
.397a
R Square
.158
Adjusted
R Square
.158
Std. Error of
the Estimate
3.59608
a. Predictors: (Constant), Age last birthday
R Square values indicate the proportion of
variance in the dependent variable (y)
accounted for by variation in the independent
variable (x)
Regression coefficients
Coefficientsa
Model
1
(Constant)
Age last birthday
Unstandardized
Coefficients
B
Std. Error
1.336
.130
.231
.004
Standardized
Coefficients
Beta
.397
t
10.314
53.500
a. Dependent Variable: Gross hourly pay (£)
hourly pay = 1.336 + .231 x age + error
Sig.
.000
.000
Multivariate
Regression Analysis
When do we use Multivariate
Regression Analysis
To find the relationship between more than
two variables
 y= b0 + bx1 + bx2 + e

 hours
worked (y)
 education (x1)
 income (x2)
Simultaneous regression

hourly pay (£)= -8.773 + .622*education +
0.201*age
Coefficientsa
Model
1
(Constant)
Age last birthday
Age completed
continuous
full-time education
Unstandardized
Coefficients
B
Std. Error
-7.827
.253
.217
.005
.540
a. Dependent Variable: Gross hourly pay (£)
.011
Standardized
Coefficients
Beta
.343
t
-30.988
46.457
Sig.
.000
.000
.355
48.123
.000
What if… we have a dichotomous
dependent variable?

Use a dummy dependent variable regression
model
 Logistic
regression model
Unlike simple linear regression and multiple
regression, in logistic regression the
dependent variable is dichotomous (ie. 0,1)
 In logistic regression more than one
independent variable can be used

Thank You