EMET2008 Course Material Release 1.0 Juergen Meinecke October 13, 2014

EMET2008 Course Material
Release 1.0
Juergen Meinecke
October 13, 2014
Contents
1
Announcements
3
2
Slides (weeks 8 through 13)
5
3
Course Material
3.1 Course Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Reading List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Illustration of Central Limit Theorem using Monte Carlo Simulation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
11
12
20
26
i
ii
EMET2008 Course Material, Release 1.0
This website hosts important content for the ANU course EMET2008.
Contents
1
EMET2008 Course Material, Release 1.0
2
Contents
CHAPTER 1
Announcements
Note: Assignment 2 now available online. Please note the revised deadline: Wednesday, 29 October by
2pm!
Midterm Midterm answer key
3
EMET2008 Course Material, Release 1.0
4
Chapter 1. Announcements
CHAPTER 2
Slides (weeks 8 through 13)
I am switching to a more conventional presentation style, using slides (instead of writing on the white
board). I hope, most of you prefer this.
• Weeks 8 and 9: Slides
• Weeks 10 and 11: Slides
• Weeks 12 and 13: chapters 14 and/or 15 of textbook
5
EMET2008 Course Material, Release 1.0
6
Chapter 2. Slides (weeks 8 through 13)
CHAPTER 3
Course Material
3.1 Course Outline
Read this entire course outline carefully!
Any items, rules, requirements in this course outline may be subject to changes. When this happens I will
announce it during the lecture. Announcements in the lecture supersede any information contained in this
course outline.
3.1.1 Course Description
This course presents and develops techniques necessary for the quantitative analysis of economic and business problems that are beyond the scope of the simple linear regression model covered in EMET2007 or
STAT2008. Topics include: endogeneity, natural experiments, binary dependent variables, time series regressions and panel data estimation. This is a hand-on course with a focus on applications in economics
as well as business. A standard statistical software will be used during computer sessions, no special programming skills are required.
Learning Outcome
Upon successful completion of the requirements for this course, students will
• understand the challenges of empirical modelling in economics and business
• understand the shortcomings of the standard linear regression model
• be able to apply important extensions to the linear regression model
• be able to express new econometric methods mathematically
• be able to think clearly about the relationship between data, model and estimation in econometrics
• use statistical software to study actual data sets
Topics Covered
I intend to teach the following set of topics.
• Brief review of OLS estimation (2 weeks)
• Endogeneity: when OLS fails (1 week)
7
EMET2008 Course Material, Release 1.0
• Instrumental variables estimation (2 weeks)
• Experiments and quasi-experiments (2 weeks)
• Binary dependent variables (2 weeks)
• Panel data and time series models (4 weeks)
If you are interested in any other topic not given here, feel free to let me know as I am happy to adapt the
course and incorporate your ideas and preferences. Note that the indicated number of weeks given within
parentheses are just estimates and may differ as we go along.
Prerequisites
To enrol in this course you must have completed
• ECON1101 and
• EMET2007 or STAT2008.
Communication
Important: The official website for this course is
<http://EMET2008.Readthedocs.org>
I will frequently make announcements on the homepage of the Course Website (under “Announcements”).
The official forum for announcements of any kind are the lectures. If necessary, I will contact students
electronically using their official ANU student e-mail address. If you want to contact me send an e-mail to
[email protected]
E-mail addresses are only to be used when you need to contact staff about administrative or academic
matters. They are NOT to be used for instructional purposes.
Textbook
The textbook for the course is Introduction to Econometrics (third edition, 2012) by Stock and Watson. Chiefly
library has several copies of the textbook. I strongly recommend that you buy a copy of the book as I base
the lecture and practice sessions on it.
Other excellent textbooks include A Guide to Modern Econometrics by Verbeek and Introductory Econometrics:
A Modern Approach 5ed, by Wooldridge.
Software
The econometric software for this course is “Stata”. Here’s a quick wiki summary of what Stata
is: <http://en.wikipedia.org/wiki/Stata>. From my own experience, Stata is an exhaustive, welldocumented, powerful and user-friendly statistical software. We will get to know Stata during the tutorial
in a “learning-by-doing manner”.
8
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
Staff
Administrative
For any administrative inquiries or problems (e.g., tutorial enrollment, exam scheduling, supplementary
exams, etc.) you should contact Terry Embling (School of Economics Course Administrator) or Finola
Wijnberg (School of Economics School Administrator).
Name
Job title
Office
Location
Hours
E-mail
Terry Embling
Course administrator
HW Arndt Building 25a
Room 1013
9:00-16:00
[email protected]
Finola Wijnberg
School administrator
HW Arndt Building 25a
Room 1014
9:00-16:00
[email protected]
Academic
If you have any academic inquiries or problems regarding the course, please don’t hesitate to contact me:
Name
Office
Location
Hours
E-mail
Juergen Meinecke
HW Arndt Building 25a
Room 1022
Tue 13:00-16:00
[email protected]
Lectures and Tutorials
There will be four hours of contact time per week: a two hour lecture and a two hour practice session. You
are expected to attend all of these. If you have persistent time conflicts with any of these class sessions, you
should not be taking the course. Although content will be made available digitally (for example through
audio recordings) you should not treat virtual attendance as a perfect substitute for physical attendance.
The class meets in the following venues at the following times:
Day
Type
Time
Location
Tuesday
Lecture
10-12
CBE Bld LT4
Thursday
Problem Solving
10-11
COP GO25
Thursday
Computing Session
11-12
COP GO25
As you can see, the two hour practice sessions happen on Thursdays and can be subdivided into a one hour
problem solving session and a one hour computing session. We will not always treat these two sub-sessions
as strictly separate and instead regard the two as one big practice session that combines both theoretical
exercises with computing exercises.
Digital Lecture Delivery
Audio recordings of the Tuesday lecture will be made available on Wattle.
The Thursday sessions (tute and computing) will not be made available on Wattle (they are group learning
sessions and as such do not lend themselves to audio recordings).
3.1. Course Outline
9
EMET2008 Course Material, Release 1.0
Workload
University study requires at least as much time and effort as a full–time job. You are expected to attend all
lectures and tutorials (4 hours per week). You should expect to put in at least 6 hours per week of your own
study time for this course in addition to the 4 hours of lectures and tutorials.
3.1.2 Course Assessment
The following table summarizes the assessable items for the course.
Assessment Item
Assignment 1
Midterm exam
Assignment 2
Final exam
Practice session participation
Due date
Thursday, week 6
Week 7
Friday, week 13
TBA
Throughout
Weight
10%
25%
10%
45%
10%
Note, all assessment items are compulsory. If you miss any one item without approval by the School or
College, you will fail the entire course!
Assignments
Working through exercises is an effective method of learning econometrics, as it is with most mathematical
subjects. That means that the assignments are more than simply part of the assessment for the course.
Students will be required to submit two written assignments during the semester.
The assignments will require computer work as well as analytical work. These assignments should be your
own work. You may discuss assignments with classmates, but you should do all your own computing and
writing of the assignments. It is an offense against the University’s regulations to copy from other students’
assignments.
Assignments should be submitted by dropping them into a specially labeled assignment box at the Research
School of Economics. (Contact Terry Embling for details.) The front page of the submitted assignments
must show your name, student number and the course name (EMET2008). Assignments missing any of
this information will receive a mark of zero.
Assignments must be submitted by 2pm on the due date. If you have a university approved excuse for
not handing in an assignment, then the value of the final exam will be increased by 10 percentage points to
compensate for the missed work.
Further details about assignment submission will be given during lectures.
Midterm Examination
The midterm examination will be held during practice session time on Thursday of week 7. The exam
covers all material from weeks 1 through 6 of the course. The exam will be marked out of 100. It is your
responsibility to make yourself available for the midterm examination.
No make-up midterm examination will be offered. Should you miss the midterm exam for a valid reason
(see Rules and Policies below) then your grade will be based solely on your final exam.
10
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
Final Examination
Examinable material covers the whole semester, including material already covered in the midterm exam.
The exam will be marked out of 100.
The final exam will be held in the exam period at the end of the semester. Details will be posted on the
ANU exam timetable site.
Practice Session Participation
Your participation is an essential part in the overall learning experience (both for you as well as your classmates!) in the course. I will evaluate you on your participation during the Thursday practice sessions. Feel
free to participate and contribute to the sessions. Do not be afraid to give wrong answers; as long as you are
constructively engaged, there is no such thing as a wrong answer.
Every Thursday after practice sessions I will take note of students who participated in class and at the end of
the semester I will aggregate these numbers to an overall participation mark. Roughly, I will give 10 marks
to regular participators, 5 marks to occasional participators and zero marks to students who rarely or never
participate. Feel free to seek feedback from me during the semester on your participation performance.
Scaling of Grades
Final scores for the course will be determined by scaling the raw score totals to fit a sensible distribution of
grades. Scaling can increase or decrease a mark but does not change the order of marks relative to the other
students in the course. If it is decided that scaling is appropriate, then the final mark awarded in a course
may differ from the aggregation of the raw marks of each assessment component.
3.1.3 Rules and Policies
It is your responsibility to familiarize yourself with the rules and regulations and the policies and procedures that are relevant to your studies at the ANU.
ANU has educational policies, procedures and guidelines, which are designed to ensure that staff and students are aware of the University’s academic standards, and implement them. You can find the University’s
education policies and an explanatory glossary at: ANU Policies.
Students are expected to have read the Student Academic Integrity Policy before the commencement of
their course.
Other key policies include:
• Student Assessment (Coursework)
• Student Surveys and Evaluations
The University also offers a number of support services for students. Information on these is available
online from ANU Studentlife.
3.2 Reading List
Note: This reading list assists you in finding the textbook sections that cover the material discussed during that week’s lecture and practice session. Occasionally, the references provided in the table below go
beyond what was discussed in class. In those cases, the reading is only recommended for deepening your
understanding; it is, however, not required.
3.2. Reading List
11
EMET2008 Course Material, Release 1.0
WeekTextbook sections
0
(assumed
knowledge)
chapters 4 through 7
1
2.5, 2.6, 3.1
2
3.1, 3.2, 3.3
3
4
4.1, 4.2, 4.4, 6.5, 6.7
1.2, 6.1, 9.2, 9.4
5
12.1 through 12.5
6
13.1, 13.2, 13.3
Key concepts covered
linear regression with one regressor
linear regression with multiple regressors
hypothesis tests and confidence intervals
population
random sample
iid
statistical inference
random variable
population mean vs. sample average
estimator vs. estimate
unbiasedness
consistency, law of large numbers, convergence in probability
central limit theorem
unbiasedness
consistency
efficiency
BLUE
central limit theorem
Monte Carlo simulation
hypothesis testing
confidence interval
standard error
brief review of OLS
endogeneity
causal effect
omitted variables bias
sample selection bias
simultaneity bias, reverse causality
measurement error bias
instrumental variables estimation
instrument relevance
instrument exogeneity
TSLS estimation
first stage; reduced form equation
structural equation
weak instrument
rule of thumb
experiments
randomized control trial
causal effect
treatment effect
internal and external validity
threats to validity
3.3 Exercises
Note: Answers to exercises will only be provided during class time. If you cannot make it to class, you
will need to see me during consultation times and we will work through the exercises together. (When
you see me during consultation times, I expect you to be prepared. I will never merely provide answers to
exercises. Instead, I want to see good faith effort on your part in which case I will be more than happy to
12
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
help you work throught the exercises.)
3.3.1 Week 1
Problem Solving
1. Prove that the sample average Y¯ is an unbiased estimator for the population mean.
2. Prove that in the linear model Yi = µ + ε i the ordinary least squares estimator of µ is equal to the
sample average. Mathematically, minimize the sum of least squares ∑in=1 (Yi − µˆ )2 and show that this
¯
is obtained by setting µˆ = Y.
3. Is there a difference between an estimator and an estimate?
Computational
We will use this first practice session to become familiar with Stata.
1. Work through the “Stata for Researchers” website, you can find the link below.
I will use this exercise to teach you some basic tricks you should know about Stata. For Stata help and
support I can highly recommend the following two sources:
• The Social Science Computing Cooperative at the University of Wisconsin provides excellent support
for Stata beginners. Check out their website “Stata for Students”:
<http://www.ssc.wisc.edu/sscc/pubs/stata_students1.htm>
Also, check out their website “Stata for Researchers”:
<http://www.ssc.wisc.edu/sscc/pubs/sfr-intro.htm>
This website should be your first port of call in all things Stata.
• Furthermore, the UCLA Institute for Digital Research and Education provides fantastic resources for
people who are interested in learning Stata:
<http://www.ats.ucla.edu/stat/stata/>
Feel free to use these links throughout the semester to improve your Stata skills!
3.3.2 Week 2
Problem Solving
1. Let Yi ∼ i.i.d.(µ, σ2 ). You have learnt in the lecture that µˆ 1 := Y¯n is an unbiased and consistent
estimator for the population mean µ. Are the following estimators also unbiased or consistent for µ?
Discuss!
(a) µˆ 2 := 42
(‘the answer to everything’ estimator)
(b) µˆ 3 := Y¯ n + 3/n
(c) µˆ 4 := (Y1 + Y2 + Y3 + Y4 + Y5 )/5
2. Excerpt from the website of the Australian Bureau of Statistics:
3.3. Exercises
13
EMET2008 Course Material, Release 1.0
"The Adult Literacy and Life Skills Survey (ALLS) was conducted in Australia as part of an
international study coordinated by Statistics Canada and the Organisation for Economic
Co-operation and Development (OECD). The ALLS is designed to identify and measure literacy
which can be linked to the social and economic characteristics of people both across and
within countries. The ALLS measured the literacy of a sample of people aged 15 to 74 years.
The ALLS provides information on knowledge and skills in (among others) *Numeracy*, i.e.
the knowledge and skills required to effectively manage and respond to the mathematical
demands of diverse situations."
In a sample of 1,000 randomly selected Australians, the average numeracy score was 312 and the
sample standard deviation was 41. Construct a 95% confidence interval for the population mean of
the numeracy score.
3. Exercise 3.3 parts a and b; Exercise 3.4.
Computational
1. Continue working through the “Stata for Researchers” website (as started last week).
2. Empirical Exercise E4.2 parts a and b.
3.3.3 Week 3
Problem Solving
Consider the following linear model for heights:
Yi = β 0 + β 1 Xi1 + ui ,
where Yi is the height of person i and Xi1 is a gender dummy variable that takes on the value 1 if person i
is male and zero otherwise.
1. In that model, what does β 0 capture? What does β 0 + β 1 capture?
2. Define and derive (mathematically) the OLS estimators of β 0 and β 1 .
Computational
1. Empirical Exercise E3.1 part d.
2. Empirical Exercise E4.4.
The following Stata do-file is a solution to this exercise. Please feel free to use this as the starting point
for all your future do-files. Just copy and paste it into your Stata do-file editor and save it as a new
do-file. Since it contains the answer to Empirical Exercise 4.4, I gave it the name “E4_4.do”, but you
can choose whatever name you want.
What’s important here is that you need to customize the code below in one place:
• Work directory: This is the location on your computer where your access and store all your files.
This includes the textbook’s data files (the files with a dta-suffix), your self-written Stata do-files
as well as the log-files that are created by your do-files. You choose the work directory; it is likely
different on different computers. For example, on my office desktop computer I created a work
directory called /Users/juergen/EMET2008/Stata.
For the code below to work, you need to keep ALL files in the same work directory! Again, this
includes your textbook’s data files as well as the Stata do-files and log-files that you create.:
14
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
// ====================================================
// PREAMBLE
// ====================================================
clear all
// clear memory
capture log close
// close any open log files
set more off
// don’t pause when screen fills
// set work directory (put your own path here!):
cd /path/to/location/on/your/computer/where/Stata/files/go
log using E4_4.log, replace
// open new log-file
// ====================================================
// Work on your data set
// ====================================================
use "Growth.dta"
// loading data set (needs to be in work directory)
summarize
scatter growth tradeshare
regress growth tradeshare
margins, at(tradeshare==1)
margins, at(tradeshare==0.5)
regress growth tradeshare if country_name!="Malta"
margins, at(tradeshare==1)
margins, at(tradeshare==0.5)
log close
// close log-file
3.3.4 Week 4
Problem Solving
1. Derive the bias from omitted variables.
2. In a recent applied econometrics research project, I have been interested in the causal effect of academic fraud on labor market outcomes. The broad research question is: Do people who commit
academic fraud (at university) benefit significantly from it? Sounds like a straightforward research
question, but answering it is quite challenging econometrically.
Let’s say the model looks like
Yi = β 0 + β 1 Fraudi + β 2 Malei + β 3 Educi + β 4 Agei + ui ,
where Yi are weekly earnings (full time), Fraudi is a dummy variable that is equal to one if a person
reported that s/he committed academic fraud during university and zero otherwise. (All other rhs
variables are self-explanatory.)
If I run this regression and obtain the estimate βˆ 1 for β 1 , can I interpret this as the causal effect of
academic fraud on earnings? Discuss!
Computational
1. Empirical Exercise E6.3.
3.3. Exercises
15
EMET2008 Course Material, Release 1.0
Solution:
// ====================================================
// PREAMBLE
// ====================================================
clear all
// clear memory
capture log close
// close any open log files
set more off
// don’t pause when screen fills
// set work directory (put your own path here!):
cd /path/to/location/on/your/computer/where/Stata/files/go
log using E6_3.log, replace
// open new log-file
// ====================================================
// Work on your data set
// ====================================================
use "./Stock_data/Growth.dta"
drop if (country_name=="Malta")
summarize
reg growth tradeshare yearsschool rev_coups assasinations rgdp60
margins, atmeans
margins, at((mean) _all
tradeshare=.771)
// checking for heteroskedasticity
// 1) pedestrian way
predict uhat, res
generate uhatsq = uhat^2
regress uhatsq tradeshare yearsschool rev_coups assasinations rgdp60
// null: homoskedasticity
// check F-stat
// 2) lazy way
reg growth tradeshare yearsschool rev_coups assasinations rgdp60
estat hettest, rhs fstat
log close
// close log-file
2. In EMET2007 you (hopefully!) have learned how to test for homoskedasticity versus heteroskedasticity. How would you do this with Stata? (Use the Growth data set from the previous exercise to
illustrate the test.) If you indeed find that the data is heteroskedastic, how would you correct for it
with Stata?
3.3.5 Week 5
Problem Solving
Consider the simple linear model Yi = β 0 + β 1 Xi + ui .
1. Mathematically define the OLS estimator and prove that it is inconsistent under endogeneity.
2. Mathematically define the TSLS estimator and prove that it is consistent under endogeneity.
3. Which of the two estimators is consistent under exogeneity?
16
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
4. Research question: Do girls who attend girls’ schools do better in math than girls who attend coed
schools? I give you a data set that includes the following variables:
• score: score in a standardized math test
• girlshs: dummy variable which is equal to 1 if a person attended girls’ school or zero otherwise
• fecud: father’s education
• meduc: mother’s education
• hhinc: household income
(a) You run an OLS estimation of score on girlshs and all the other variables. Will your OLS estimate
of the coefficient on girlshs capture the causal effect of girls’ school on math score? If not, why
not?
(b) What would be a good instrumental variable for girlshs?
Note: this exercise is based on Wooldridge, Introductory Econometrics, A Modern Approach, 5th edition,
chapter 15.
Computational
1. Empirical Exercise E12.2 (Stock and Watson book)
3.3.6 Week 6
Problem Solving
Cool things can be done with randomized control trials. Here I expose you to the work of two recent
economics papers published in a top field journal.
1. We will read and discuss the paper on the effects of home computer use on academic achievement of
school children (written by Fairlie and Robinson).
Here the paper for download (with my annotations): Fairlie Robinson 2013
2. We will read and discuss the paper on the effects of dropping schools by helicopter on rural villages
in Afghanistan (written by Burde and Linden).
Here the paper for download (with my annotations): Burde Linden 2013
Computational
1. Empirical Exercise E13.1 (Stock and Watson book)
3.3.7 Week 7
Midterm exam
3.3. Exercises
17
EMET2008 Course Material, Release 1.0
3.3.8 Week 8
Problem Solving
We will review the midterm exam. In particular: Q1, Q2 and Q5. (The other two questions are easy to
answer if you have read the papers.)
Computational
1. Empirical Exercise E11.1 (Stock and Watson book)
Solution to part (f)
twoway function y = _b[_cons] + _b[age] * x + _b[agesq]* x^2 + _b[colgrad], range(18 65)
3.3.9 Week 9
Problem Solving
Maximum likelihood estimation of probit and logit coefficients.
1. Define the maximum likelihood estimator.
2. Derive the maximum likelihood estimator.
3. Discuss statistical inference for the probit and logit coefficients.
4. Discuss consistency of the probit and logit estimators.
Note: In contrast to the linear probability model (which is a linear model that can be estimated straightforwardly by OLS) the probit and logit models are non-linear (remember that S-shaped curve from the
lecture?). Non-linear models are considerably more difficult to estimate. In this problem solving session I
will try to explain to you the principle idea and math of maximum likelihood estimation of probit and logit
models. In the end, the estimation will need to be done by computers. Luckily, Stata offers a nice set of
commands to help out.
Computational
1. Empirical Exercise E11.2 (Stock and Watson book)
Solution:
// ====================================================
// PREAMBLE
// ====================================================
clear all
// clear memory
capture log close
// close any open log files
set more off
// don’t pause when screen fills
// set work directory (put your own path here!):
cd /path/to/location/on/your/computer/where/Stata/files/go
log using E11_2.log, replace
// open new log-file
// ====================================================
// Work on your data set
// ====================================================
18
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
use "./Stock_data/Smoking.dta"
summarize
******* a ***********
generate agesq = age^2
probit smoker smkban female age agesq hsdrop hsgrad colsome colgrad black hispanic, robust
******* b ***********
* just read off from probit output in part a
******* c ***********
test hsdrop hsgrad colsome colgrad
******* d ***********
margins, at(smkban=(0 1) female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0
margins, dydx(smkban) at(female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0
******* e ***********
margins, at(smkban=(0 1) female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black=
margins, dydx(smkban) at(female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black=
******* f ***********
regress smoker smkban
margins, at(smkban=(0
margins, dydx(smkban)
margins, at(smkban=(0
margins, dydx(smkban)
log close
female age agesq hsdrop hsgrad colsome colgrad black hispanic, robust
1) female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0
at(female=0 age=20 agesq=400 hsdrop=1 hsgrad=0 colsome=0 colgrad=0 black=0
1) female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black=
at(female=1 age=40 agesq=1600 hsdrop=0 hsgrad=0 colsome=0 colgrad=1 black=
// close log-file
3.3.10 Week 10
Problem Solving
We will briefly revisit last week’s problem solving session to summarize ML estimation of probit and logit
models.
Computational
1. Revisit Empirical Exercise E11.2 (Stock and Watson book)
2. Empirical Exercise E10.1 (Stock and Watson book)
(a) Regress lnvio on shall separately for the years 1977 and 1999. What is the causal effect?
(b) Run a pooled regression across all years.
(c) Can you think of an unobserved variable that varies by state but not across time? How about
one that varies across time but not by state?
(d) Reshape your data from long format to wide format. Use the reshaped data to create differenced
variables (between 1999 and 1977) for lnvio and shall.
(e) Run a regression of the differences. What is the causal effect? How does it compare to part (a)?
Why should the estimate be different theoretically?
3.3. Exercises
19
EMET2008 Course Material, Release 1.0
For the rest of this exercise, reshape your data back into long format. (Simply reload the original
data set.)
(f) Run an (n − 1)-binary regressors estimation of lnvio on shall.
(g) Run a fixed effects estimation of lnvio on shall. Do it in two different ways:
i. Hard way: demean the variables yourself and regress demeaned variables on each other.
ii. Lazy way: use Stata’s inbuilt fixed effect estimation command.
How do the results differ to part (f)?
(h) Add the explanatory variables incarc_rate, density, avginc, pop, pb1064, pw1064, and
pm1029 to the estimation.
(i) Now also control for time fixed effects. Do it in three different (yet equivalent) ways:
i. Entity demeaning with ( T − 1)-binary time indicators
ii. Time demeaning with (n − 1)-binary entity indicators
iii. ( T − 1)-binary time indicators and (n − 1)-binary entity indicators
(j) Redo the main estimation using the logarithms of rob and mur instead of vio as outcome variables. How do your findings change?
3.3.11 Week 11
Problem Solving
Define and derive the fixed effect estimator.
Computational
Continue working on Empirical Exercise E10.1 (Stock and Watson book), see previous week.
3.4 Assignments
3.4.1 Assignment 1
Instructions
Answer all questions!
This assignment is due at 2.00pm on Thursday, 28 August. It is worth 10% of your final mark for this course.
Hand in your work by putting it in the EMET2008/6008 assignment box (HW Arndt Building 25a, opposite
of room 1002). Absolutely no extensions will be given, late assignments will receive zero credit. If you have
a university approved excuse for not handing in this assignment, then your marks for your final exam will
be weighted up by 10% to compensate for the missed work.
While I would prefer it if you could provide typed answers, you may also hand in written answers as long
as they are legible and easy to follow. (I will not only mark the correctness of your answers but also the
clarity of the exposition and the transparency with which you communicate your results.) The work that
you hand in should consist of answers to the questions, together with an appendix which contains both the
printout of a complete Stata do-file and a log-file (prodced by the do-file) that covers the entire assignment.
20
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
Answers should be in sentence form (i.e. single word or single number answers without explanation will be
considered incomplete), but clarity of presentation is important, so try to make your comments/discussion
brief and to the point. Annotated output does not constitute a sufficient answer to any question, but you
should highlight those parts of your output that you explicitly use in your answers.
If you have any questions regarding to these instructions, do not hesitate to ask me (either during class
meetings or consultation times or e-mail.)
Exercise 1
Frankel and Rose (for short: FR), in their 2005 paper ‘’Is Trade Good or Bad for the Environment? Sorting
Out the Causality.‘’ which is published in The Review of Economics and Statistics (volume 87(1), pages 85-91)
empirically address the question:
Is globalization good or bad for the environment?
In particular, they examine whether countries which are more open to international trade incur more (or
less) environmental damage as result, controlling for international variations in real growth rates and in
political institutions. FR quantify environmental damage on seven dimensions: SO2 air concentrations,
NO2 air concentrations, particulate matter air concentrations, CO2 air concentrations, deforestation, energy
resources depletion, and rural clean water access. Their analytic focus is primarily on the SO2 , NO2 , and
particulate matter air pollution impacts, however.
Overall, FR find that greater trade openness, quantified as
Exports + Imports
,
GDP
is actually associated with better environmental outcomes. This result might seem surprising, in view of
the:
’’...race-to-the-bottom hypothesis, which says that open countries in general adopt looser
standards of environmental regulation, out of fear of a loss in international competitiveness.
Alternatively, poor open countries may act as pollution havens, adopting lax environmental
standards to attract multinational corporations and export pollution-intensive goods.
Less widely recognized is the possibility of an effect in the opposite direction, which we call
the gains-from-trade hypothesis. If trade raises income, it allows countries to attain more of
what they want, which includes environmental as well as more conventional output. Openness could
have a positive effect on environmental quality (even for a given level of GDP per capita) for a
number of reasons. First, trade can spur managerial and technological innovation, which can have
positive effects on both the economy and the environment. Second, multinational corporations
tend to bring clean state-of-the-art production techniques from high-standard source countries
of origin to host countries. Third is the international ratcheting up of environmental standards
through heightened public awareness. Whereas some environmental gains may tend to occur with any
increase in income, whether taking place in an open economy or not, others may be more likely
when associated with international trade and investment. Whether the race-to-the-bottom effect
in practice dominates the gains-from-trade effect is an empirical question.’’
(modified quotation from FR 2005)
In this exercise you will replicate FR’s empirical results for the SO2 measure of air pollution and diagnostically check their regression model, so as to assess the credibility of their statistical inference results.
Two features of this exercise should be noted at the outset. First, it is worth noting that the FR model is
estimated using only 41 sample observations. This is an unusually small sample size for a piece of research
in applied econometrics that is published in such a high-quality journal. As we have learned in class,
sample estimators will have an approximate normal distribution (justified by the central limit theorem) –
3.4. Assignments
21
EMET2008 Course Material, Release 1.0
the larger the sample size, the better the approximation. For the purpose of this assignment, we will not
worry further about the small sample size here. We keep it in the back of our minds but are otherwise
happy to use and apply our standard econometric toolkit.
Second, we conduct our analysis here under the assumption that all key explanatory variables are exogenous.
In particular, real per capita income and trade openness are considered to be determined outside of the
model. (We will deviate from that assumption in the next exercise.)
Use the Stata file Frankel_Rose.dta (available for download on Wattle). This file contains data on 41 countries,
collecting (among others) the following variables:
Variable
sulfdm
inc
incsqr
pwtopen
polity
lareapc
oecd
country
Description
mean 1990 SO2 (sulfur dioxide) concentration (in micrograms per cubic meter).
logarithm of real per capita GDP (from the Penn World Tables 5.6; in 1990 dollars, PPP
adjusted)
squared value of the logarithm of real per capita GDP.
100· (Imports + Exports)/GDP from the Penn World Tables 5.6.
index of democratic (+10) versus autocratic (-10) institutions.
logarithm of land area per capita.
dummy variable which equals 1 if country is an OECD member country
country name
1. Estimate a basic regression model with sulfdm as the outcome variable using the regressors inc,
incsqr, pwtopen, polity, lareapc. Interpret your coefficient estimates.
What does this say about the impact of trade openness on SO2 concentrations – i.e., on the relative
importance of the ‘’race-to-the- bottom” versus ‘’gains-from-trade” hypotheses alluded to earlier?
2. Test the model for heteroskedasticity. Do you conclude that the model is heteroskedastic? (If so,
proceed with the heteroskedasticity-corrected version of the model in everything that follows.)
In the linear regression model, when you correct for heteroskedasticity, how do your coefficient estimates change (vis-a-vis the model in which you do not correct for homoskedasticity)? What about
the standard errors?
3. Explain (in words, not maths) what the R2 of a regression measures. How does the adjusted R2 ,
denoted R¯ 2 , differ from this?
Using the adjusted R2 statistic, what is the fraction of the sample variation in sulfur dioxide concentration which is explained by these five explanatory variables? By how much does this fraction decrease
once the openness variable is dropped from the model?
(Note: To have Stata report the value of adjusted R2 , use the command ereturn list after the regress
command: adjusted R2 will be listed as e(r2_a).)
4. Produce the scatter plot of sulfdm against the crucial independent variable pwtopen. Can you spot
two outliers? Which countries do they correspond to? Are they the driving force behind your estimation results?
5. Estimate a re-specified model, using both the logs of sulfdm and pwtopen. (Recall from your study
of the log-log model in EMET2007 that the coefficient on logpwtopen can be interpreted as the elasticity of sulfdm with respect to pwtopen.) How do your conclusions about the openness effect
change?
6. Check whether the key coefficient in the model is different for OECD countries.
(Note: This exercise is from the book ‘’Fundamentals of Applied Econometrics” by Richard Ashley.)
22
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
Exercise 2
In Exercise 1 you used OLS to study the relationship between trade openness and sulfur dioxide levels (as
a proxy for environmental outcomes). That analysis was done under the assumption that all explanatory
variables are exogenous. The actual contribution of the paper by FR is to look deeper and examine the
causal relationship between trade openness and environmental quality while both controlling for income
and appropriately dealing with the likely endogeneity of both income and trade openness. To that end they
used several instrumental variables to deal with these two explanatory variables. You will replicate some
of these results in the current exercise.
Use the Stata file Frankel_Rose.dta. This file contains data on 41 countries, collecting (in addition to the
variables mentioned in the previous exercise) the following instrumental variables:
Instrument
IV
Description
for
trade_potential
pwtopen
Trade potential of a country. This variable combines information on a country’s
geographical location (number of neighbor countries, access to sea, landlock
status), population size, land area and language to construct a measure of
potential trade. For example, all else equal, a country with access to the sea will
have a higher trade potential than a country that is landlocked. This IV is notably
correlated with the endogenous regressor pwtopen while plausibly uncorrelated
with environmental outcomes.
inc_exog inc
Exogenous income of a country. While per capita income inc is likely
endogenous, it contains some exogenous components. FR combine information
on a country’s lagged income as well as school attainment to construct the
exogenous component of income. For example, all else equal, a country with
higher average school attainment will have higher per capita income than a
country with lower average school attainment. This IV is notably correlated with
the endogenous regressor inc while plausibly uncorrelated with environmental
outcomes.
inc_exogsqr
incsqrSince their model specification also includes the square of the logarithm of real
per capita GDP (incsqr), FR also define inc_exogsq as the square of
inc_exog and use this as an instrument for incsqr.
1. Re-estimate the basic model from Exercise 1) part a) using instrumental variables estimation instead.
Use all three instruments and make your estimation robust to heteroskedasticity.
What does this say about the impact of trade openness on SO2 concentrations – i.e., on the relative
importance of the ‘’race-to-the- bottom” versus ‘’gains-from-trade” hypotheses alluded to earlier?
2. Examining the first-stage regressions, do all three first-stage models have reasonably high adjusted
R2 values? Do you need to be concerned about weak instruments? (Use the ‘Rule of Thumb’ explained
in the textbook, section 12.3.)
3. Using the insights gained from Exercise 1, re-estimate the model, replacing sulfdm and pwtopen by
their logarithms, logsulfdm and logpwtopen. How do your conclusions about the openness effect
change?
4. Test whether the OLS and 2SLS coefficient estimates are significantly different. (Hint: use the Hausman test; in Stata type help: hausman to learn how to use it. Provide a brief explanation of what
the Hausman test does.)
5. In conclusion to Exercises 1 and 2, what is your answer to the question Is globalization good or bad for
the environment? What are the strengths and weaknesses of the econometric analysis conducted here?
Do you see any possible extensions that could help improve your research?
(Note: This exercise is from the book ‘’Fundamentals of Applied Econometrics” by Richard Ashley.)
3.4. Assignments
23
EMET2008 Course Material, Release 1.0
Exercise 3
In the research paper ‘’Does Size Matter in Australia’‘, published in The Economic Record (Vol. 86, No. 272,
March 2010, pp.71-83), Michael Kortt and Andrew Leigh address the research question:
Do taller and slimmer workers earn more?
To that effect, they consider the following linear model:
Wi = β 0 + β 1 Heighti + β 2 BMIi + β 3 Xi3 + · · · + β k Xik + ui .
(This equation is my version of equation (1) on page 73 of their paper.) Here, Wi is the log hourly wage
of person i, Heighti represents a person’s height and BMIi stands for a person’s body mass index. The
remaining regressors, Xi3 , . . . , Xik capture a person’s demographic characteristics, including gender, age
(linear and quadratic) and education.
Obtain a copy of the paper (available online for ANU students and faculty) and answer the following
questions.
1. Kortt and Leigh begin the analysis by estimating all coefficients by OLS. Summarize their OLS results
regarding the two main coefficients of interest, β 1 and β 2 (for height and BMI).
2. Would you interpret these estimates as causal? What are the main endogeneity problems in this regression?
3. Explain how Kortt and Leigh attempt to address the endogeneity problem using instrumental variables. How do their findings change?
4. What is the main conclusion of the paper? Do taller and slimmer workers in Australia earn more?
What is the evidence from other countries?
3.4.2 Assignment 2
Instructions
Answer all questions!
This assignment is due at 2.00pm on Wednesday, 29 October. It is worth 10% of your final mark for this
course. Hand in your work by putting it in the EMET2008/6008 assignment box (HW Arndt Building 25a,
opposite of room 1002). Absolutely no extensions will be given, late assignments will receive zero credit. If
you have a university approved excuse for not handing in this assignment, then your marks for your final
exam will be weighted up by 10% to compensate for the missed work.
While I would prefer it if you could provide typed answers, you may also hand in written answers as long
as they are legible and easy to follow. (I will not only mark the correctness of your answers but also the
clarity of the exposition and the transparency with which you communicate your results.) The work that
you hand in should consist of answers to the questions, together with an appendix which contains both the
printout of a complete Stata do-file and a log-file (prodced by the do-file) that covers the entire assignment.
Answers should be in sentence form (i.e. single word or single number answers without explanation will be
considered incomplete), but clarity of presentation is important, so try to make your comments/discussion
brief and to the point. Annotated output does not constitute a sufficient answer to any question, but you
should highlight those parts of your output that you explicitly use in your answers.
If you have any questions regarding to these instructions, do not hesitate to ask me (either during class
meetings or consultation times or e-mail.)
24
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
Exercise 1
The data set PNTSPRD (available on Wattle) contains information from the Las Vegas sport betting market.
The overarching research question is whether the favorite team is more likely to win the game.
Consider the linear probability model
Pr( f avwin = 1|spread) = β 0 + β 1 spread,
where spread is a proxy for the favorite team. A high point spread means that a team is the favorite. Here
a quick primer on point spread betting from Wikipedia:
The general purpose of spread betting is to create an active market for both sides of a binary
wager []. If the wager is simply "Will the favorite win?", more bets are likely to be made for
the favorite, possibly to such an extent that there would be very few betters willing to take
the underdog.
The point spread is essentially a handicap towards the underdog. The wager becomes "Will the
favorite win by more than the point spread?" The point spread can be moved to any level to
create an equal number of participants on each side of the wager. This allows a bookmaker to act
as a market maker by accepting wagers on both sides of the spread. The bookmaker charges a
commission, or vigorish, and acts as the counterparty for each participant. As long as the total
amount wagered on each side is roughly equal, the bookmaker is unconcerned with the actual
outcome; profits instead come from the commissions.
(excerpt taken on October 7, 2014)
1. Explain why, if the spread incorporates all relevant information, we expect β 0 = 0.5?
2. Estimate the linear probability model. Test the hypothesis β 0 = 0.5 against a two-sided alternative.
(Make all estimations robust to heteroskedasticity throughout this entire exercise.)
3. Is spread statistically significant? What is the estimated probability that the favored team wins when
spread = 10?
4. Now estimate the model by probit. Interpret and test the hypothesis that the intercept is equal to 0.5?
5. Use the probit model to estimate the probability that the favored team wins when spread = 10. Compare this with the linear probability model.
6. Add the variables favhome, fav25, and und25 to the probit model and test joint significance of these
variables.
7. Redo parts (d), (e), and (f) using the logit model.
8. Which sport is this exercise about?
(Note: This exercise is from the book ‘’Introductory Econometrics: A Modern Approach” by Jeffrey Wooldridge)
Exercise 2
Krueger and Maleckova, in their paper ‘’Education, Poverty and Terrorism: Is There a Causal Connection?’‘,
published in the Journal of Economic Perspectives (2003), attempt to estimate the causal effect of education and
poverty on terrorism.
1. What is the main research question of the paper?
2. What econometric method do they use to estimate causal effects?
3. What is the main outcome variable?
4. What are the main explanatory variables?
3.4. Assignments
25
EMET2008 Course Material, Release 1.0
5. What other explanatory variables do they include?
6. What is their main finding?
7. What problems/shortcomings do you see in their research?
Exercise 3
The data set Airfare (available on Wattle) contains information on airfares, passenger volume, flight
distance and market concentration for 1,149 flight routes (connections) for the years 1997 through 2000. The
overarching research question is whether increased competition reduces air fares. (Do connections with
less market concentration have cheaper prices?) The data set contains the following variables:
Variable
year
id
dist
passen
fare
bmktshr
Description
year: 1997, 1998, 1999, 2000
route identifier
(the subject of analysis are flight routes)
distance of flight route (in miles)
average number of passengers per day
average one way airfare, $
market share, biggest carrier
(proxy variable for market concentration)
The main explanatory variable is bmktshr. A higher value of bmktshr implies higher market concentration on that route and therefore less competition.
Consider the following linear model:
log( f are)it = ηt + β 1 bmktshrit + β 2 log(dist)it + β 3 [log(dist)it ]2 + αi + uit ,
where ηt means that we allow for different year intercepts.
Make sure you create all necessary variables, in particular the logs of fare and bmktshr.
1. Estimate the above linear model separately for all four years. If ∆bmktshr = 0.1, what is the estimated
percentage increase in fare? (Make all estimations robust to heteroskedasticity throughout this entire
exercise.)
2. Run pooled OLS across all years, i.e. treat the data as if it were one big regression and control for
years by including year dummies. What is your estimate of β 1 ? Is it significant?
3. For what value of dist does the relationship between log(fare) and dist become positive?
4. Estimate the linear model using fixed effects. What is the fixed effect estimate of β 1 ?
5. Add the logarithm of passen to the model. How do your estimates change? In summary, does higher
concentration (i.e, higher bmktshr) on a route increase air fares? What is your best estimate?
6. Name two characteristics of a route (other than distance) that are captured by αi and that are correlated
with bmktshr.
(Note: This exercise is from the book ‘’Introductory Econometrics: A Modern Approach” by Jeffrey Wooldridge)
3.5 Illustration of Central Limit Theorem using Monte Carlo Simulation
The principal problem in econometrics is that we want to learn something about the unknown population
distribution. For example, we want to know mean heights of Australians. In practice, we can never know
26
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
the true population mean; instead we make statistical inferences about the population mean based on one
random sample of size n that is drawn from the population. We have learnt that a good estimator of the
population mean is the sample average Y¯ n .
We have also learned that the sample average itself is a random variable. If you draw more than one
random sample from the population you are likely to obtain different estimates of the population mean
when computing the sample average. The central limit theorem helps us understand what the approximate
distribution of the sample average looks like.
To illustrate the CLT we use Monte Carlo simulation. Here is a brief excerpt from Wikipedia explaining
the term:
"Monte Carlo [Simulations] are a broad class of computational algorithms that rely on repeated
random sampling to obtain numerical results; typically one runs simulations many times over in order
to obtain the distribution of an unknown probabilistic entity. The name comes from the resemblance
of the technique to the act of playing and recording results in a real gambling casino. They are
often used in physical and mathematical problems and are most useful when it is difficult or
impossible to obtain a closed-form expression, or infeasible to apply a deterministic algorithm.
Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical
integration and generation of draws from a probability distribution."
(excerpt taken on 30 July 2014)
Monte Carlo simulations are run on computers that are able to quickly calculate thousands (millions) of
sample averages for as many different samples. In an MC simulation we pretend to know what the distribution of Yi in the population is: we generate an artificial population from which we will draw many
many different random samples and we then compute many many different sample averages (for each of
the random samples). We are then able to visualize the distribution of Y¯n by simply looking at a histogram
of the different sample averages.
To be specific, let’s assume that the population values Yi are actually exponentially distributed with λ = 1.
(Using the exponential distribution is only an example. We could choose any statistical distribution here,
the CLT would still apply.) If you (vaguely) recall the properties of the exponential distribution, this implies
that the population mean µ is equal to 1 and the population variance σ2 is also equal to 1. If we compute
one random sample of size n, the CLT would therefore suggest the following approximate distribution:
Y¯ n ∼ N (1, 1/n)
In an MC simulation we are in the luxurious position to create an artificial population based on the exponential distribution of, say, 1,000,000 members. We then draw 10,000 random samples of size n (which can
take on the values 1, 5, 10, 30, 100 in the pictures below) from that population and plot the histogram. As
you can see in the plots below, as the sample size increases from 1 to 100, the distribution resembles more
and more that of a normal distribution.
3.5. Illustration of Central Limit Theorem using Monte Carlo Simulation
27
EMET2008 Course Material, Release 1.0
Next, instead of studying the approximate distribution of Y¯ n , we standardize the distribution and thus
study
Y¯ n − µ
Y¯ n − 1
=
∼ N (0, 1)
σ/n
1/n
It is then easier to superimpose the pdf of the standard normal distribution which can then be directly
compared to the histograms. The CLT says that the histograms should get closer and closer to the pdf of
the standard normal distribution (the dashed line) as the sample size grows from 1 to 5 to 10 to 30 to 100.
28
Chapter 3. Course Material
EMET2008 Course Material, Release 1.0
This little MC simulation confirms the CLT and it also shows us that sample sizes do not necessarily need
to be very large for the sample average to have a normal distribution. In practice, a sample size of 30 seems
sufficiently large for that purpose.
I hope you are convinced now that the CLT really ‘works’. The question remains, how do we use the CLT
theorem for practical purposes?
3.5. Illustration of Central Limit Theorem using Monte Carlo Simulation
29