Sample Midterm Exam Questions: CS 689: Fall 2011

Sample Midterm Exam Questions: CS 689: Fall 2011
Question 1
You are given two sealed envelopes, one containing $ x and the other containing $ 2x, where x is unknown.
You are asked to select an envelope at random, open it, and then decide if you would like to exchange
it with the other envelope. Develop a justification for your decision based on reasoning using expected
values. Is there an inherent paradox in your solution? Contrast decision-making with expected values to
decision-making with likelihoods, by formulating a likelihood-based solution to this problem (where θ is
the unknown parameter which is either $ x or $ 2x).
Question 2
Let x1 = 0.1, x2 = 0.7, x3 = 1.4, x4 = 2.3 be a set of IID samples from a uniform distribution U [0, θ] for
θ > 0. Draw (by hand) the shape of the likelihood function. Label the axes clearly. As the number of
samples grows, what happens to the shape of the likelihood function?
Question 3
For each of the following questions, answer true or false, and provide a detailed justification for each
answer.
• The variance of the sample mean is always higher than the variance of any individual observation.
• Rao-Blackwellization can only be applied to unbiased estimators.
• Every distribution P (X|θ) has a sufficient statistic.
• The likelihood function is always unimodal.
• The EM algorithm attempts to maximize the log-likelihood of the observed data.
Question 4
Consider the problem of linear regression, where the goal is to find the regression of some variable, such
as your height y, on your father’s height, x. Imagine a cloud of points as the given data, where each
point (xi , yi ) represents your height vs. your father’s height (one for each person). Let us define the
standard deviation line as the line that goes through the sample average point (¯
x, y¯) with a slope equal
σy
to sign(r) σx , where sign(r) is the sign (positive or negative) of the correlation coefficient r, and σx and
σy are the sample standard deviations of x and y. Derive an expression for the regression line in terms of
these quantities, and explain whether its slope is flatter or steeper than the standard deviation line. Draw
a qualitative diagram (by hand) illustrating these two lines for some sample data points.
Question 5
This question concerns the use of EM to infer models from partially missing data. The National Crime
Survey conducted by the US Bureau of the Census interviewed occupants of a certain apartment complex
to determine if they had been victimized by crime during the preceding six month period. Six months
1
later, the occupants were interviewed again to determine if they had been the victims of a crime in the
intervening months since the first interview. The data resulting from the survey is summarized in the
following table.
First Interview
Crime-Free
Victims
Nonrespondents
Second Interview
Crime-Free Victims Nonrespondents
392
55
33
76
38
9
31
7
115
part a What is the “missing” data here? Explain how many training instances in this data can be
used by EM in parameter estimation and which instances have to be discarded. Justify your reasoning.
part b Let xij be the complete data cell counts for the (i, j)th entry of the table. Write an expression
for the complete log-likelihood of the data, and derive the maximum likelihood estimator θˆij .
part c To model this as a missing data problem, show how to represent each complete data cell count
xij in terms of the observed data and the missing data.
part d Specify the E-step of the EM algorithm, This requires specifying an expression for P (Xm |Xo , θt ),
the predictive distribution of the missing data, given the observed data and an initial setting of the model
parameter θ. You may find it helpful to use the following interesting property of multinomial distributions,
11
whereby the conditional distribution P (x11 |x11 + x12 , θ) = θ11θ+θ
, i.e. the conditional distribution of a
12
multinomial variable is also a multinomial.
part e ( 5 points): Give the complete EM algorithm. Given the small size of the problem, it is
possible to compute the results using a calculator, and the EM procedure converges in 3-4 steps. Using
a calculator, show the convergence of EM for this problem, starting with a random guess of the model
parameter θ.
Question 6
S(t−1,1)
S(t,1)
S(t+1,1)
S(t−1,2)
S(t,2)
S(t+1,2)
Y(t−1)
Y(t)
Y(t+1)
The above figure shows a factorial HMM, which can be viewed as a variant of a regular HMM. Here,
each state at time t is made up of a vector of state variables S(t, i), where each variable is governed by a
probability distribution that depends only on the value of the corresponding variable at the previous time.
However, the observation Y (t) at time t can depend on all the state variables at that time.
part a Analyze the conditional independence properties of a factorial HMM using the concept of dseparation. In particular, are the state variables at time t marginally independent? Are the state variables
at time t conditionally independent given the observation Y (t) at time t? Are the state variables at time
2
t conditionally independent of the past history of state variables, given the value of the variables at the
previous time instant t − 1?
part b Suppose there were M parallel chains (instead of 2 as shown in the figure), and each state
variable takes on K values. Show how to convert a factorial HMM into a regular HMM, and give an
expression for the complexity of the forward algorithm for the converted HMM in terms of T (the length
of the sequence), K (the number of values each state variable takes on), and M (the number of chains).
part c Is there a reason to not reduce a factorial HMM into a regular HMM? Is the E-step tractable
for factorial HMMs (in their original or converted form?).
Question 7
EM was primarily specified as a method of doing maximum likelihood estimation over incomplete data.
Does EM assume the underlying samples are IID? Explain.
We often want to do full Bayesian estimation, and not assume uniform priors. Specify a modified
auxiliary function Q∗ (θ|θt ) for applying EM to obtain posterior distributions from incomplete data. Explain
the modification needed, if any, to the E-step and the M-steps.
Question 8
Give a general procedure for determining if a (continuous) function is convex or concave. Consider the
logistic function
f (x) =
1
1 + e−x
Is the logistic function convex or concave? How about the log f (x)? Compare the two loss functions we
have studied in class: the absolute loss and the squared loss. Are both these loss functions convex? What
are the pros and cons of using convex loss functions?
Question 9
part a Derive an expression that relates the Fisher information JX (θ) for an IID dataset of N instances
X = x1 , . . . , xn sampled from a univariate distribution P (x|θ), in terms of the Fisher information Jx (θ) of
each individual instance.
part b Compute the Fisher information JX (θ) for an IID dataset X of N instances sampled from the
Rayleigh distribution
2
P (x|θ) = 2θxe−θx
3
x≥0
Question 10
A finite undirected graph G = (V, E) consists of a set of vertices V and a set of edges (u, v) ∈ E where
u, v ∈ V . This question pertains to the analysis of probability distributions over graphs. Denote by A the
adjacency matrix of the graph G, where A(u, v) = 1 if and only if (u, v) ∈ E. Define the random walk
matrix Pr over a graph G to be an |V | × |V | matrix where Pr (i, j) = d1i , specifying the probability of a
transition from vertex i to vertex j. Here, di denotes the degree of vertex i. Assume the graph is connected,
so all vertices are reachable from every other vertex. Assume |V | = n.
part a Is the random walk matrix symmetric? If you think it is not, give a counterexample. Either
way, justify your answer rigorously.
part b Is the adjacency matrix symmetric? Derive an expression for the random walk matrix in terms
of the adjacency matrix.
part c Let π define a distribution over V so that
any two functions on a graph f, g : V → R as follows:
hf, giπ =
X
P
v∈V
π(v) = 1. Define the inner product hf, giπ of
f (i)g(i)π(i)
i∈V
Show that this definition of inner product satisfies all the axioms defining the inner product.
part d Starting from any initial distribution π0 , derive an expression for the distribution at time step
t > 0 resulting from a random walk on the graph of length t. Starting from any distribution π, does this
process of doing a random walk over longer and longer time periods converge?
4