Sample Tests for the Course Computational Intelligence Thomas Natschl¨ ager

Sample Tests for the Course Computational
Intelligence
Thomas Natschl¨ager
June 16, 2004
The aim of these multiple choice tests is that students can check their understanding of the
topics presented during the lectures. An interactive version of these tests1 can be found at
the homepage of the course2 or you can also download the interactive version as a zip file3 .
Introduction to Machine Learning
1. A learning algorithm is a function which maps each attribute vector a = ha1 , . . . , ad i to
a target value b.
True False
2. The empirical error on the training set is always lower than the empirical error on the
test set.
True False
3. The true error errorP (H) of the hypothesis H is necessarily larger then the empirical
error errorTk (H) measured on the test set Tk .
True False
4. If the training set L and the test set T are generated by two totally different distributions, then
the larger the test set Tk the closer the empirical error errorTk (H) is to the true
error errorP (H) of the hypothesis H.
limk→∞ errorTk (H) = errorP (H) where k is the size of the test set Tk .
it may happen that also for very large test sets Tk the empirical error does not approximate the true error very well.
5. Generalization has to do
with the ability of a learning algorithm to find a hypothesis which has a low error
on the training set.
1
http://www.igi.tugraz.at/lehre/CI/tests/index.html
http://www.igi.tugraz.at/lehre/CI
3
http://www.igi.tugraz.at/lehre/CI/InteraktiveTests.zip
2
1
with the ability of a learning algorithm to find a hypothesis which has a low error
on the test set.
with the ability of a learning algorithm to find a hypothesis which performs well on
examples ha, bi which were not used for training.
6. A “good” learning algorithm is a learning algorithm which
has good generalization capabilities.
can find for the training set L a Hypotheses H with errorL (H) = 0.
finds for rather small training sets L a Hypothesis HL with a small true error.
Neuronale Netze
1. Mit einem linearen Schwellengatter g(x) = sign(w · x + wo ) kann bei geeigneter Wahl
von w und wo jede Boole’sche Funktion f : {−1, 1}n → {−1, 1} berechnet werden.
True False
2. Ein 3-schichtiges ANN mit einem Output, welches nur aus linearen Gattern besteht,
kann durch ein einzelnes lineares Gatter simuliert werden.
True False
3. Jede stetige Funktion f : R → (0, 1) kann von einem vorwrtsgerichteten ANN aus sigmoiden Gattern mir einer hidden Schicht beliebig genau approximiert werden.
True False
4. Die ∆-Regel findet fr ein binres Klassifikationsproblem immer eine Lsung, falls die Daten
linear trennbar sind.
True False
5. Falls die Daten eines binren Klassifikationsproblems nicht linear trennbar sind, so ist
die ∆-Regel nicht anwendbar.
True False
6. Mit Hilfe eines linearen Programms kann man auch fr den nicht linear trennbaren Fall
die Gewichte eines linearen Schwellengatters ”‘trainieren”’.
True False
7. Bei der Berechnung der Fisher Diskriminante wird jener Richtungsvektor ermittelt, auf
dem die projizierten Daten die grte Varianz aufweisen.
True False
2
8. Die mittels Pseudo-Inverse ermittelten Gewichte fr ein Schwellengatter sind optimal in
Bezug auf die Anzahl der falsch klassifizierten TBs.
True False
9. Der durch eine SVM mit linearem Kernel bestimmte Gewichtsvektor hw1 , . . . , xn i eines
Schwellengatters ist eine Linearkombination der TBs.
True False
10. Bei einem linear trennbaren Klassifikationsproblem bezeichnet man jene trennende Hyperebene als die optimale Hyperebene, welche den maximalen Abstand zu den TBs
aufweist.
True False
11. Beim Trainieren einer SVM werden mittels gradient descent die Gewichte eines Schwellengatters bestimmt.
True False
12. Der Backprop-Alg. findet garantiert jene Gewichte fr ein ANN, die ein globales Minimum der Fehlerfunktion darstellen.
True False
13. weight-decay ist eine Heuristik, um overfitting beim Trainieren von ANNs zu vermeiden.
True False
14. Bei ungnstiger Einstellung der Parameter bei Backprop mit adaptiver Lernrate kann es
sein, da der Gewichtsvektor zu oszillieren beginnt.
True False
15. Gradient descent ist eine speziell fr ANNs entwickelte Technik zum Minimieren von
quadratischen Fehlerfunktionen.
True False
16. Der Vorteil von Backprop mit Momentum ist, da der Lernvorgang in ”‘Plateaus”’ der
Fehlerfunktion beschleunigt wird.
True False
17. Quasi-Newton und conjugate gradient Verfahren unterscheiden sich nur in der Art wie
die Hess’sche Matrix berechnet wird.
True False
3
18. Nach dem whitening von Daten gibt es keine empirisch mebaren linearen Abhngikeiten
mehr zwischen verschiedenen Attributen.
True False
Classification Algorithms
1. For any finite training set C4.5 can produce a decision tree which makes no errors on
the training set.
True False
2. The performance of a nearest neighbor algorithm depends stronlgy on the relativ scaling
of the attributes.
True False
Adaptive Filtering
1. Increasing the step size µ generally results in faster convergence of the LMS algorithm.
True False
2. The goal of system identification is to build a model of an unknown system.
True False
3. The RLS algorithm usually converges faster than the LMS algorithm.
True False
4. Why is it (usually) not desirable to achieve the global minimum of the mean squared
error (of the whole time-series) for an adaptive filter?
Because the filter should adapt to temporal variation of an unknown system.
Because the wanted signal (e.g., the signal of a local speaker for the application in
echo-cancellation) would be suppressed.
5. An adaptive filter trained using the RLS algorithm with a forgetting factor ρ = 1
has constant coefficient values over time w[n] = w (it does not adapt).
reaches the global minimum of the mean squared error for a time-series at the end
of the time-series.
considers indirectly all past signal samples for the computation of the local error and
the adaption of the coefficients.
displays an identical adaptation behavior as an adaptive filter trained using the LMS
algorithm with µ = 0.
4
Gaussian Statistics
1. Consider a 2-dimensional Gaussian Process. Find the correct answers:
If the first and second dimension are independent, the cloud of points (xi , yi )i=1,...,N
and the pdf contour has necessarily the shape of a circle.
If the first and second dimension are independent, the cloud of points and pdf contour has to be elliptic with the principle axes of the ellipse aligned with the abscissa
and ordinate axes (consider a circle as a special form of an ellipsis).
The covariance matrix Σ is symetric. That is for i, j = 1, . . . , d it holds that cij = cji
2. Estimation of the parameters of a 2-dimensional normal distribution. Find the correct
answers.
An accurate mean estimate requires more samples than an accurate variance estimate.
Using more data results in an more accurate estimate of the parameters of the normal
distribution.
3. Computation of the log-likelihood for classification instead of the likelihood
gives the same classification results since the logarithm is a monotonically increasing
function.
is computationally beneficial, since we do not have to deal with very small numbers.
turns products for the computation of the likelihood into sums for the computation
of the log-likelihood.
4. For the computation of the log-likelihood for observations x with respect to Gaussian
models according to log p(x|Θ) = 12 [−d log(2π) − log(det(Σ)) − (x − µ)T Σ−1 (x − µ)],
we may (for all Gaussian models)
drop the division by 2.
drop the term d log(2π).
drop the term log(det(Σ)).
drop the term (x − µ)T Σ−1 (x − µ).
pre-compute the term log(det(Σ)) for each of the Gaussian models.
Hidden Markov Models
1. The parameters of a Markov model (NOT a hidden Markov model) are:
The set of states.
The prior probabilities (probabilities to start in a certain state).
The state transition probabilities.
The emission probabilities.
2. Find the correct statements.
The (first-order) Markov assumption means that the probability of an event at time
n only depends on the event at time n − 1.
5
An ergodic HMM allows transitions from each state to any other state.
For speech recognition usually ergodic HMMs are used to model phoneme sequences
in words.
3. Viterbi algorithm
The Viterbi algorithm finds the most likely state sequence for a given observation
sequence and a given HMM.
The Viterbi algorithm finds the most likely state sequence for a given HMM.
The Viterbi algorithm computes the likelihood of an observation sequence with respect to an HMM (considering all possible state sequences).
In the Viterbi algorithm at each time step and for each state only one path leading
to this state (the surviver path) and its metric are stored for further processing.
6