EXERCISES STR-VALIDATOR

19.04.2015
EXERCISES STR-VALIDATOR
GETTING STARTED
It is assumed that you have installed R (e.g. RGui or RStudio). To install STR-validator (package name is without
dash and all lower case letters: strvalidator) follow the instructions given in the document
strvalidator_installation.pdf
that
can
be
downloaded
from
the
STR-validator
website
(https://sites.google.com/site/forensicapps/strvalidator).
To start the graphical user interface type the following two lines in the R console (press enter after each
command to execute):
library(strvalidator)
strvalidator()
Although most STR-validator functions can be performed by R commands these exercises will focus on using
the graphical user interface.
EXERCISE 1 – REPEATABILITY
In this exercise you will load an in-built example dataset with 8 replicates of a positive control sample and
calculate height metrics to evaluate the repeatability of the STR analysis.
a)
First load the in-built example data into the R environment by typing data(set1) in the R console.
The name of the dataset is ‘set1’.
b) Select the Workspace tab in STR-validator and click the drop-down list in the Load objects from R
workspace area. Select set1 and click the button Load object. set1 appears in the object list in the STRvalidator workspace for the current project. To inspect the data select set1 and click the View button.
The dataset contain 8 positive controls, 1 negative control and 1 allelic ladder.
c)
To enable analysis the GeneMapper-formatted (Figure 1) data must be converted to the STR-validator
format (Figure 2). In the Tools tab click the Slim button. A new window open. Select dataset set1 in the
drop-down list. Accept the defaults and click the Slim dataset button. The new dataset has been saved
in the STR-validator workspace.
d) Select the Tools tab and click the Height button. Select the set1_slim dataset and uncheck the option
Add result to dataset (by doing this a new dataset will be created). Let the other options remain
checked. Click the Calculate button. The result is now saved and can be view using the View or Edit
button in any tab. It should look like in Figure 3 (left).
e)
We now have the total peak height and number of peaks for each sample in set1, even the negative
control and ladder. But what if we want to exclude artefact peaks, like stutters, and only sum the
heights for actual alleles? Also, we probably want to get rid of the negative control and ladder. To
accomplish this we need to import the known profile of the positive control sample. Type
data(ref1) in the R console and load the dataset into STR-validator as explained in point b).
Convert the dataset as explained in point c) but this time also rename the result to ref1 so to not
clutter the workspace (overwrite when asked to).
Page 1 (7)
19.04.2015
f)
Select the Tools tab and click the Filter button. Select the set1_slim dataset and the ref1 reference
dataset. The Check subsetting button is used to see which samples are matching the reference
samples. Leave the default settings and click the Filter profile button. The known alleles and peak
heights have now been pulled out from each replicate control sample and saved as a new dataset
set1_slim_filter.
g)
Repeat point d) with the set1_slim_filter dataset. The result (set1_slim_filter_height) now looks like
Figure 3 (middle). As can be seen the negative control and ladder is now absent. They did not match
the reference sample “PC” and was therefore newer included in filtered dataset.
h) A better metric than total peak height is average peak height (H) since it is easy to relate to the DNA
profile. To calculate H we need to know if a marker is heterozygous or homozygous. Normally this
information is calculated for reference datasets (where you know that all alleles are present), and then
added to other datasets (where you can’t be sure that all alleles are observed). See section 9.3 and 9.4
in the STR-validator tutorial (version 1.4). In this case, however, we can see from Figure 3 (middle) that
all alleles are present in all samples (i.e. 33 alleles per sample) so we can take a short-cut. In the Tools
tab click the Heterozygous button. Select the set1_slim_filter dataset in the drop-down list. Click
Calculate. This adds a column Heterozygous to the dataset with 1 to indicate heterozygous alleles and
0 to indicate homozygous alleles. The new dataset is saved as set1_slim_filter_het.
i)
Repeat point d) with the set1_slim_filter_het dataset. The result (set1_slim_filter_het_height) now
looks like Figure 3 (right). There is a new column H in the dataset. In the current version (1.4.0) there is
no summary function for peak heights. If you would like to continue the analysis in a spread-sheet
software, open the dataset using the Edit button. Select the dataset set1_slim_filter_het_height and
click the Copy to clipboard button. Paste the data in the spread-sheet program and apply e.g. functions
for average and standard deviation to the data. The result can look like in Figure 4.
Sample.Name
PC1
PC1
PC1
PC1
PC1
PC1
PC1
Marker
AMEL
D3S1358
TH01
D21S11
D18S51
D10S1248
D1S1656
Dye Allele.1 Allele.2 Allele.3 Allele.4 Allele.5 Height.1 Height.2 Height.3 Height.4 Height.5
B
X
OL
Y
NA
NA
2486
81
2850
NA
NA
B
16
17
18
NA
NA
260
3251
2985
NA
NA
B
6
9.3
NA
NA
NA
3357
2687
NA
NA
NA
B
28
29
30.2
31.2
NA
183
2036
180
1942
NA
B
15
16
17
18
NA
161
2051
203
1617
NA
G
12
13
14
15
NA
168
2142
243
2230
NA
G
11
12
13
NA
NA
249
3149
3965
NA
NA
FIGURE 1. GENEMAPPER FORMATTED DATA.
Sample.Name
PC1
PC1
PC1
PC1
PC1
PC1
PC1
Marker
AMEL
AMEL
AMEL
D3S1358
D3S1358
D3S1358
TH01
Dye Allele Height
B
X 2486
B
OL
81
B
Y 2850
B
16
260
B
17 3251
B
18 2985
B
6 3357
FIGURE 2. STR-VALIDATOR FORMATTED DATA.
Page 2 (7)
19.04.2015
Sample.Name
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
NC
Ladder
TPH
Peaks
109875
59
135465
62
139962
60
115650
60
138392
60
106037
60
118332
61
92227
60
0
0
111651
82
Sample.Name
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
TPH
Peaks
103748
33
127760
33
131594
33
108599
33
130122
33
99703
33
111548
33
86923
33
Sample.Name
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
H
TPH
Peaks
3051.4 103748
33
3757.6 127760
33
3870.4 131594
33
3194.1 108599
33
3827.1 130122
33
2932.4 99703
33
3280.8 111548
33
2556.6 86923
33
FIGURE 3. RESULT AFTER COMPLETION OF POINT D) (LEFT), G) (MIDDLE) AND I) (RIGHT) RESPECTIVELY.
Sample.Name
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
Mean:
StdDev:
H
TPH Peaks
3051.41 103748
33
3757.65 127760
33
3870.41 131594
33
3194.09 108599
33
3827.12 130122
33
2932.44
99703
33
3280.82 111548
33
2556.56
86923
33
3308.8 112499.6
33
474.1 16118.5
0
FIGURE 4. EXAMPLE OF EXTERNAL ANALYSIS OF STR-VALIDATOR RESULTS.
EXERCISE 2 – HETEROZYGOTE- AND INTER-LOCUS BALANCE
You will use the same data as in Exercise 1 and calculate the heterozygote balance and the inter-locus balance.
However we will start with saving the project before continuing the analysis.
a)
Select the Workspace tab and click the Save As button. Locate the folder where you want to save the
project. Click Ok, and then type a project name in the dialog box. Click Ok to save the project. A dialog
box showing the complete path to the project is shown when the project has been saved. The next
time it is enough to click the Save button to save the project.
b) Select the Balance tab and click the Calculate button in the Intralocus and interlocus balance group.
Select the set1_slim_filter dataset and ref1 reference dataset in the respective drop-down list. Select
to calculate the heterozygote balance using the High peak / low peak option. Calculate the
proportional inter-locus balance within each dye. Click the Calculate button. The result is saved as
set1_slim_filter_balance.
c)
To plot the balance data click the Plot button in the Intralocus and interlocus balance group. Select the
set1_slim_filter_balance dataset in the drop-down list. Select the option fixed scales under the Axes
expandable option group. Click the Hb_vs_Height to plot the heterozygote balance and Lb_vs_Height
to plot the inter-locus balance. The plots can be saved as modifiable plot objects or as images.
Page 3 (7)
19.04.2015
d) Click the Summarize button to calculate summary statistics. Select the set1_slim_filter_balance
th
dataset in the drop-down list. Calculate the 5 quantile per locus. The now rather long name for the
result dataset is set1_slim_filter_balance_table_locus. The name for the result can be changed before
performing a calculation. It can also be changed at any time in the Workspace tab. However, for the
sake of clarity, we will often accept the default name during the exercises (it gives us a summary of the
actions that has been applied to the datasets).
e)
To complete the summary statistics we are going to add dye information to the dataset. Click the Dye
button in the Tools tab. Select the set1_slim_filter_balance_table_locus dataset in the drop-down list.
Click Add dye. The result, rounded to two decimals, is shown in Figure 5.
Marker Hb.n Hb.Min Hb.Mean Hb.Stdv Hb.Perc.5 Lb.n Lb.Min Lb.Mean Lb.Stdv Lb.Perc.5 Color
AMEL
8
0.66
0.85
0.09
0.71
8
0.20
0.21
0.01
0.20 blue
D3S1358
8
0.65
0.79
0.11
0.67
8
0.20
0.24
0.03
0.20 blue
TH01
8
0.60
0.77
0.12
0.64
8
0.20
0.22
0.02
0.20 blue
D21S11
8
0.56
0.84
0.15
0.61
8
0.16
0.17
0.02
0.16 blue
D18S51
8
0.61
0.82
0.13
0.66
8
0.13
0.16
0.02
0.13 blue
D10S1248
8
0.57
0.86
0.13
0.66
8
0.20
0.23
0.03
0.20 green
D1S1656
8
0.79
0.90
0.08
0.80
8
0.24
0.29
0.03
0.25 green
D2S1338
8
0.68
0.80
0.10
0.68
8
0.23
0.26
0.02
0.23 green
D16S539
8
0.66
0.85
0.10
0.71
8
0.19
0.22
0.02
0.20 green
D22S1045
0
NA
NA
NA
NA
8
0.19
0.23
0.03
0.20 yellow
vWA
8
0.68
0.82
0.10
0.69
8
0.18
0.21
0.02
0.18 yellow
D8S1179
8
0.71
0.90
0.09
0.75
8
0.24
0.28
0.02
0.25 yellow
FGA
8
0.68
0.89
0.12
0.70
8
0.24
0.28
0.03
0.24 yellow
D2S441
8
0.72
0.88
0.09
0.74
8
0.32
0.37
0.04
0.33 red
D12S391
8
0.58
0.78
0.14
0.61
8
0.15
0.19
0.03
0.15 red
D19S433
8
0.68
0.85
0.10
0.71
8
0.22
0.25
0.02
0.22 red
SE33
8
0.59
0.75
0.13
0.60
8
0.17
0.19
0.02
0.17 red
Dye
B
B
B
B
B
G
G
G
G
Y
Y
Y
Y
R
R
R
R
FIGURE 5. BALANCE SUMMARY STATISTICS FOR THE POSITIVE CONTROL SAMPLES. D22S1045 IS LACKING HB VALUES BECAUSE THE
POSITIVE CONTROL SAMPLE IS HOMOZYGOUS IN THIS MARKER. THE OPTIMAL LOCUS BALANCE IS CALCULATED BY DIVIDING 1 WITH THE
NUMBER OF MARKERS IN EACH COLOUR (OR THE TOTAL NUMBER OF MARKERS IF CALCULATING THE OVERALL BALANCE).
EXERCISE 3 – CREATE AN EPG
DNA profiles can be visualised as EPGs from within STR-validator. Select the Tools tab and click the EPG button.
Select the unfiltered dataset set1_slim in the drop-down list. Select one of the positive control samples in the
sample drop-down list. Click Generate EPG.
EXERCISE 4 –STUTTER RATIOS
We will continue with dataset set1 and calculate stutter ratios. The stutter ratios will be plotted and the result
will be summarized.
a)
Select the Stutter tab and click the Calculate button. Select the unfiltered dataset set1_slim and
reference ref1. It is important to note that no stutter filter was used to analyse this data in
GeneMapperID-X. Calculate stutters within the range ±1 repeat unit. Select the option no overlap
between stutters and alleles. Leave the replacement table as is and click the Calculate button.
Page 4 (7)
19.04.2015
b) Click the Plot button. Select the set1_slim_stutter dataset in the drop-down list and plot the Ratio vs.
Allele. From the plot you can quickly tell that all observed alleles are below 15%. Both -1 and +1 repeat
stutters are observed. Do you observe any notable differences between the loci?
c)
Now plot the Ratio vs. Height. What is your conclusion? Close the plotting window.
d) In the Stutter tab click the Summarize button. Select the set1_slim_stutter dataset in the drop-down
th
list. Calculate the 95 quantile on a per stutter basis. Click the Summarize button. The result is saved
as set1_slim_stutter_table_stutter and should be similar to Figure 6.
Marker Type n.alleles n.stutters Mean Stdv
Perc.95 Max
D21S11
-1
2
16
0.080
0.007
0.091
0.093
D18S51
-1
1
8
0.080
0.004
0.086
0.087
D10S1248
-1
1
8
0.078
0.004
0.084
0.085
D2S1338
-1
2
16
0.086
0.018
0.112
0.112
D16S539
-1
2
15
0.064
0.023
0.089
0.094
D22S1045
-1
1
8
0.119
0.005
0.126
0.126
D22S1045
1
1
8
0.058
0.005
0.065
0.069
vWA
-1
2
16
0.091
0.012
0.112
0.125
FGA
-1
2
16
0.063
0.012
0.076
0.077
D2S441
-1
2
16
0.052
0.010
0.065
0.069
D2S441
1
1
1
0.009
NA
0.009
0.009
D12S391
-1
2
16
0.106
0.027
0.140
0.146
D12S391
1
1
1
0.025
NA
0.025
0.025
FIGURE 6. RESULT FROM STUTTER ANALYSIS OF 8 POSITIVE CONTROL SAMPLES.
EXERCISE 5 – CONCORDANCE
In this exercise you will perform a concordance analysis of samples analysed with SGM Plus and ESX 17. Any
discordance is listed and the overall concordance is calculated. The data for this exercise is found in the
Concordance folder.
a)
Select the Workspace tab and click the Import button. Locate the file concordance_esx17.txt and give
the dataset a name e.g. conc_esx and click the Import button. Repeat the procedure to import the file
concordance_sgm_plus.txt as for example conc_sgm. The data have been analysed according to
standard procedures in GeneMapperID-X and artefacts such as stutters and pull-up peaks has been
removed. The data are already in STR-validator format so there is no need to Slim the data.
b) Select the Concordance tab and click the Calculate button. Select the conc_esx dataset and check that
the kit has been correctly detected as ESX17. Click the Add button. Select the conc_sgm dataset and
check that the kit has been correctly detected as SGMPlus. Click the Add button. The kit names in the
Names for analysis kit text field can be changed as they are only labels for the result tables. Click the
Calculate button. Two result tables are created: one for the overall concordance (table_concordance)
and one listing all discordances (table_discordance).
c)
Click the Edit button to view the result. Select the table_discordance dataset. It should look as in
Figure 7. Select the table_concordance to check the overall result (Figure 8).
Page 5 (7)
19.04.2015
Sample.Name
Sample_02
Sample_04
Sample_07
Marker ESX17 SGMPlus
D3S1358 15,20
15,OL
D16S539 9,13
9
D3S1358 16,21
16,OL
FIGURE 7. A LIST OF DISCORDANCES (TWO OF-LADDER ALLELES AND ONE FALSE HOMOZYGOTE) FOUND COMPARING THE ESX17 AND
THE SGMPLUS TYPING KIT.
Kits
Samples Loci Alleles Discordances Concordance
ESX17 vs. SGMPlus
10 10
200
3
98.5
FIGURE 8. THE OVERALL CONCORDANCE BETWEEN ESX17 AND SGMPLUS TYPING KIT.
EXERCISE 6 – MIXTURES
In this exercise you will perform a validation analysis of mixtures. For each mixture the mixture proportion (Mx)
per marker, the sample Mx average, and the difference from the sample average is calculated. The number of
observed and expected unshared alleles is listed and the percentage profile is calculated. In addition any dropin peaks are counted. The data for this exercise is found in the Mixture folder.
a)
Select the Workspace tab and click the Import button. Locate the file mixtures.txt and give the dataset
a name e.g. mix_data and click the Import button. Repeat the procedure to import the file
ref_major.txt and ref_minor.txt as e.g. mix_ref_major and mix_ref_minor respectively. The data have
been analysed according to standard procedures in GeneMapperID-X and artefacts such as stutters
and pull-up peaks has been removed. The data are already in STR-validator format so there is no need
to Slim the data.
b) Select the Mixture tab and click the Calculate button. Select mix_data in the drop-down list for
datasets. Select mix_ref_major reference dataset for the major mixture component and
mix_ref_minor reference dataset for the minor mixture component. Check the options Remove offladder alleles and Ignore drop-out. Click the Calculate button. The result is saved in mix_data_mixture
and is summarized in Figure 9.
Sample.Name Average Mx Observed Expected Profile Dropin
major_minor_1
0.255
19
19 100.0
0
major_minor_2
0.160
13
19
68.4
3
FIGURE 9. EXAMPLE OF EXTERNAL ANALYSIS OF STR-VALIDATOR MIXTURE RESULTS.
EXERCISE 7 – KIT MARKER RANGE COMPARISON
Kit range marker plots can easily be created in STR-validator. Select the DryLab tab and click the Plot Kit button.
Check one or more typing kits that you want to plot. Click the plot button. Many plot options can be
customised. Click the plot button again to update the plot. Plots can be saved by clicking the Save as object or
Save as image button.
EXERCISE 8* – DROP-OUT
This is a slightly updated guided exercise from a previous course. In this exercise you will perform a drop-out
analysis from serially diluted samples to estimate the stochastic threshold. This is a more challenging exercise
Page 6 (7)
19.04.2015
as it does not include a reference dataset, and there is contamination. The data for this exercise is found in the
Dropout folder. The step-by-step guide guided_exercise_dropout.pdf is available in the same folder.
EXERCISE 9* – DROP-IN
This is a slightly updated guided exercise from a previous course. In this exercise you will perform process
control of extraction negative controls and PCR non-template controls to estimate the probability of drop-in
contamination. This is a more challenging exercise as it includes multiple steps and tools of STR-validator. The
data for this exercise is found in the Contamination folder. The step-by-step guide guided_exercise_dropin.pdf
is available in the same folder.
EXERCISE 10 – RESULT TYPE AND PEAKS
In Chapter 6 and 7 in the tutorial you learn how to perform a result type and peak analysis. This can be an easy
way to compare e.g. two extraction methods using real crime scene samples by categorise the result as e.g. full,
partial, or negative.
EXERCISE 11 – PRECISION
In Chapter 8 in the tutorial you learn how to perform a precision analysis of allelic ladders.
Page 7 (7)