Assessing Programming Ability in Introductory Computer Science: Why Can't Johnny Code? Robyn McNamara Computer Science & Software Engineering Monash University Can Johnny Program? ● ● Most secondyears can (kind of), but evidence suggests that many can't. Not just at Monash – this is a global problem. – McCracken report (2002) – four universities on three continents found that their secondyears lacked the ability to create a program – papers from every continent (except Antarctica) indicate problems with the programming ability of CS2 students ● Antarctica doesn't seem to have any tertiary programs in IT. Kinds of assessment ● ● Formative: feedback to students – “how am I doing? how can I improve?” – “what are the lecturers looking for?” Summative: counts toward final mark – pracs, exams, assignments, tests, prac exams, hurdles, etc. Purpose of CS assessment ● ● Ensure that students enter the next level of study or work practice with enough knowledge and skill to be able to succeed. May include: – programming ability – grasp of computing theory – problemsolving/analytical ability – more! These skills are mutually reinforcing rather than orthogonal. What's at stake Inadequate assessment in early years can lead to: ● ● inadequate preparation for lateryear courses – watering down content – grade inflation “hidden curriculum” effects (Snyder) – ● to students, assessment defines curriculum poor student morale, which brings – attrition – plagiarism (Ashworth et. al., 1997) Characteristics of good assessment ● ● Reliability: is it a good measure? – if you test the same concept twice, students should get similar marks (cf. precision) – can be evaluated quantitatively using established statistical techniques (AERA et. al., 1985) Validity: is it measuring the right thing? – not directly quantifiable – measured indirectly using (e.g.) correlation studies – this is what I'm interested in! Types of validity ● Content validity: assessment needs to – be relevant to the course – cover all of the course (not just the parts that are easy to assess) Who discovered the Quicksort algorithm? a) Donald Knuth b) C.A.R. Hoare c) Edsger Dijkstra d) Alan Turing Types of validity ● Construct validity: assessment measures the psychological construct (skill, knowledge, attitude) it's supposed to measure. – Can't be evaluated directly, so we have to use other forms of validity as a proxy (Cronbach & Quirk, 1976) You can store several items of the same type in an: a) pointer b) array c) struct d) variable Example: time and construct validity ● Allocating too little time for a task threatens validity – ● you end up assessing time management or organizational skills Allocating too much time can also threaten validity! – students can spend a long time working on programming tasks – they can go through many redesign cycles instead of just a few intelligent ones – even with an unintelligent heuristic, a student can eventually converge on a “good enough” answer given enough iterations – not a true test of problemsolving/design ability construct validity threat Types of validity ● Criterion validity: the assessment results correspond well with other criteria that are expected to measure the same construct – “predictive validity”: results are a good predictor of performance in later courses – “concurrent validity”: results correlate strongly with results in concurrent assessment (e.g. two parts of the same exam, exam and prac in same year, corequisite courses etc.) – We can measure this! Method ● ● ● ● Took CSE1301 prac and exam results from 2001, only those who had sat both the exam and at least one prac Grouped exam questions into – multiple choice – short answer – programming Calculated percentage mark for each student in each exam category, plus overall exam and overall prac Generated scatterplots and bestfit lines from percentage marks Predictions ● Programming questions on the exam should be the best predictor of prac mark... ● ...followed by short answer... ● ...with multiplechoice being the worst predictor – programming skills are clearly supposed to be assessed by on paper coding questions and pracs – many shortanswer questions cover aspects of programming, e.g. syntax Sounds reasonable, right? MCQ vs Short Answer ● ● Strong correlation: 0.8 Same students, same exam (so same day, same conditions, same level of preparation) MCQ vs Code ● ● Correlation 0.82 Note the Xintercept of 30% for bestfit line MCQ vs Prac ● ● ● Correlation only 0.55 We predicted a relatively poor correlation here, so that's OK Note the Yintercept Short Answer vs Code ● ● ● Correlation 0.86 SA is a better predictor than MCQ; so far so good Note the Xintercept at 20 – a guesswork effect? Short Answer vs Prac ● ● Correlation 0.59 Stronger than MCQ, as expected, but only slightly. Code vs Prac ● ● Correlation still only 0.59 – no better than short answer! Note that the bestfit line has a Yintercept of more than 50%! Exam vs Prac ● Note that someone who got zero for the exam could still expect 45% in the pracs – 45% was the hurdle requirement for the pracs Summary ● ● ● Exam programming and lab programming are strongly correlated, so they're measuring something. But... Exam programming results are not a better predictor of ability in pracs than shortanswer questions, and only slightly better than multiplechoice. Something is definitely not right here! What next? ● I still haven't asked the really important questions: – what do we think we're assessing? – what do the students think they're preparing for? – are pracs or exams better predictors of success in later courses, especially at secondyear level? – what are the factors that affect success in programmingbased assessment tasks, other than programming ability? – computer programming and computer science: how are they different? What are the ramifications for our teaching and assessment? (This is a big and probably postdoctoral question.) What's next? ● Current plan for my PhD research: three stages ● What do we think we're doing? – ● ● interview lecturers to determine what skills they are trying to assess What are we doing? – obtain finelygrained assessment results for firstyear and secondyear core subjects for one cohort and analyse these results to see which tasks have highest predictive validity – interview students to determine how they approach assessment What should we be doing? – suggest feasible ways we can improve assessment validity Bibliography Reliability and validity American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Cronbach, Lee. (1971). “Test validation”. In R. L. Thorndike (Ed.). Educational Measurement Cronbach, L. J. & Quirk, T. J. (1976). “Test validity”. In International Encyclopedia of Education. Oosterhof, A. (1994). Classroom applications of educational measurement. McMillan. Bibliography General and CS Ashworth, P., Bannister, P. & Thorne, P. (1997) “Guilty in whose eyes? University student's perceptions of cheating and plagiarism in academic work and assessment”, Studies in Higher Education 22(2), pp. 187—203. Barros, J. A. et. al., “Using lab exams to ensure programming practice in an introductory programming course”, ITiCSE 2003 pp. 16—20. Chamillard, A. & Joiner, J.K., “Evaluating programming ability in an introductory computer science course”, SIGCSE 2000 pp. 212—216. Daly, C. & Waldron, J. (2001) “Introductory programming, problem solving, and computer assisted assessment”, Proc. 6th Annual International CAA Conference, pp. 95—107. Daly, C. & Waldron, J. (2004) “Assessing the assessment of programming ability”, SIGCSE 2004 pp. 210—213. Bibliography de Raadt, M., Toleman, M. & Watson, R. (2004) “Training strategic problem solvers”, SIGCSE 2004 pp. 48—51. Knox, D. & Woltz, U. (1996) “Use of laboratories in computer science education: Guidelines for good practice”, ITiCSE 1996 pp. 167—181. Kuechler, W.L. & Simkin, M.G. (2003) “How well do multiple choice tests evaluate student understanding in computer programming classes?” Jnl of Information Systems Education, 14(4) pp. 389—399. Lister, R. (2001) “Objectives and objective assessment in CS1”, SIGCSE 2001 pp. 292—297. McCracken, M. et. al., “A multinational, multiinstitutional study of assessment of programming skills of firstyear CS students”, SIGCSE 2002 pp. 125—140. Ruehr, F. & Orr, G. (2002) “Interactive program demonstration as a form of student program assessment”, Jnl of Computing Sciences in Colleges 18(2), pp. 65—78. Bibliography Sambell, R. & McDowell, L. (1998). “The construction of the hidden curriculum: Messages and meanings in the assessment of student learning”, Jnl of Assessment and Evaluation in Higher Education 23(4), pp. 391—402. Snyder, B.R. (1973). The hidden curriculum, MIT Press. Thomson, K. & Falchikov, N. (1998). “ 'Full on until the sun comes out': The effects of assessment on student approaches to study”, Jnl of Assessment and Evaluation in Higher Education 23(4), pp. 379—390.
© Copyright 2025