How to Use… 

How to Use… Guide to the practical sessions and to working with software for the purposes of the Essex Summer School 2011 course 1E ‘Introduction to multilevel models with applications’ Paul Lambert, July 2011 Version 2.0 of 20/JUN/2011 Contents How to... Undertake the practical sessions for module 1E ................................................................... 2 Principles of the practical sessions ...................................................................................................... 4 Organisation of the practical sessions ................................................................................................ 7 ..The tremendous importance of workflows.. ............................................................................... 10 Practical sessions 1‐10: Contents ...................................................................................................... 13 Software alternatives ........................................................................................................................ 18 How to use… EditPad Lite .................................................................................................................... 19 How to use… Stata ............................................................................................................................... 23 How to use… SPSS / PASW ................................................................................................................... 30 How to use… MLwiN ............................................................................................................................ 36 How to use MLwiN (i): MLwiN the easy way .................................................................................... 36 How to use MLwiN (ii): MLwiN for serious people ............................................................................ 40 How to use MLwiN (iii): Using ‘runmlwin’ ......................................................................................... 42 Some miscellaneous points on working with MLwiN .................................................................... 44 How to use… R ...................................................................................................................................... 45 References cited ................................................................................................................................... 49 Software‐specific recommendations: ................................................................................................ 50 1 1E Coursepack: Appendix 3
How to... Undertake the practical sessions for module 1E The next few pages describe topics and materials from the lab sessions. The step‐by‐step implementation instructions for each session are largely to be found within the specific ‘syntax’ files for the relevant sessions (.do, .sps, .mac and .R files), which are distributed electronically during the course of the module. In general terms, the task in the labs is to open up the relevant syntax files, and work your way through them, digesting the examples shown (and potentially adding your own notes, adjustments or examples). You’ll ordinarily need to have the analytical software open and the relevant tool for working with a syntax file (e.g. ‘syntax window’ or ‘do file editor’). In addition it will typically also be helpful to have open some applications to remind you of where the data is stored, and perhaps a plain text editor allowing you to conveniently open up several other syntax files for purposes of comparison. • When working with Stata a typical view of your desktop might be something like: Description: The first two interfaces you can see in this screenshot are respectively the Stata do file editor (where I write commands and send them to be processed, such as by highlighting the relevant lines and clicking ‘ctrl‐d’); and the main Stata window (here Version 10) which includes the results page. Note that the syntax file open is a modified (personalised) version of the supplied illustrative syntax file – the name has been changed so that I can save the original file plus my own version of it after edits (e.g. with my own comments). Behind the scenes I’ve also got open an ‘Editpadlite’ session which I’m using to conveniently open up and compare some other sytnax files that I don’t particularly want in my do file editor itself; I’ve also got a file manager software open showing the data I’m drawing upon (in what Stata will call ‘path1a’); and I’ve got some Internet Exporer (IE) sessions open (I’m looking up the online BHPS documentation, and the Stata listserv, where information on Stata commands is available). 2 1E Coursepack: Appendix 3
•
When working with SPSS, you may have a desktop session something like: Description: The three SPSS interfaces shown in this screenshot are firstly the syntax editor (where I write commands and send them to be processed, such as by highlighting the relevant lines and clicking ‘ctrl‐r’); then behind this are the ‘data editor’ and the ‘ouptut’ window in SPSS, where I can see the data and results from the commands. Again, the syntax file open is a modified (personalised) version of the supplied illustrative syntax file – the name has been changed so that I can save the original file plus my own version of it after edits (e.g. with my own comments). Whilst working in SPSS I’ve also got open an ‘Editpadlite’ session which I’m using to keep open and compare some other sytnax files that I’m checking against; I’ve also got a file manager software open (Windows Explorer) to show the location of the data files I’m using (i.e. the contents of ‘path1a’); and I’ve got some IE sessions open (I’m looking at the BHPS documentation, but it’s late in the day and my self‐discipline is slipping a little, so I’ve got my email open as well, which is never going to help with my multilevel modelling..!). 3 1E Coursepack: Appendix 3
Principles of the practical sessions The programme of practicals included within course 1E is intended to help you first in understanding the topics covered in teaching sessions, and second by furnishing you with important skills in applying multilevel analysis techniques to social science data. For the practical sessions we provide this handout which briefly describes the contents of the sessions and the Essex lab setup, and then gives more extended material describing the various software packages used in the labs. In addition, within the labs you are given copies of extended software specific command files which contain the details of the exercises in the form of software code and some degree of descriptive notes. These sample command files will hopefully provide you with some useful worked examples which will help you for quite some time into the future. Supplementing these resources, within the Coursepack you are also given various ‘homework’ exercises which in many instances serve to test the skills covered in the practical sessions. A lot of material is presented across the programme of practicals, particularly in the details of supplementary exercises or activities. It is easy with software based materials to skip through parts of the sessions briefly without properly digesting them. Nevertheless you will get the best out of the module if you work through the practical exercises in full (even if that involves catching up with some aspects of them later on). You should also be prepared that covering the materials in full will often include following up on other information sources, such as software help guides and files. To some extent, part of learning to work effectively with complex data and analysis methods (which is what, arguably, defines multilevel modelling of social science processes) is about teaching yourself to be a better software programmer in terms of using relevant software packages (such as Stata, SPSS and MLwiN). Many social scientists have not previously been trained in this manner. Good programming involves things like using software in a manner which generates a syntactical record of the procedures; keeping your software commands and files well organised; and learning specific programming rules, fragments of code, and code options, as are relevant to the particular software packages (see also the sections immediately below). Much of the work of course 1E, indeed, concerns the development of your skills for using software for handling and analysing social science datasets. To pre‐warn you, for most of us, the development of our programming skills in this domain is a long and often arduous process. It is possible to trace three themes running through the practical sessions which summarise the approach we are adopting and the principles of good practice in using software for handling and analysing social science datasets which we follow: • Documentation for replication • The integration of data management and data analysis • The use of ‘real’ data 4 1E Coursepack: Appendix 3
Documentation for replication The first principle refers to the importance of working in such a way as to keep a clear linear record of the steps involved in your analysis. This serves as a means to support the long‐term preservation of your work. A useful guiding principle is that after a significant break in activities, you yourself – or another researcher at some future point – could retrace and replicate your analysis generating exactly the same results. There are several excellent recent texts which explain the importance of the principle of replicability to a model of systematic scientific enquiry with social science research data (e.g. Dale, 2006; Freese, 2007; Kohler and Kreuter, 2009). Long (2009) gives a very useful guide to organising you research materials for this purpose; for another comparable discussion see Altman and Franklin (2010). In social statistics we work overwhelmingly with software packages which process datasets and generate results. In this domain, documentation for replicability is effectively achieved by (a) the careful and consistent organisation and storage of data files and other related materials, and (b) the systematic use (and storage) of ‘syntactical’ programming to process tasks. ‘Syntactical programming’ involves finding a way to write out software commands in a textual format which may be easily stored, reopened, re‐run and/or edited and rerun. The use of syntactical programming is not universal in the analysis of social science research data ‐ but it should be. Examples include using SPSS ‘syntax files’, and Stata ‘do files’, and almost all software programmes support some sort of equivalent syntactical command representation. A great attraction of programmes such as SPSS and Stata is that their syntactical command language is largely ‘human readable’ (that is, you can often guess what the syntax commands means even before you look them up, as they often use English language terms). We’ll say more on applying syntactical programming in the supplementary ‘How to use..’ sections below. The integration of data management and data analysis. The second principle involves recognising the scope for modification and enhancement of quantitative data in the social sciences. We all know that data is ‘socially constructed’, but the impact for analysts, which we sometimes forget, is that many measures in a survey dataset are ‘up for negotiation’ (such as in choices we have regarding category recodes, missing data treatments, derived summary variables, and choices over the ‘functional form’). The way that such decisions are actually negotiated in any particular study comprises an element of research which is often labelled ‘data management’. It is our experience that most applied projects require more input to data management than to data analysis aspects of the work. As such, finding software approaches (or combinations of software approaches) which support the seamless movement back and forth between data management and data analysis is an important consideration. In this regard, Stata stands out at the current time as the superior package for social survey research as it combines extended functionally for both components of the research process (this is the main reason that Stata is used more frequently than other packages in this course). Other packages or combinations of packages can also be used effectively (our other main means of working involves finding effective ways to combine SPSS with MLwiN – the former has a fairly full range of data management capacity and a good range of data analysis options; the latter has a wider range of multilevel data analysis functions but supports only limited data management tasks). (Aside, and a shameless plug: If you are interested in the methodological topic of data management more generally, we have a related project on this very theme, where amongst other things we provide online data‐
5 1E Coursepack: Appendix 3
oriented resources and training materials, and run workshop sessions, oriented to data management issues in social survey research: see the ‘DAMES’ project at www.dames.org.uk). The use of ‘real’ data. The last principle may sound rather trite but is an important part of gaining applied skills in social science data analysis. Statistics courses and texts ordinarily start with simplified illustrative data such as simulated data and data files which are very small in terms of the number of cases and variables involved. This is for very good reasons – trying anything else can lead to access permission problems, software processing time delays, and onerous requirements for further data management. In this course we do often use simplified datasets which these characteristics, however, as much as possible we also include materials which start from the original source data (e.g. the full version of the data such as you could get from downloading via the UK Data Archive or a similar data provider). There are quite a few reasons why using full size data and the original measures is a good thing to aspire to. Firstly, in the domain of multilevel modelling techniques, estimation and calculation problems, and the actual magnitude of hierarchical effects, can often be different in ‘real’ datasets compared to those from simplified examples (unfortunately, in ‘real’ applications the former tend to be much greater, and the latter much smaller..!). Secondly, using full size data and original variables very often exposes problems, errors or required adjustments with software routines which may not otherwise be apparent (errors caused by missing data conventions and upper limits on the numbers of categories, cases or variables, are common problems here). Third, our experience is that by using ‘real’ data during practical sessions, as course participants you are much more likely to go away from the course in a position to be readily able to apply the analytical methods covered to your own data and analysis requirements. 6 1E Coursepack: Appendix 3
Organisation of the practical sessions Materials referred to in the 1E practical sessions will include: • data files (copies of survey data used in the practicals); • sample command files (pre‐prepared materials which include programming commands in the language of the relevant software) • supplementary ‘macros’ or ‘sub‐files’ (further pre‐prepared materials featuring programming commands in relevant languages, usually invoked as a sub‐routine within the main sample command files) • teaching materials (the Coursepack notes and lecture files). At each session we will give you example command files for the data analysis packages which take you through the process of carrying out the relevant data analysis operations. The main activities of the lab sessions involve opening the relevant files and working your way through them. During the labs, we do of course expect that you can ask for help or clarification on any parts of the commands which are not clear. During the workshop you will be able to store data files at a network location within the University of Essex systems. You should allocate a folder for this module, and use subfolders within it to store the materials for this module. A typical setup will look be something like: This session is from my own laptop so it won’t be identical to your machine, but you should have something broadly similar. Illustrated, I’ve made a folder called ‘d:\essex10’ within which I’ve made a number of subfolders for different component materials linked to the course. For example, in the ‘syntax’ subfolder I’ve got copies of the four example command files linked to Practical session 1; although not shown, other things from the module are in the other subfolders, such as a number of other relevant command files in the ‘sub_files’ folder, and some relevant data files in the ‘data’ folder. 7 1E Coursepack: Appendix 3
An important point to make is that the various command files will need to draw upon other files (e.g. data files) in order to run. To do this, they need to be able to reference the location of the required files. In most applications, we do this be defining ‘macros’ which point to specific ‘paths’ on your computer (see also software sections below). For the labs to work successfully, it will be necessary to ensure that the command file you are trying to run is pointing to the right paths at the right time. In general, this only requires one specification to be made at the start of the session, for instance whereby in SPSS and in Stata we define ‘macros’ for the relevant paths (also see pages 12/20 of the ‘How to use’ guide on this issue). Sometimes however it can be necessary to edit the full path reference of a particular file in order to be able to access it. For example, in the text below, we show some Stata and SPSS commands which in both cases define a macro (called ‘path3a’) which gives the directory location of the data file ‘aindresp.dta’ or ‘aindresp.sav’, so that subsequent commands calling it will go directly to that path: Stata example SPSS example global path3a “d:\data\bhps\”
use pid asex using $path3a\aindresp.dta, clear
tab asex
define !path3a () “d:\data\bhps\” !enddefine.
get file=!path3a+”aindresp.sav”.
fre var=asex.
Example command files Most of the command files are annotated with descriptive comments which try to explain, at relevant points, what they are doing in terms of the data and analysis. In general, to use the example command files, you should: ¾ Save them into your own filestore space ¾ Open them with the relevant part of the software application (e.g. the ‘syntax editor’ in SPSS or the ‘do file editor’ in Stata (see also the ‘How to use..’ subsections) ¾ Save the files with a new name to indicate your personalised copy of the file (i.e. so that you can edit the files without risk of losing the original copies) ¾ Make any necessary edits to the path locations given in the files (at the top of the relevant files) ¾ Work through the files implementing the commands and interpreting the outputs, adding your own comments to the files to help you understand the operations you’re performing Data Files We’ll use a great many data files over the course of the practical programme. We’re planning to supply them via the network drive, though it may sometimes be easier to transfer them by other means. Many of these data are freely available online, such as the example data files used in textbooks such as by Hox (2010), Treiman (2009) and Rabe‐Hesketh and Skrondal (2008). 8 1E Coursepack: Appendix 3
Some of the data files are versions of complex survey datasets from the UK and from comparative international studies. We can use these for teaching purposes, but you are not allowed to take them away with you from the class. You can in general access these data for free, however to do so you must register with the relevant data provider (e.g. the UK Data Archive) and agree to any associated conditions of access (e.g. required citations and acknowledgements). Macros/non‐interactive command files As well as the example ‘interactive’ command files that you’ll open, work through and probably edit slightly, at certain points we will also supply you with command files in the various software languages which essentially serve as ‘macros’ to perform specific tasks in data management or analysis. These ordinarily do not need to be further edited or adjusted, and it is not particularly necessary to examine their contents, but it will be necessary to take some time to place the files in the right location in order to invoke them successfully at a suitable time. Where to put your files.. In general it will be effective to store your files (example command files, data files, and macro files) in the allocated network filestore space during the course of the summer school (‘M:\ drive’). This filestore should be available to you from any machine on the Essex system. Some of the larger survey data files, however, might not be effectively stored on the network location (due to their excessive size). In some instances, it might be more efficient to store the files on the hard drives of the machine you are using (e.g. the C:\ drive), though if doing this you should be careful that contents of the hard drives may not be preserved when you log off. The various exercises and command files ought all to be transferable to other machines (e.g. your laptop or you work machine at your home institution), though obviously you will need to install the relevant software if you don’t already have it. At the end of the module, you will probably want to take a copy of all your command files and related notes, plus the freely available example datasets and the macro files, to your memory stick or some other means, to take back with you. Analysis of your own data In general we would very much encourage you to bring you own data to the practical sessions. You’ll get a lot out of the module if you are able to repeat the demonstration analyses on your own data and research problems. It will ordinarily be most effective to undertake the class examples first, then subsequently try to re‐do the analysis in part in whole on your own data. If we can, we’ll help you in working with your own data within lab sessions, though when the class is busy we need to prioritise people who are working through the class examples. 9 1E Coursepack: Appendix 3
..The tremendous importance of workflows.. If you haven’t come across the idea of workflows before, then this might simply be a point that, at this stage, we ask you to take in good faith! There are very good expositions of the idea of workflows in the social science data analysis process in, amongst others, Long (2009); Treiman (2009), and Kohler and Kreuter (2008). A workflow in itself is a representation of a series of tasks which contribute to a project or activity. It can be a useful exercise to conceptualise a research project as a workflow (with components such as data collection, data processing, data analysis, report writing). However, what is a really important contribution is to organise your data and command files that are associated with a project in a consistent style that recognises that relevant contributions to the workflow structure. What does that involve? The issue is that we want to construct a replicable trail of our data oriented research, which allows us to go all the way from opening the initial data file, to producing the publication quality graph or statistical results which are our end products. We need the replicable trail in order to adjust our analysis on the basis of minor changes at any possible stage of the process (or to be able to transfer a record of our work on to others). However because the trail is long and complex (and not entirely linear), we can only do this, realistically, if we break down our activities into multiple separate components. There are different ways to organise files for these purposes, but a popular and highly effective approach is to design a ‘master’ syntax command file and a series of ‘sub‐files’ which it draws upon. In this model, the sub‐files cover different parts of the research analysis. Personally, my preference is to construct both the master and sub‐files in the principle software package being used, though Long (2009) notes that creating a documentation master file in a different software (e.g. MS Excel) is an effective way to record a wider range of activities which span across different software. Here’s an example of a series of tasks being called upon via a Stata format ‘master’ file: 10 1E Coursepack: Appendix 3
(This screenshot shows the Stata master file, and the sub‐files which are mostly open within the EditPad editor ‐ except for a few other files which I’ve opened in the do file editor. The Stata output file is not visible but is open behind the scenes). 11 1E Coursepack: Appendix 3
Here’s an example of a project documentation file that might be constructed in Excel: Note that the other tabs in the Excel file can be used to show things like author details, context of the analysis, and last update time. The file also notes some (though not all) dependencies within the workflow – for instance step 9 requires step 4 to have been take (the macro reads in a plain text data file that was generated in Stata by do file pre_analysis1.do). In summary, we can’t advise you strongly enough on the value of organising you data files around a workflow conceptualisation, such as through master and sub‐files. Read the opening chapters of Long (2009), or the other references mentioned above, for more on this theme. We’d encourage you to look at the workshop materials from the ‘DAMES’ research Node, at www.dames.org.uk, for more on this topic.
12 1E Coursepack: Appendix 3
Practical sessions 1‐10: Contents Practical 1: Getting started with multilevel software and data Coverage A lot of things! ¾ Get set‐up with your account and file locations for the course. ¾ You should read every single word of the ‘How to use..’ guide! ¾ Open up and briefly explore the examples given in each of the four packages. Command files p1_stata.do p1_SPSS.do p1_mlwin.mac pl_R.R Each provides a few brief exercises in opening the packages and summarising some relevant data The lab also refers you to extended introductory files covering syntax commands for the analysis of data in Stata or SPSS. These files are available at http://www.longitudinal.stir.ac.uk/workshop_materials.html, called ‘lab0.do’ and ‘lab0.sps’. This site also has information on how to access the data files used in the sessions. Other notes Some examples require other sub‐files to be run (see descriptions in the syntax) Covers familiarity with the ways of handling data in the relevant software files, and the mechanisms for the lab session programme [Don’t expect to complete all of the activities in the time allocated – but work through the background materials over the course of the week] 13 1E Coursepack: Appendix 3
P2a/b: Examples of two‐level hierarchical data/ Estimating two‐level multilevel models Coverage Five examples of setting up data with a two‐level hierarchical structure (lab 2a) and estimating variance components models with and without covariates for those models (lab 2b) Command files p2a_stata.do / p2b_stata
p2a_SPSS.do / p2b_SPSS.do p2b_mlwin.mac p2b_R.R The ‘2a’ files set up a number of datasets for further analysis throughout the module. The ‘2b’ files run various ‘random intercepts’ models. Other notes: Some examples require other sub‐files to be run (see descriptions in the syntax) Understanding of the organisation of two‐level hierarchical data in the relevant packages, and popular analytical approaches which do not use random effects modelling. Understanding of how to specify the principle estimation options for random intercepts models (ML, REML, GLS) via the relevant commands in Stata (xtreg, xtmixed and gllamm) or in SPSS (mixed)/MLwiN, to do this, and the typical properties and contributions of random intercepts estimates. Understanding of the interpretation of random intercepts models and of level 2 residuals. P3a/b: Models for random intercepts and slopes/Interpreting complex multivariate models Coverage Introduction to specification of random coefficients across a variety of packages, replicating the examples discussed in lectures. Some other routines on handling and assessing regression data in the Stata file. Command files p3a_stata.do p3b_stata.do p3b_spss.sps p3b_R.R p3b_mlwin.mac Other: A recommended exercise after looking over the examples is to generate your own models with the same datasets but trying out different variables and/or variance components. Be wary that some models which you could specify would take a long time to estimate in Stata. A rule of thumb is to give up and hit ‘break’ after 5 minutes of running, and try a different model instead. The speed differences between Stata and other packages are sometimes very substantial. 14 1E Coursepack: Appendix 3
P4: Data and models with complex clustering Coverage An illustration of 3 level models in SPSS, Stata, MLwiN and R, and example cross‐
classified model specifications in Stata, MLwiN and R. Command files p4_stata.do p4_spss.sps p4_mlwin.mac p4_R.R Other: Don’t expect the cross‐classified model examples to work smoothly. I found on my laptop that both MLwiN and Stata crashed whilst trying to estimate versions of the BHPS SOC cross‐classified model P5: Exploring and exploiting hierarchical data Coverage More on techniques for describing clustered data and variables. An example of interpreting higher versus lower level explanatory variables. An example of centring and standardising variables. An example showing the change (decline) in higher level random effects variance with increasing explanatory variables at either the higher or lower level. Looking at the variance covariance matrix (Stata only). Command files p5_stata.do p5_SPSS.sps Other notes: Requires you to run the programme file ‘variance_summaries.do’ in order to get the ICC diagnostics data 15 1E Coursepack: Appendix 3
P6: Alternative statistical treatments for clustered data/Comparing hierarchical and other effects Coverage A few selected examples of multilevel models and other modelling approaches for summarising hierarchical patterns – mostly corresponding to the examples presented in the lecture slides (lecture 8). Command files p8a_stata.do (Just the one pre‐prepared file for this session) Other: By this stage of the module, we’d encourage you to use the lab time to work with your own data examples as well as the pre‐prepared routines. P7: Analysing panel data as a multilevel dataset Coverage Stata and SPSS examples which illustrate opening up the BHPS annual sample files and generating a panel dataset, then exploring and analysing long format panel data using various methods including multilevel models with random effects. Command files p7_stata.do p7_spss.sps Other: Requires access to the full BHPS sample. Many of the materials in these files are extracts from the exercises at www.longitudinal.stir.ac.uk; refer to those examples for some further extensions on this topic. At time of writing (16/7/10) the SPSS syntax file would benefit from further debugging. Since the lab session, both the SPSS and Stata files have been updated on the S drive with minor bug fixes: • The Stata file now uses xtmixed rather than xtreg for the repeated cross‐
sectional model since xtreg won’t support the model with collinear predictors in Stata v11 (its ok in v10). • There were several typos and erroneous cross‐references within the SPSS command file which have now been adjusted. 16 1E Coursepack: Appendix 3
P8a/b: Clustered data with categorical measures/Applications of models with binary outcomes Coverage Examine binary outcomes data in the context of hierarchical datasets; look at the implications of GLM transformations for binary outcomes; implement some standard binary outcomes models in Stata and MLwiN Command files P6a_stata.do P6b_stata.do P6a_spss.sps P6b_mlwin.mac Other: P9: Practical applications with different outcomes Coverage Syntax files in Stata and MLwiN show a short selection of multilevel model specifications for complex categorical data and event history data using previously prepared data extracts. Command files p9_stata.do p9_mlwin.mac Other: P10: Review and recap of practical skills There are no additional materials in session 10. We suggest you use the lab time to review previous materials or to work on applying some of the multilevel models considered to your own data. 17 1E Coursepack: Appendix 3
Software alternatives Many different software packages can be used effectively for applied research using complex and clustered social survey data. Various packages support the estimation of a wide range of multilevel models covering random effects models and other specifications. Packages are not uniform, however, in their support for multilevel models, with variations in their coverage of different modelling options; their requirements and approaches to operationalising data for modelling; the statistical algorithms the packages use; the relative estimation times required; the default settings for estimation options and other data options; and the means of displaying and storing the results produced. The Hox (2010) text comments systematically on software options for different modelling requirements, and other comparisons can be found in Twisk (2006) and West et al. (2007), amongst others. A very thorough range of material reviewing alternative software options for multilevel modelling is available at the Centre for Multilevel Modelling website at: http://www.cmm.bristol.ac.uk/learning‐training/multilevel‐m‐software/ In this course we focus upon four packages which bring slightly different contributions to multilevel modelling. •
SPSS is used because it is a widely available general purpose package for data management and data analysis, which also supports the estimation of many of random effects multilevel models for linear outcomes. Field’s (2009) introductory text on SPSS includes a chapter on multilevel modelling in SPSS. •
Stata is used because it is a popular general purpose package for data management and data analysis which also includes a substantial range of multilevel modelling options. Stata is attractive to applied researchers for many reasons, including its good facilities for storing and summarising estimation results; its support of a wide range of advanced analytical methods which complement a multilevel analysis (e.g. clustering estimators used in Economics); and its wide range of data management functions suited to complex data. Rabe‐Hesketh and Skrondal’s (2008) text provides a very full intermediate level guide to multilevel modelling in Stata. •
MLwiN is used because it is a widely available, purpose built package for applied multilevel modelling which covers almost all random effects modelling permutations, and is also designed to be broadly accessible to non‐specialist users. A further attraction of MLwiN is that its purpose built functionality means that it supports a wider range of estimation options, and achieves substantially more efficient estimation, than many other packages. The MLwiN manual by Rasbash et al. (2009) is a very strong, and accessible, guide to this software. •
R is used occasionally in this module because it is a popular freeware that supports most forms of multilevel model estimation (via extension libraries). It is a difficult language for social scientists to work effectively with, however, because it brings with it very high ‘overheads’ in its programming requirements, especially for large and complex data. Gelman and Hill (2007) give an excellent advanced introduction to multilevel modelling which is illustrated with examples in R. We should stress that many more packages can be used effectively for multilevel modelling analysis. Some of those that are often used for multilevel modelling in social science research, but that we have not mentioned hitherto, are HLM (Raudenbush et al., 2004), M‐Plus (Muthén and Muthén, 2010), and BUGS (e.g. Lun et al., 2000). Some cross‐over between packages can also be found, such as in linkage routines known as ‘plug‐ins’ which allow a user to invoke one software package from the interface of another – examples include routines for running HLM (hlm, see Reardon, 2006) and MLwiN (run‐mlwin, see Leckie and Charlton, 2011)from Stata, and tools to run an estimation engine known as ‘Sabre’ in Stata and in R (see Crouchley et al. 2009). Finally, an exciting software development in the area being led in the UK is the construction of a generic interface for specifying and estimating complex statistical models of ‘arbitrary complexity’. These cover most forms of multilevel models, as well as many other statistical modelling devices. This project is called ‘e‐Stat’ and is expecting to generate it first publicly available resources over the period 2010‐2011 (see http://www.cmm.bristol.ac.uk/research/NCESS‐EStat/). 18 1E Coursepack: Appendix 3
How to use… EditPad Lite The first package we’ll introduce is not a statistical analysis package. EditPad Lite is a freely distributed plain text editor, and we go to the trouble of introducing it for two reasons: • Using a plain text editor to open supplementary files and subfiles is a very good way to perform complex operations and programming without getting into too tangled a mess. • In our experience, plenty of people haven’t previously used plain text editors, and are initially confused by the software until they’ve got used to using it. 1) Access/ installation To access the package, go to http://www.editpadpro.com/editpadlite.html and follow the links to download the software: 19 1E Coursepack: Appendix 3
2) More on installation It is best to click ‘save file’ and run the setup .exe file from a location on your computer. This will extract four component files which allow you to run the software, regardless of where they are stored. You could for instance install the files on a section of your memory stick and run the programme from there, but I would recommend saving them to your network filestore folder and running the software from that location. E.g.: You can now run the software by double clicking the ‘EditPadLite.exe’ file. To save yourself some time, you might want to copy that .exe file, then ‘paste as shortcut’ onto your desktop (and maybe also add a shortcut key to the shortcut icon), so that you can open up the software rapidly (though such desktop settings might not be preserved across login sessions). 20 1E Coursepack: Appendix 3
3. Opening files When you open the software, you are given a blank template from where you can then ask to open one or more files (you can in fact open several files in one call, which can be convenient): (note how the multiple files are organised under multiple tabs) 21 1E Coursepack: Appendix 3
4. Editing files You can use EditPadLite to edit files (for instance sub‐files and macros which are called upon in other applications). There are numerous handy file editing functionalities in EditPad Lite which you won’t get in a simpler text editor such as Notepad. For instance you can run various types of ‘search’ and ‘replace’ commands, across one or multiple files. The latter is often helpful, for instance if you have built up a suite of related files and you want to change a path specification in all of them. (The red tabs indicate the contents of the file have been changed and have not yet been saved) Closing comments on EditPad Lite Using a plain text editor is very helpful if your work which involves programming in any software language. It can help you to deal with complex combinations of files, and it provides a ‘safe’ way of editing things without inadvertently adding in formatting or other characters (a problem that will arise if you try to edit command files in a word processor software such as Word). We’ve nominated EditPadLite for this purpose, though you might find a different editor more helpful or preferable. There are many alternatives, some of which are free to access for some or all users, and others of which require a (typically small) purchase price. Note that if you wish to use EditPad for commercial purposes, the licence requires that you should purchase the ‘professional’ version of the package, called ‘EditPadPro’. 22 1E Coursepack: Appendix 3
How to use… Stata Stata was first developed in 1984 and was originally used mainly in academic research in economics. From approximately the mid 1990’s its functionalities for social survey data analysis began to filter through to other social science disciplines, and in the last decade it has displaced SPSS as the most popular intermediate‐to‐
advanced level statistical analysis package in most academic disciplines which use social survey data (e.g. sociology, educational research, geography). Stata is popular for many good reasons. The list of features of Stata that lead me personally to favour this package above others are: • It supports explicit documentation of complex processes through a concise and ‘human readable’ syntax language • It supports a wide range of data management functions including many routines useful in complex survey data which are not readily performed in other packages (e.g. ‘egen’, ‘xtdes’) • It supports a very full range of statistical modelling options, including several advanced model specifications which are not widely available elsewhere • It has excellent graphics capabilities, supporting the specification and export of publication quality graphs (in a syntactical, replicable manner) • It features very convenient tools for storing the results from multiple models or analyses and compiling them in summary tables or files (e.g. ‘est store’, ‘statsby’) • It can read online data files and run command files and macros from online locations • It supports extended add‐on programming capabilities, and benefits from a large, constructive community of user‐contributed extensions (see e.g. http://www.stata.com/links/resources3.html ) In pragmatic terms, most users of Stata are reasonably confident programmers, and getting started with the package does need a little effort in learning about data manipulation and data analysis. This is one reason why Stata is not yet widely taught in introductory social science courses, though, in the UK for example, it is increasingly used in intermediate and advanced level teaching (e.g. MSc programmes or Undergraduate social science programmes with extended statistical components). A common problem with working with Stata is that many institutions do not have site‐level access to the software, and accordingly many individual researchers don’t have access to the package ‐ Stata is generally sold as an ‘n‐user’ package, which means that an institution buys a specified number of copies at any one time. Recently however, access to Stata for academic researchers has probably be made easier by the Stata ‘GradPlan’, which allows purchase of personal copies of the package for students and faculty at fairly low price – see http://www.stata.com/order/new/edu/gradplan.html . Stata also comes in several different forms with different upper limits on the scale of data it may handle – ‘Small Stata’ is not normally adequate for working with advanced survey datasets; ‘Intercooled’ Stata (I/C) usually has more than enough capacity to support social survey research analysis (although, working with a large scale resources you may occasionally hit upper limits, such as on the number of variables or cases, it is usually possible to find an easy work‐around such as by dropping unnecessary variables); Stata SE and MP offer greater capacity regarding the size of datasets and faster processing power, but they are more expensive to purchase. To my knowledge, most academic researchers use Intercooled Stata. In summary, many users of Stata favour the package not because it offers one particular functionality which others don’t, but because it offers an integrated set of advanced functionalities covering data management and data analysis which can’t easily be matched by any other software. For other texts which explain the strengths and attractions of Stata, see for example Treiman (2009). The steps below give you some relevant instructions on working with Stata for the purposes of the module (the examples are mostly from the Practical 1 Stata file). Many online resources on Stata are available, in particular we highlight: 23 1E Coursepack: Appendix 3
•
•
•
UCLA’s ATS pages: http://www.ats.ucla.edu/stat/stata/ (Features a wide range of materials including videos of using Stata and routines across the range of the package) The CMM’s LEMMA online course: http://www.cmm.bristol.ac.uk/learning‐training/course.shtml (includes detailed descriptions of running basic regression models and of specifying random effects multilevel models in Stata) In the first lab session we point you to an illustrative do file which serves as an introduction to Stata, available from www.longitudinal.stir.ac.uk When you On opening the programme (this image shows Stata version 10): launch the package, you see the basic Stata window, here for version I/C 10.1. You can customise its appearance (e.g. right click on the results window) – the image on the right is how I set up the windows on my machine and will be slightly different to what you see on first launching the package in the lab at Essex. 24 1E Coursepack: Appendix 3
The very first thing you should do at the start of every session it so ask explicitly to open the ‘do file’ editor with ‘ctrl+8’ or via the GUI. Note below that we can have several do files open at once. Not shown below, but from Stata 11 onwards, it is possible to permit various formatting options in the do file editor (e.g. colour coding). It is also possible to set up Stata to run directly from a plain text editor if you wish to (search online for how to do this). 25 1E Coursepack: Appendix 3
Once you’ve opened a ‘do file’ you can begin running commands by clicking on the segments of the relevant command lines and clicking ‘ctrl+R’. The results are shown in the results window. Error messages are by default shown in red text and lead to the termination of the sequence of commands at that point (unlike in SPSS, which carries on, disregarding the error). Important: Defining macros for paths. This particular image shows an important component of the start of every session in the module lab exercises. The lines beginning with ‘global’ are ways of defining permanent ‘macros’ for the session. The macros serve to define locations on my machine where my files (e.g. data files) are actually stored. Doing this means that in later commands (e.g. the image below) I can call on files via their macro folder locations rather than their full path – this aids transferability of work between machines/locations. In the above, the macro which I have called ‘path3b’ means that when Stata reads the line: use $path3b\ghs95.dta, clear
what it reads behind the scenes is use c:\data\lda\ghs95.dta, clear
26 1E Coursepack: Appendix 3
You can also submit commands line by line through the command interface (e.g. if you don’t want to log them in the do file). Note some of the features of the Stata syntax language: Note how the ‘review’ window shows lines that were entered through the command window, but it just shows some programming code for commands run through the do file editor. You need to ‘clear’ the dataspace to read in a new file, e.g. use $path3b\ghs95.dta, clear
You can’t create a new variable with the same name as an existing one – if it’s there already you need to delete it first, e.g. drop fem
gen fem=(sex==2)
The ‘capture’ command suppresses potential error messages so is a useful way to make commands generic capture drop fem
gen fem=(sex==2)
Using ‘by:’ or ‘if..’ within a command can usefully restrict the specifications: tab nstays if sex==1
bysort sex: tab nstays if age >= 20 & age <= 30
The ‘numlabel’ command is a useful way to show both numeric values and categorical labels compared to the default (labels only), e.g.; tab ecstaa
numlabel _all, add
tab ecstaa
There’s no requirement for full stops at the end of lines, but a carriage return serves as the default delimiter, and so we usually use ‘///’ to extend a command over more than one line. bysort sex: tab nstays ///
if age >= 20 & age <= 60 & ecstaa==1
27 1E Coursepack: Appendix 3
Extension routines are often written by users and made available to the wider community. To exploit them, you need to run either ‘net install’ or ‘ssc install’ Example of finding and installing the ‘tabplot’ extension routine: (the exact code needed may depend on which machine you are working on – you may have to define a folder for the installation that you have permission to write to) We often run subfiles, or define macros or programmes, via calling upon other do files with the ‘do’ command 28 1E Coursepack: Appendix 3
‘est store’ is a great way to collate and review multiple model results Extension: You can write a ‘.profile’ file to load some starting specifications into your session For lots more on Stata, see the links and references given above, or the DAMES Node workshops at dames.org.uk 29 1E Coursepack: Appendix 3
How to use… SPSS / PASW Here we will refer to SPSS as shorthand for the software which, in 2008, was officially rebranded as ‘PASW Statistics’, and then renamed again in 2009 as “IBM SPSS Statistics”. SPSS has been a market leader as a software tool for social survey data analysis and data management since its first versions in the late 1960’s. Some people believe that position is coming to an end, since recent revisions to SPSS seem to make it less suitable for academic research users, whilst other packages such as Stata have emerged that seem to many to be more robust and more wide‐ranging in their capabilities. Nevertheless, most European educational establishments have site‐level access to SPSS, and continue to teach it to students as a first introductory package for survey data analysis; additionally a great many social science researchers continue to work with the package – so the software itself remains a significant part of social survey data analysis. SPSS has a very good range of useful features for data management and exploratory analysis of survey data. It also has functionality that supports many variants of random effects multilevel models, although few advanced users of multilevel models probably use SPSS for this purpose (principally because other packages have a wider range of model specification options)1. A common practice amongst multilevel modellers in the social sciences is to use SPSS for data preparation, in combination with a specialist multilevel modelling package such as MLwiN or HLM for the actual statistical analysis. In this module we’ll use SPSS in this way (alongside MLwiN), though we will also illustrate its own multilevel modelling capacity. In working with SPSS for the preparation or analysis of complex survey data, it is critical to use ‘syntax’ command programming in order to preserve a record of the tasks you undertake. This is often a bit of a burden to adapt to if it is new to you: probably most people are first taught to use the package using non‐
syntax methods (i.e. using the drop down or ‘GUI’ interface). For topics such as covered in this module, however, it is essential to work with syntax commands, which we’ll give you examples of in the relevant lab session files. You can also find out relevant syntax by setting up commands via the GUI menus then clicking ‘paste’, or by using the SPSS help facilities. The steps below highlight ways of using SPSS effectively for the purpose of this module. Online introductions to SPSS are available from many other sources, and a list of online tutorials is maintained at: http://www.spsstools.net/spss.htm 1
We should note that SPSS can also be used as the means of specifying a much wider range of multilevel modelling and other statistical models through its ‘plug‐in’ facility supporting R and Python scripts (see e.g. Levesque and SPSS Inc, 2010, for more on this topic). 30 1E Coursepack: Appendix 3
Your first steps should always be to open the programme first, then immediately open a new or existing syntax file I advise against double clicking on SPSS format data files as a way to open the package and data simultaneously. This might seem a trivial issue, but actually it makes quite a difference to the replicability and consistency of your analysis. 31 1E Coursepack: Appendix 3
You can run commands via syntax by highlighting all or part of the line or lines you want to run, then clicking ‘run ‐> selection’, or just ‘ctrl+r’. You can write out the commands yourself; or find their details in the manual; or use the ‘paste’ button to retrieve them from the windows link. The above gives you a manual with documentation on all syntax commands. Alternatively by clicking the ‘syntax help’ icon whilst the cursor is pointing within a particular command, you can go straight to the specific syntax details of the command you’re looking at (see below). 32 1E Coursepack: Appendix 3
These images depict using the ‘paste’ option to find relevant syntax (for the crosstabs command). Using ‘paste’, you’ll often get a bit more syntax detail than you bargained for since all the default settings are explicitly shown. 33 1E Coursepack: Appendix 3
Most often, you’ll start with an existing syntax file and work from it, editing and amending that file. A well organised syntax file should start with metadata and path definitions. By default, the syntax editor has colour coding and other command recognition. I personally prefer to turn off annotation settings on the syntax editor (under ‘tools’ and ‘view’), but others prefer to keep them on. *Path definitions are important! In the above, at the top of the file we’ve used the ‘define..’ lines to specify macros which SPSS will now recognise as the relevant text. For example, when SPSS reads get file=!path3b+”ghs95.sav”.
in the above ‘get file’ line, it is actually reading get file=”c:\data\lda\ghs95.sav”.
34 1E Coursepack: Appendix 3
9
9
A few syntax tips: 9
9
9
New commands begin on a new line.
Command lines can spread over several lines, but commands must
always end with a full stop. When a command line is spread over several
lines, it is conventional, though not essential, to indent the 2nd, 3rd etc
lines by one or more spaces.
Any line beginning with a ‘*’ symbol is ignored by the SPSS
processor and is regarded as a ‘comment’ (you write comments purely
for your own benefit).
Do your best not to use the drop-down menus! Practice pays off where
using syntax is concerned – the more you run in syntax, the more
comfortable you’ll be with it
It’s good to organise your syntax files into master files and sub-files.
The master file can call the sub-files with the ‘include’ command.
Some comments on output from SPSS •
•
•
•
You can manually edit content of the output windows (e.g. text, fonts, colours). You can paste output directly into another application, either as an image (via ‘paste special’) or as an editable graph or table (in some applications) Usually, you’re unlikely to want to save a whole output file (it’s big and isn’t transferable) (But you can print output files into pdf only) Many users don’t save SPSS output itself, but merely save their syntax and transcribe important results from the output as and when they need it. 35 1E Coursepack: Appendix 3
How to use… MLwiN MLwiN was first released in 1995 as an easy to use Windows style version of the package ‘MLN’. That package supported the preparation of data and the estimation of multilevel models via a DOS style command interface; MLwiN has since retained the MLN estimation functions, and developed a great many new additional routines. How to use MLwiN (i): MLwiN the easy way The standard means of using MLwiN is to open the package then open data files in its format, and input commands through opening the various windows and choosing options. People most often use the Names and Equations windows here – the latter is especially impressive, allowing you to specify models then run them and review their results in an intuitive interface. The MLwiN user’s guide gives a very accessible description of the windows commands options, as does the LEMMA online course covering MLwiN examples (http://www.cmm.bristol.ac.uk/research/Lemma/). One important addition, however, is that it is very useful to have the ‘command interface’ and ‘output’ windows in the background. These will give you valuable information on what is going on behind the scenes when you operate the package. Here are some comments and description on using MLwiN in the standard way: When you first open the MLwiN interface… 36 1E Coursepack: Appendix 3
…the first thing to do is click ‘data manipulation’ and ‘command interface’, then click ‘output’, then tick ‘include output..’ {Note ‐ These materials are oriented to MLwiN v 2.20, but they should be compatible with other versions of the software} The normal next step it to open up an existing MLwiN format dataset (‘worksheet’).. 37 1E Coursepack: Appendix 3
..or else to read in data in plain text format, which requires you to specify how many columns there are in the data.. When data is entered, most people now open the ‘names’ and ‘equations’ windows as their next steps. From the equations window, you can click on terms or add terms to develop your model as you wish…!! 38 1E Coursepack: Appendix 3
Important: you read in a worksheet file, it might already contain some saved model settings which are visible via the equations window This is the popularity analysis dataset as used in Hox 2010. The terms in the equation window show the model that has been ‘left’ in MLwiN at the last save – the terms show the outcome variable (popular); explanatory variables (cextravl; csex; ctexp; cextrav*ctexp; the random effects design, and the model coefficient estimates for that model. MLwiN supports all sorts of other useful windows and functions; we’ll explore some but not all over the module. 39 1E Coursepack: Appendix 3
How to use MLwiN (ii): MLwiN for serious people Good practice with statistical analysis involves keeping a replicable log of your work. MLwiN has some very nice features but it is not well designed to support replication of analysis activities. This is because it is programmed on the presumption that most specifications are made through windows interface options (point and click specifications). Most people working with MLwiN specify models, and produce other outputs, via the windows links. This can work well to a point, but it does make documentation difficult. To see the ‘behind the scenes’ commands that MLwiN is processing when responding to point and click instructions, ensure the output window has the box ‘include output from system generated commands’ ticked, from which point you can see results in the background. Replication in MLwiN can be achieved ‐ by building up a command file which will run all the processes you are working on – but this is quite hard work. (An alternative is to set a log of your analysis and run your analysis through point‐and‐click methods – some people do this, but it does generate a very long list of commands that isn’t easily navigated). Building up a command file is hard because the command language (known as ‘MLN’) is not widely reported. If possible, it is usually better to build up your analysis using a template from another file (e.g. the command files we’ll supply in this module). Also, some features of the programme use rather long‐
winded command language sequences which are burdensome to write out manually (e.g. in moving through multiple iterations when estimating models). Even so, my advice is to persevere with building up a command file in MLwiN covering the model specification you’re working on. To do this, I recommend writing your file in a separate plain text editor, and invoking it with the ‘obey’ command in MLwiN. This is cumbersome, but tractable. I find that a reasonable way to approach this is to explore and refine models via the windows options, but make sure that when I settle on an approach, I write it into my command file, and then run the command file again to check that this has worked. The following screens show this being done: 40 1E Coursepack: Appendix 3
The above screens depict how I would typically have MLwiN configured for my own work. I would have the package open, but simultaneously I would also be editing a command file in another editor, which I would periodically run in full in MLwiN. Looking at the MLwiN session, at various points there are other windows that I’ll need to open, but the four windows shown are probably the most important. At the command interface I can manually input text commands. I sometimes do this, and at other times I use it to run a whole subfile, such as with: obey "d:\26jun\new\prep\prep_mlwin1.mac" This is used to run the whole of the command file (that I’m simultaneously editing in EditPad Lite) through MLwiN (which includes a ‘logo’ command which sends the log of the results to a second plain text location). Next, in the output window, I can see the results of commands I send. This includes results emerging from my macro file and commands sent in through the command interface, and it also includes results emergent from windows based command submissions. For example, in this image, I’ve recently input the commands ‘fixe’ and ‘rand’ from the command interface, with the effect of generating a list of the ‘fixed’ and ‘random’ parameter estimates from the last model run. The other windows shown in MLwiN here are the names window (which shows a list of the variables and scalars currently in memory), and the equations window. The equations window is a very attractive feature of MLwiN through which, firstly, you can see the current model represented statistically, along with its parameter estimates, and secondly, you can manually edit to adjust the terms of the model. 41 1E Coursepack: Appendix 3
How to use MLwiN (iii): Using ‘runmlwin’ A third mechanism exists for running MLwiN as a plug‐in from Stata (see Leckie and Charlton, 2011). For Stata users this is particularly attractive because it offers the convenience of working with your data in Stata, but the ability to run complex multilevel models using the more efficient (and wider ranging) estimation routines of MLwiN (compared to the relatively slow routines in Stata). Moreover, ‘runwlwin’ allows you to save most model results within your Stata session in a similar way to that which you would otherwise have achieved from running a model in Stata (i.e. you can subsequently use commands such as ‘est table’ on the basis of model results which were calculated in MLwiN). Using runmlwin involves working in Stata, and requires Stata version 11 and the most recent MLwiN As preliminaries, you must set a macro, specifically called ‘MLwiN_path’ which identifies where MLwiN is installed on your machine. ..you must then install runmlwin using the typical Stata module installation commands 42 1E Coursepack: Appendix 3
Running a model with runmlwin uses a fairly intuitive syntax The output includes a brief glimpse of MLwiN, which is opened, used, and closed again..! Note above that by using the ‘viewmacro’ model we also get to see the underlying MLwiN macro used for this analysis. 43 1E Coursepack: Appendix 3
A great attraction is the storage of model results from MLwiN as items in Stata In my opinion, runmlwin is a very exciting new contribution to software for multilevel modelling, and will prove a very valuable tool for anyone with access to both packages! Some miscellaneous points on working with MLwiN • Sorting data. Some MLwiN models will only work effectively if the data has been sorted in ascending order of respective level identifiers. In some circumstances, moreover, it may be necessary to ensure that the identifiers begin at 1 and increase contiguously. Generally, it is easiest to arrange identifiers in this way in another package such as Stata or SPSS before reading the data into MLwiN. • Macro editor. It is possible to edit a macro within the MLwiN session using the link ‘file ‐> {open/new} macro’. The image below shows an example of this. However, the macro can only be executed in its entirety (there is no comparable function to the SPSS and Stata syntax editors to allow you to run only a selected subset of lines), and on balance you are likely to find this less convenient then executing a macro which is open in a separate plain text editor. • Saving worksheets. An important feature of MLwiN is that when a worksheet is saved in MLwiN, as well as saving the underlying data, the software automatically saves the current model settings as part of the worksheet. Opening and reviewing the Equations window is usually adequate to reveal any existing model specifications. Also, model specifications can always be cleared from memory with the ‘clear’ command. To follow ‘best practice’ regarding replication standards, ideally it is preferable to ‘clear’ datasets of the model specifications first, then re‐specify the model from first principles via your working macro. In exploratory analysis phase, however, people frequently do exploit the save functionality to save their current model settings. • Reading data from other formats. MLwiN is relatively poorly designed for reading data from other software formats. In general, only plain text data can be easily recorded, and you will need to add in any metadata (e.g. variable names; variable descriptions) manually yourself. The effective way to do this is to write a data input macro for the relevant files, which contains the metadata specifications you want to make then saves out to MLwiN data format (*.ws), thus allowing you to start an analytical file from that point. There is an illustration of this process in Practical 2b. Useful user guides on MLwiN are: Rasbash, J., Steele, F., Browne, W. J., & Goldstein, H. (2009). A User's Guide to MLwiN,
v2.10. Bristol: Centre for Multilevel Modelling, University of Bristol. (pdf available:
http://www.cmm.bristol.ac.uk/MLwiN/download/manual-print.pdf)
Rasbash, J., Browne, W. J., & Goldstein, H. (2000). MLwiN 2.0 Command Manual. London:
Centre for Multilevel Modelling, Institute for Education, University of London. (pdf
available: http://www.cmm.bristol.ac.uk/MLwiN/download/commman20.pdf)
(The former covers windows operation of the package, and the latter covers the syntax command language based upon the earlier MLN package). 44 1E Coursepack: Appendix 3
How to use… R R is a freeware which is a popular tool amongst statisticians and a small community of social science researchers with advanced programming skills. It is an ‘object oriented’ programming language which supports a vast range of statistical analysis routines, and many data management tasks, through its ‘base’ or extension commands. Being ‘object oriented’ is important and means the package appears to behave in a rather different way to the other packages described above. The other packages essentially have one principal quantitative dataset in memory at any one time, plus they store metadata on the matrix and typically some other statistical results in the form other scalars and matrices. In the other packages, commands are automatically applied to the variables of the principal dataset. In R, however, different quantitative datasets (‘data frames’), matrices, vectors, scalars and metadata, are all stored as different ‘objects’, potentially alongside each other. R therefore works by first defining objects, then second performing operations on one or many objects, however defined. Some researchers are very enthusiastic about R, the common reasons being that it is free and that it often supports exciting statistical models or functions which aren’t available in other packages. However, my perspective is that R isn’t an efficient package for a social survey researcher interested in applied research, as the programming demands to exploit it are very high, and, because it isn’t widely used in applied research, it hasn’t yet developed robust and helpful routines, working interfaces, or documentation standards, to address popular social science data‐oriented requirements. An important concept in R is the ‘extension library’, which is how ‘shortcut’ programmes to undertake many routines are supplied. In fact, you will rarely use R without exploiting extension libraries. The idea here is that R has a ‘base’ set of commands and support, and that many user‐contributed programmes have been written in that base language. Those extensions typically provide shortcut routes to useful outcome analyses. A few extension libraries in R are specifically designed to support random effects multilevel model estimation – e.g. the lme package (Bates, 2005; Pinhero & Bates, 2000). R is installed as freeware and since it is frequently updated it is wise to regularly revisit the distribution site and re‐
download 45 1E Coursepack: Appendix 3
When you open R, you will see something like this ‘R console’ (on my machine I use a ‘.rprofile’ settings file so my starting display is marginally different to the default) The first few lines show me defining a new object (a short vector) and listing the objects in current memory. R’s basic help functions point to webpages. The general help pages mostly have generic information, and are not in general provided with worked examples. Many R users get their help from other online sources, e.g. http://www.statmethods.net/ 46 1E Coursepack: Appendix 3
In general, with R, the first thing you should do is ask to open a new or existing script and work from that. Scripts in R work in a similar way to a syntax file in Stata or SPSS – highlight a line or lines, and press ‘ctrl+R’. 47 1E Coursepack: Appendix 3
After running commands, output is sent either to the main console or a separate graphics window 48 1E Coursepack: Appendix 3
References cited Altman, M., & Franklin, C. H. (2010). Managing Social Science Research Data. London: Chapman
and Hall.
Bates, D. M. (2005). Fitting linear models in R using the lme4 package. R News, 5(1), 27-30.
Crouchley, R., Stott, D., & Pritchard, J. (2009) Multivariate generalised linear mixed models via
sabreStata (Sabre in Stata). Lancaster: Centre for e-Science, Lancaster University.
Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research
Methodology, 9(2), 143-158.
Field, A. (2009). Discovering Statistics using SPSS, Third Edition. London: Sage.
Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology?
Sociological Methods and Research, 36(2), 153-171.
Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression, 2nd Ed. London: Sage.
Gelman, A., & Hill, J. (2007). Data Analysis using Regression and Multilevel/Hierarchical Models.
Cambridge: Cambridge University Press.
Hox, J. (2010). Multilevel Analysis, 2nd Edition. London: Routledge.
Kulas, J. T. (2008). SPSS Essentials: Managing and Analyzing Social Sciences Data New York:
Jossey Bass.
Kohler, H. P., & Kreuter, F. (2009). Data Analysis Using Stata, 2nd Edition. College Station, Texas:
Stata Press.
Leckie, G., & Charlton, C. (2011). runmlwin: Running MLwiN from within Stata. Bristol: University of
Bristol, Centre for Multilevel Modelling, http://www.bristol.ac.uk/cmm/software/runmlwin/
[accessed 1.6.2011].
Levesque, R., & SPSS Inc. (2010). Programming and Data Management for IBM SPSS Statistics 18:
A Guide for PASW Statistics and SAS users. Chicago: SPSS Inc.
Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.
Muthén, L. K., & Muthén, B. O. (1998-2010). Mplus User's Guide, Sixth Edition. Los Angeles,
California: Muthén and Muthén, and http://www.statmodel.com/
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-Effects Models in S and S-Plus. New York: SpringerVerlag.
Rabe-Hesketh, S., & Skrondal, A. (2008). Multilevel and Longitudinal Modelling Using Stata, Second
Edition. . College Station, Tx: Stata Press.
Rafferty, A., & Watham, J. (2008). Working with survey files: using hierarchical data, matching files
and pooling data. Manchester: Economic and Social Data Service, and
http://www.esds.ac.uk/government/resources/analysis/.
Rasbash, J., Steele, F., Browne, W. J., & Goldstein, H. (2009). A User's Guide to MLwiN, v2.10.
Bristol: Centre for Multilevel Modelling, University of Bristol.
49 1E Coursepack: Appendix 3
Reardon, S. (2006). HLM: Stata Module to invoke and run HLM v6 software from within Stata. Boston:
EconPapers, Statistical Software Components
(http://econpapers.repec.org/software/bocbocode/s448001.htm).
Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., & Congdon, R. (2004). HLM 6: Hierarchical Linear
and Nonlinear Modelling. Chicago: Scientific Software International.
Spector, P. (2008). Data Manipulation with R (Use R). Amsterdam: Springer.
Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York:
Jossey Bass.
Twisk, J. W. R. (2006). Applied Multilevel Analysis: A Practical Guide. Cambridge: Cambridge
University Press.
West, B. T., Welch, K. B., & Gatecki, A. T. (2007). Linear Mixed Models. Boca Raton, Fl: Chapman
and Hall.
Software‐specific recommendations: Stata Kohler, H. P., & Kreuter, F. (2009). Data Analysis Using Stata, 2nd Ed College Station, Tx: Stata
Press.
Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.
Rabe-Hesketh, S., & Skrondal, A. (2008). Multilevel and Longitudinal Modelling Using Stata, Second
Edition. . College Station, Tx: Stata Press.
Web: http://www.longitudinal.stir.ac.uk/Stata_support.html SPSS Field, A. (2009). Discovering Statistics using SPSS, Third Edition. London: Sage.
Kulas, J. T. (2008). SPSS Essentials: Managing and Analyzing Social Sciences Data New York:
Jossey Bass.
Levesque, R., & SPSS Inc. (2010). Programming and Data Management for IBM SPSS Statistics 18:
A Guide for PASW Statistics and SAS users. Chicago: SPSS Inc.
Web: http://www.spsstools.net/spss.htm MLwiN Rasbash, J., Steele, F., Browne, W. J., & Goldstein, H. (2009). A User's Guide to MLwiN, v2.10.
Bristol: Centre for Multilevel Modelling, University of Bristol.
Web: http://www.cmm.bristol.ac.uk/MLwiN/download/manuals.shtml R Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression, 2nd Ed. London: Sage.
Gelman, A., & Hill, J. (2007). Data Analysis using Regression and Multilevel/Hierarchical Models.
Cambridge: Cambridge University Press.
Spector, P. (2008). Data Manipulation with R (Use R). Amsterdam: Springer.
(There is also a useful guide to using R, within SPSS, in the body of Levesque & SPSS Inc 2010, cited above). Web: http://www.ats.ucla.edu/stat/r/ 50 1E Coursepack: Appendix 3