Document 179908

Building a Window to a Data Archive
or
How to Provide Access to Large Statistical Data Files
Tom Boggess
Cornell University
Abstract
the VAX/VMS~ operating system; a variety of
VMS and Unix~ based MicroVAXs~; and the Cornell National Supercomputer Facility's IBM 4381
R14 and IBM 3090-600 with attached array proces-
Researchers that use secondary data files, such as
those produced by the Bureau of the Census, face
the problems of obtaining information about the
contents of such files and of formatting these diverse
files for analytical evaluation. A data archive is
charged with the responsibility to maintain and
provide access to the data. An electronic information and data retrieval system would seem an ideal
interface between a data archive and its users.
sors.
Consulting on the use of files in the CISER Data
Archive and computing consulting is provided by
CISER
o 'Mainframe' system developments include:
o eMS Utility Panels (ISPf'M)
o
o OASIS [Online Archive Statistical Information System]
Phase I (Data Retrieval System)
Phase II (Information Retrieval System)
An information subsystem might allow users to
obtain information about the databases by browsing
or entering specific characteristics of interest, obtaining information about which data bases contain such
information, inform the user about the specifics of
each field or variable in the database, and indicate
any problems with merging disparate databases. A
data retrieval subsystem could let the user select the
databases of interest, subset according to case and
The Problems
There is no one standard format for files distributed by governmental and other agencies. The data
files are frequently organized in a hierarchical
fashion with complicated coding schemes and difficult to decipher codebooks. Further, these data are
rarely completely clean and must be checked for accuracy against printed and other sources.
o Many faculty members depend on graduate students to perform computing work and focus on the
results of that work. Training faculty and graduate
students to access data and become computer literate is a major task.
• Due to the eclectic nature of social science research,
there is no way to predict which files, tables, and
cases a researcher will find valuable.
o
field, merge 'common ease' subsets, obtain descrip-
tive statistics on the subsets, and store the results in
a variety of formats. A complete system would integrate the two subsystems. This presentation will
outline completed work for Phase I (the data retrieval subsystem).
Introduction and Background
The Organization:
Cornell Institute for Social & Economic Research KISER)
There are over 400 social science faculty and research associates at Cornell
o 300+ of these social scientists are CISER members.
o CISER maintains a Data Archive of 7000 separate
data files on over 600 reels of tape.
o The CISER computing team supports use of the
SAS'System and SPSS"'" on an IBM" 4381 as well
as general use of IBM PCs and Apple" Macintoshs~.
o CISER also provides proposal submission support
with local (Cornell) peer review.
o A Survey Research Facility fo, collection and entry
of primary data is another CISER service.
o
OASIS Phase I Solution: Data Retrieval
From the user's point of view, OASIS is simple:
choose a database, select observations, keep variables, and process (including reformat and store) the
data. This process is outlined in the figure below.
Welcome
Keep
Variables
The Social Science Computing Environment at Cornell
o
IBM 4381 Q13 (soon R14) running VM/SP CMS
operating system with a minimum of half the CPU
resources, access to several tape drives (allocated
o
dynamically), and 5.0 gigabytes (Gb) of diskspace.
Other hardware available to social scientists on
campus: IBM 3090-200 with CMS and MVS (batch
only); a primarily instructional DEC® 8500 running
496
files. However, one major data management task is
required: placing a standard merge variable, standard identifying codes, and descriptive names that
are equivalent across files that have similar units of
observation. An example for county-based files is to
match county names and Federal Information Processing Standards (FIPS) codes. Each file tends to
have differing descriptive names and PIPS codes
vary over time, so some standards are set to allow
for these differences.
This system is written using REXX~ executable
programs (EXECs) interfacing SAS System files currently stored only on magnetic disk. See section
'Software Selections' for a discussion the choice of
using the SAS System and RExx.
Value Added by the Data Archive Group
The 'behind the scenes' picture is, as expected,
more complex and requires many person hours per
dataset. First, information about the data from the
codebook or data dictionary is entered into a dBase
III Plus™ file. Each record of this database includes
a variable or 'flag" name, the position of the variable
in the file, a SAS System forty character label, a label
that can be as long as 255 characters (for use with the
retrieval software), the record number and column
position of the variable, the variable length, and the
number of decimal places (if appropriate), and the
year the variable covers.
A user can access any installed dataset by entering 'OASIS' and after an introductory screen the
current list of datasets is presented as shown in the
figure below. The list is currently quite short since
efforts have been primarily directed toward the user
interface. There are some sixty files waiting to be installed.
I.
2.
Once checked for duplications and entry errors, a
text file containing primarily INPUT and LABEL
SAS System statements is created and again checked
for accuracy. This file is then uploaded to the mainframe and edited to include relevant Job Control
Language, Data Control Block (DCB) information
(record length, block size, and record format), a
constrained PROC PRINT, a PROC CONTENTS,
and PROC MEANS. MVS is used since these files
tend to be large and work space is more easily obtained in MVS. When the SAS System file is created,
the archive tape information database is updated
and the tape manager is notified.
STF3<A)
COSTftTI
3_
SEASCS)
4.
":1'.
6.
7.
BEA25(S)
FEtlSTC02
P(')Pt1IGa2
POPtH083
.-:----------+
+---If£LP
un)
CHOOSE II
INfO n
INfO n f'IlIMT
cc:o..-.d1ilELPPF (ff2)
~El.P
PFlOC£S:s <rfUI)
QUIT (£IIIT 0tIS!S)
RlIflING
CORNE1...LA
The follOwing figure presents the initial screen the
user would use to select observations. Typically, a
user of the system would require OASIS documentation that would outline the observations and variables available for each dataset. Phase II of this
project will allow a user to browse the contents of
the files by entering areas of interest and being
presented with alternatives. This information system (sometimes called a meta data system) is still on
the drawing board but would use classic computer
science information retrieval system concepts such
as those advanced by Gerard Salton and others'.
There are five additional files that must be created
and uploaded for use in the OASIS System: a variable description file including the variable name,
length, type, decimal, and long label; an observation
file that contains key information associated with the
unit of observation (e.g., PIPS codes for county-level
data); a help file that provides an overview of the
types of variables included including flag information and references on where to obtain more information from printed sources; a help file that summarizes the number and types of cases (observations) to
be found in the data; and an abstract of the file.
These files are documented and stored on a personal
computer, uploaded to the mainframe, and a computing team member is informed and given complete file information.
You "'""'" ",I\osen
ob"""""ti.,1\S of
the "COlItITY ....,..... .-il.. 1I~INC2. PI ...... ind.",..,..
i~ I"l sele.,H"'I .... " err the opt.ions I isl .... bel .....
1_
Z.
3.
4.
S.
IIL1. Ul&IMITIUNS
flU. U.S. ClIUNTIH
fILL N.Y. COUNTIES
stH1fIIIY IWFllflHflTION
PmTICULFIIl tIlllHTl£S C.......... loKU
Value Added by the Computing Group
+----I n
Once the information and data files have been
readied by the Data Archive group, the next step is
to load these files into the OASIS system. Since
OASIS primarily consists of EXECs that incorporate
text and SAS System files, much of the loading
process is editing EXECs to acknowledge the new
;
;
;
HELl' (f'Fl.
HELl' IlBS
fIE11lJIN (fF4)
cct.oone .... H .... n I
(General Help on tIIoosinOj ~iOllS)
•
(Specific:: llelp .... 1lI>_.... <tt.iuns in tbis 1111\<01><1_) 1
(Het...... 'to f'M! .. i _ 1'101 .... )
;
+------------~---
497
--------------------+
Screen Management
The next figure presents the screen viewed by a
user when selecting variables. Again, OASIS documentation would be required to fully utilize the
system. However, the 'long labels' can be accessed
from this screen using the DESCRIBE command.
LIST DF IJAfIlflBLES FUI BSINU
1--FfPSTCl'l
2--1JlB1.E
3-l.1HECOIl£
4-f1E(l1(lH
>-""""""
..........
>--""""""
0--~-
9-85WISO
IB-85HfPISo;I
11--8SF159
12--1Sf'OP59
13--8S1"U'I59
14--El5lE59
1:1--Ei5f'CSI-59
l6-nflllll.J59
17-1l5N£lEi9
111--B5DIRW
1'J-1l5111f'S<J
28--IISWSIlS9
21--IIS()lU-59
:22-115PII159
2l-----r15F:i1l
24---1I5NF:l9
2S---lISFfIIlm9
216-1l5NFttS9
<>"
UfIR 100l..£S ~
2~f!lV59
<le-B'5F'EOO5')
,,......,.,..
29"-B5H 1N59
41-BSHILI-59
42---tl5SUG59'
55----t!5DUI62
>8-8"'"
3fI-s=iCcrt!l-59
31-B5H~
32-B5t1IlNDS9
".....,..""".
~,,,.
3S-II5loIH59
36--85fIE:iO
3?--B5FIFlIE:I<J
38-85SEflV59
39--e5GOU1S9
REXX EXECs using TERMF commands were selected over ISPF (Interactive System Productivity
Facility) panels due to the belief that when invoked,
the SAS System and ISPF would fracture memory
too severely in the three megabyte virtual machine
the user has available. In fact, many conflicts occurred between the EXECs and running of the SAS
System and use of ISPF is being reevaluated with
machines allocated larger virtual memory. Another
43-8S1P162
44--8SHFP162
4:r-fI5F 162
-
..,,---....
,,~
41-B5PU'162
4'J~lti2
54---1r513D&2
-56--1r.i1'll162
51--1!:5F62
5B--II5HF62
59--1l5FfIIltt62
6t!-1i5NF1'162
61--1l5PIIIU62
62--1l5f11!62
63--El5H 1N62
,,~
64-115CClNS62
Sz-..IISI)lf162
6:!i--Il~
solution being considered is to build a service vir-
tual machine that would invoke the SAS System and
return results to the EXEC or panel that did the call.
SAS / AF'" was not used due to nonexistent support
atComell.
------+
KEU' (I"FO
I
I:EEf' ••••• , ... ( ..... ,
I fILL
I (lONE (PF18) I
fi£LP <~..n
; IlESCfll1i£ •••• __ .(.. _U
; $UlEDI ;
;
J.I£l.I> 'I.ftIS
; WHEIII: _ _ (C,'."') _I .... I ;
;
(Uft f'f7,fF8,+,-, .... Df1Dl t.o _"" .. II .... I~I. list)
;
--------+
COHMAHD _>
Descriptive Statistics
RJliNING
Finally, after naming the subset, the SAS System is
invoked and a subset of the data is created. At this
point the user can subset another dataset available to
the system or perform a variety of tasks on the
dataset just created. These tasks include reformatting the data into SPSS', comma delimited, tab delimited, column aligned, etc.; storing the data on
mainframe disk, sending to another user, or downloading the data to a personal computer (using
special terminal emulation software); merging the
data with data that have similar units of observation;
and so on.
OASIS uses PROC UNIVARIATE PLOT, PROC
FREQ (for non-continuous variables), and PROC
MEANS as descriptive statistics for the user. These
are fairly straight forward procedures and no complicated options are necessary. It was felt that these
procedures provided enough of the basic information needed by the user. Within limits, requests for
other descriptive statistics procedures will be honored. Of course, once created and stored, the user
can run any desired procedure independent of
OASIS.
Managing the Process
OA$IS sw.s..b t:r.ated
Teq>
FII _ _
I. SEASIHe
~t
•
FI1.ty~ ~
Oc:Itobo_ Obs
SSI
B51HC2
8ZlEI1P2
2. BEA25El1P SS2
F
6
fInf' (PFI)
lIEU' c~~
I'I£lWtE ..
Hardware Resource Management
Size
U_ ForlllClt (Hb)
66
31
SAS
9.92
f>2
38
SAS
8.93
StQr~
SAS
tlo"g Stat LIS
Disk space is needed for both storage of the data
and for work space. One potential solution to the
central storage concern is to use optical disk technology. CISER is evaluating an optical disk device by
Data/Ware®that will allow hundreds of gigabytes
of data to be stored in IBM 3380 tape format.
Though this is not a solution for work space, centrally stored SAS System files will be relatively
quickly accessible as compared to tape mounts and
much larger datasets can be made accessible.
"" "" ""
I HEIIG£ •••••• ("-") I SRIII£ • c$fISlllG. I
I S1II1S ..
I FIETlIRt ',..4)
I
; oorc;:
(WIf)}
;
: alMRT.
-------
---
RUtttU HG
COfltELLR
Work space is an ongoing problem. It is not
expected that a large number of users will access
OASIS simultaneously, however the system depends
on an adequate supply of temporary disk space
made available for any user request. The amount of
temporary space defined is determined by the size
of the parent database selected. This is done so the
user can be told immediately if there is inadequate
work space to continue. Another approach is to wait
until the actual subsetting request is made and then
estimate the amount of space needed to complete
the task. H no space is available at this point, it is
more disappointing to the user who must make
Software Selections
Database Management
The SAS System was selected as a database
manager due to price; its ability to handle data
stored on disk or tape; the ability to use the SAS
System in CMS (both semi-interactive and 'batch')
and MVS; the savings in disk space that a SAS System file requires as compared to a more traditional
database management system file; familiarity by inhouse programmers; its user population popularity;
and SAS System versatility as both a database manager and statistical package.
another request at another time. For certain very
498
From Concept to Product
large datasets (50 megabytes or larger) pre-formatted, non-temporary disks are kept ready to handle
the data requests without adversely effecting the
temporary disk pool. Of course, this means that
these disks generally sit idle. Still, the savings of
disk space is substantial when compared to conventional database managers.
There are three basic stages in the completion of a
project of this kind: design, testing, and promotion.
One of the first design questions that needs to be
addressed is 'For whom will this software serve?'
This question is not answered in a day or two with
Another real concern is memory management.
.Three megabytes of virtual memory becomes full
and well fractured when one defines, links, and
accesses multiple disks, opens tens of EXEC files and
invokes the SAS System. If virtual machine storage
limits were increased to four megabytes, system
problems could occur since there seems to be a
minimum of eighty virtual machines attached and a
high load day can see as many as 200 active virtual
machines. One solution is to upgrade the CPU and
. increase main memory and paging. CISER and central computing services is considering just such a
plan.
input from a few. Generally, the answers to basic
design questions need to evolve over time. For
ClSER, the basic use of the system is by faculty,
research associates, and their assistants. However,
since research could include Cooperative Extension
faculty a dilemma was created. Would this be a
system to provide well defined descriptive information that could be made available for county extension agents? Or would it be better if the system
offered 'complete" files with no judgement as to
which table or variable needed to be included? It
was agreed that OASIS incorporate complete files,
but it certainly would have changed the nature of
the software had the alternative been chosen. Perhaps more importantly, the design of the system
Personnel
may have floundered without a clear answer to
Coordinating the efforts of six FIEs across two
'departments' with a faculty oversight committee is
a difficult task. Fortunately, CISER remains small
enough to maintain a family-like atmosphere of
these questions.
Another basic design element was to limit the
scope of the project. What OASIS was not designed
to do is as important as the tasks it is designed to accomplish. For the initial project, only county based
data files stored on magnetic disk in SAS System
format were used. Future OASIS Phase I projects
will evaluate the inclusion of hierarchical files (e.g.,
person records nested in family records nested in
household records), files stored in tape format,
processing the requests on separate processors,
merging on selected variables, using expanded
WHERE options, and so on.
cooperation and communication. At the outset the
project, there was more discussion than production
but finally a prototype system was developed and
has become the OASIS system. Achievements must
be recognized and celebrated. It has been important
for workers on this project to have clearly defined
tasks that have reasonable completion dates.
To advance Phase I of the OASIS system from its
current prototype status to a mature system will
require moving the system to ISPF panels, evaluating other database management systems, and incorporating enhancements. An additional systems
programmer will be required for this effort.
Designing a system is more a consort than a solo
with input from a number of sources on desires and
system limitations. It requires a high level of under-
The building of OASIS Phase II (the information
management subsystem) will require continued
evaluation of this type of system and the creation of
intelligent front end software to aid users in selecting information from available data. This project is
being directed from the data archive group with
input from the computing team.
standing of the task at hand, system possibilities and
limitations, software knowledge, database management techniques, knowledge of the data, and an
understanding of what users need and will use. One
of the more challenging aspects of designing a
mainframe-based system is that today's university
researcher is often more accustomed to what a personal computer can do and is generally not enam-
Another often overlooked activity is documentation. For the Phase I prototype, the computing
manager produced an introduction and tutorial
while the data archive provided information about
each data file including 10ng label' descriptions of
ored with the complications of a mainframe operating system. Thus, the system needs to address that
user knowledge base and be logical, comprehensible, and consistent. The words on the screen must
convey the function they perform. [More than one
long discussion has been generated on the relative
merits of 'QUIT', 'STOP', 'EXIT', and 'DONE'.J
Finally, the design must be forward-looking. It must
allow for changes in the hardware and software and
variables, a numeric list of the observations from
which to choose, and an abstract. It is estimated that
a half-time person will be needed to produce more
complete documentation on the use of the system
and the data files it contains.
499
SAS, and SAS/ AF are registered trademarks of SAS
Institute, Inc., Cary, NC, USA
IBM is a registered trademark of International Busi-
be flexible enough to adapt to major changes. For
example, it is expected that the interactive portion of
this system may reside on a personal computer with
calls to the larger processor(s) (via a sophisticated
network) to perform the subsetting, merging and file
transfer. This will need to be a seamless operation.
If the system is not carefully designed and properly
fragmented, it could prove impossible to alter the
system without a complete overhaul of the software.
ness Machines Corporation.
ISPF and REXX are trademarks of International Business Machines Corporation.
Apple is a registered trademark of Apple Computer,
Inc.
Macintosh is a trademark of Apple Computer, Inc.
DEC is a registered trademark of Digital Equipment
Corporation.
VAX/VMS and MicroVAX are trademarks of Digital
Equipment Corporation.
SPSS'is a trademark of SPSS, Inc.
Unix is a trademark of Bell Laboratories.
dBase III Plus is a trademark of Ashton-Tate.
Data/Ware is a registered trademark of Data/Ware
Development, Inc.
Testing the system is not as straight forward as it
may at first seem. Of course. . it is important to find
programming 'bugs', but logical inconsistencies and
an awkward flow or command name may be just as
damaging to the success of the software. Therefore,
it is as important for a group of users as well as programmers to test the system. Finding a willing set of
faculty researchers is not always an easy task since
research agenda and testing of software rarely jive.
For Additional Information Contact:
Promotion of the completed system is as important as its creation. It is important that all concerned
in the creation of the software believe in it and are
willing and able to use it themselves. Promotion of
the system must include workshops, on-going formal presentations, and quick on-the-fly demonstra-
Tom Boggess
CISER, 382 Uris Hall
Cornell UniverSity
Ithaca, New York
14853
tions.
The reason the OASIS system was created was to
reduce the time it took a researcher to produce a
clean, workable dataset. CISER staff had noticed
that it often took months of working with a complicated data file to get it into a format against which
analytical procedures could be run. If success is
achieved, users should immediately recognize the
value of the software and want to use it. It is still too
early to know whether researchers will embrace this
system, but when shown to a set of social science
institute personnel from universities across the
country, the reaction was most favorable.
1
2
A flag is information about variables that relate to suppression of the data due to privacy concerns. . refusals to
answer, and so on.
Gerard Salton, Dynamic Infonnation and Library Proc-
eSSing, Prentice-Hall, Inc., Englewood Cliffs, N.L 1975.
3
Files in OASIS contain all variables (tables) but not
necessarily all observations.
500