Bibliographic Metadata and HathiTrust HATHITRUST

HATHITRUST
A Shared Digital Repository
Bibliographic Metadata
and HathiTrust
ALCTS CaMMS Catalog Management Interest Group Meeting
American Library Association MidWinter Convention
Philadelphia, Pennsylvania, January 25, 2014
Jon Rothman, Head, Library Systems Office, University of Michigan
[email protected]
HathiTrust Mission
To contribute to the common good by
collecting, organizing, preserving,
communicating, and sharing the record of
human knowledge.
HathiTrust Background
• Launched in 2008 by the libraries of the CIC
Committee on Institutional Cooperation (CIC)
and the University of California System.
• Initial focus on digitized book and journal
content
– 10,922,113 total volumes
– 3,563,589 public domain (~33%)
• Currently 91 partner institutions and
continuing to grow.
Partnership
Allegheny College
Arizona State University
Baylor University
Boston College
Boston University
Brandeis University
Brown University
California Digital Library
Carnegie Mellon University
Colby College
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Florida State University
Getty Research Institute
Harvard University Library
Indiana University
Iowa State University
Johns Hopkins University
Kansas State University
Lafayette College
Library of Congress
Massachusetts Institute of
Technology
McGill University`
Michigan State University
New York Public Library
New York University
North Carolina Central
University
North Carolina State
University
Northwestern University
The Ohio State University
The Pennsylvania State
University
Princeton University
Purdue University
Stanford University
Syracuse University
Temple University
Texas A&M University
Tufts University
Universidad Complutense
de Madrid
University of Alabama
University of Alberta
University of Arizona
University of British Columbia
University of Calgary
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Connecticut
University of Delaware
University of Florida
University of Houston
University of Illinois
University of Illinois at
Chicago
The University of Iowa
University of Kansas
University of Maryland
University of Massachusetts,
Amherst
University of Miami
University of Michigan
University of Minnesota
University of Missouri
University of NebraskaLincoln
The University of North
Carolina at Chapel Hill
University of Notre Dame
University of Oklahoma
University of Pennsylvania
University of Pittsburgh
University of Queensland
University of Tennessee,
Knoxville
University of Utah
University of Vermont
University of Virginia
University of Washington
University of WisconsinMadison
Utah State University
Vanderbilt University
Virginia Tech
Wake Forest University
Washington University
Yale University Library
Where does HathiTrust’s bibliographic
metadata come from?
• Bibliographic metadata is provided by depositors of
digital content.
• Metadata must be supplied to HathiTrust before
ingest of digital content can occur
• The metadata is used in several ways, including
– To act as a manifest of the materials being deposited.
– To identify and track records to their contributor.
– To help in making an initial rights determination about
each volume.
Minimal metadata specifications for
deposited records
•
•
•
•
Valid MARC binary or MARCXML structure
Valid leader and 008
245 $$a (or $$k where appropriate)
A 955 field describing a single item
– Item identifier (usually barcode)
– Item description (enumeration/chronology) for
multi-volume works
• OCLC Number (strongly preferred)
Duplicate detection
• Simple identifier match at bibliographic level,
using OCLC numbers.
• OCNs most ubiquitous and unique identifiers
in the records, but there are issues…
• Records without OCNs
– Some partners didn’t have OCNs in any of their records
– Some have had them in many, but not all, of their records
• Differences in OCN location, prefixes, etc. in records
• Different OCNs for same item.
HathiTrust metadata management
• Where
– HathiTrust bibliographic metadata was managed in
the University of Michigan’s Aleph LMS from 2008
until…
– Zephir, a dedicated HathiTrust metadata management
system developed by California Digital Library,
launched in production in early December, 2013.
• Underlying principle
– Records supplied to HathiTrust are not considered
definitive.
– Definitive record lives in the source institution’s own
system and/or Worldcat.
Zephir Functionality
• Keeps all versions of records received from
depositors.
– OCLC number still used for duplicate detection
– Records are clustered rather than merged.
• A weighting algorithm determines best
bibliographic record in each cluster. – selected
record, with item-level data for all ingested items
attached to that cluster are selected for output.
• Provides a daily output of new/changed records.
Records where none of the associated digital
items have been ingested yet are not included.
Record correction and update
• General policy is not to correct or update the
content of contributors’ records.
• In most cases, contributors are asked to correct
and re-submit records with observed metadata
errors or issues.
• When it’s necessary for a correction to happen
quickly:
– A corrected “shadow record” is created in Zephir -temporarily takes the place of the contributor record
in outputs.
– Contributor is asked to submit a corrected record.
When corrected record is received, the shadow record
is removed.
Contributor
Bibliographic Records
HathiTrust Metadata
Management (Zephir)
Zephir daily
export
Metadata about
newly-loaded
records
HathITrust Access
Processing
Rights DB
Identifiers of
ingested
objects
HathiTrust Ingest
Framework (Feed)
OAI
Hathifiles
HathiTrust
Catalog
Individual
library
catalogs,
etc.
WorldCat
Digital Object Repository
Bib API
Catalog +
Full Text
Contributor
Bibliographic Records
HathiTrust Metadata
Management (Zephir)
Zephir daily
export
Metadata about
newly-loaded
records
HathITrust Access
Processing
Rights DB
Identifiers of
ingested
objects
HathiTrust Ingest
Framework (Feed)
OAI
Hathifiles
HathiTrust
Catalog
Individual
library
catalogs,
etc.
WorldCat
Digital Object Repository
Bib API
Catalog +
Full Text
Contributor
Bibliographic Records
HathiTrust Metadata
Management (Zephir)
Zephir daily
export
Metadata about
newly-loaded
records
HathITrust Access
Processing
Rights DB
Identifiers of
ingested
objects
HathiTrust Ingest
Framework (Feed)
OAI
Hathifiles
HathiTrust
Catalog
Individual
library
catalogs,
etc.
WorldCat
Digital Object Repository
Bib API
Catalog +
Full Text
Contributor
Bibliographic Records
HathiTrust Metadata
Management (Zephir)
Zephir daily
export
Metadata about
newly-loaded
records
HathITrust Access
Processing
Rights DB
Identifiers of
ingested
objects
HathiTrust Ingest
Framework (Feed)
OAI
Hathifiles
HathiTrust
Catalog
Individual
library
catalogs,
etc.
WorldCat
Digital Object Repository
Bib API
Catalog +
Full Text
HATHITRUST
A Shared Digital Repository
Bibliographic Metadata
and HathiTrust
ALCTS CaMMS Catalog Management Interest Group Meeting
American Library Association MidWinter Convention
Philadelphia, Pennsylvania, January 25, 2014
Jon Rothman, Head, Library Systems Office, University of Michigan
[email protected]