Historical Corpora

Historical Corpora
Anke Lüdeling
Humboldt-Universität zu Berlin
Thank you to
Malte Belz, Thomas Krause, Carolin Odebrecht, Laura Perlitz, Felix
Schremmer, Uwe Springmann, Vivian Voigt, Amir Zeldes
and many students
1
goal
We want to build and store historical corpora in a way so that they can be found and analyzed (and
annotated, if necessary) by as many people as possible.
This involves
• clear corpus designs, well-described meta-data, re-usable and standardized formats
• multi-layer annotation in all relevant formats, including normalization on one or more levels
• representation of inter-textuality
• reliable and transparent repositories, ideally embedded in larger structures
• conversion to different formats
• multi-layer search tools
Much will have to be done manually but we want
• evaluation of all of the steps above
• as much computational help as possible (OCR, automatic pre-processing)
2
why?
• how will research with historical corpora change
if we have this environment?
• availability, transparency, replicability
annotation is research
annotated texts can be used for computational linguistics
3
basic models in Laudatio-Annis-SnP world
• a corpus consists of one or several texts; a text can be of any size
• each text can be described by metadata, the corpus as a whole can be described
by metadata, metada can be about the text and about the corpus construction
• a text can be tokenized in several ways
• each unit (token, span) within a text can be annotated
• annotation can be in different formats; the content of the annotation is not
constrained by the models
• each annotation layer is technically independent of each other annotation layer
____
Zeldes et al. (2009), Zipser & Romary (2010), Odebrecht (2014), Krause & Zeldes (to
appear)
4
Ridges herbology corpus
• herbal texts from between 1487 and 1870,
29 texts, about 30 pages each
• annotated on many levels
• built in several seminars (and corrected later under the supervision of
Carolin Odebrecht)
• freely available (CC-BY) at the Laudatio repository and at
http://korpling.german.hu-berlin.de/ridges/index_de.html
• examples from:
Johannes von Cuba (Johann Wonneke von Kaub) Gart der Gesundheit,
print from 1487, Ulm
(digitized from a facsimile from Bayerische Staatsbibliothek)
5
normalization
→category annotation
6
diplomatic
• diplomatic versions of historical texts are necessary for many research
questions
ex.: information about the writing conventions
ARthemiſia mater herbaꝝ
7
normalized
• normalized versions of historical texts are necessary for many
research questions
ex.: vocabulary information across space and time
Kräuter → kreüter, krüter, Kraͤut⸗ ter, Kraͤuter, kreutter, Kreuter, …
8
diplomatic and normalized
• it is therefore often useful to have a diplomatic version
and one or more normalized versions of a text
• normalization is categorization, i.e. necessary loss of information
• there can be many ways to normalize a text –
each of them suitable for some research questions but not for others
• in a multi-layer format this is not a problem
9
normalization: example Ridges
≈ item who carries the root of the
wormwood around his neck
no poinsonous animal can harm him
Jtem
Jtem
wer
wer

der
beyfuſz beyfusz
wurczel wurczel
an
an
ſeinem
seinem
halſz
halsz
tregt
tregt
kain
kain
vergífftíg vergifftig
tíer
tier
mag
mag
ím
im
nít
nit
geſchaden geschaden
Jtem
wer
der
Beifußwurzel
an
seinem
Hals
trägt
kein
giftig
Tier
mag
ihm
nicht
schaden
10
normalization: example Ridges
• dipl layer: as close to the original as possible
(some decisions necessary)
→ ground truth for OCR
• norm layer: special characters normalized, can be done automatically
• clean layer: words written in the modern German spelling
→ research on variation
→ input for tagging (TreeTagger, Schmid 1994),
this is highly problematic (comparative fallacy, Bley-Vroman, 1983)
 ſelb ſaft  verzeret ycterí cíam
≈ the same juice that/it heals jaundice
11
annotation
• many annotation layers are instances of more or less abstract
normalization (close connection to variationist models)
• the layers, categories, and granularity depend on the research
question
• annotation is an iterative process:
one starts with a grammatical model but the annotation categories
and insights change while working with the text
• ideally the annotation reflects the analysis
12
annotation
• just as we wanted different spellings for one 'word', we could want
• different 'words' for one concept (hyperlemma)
sucht - Krankheit 'illness', zehand - alsbald 'soon'
• different words or phrases for one concept
Nierenstein – Stein der in der Blasen wächst und Nieren, Stein in der Blasen oder auch
in den Lenden 'kidney stone'
• different ways to formulate a relative clause
Wer dē beyfuſz beí ím tregt wen̄ er wandert der wírt nít muͤde
≈ who carries wormwood while he is hiking will not become tired
Jtem Stabwurz macht auch wachsen den Bart der langsam hervor kommt
≈ sitherwood will also let grow the beard that comes slowly
Wellíche frawen kínder tragent ˖ díe ſollent eppich ſammen meiden
≈ which women are carrying children – they should avoid cellery seeds
• …
13
normalization: target hypotheses
Vn haíſzt darumb arthe miſía ˖ wan̄ der küníg Manfolei ge Arthemíſía /
díe wolt dz díſz kraut nant het aín hauſzfrawē díe híeſz auch alſo genent
würde / vmb tu⸗ gent wíllen díe díſe künígín an dí ſem kraut befand
14
normalization: target hypotheses
Vn haíſzt darumb arthe miſía ˖
wan̄ der küníg Manfolei ge
Arthemíſía / díe wolt dz díſz kraut
nant het aín hauſzfrawē díe híeſz
auch alſo genent würde / vmb tu⸗
gent wíllen díe díſe künígín an dí
ſem kraut befand
Vn haíſzt darumb arthe miſía ˖
wan̄ der küníg Manfolei ge
nant het aín hauſzfrawē díe híeſz
Arthemíſía / díe wolt dz díſz kraut
auch alſo genent würde / vmb tu⸗
gent wíllen díe díſe künígín an dí
ſem kraut befand
≈ and is called arthemisia because the king
called Manfolei (=Mausolus) had a wife
called Arthemisia who wanted this herb to
be called like her because of the virtues
that this queen found in this herb
15
target hypotheses
• the concept stems from learner corpus research; not yet realized in Ridges
(Lüdeling 2008, 2011, Reznicek et al. 2013)
• basic idea:
• if an original text cannot be described by a well-defined grammar
(for whatever reason)
• make minimal changes so that it can be described
• note and analyze the changes
(in this case: transposed lines)
• if possible: hypothesize the reasons for the difference between original and changed version
(in this case: printer‘s error)
• in this way all information is preserved in a transparent way
• further normalization layers and all other analyses (e.g. syntactic analyses) are
based on the target hypotheses
• again: many decisions necessary (Manfolei)
16
annotation
• all of this (and more) is possible in the current models/tools
• why doesn't everybody do it this way?
17
OCR
18
manual and automatic analysis
• the data in many historical corpora are too variable to be annotated
automatically
(spelling variation, syntactic variation, text structure variation;
especially true for diachronic corpora)
• (how) can we use manual annotation of subcorpora for the automatic
analysis of further text?
19
OCR
• (early) prints are almost impossible to OCR with traditional methods
• the diplomatic layer of the Ridges corpus can be used as the ground
truth for machine learning in combination with new RNN-based OCR
tools
• pre-processing necessary (cleaning, line alignment)
20
A 1577 Fraktur printing (A. v. Bodenstein)
original page
Abbyy FR 11 (84.65%)
Kreücev
ner erscheinng / vnserer teutscher
zäun oderhagwuryel/ garnicht/
welche der mehrertheil baldierer
fiir rechte Aristolochiam rotundameinsamlend. Diosc. Diser
würget etwas mit wein Myrrhen
vnd Pfeffer getrvncken/ reiniget
die weider von vderssissigem vm
rath dermäter/ treibt auß die an
Lgedurt vn weider menses. Ein
faldgemachtvonndiser würgen
zeitlosenvnanagallidezeüchtvß
spreissel/ dörnvngeschiferte dein.
iZiemitt deschlies ich mein
rede diser zeitvonden zwelffzeichenkreütteren/degären menckltch welle mirs im dessen aufnem
men als dan ichs gethan / had fye
weitleüfftgerdeschridc wellen / f»
sind yegigerkürgeviel vrsachen/
voraußdieweil ichgrosicnkosten
angewendet insüchungder kreü
ter auß eignem willen vndeücrcl/
OCRopus (98.84%;
raw uncorrected output)
Kreüter
ner erſcheinüg/ vnſerer teütſcher
zaun oder hagwurtzel/ gar micht/
welche der mehrertheil balbierer
für rechte Ariſtolochiam rotun⸗
dam einſamlend. Dioſc. Diſer
wurtzel etwas mit wein myrrhen
vnd pfeffer getruncken/ reiniget
die weiber von vberfliſzigem vn⸗
rath der můter/ treibt auſz die an
d geburt vñ weiber menſes. Ein
ſalb gemacht vonn diſer wurtzen
zeitloſen vñ anagallide zeücht vſz
ſpreiſſel/ doͤrnvñ geſchiferte bein.
Hiemitt beſchlies ich mein
rede diſer zeit von den zwelff zei⸗
chen kreütteren/ begaͤren menck⸗
lich welle mirs im beſten aufnem
men als dañ ichs gethan/ hab ſye
weitleüffiger beſchribẽ wellen/ ſo
ſind yetziger kürtze viel vrſach_en/
vorauſz dieweil ich groſſen koſten
angewendet in ſůchung der kreü
ter auſz eignem willen vñ beüt_tel/
ncn⸗
slide by Uwe
Springmann21
table by Felix Schremmer
22
inter-textuality
23
different types of intertextuality
• a text and its translation(s) → parallel corpus
• David Birnbaum:
different manuscripts representing the same text
→ (complicated) parallel corpus
• similar topics across space and time
• text re-use: direct citations, indirect citations, illusions
•…
________
Coffee et al. (2012), Almas & Berti (2013)
24
European developments
• in scientific contexts:
roughly between 1400 and 1700 in many European areas
the lingua franca Latin is replaced by the vernacular languages
25
example: herbal texts in different vernaculars
the early texts in Ridges are compilations (not always direct translations) of
Latin herbal texts ('truth' via authority)
Auícēna ſprícht dz knob lauch beneme vn̄ verdruck díe ge⸗ ſchwulſt des
menſchen in dem leib wo díe ſey
≈ Avicenna says that garlic takes away and heals the tumor of the human in
the body where it is
Serapío ſprí⸗ chet ˖ das ſtabwurcz genűczt ver⸗ zeret vͤberflűſſíg feuchtíkaít díe
ín den daͤrmen ſind dauō aín kranck haít komet genant colíca paſſío
≈Serapio says that sitherwood consumes superfluous wetness in the which is
in the intestines which causes an illness called colica passio.
26
herbal texts in other languages
Czech (herbal text by an unknown monk, 16th century):
Kohož velmi bolí hlava
Ten vezmi černobýl a semenec ozimý a jalovec, vařiž to spolu u víně tak
dlúho, dokud se nebude pukati semenec
[http://vokabular.ujc.cas.cz/moduly/edicni/edice/fa9ef614-5242-4c00-8916-f63d4cfb66ba/plny-text/s-aparatem/folio/43r]
≈who has a strong headache should take sage brush and the seeds of
winter barley and juniper and boil this together in wine until the seeds
burst
27
28
the same Latin source text
• # Platearius: Liber de simplici medicina dictus circa
instans.# # [….]¶ De absinthio XIX. Absinthium
calidum est in primo gradu. et siccum in secundo.
absinthii duo sunt genera. unumquod dicitur
ponticum: vel quia in ponto insula reperitur:vel
quia ponticum habet saporem: colorem habet
viridemsaporem amarissimum. in fine veris
colligitur in umbra exgiccatur. per annum servatur:
reperitur autem subalbidum et minus amarum. et
tale minoris est effectus. ¶ Absinthium dicitur
habere duas contrarias proprietates scilicet
laxativam et constrictivam: constrictivam habet ex
grossitie substantie: et ponticitate: laxativam ex
caliditate: et amaritudine. grossamdicitur habere
substantiam: propter ponticitatem et ama-ram.
amara enim et pontica grossam dicuntur habere
substantiamquare si interius reciperetur. materia
existente compacta:eam grossicie sua
compactiorem redderet: suaque caliditatequod
humiditatis inerat dissolvendo ab ea. ¶
Díe maiſter ín der ercz⸗ nei ſprechen . das wermůt ſei
haíſz ín dem erſten grad vnd trucken ín dem andern ˖
¶ Platearíus ſpricht das wermut aín wíderwaͤrtíge na
tur an ír hab / wan̄ ſíe laxíert vnd ſtopffet / vnd díe
zway ſind wíder aínander . Vnd darumb ſprícht er das
wermut genűczet ſol werden mít vermíſchung ˖
Pelyňka neb pelynek jest horký na prvním stupni a
suchý na druhým stupni, vyhoní pěnohorku z žaludka
a z třev a přivodí moč a nedá sě opiti. A ktož na každý
den pijí, činí dobrú chut a žádost a jest spomocná
proti žlútennici.
29
research questions
• how did the Latin text influence the compilations?
(lexicon, syntax, morphology, text structure)
• which clear differences between Latin text and vernacular text can be
found? why? ('Differenzbelege', Hinterhölzl 2010)
we need a way of modelling inter-textuality
(pointers with labels from (segments of) one corpus
to (segments of) another corpus)
• this is not yet possible in the Laudatio repository
30
summary
• annotation can be research!
• well-understood annotations can be exploited for (useful)
computational modelling
• the model must be extended to allow for the
representation of inter-textuality (and pointers to external resources)
31
wish list
• corpora are built in a way so that the design and format are wellunderstood, well-described and stored in a way so that everybody can
use them (Laudatio!)
• all categorization and research is done in the corpus
• students and researchers know how to do annotation
(tools, guidelines, evaluation)
• it is easy to download, annotate, and resubmit corpus data
32
Danke!
Thank you!
Gratias vobis ago!
[email protected]
33
Laudatio-Workshop 2014
Thomas Krause
Anke Lüdeling
Carolin Odebrecht
Laurent Romary
Peter Schirmbacher
Dennis Zielke
34
We want to build and store historical corpora in a way so that they can be
found and analyzed (and annotated, if necessary) by as many people as
possible.
This involves
• clear corpus designs, well-described meta-data, re-usable and standardized
formats
• multi-layer annotation in all relevant formats, including normalization on
one or more levels
• representation of inter-textuality
• reliable and transparent repositories, ideally embedded in larger structures
• conversion to different formats
• multi-layer search tools
Much will have to be done manually but we want
• evaluation of all of the steps above
• as much computational help as possible (OCR, automatic pre-processing)
35
We want to build and store historical corpora in a way so that they can be found and
analyzed (and annotated, if necessary) by as many people as possible.
→ Tom Ruette, Tue 9:30-10:30
This involves
• clear corpus designs, well-described meta-data, re-usable and standardized formats
→ Laurent Romary, Tue 4:30-5:30pm
• multi-layer annotation in all relevant formats, including normalization on one or more
levels
• representation of inter-textuality
→ David Birnbaum, Wed 1:30-2:30 pm
• reliable and transparent repositories, ideally embedded in larger structures
→ Carolin Odebrecht, Tue 11am-12pm, Frank Wiegand, Tue 4:30-530pm, Maxi Kindling
Wed 9:30-10:30 am, Dennis Zielke, Ralf Claussnitzer, Dulip Withanage, Wed 9:30-10:30
am, Frank Kühnlenz, Katarzyna Biernacka, Wed 11am-12pm
• conversion to different formats
→ Florian Zipser, Tue 3-4 pm
• multi-layer search tools
→ Thomas Krause, Hagen Hirschmann, Tue 1:30 – 2:30 pm
Much will have to be done manually but we want
• evaluation of all of the steps above
• as much computational help as possible (OCR, automatic pre-processing)
36
organizational information
• registration and coffee breaks in room 3.308 (Haus 3, third floor)
• dinner will be at Ampelmann restaurant;
please register on the list if you want to participate
• the trains and the S-Bahn (Deutsche Bahn) will not be running
between 9 pm tonight and 6 am tomorrow morning
37