Historical Corpora Anke Lüdeling Humboldt-Universität zu Berlin Thank you to Malte Belz, Thomas Krause, Carolin Odebrecht, Laura Perlitz, Felix Schremmer, Uwe Springmann, Vivian Voigt, Amir Zeldes and many students 1 goal We want to build and store historical corpora in a way so that they can be found and analyzed (and annotated, if necessary) by as many people as possible. This involves • clear corpus designs, well-described meta-data, re-usable and standardized formats • multi-layer annotation in all relevant formats, including normalization on one or more levels • representation of inter-textuality • reliable and transparent repositories, ideally embedded in larger structures • conversion to different formats • multi-layer search tools Much will have to be done manually but we want • evaluation of all of the steps above • as much computational help as possible (OCR, automatic pre-processing) 2 why? • how will research with historical corpora change if we have this environment? • availability, transparency, replicability annotation is research annotated texts can be used for computational linguistics 3 basic models in Laudatio-Annis-SnP world • a corpus consists of one or several texts; a text can be of any size • each text can be described by metadata, the corpus as a whole can be described by metadata, metada can be about the text and about the corpus construction • a text can be tokenized in several ways • each unit (token, span) within a text can be annotated • annotation can be in different formats; the content of the annotation is not constrained by the models • each annotation layer is technically independent of each other annotation layer ____ Zeldes et al. (2009), Zipser & Romary (2010), Odebrecht (2014), Krause & Zeldes (to appear) 4 Ridges herbology corpus • herbal texts from between 1487 and 1870, 29 texts, about 30 pages each • annotated on many levels • built in several seminars (and corrected later under the supervision of Carolin Odebrecht) • freely available (CC-BY) at the Laudatio repository and at http://korpling.german.hu-berlin.de/ridges/index_de.html • examples from: Johannes von Cuba (Johann Wonneke von Kaub) Gart der Gesundheit, print from 1487, Ulm (digitized from a facsimile from Bayerische Staatsbibliothek) 5 normalization →category annotation 6 diplomatic • diplomatic versions of historical texts are necessary for many research questions ex.: information about the writing conventions ARthemiſia mater herbaꝝ 7 normalized • normalized versions of historical texts are necessary for many research questions ex.: vocabulary information across space and time Kräuter → kreüter, krüter, Kraͤut⸗ ter, Kraͤuter, kreutter, Kreuter, … 8 diplomatic and normalized • it is therefore often useful to have a diplomatic version and one or more normalized versions of a text • normalization is categorization, i.e. necessary loss of information • there can be many ways to normalize a text – each of them suitable for some research questions but not for others • in a multi-layer format this is not a problem 9 normalization: example Ridges ≈ item who carries the root of the wormwood around his neck no poinsonous animal can harm him Jtem Jtem wer wer der beyfuſz beyfusz wurczel wurczel an an ſeinem seinem halſz halsz tregt tregt kain kain vergífftíg vergifftig tíer tier mag mag ím im nít nit geſchaden geschaden Jtem wer der Beifußwurzel an seinem Hals trägt kein giftig Tier mag ihm nicht schaden 10 normalization: example Ridges • dipl layer: as close to the original as possible (some decisions necessary) → ground truth for OCR • norm layer: special characters normalized, can be done automatically • clean layer: words written in the modern German spelling → research on variation → input for tagging (TreeTagger, Schmid 1994), this is highly problematic (comparative fallacy, Bley-Vroman, 1983) ſelb ſaft verzeret ycterí cíam ≈ the same juice that/it heals jaundice 11 annotation • many annotation layers are instances of more or less abstract normalization (close connection to variationist models) • the layers, categories, and granularity depend on the research question • annotation is an iterative process: one starts with a grammatical model but the annotation categories and insights change while working with the text • ideally the annotation reflects the analysis 12 annotation • just as we wanted different spellings for one 'word', we could want • different 'words' for one concept (hyperlemma) sucht - Krankheit 'illness', zehand - alsbald 'soon' • different words or phrases for one concept Nierenstein – Stein der in der Blasen wächst und Nieren, Stein in der Blasen oder auch in den Lenden 'kidney stone' • different ways to formulate a relative clause Wer dē beyfuſz beí ím tregt wen̄ er wandert der wírt nít muͤde ≈ who carries wormwood while he is hiking will not become tired Jtem Stabwurz macht auch wachsen den Bart der langsam hervor kommt ≈ sitherwood will also let grow the beard that comes slowly Wellíche frawen kínder tragent ˖ díe ſollent eppich ſammen meiden ≈ which women are carrying children – they should avoid cellery seeds • … 13 normalization: target hypotheses Vn haíſzt darumb arthe miſía ˖ wan̄ der küníg Manfolei ge Arthemíſía / díe wolt dz díſz kraut nant het aín hauſzfrawē díe híeſz auch alſo genent würde / vmb tu⸗ gent wíllen díe díſe künígín an dí ſem kraut befand 14 normalization: target hypotheses Vn haíſzt darumb arthe miſía ˖ wan̄ der küníg Manfolei ge Arthemíſía / díe wolt dz díſz kraut nant het aín hauſzfrawē díe híeſz auch alſo genent würde / vmb tu⸗ gent wíllen díe díſe künígín an dí ſem kraut befand Vn haíſzt darumb arthe miſía ˖ wan̄ der küníg Manfolei ge nant het aín hauſzfrawē díe híeſz Arthemíſía / díe wolt dz díſz kraut auch alſo genent würde / vmb tu⸗ gent wíllen díe díſe künígín an dí ſem kraut befand ≈ and is called arthemisia because the king called Manfolei (=Mausolus) had a wife called Arthemisia who wanted this herb to be called like her because of the virtues that this queen found in this herb 15 target hypotheses • the concept stems from learner corpus research; not yet realized in Ridges (Lüdeling 2008, 2011, Reznicek et al. 2013) • basic idea: • if an original text cannot be described by a well-defined grammar (for whatever reason) • make minimal changes so that it can be described • note and analyze the changes (in this case: transposed lines) • if possible: hypothesize the reasons for the difference between original and changed version (in this case: printer‘s error) • in this way all information is preserved in a transparent way • further normalization layers and all other analyses (e.g. syntactic analyses) are based on the target hypotheses • again: many decisions necessary (Manfolei) 16 annotation • all of this (and more) is possible in the current models/tools • why doesn't everybody do it this way? 17 OCR 18 manual and automatic analysis • the data in many historical corpora are too variable to be annotated automatically (spelling variation, syntactic variation, text structure variation; especially true for diachronic corpora) • (how) can we use manual annotation of subcorpora for the automatic analysis of further text? 19 OCR • (early) prints are almost impossible to OCR with traditional methods • the diplomatic layer of the Ridges corpus can be used as the ground truth for machine learning in combination with new RNN-based OCR tools • pre-processing necessary (cleaning, line alignment) 20 A 1577 Fraktur printing (A. v. Bodenstein) original page Abbyy FR 11 (84.65%) Kreücev ner erscheinng / vnserer teutscher zäun oderhagwuryel/ garnicht/ welche der mehrertheil baldierer fiir rechte Aristolochiam rotundameinsamlend. Diosc. Diser würget etwas mit wein Myrrhen vnd Pfeffer getrvncken/ reiniget die weider von vderssissigem vm rath dermäter/ treibt auß die an Lgedurt vn weider menses. Ein faldgemachtvonndiser würgen zeitlosenvnanagallidezeüchtvß spreissel/ dörnvngeschiferte dein. iZiemitt deschlies ich mein rede diser zeitvonden zwelffzeichenkreütteren/degären menckltch welle mirs im dessen aufnem men als dan ichs gethan / had fye weitleüfftgerdeschridc wellen / f» sind yegigerkürgeviel vrsachen/ voraußdieweil ichgrosicnkosten angewendet insüchungder kreü ter auß eignem willen vndeücrcl/ OCRopus (98.84%; raw uncorrected output) Kreüter ner erſcheinüg/ vnſerer teütſcher zaun oder hagwurtzel/ gar micht/ welche der mehrertheil balbierer für rechte Ariſtolochiam rotun⸗ dam einſamlend. Dioſc. Diſer wurtzel etwas mit wein myrrhen vnd pfeffer getruncken/ reiniget die weiber von vberfliſzigem vn⸗ rath der můter/ treibt auſz die an d geburt vñ weiber menſes. Ein ſalb gemacht vonn diſer wurtzen zeitloſen vñ anagallide zeücht vſz ſpreiſſel/ doͤrnvñ geſchiferte bein. Hiemitt beſchlies ich mein rede diſer zeit von den zwelff zei⸗ chen kreütteren/ begaͤren menck⸗ lich welle mirs im beſten aufnem men als dañ ichs gethan/ hab ſye weitleüffiger beſchribẽ wellen/ ſo ſind yetziger kürtze viel vrſach_en/ vorauſz dieweil ich groſſen koſten angewendet in ſůchung der kreü ter auſz eignem willen vñ beüt_tel/ ncn⸗ slide by Uwe Springmann21 table by Felix Schremmer 22 inter-textuality 23 different types of intertextuality • a text and its translation(s) → parallel corpus • David Birnbaum: different manuscripts representing the same text → (complicated) parallel corpus • similar topics across space and time • text re-use: direct citations, indirect citations, illusions •… ________ Coffee et al. (2012), Almas & Berti (2013) 24 European developments • in scientific contexts: roughly between 1400 and 1700 in many European areas the lingua franca Latin is replaced by the vernacular languages 25 example: herbal texts in different vernaculars the early texts in Ridges are compilations (not always direct translations) of Latin herbal texts ('truth' via authority) Auícēna ſprícht dz knob lauch beneme vn̄ verdruck díe ge⸗ ſchwulſt des menſchen in dem leib wo díe ſey ≈ Avicenna says that garlic takes away and heals the tumor of the human in the body where it is Serapío ſprí⸗ chet ˖ das ſtabwurcz genűczt ver⸗ zeret vͤberflűſſíg feuchtíkaít díe ín den daͤrmen ſind dauō aín kranck haít komet genant colíca paſſío ≈Serapio says that sitherwood consumes superfluous wetness in the which is in the intestines which causes an illness called colica passio. 26 herbal texts in other languages Czech (herbal text by an unknown monk, 16th century): Kohož velmi bolí hlava Ten vezmi černobýl a semenec ozimý a jalovec, vařiž to spolu u víně tak dlúho, dokud se nebude pukati semenec [http://vokabular.ujc.cas.cz/moduly/edicni/edice/fa9ef614-5242-4c00-8916-f63d4cfb66ba/plny-text/s-aparatem/folio/43r] ≈who has a strong headache should take sage brush and the seeds of winter barley and juniper and boil this together in wine until the seeds burst 27 28 the same Latin source text • # Platearius: Liber de simplici medicina dictus circa instans.# # [….]¶ De absinthio XIX. Absinthium calidum est in primo gradu. et siccum in secundo. absinthii duo sunt genera. unumquod dicitur ponticum: vel quia in ponto insula reperitur:vel quia ponticum habet saporem: colorem habet viridemsaporem amarissimum. in fine veris colligitur in umbra exgiccatur. per annum servatur: reperitur autem subalbidum et minus amarum. et tale minoris est effectus. ¶ Absinthium dicitur habere duas contrarias proprietates scilicet laxativam et constrictivam: constrictivam habet ex grossitie substantie: et ponticitate: laxativam ex caliditate: et amaritudine. grossamdicitur habere substantiam: propter ponticitatem et ama-ram. amara enim et pontica grossam dicuntur habere substantiamquare si interius reciperetur. materia existente compacta:eam grossicie sua compactiorem redderet: suaque caliditatequod humiditatis inerat dissolvendo ab ea. ¶ Díe maiſter ín der ercz⸗ nei ſprechen . das wermůt ſei haíſz ín dem erſten grad vnd trucken ín dem andern ˖ ¶ Platearíus ſpricht das wermut aín wíderwaͤrtíge na tur an ír hab / wan̄ ſíe laxíert vnd ſtopffet / vnd díe zway ſind wíder aínander . Vnd darumb ſprícht er das wermut genűczet ſol werden mít vermíſchung ˖ Pelyňka neb pelynek jest horký na prvním stupni a suchý na druhým stupni, vyhoní pěnohorku z žaludka a z třev a přivodí moč a nedá sě opiti. A ktož na každý den pijí, činí dobrú chut a žádost a jest spomocná proti žlútennici. 29 research questions • how did the Latin text influence the compilations? (lexicon, syntax, morphology, text structure) • which clear differences between Latin text and vernacular text can be found? why? ('Differenzbelege', Hinterhölzl 2010) we need a way of modelling inter-textuality (pointers with labels from (segments of) one corpus to (segments of) another corpus) • this is not yet possible in the Laudatio repository 30 summary • annotation can be research! • well-understood annotations can be exploited for (useful) computational modelling • the model must be extended to allow for the representation of inter-textuality (and pointers to external resources) 31 wish list • corpora are built in a way so that the design and format are wellunderstood, well-described and stored in a way so that everybody can use them (Laudatio!) • all categorization and research is done in the corpus • students and researchers know how to do annotation (tools, guidelines, evaluation) • it is easy to download, annotate, and resubmit corpus data 32 Danke! Thank you! Gratias vobis ago! [email protected] 33 Laudatio-Workshop 2014 Thomas Krause Anke Lüdeling Carolin Odebrecht Laurent Romary Peter Schirmbacher Dennis Zielke 34 We want to build and store historical corpora in a way so that they can be found and analyzed (and annotated, if necessary) by as many people as possible. This involves • clear corpus designs, well-described meta-data, re-usable and standardized formats • multi-layer annotation in all relevant formats, including normalization on one or more levels • representation of inter-textuality • reliable and transparent repositories, ideally embedded in larger structures • conversion to different formats • multi-layer search tools Much will have to be done manually but we want • evaluation of all of the steps above • as much computational help as possible (OCR, automatic pre-processing) 35 We want to build and store historical corpora in a way so that they can be found and analyzed (and annotated, if necessary) by as many people as possible. → Tom Ruette, Tue 9:30-10:30 This involves • clear corpus designs, well-described meta-data, re-usable and standardized formats → Laurent Romary, Tue 4:30-5:30pm • multi-layer annotation in all relevant formats, including normalization on one or more levels • representation of inter-textuality → David Birnbaum, Wed 1:30-2:30 pm • reliable and transparent repositories, ideally embedded in larger structures → Carolin Odebrecht, Tue 11am-12pm, Frank Wiegand, Tue 4:30-530pm, Maxi Kindling Wed 9:30-10:30 am, Dennis Zielke, Ralf Claussnitzer, Dulip Withanage, Wed 9:30-10:30 am, Frank Kühnlenz, Katarzyna Biernacka, Wed 11am-12pm • conversion to different formats → Florian Zipser, Tue 3-4 pm • multi-layer search tools → Thomas Krause, Hagen Hirschmann, Tue 1:30 – 2:30 pm Much will have to be done manually but we want • evaluation of all of the steps above • as much computational help as possible (OCR, automatic pre-processing) 36 organizational information • registration and coffee breaks in room 3.308 (Haus 3, third floor) • dinner will be at Ampelmann restaurant; please register on the list if you want to participate • the trains and the S-Bahn (Deutsche Bahn) will not be running between 9 pm tonight and 6 am tomorrow morning 37
© Copyright 2024