The CORPUSCLE corpus search system

The Corpuscle search system Paul Meurer Uni Research Compu7ng Corpuscle Corpuscle: a corpus management and analysis system for annotated corpora. •  Newly developed corpus query engine •  Comparable to Corpus Workbench •  Coding of hierarchical structures, structured aEributes •  Integrated Web interface •  Concordances, colloca7ons, word and cooccurence/distribu7on sta7s7cs Corpuscle • 
• 
• 
• 
• 
• 
Parallel corpus support Searchable user annota7ons Mul7media support Clarin CMDI metadata AAI, federated authen7ca7on REST/JSON API (useful e.g. for interfacing with R) Highlights • 
• 
• 
• 
• 
• 
Searchable user annota7ons Mul7-­‐valued and set-­‐valued aEributes Coding of mul7-­‐word expressions (and more) Corpus text upload CLARIN Federated Content Search endpoint REST/JSON based API Searchable user annota7ons AEributes Mul7-­‐valued aEributes: •  useful for not fully disambiguated annota7on •  transparent to the user: queried in the same way as single-­‐valued aEributes •  possible to query for single-­‐valuedness Set-­‐valued aEributes: •  used for gramma7cal feature sets •  can be queried efficiently, with special syntax Combina7ons of both •  e.g., not fully disambiguated feature set annota7on Mul7-­‐valued aEributes Ex.: word = fisker, lemma = fisk | fiske Possible queries: does match? •  [ lemma = "fisk" ] yes •  [ lemma == "fisk" ] no •  [ lemma != "fisk" ] no •  [ lemma !!= "fisk" ] yes •  [ lemma == "fisk.*" ]
? (no) Set-­‐valued aEributes Useful to code gramma7cal feature sets, or other types of non-­‐atomic annota7ons An aEribute can be set-­‐valued and mul7-­‐valued at the same 7me: •  word = fisker •  lemma = fisk|fisker •  morph = ( N m pl ) | ( N m sg ) Example queries with boolean expression syntax: •  [ morph = ("N" "sg") ] •  [ morph = ("N" "pl" | "A" !"sup") ] The reversed index is implemented as a suffix array, which makes this type of queries very natural and efficient. No complicated regular expressions have to be evaluated. MWEs MWE: Mul7 Word Expressions difficult to handle because of conflic7ng needs: •  Want to treat a MWE as a unit (single lemma form, set of gramma7cal features) –  e.g., “i dag” should be treated as an Adv, not a PP and an N; “Rio de Janeiro” is one place name. •  But we want to be able to search in a uniform way, without knowing in advance which words are MWEs and which are not, at least on the token level. •  Counts have to be correct MWEs: Solu7on •  Every word of a MWE is a separate token •  Lemma and features span the whole MWE •  Counts always relate to corpus posi7ons Annota7on spans Annota7on spans as an extension of MWE coding (under development): •  MWEs: An aEribute value can have a span •  Extension: mul7ple values with mul7ple spans; directed spans (edges) •  This allows coding of, e.g., dependency rela7ons, coreference chains, and more Corpus Text Upload Users can upload their texts via a Web form to build a corpus •  Plain text (UTF-­‐8) •  XML text Three steps: •  Corpus defini7on •  Text upload •  Indexing The corpus is useable right away Plans: include annota7on workflow (e.g., LAP) Federated Content Search CLARIN-­‐FCS •  Goal: search in heterogeneous, geographically spread resources in a unified manner •  Query language: CQL – Contextual Query Language •  Protocol: SRU – Search and Retrieve via URL •  Return format: XML (adhering to the CLARIN-­‐CQL schema, extensible) •  Opera:ons: explain, (scan,) searchRetrieve •  Use case: Weblicht Aggregator Federated Content Search Federated Content Search •  hEp://clarino.uib.no/corpuscle/fcs?
opera7on=explain •  hEp://clarino.uib.no/corpuscle/fcs?
opera7on=searchRetrieve REST/JSON API REST: REpresenta7onal State Transfer •  An architectural style, no official standard (unlike SOAP) •  Web API, Client-­‐server model •  Stateless (in theory, but a session-­‐id token and authen7ca7on informa7on is sent with every request) •  Uses HTTP GET or POST requests •  response can be XML, JSON, etc. •  in our case: JSON REST/JSON API JSON: JavaScript Object Nota7on •  Language independent data format •  light-­‐weight, easy to parse, basic data types •  Parsers for most programming languages As a result: Easy to implement clients for REST/
JSON API based services (Need e.g. curl, json parser) REST/JSON API •  Example calls: •  hEp://clarino.uib.no/corpuscle/rest?
command=get-­‐session •  hEp://clarino.uib.no/corpuscle/rest?
session=1234&corpus=avis-­‐
plain&command=query&query='aske.*' REST/JSON API Problem: federated authen7ca7on via IdP to access restricted resources Two solu=ons: 1.  Let your program code replicate the user interac7on with the IdP and the local SP (this involves parsing of returned HTML pages etc.) 2.  Get an authen7ca7on session token from a Web login to the SP and use it in your code, e.g.: curl "hEp://clarino.uib.no/corpuscle/rest?command=get-­‐session&login-­‐
index=_90753bc613d96c2v19069332254ca1b8fee4f574d"