next_inactive up previous


Using the Nouveau Corpus d'Amsterdam (NCA)
with TIGERSearch

Achim Stein

in progress, version of July 17, 2011

Abstract:

This document is a quick start guide for querying the Nouveau Corpus d'Amsterdam (NCA, Stein u.a. 2006). For more information about the corpus see http://www.uni-stuttgart.de/lingrom/stein/corpus.

TIGERSearch is a software for querying syntactically annotated text corpora (tree graphs) and was developed at the Institut für Maschinelle Sprachverarbeitung (IMS) at the University of Stuttgart (Lezius, 2002). It includes a complete manual which however focuses mostly on phrase structure grammars.


Contents

1 Introduction

1.1 Who should read this document?

This is a quick start guide, written for users of the NCA who need quick results using TIGERSearch (TS). You may be able to infer much of the TS query language from the examples given in this document. For more information, you should study chapter IV of the TS manual in order to get acquainted with the basics of the query language. The TS manual is included in the tool (click on the Help icon); PDF and HTML versions can be found in the doc subfolder in the TIGERSearch installation folder.

Note that TIGERSearch will also be the preferred software for the parts of the corpus which have been or will be syntactically annotated in the project Syntactic Reference Corpus of Medieval French (SRCMF). The first texts will be published starting in 2011. See http://www.uni-stuttgart.de/lingrom/forschung/projekte/srcmf.htmlfor more information about the project.

HINT: You may be able to infer most of the TS query language from the examples given in this document. If not, you should study chapter III of the TS manual in order to get acquainted with the basics of the query language.

1.2 Installation of TIGERSearch corpora

On the NCA homepage http://www.uni-stuttgart.de/lingrom/stein/corpus, registered users can download, for Windows or Mac computers,


2 TIGERSearch queries for word-level annotated corpora


2.1 Flat annotation

TIGERSearch was developed for syntactic queries, i.e. for sentence structures which are represented as trees. The NCA, however, has only word-level annotation: part of speech (POS) and lemmas. This means that only a part of the TS query language will be used. This section will present queries for the current version of the NCA (v3, 20111).

Figure 1 shows a example of the NCA annotation, as represented in TS:

Figure 1: Word-level annotated corpus
Image tiger-graph-flach

Each structure is a minimal tree, where all the words are depend on a node S (for 'sentence': actually structures are either verses or text lines of the original file; the only case where structures are sentences are texts where punctuation has been added manually).

The words are annotated with attribute-value pairs. For chief, the attribute-value pairs are:

word=chief
pos=NOM_obj_masc_sg 
deespos=002
taggerpos=NOM
lemma=chief1_chief2

HINT: Hold your mouse over a node (S or word) for a second, and a popup window will show the attribute-value pairs.

The attribute-value pairs of the S node contain information about the text the structure belongs to. They allow you to restrict the query to certain texts. For most attribute-value pairs, the corpus information window in the left part of the query window explains the possible values, or indicates at the frequency of the values. Figure 2 shows the query window where the S node specification restricts the date to the 12th c. (years beginning with 11) and searches for the forms of amor, amour, amours:

Figure 2: TIGERSearch query window
Image tiger-anfrage-fenster


2.2 Query basics

In the query, each node is specified by a pair of angular brackets, which may contain one or more attribute-value expressions, e.g. pos="NOM.*". TS searches the corpus for all the nodes which match this expression and displays the results in a new window, the ''GraphViewer''.

Node expressions can be combined using operators, in order to search for co-occurrences or sequenes of words (see section 2.5). Text following double slashes will be ignored, e.g. // This is a comment.

A simple query for pruz looks like this:

// search for a word
[word="pruz"]

Each terminal node of the NCA corpus has the attribute word for the actual word form. Further attributes (terminal features) are:

feature explanation example
deespos original annotation of A. Dees' project 513
pos an easier to read version of deespos VER_pres_3_sg
taggerpos the category guessed by the TreeTagger VER
lemma on or more lemmas suggested by the TreeTagger faire_farre
src the dictionary source of the lemma (not present)

2.3 Regular expressions

Values in the node specification must be enclosed by double quotes, if they are a fixed string, or by slashes, if they are regular expressions. The query for either amor or amour with an optional s uses a regular expression:

[word=/(amor|amour)s?/]

Regular expressions use pre-defined characters and operators. The following table introduces the most important ones (char = character). Sorry, the examples for the matches given here are not Old French:

symbol meaning (char=character) expression finds...
. any char b.ten baten, beten, boten
+ preceding char at least once be*ten beten, beeten, beeeten, ...
? preceding char at most once bi?eten bieten, beten
* preceding char zero or more times be*ten bten, beten, beeten, beeeten, ...
[ ] possible chars at a position b[eo]ten beten, boten; not buten
[a-z] all chars between a and z [a-z]aten aaten, baten, caten, daten, ...
[^ ] excluded chars at a position b[^eo]ten baten, buten; not beten, boten
( | ) OR (Rose|Nelke) Rose, Nelke
useful: .* nothing or any combination of chars b.*ten beten, bluten, bearbeiten

HINT: Do not use regular expressions if you don't need them, i.e. if you are sure about the value you are searching for: a query using pos="PRO_invar" will be much faster that a query using pos=/PRO_invar.*/


2.4 Combining attributes

The attributes of a node can be combined using &. The following query finds forms ending with ment and having the POS beginning with NOM:

[word=/.*ment/ & pos=/NOM.*/]


2.5 Sequences of nodes

Precedence: A sequence of two nodes is expressed by the precedence operator . (a dot). The following query finds forms with the lemma aler which immediately precede a preposition:

[lemma="aler"] . [pos=/PRE.*/]


2.6 Variables and sequences of more than two nodes

Operators like . always concatenate two nodes. If you want to search for three nodes, you have two add a second expression using the dot operator, and use & between the two expressions. A sequence A, B, C will therefore be expressed as A . B & B . C

To make things easier, and to make sure that B refers to the same node, a variable is attached to this node. Variables have the form #name:[ ] and can than be used using #name

The following query finds forms of aler, followed by a preposition (PRE) and an article (DET). The expression for the preposition is labelled with the variable pre, and re-used in the second line:

[lemma="aler"] . #pre:[pos=/PRE.*/]
& #pre . [pos=/DET.*/]


2.7 Define distances

The operator . can take modifiers in order to specify fixed or variable distances (see table below). The following query finds en with a following noun at a maximum distance of 3 words (i.e. with at most 2 words between them):

[word="en"] .1,3 [pos=/NOM.*/]

operator meaning example
. immediately (1 word) before
.n n words before .3
.m,n between m and n words before .1,5
.* any distance before (1)
!. not immediately before
!.n not n words before !.3
!.m,n not between m and n words before !.1,5


2.8 The sentence node: using hierarchical relations in flat structures

Dominance ist the vertical relation beween two nodes. The higher node dominates the lower node. In the flat structures of our corpus, hierachical queries are only of limited interest. Two kinds of queries will be introduced here:

  1. Queries which use the information attached to the sentence node (S) in order to restrict the query to a subset of sentences. Keep in mind that most of the bibliographical information (title, date, etc.) is attached to the S node. Restricting the queries to S features therefore means restricting the query to certain text properties. Refer to the corpus information (nonterminal features) and the bibliography distributed with the corpus for more information (see also Gleßgen und Vachon 2011).

  2. Queries for expression at the beginning or at the end of a sentence.

The operator for dominance is > Dominance between the S node and a noun would be expressed by [cat="S"] > [pos=/NOM.*/]. (The operator can be modified like the precedence operator ., see the table in 2.7, but this is not relevant here).

Restrict the query to specific features of S: The following query finds S nodes whose attribute ponctuation has the value oui. Remember that the Corpus d'Amsterdam has no punctuation, but that we have inserted punctuation for some prose texts in the NCA. Only in these texts, S corresponds to a sentence.

The S node can be found using [cat="S" & punctuation="oui"]. But since the feature only occurs in S nodes punctuation only occurs, cat is redundant. It is also sufficient to specify co-occurrence of the feature using &, instead of expressing the dominance. The first node selects only

Gefunden werden also alle Sätze mit président auf Seite 1 der Zeitung:

[ponctuation="oui"]
& [lemma="voloir"] . [pos=/VER.*/]

The selection can also combine several features, e.g. texts in prose from the 12th century: [vers="non" & dateManuscrit=/11.*/]

Beginning and end of sentences. The operator > can be modified for the leftmost dominated node using >@l or the rightmost dominated node using >@r.

The following query finds verb-initial sentences (i.e. where a verb node is the leftmost daughter of S). In this case, it is useful to limit the search to texts with punctuation:

[cat="S" & ponctuation="oui"] >@l [pos=/VER.*/]


2.9 Exporting the results

TIGERSearch allows you to export results using its own encoding format TIGER-XML, which is explained in chapter V of the TS documentation. This can be interesting to create a new corpus out of the results, but most users will not want to do this, but rather want to produce a different format.

Click on Export Matches in the icon bar of the TIGERSearch query window. In the pop-up dialog, choose XML piped through XSLT in the top left pulldown menu (even if you don't know what this means).

Under export to file click on Search and indicate a path and a file name (you may want to use the extension .txt, e.g. export.txt).

Click the radio button for Current matching corpus graphs to export the results of your search (i.e. the sentences displayed in the graph viewer). The other options here are rather self-explanatory.

In the bottom right pulldown menu XML piped through XSLT, try the option sentence format (all tokens, matching tokens marked).

Click Submit and have a look at the resulting file.


3 Using the interface

Since TIGER is Java-based, it looks very similar in the different operating systems. You can use the usual commands for copying and pasting in and out of the query window or for copying an example displayed at the bottom the graph viewer.

Mac OS X: TIGERSearch uses the Ctrl-key combinations (Ctrl-c, Ctrl-v) for copying and pasting, not the Cmd key.


3.1 Viewing the results

When the query is finished (or when you cancel it), TIGER opens the result window (TIGERGraphViewer, see fig. 3). You can browse through the matched sentences using the Previous/Next or First/Last buttons. If you have more then one hit per sentence, the buttons for Subgraph will be aktive.

Figure 3: Result window (TIGERGraphViewer)
Image tiger-graphviewer

The yellow T in the icon bar switches the textwindow on and off. Further display options (context sentences, features of terminal nodes, colours etc.) are available in the Options menu. Some other functions are not very useful or will not work for the flat structures of our corpus: export of graph images, focus on subtrees etc.

HINT: Since most S structures in the NCA do not correspond to sentences, but rather to verses or lines, it is useful to set the display option Number of context sentences to 1 or 2.


3.2 Save queries: bookmarks

Queries can be saved as bookmarks. Assuming that you have a query in the query window, do as follows:

  1. Activate the Bookmarks tab in the bottom left part of the query window. (fig. 2). The bookmark tree will appear.
  2. Right-click on a folder (there will only be one folder at the beginning) and choose Add Bookmark (or Add Group and then add a bookmark).
  3. Saved bookmarks (or groups of bookmarks) can be exported with a right click, then Export as Bookmark File. A dialog will allow you to store the bookmark(s) as an XML file.

If you receive or want to re-use a bookmark file, do as follows:

  1. Put the file somewhere on your computer.
  2. Activate the Bookmarks tab.
  3. Right-click on a folder of the tree and choose Import Bookmark File. A file selection dialog will allow you to select your bookmark file.
  4. After the import, the bookmarks can be selected in the bookmark tree.

Bookmarks can also be changed, renamed, copied and deleted. You can also include the query result in the bookmark.

HINT: Since copying and pasting in and out of the query window is easy, you may prefer to simply save your bookmark collection in a text editor or a word processor.


4 Calculating frequencies: the statistics window

TIGERSearch also has a statistics window. Before you use it, you must have completed a query. The following query finds nouns followed by adjectives coordinated with et.

#nom:[pos=/NOM.*/] . #adj1:[pos=/ADJ.*/]
& #adj1 . #et:[word="et"]
& #et . #adj2:[pos=/ADJ.*/]

Note that the the variables #nom and #adj2 are not necessary for the query as such, but they will be needed in the statistics window.

When the search is finished, click on the grid symbol (eighter in the query window or in the graph viewer) to open the statistics window. Here you can make lists for the terminal nodes (words) using their variables (i.e. #nom, #adj1, and #adj2).

For a first try, click on Default in the icon menu. TIGERSearch suggests column headers for the terminal nodes (and even creates variables if you have forgotten to declare them). In this case, TIGERSearch might suggest a column for #et, which is not useful. Change it to #nom by clicking on the pulldown menu of the column header (change other columns too, if necessary). Then click Build again.

Concordance: Corpus is already selected in the icon menu, otherwise click on it. A click on Build will fill the table, and show you for each sentence the result for the named positions.

Figure 4: Statistics window: concordance
Image tiger-statistik-conc

HINT: A double click on an field in the left column (GraphID) will display the sentence in the graph viewer window.

Frequency: Click on Frequency in the icon menu in order to build a frequency list. It will show you that pucele bele et gente is the most frequent combination in this construction.

Figure 5: Statistics window: frequency
Image tiger-statistik-freq

Modify tables: If you are interested in the first adjective only (#adj1), right-click on any of the other column headers and select Remove column. When only the desired column is left, click on Build again to obtain a list for this position only.

Export tables: Click on Export in the icon menu. The pop-up dialog will allow you to save the table as a file in either Text, XML or Excel format.

Bibliography

Gleßgen und Vachon 2011 GLESSGEN, Martin-Dietrich ; VACHON, Claire:
L'étude philologique et scriptologique du Nouveau Corpus d'Amsterdam.
In: CASANOVA, Emili (Hrsg.) ; CALVO, Cesáreo (Hrsg.): Actes du XXVI CILPR, València 6-11 septembre 2010.
Berlin : De Gruyter, 2011

Lezius 2002 LEZIUS, Wolfgang:
Ein Suchwerkzeug für syntaktisch annotierte Textkorpora (German).
Stuttgart : Institut für Maschinelle Sprachverarbeitung (IMS), 2002
(University of Stuttgart Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (AIMS), vol. 8, no. 4)

Stein u.a. 2006 STEIN, Achim (Hrsg.) u.a.:
Nouveau Corpus d'Amsterdam. Corpus informatique de textes littéraires d'ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, Pierre Kunstmann et Martin-D. Gleßgen.
Stuttgart : Institut für Linguistik/Romanistik, 2006. -
URL http://www.uni-stuttgart.de/lingrom/stein/corpus/

About this document ...

Using the Nouveau Corpus d'Amsterdam (NCA)
with TIGERSearch

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -dir latex2html -show_section_numbers -local_icons -style=tigersearch-nca.css tigersearch-nca.tex

The translation was initiated by Achim Stein on 2011-07-17


next_inactive up previous
Achim Stein 2011-07-17