in progress, version of July 17, 2011
TIGERSearch is a software for querying syntactically annotated text corpora (tree graphs) and was developed at the Institut für Maschinelle Sprachverarbeitung (IMS) at the University of Stuttgart (Lezius, 2002). It includes a complete manual which however focuses mostly on phrase structure grammars.
This is a quick start guide, written for users of the NCA who need quick results using TIGERSearch (TS). You may be able to infer much of the TS query language from the examples given in this document. For more information, you should study chapter IV of the TS manual in order to get acquainted with the basics of the query language. The TS manual is included in the tool (click on the Help icon); PDF and HTML versions can be found in the doc subfolder in the TIGERSearch installation folder.
Note that TIGERSearch will also be the preferred software for the parts of the corpus which have been or will be syntactically annotated in the project Syntactic Reference Corpus of Medieval French (SRCMF). The first texts will be published starting in 2011. See http://www.uni-stuttgart.de/lingrom/forschung/projekte/srcmf.htmlfor more information about the project.
HINT: You may be able to infer most of the TS query language from the examples given in this document. If not, you should study chapter III of the TS manual in order to get acquainted with the basics of the query language.
On the NCA homepage http://www.uni-stuttgart.de/lingrom/stein/corpus, registered users can download, for Windows or Mac computers,
NCA3
into the
folder TIGERCorpora
of your TIGERSearch installation (on
Windows systems, the default is C:\TIGERSearch\TIGERCorpora
).
TIGERSearch was developed for syntactic queries, i.e. for sentence structures which are represented as trees. The NCA, however, has only word-level annotation: part of speech (POS) and lemmas. This means that only a part of the TS query language will be used. This section will present queries for the current version of the NCA (v3, 20111).
Figure 1 shows a example of the NCA annotation, as represented in TS:
Each structure is a minimal tree, where all the words are depend on a node S (for 'sentence': actually structures are either verses or text lines of the original file; the only case where structures are sentences are texts where punctuation has been added manually).
The words are annotated with attribute-value pairs. For chief, the attribute-value pairs are:
word=chief pos=NOM_obj_masc_sg deespos=002 taggerpos=NOM lemma=chief1_chief2
HINT: Hold your mouse over a node (S or word) for a second, and a popup window will show the attribute-value pairs.
The attribute-value pairs of the S node contain information about the text the structure belongs to. They allow you to restrict the query to certain texts. For most attribute-value pairs, the corpus information window in the left part of the query window explains the possible values, or indicates at the frequency of the values. Figure 2 shows the query window where the S node specification restricts the date to the 12th c. (years beginning with 11) and searches for the forms of amor, amour, amours:
In the query, each node is specified by a pair of angular brackets,
which may contain one or more attribute-value expressions, e.g.
pos="NOM.*"
. TS searches the corpus for all the nodes which
match this expression and displays the results in a new window, the
''GraphViewer''.
Node expressions can be combined using operators, in order to search
for co-occurrences or sequenes of words (see section 2.5).
Text following double slashes will be ignored, e.g. // This is a comment
.
A simple query for pruz looks like this:
// search for a word [word="pruz"]
Each terminal node of the NCA corpus has the attribute word for the actual word form. Further attributes (terminal features) are:
feature | explanation | example |
deespos | original annotation of A. Dees' project | 513 |
pos | an easier to read version of deespos | VER_pres_3_sg |
taggerpos | the category guessed by the TreeTagger | VER |
lemma | on or more lemmas suggested by the TreeTagger | faire_farre |
src | the dictionary source of the lemma | (not present) |
Values in the node specification must be enclosed by double quotes, if they are a fixed string, or by slashes, if they are regular expressions. The query for either amor or amour with an optional s uses a regular expression:
[word=/(amor|amour)s?/]
Regular expressions use pre-defined characters and operators. The following table introduces the most important ones (char = character). Sorry, the examples for the matches given here are not Old French:
symbol | meaning (char=character) | expression | finds... |
. |
any char | b.ten |
baten, beten, boten |
+ |
preceding char at least once | be*ten |
beten, beeten, beeeten, ... |
? |
preceding char at most once | bi?eten |
bieten, beten |
* |
preceding char zero or more times | be*ten |
bten, beten, beeten, beeeten, ... |
[ ] |
possible chars at a position | b[eo]ten |
beten, boten; not buten |
[a-z] |
all chars between a and z | [a-z]aten |
aaten, baten, caten, daten, ... |
[^ ] |
excluded chars at a position | b[^eo]ten |
baten, buten; not beten, boten |
( | ) |
OR | (Rose|Nelke) |
Rose, Nelke |
useful: .* |
nothing or any combination of chars | b.*ten |
beten, bluten, bearbeiten |
HINT: Do not use regular expressions if you don't need them, i.e. if
you are sure about the value you are searching for: a query using
pos="PRO_invar"
will be much faster that a query using
pos=/PRO_invar.*/
The attributes of a node can be combined using &
. The following
query finds forms ending with ment and having the POS
beginning with NOM:
[word=/.*ment/ & pos=/NOM.*/]
Precedence: A sequence of two nodes is expressed by the
precedence operator .
(a dot). The following query finds
forms with the lemma aler which immediately precede a preposition:
[lemma="aler"] . [pos=/PRE.*/]
Operators like .
always concatenate two nodes. If you want to
search for three nodes, you have two add a second expression using the
dot operator, and use &
between the two expressions. A
sequence A, B, C will therefore be expressed as A . B & B . C
To make things easier, and to make sure that B
refers to the
same node, a variable is attached to this node. Variables have the
form #name:[ ]
and can than be used using #name
The following query finds forms of aler, followed by a preposition (PRE) and an article (DET). The expression for the preposition is labelled with the variable pre, and re-used in the second line:
[lemma="aler"] . #pre:[pos=/PRE.*/] & #pre . [pos=/DET.*/]
The operator .
can take modifiers in order to specify fixed or
variable distances (see table below). The following query finds
en with a following noun at a maximum distance of 3 words
(i.e. with at most 2 words between them):
[word="en"] .1,3 [pos=/NOM.*/]
operator | meaning | example |
. |
immediately (1 word) before | |
.n |
n words before | .3 |
.m,n |
between m and n words before | .1,5 |
.* |
any distance before (1) | |
!. |
not immediately before | |
!.n |
not n words before | !.3 |
!.m,n |
not between m and n words before | !.1,5 |
Dominance ist the vertical relation beween two nodes. The higher node dominates the lower node. In the flat structures of our corpus, hierachical queries are only of limited interest. Two kinds of queries will be introduced here:
The operator for dominance is >
Dominance between the S node
and a noun would be expressed by [cat="S"] > [pos=/NOM.*/]
.
(The operator can be modified like the precedence operator .
,
see the table in 2.7, but this is not relevant here).
Restrict the query to specific features of S: The following query finds S nodes whose attribute ponctuation has the value oui. Remember that the Corpus d'Amsterdam has no punctuation, but that we have inserted punctuation for some prose texts in the NCA. Only in these texts, S corresponds to a sentence.
The S node can be found using [cat="S" & punctuation="oui"]
.
But since the feature only occurs in S nodes punctuation only
occurs, cat is redundant. It is also sufficient to
specify co-occurrence of the feature using &
, instead of
expressing the dominance. The first node selects only
Gefunden werden also alle Sätze mit président auf Seite 1 der Zeitung:
[ponctuation="oui"] & [lemma="voloir"] . [pos=/VER.*/]
The selection can also combine several features, e.g. texts in prose
from the 12th century: [vers="non" & dateManuscrit=/11.*/]
Beginning and end of sentences. The operator >
can
be modified for the leftmost dominated node using >@l
or the
rightmost dominated node using >@r
.
The following query finds verb-initial sentences (i.e. where a verb node is the leftmost daughter of S). In this case, it is useful to limit the search to texts with punctuation:
[cat="S" & ponctuation="oui"] >@l [pos=/VER.*/]
TIGERSearch allows you to export results using its own encoding format TIGER-XML, which is explained in chapter V of the TS documentation. This can be interesting to create a new corpus out of the results, but most users will not want to do this, but rather want to produce a different format.
Click on Export Matches in the icon bar of the TIGERSearch query window. In the pop-up dialog, choose XML piped through XSLT in the top left pulldown menu (even if you don't know what this means).
Under export to file click on Search and indicate a path and a file name (you may want to use the extension .txt, e.g. export.txt).
Click the radio button for Current matching corpus graphs to export the results of your search (i.e. the sentences displayed in the graph viewer). The other options here are rather self-explanatory.
In the bottom right pulldown menu XML piped through XSLT, try the option sentence format (all tokens, matching tokens marked).
Click Submit and have a look at the resulting file.
Since TIGER is Java-based, it looks very similar in the different operating systems. You can use the usual commands for copying and pasting in and out of the query window or for copying an example displayed at the bottom the graph viewer.
Mac OS X: TIGERSearch uses the Ctrl-key combinations (Ctrl-c, Ctrl-v) for copying and pasting, not the Cmd key.
When the query is finished (or when you cancel it), TIGER opens the result window (TIGERGraphViewer, see fig. 3). You can browse through the matched sentences using the Previous/Next or First/Last buttons. If you have more then one hit per sentence, the buttons for Subgraph will be aktive.
The yellow T in the icon bar switches the textwindow on and off. Further display options (context sentences, features of terminal nodes, colours etc.) are available in the Options menu. Some other functions are not very useful or will not work for the flat structures of our corpus: export of graph images, focus on subtrees etc.
HINT: Since most S structures in the NCA do not correspond to sentences, but rather to verses or lines, it is useful to set the display option Number of context sentences to 1 or 2.
Queries can be saved as bookmarks. Assuming that you have a query in the query window, do as follows:
If you receive or want to re-use a bookmark file, do as follows:
Bookmarks can also be changed, renamed, copied and deleted. You can also include the query result in the bookmark.
HINT: Since copying and pasting in and out of the query window is easy, you may prefer to simply save your bookmark collection in a text editor or a word processor.
TIGERSearch also has a statistics window. Before you use it, you must have completed a query. The following query finds nouns followed by adjectives coordinated with et.
#nom:[pos=/NOM.*/] . #adj1:[pos=/ADJ.*/] & #adj1 . #et:[word="et"] & #et . #adj2:[pos=/ADJ.*/]
Note that the the variables #nom
and #adj2
are not
necessary for the query as such, but they will be needed in the
statistics window.
When the search is finished, click on the grid symbol (eighter in the
query window or in the graph viewer) to open the statistics window.
Here you can make lists for the terminal nodes (words) using their
variables (i.e. #nom
, #adj1
, and #adj2
).
For a first try, click on Default in the icon menu. TIGERSearch
suggests column headers for the terminal nodes (and even creates
variables if you have forgotten to declare them). In this case,
TIGERSearch might suggest a column for #et
, which is not useful.
Change it to #nom
by clicking on the pulldown menu of the
column header (change other columns too, if necessary). Then click
Build again.
Concordance: Corpus is already selected in the icon menu, otherwise click on it. A click on Build will fill the table, and show you for each sentence the result for the named positions.
HINT: A double click on an field in the left column (GraphID) will display the sentence in the graph viewer window.
Frequency: Click on Frequency in the icon menu in order to build a frequency list. It will show you that pucele bele et gente is the most frequent combination in this construction.
Modify tables: If you are interested in the first adjective
only (#adj1
), right-click on any of the other column headers
and select Remove column. When only the desired column is
left, click on Build again to obtain a list for this position
only.
Export tables: Click on Export in the icon menu. The pop-up dialog will allow you to save the table as a file in either Text, XML or Excel format.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -dir latex2html -show_section_numbers -local_icons -style=tigersearch-nca.css tigersearch-nca.tex
The translation was initiated by Achim Stein on 2011-07-17