SSLMIT Dev Home | Site map

Advanced Query how-to (Repubblica)

The advanced query interface allows you to enter queries using the CQP Query Language. In this brief tutorial we show you a few examples of what you can do (you can do many other things as well: see Stefan Evert's CQP tutorial).
As you try the queries suggested here, keep in mind the handy Edit Query option you will see in the left sidebar after you issue a query: it will bring you back to the last query you issued, so that you can edit it rather than starting from scratch.
First, we present positional attributes, then structural attributes.

Positional attributes

Positional attributes pertain to single words. Currently, the "la Repubblica" corpus contains the following positional attributes:
word: the wordform;
lemma: the lemma;
pos: the part-of-speech tag (the tagset).
If you want to look for a word(form), just put it into double quotes:
If you want to look for a sequence of words, put each word into double quotes:
"parola" "parola"
If you want to look for two words in either order, type something like:
"caso" "strano" | "strano" "caso"
Case insensitive queries require the %c flag. For example:
"bbc" -> this does not match anything
"bbc" %c -> this matches both "Bbc" and "BBC"
You can ignore accents and other diacritics by inserting the %d flag. For example you can find out all the ways in which a foreign word is spelt in Italian:
"élite" -> this returns 800 something results
"elite" -> less than 200 results
"elite" %d -> more than the sum of the first two because of "wrong" spellings, e.g. "èlite"
The syntax to include lemmas and POSs in queries is just a bit more complex. For example:
"in" "ogni" [pos="NOUN"]
looks for the sequence "in ogni" followed by any common noun.
You can look for the lemma "interessante" like this:
The query language supports regular expressions. For example, in the following way you can look for all the words beginning with "interess":
If instead you want to specify a fixed suffix set, you can do:
With the following query you find both "picnic" and "pic-nic":
Now for something more ambitious: the following query finds all adjectives beginning with "pass" and it displays both words and POSs:
[word="pass.*" & pos="ADJ"]; show +pos
There are two new things here: you can use & to specify more than one condition on a token, and you can use show to add/remove attributes to display.
Other ways of combining conditions on the same token:
[word="sott.*" & !(pos="V.*" | pos="ADJ")]; show +pos
The previous query looks for words beginning with "sott" which are not (!) verbs (i.e., items whose POS begins with V) or (|) adjectives.
The same query but displaying POSs only:
[word="sott.*" & !(pos="V.*" | pos="ADJ")]; show +pos -word

More examples:

The following query looks for all forms of the adjective "rosso":
[lemma="rosso" & pos="ADJ"]
This finds all occurrences of "hanno" as an auxiliary:
[word="hanno" & pos="AUX.*"]; show +pos
The following query
[(word="han(no)?")]; show +pos
finds the variants "han" and "hanno"
You can change the context size by specifying:
- the number of characters: set Context 30
- the number of tokens: set Context 10 words
- the number of sentences: set Context 2 s
For example:
"cane"; set Context 1 s; show +pos

Structural attributes

The other attributes in the "la Repubblica" corpus are encoded as structural attributes.
Structural attributes define spans that encompass words.
At the moment, the corpus contains the following structural attributes:
- corpus: encompasses the whole corpus (not very interesting).
- article: encompasses one article.
- title: the title.
- subtitle: the subtitle.
- summary: the summary.
- text: the body of an article.
The following structural attributes can be used to find articles matching the relevant attribute values (e.g., they can be used to limit queries to articles on a certain topic):
- article_id: a single id assigned to each article (not very interesting).
- article_author: the author of the article.
- article_gen: the genre of the article (two values: news and commento).
- article_top: the topic of an article (chiesa, cronaca, cultura, economia, meteo, politica, scienze, scuola, società, sport, NOCAT).
- article_year: the year of an article (1985-2000).
You can request that a certain word is within the span of a certain structural attribute simply by specifying the attribute in the search pattern.
For example, to search for the proper noun Formica inside subtitles, you can issue this query:
[pos="NPR" & word="Formica" & subtitle]
Vice versa, to look for the same proper noun not in subtitles, you can do:
[pos="NPR" & word="Formica" & !subtitle]
The syntax to limit hits to the span of structural attributes matching a certain value is a bit awkward. In order to search within a structural attribute, you have to use the CQP constraint specification syntax, as in:
a:"opportunista" :: a.article_top="politica"
This looks for the word "opportunista" only within articles dealing with politics.
a:"opportunista" :: a.article_top="sport"
This looks for the word "opportunista" only within articles dealing with sport.
Notice that a, in the previous two examples, is an arbitrary string used to "label" the pattern. The part following the double colons specifies a constraint on the labeled pattern.
Just one more example. Let us look for the word "sport" only within subtitles and only in the year 1990:
a:[word="sport" & subtitle] :: a.article_year="1990"
At this point, if you want to learn more about the CQP syntax, we strongly recommend you to read Stefan Evert's CQP tutorial. Most of the CWB syntactic expressions he describes can also be used when exploring the "la Repubblica" corpus through the advanced query interface (although, of course, the attributes are different from those in his example corpora).

Manage Your Profile | Contact Us | SSLMIT Dev Online Newsletter
©2004 SSLMIT (University of Bologna).  Terms of Use | Privacy Statement