SSLMIT Dev Home | Site map

Frequency Lists how-to (Repubblica)

This interface allows you to collect frequency lists from the "la Repubblica" corpus. The results are sent to the e-mail adress you specify with the query.
This interface is intended for those who want to extract/study statistical properties of Italian and have their own tools to process a frequency list.
The output of queries is an unordered list of n-tuples and their frequencies. The simple query:
will generate a list of frequencies of all the unigrams in the corpus.
A non-negative offset can be specified to collect n-grams. For example the following query:
pos+0 pos+1
will generate a frequency list of all the sequences of POSs in the corpus.
To collect frequencies of sequences of three POSs do
pos+0 pos+1 pos+2
You can specify constraints with a syntax like the following:
pos+0=/ADJ/ word+1=/riflession.*/
This generates a frequency list of occurrences of the sequence ADJ followed by a word beginning with "riflession", as in
757    ADJ    riflessione
228    ADJ    riflessioni
Notice that if you want to know what these adjectives are you must issue the following query:
pos+0=/ADJ/ word+0 word+1=/riflession.*/
This returns something like:
2     ADJ    pubblica     riflessione
1     ADJ    frequenti    riflessioni
10    ADJ    seconda     riflessione
30    ADJ    prima     riflessione
1     ADJ    dotte     riflessioni
3     ADJ    autonoma     riflessione
2     ADJ    future     riflessioni
2     ADJ    possibile    riflessione
40    ADJ    ulteriore    riflessione
Given that you asked for an adjective in the first slot, the ADJ column is redundant. To specify a constraint without printing out the corresponding attribute you can prefix it with a question mark. For example we can repeat the same query as:
?pos+0=/ADJ/ word+0 word+1=/riflession.*/
Now the results look like this:
2     pubblica     riflessione
1     frequenti    riflessioni
10    seconda     riflessione
30    prima     riflessione
1     dotte     riflessioni
3     autonoma     riflessione
2     future     riflessioni
2     possibile    riflessione
40    ulteriore    riflessione
Here, pos+0=/ADJ/ still constrained the results but it is no longer displayed.
You can use CQP-style regular expressions in your queries. For example:
word+0=/[aeiou].*/ pos+0=/ADJ/ ?word+1=/rango/
generates a frequency list of adjectives beginning with a vowel and preceding the word "rango".
You can use all "la Repubblica" positional (word, pos and lemma) and structural attributes (see Advanced Query how-to in the left sidebar) in your queries.

Manage Your Profile | Contact Us | SSLMIT Dev Online Newsletter
©2004 SSLMIT (University of Bologna).  Terms of Use | Privacy Statement