Explanation of tagged fields and searching

It is most efficicient to search for grammatical information in a corpus if individual words are tagged. This process is done through an application that is called a tagger, matching individual words according to their context to specific grammatical information. After the tagging is complete, you will have a series of words (from the text you are working with) and each word is accompanied by a tag.

Here is a sample tagged text in a corpus.

>>Filename:oral_int_pri_V_JAPEC191099.txt >>Mode:oral >>Register:interview Qué;p3cn;int;;qué; le;p3cs;per;;él; parece;vm;ip;3s;parecer; la;lfs;def;;la; publicación;nfs;com;;publicación; donde;p000;rel;;donde; agentes;nmp;com;;agente; de;en;;;de; 'l;lms;def;;el; FBI;n;prop;;fbi; vinculan;vm;ip;3p;vincular; a;en;;;a; 'l;lms;def;;el; PRI;n;prop;;pri; con;en;;;con; el;lms;def;;el; narcotráfico;nms;com;;narcotráfico; y;con;coor;;y; lavado;vm;ps;0s;lavar; de;en;;;de; dinero;nms;com;;dinero; ?;punc;sent;?;; Por;r;;;por; supuesto;r;;;supuesto; que;que;rel;;que; rechazo;vm;ip;1s;rechazar; todo;d3ms;ind;;todo; ese;d3ms;dem;;ese; tipo;nms;com;;tipo; de;en;;;de; aseveraciones;nfp;com;;aseveración;

You can practice searching for tagged data with this tags practice tool.

The tagger designed for the corpus we are examining attaches a tag to every word. You know where the tag is in a word because it begins after the first semicolon ";" of each line. After that you will find information in 'fields', which are sort of like columns in a spreadsheet. The first column/field after the ; indicates the part of speech of the word. The first field ends with a ;. Each field is separated by a semicolon ";". Notice that, when you see two semicolons in a row ";;", there is nothing specificied in that field for the word.

tipos;nms;com;;tipo;

Field 1	Field 2	Field 3	Field 4	Field 5
This is the specific word that is targeted in the search.	This is the basic part-of-speech of the word.	This information is optional.	This information is also optional.	This is the lemma.
tipos;	nms;	com;	;	tipo;

These tags follows the Biber, Davies, Tracey-Ventura & Jones schema.

For instance, tipos;nmp;com;;tipo; indicates in the second field that tipos is a noun that is masc./plural. The third field indicates that the word is a common noun; in the fourth field there is no information. The fifth - and last - field is the lemma, or the basic, dictionary form for the word. The lemma for tipos is tipo; for dormí it would be dormir, duermas it would also be dormir.

You might be thinking that to look for grammatical information, you must consider every possible type of information that any word will have. As long as you know what information you are looking for in what field, you can do very powerful searches with regular expressions. The reason is that you can make searches like the following:

Any word that is a masculine, singular noun.
Any word that is in the present subjunctive.
Any determiner followed by a common noun.

Let's go back to the tipos case. This is a noun. Notice that all nouns have a 'n' after the first ; (i.e., in the first field). Thus, to start looking for all nouns in a corpus with this tagging scheme (it is not the only one out there), we could erroneously start our search like the following:

;n ...

However, we need to account for the whole tag all the way from the beginning up to the end of the tag or word. So let's build this search one step at a time. You might find this tedious but you will see that this example extends to a whole bunch of possible searches. Let's start with the table we saw above:

Field 1	Field 2	Field 3	Field 4	Field 5
\w+;	\w+;	\w*;	*\w;**	\w+;

Any word has a space before and after it. Thus, since \w means any non-space and \w+ one or more non spaces (that is, any run of characters), we start our tag search with \w+, assuming any word has one or more characters. And, any word has a lemma, thus fields 1 and 5 will always have at least one or more characters. Take the words a, en, and slos.

Field 1	Field 2	Field 3	Field 4	Field 5
a;	\w+;	\w*;	*\w;**	a;
en;	\w+;	\w*;	*\w;**	en;
los;	\w+;	\w*;	*\w;**	el;

See technical note 1 below.

You could put a;\w+;\w*;\w*;a; into RegexConcord and get instances of a. Likewise with los;\w+;\w*;\w*;el;.

In any event, returning to our starting point:

Field 1	Field 2	Field 3	Field 4	Field 5
\w+;	\w+;	\w*;	*\w;**	\w+;

At the end of the first field there is a semicolon ";". Thus we start out a search for a word with:

\w+;...

If we want to look for a word that is a noun, we would add to this the n in the second field:

\w+;n\w+

Notice that we encapsulated the search string with \w+. The last \w+ means: after you find the first ;n sequence, ignore the rest of the characters (in the fields) until you come to the end of the tagged word.

After that first semicolon ";" we have to keep in mind that there may or may not be other characters between the n and the end of the second field (i.e., where the next semicolon ";" is). We really don't care at this point. That is a fine point to take advantage of regular expressions' abilities because we can say: look in the second field to see if the first letter is an n and then look to see if there are any more characters and ignore them until you find the next first semicolon ";", marking the end of the field. Finally, since we are not going to specify any information for fields 3 through 5, we could just place a \w+ at the end to say... and keep going until you find a space:

\w+;n[^;]*;\w+

Field 1	Field 2	Fields 3 through 5
\w+;	*n[^;];**	\w+
Go all the way up to the first semicolon ";"...	Look to see if there is a n and then run all the way up to the next semicolon ";"...	run to the end of the word...

If we wanted to look for all common nouns, we could further construct our search with a 'com' in the second field:

\w+;n[^;]*;com;\w+

Notice that if we simply said to look for zero or more characters in the second field, we would be looking for both common and non-common nouns (i.e., all nouns):

\w+;n[^;]*;\w+

Step back for a second and look at the sequence [^;]*;. Again, the [^;]*; is a lifesaver. Whereas \w+ means any word (forgetting its tag), [^;]*; means any field whether it has information in it or not.

How do we get to a particular field? This is where the grouping syntax comes in. You can add a quantifier to a group. Thus:

(?:[^;]*;){2} = any 4 fields

(?:[^;]*;){3} = any 6 fields

So, since every tag has 5 fields (1 is the word, 2 through 4 give grammatical information, and 5 is the lemma) we could simplify our search for all nouns by looking for ;n[^;]*; followed by 3 [^;]*;. That is:

\w+;n[^;]*;(?:[^;]*;){3}

Field 1	Field 2	Fields 3 through 5
\w+	n[^;]*;	(?:[^;]*;){3}

You could use this to isolate a field, like the lemma. Say you were looking for all verbs whose lemma was ser. This would be accomplished by:

\w+;v[^;]*;(?:[^;]*;){2}ser;

Field 1	Field 2	Fields 3 through 4	Field 5
\w+	v[^;]*;	(?:[^;]*;){2}	ser;

Any verb = ;v[^;]*;(?:[^;]*;){3}

Any main verb of a clause =;vm[^;]*;(?:[^;]*;){3}

Any adjective = ;j[^;]*;(?:[^;]*;){3}

Technical notes

Technical side note 1

In reality, the sequence \w works well for English. \w means any character. So, if a sentence you are searching is:

ao contrário da crença popular, lipsum (lorem ipsum abreviado) não é simplesmente um texto qualquer com um monte de letras.

The sequence \w will only represent these letters, since English uses them:

,.()abcdeilmnopqrstuvx

But we want it to include these as well:

áãçé

This is a problem for languages like Spanish. For example, the search:

\w+ \w+

would find:

ao contr

rather than:

ao contrário

Regex Concord solves this by defining \w as anything that isn't a space, in regular expressions terms:

[^ ]

So, any word is:

[^ ]+