Introduction to regular expressions

Note: This introduction/review of regular expressions contains links to practice exercises. They are meant to help you learn the material as you read the following.

Regular expressions are a search tool that are specifically designed to facilitate searches by language experts, who tend to understand a bit more of the anatomy of a word than most people. It helps one to search for and amongst (a) large quantities of text and (b) precise patterns. It is a system for text with a wide range of wild card characters. To get up to speed, you need to know how to do the following:

How to search with the basic wild card sequence of: \w
How to search for collections of items, such as any word ending is a 'a' or an 'e'.
How to indicate how many characters or items you want to find.
How to group items.

Special characters

\w = any character

E.g.

\w\w\w\wé = A word that begins with any 4 letters followed by a 'é', e.g., 'canté', 'hablé'

a \w\w\w\war = the word 'a' followed by a word that begins with any 4 letter followed by 'ar', e.g., 'a mandar', 'a cantar'

Comments:

This is one of the most basic of sequences in regular expressions (i.e., regex). Because it covers any character, you can leverage it search for a lot of things. If you combine this wild card or sorts with non wild cards, you can really narrow down your search for something in particular. Let's say you need to find all words with 6 letters ending in 'dad'. You would search for:

\w\w\w\dad

[Actually, it is very rare that one searches for words of a specific length. What you will find most useful is the following: \w+ = any word with one or more characters, which is basically the way to look for any word. See quantifiers below!]

Link to practice page for basics

Character classes

[...] = a subgroup of letters or characters that you specify

[^...] = a subgroup of letters or characters that want to exclude (i.e., the [^ ... ] means exclude any letter in this set)

For example: Instead of looking at sequences of \w, you want to look at sequences of vowels. The character class would be: [aeiouáéíóúü]

[aeiouáéíóúü][aeiouáéíóúü]\w\w\w\w = any word beginning with 2 vowels followed by 4 characters

\w\w\w\w\w\w[áéó] = any 7 letter word ending in a 'á, é, or ó'

\w\w\w\w\w\w[^íéó] = any 7 letter word not ending in a 'á, é, or ó'

Comments:

Character sets are very powerful, as you can use them to really narrow down a search to certain letters in certain places in a word. If you need to look for 8 letter words starting with 'pre' followed by a 'b', 'd', or 'g'. this could be accomplished quite easily:

pre[bdg]\w\w\w\w

Notice, however, that when the regex 'engine' looks at [bdg] it looks to see if the next letter is one of the letters in the set. So for instance a way to look for a combination of two vowels would be to create a search sequence and place somewhere in the sequence:

[aeiouáéíóúü][aeiouáéíóúü]

If you place a '^' at the beginning of a character set, you tell the regex engine to look for any thing that is not in the set. So, you would think that if you typed [^aeiouáéíóúü] you were searching for any letter that is not a vowel. Wrong! It looks for any thing that is not a vowel, including spaces and punctuation. So, if you wanted to find all words with two consonants in a row, you should actually include the space ' ' in your exclusion set:

[^aeiouáéíóúü ][^aeiouáéíóúü ]

Notice the space at the end of the character set!

So, let's say we were looking for all 4 letter words that end in 2 non vowels. You would write:

\w\w[^aeiouáéíóúü ][^aeiouáéíóúü ]

Try it out for yourself!

Link to practice page character sets

Quantifiers

These quantify the preceding character or special character

* = 0 or more
+ = 1 or more

{n} = a certain number

{n,m} = a range of numbers (n=min; m=max)

{n,} = a certain number or more

E.g.

\w{3} = A word with 3 letters

\w{5} = a word with 5 letters

\w{3}í = a word beginning with any 3 letters ending with 'í', e.g., 'salí', 'bebí'

\w{4,5}o = a word with any 4 or 5 letters ending in an 'o', e.g., 'libro', 'canto', 'duermo'

= a word of any number of any characters ending in an 'a', e.g., 'quisiera', 'podía', 'necesitaba' 'va'

\w+ = any word

\w+a \w+ar = any word ending in an 'a' followed by any word ending in 'ar', e.g., 'para pensar', 'pueda pasar', 'podría entrar'

pre\w+ = any word beginning with 'pre', e.g., 'preparación', 'precisamente', 'prender'

Comments:

Quantifiers do exactly what they say: they tell the regex engine to look for a certain number of preceding things. You could look for a certain number of any characters:

A word with any 4 characters = \w{4}

A word with 1 or more characters = \w+

A word with 9 or more characters (i.e., really long words) = \w{9,}

You could also look for a certain number of character sets:

A word ending in between 2 and 4 vowels = \w+[aeiouáéíóúü]{2,4}

Practice quantifiers

Groupings

For options:

(?:a|b|c) = either a, or b or c

E.g., \w+(?:a|o|as|os) = any word ending in 'a,o, as, os', e.g., 'casa', 'bajo','pasas', 'pasos'

Actually, if you do this search, you will find that many false alarms come up. When we move on to working with the tagged corpus, you'll see that we can tell the regex engine to look for specific grammatical properties in a word very easily, such as whether a verb is in the preterite and whether its infinitive ends in a 'ar'.

For quantifiers:

(?:...)Any quantifier

E.g., (?:ar){2} = any combination of 'arar'.

Thus, \w*(?:ar){2}\w* = any word that begins with zero or more letters followed by 'arar' followed by zero or more letters, 'preparar', 'equipararnos', 'amarartiarse', 'separar'

\w*(?:ar)+\w* = the same search but any number of 'ar's in a row within a word, e.g., 'escuchar', 'preparar', 'comparaba'

Practice groupings

Look aheads

(?!= ...) = negative lookahead

(?!<

E.g.,

de(?!=[aeiou])\w+ = any word beginning with 'de' not followed by a vowel but the word is of any length.

Now go on to an explanation of regular expressions and tags.

Extras

If your browser/system doesn't allow you to easily type in Spanish characters, you can use one of the code sequences.

á	\x87
é	\x8e
í	\x92
ó	\x97
ú	\x9c
¿	\xc0
¡	\xc1
Á	\xe7
É	\x83
Í	\xea
Ó	\xee
Ú	\xf2
ñ	\x96
Ñ	\x84
ü	\x9f