Spanish corpus linguistics

This site contains tools for working with Spanish corpora. These tools are designed by me, Joe Collentine . These tools are intended for my students and, of course, research colleagues.

You can utilize the freely available Regex Concord, which is a concordance app that permits simple searches as well as those using regular expressions. There are also a number of instructional videos that are useful for any corpus research project, giving you an introduction to the basics and complex issues related to corpus research projects.

You may also download the freely available Simple Search app. It is currently a bit experimental, containing some metrics that Regex Concord does not have. I encourage you to play around with both.

By the fall of semester of 2018 I will update both Regex Concord and Simple Search, so please check back. Write me with any questions you have about the apps.

Here are some tutorials for using regular expressions in your searches. They are useful for both general purposes (i.e., there are other corpus tools that use regular expressions) and for using Regex Concord. You can complete the video and/or the text based tutorials. Each has an interactive practice component.

The Swiss Army knife of all corpus work is a text-based word processor. These suggestions handle both regular expressions as well as international characters (i.e., encodings, such as UTF-8) very nicely.

Full tags

The part-of-speech tags that I use are based on the original set of tags that Doug Biber, Mark Davies, Nicole Tracy-Ventura, and Jim Jones used when they developed their initial version of the Corpus del español . I still use these in the corpora that I tag with the python-based tagger that I have built using the NLTK toolkit.

Yuly Asención Delaney and I altered this tagging schema a bit to accommodate learner corpora.

Simplified tags

I have learned that these tags can be a bit verbose, so I have developed a simplified Spanish tag set.

Students and colleagues using the corpora that I have tagged may end up working with either version, and so I have provided definitions of both tagging schemes.

This is normally reserved for my students. The link contains corpora that we use in our graduate courses, such as Morfosintaxis española and Sociolingüística .

If you're authorized you can download Spanish corpora:

Log in

put anything here