On 27th and 28th May 2013, Dr Andrew Hardie (University of Lancaster) visited our Institute and gave two lectures on
- Annotation and analysis: an overview of tools and techniques
The corpus research infrastructure at Lancaster’s UCREL research centre is based around the use of a number of standard tools for (a) automated annotation at various levels of language, for instance p[art-of-speech tagging and semantic tagging and (b) indexing, searching and analysing the resulting data. In this presentation, I will provide an introductory overview of the nature of these tools and how we make them work together. The presentation will conclude with a live (internet connection permitting!) demonstration of the analytic possibilities afforded by the CQPweb software when it operates across fully-annotated corpus data – in particular looking at different approaches to collocational phenomena.
- Applying cluster analysis to the problem of text-type classification
(co-author Ghada Mohamed)
This presentation illustrates (a) a new approach to the bottom-up analysis of text types based on cluster analysis, and (b) its cross-linguistic applicability, exemplified through analyses of English and Arabic corpora. Although there exist many different approaches to the classification of texts into categories, most such work can be considered top-down in orientation. Such approaches must, therefore, be complemented by bottom-up approaches where categorisation is based on features internal to the language of the texts; the most widely known approach of this kind is Biber’s (1988) Multi-Dimensional(MD) analysis of English, extended to cross-linguistic text typology by Biber (1995). Biber’s methodology is based on a multivariate statistical technique, factor analysis; this presentation will explore an alternative methodology for establishing text-type categories based on cluster analysis. Work using the British National Corpus and the Leeds Corpus of Contemporary Arabic shows cluster analysis to be a powerful tool for structuring frequency data from automated retrieval lexico-grammatical features, if its output is interpreted with care.