Paul Rayson

On 11th and 12th June 2012, Paul Rayson (Lancaster University) visited our Institute and gave two lectures on

Adapting a semantic field tagging system for Early Modern English

Automatic semantic annotation approaches can be used for many applications such as content analysis of political discourse, opinion mining and sentiment analysis. These applications are becoming very popular for text mining and analytics of material sourced from online social networks and the web. In this talk, I will highlight another application of semantic annotation to conceptual history or the history of ideas, which entails applying computational and corpus based techniques to large volumes of transcribed historical documents and determining any changes in the use and meaning of concepts over time. I will describe in some detail our existing semantic annotation system (USAS) which has been developed for modern English and then explain how we plan to adapt it to deal with historical text collections and in particular those from the Early Modern English period, such as Early English Books Online. The first major issue we have tackled is that of historical spelling variation since non-modern variants are ubiquitous in these datasets and this has been shown to cause significant problems for Natural Language Processing (NLP) tools. I will describe the Variant Detector (VARD) tool that has been developed in Lancaster and explain the methods that are employed to detect a historical spelling variant and match it to a modern form. The original variant is retained but the modern form can be used for tagging and in corpus retrieval tools since this results in more accuracy and robustness. Finally, I will outline next steps in the development of the semantic tagger for Early Modern English.

Exploring interoperability between corpus tools

The corpus methodology is now well established in linguistics and many more researchers are using software tools to produce frequency lists, examine concordances, extract key words, collocations and n-grams. Academics from related areas such as literary stylistics and translation studies may have a small learning curve for corpus tools. However, given that the corpus methods are beginning to reach other areas in the social sciences and humanities (e.g. psychology and history), it is vital that we re-evaluate computational approaches and software support for the corpus methods because the learning curve will be much steeper for researchers in these communities. In this talk, I will describe a project looking at interoperability between corpus tools in which we have focused on four core web-based tools (Wmatrix, CQPweb, Intellitext and WordTree). We have surveyed the four tools and other similar systems (Manatee, Sketch Engine, BNCweb, eMargin) and considered ways in which we can link methods and components together. By connecting them we aim to open up new pathways of enquiry, and encourage researchers to try research methods which are established in other disciplinary communities, but are not yet familiar in their own. The integration of the corpus tools means that it will be considerably easier for researchers to move from one platform to another, and draw on the affordances of different tools to enhance research outcomes in their own fields of research. Integration will also assist tool developers to link to complementary components and interfaces in other existing tools.

Our selected applications

Jazyky