Cross-linguistic stylometric features: A preliminary investigation

Patrick Juola, Georgios Mikros

Research output: Contribution to conferencePaperpeer-review

Abstract

Stylometric analysis-the study of the writing style of the author of a document, either to determine his/her identity or personal characteristics-is an important problem in text analysis and information retrieval, with many important real-world applications. It is generally limited by a need for reference documents that are representative of the unknown documents to be analyzed. This paper addresses the issue of analysis with highly unrepresentative documents, and specially the question of whether elements of writing style can be shown to vary systematically with the individual irrespective of the language of writing. We identify fourteen Twitter users who post bilingually in both Spanish and English. An analysis of several standard linguistic and Twitter-specific extra-linguistic variables show both that there is a substantial amount of individual variation along these variables, but (more importantly), that the variations correlate very strongly across languages. In other words, an individual who scores highly along one axis in English is also very likely to score highly on that axis in Spanish. These findings strongly suggest that cross-linguistic individual authorship features can be developed that, in turn, will enable accurate stylistic analysis across language barriers.
Original languageEnglish
Publication statusPublished - 2016
Externally publishedYes
EventJADT 2016 - 13th International Conference on Statistical Analysis of Textual Data - , France
Duration: 7 Jun 201610 Jun 2016

Conference

ConferenceJADT 2016 - 13th International Conference on Statistical Analysis of Textual Data
Country/TerritoryFrance
Period7/06/1610/06/16

Fingerprint

Dive into the research topics of 'Cross-linguistic stylometric features: A preliminary investigation'. Together they form a unique fingerprint.

Cite this