Abstract
Stylometric analysis-the study of the writing style of the author of a document, either to determine his/her identity or personal characteristics-is an important problem in text analysis and information retrieval, with many important real-world applications. It is generally limited by a need for reference documents that are representative of the unknown documents to be analyzed. This paper addresses the issue of analysis with highly unrepresentative documents, and specially the question of whether elements of writing style can be shown to vary systematically with the individual irrespective of the language of writing. We identify fourteen Twitter users who post bilingually in both Spanish and English. An analysis of several standard linguistic and Twitter-specific extra-linguistic variables show both that there is a substantial amount of individual variation along these variables, but (more importantly), that the variations correlate very strongly across languages. In other words, an individual who scores highly along one axis in English is also very likely to score highly on that axis in Spanish. These findings strongly suggest that cross-linguistic individual authorship features can be developed that, in turn, will enable accurate stylistic analysis across language barriers.
| Original language | English |
|---|---|
| Publication status | Published - 2016 |
| Externally published | Yes |
| Event | JADT 2016 - 13th International Conference on Statistical Analysis of Textual Data - , France Duration: 7 Jun 2016 → 10 Jun 2016 |
Conference
| Conference | JADT 2016 - 13th International Conference on Statistical Analysis of Textual Data |
|---|---|
| Country/Territory | France |
| Period | 7/06/16 → 10/06/16 |