Abstract
The aim of this study is to obtain authorship attribution and author’s gender identification in a corpus of blogs written in Modern Greek language. More specifically, the corpus used contains 20 bloggers equally divided by gender (10males & 10 females) with 50 blog posts from each author (1,000 posts in total and an overall size of 406,460 words). In this corpus we calculated a number of standard stylometric variables (e.g. word length statistics and various vocabulary “richness”indices) and 300 most frequent word and character n-grams (character and word uni-grams, bigrams, trigrams). Support Vector Machines (SVM) were trained in the above data and the author’s gender prediction accuracy in 10-fold cross-validation experiment reached 82.6% accuracy, a result that is comparable to current state-of-the-art author profiling systems. Authorship attribution accuracy reached 85.4%, an equally satisfying result given the large number of candidate authors (n=20).
| Original language | English |
|---|---|
| Number of pages | 12 |
| Publication status | Published - 2012 |
| Externally published | Yes |
| Event | QUALICO 2012 - Belgrade, Serbia Duration: 26 Apr 2012 → 29 Apr 2012 |
Conference
| Conference | QUALICO 2012 |
|---|---|
| Country/Territory | Serbia |
| City | Belgrade |
| Period | 26/04/12 → 29/04/12 |