Abstract
This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts
(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes them
significantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantly
from a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands of
authors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimental
setup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data
(e.g. blog posts or emails).
The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collects
aggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recent
advances in natural language processing such as vector-based word representations obtained through neural network training.
Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the prediction
lists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.
Reference
Joulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,
2016.
(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes them
significantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantly
from a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands of
authors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimental
setup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data
(e.g. blog posts or emails).
The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collects
aggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recent
advances in natural language processing such as vector-based word representations obtained through neural network training.
Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the prediction
lists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.
Reference
Joulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,
2016.
Original language | English |
---|---|
Publication status | Published - 2019 |
Event | 14th Biennial Conference of the International Association of Forensic Linguists - Duration: 1 Jul 2019 → 5 Jul 2019 |
Conference
Conference | 14th Biennial Conference of the International Association of Forensic Linguists |
---|---|
Period | 1/07/19 → 5/07/19 |
Keywords
- forensic linguistics
- forensic authorship analysis