Abstract
The idea of authorship attribution is based on two assumptions: (i) that some
language users have unique linguistic styles, or quantifiable ‘idiolects’, and (ii) that features characteristic of those styles are likely to recur with a relatively stable frequency in an individual's linguistic output. Studies of individual linguistic variation show a tendency to use sociolinguistically homogenous data focusing on one genre and the few existing cross-genre studies are typically limited to two genres e.g. (Kestemont et al. 2012; Stamatatos 2013). The study reported in this paper takes a different approach: one hundred and twelve participants have shared with us natural language samples from six discourse types. We have collected emails, text messages, university essays, oral interview data, oral image description data, and digital data of Google search behaviour. Each participant’s dataset thus comprises a wide range of genres but also of communication channels, contexts, and language input modes. The individual datasets consist of roughly 10,000 words each, amounting to a total corpus size of over a million words. Using stylometric classification tools, we have measured within-author and between-author variability and obtained results indicating very low levels of individual stability across genres.
We offer a sociolinguistically-based interpretation of the results and discuss their implications for forensic authorship analysis.
References:
Kestemont, M., Luyckx, K., Daelemans, W. and Crombez, T., 2012. Cross-genre authorship verification using unmasking. English Studies, 93(3), pp.340-356.
Stamatatos, E., 2013. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), pp.421-439.
language users have unique linguistic styles, or quantifiable ‘idiolects’, and (ii) that features characteristic of those styles are likely to recur with a relatively stable frequency in an individual's linguistic output. Studies of individual linguistic variation show a tendency to use sociolinguistically homogenous data focusing on one genre and the few existing cross-genre studies are typically limited to two genres e.g. (Kestemont et al. 2012; Stamatatos 2013). The study reported in this paper takes a different approach: one hundred and twelve participants have shared with us natural language samples from six discourse types. We have collected emails, text messages, university essays, oral interview data, oral image description data, and digital data of Google search behaviour. Each participant’s dataset thus comprises a wide range of genres but also of communication channels, contexts, and language input modes. The individual datasets consist of roughly 10,000 words each, amounting to a total corpus size of over a million words. Using stylometric classification tools, we have measured within-author and between-author variability and obtained results indicating very low levels of individual stability across genres.
We offer a sociolinguistically-based interpretation of the results and discuss their implications for forensic authorship analysis.
References:
Kestemont, M., Luyckx, K., Daelemans, W. and Crombez, T., 2012. Cross-genre authorship verification using unmasking. English Studies, 93(3), pp.340-356.
Stamatatos, E., 2013. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), pp.421-439.
Original language | English |
---|---|
Publication status | Published - 18 Jul 2022 |
Event | Fourth European Conference of the International Association of Forensic and Legal Linguistics - Porto, Portugal Duration: 18 Jul 2022 → 21 Jul 2022 |
Conference
Conference | Fourth European Conference of the International Association of Forensic and Legal Linguistics |
---|---|
Country/Territory | Portugal |
City | Porto |
Period | 18/07/22 → 21/07/22 |
Keywords
- forensic linguistics
- authorship analysis
- idiolect