A corpus-based investigation of junk emails

Constantin Orasan, Ramesh Krishnamurthy

    Research output: Chapter in Book/Published conference outputChapter

    Abstract

    Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified.
    Original languageEnglish
    Title of host publicationProceeding of the Third International Conference on Language Resources and Evaluation (LREC 2002)
    PublisherELRA
    Pages1773-1780
    Number of pages8
    ISBN (Print)2-951740-80-8
    Publication statusPublished - Jun 2002
    Event3rd International Conference on Language Resources and Evaluation - Las Palmas, Spain
    Duration: 29 May 200231 May 2002

    Conference

    Conference3rd International Conference on Language Resources and Evaluation
    Country/TerritorySpain
    CityLas Palmas
    Period29/05/0231/05/02

    Bibliographical note

    3rd International Conference on Language Resources and Evaluation (LREC-2002), 29-30 May 2002, Las Palmas (ES). Published with the permission of ELRA. This paper was published within the proceedings of the LREC 2002 Conference. © 1998-2010 ELRA - European Language Resources Association. All rights reserved.

    Keywords

    • email
    • unwanted emails
    • junk emails
    • linguistic features
    • lexic
    • grammatic

    Fingerprint

    Dive into the research topics of 'A corpus-based investigation of junk emails'. Together they form a unique fingerprint.

    Cite this