Author Profiling Using Semantic and Syntactic Features

Notebook for PAN at CLEF 2019

Document identifier: oai:DiVA.org:ltu-76936
Keyword: Natural Sciences, Computer and Information Sciences, Computer Sciences, Naturvetenskap, Data- och informationsvetenskap, Datavetenskap (datalogi), Maskininlärning, Machine Learning
Publication year: 2019
Relevant Sustainable Development Goals (SDGs):
SDG 9 Industry, innovation and infrastructure
The SDG label(s) above have been assigned by OSDG.ai

Abstract:

In this paper we present an approach for the PAN 2019 Author Profiling challenge. The task here is to detect Twitter bots and also to classify the gender of human Twitter users as male or female, based on a hundred select tweets from their profile. Focusing on feature engineering, we explore the semantic categories present in tweets. We combine these semantic features with part of speech tags and other stylistic features – e.g. character floodings and the use of capital letters – for our eventual feature set. We have experimented with different machine learning techniques, including ensemble techniques, and found AdaBoost to be the most successful (attaining an F1-score of 0.99 on the development set). Using this technique, we achieved an accuracy score of 89.17% for English language tweets in the bot detection subtask

Authors

György Kovács

Luleå tekniska universitet; EISLAB; MTA-SZTE Research Group on Artificial Intelligence, Szeged, Hungary
Other publications >>

Vanda Balogh

Institute of Informatics, University of Szeged, Szeged, Hungary
Other publications >>

Purvnashi Mehta

MindGarage, Kaiserslautern, Germany
Other publications >>

Kumar Shridhar

MindGarage, Kaiserslautern, Germany
Other publications >>

Pedro Alonso

Luleå tekniska universitet; EISLAB
Other publications >>

Marcus Liwicki

Luleå tekniska universitet; EISLAB
Other publications >>

Record metadata

Click to view metadata