Post developed by Dory Knight-Ingram 

ICYMI (In Case You Missed It), the following work was presented at the 2019 Annual Meeting of the American Political Science Association (APSA).  The presentation, titled “Using Neural Networks to Classify Based on Combined  Text and Image Content: An Application to Election Incident Observation” was a part of the session “Deep Learning in Political Science” on Friday, August 30, 2019.

A new election forensics process developed by Walter Mebane and Alejandro Pineda uses machine-learning to examine not just text, but images, too, for Twitter posts that are considered reports of “incidents” from the 2016 US Presidential Election. 

Mebane and Pineda show how to combine text and images into a single supervised learner for prediction in US politics using a multi-layer perceptron. The paper notes that in election forensics, polls are useful, but social media data may offer more extensive and granular coverage. 

The research team gathered individual observation data from Twitter in the months leading up to the 2016 US Presidential Election. Between Oct. 1-Nov. 8, 2016, the team used Twitter APIs to collect millions of tweets, arriving at more than 315,180 tweets that apparently reported one or more election “incidents” – an individual’s report of their personal experience with some aspect of the election process. 

At first, the research team used only text associated with tweets. But the researchers note that sometimes, images in a tweet are informative, while the text is not. It’s possible for the text alone to not make a tweet a report of an election incident, while the image may indeed show an incident. 

To solve this problem, the research team implemented some “deep neural network classifier methods that use both text and images associated with tweets. The network is constructed such that its text-focused parts learn from the image inputs, and its image-focused parts learn from the text inputs. Using such a dual-mode classifier ought to improve performance. In principle our architecture should improve performance classifying tweets that do not include images as well as tweets that do,” they wrote.

“Automating analysis for digital content proves difficult because the form of data takes so many different shapes. This paper offers a solution: a method for the automated classification of multi-modal content.” The research team’s model “takes image and text as input and outputs a single classification decision for each tweet – two inputs, one output.” 

The paper describes in detail how the research team processed and analyzed tweet-images, which included loading image files in batches, restricting image types to .jpeg or .png., and using small image sizes for better data processing results. 

The results were mixed.

The researchers trained two models using a sample of 1,278 tweets. One model combined text and images, the other focused only on text. In the text-only model, accuracy steadily increases until it achieves top accuracy at 99%. “Such high performance is testimony to the power of transfer learning,” the authors wrote. 

However, the team was surprised that including the images substantially worsened performance. “Our proof-of-concept combined classifier works. But the model structure and hyperparameter details need to be adjusted to enhance performance. And it’s time to mobilize hardware superior to what we’ve used for this paper. New issues will arise as we do that.”