Joint Image-Text Representations Using Deep Learning

ICYMI (In Case You Missed It), the following work was presented at the 2020 Annual Meeting of the American Political Science Association (APSA). The presentation, titled “Joint Image-Text Classification Using an Attention-Based LSTM Architecture” was a part of the session “Image Processing for Political Research” on Thursday, September 10, 2020. Post developed by Patrick Wu and Katherine Pearson.

Political science has been enriched by the use of social media data. However, automated text-based classification systems often do not capture image content. Since images provide rich context and information in many tweets, these classifiers do not capture the full meaning of the tweet. In a new paper presented at the 2020 Annual Meeting of the American Political Science Association (APSA), Patrick Wu, Alejandro Pineda, and Walter Mebane propose a new approach for analyzing Twitter data using a joint image-text classifier.

Human coders of social media data are able to observe both the text of a tweet and an attached image to determine the full meaning of an election incident being described. For example, the authors show the image and tweet below.

Photo of people waiting to vote and text of tweet reading “Early voting lines in Palm Beach County, Florida #iReport #vote #Florida @CNN”

If only the text is considered, “Early voting lines in Palm Beach County, Florida #iReport #vote #Florida @CNN”, a reader would not be able to tell that the line was long. Conversely, if the image is considered separately from the text, the viewer would not know that it pictured a polling place. It’s only when the text and image are combined that the message becomes clear.

MARMOT

A new framework called Multimodal Representations Using Modality Translation (MARMOT) is designed to improve data labeling for research on social media content. MARMOT uses modality translation to generate captions of the images in the data, then uses a model to learn the patterns between the text features, the image caption features, and the image features. This is an important methodological contribution because modality translation replaces more resource-intensive processes and allows the model to learn directly from the data, rather than on a separate dataset. MARMOT is also able to process observations that are missing either images or text.

Applications

MARMOT was applied to two datasets. The first dataset contained tweets reporting election incidents during the 2016 U.S. general election, originally published in “Observing Election Incidents in the United States via Twitter: Does Who Observes Matter?” The tweets in this dataset report some kind of election incident. All of the tweets contain text, and about a third of them contain images. MARMOT performed better at classifying the tweets than the text-only classifier used in the original study.

In order to test MARMOT against a dataset containing images for every observation, the authors used the Hateful Memes dataset released by Facebook to assess whether a meme is hateful or not. In this case, a multimodal model is useful because it is possible for neither the text nor the image to be hateful, but the combination of the two may create a hateful message. In this application, MARMOT outperformed other multimodal classifiers in terms of accuracy.

Future Directions

As more and more political scientists use data from social media in their research, classifiers will have to become more sophisticated to capture all of the nuance and meaning that can be packed into small parcels of text and images. The authors plan to continue refining MARMOT, and expand the models to accommodate additional elements such as video, geographical information, and time of posting.

Subscribe to Blog via Email

SEARCH BLOG POSTS

Categories

RECENT BLOG POSTS

BLOG ARCHIVES