Write Paper
Final Project : Text as Data and Automated Content Analysis
For their final project students will employ general data manipulation tools and advanced text as data techniques with a view to answering – or at least complementing – an original research question of their choice. The main goal of this assignment is for students to combine the R knowledge, Quan…
Final Project : Text as Data and Automated Content Analysis
For their final project students will employ general data manipulation tools and advanced text as data techniques with a view to answering – or at least complementing – an original research question of their choice. The main goal of this assignment is for students to combine the R knowledge, Quanteda and StringR based analyses learnt in the final units of this course, with the theoretical insights explored in the class. In doing so, students should embed their analysis within a theoretical framework, clearly demonstrating how text as data approaches can help them tackle their chosen research problem.
Students can either work individually or in groups of 3 people.
- Guidelines
1.1 The Research Problem
The project should present a clear research puzzle that the student wants to engage with. This can be in the form of a research question or sub-questions, which themselves can be exploratory (‘what are...?’) or deductive (‘does X...?’). There are no other specific requirements for the research question(s), and they do not need to be related to what students worked on in previous assignments.
1.2 Theoretical Framework/Background
The research question (s) should, ideally, be linked to an established field of literature. To allow for greater flexibility in the choice of a research problem, however, students can also explore a socially relevant topic to which their empirical analysis could contribute. A review of the topic-specific literature is still needed, but students can also make use of journalistic sources to provide a background to their research problem. The key point is that students situate their initial puzzle and empirical analysis within a broader framework.
1.3 Methodology
1.3.1 Data
Students should make use of an existing dataset to conduct their analyses. There are some options already available and documented below – nevertheless, the students do not have to use these datasets. They are free to find (or even collect!) datasets that best reflect their research interests – the only requirement is that they incorporate text.
• Existing class datasets. Students can choose to use text datasets employed throughout the course. These include:
o A collection of Tweets about the Russian invasion of Ukraine (53K)
o A collection of all US presidential inaugural addresses
o All speeches in the UK parliament in the first seven months of 2020
o A collection of 90K Tweets on the topic of ‘Musk’ collected in April 2022 (the month he announced he would buy Twitter)
• The useNews database available here. It is a collection of over 2.5 million news articles from 81 online news outlets originating from 12 countries for the years 2019 and 2020 (including German-speaking Germany and Austria, and English-speaking UK and US). As all articles are presented in the form of a document feature matrix (dfm), students will need to use Quanteda to analyze relevant sections of the dataset. Additionally, the creators of this dataset have made available Facebook engagement metrics for each of the news articles, providing numbers of shares, reactions, comments for each article. While students do not have to use these metrics, they should remember they are available when formulating their relevant research questions.
• A collection of transcripts of podcast episodes from the Joe Rogan Experience can be found here. The corpus claims to contain more than 833 hours of speech.
Important considerations for the useNews dataset:
• If using the useNews dataset, a script has been written and uploaded to Ilias that will facilitate reading the data into R and selecting the relevant parts of it. Students can find it in Ilias/Final Project/getting_data_final_project
• An Excel sheet has been organized where students can see the name of each outlet as it appears in the data (when working with it you must use these exact names – if it is spelled ‘FOX nEws’ you have to spell it ‘FOX nEws’ in R), what country they belong to, and how many articles are available for 2020 and 2019. Students should use this to guide their data selection. It is important that students note that some outlets are only available for one year, and that in some cases, news outlet names are spelled differently for different years. Students can find this file in Ilias/Final Project/all_outlets_by_country
1.3.2 Analysis
Students should use the data manipulation and text analysis tools they have learnt throughout the proseminar. Specifically, they are required to use at least one of the Quanteda-based automated text classification approaches learned in this block course: dictionary classification, supervised machine learning, and/or unsupervised machine learning, as well as analyses based on the StringR package.
1.4 Answering the Research Question
Students must carefully consider what data and methods they need in order to best answer their research question(s). The idea is to provide the best answer possible within the constraints of the data available, and the text as data methods they have learnt.
1.5 Structure
The essay should consist of the following sections:
• Introduction
– (Very) briefly introduce the topic, clearly state research problem, and summarise strategy of analysis, findings and implications.
• Theoretical Framework/Background
– Embed research problem within a theoretical framework, academic literature, or a broad ‘background’ that situates the puzzle in terms of its social relevance.
• Methodology
– Describe what data you will use for analysis.
– Explain what methods of analysis you will use.
• Results
– Present and interpret findings of your analysis.
– Plots/Figures greatly help communicate you findings.
• Conclusion
– Clearly state whether and how you solved your research problem.
– Briefly discuss what your findings mean for e.g. a field of literature or the topic more generally.
2. Grading
Students will be evaluated on the following aspects:
• Research problem (10%)
– Is it interesting/ambitious? Is it relevant? Is it suitable for a computational approach?
• Framework (20%)
– Was the puzzle clearly and sufficiently integrated within a broader framework? (theory, literature, social background)
– Are research questions clearly stated? Do they clearly follow from the framework provided?
• Methodology and Analysis (40%)
– Analysis will be evaluated in terms of suitability (does it allow student to answer the Research Question?) and overall quality (how well was the method and analysis applied? Was it sufficiently ambitious? Does student show independent work? Is the process of data analysis clearly explained?).
• Results (20%)
– Does the student present results in a clear manner? Do they directly speak to the Research Question? Is interpretation sophisticated? Does the student reflect well on the limitations of their design?
• Academic writing (10%)
– Is writing sufficiently clear? Does it follow standard academic style?
3. Submission
Students need to submit two separate files by March 3rd, 2023:
• A 1,000 to 1,500-word essay containing the sections outlined above, in Word format.
• An R Script showing the student’s coding work.
All papers will be automatically checked for plagiarism. Plagiarism will not be tolerated and papers doing so will automatically receive a 1. This includes directly copying others’ work, as well as instances of paraphrasing without citation, including the use of text generators such as ChatGPT. Make sure you review the University’s policy on what constitutes plagiarism.