lab_did_answer.Rmd

---
title: "My answers"
author: "My name"
date: "2024-05-08"
output: html_document
---

## Online Reputation Management

This exercise studies online reputation management through the use of public comments by firms in response to online reviews.
The content is based on the article "Online reputation management: 
 Estimating the impact of management responses on consumer reviews" by Proserpio and Zervas.
 The article is published in Marketing Science in 2017, and is available in our course readings.
You should read through the article before answering these questions.
The paper wants to investigate the relationship between a hotel's use of management responses and its online reputation (measured by star rating, `stars`) & establish a causal relationship from the use of management responses to online reputation.
Your goal in this exercise is to explain key arguments and replicate selected results from this paper.
The data for this exercise is located `data/responses.dta`.^[
Note that the data is stored as a `.dta` format (i.e. a Stata dataset).
We use the package `haven` with its `read_stata()` command to load a Stata dataset.
]

For this exercise you might need to the following packages:

```{r, warning= FALSE, message=FALSE}
library(haven)
library(dplyr)
library(tidyr)
library(fixest)
library(purrr)
library(broom)
library(modelsummary)
```

1. Explain why firms might use public responses to reviews to manage their online reputation.
   Should they respond relatively more to positive or negative reviews? Explain why.

Write your answer here


2. Load the data for this exercise and name it `hotels_orm`. 
For this exercise you will only need the rows where `xplatform_dd_obs = 1`.
Keep only the columns `hotel_id, year, stars, after, ta_dummy, first_response, cum_avg_stars_lag, log_count_reviews_lag, t, ash_interval, traveler_type`.

```{r}
# Write your answer here
```


3.  Proserpio & Zervas' empirical exercise uses what they call 'cross-platform' difference in differences.
Using your own words, explain their idea conceptually - and justify why it is valid. 
You can use equations or figures, but do so sparingly. (max 7 sentences).

Write your answer here


4. Explain what the 'parallel trends' assumption is. 
   Why is it important in this application?
   Which figure (if any) provides support for the parallel trends assumption?
  
Write your answer here


5. What is the 'Ashenfelter dip'? 
   Why do the author's believe they see a pattern akin to an Ashenfelter dip in their application.

Write your answer here


6. First, lets compute the 'simple' Difference in Difference estimate using differences in means.
  (a) Create a data frame with two rows and two columns where the rows take the values of `first_response` = 0 or 1, 
      and the columns take the values of `ta_dummy` = 0 or 1. 
      The values in the data frame should be the respective group means of `stars`.
  (b) Compute the difference between `first_response` = 1 and `first_response` = 0 for each of `ta_dummy` = 0 and `ta_dummy` = 1.
  (c) Compute the difference between the two values in (b) to get your simple DiD estimate.
      What is the estimate that you get?

```{r}
# Write your answer here
```


The authors use the following regression equation to estimate the difference in difference estimator of the effect of online responses on online reputation:

$Stars_{ijt} = \beta_1 After_{ijt} + \beta_2 TripAdvisor_{ij} + \delta After_{ijt}  \times TripAdvisor_{ij} + X_{ijt}\gamma + \alpha_j + \tau_t + \varepsilon_{ijt}$

where $Stars_{ijt}$ is the star-rating of review $i$ for hotel $j$ in calendar month $t$, $After_{ijt}$ is an indicator
for reviews (on either platform) submitted after hotel $j$ started responding, $TripAdvisor_{ij}$ is an
indicator for TripAdvisor ratings, $X_{ijt}$ are control variables, $\alpha_j$ are hotel fixed effects, $\tau$ are calendar-month fixed effects
and $\varepsilon_{ijt}$ is the error term.

To relationship between the variables in the equation above and the variables in the dataset is:^[
  This mapping is *not* immediately obvious, and one of the (small) perils of using a dataset that one hasn't constructed themselves from scratch. We hope this clarifies which variables need to be included in the regression.
]

* $After =$ `first_reponse`, 
* $TripAdvisor =$ `ta_dummy`, and 
* $After_{ijt}  \times TripAdvisor_{ij}$ = `after`

7. Explain why $\delta$, i.e. the coefficient for the interaction term $After_{ijt} \times TripAdvisor_{ij}$ is the difference in difference estimate from the regression (i.e. the coefficient you want to estimate).

```{r}
# Write your answer here
```


8. Now, lets replicate Table 4 of Proserpio and Zervas. 
  Estimate the model outlined above using the `fixest` package.
  In particular, produce three regression models:

  (a) `model_1` should be the regression equivalent of the simple DiD in 6 using the whole dataset. This estimate is not in Table 4 of Proserpio and Zervas.
  (b) `model_2` should extend `model_1` by adding the fixed effects and uses the whole dataset.
  (c) `model_3` should be the same estimating equation as `model_2` but correct for the Ashenfelter dip.
  (d) `model_4` should augment `model_2` by adding the variables `cum_avg_stars_lag` and `log_count_reviews_lag` to $X_{ijt}$, and correct for the Ashenfelter dip.

For each model, standard errors should be clustered by `hotel_id`.

Use the following starter code for estimating each regression model:

```{r, eval = FALSE}
model_x <- fixest(YOUR_CODE ~ YOUR_CODE + 
                    # t:ta_dummy is the platform specific linear time 
                    # trend they mention in the notes of table 4
                    # you need this to get their estimates in models 2 thru 4
                    t:ta_dummy
                    |
                    # put any additional fixed effects here (if you need them)
                    # format is var1 + var2
                    YOUR CODE,
                    data = YOUR_CODE,
                    cluster = ~ YOUR_CODE # what variable denotes the clusters 
                                          # for the standard errors
                    )
```

```{r}
# Write your answer here
```


9. Report your estimates in a regression table.

```{r}
# Write your answer here
```


10. Interpret the value of the coefficients you deem most important. 
    Are these effects significant from a marketing perspective (i.e. Should they shape marketing practice)? (max. 7 sentences)

Write your answer here


11. Explain why management responses to reviews can lead to improved hotel ratings. 
  Can you support this argument using any of the results in Proserpio and Zervas' work? 
  Be specific as to which results you can use and how they support your arguments. (max. 10 sentences).

Write your answer here


12. Apart from the "cross-platform design", the Difference in Differences strategy adopted by Proserpio and Zervas differs across one other important dimension from the traditional design. 
What dimension is this?
What effects might it have on the results and their interpretability?  

Write your answer here