-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some clarification for Chapter 8 Observer Bias model formulation #13
Comments
Alex, thanks for these comments. It will take me some time to process
them, but I will get to it as soon as I can.
…On Sun, Jan 7, 2018 at 7:36 PM, Alex Klibisz ***@***.***> wrote:
Chapter 8 makes an interesting point about Observer Bias on the Red Line,
but it took me a while to understand why the distribution over passengers'
observed wait times is greater than the true wait times. After some thought
it turns out I was assuming a more complicated model than the text. I don't
think either model is unreasonable; my intuition just wasn't on the same
page and I didn't find an explicit reason in the text to invalidate my
model. The correct model might be obvious to most but perhaps the
clarification below will help someone in the future:
The text reads:
The average time between trains, as seen by a ran- dom passenger, is
substantially higher than the true average.
Why? Because a passenger is more like (sic) to arrive during a large
interval than a small one. Consider a simple example: suppose that the time
between trains is either 5 minutes or 10 minutes with equal probability. In
that case the average time between trains is 7.5 minutes.
But a passenger is more likely to arrive during a 10 minute gap than a 5
minute gap; in fact, twice as likely. If we surveyed arriving passengers,
we would find that 2/3 of them arrived during a 10 minute gap, and only 1/3
during a 5 minute gap. So the average time between trains, as seen by an
arriving passenger, is 8.33 minutes.
For this to be true, I believe we have to assume a passenger arriving 0
minutes after the previous train has the same observed waiting time as a
passenger arriving any arbitrary n > 0 minutes after the train. In other
words, a passenger who just missed the previous train and waited the full
gap is treated the same as a passenger who just barely made it the train.
My intuition was as follows: In reality, a passenger can arrive at the 9th
minute of a 10 minute gap or the 4th minute of a 5 minute gap. Both
passengers wait 1 minute. If you model it this way, the biased distribution
actually shifts to the left. Why? Let's say there are two passengers
arriving per minute (lam = 2). For a 2 minute gap, you might have the
following wait times for 4 passengers: [0, 0, 1, 1]. For a 3 minute gap,
you might have the following wait times for 6 passengers: [0, 0, 1, 1, 2,
2]. A passenger who waits 0 has arrived just before the train departs.
For an n minute gap, wait time n-1 indicates the passenger arrived within
the first minute after the previous train departed. From the 2-minute and
3-minute gaps above, you can deduce that across all trains P(wait n) <
P(wait n-1). I.e., there is always be a chance for a passenger to wait 0
minutes. But for an e.g. 5 minute gap, it's impossible to wait 6 minutes.
Here is some code to simulate the process and the resulting histogram.
from math import floor
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
n = 50000 # Number of trains.
l = 2 # Passengers arriving per minute.
T = np.random.normal(10, 2, n) # True time between trains.
W1 = [] # Passengers' observed waiting time (my initial formulation).
W2 = [] # Passengers' observed waiting time (Think Bayes Formulation).
for t in T:
size = int(floor(t * l)) # This many passengers will end up on the next train.
W1 += list(np.random.uniform(0, floor(t), size))
W2 += list(np.ones(size) * t)
bins = int(T.max() - T.min())
plt.hist(T, color='red', bins=bins, alpha=0.3, normed=True, label='True wait $\mu=%.3lf$' % T.mean())
plt.hist(W1, color='blue', bins=bins, alpha=0.3, normed=True, label='Observed wait $\mu=%.3lf$' % np.mean(W1))
plt.hist(W2, color='green', bins=bins, alpha=0.3, normed=True, label='Observed wait simplified $\mu=%.3lf$' % np.mean(W2))
plt.legend(fontsize=8)
plt.show()
[image: figure_1]
<https://user-images.githubusercontent.com/8015228/34655951-13ef059e-f3e0-11e7-8aa6-f7bbd2a9ee3c.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#13>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABy37ZyayRQa4vgwF_DJdseXlYa6TZbaks5tIWMhgaJpZM4RV2cz>
.
|
@AllenDowney It's nothing urgent and not explicitly a problem. Just figured I'd post it in case someone else overcomplicates the problem like I did and gets confused at chapter 8. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Chapter 8 makes an interesting point about Observer Bias on the Red Line, but it took me a while to understand why the distribution over passengers' observed wait times is greater than the true wait times. After some thought it turns out I was assuming a more complicated model than the text. I don't think either model is unreasonable; my intuition just wasn't on the same page and I didn't find an explicit reason in the text to invalidate my model. The correct model might be obvious to most but perhaps the clarification below will help someone in the future:
The text reads:
For this to be true, I believe we have to assume a passenger arriving 0 minutes after the previous train has the same observed waiting time as a passenger arriving any arbitrary
n > 0
minutes after the train. In other words, a passenger who just missed the previous train and waited the full gap is treated the same as a passenger who just barely made it the train.My intuition was as follows: In reality, a passenger can arrive at the 9th minute of a 10 minute gap or the 4th minute of a 5 minute gap. Both passengers wait 1 minute. If you model it this way, the biased distribution actually shifts to the left. Why? Let's say there are two passengers arriving per minute (
lam = 2
). For a 2 minute gap, you might have the following wait times for 4 passengers:[0, 0, 1, 1]
. For a 3 minute gap, you might have the following wait times for 6 passengers:[0, 0, 1, 1, 2, 2]
. A passenger who waits 0 has arrived just before the train departs. For ann
minute gap, wait timen-1
indicates the passenger arrived within the first minute after the previous train departed. From the 2-minute and 3-minute gaps above, you can deduce that across all trainsP(wait n) < P(wait n-1)
. I.e., there is always be a chance for a passenger to wait 0 minutes. But for an e.g. 5 minute gap, it's impossible to wait 6 minutes.Here is some code to simulate the process and the resulting histogram.
The text was updated successfully, but these errors were encountered: