diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 59ece58c..3807260a 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -35,3 +35,19 @@ jobs:
run: python -m unittest tests/imports.py
- name: Run pytest suite
run: pytest
+ build_docs:
+ runs-on: ubuntu-22.04
+ defaults:
+ run:
+ working-directory: ./docs
+ steps:
+ - name: Checkout Forest code from GitHub repo
+ uses: actions/checkout@v3
+ - name: Set up Python
+ uses: actions/setup-python@v4
+ with:
+ python-version: 3.8
+ - name: Install documentation build dependencies
+ run: pip install -r requirements.txt
+ - name: Build the docs
+ run: make html SPHINXOPTS="-W"
diff --git a/docs/source/index.md b/docs/source/index.md
index 9d980fbd..25d0c46b 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -216,6 +216,9 @@ The summary statistics that are generated are listed below:
* - total_mins_out_call
- float
- The duration (minute) of all outgoing calls.
+* - num_uniq_individuals_call_or_text
+ - int
+ - The total number of unique individuals who called or texted the subject, or who the subject called or texted. The total number of individuals who the subject had any kind of communication with.
* - num_s
- int
- The total number of sent SMS.
diff --git a/docs/source/jasmine.md b/docs/source/jasmine.md
index e17e8fa0..324708ad 100644
--- a/docs/source/jasmine.md
+++ b/docs/source/jasmine.md
@@ -85,37 +85,39 @@ You can also tweak the parameters that change the assumptions of the imputation
(6) locations_log (.json)\
- json file created if `save_osm_log` is set to True. It contains information on the places visited by the user, their tags and the time of visit.
-## Description of functions in package:
+## Description of functions in package:
+
`data2mobmat.py`
This file contains the functions to convert the raw GPS data to a mobility matrix (2d numpy array), where each column represents movement status(flight/pause/undecided), starting latitude, starting longitude, starting timestamp, ending latitude, ending longitude, ending timestamp. This module focuses on summarizing observed data to trajectories but not unobserved period.
-- Its main function is `GPS2MobMat` which calls the required functions in the right order (see [[Link to paper | doi....]] for details on the algorithm
+- Its main function is `gps_to_mobmat` which calls the required functions in the right order (see [[Link to paper | doi....]] for details on the algorithm
- It contains various functions to calculate distance on the globe: `cartesian`, `shortest_dist_to_great_circle`, `great_circle_dist` and `pairwise_great_circle_dist`
- In addition, it has a few helper functions:
-- `unique`: return a list of unique items in a list
- `collapse_data`: the GPS data is usually sampled at 1 Hz. We collapse the data every 10 seconds and calculate the average to reduce the noise in the raw data.
-- `ExistKnot`: given a matrix with columns [timestamp, latitude, longitude], return if the trajectories depicted by those coordinates can be approximated as a straight line. The parameter $w$ represents the tolerance of deviation. It return 1 if there exists at least one knot in the trajectory and it returns 0 otherwise.
-- `ExtractFlights`: given a matrix with columns [timestamp, latitude, longitude] in a burst period (when the GPS is on), return a summary of trajectories (2d array) with columns as [movement status, start_timestamp, start_latitude, start_longitude, end_timestamp, end_latitude, end_longitude].
-- `InferMobMat`: tidy up the trajectory matrix (infer undecided pieces, combine flights/pauses.)
+- `exist_knot`: given a matrix with columns [timestamp, latitude, longitude], return if the trajectories depicted by those coordinates can be approximated as a straight line. The parameter $w$ represents the tolerance of deviation. It return 1 if there exists at least one knot in the trajectory and it returns 0 otherwise.
+- `extract_flights`: given a matrix with columns [timestamp, latitude, longitude] in a burst period (when the GPS is on), return a summary of trajectories (2d array) with columns as [movement status, start_timestamp, start_latitude, start_longitude, end_timestamp, end_latitude, end_longitude]. It uses the helper funtions `mark_single_measure`, `mark_complete_pause`, `detect_knots` and `prepare_output_data`.
+- `infer_mobmat`: tidy up the trajectory matrix (infer undecided pieces, combine flights/pauses.). It uses the helper functions `compute_flight_positions`, `compute_future_flight_positions`, `infer_status_and_positions`, `merge_pauses_and_bridge_gaps` and `correct_missing_intervals`.
`sogp_gps.py`
This file is the core of sparse online Gaussian Process. It covers the algorithm described in [Csato and Opper (2001)](https://eprints.soton.ac.uk/259182/1/gp2.pdf).
-- `K0`: a kernel function to measure the similarity between x1 and x2.
-- `update_K`, `update_k`, `update_e_hat`, `update_gamma`, `update_q`, `update_s_hat`, `update_eta`, `update_alpha_hat`, `update_c_hat`, `update_s`, `update_alpha`, `update_c`, `update_Q`, `update_alpha_vec`, `update_c_mat`, `update_q_mat`, `update_s_mat`: are the updating rules for each parameters in the algorithm.
-- `SOGP`: A key function of this model. Given an 2d array of latitude and longitude, return a basis vector set of fixed size and relevant parameters for the updates in the future.
-- `BV_select`: The master function. Given the observed trajectory matrix, return representative trajectories of a fixed size and relevant parameters for the updates in the future.
+
+- `calculate_k0`: a kernel function to measure the similarity between x1 and x2.
+- `update_similarity`, `update_similarity_all`, `update_e_hat`, `update_gamma`, `update_q`, `update_s_hat`, `update_eta`, `update_alpha_hat`, `update_c_hat`, `update_s`, `update_alpha`, `update_c`, `update_q_mat`, `update_alpha_vec`, `update_c_mat`, `update_q_mat2`, `update_s_mat`: are the updating rules for each parameters in the algorithm.
+- `sogp`: A key function of this model. Given an 2d array of latitude and longitude, return a basis vector set of fixed size and relevant parameters for the updates in the future. It uses the helper functions `calculate_sigma_max`, `update_system_given_gamma_tol`, `update_system_otherwise` and `pruning_bv`.
+- `bv_select`: The master function. Given the observed trajectory matrix, return representative trajectories of a fixed size and relevant parameters for the updates in the future.
`mobmat2traj.py`
This file imputes the missing trajectories based on the observed trajectory matrix.
-- Its main functions are `ImputeGPS` (for ...) and `Imp2traj` (for ...)
-- It contains two functions that are also used for generating summary statistics: `num_sig_places` (identify number of locations where participant spends x consecutive minutes, and is at least y m away from other locations) and `locate_home` (identify location that a participant spends most time between 9pm and 9 am)
+
+- Its main functions are `impute_gps` (for bi-directional imputation) and `imp_to_traj` (for combining pauses, flights shared by both observed and missing intervals, also combining consecutive flight with slightly different directions as one longer flight). It uses the helper functions `calculate_delta`, `adjust_delta_if_needed`, `calculate_position`, `update_table`, `forward_impute` and `backward_impute`.
+- It contains two functions that are also used for generating summary statistics: `num_sig_places` (identify number of locations where participant spends x consecutive minutes, and is at least y m away from other locations) and `locate_home` (identify location that a participant spends most time between 9pm and 9 am). They use helper functions `update_existing_place` and `add_new_place`.
- It contains various helper functions:
-- `K1`: the kernel function returns the similarity between the given triplet and every triplet in the basis vector set.
-- `I_flight`: determine if a flight occurs at the current time and location
-- `adjust_direction`: adjust the direction of the sampled flight if it is not likely to happen in the real world.
-- `multiplier`: return a coefficient to accelerate the imputation process based on the duration of the missing interval.
-- `checkbound`: check if the destination will be out of a reasonable range given the sampled flight
-- `create_tables`: initialize three 2d numpy arrays, one to store observed flights, one to store pauses, and one to store missing intervals.
+ - `calculate_k1`: the kernel function returns the similarity between the given triplet and every triplet in the basis vector set.
+ - `indicate_flight`: determine if a flight occurs at the current time and location
+ - `adjust_direction`: adjust the direction of the sampled flight if it is not likely to happen in the real world.
+ - `multiplier`: return a coefficient to accelerate the imputation process based on the duration of the missing interval.
+ - `checkbound`: check if the destination will be out of a reasonable range given the sampled flight
+ - `create_tables`: initialize three 2d numpy arrays, one to store observed flights, one to store pauses, and one to store missing intervals.
`traj2stats.py`
This file converts the imputed trajectory matrix to summary statistics.
@@ -123,7 +125,7 @@ This file converts the imputed trajectory matrix to summary statistics.
- `transform_point_to_circle`: transform a transforms a set of cooordinates to a shapely circle with a provided radius.
- `get_nearby_locations`: return a dictionary of nearby locations, a dictionary of nearby locations' names, and a dictionary of nearby locations' coordinates.
- `gps_summaries`: converts the imputed trajectory matrix to summary statistics.
-- `gps_quality_check`: checks the data quality of GPS data. If the quality is poor, the imputation will not be executed.
+- `gps_quality_check`: checks the data quality of GPS data. If the quality is poor, the imputation will not be executed.
- `gps_stats_main`: this is the main function of the jasmine module and it calls every function defined before. It is the function you should use as an end user.
## List of summary statistics
diff --git a/docs/source/sycamore.md b/docs/source/sycamore.md
index a9386d39..092ae1e9 100644
--- a/docs/source/sycamore.md
+++ b/docs/source/sycamore.md
@@ -201,3 +201,91 @@ If surveys are sent on a weekly schedule, Sycamore assumes that there is a surve
**What does `surv_inst_flg` mean in the outputs?**
`surv_inst_flg` is a unique identifying number to distinguish different times when the same individual took the same survey. This column is useful for joining outputs together.
+
+
+## List of summary statistics
+
+The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The `submits_summary_daily.csv` and `submits_summary_hourly.csv` files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.
+
+
+| Variable | Type | Description of Variable |
+|--------------------------------------- |-------------- |------------------------------------------------------------------------------------------------------------- |
+| survey id | str | ID of the survey for which this row applies to. Note: If `submits_by_survey_id` is False, surveys will not be aggregated at the survey level (they will only be aggregated by user) so this column will not appear. |
+| year | int | Year of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv` |
+| month | int | Month of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv` |
+| day | int | Day over which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv` |
+| hour | int | Hour over which submits/deliveries are being aggregated. This is only included in `submits_summary_hourly.csv` |
+| num_surveys | int | Number of surveys scheduled for delivery to the individual during the period |
+| num_submitted_surveys | int | Number of surveys submitted during the period (i.e. the user hit submit on all surveys)
+| num_opened_surveys | int | Number of surveys opened by the individual during the time period (i.e. the user answered at least one question) |
+| avg_time_to_submit | float | Average time between survey delivery and survey submission, in seconds, for complete surveys |
+| avg_time_to_open | float | Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing. |
+| avg_duration | float | Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing. |
+
+
+The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.
+
+| Variable | Type | Description of Variable |
+|--------------------------------------- |-------------- |------------------------------------------------------------------------------------------------------------- |
+| survey id | str | ID of the survey |
+| delivery_time | str | A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date |
+| submit_flg | str | Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session |
+| time_to_submit | float | Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank. |
+| time_to_open | float | Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0) |
+| survey_duration | float | Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA)|
+
+
+The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.
+
+| Variable | Type | Description of Variable |
+|--------------------------------------- |-------------- |------------------------------------------------------------------------------------------------------------- |
+| survey id | str | ID of the survey |
+| beiwe_id | str | The participant’s Beiwe ID |
+| question id | str | The ID of the question for this line |
+| question text | str | The question text corresponding to the answer |
+| question type | str | The type of question (radio button, free response, etc.) corresponding to the answer |
+| question answer options | str | The answer options presented to the user (applicable for check box or radio button surveys) |
+| timestamp | str | The Unix timestamp corresponding to the latest time the user was on the question |
+| Local time | str | The local time corresponding to the latest time the user was on the question |
+| last_answer | str | The last answer the user had selected before moving on to the next question or submitting |
+| all_answers | str | A list of all answers the user selected |
+| num_answers | int | The number of different answers selected by the user (the length of the list in all_answers) |
+| first_time | str | The local time corresponding to the earliest time the user was on the question |
+| last_time | str | The local time corresponding to the latest time the user was on the question |
+| time_to_answer | float | The time that the user spent on the question |
+
+
+The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.
+
+| Variable | Type | Description of Variable |
+|--------------------------------------- |-------------- |------------------------------------------------------------------------------------------------------------- |
+| survey id | str | ID of the survey |
+| beiwe_id | str | The participant’s Beiwe ID |
+| question id | str | The ID of the question for this line |
+| num_answers | int | The number of times in the given data the answer is answered |
+| average_time_to_answer | float | The average number of seconds the user takes to answer the question |
+| average_number_of_answers | float | Average number of answers selected for a question. This indicated if a user changed an answer before submitting it. |
+| most_common_answer | str | A user’s most common answer to a question |
+
+
+The following variables are created in the “submits_only.csv” file. This file will always be generated.
+
+| Variable | Type | Description of Variable |
+|--------------------------------------- |-------------- |------------------------------------------------------------------------------------------------------------- |
+| survey id | str | ID of the survey |
+| beiwe_id | str | The participant’s Beiwe ID |
+| surv_inst_flg | int | A “submission flag” which distinguishes submissions that are done by the same individual on the same survey |
+| max_time | str | Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session |
+| min_time | str | The earliest time the individual was interacting with the survey that session |
+| time_to_complete | float | Time between min_time and max_time, in seconds (for responses where a survey_timings file was available) |
+
+
+The following variables are created in a csv file for each survey.
+
+| Variable | Type | Description of Variable |
+|--------------------------------------- |-------------- |------------------------------------------------------------------------------------------------------------- |
+| start_time | str | Time this survey submission was started |
+| end_time | str | Time this survey submission was ended |
+| survey_duration | float | Difference between start and end time, in seconds (for surveys where a survey_timings file was available) |
+| question_1, question_2, … | str | Responses to each question in the survey |
+
diff --git a/docs/source/willow.md b/docs/source/willow.md
index ccf79fcb..4d53912a 100644
--- a/docs/source/willow.md
+++ b/docs/source/willow.md
@@ -36,11 +36,12 @@ ___
| num_in_call | int | Total number of incoming calls |
| num_out_call | int | Total number of outgoing calls |
| num_mis_call | int | Total number of missed calls
-| num_uniq_in_call | float | Total number of unique incoming callers |
-| num_uniq_out_call | int | Total number of unique outgoing calls |
-| num_uniq_mis_call | float | Total number of unique callers missed |
+| num_in_caller | float | Total number of unique incoming callers |
+| num_out_caller | int | Total number of unique outgoing calls |
+| num_mis_caller | float | Total number of unique callers missed |
| total_time_in_call | int | Total amount of minutes spent on incoming calls |
| total_time_out_call | int | Total amount of minutes spent on outgoing calls |
+| num_uniq_individuals_call_or_text | float | Total number of unique individuals who called or texted the Beiwe user, or who the Beiwe user called or texted. The total number of individuals with any communication contact with the Beiwe user |
| num_s | float | Total number of sent SMS texts |
| num_r | int | Total number of received SMS texts |
| num_mms_s | int | Total number of sent MMS texts |
@@ -52,6 +53,7 @@ ___
| text_reciprocity_incoming | int | The total number of times a text is sent to a unique person without response |
| text_reciprocity_outgoing | int | The total number of times a text is received by a unique person without response |
+
## References
## Contact information for questions:
diff --git a/forest/jasmine/traj2stats.py b/forest/jasmine/traj2stats.py
index 899a9886..bb1699c8 100644
--- a/forest/jasmine/traj2stats.py
+++ b/forest/jasmine/traj2stats.py
@@ -590,6 +590,17 @@ def gps_summaries(
res += [0] * (2 * len(places_of_interest) + 1)
summary_stats.append(res)
continue
+ elif sum(index_rows) == 0 and not split_day_night:
+ # There is no data and it is daily data, so we need to add empty
+ # rows
+ res = [year, month, day] + [0] * 3 + [pd.NA] * 15
+
+ if places_of_interest is not None:
+ # add empty data for places of interest
+ # for daytime/nighttime + other
+ res += [0] * (2 * len(places_of_interest) + 1)
+ summary_stats.append(res)
+ continue
temp = traj[index_rows, :]
# take a subset which is exactly one hour/day,
diff --git a/forest/willow/log_stats.py b/forest/willow/log_stats.py
index 1418871b..a40e1be0 100644
--- a/forest/willow/log_stats.py
+++ b/forest/willow/log_stats.py
@@ -140,6 +140,68 @@ def text_analysis(
)
+def text_and_call_analysis(
+ df_call: pd.DataFrame, df_text: pd.DataFrame, stamp: int, step_size: int
+) -> tuple:
+ """Calculate the summary statistics for anything requiring both call and
+ text data in the given time interval.
+ Args:
+ df_call: pd.DataFrame
+ dataframe of the call data
+ df_text: pd.DataFrame
+ dataframe of the text data
+ stamp: int
+ starting timestamp of the interval
+ step_size: int
+ ending timestamp of the interval
+
+ Returns:
+ tuple of summary statistics containing:
+ num_uniq_individuals_call_or_text: int
+ number of people making incoming calls or texts to the Beiwe
+ user or who the Beiwe user made outgoing calls or texts to
+
+
+ """
+ # filter the data based on the timestamp
+ if df_call.shape[0] > 0:
+ temp_call = df_call[
+ (df_call["timestamp"] / 1000 >= stamp)
+ & (df_call["timestamp"] / 1000 < stamp + step_size)
+ ]
+ index_in_call = np.array(temp_call["call type"]) == "Incoming Call"
+ index_out_call = np.array(temp_call["call type"]) == "Outgoing Call"
+ index_mis_call = np.array(temp_call["call type"]) == "Missed Call"
+ calls_in = np.array(temp_call["hashed phone number"])[index_in_call]
+ calls_out = np.array(temp_call["hashed phone number"])[index_out_call]
+ calls_mis = np.array(temp_call["hashed phone number"])[index_mis_call]
+
+ else: # no calls were received, so no unique numbers will be used
+ calls_in = np.array([])
+ calls_out = np.array([])
+
+ if df_text.shape[0] > 0:
+ temp_text = df_text[
+ (df_text["timestamp"] / 1000 >= stamp)
+ & (df_text["timestamp"] / 1000 < stamp + step_size)
+ ]
+
+ index_s = np.array(temp_text["sent vs received"]) == "sent SMS"
+ index_r = np.array(temp_text["sent vs received"]) == "received SMS"
+ texts_in = np.array(temp_text["hashed phone number"])[index_r]
+ texts_out = np.array(temp_text["hashed phone number"])[index_s]
+ else: # no texts were received, so no unique numbers will be used
+ texts_in = np.array([])
+ texts_out = np.array([])
+
+ num_uniq_individuals_call_or_text = len(np.unique(np.hstack(
+ [calls_in, texts_in, texts_out, calls_out, calls_mis]
+ )))
+ return (
+ num_uniq_individuals_call_or_text,
+ )
+
+
def call_analysis(df_call: pd.DataFrame, stamp: int, step_size: int) -> tuple:
"""Calculate the summary statistics for the call data
in the given time interval.
@@ -148,9 +210,9 @@ def call_analysis(df_call: pd.DataFrame, stamp: int, step_size: int) -> tuple:
df_call: pd.DataFrame
dataframe of the call data
stamp: int
- starting timestamp of the study
+ starting timestamp of the interval
step_size: int
- ending timestamp of the study
+ ending timestamp of the interval
Returns:
tuple of summary statistics containing:
@@ -232,9 +294,9 @@ def comm_logs_summaries(
df_call: pd.DataFrame
dataframe of the call data
stamp_start: int
- starting timestamp of the study
+ starting timestamp of the interval
stamp_end: int
- ending timestamp of the study
+ ending timestamp of the interval
tz_str: str
timezone where the study was/is conducted
frequency: Frequency class,
@@ -288,13 +350,19 @@ def comm_logs_summaries(
newline += list(call_stats)
else:
newline += [pd.NA] * 8
+ if df_text.shape[0] > 0 or df_call.shape[0] > 0:
+ text_and_call_stats = text_and_call_analysis(
+ df_call, df_text, stamp, step_size
+ )
+ newline += list(text_and_call_stats)
+ else:
+ newline += [pd.NA]
if df_text.shape[0] > 0:
text_stats = text_analysis(df_text, stamp, step_size, frequency)
newline += list(text_stats)
else:
newline += [pd.NA] * 10
-
if frequency == Frequency.DAILY:
newline = [year, month, day] + newline
else:
@@ -311,6 +379,7 @@ def comm_logs_summaries(
"num_mis_caller",
"total_mins_in_call",
"total_mins_out_call",
+ "num_uniq_individuals_call_or_text",
"num_s",
"num_r",
"num_mms_s",
@@ -425,6 +494,48 @@ def log_stats_main(
tz_str,
frequency,
)
+ # num_uniq_individuals_call_or_text is the cardinality
+ # of the union of several sets. It should should always
+ # be at least as large as the cardinality of any one of
+ # the sets, and it should never be larger than the sum
+ # of the cardinalities of all of the sets
+ # (it may be equal if all the sets are disjoint)
+ sum_all_set_cols = pd.Series(
+ [0]*stats_pdframe.shape[0]
+ )
+ for col in [
+ "num_s_tel", "num_r_tel", "num_in_caller",
+ "num_out_caller", "num_mis_caller"
+ ]:
+ sum_all_set_cols += stats_pdframe[col]
+ if (
+ stats_pdframe[
+ "num_uniq_individuals_call_or_text"
+ ] < stats_pdframe[col]
+ ).any():
+ logger.error(
+ "Error: "
+ "num_uniq_individuals_call_or_text "
+ "was found to be less than %s for at "
+ "least one time interval. This error "
+ "comes from an issue with the code,"
+ " not an issue with the input data",
+ col
+ )
+ if (
+ stats_pdframe[
+ "num_uniq_individuals_call_or_text"
+ ] > sum_all_set_cols
+ ).any():
+ logger.error(
+ "Error: "
+ "num_uniq_individuals_call_or_text "
+ "was found to be larger than the sum "
+ "of individual cardinalities for at "
+ "least one time interval. This error "
+ "comes from an issue with the code,"
+ " not an issue with the input data"
+ )
write_all_summaries(bid, stats_pdframe, output_folder)