Prediction Validation

A coding challenge for Insight Data Engineering Program

Requirements

Python 3.6.5

Problem Statement

Problem Description

Write a program that reads in two time-ordered files and calculate the average error within the given window size.

Input Format

actual.txt and predicted.txt: time-ordered line-separated lists containing pipe-delimited hour (integer), stock(string) and price(float);
window.txt: an integer greater than 0.

Output Format

comparison.txt: time-ordered line-separated lists containing pipe-delimited hour, hour+window, average error.

Original approach (`pv-ry.py`)

Dollar to penny Since the dollar currency only contains 2 decimal points, float is not a precise way to represent it. I've change dollar currency into penny, which have integer amounts, at the beginning. In the end when I calculate average errors, I switch to dollar and round to 2 decimal points.
Prepare the data structure dictionary: pred_dict = {hour:{stock:price}}; time_dict = {hour:(count, error_sum)}; variable: time_count.
Read and parse predicted.txt
- Start with predict.txt since not all stock values have useful predicted information.
- Read and parse predict.txt and prepare dictionary pred_dict.
- Since all the predicted stock price were based on actual price for the same stock, I count the stocks for comparison in each hour when I prepare pred_dict and store in time_dict.
- Maximum hour is also counted as time_count.
Compare with actual.txt
- Read and parse actual.txt.
- If find a stock at given hour in pred_dict, calculate the price difference between prediction and actual value and add it into error_sum to finish up building time_dict.
Calculate average in time windows.
- Start with time window from 0 to window-1. (e.g. 0 to 1, when window = 2).
- Slide the time window by removing the previous starting hour and adding the current ending hour.
- Calculate average error for every time window and write into file.
- Use dictionary.get() to find value from time_dict, in case we don't have good prediction of stock price in certain hours.

Comment

Test passed within 0.01 of the expected value;
It is efficient to use dictionary as container datatype, but it may occupy to much memory when scale up.

(Aug 31, 2018)

Adapted approach (`pv-ry1.py`)

Use deque from collections in Python's standard library. This is a list-like container with fast appends and pops on either end.
Since the data are time-ordered, we could use deque as the container datatype when we compare pred_dict with actual.txt and build time_deque which save count and sum of error for every hour.
When we slide the time window, we could also make a small window_deque to save the data we pop out from time_deque but might be useful for later time window.
The time complexity of deque.append() and deque.leftpop() is O(1), thus we would retain (or even improve) the efficiency while reduce memory usage.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
input		input
insight_testsuite		insight_testsuite
output		output
src		src
.DS_Store		.DS_Store
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction Validation

Requirements

Problem Statement

Problem Description

Input Format

Output Format

Original approach (`pv-ry.py`)

Comment

Adapted approach (`pv-ry1.py`)

About

Releases

Packages

Languages

ziyunch/prediction-validation

Folders and files

Latest commit

History

Repository files navigation

Prediction Validation

Requirements

Problem Statement

Problem Description

Input Format

Output Format

Original approach (pv-ry.py)

Comment

Adapted approach (pv-ry1.py)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Original approach (`pv-ry.py`)

Adapted approach (`pv-ry1.py`)

Packages