A coding challenge for Insight Data Engineering Program
Python 3.6.5
Write a program that reads in two time-ordered files and calculate the average error within the given window size.
actual.txt
andpredicted.txt
: time-ordered line-separated lists containing pipe-delimitedhour
(integer),stock
(string) andprice
(float);window.txt
: an integer greater than 0.
comparison.txt
: time-ordered line-separated lists containing pipe-delimitedhour
,hour+window
,average error
.
- Dollar to penny Since the dollar currency only contains 2 decimal points, float is not a precise way to represent it. I've change dollar currency into penny, which have integer amounts, at the beginning. In the end when I calculate average errors, I switch to dollar and round to 2 decimal points.
- Prepare the data structure
dictionary:
pred_dict = {hour:{stock:price}}
;time_dict = {hour:(count, error_sum)}
; variable:time_count
. - Read and parse
predicted.txt
- Start with
predict.txt
since not all stock values have useful predicted information. - Read and parse
predict.txt
and prepare dictionarypred_dict
. - Since all the predicted stock price were based on actual price for the same stock, I count the stocks for comparison in each hour when I prepare
pred_dict
and store intime_dict
. - Maximum hour is also counted as
time_count
.
- Start with
- Compare with
actual.txt
- Read and parse
actual.txt
. - If find a stock at given hour in
pred_dict
, calculate the price difference between prediction and actual value and add it intoerror_sum
to finish up buildingtime_dict
.
- Read and parse
- Calculate average in time windows.
- Start with time window from 0 to
window-1
. (e.g.0 to 1
, whenwindow = 2
). - Slide the time window by removing the previous starting hour and adding the current ending hour.
- Calculate average error for every time window and write into file.
- Use
dictionary.get()
to find value from time_dict, in case we don't have good prediction of stock price in certain hours.
- Start with time window from 0 to
- Test passed within 0.01 of the expected value;
- It is efficient to use dictionary as container datatype, but it may occupy to much memory when scale up.
(Aug 31, 2018)
- Use
deque
fromcollections
in Python's standard library. This is a list-like container with fast appends and pops on either end. - Since the data are time-ordered, we could use
deque
as the container datatype when we comparepred_dict
withactual.txt
and buildtime_deque
which save count and sum of error for every hour. - When we slide the time window, we could also make a small
window_deque
to save the data we pop out fromtime_deque
but might be useful for later time window. - The time complexity of
deque.append()
anddeque.leftpop()
is O(1), thus we would retain (or even improve) the efficiency while reduce memory usage.