-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AD suggested changes to ReadMe for senior role #4
base: main
Are you sure you want to change the base?
Conversation
Happy to talk through reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does incorporate most of the changes I had in mind. I still think we should make it harder, more confusing and more repetitive but I see that you have different thoughts on this
README.md
Outdated
|
||
### Task 2: Data Aggregation | ||
To download 311 service request records write a script that takes 2 parameters passed from the command line: number of days and responding agency acronym. For example, if a user wanted to get all service request records created in the last week where DSNY is the responding agency, they would pass `7` and `DSNY` as the parameters. For this exercise, we ask that you download all 311 service requests filed the **last seven days** where **HPD** is the responding agency. Save the data as a csv named `raw.csv` in a folder called `data`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I would prefer to have them choose the number of days and responding agency is that it has them read the source data themselves and see what the responding agencies are. Having them save multiple files with names of their choice tests their ability to cache well-named files. I would prefer to see HPD_last_7.csv
and DOT_last_10.csv
with the filenames constructed by the python code rather than data1.csv
and data2.csv
.
Additionally, if we ask them to read the whole challenge before starting they will know not to chose one day or 1,000 days as these don't produce such good plots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SashaWeinstein That makes sense to me and tests their data acumen but can see @AmandaDoyle point as well
- `created_date_hour`: the timestap of request creation by date and hour | ||
Write a process to produce a time series table based on the `data/raw.csv`file we created in **Task 1** that has the following fields: | ||
|
||
- `created_date_time`: the timestap of request creation by date and hour OR just date | ||
- `complaint_type`: the type of the complaint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having them pass the complaint type as an argument and having it be optional tests something that task 1 doesn't test. Optional args requires different implementation on the arg parse side and data processing side
|
||
### Task 4: Spatial data processing | ||
Create a multi-line plot to show the total service request counts by `created_date_time` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having them produce multiple plots with the multiple .csv's they cached is a good test of writing reusable data viz code that sets axes/titles programmatically based on what it's passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Sasha especially after the work we've been doing with the QAQC app. It's great having the ability to communicate effective data viz in succinct code especially when it comes to the little formatting issues that inevitably come up
README.md
Outdated
|
||
Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder. | ||
At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Python Task 1**, we would like you to join the initial raw data to an NYC administraive boundary. Then create a choropleth map of the 7 day total count of complaints where `HPD` is the responding agency fot a specific `complaint_type` of your choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me
> Note: Depending on your preference, you can use or [Postgres](https://www.postgresql.org/), which is prefered; however, if you are familiar with [SQLite](https://docs.python.org/3/library/sqlite3.html) (much easier to set up and use), you can use that too. | ||
- Set up POSTGIS container using an image. [Here](https://registry.hub.docker.com/r/postgis/postgis/) is the one we use. | ||
- Load the `data/raw.csv` into a database and name the table `sample_311`. Make sure this process is captured in a script. | ||
- Perform the same aggregation in **Python Task 2** in SQL and store the results in a table (same name as the corresponding csv file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me, you've read my thoughts on the file name and having the interviewee find the image themselves
All looks good to me. Language is clear. |
Only ask for bash scripting as a bonus item for task 1. The data challenges we got for the data engineering position in the summer/fall of 2022 had a lot of copy/pasted bash code. Challenge will faster to complete if we only ask for it once. I got hung up on how to describe the date filter. I'm not sure if this language >> Write a python script to pulls data from the NYC Open DataAPI based two filters. The first filter is on responding agency. The second filter is an integer date filter to only get calls `n` days before the current date. is sufficiently clear
Add a couple sentences to remind the interviewee that they need to find new administrative boundaries to aggregate on. I think the original instructions were actually more clear that I assumed so less sure this upgrade is actually needed. Figured I would let the team give some input
Add a second bonus task to SQL/Docker task 2. The task is to push an image with the setup and code to the docker hub so we can pull it down and run the code.
include list of administrative boundaries that aren't valid choices for python task 4
Clarified instructions for python task 1
clarified the second introduction paragraph
Limit bash scripting to task 1
Don't make borough chloropleth
clarified instructions in docker bonus task
Docker hub bonus task
does it make sense to close this PR now that we know we want to keep the advanced data challenge on a separate branch than main? |
Happy to talk through reasoning.