Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AD suggested changes to ReadMe for senior role #4

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

AmandaDoyle
Copy link
Member

Happy to talk through reasoning.

Happy to talk through reasoning.
Copy link

@SashaWeinstein SashaWeinstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does incorporate most of the changes I had in mind. I still think we should make it harder, more confusing and more repetitive but I see that you have different thoughts on this

README.md Outdated

### Task 2: Data Aggregation
To download 311 service request records write a script that takes 2 parameters passed from the command line: number of days and responding agency acronym. For example, if a user wanted to get all service request records created in the last week where DSNY is the responding agency, they would pass `7` and `DSNY` as the parameters. For this exercise, we ask that you download all 311 service requests filed the **last seven days** where **HPD** is the responding agency. Save the data as a csv named `raw.csv` in a folder called `data`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I would prefer to have them choose the number of days and responding agency is that it has them read the source data themselves and see what the responding agencies are. Having them save multiple files with names of their choice tests their ability to cache well-named files. I would prefer to see HPD_last_7.csv and DOT_last_10.csv with the filenames constructed by the python code rather than data1.csv and data2.csv.

Additionally, if we ask them to read the whole challenge before starting they will know not to chose one day or 1,000 days as these don't produce such good plots

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SashaWeinstein That makes sense to me and tests their data acumen but can see @AmandaDoyle point as well

- `created_date_hour`: the timestap of request creation by date and hour
Write a process to produce a time series table based on the `data/raw.csv`file we created in **Task 1** that has the following fields:

- `created_date_time`: the timestap of request creation by date and hour OR just date
- `complaint_type`: the type of the complaint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having them pass the complaint type as an argument and having it be optional tests something that task 1 doesn't test. Optional args requires different implementation on the arg parse side and data processing side


### Task 4: Spatial data processing
Create a multi-line plot to show the total service request counts by `created_date_time` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having them produce multiple plots with the multiple .csv's they cached is a good test of writing reusable data viz code that sets axes/titles programmatically based on what it's passed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Sasha especially after the work we've been doing with the QAQC app. It's great having the ability to communicate effective data viz in succinct code especially when it comes to the little formatting issues that inevitably come up

README.md Outdated

Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder.
At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Python Task 1**, we would like you to join the initial raw data to an NYC administraive boundary. Then create a choropleth map of the 7 day total count of complaints where `HPD` is the responding agency fot a specific `complaint_type` of your choice.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me

> Note: Depending on your preference, you can use or [Postgres](https://www.postgresql.org/), which is prefered; however, if you are familiar with [SQLite](https://docs.python.org/3/library/sqlite3.html) (much easier to set up and use), you can use that too.
- Set up POSTGIS container using an image. [Here](https://registry.hub.docker.com/r/postgis/postgis/) is the one we use.
- Load the `data/raw.csv` into a database and name the table `sample_311`. Make sure this process is captured in a script.
- Perform the same aggregation in **Python Task 2** in SQL and store the results in a table (same name as the corresponding csv file).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good to me, you've read my thoughts on the file name and having the interviewee find the image themselves

@mbh329
Copy link

mbh329 commented Aug 4, 2022

All looks good to me. Language is clear.

AmandaDoyle and others added 12 commits August 8, 2022 11:37
Only ask for bash scripting as a bonus item for task 1. The data challenges we got for the data engineering position in the summer/fall of 2022 had a lot of copy/pasted bash code. Challenge will faster to complete if we only ask for it once.
I got hung up on how to describe the date filter. I'm not sure if this language
>> Write a python script to pulls data from the NYC Open DataAPI based two filters. The first filter is on responding agency. The second filter is an integer date filter to only get calls `n` days before the current date.
is sufficiently clear
Add a couple sentences to remind the interviewee that they need to find new administrative boundaries to aggregate on. 
I think the original instructions were actually more clear that I assumed so less sure this upgrade is actually needed. Figured I would let the team give some input
Add a second bonus task to SQL/Docker task 2. The task is to push an image with the setup and code to the docker hub so we can pull it down and run the code.
include list of administrative boundaries that aren't valid choices for python task 4
Clarified instructions for python task 1
clarified the second introduction paragraph
clarified instructions in docker bonus task
@SashaWeinstein
Copy link

does it make sense to close this PR now that we know we want to keep the advanced data challenge on a separate branch than main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants