Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean socrata data #3

Open
galbwe opened this issue May 18, 2021 · 5 comments
Open

Clean socrata data #3

galbwe opened this issue May 18, 2021 · 5 comments
Assignees
Labels

Comments

@galbwe
Copy link
Collaborator

galbwe commented May 18, 2021

  • In the data scraped from socrata api:
    1. names should not be in all caps
    2. nonprofits that have obvious religious affiliations based on their names should be filtered out
  • cleaning can be done as part of the find_socrata_api_leads task (in tasks.py), or can be done in a separate task in tasks.py that runs after find_socrata_api_leads
@galbwe galbwe added the backend label May 18, 2021
@galbwe galbwe changed the title Normalize company name string format Clean socrata data Jul 17, 2021
@Nova791
Copy link

Nova791 commented Aug 4, 2021

Clean Socrata Data

@galbwe galbwe added the Hacktoberfest Hacktoberfest 2021 label Oct 1, 2021
@kaleeaswari
Copy link
Collaborator

@galbwe Can I get this assigned?

@galbwe
Copy link
Collaborator Author

galbwe commented Oct 24, 2021

Hey @kaleeaswari if you can focus on the Colorado Non-profits (CNP) data, that would be the most helpful. We currently aren't using the socrata dataset because it was missing most of the relevant information for the app.

The files you will want to look at are scrape_CNP.py and tasks.py. A good start would be to filter out the records that are missing all of the fields needed for the the table on the homepage. There are other things to do like making sure strings are formatted consistently, and removing records for nonprofits with obvious religious affiliations.

@kaleeaswari
Copy link
Collaborator

Mandatory fields to consider a lead : Name, Contact, Website, SocialMedia. Is that correct?

@galbwe

@galbwe
Copy link
Collaborator Author

galbwe commented Oct 24, 2021

I think they all have the Name field. If a lead is missing Contact, Website, Facebook, Twitter, Instagram, and LinkedIn, then it should be dropped.

@galbwe galbwe removed the Hacktoberfest Hacktoberfest 2021 label Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants