Skip to content

Latest commit

 

History

History
37 lines (31 loc) · 4.04 KB

sharechat_fresh_content_scraper.md

File metadata and controls

37 lines (31 loc) · 4.04 KB

Scraper function loaded from Sharechat scrapers. Runs when content_to_scrape="fresh" in Config. Scrapes content from the "trending" tab on the tag page. It allows the user to get content posted around or leading up to a particular date and time, which is determined by the "unix_timestamp" value in Config. The scraped content will be chronological.

Scraper workflow:

The scraper performs the following steps using helper functions imported from the Sharechat helper and S3 Mongo helper modules.

  1. Initializes S3 and Mongo DB connections. This is done first to ensure that any authentication errors are caught before the scraping begins.

  2. Calls get_fresh_data() which does the actual scraping as follows -

    1. Initializes an empty Pandas dataframe df with column labels corresponding to the content that will be scraped.

    2. Starts a loop that will scrape from each tag in the list of tag hashes entered in config.

      For each tag:

      1. Generates a requests dictionary with generate_requests_dict(). The returned dictionary contains parameters required for replicating Sharechat API requests described below

      2. Sends a requestType66 to get data about the tag

      3. Scrapes the json response with a helper function called get_tag_data() . This returns tag_name, tag_translation, tag_genre, bucket_name, bucket_id

      4. Starts a loop to scrape "n" pages of fresh post data from the tag, where "n" is the number of pages entered in config

        For each page:

        1. Sends a requestType25 using the helper function get_response_dict(). The "unix_timestamp" specified in config is included in the request body. This returns a json response containing posts that were posted around the same time as the unix timestamp.
        2. Scrapes the json response with a helper function called get_post_data() . This returns a dataframe called post_data containing the following metadata for each post - media_link, timestamp, language, media_type, external_shares, likes, comments, reposts, post_permalink, caption, text, views, profile_page.
          Note that some of the dataframe's column names are different from the metadata labels generated by Sharechat, eg. 'usc' is renamed as 'external_shares'
        3. Scrapes the json response with a helper function called get_next_timestamp(). This returns an earlier unix timestamp which is required to scrape the next page of the loop, i.e. the scraper will go backwards in time. The timestamp is included in the request sent by get_response_dict() in the next iteration of the loop.
        4. Appends the post_data to the main dataframe
        5. Pauses for 30-35 seconds (random time delay to avoid bombarding the Sharechat API with requests)
    3. Drops duplicate posts (rows) from the main dataframe

    4. Transforms all the post timestamps with datetime.utcfromtimestamp() in accordance with Tattle's datetime conventions

    5. Adds a uuid filename used for identification across S3/Mongo to each post

    6. Adds the scraped_date to each post

    7. Returns the main dataframe

  3. Uploads the scraped data to an S3 bucket with a helper function called sharechat_s3_upload() that uses the common s3_mongo_helper module. This function returns the dataframe with an **S3 url **added to each post (row)

    If the S3 upload is successful, the scraper proceeds to step 4. If the S3 upload fails, the scraper jumps to step 8.

  4. Generates thumbnails for scraped images and videos saved on S3

  5. Creates and locally saves HTML file containing the scraped content and thumbnails. This is handy for previewing and sharing the content.

  6. Uploads the scraped data including S3 urls to Mongo DB with a helper function called sharechat_mongo_upload() that uses the common s3_mongo_helper module

  7. Locally saves a CSV file containing the scraped content. This is handy for previewing, annotating and sharing the content.

  8. Returns the final complete dataframe containing all scraped data (posts and their metadata)