Skip to content

Commit

Permalink
Co-authored-by: Bethany Pietroniro <[email protected].…
Browse files Browse the repository at this point in the history
…com>

Co-authored-by: Michael Johnson <[email protected]>
Co-authored-by: Fred-209 <[email protected]>
  • Loading branch information
davidrd123 committed Aug 20, 2023
2 parents 79970be + 45f3bf6 commit a16682a
Show file tree
Hide file tree
Showing 87 changed files with 4,835 additions and 62 deletions.
43 changes: 43 additions & 0 deletions .github/workflows/static.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Simple workflow for deploying static content to GitHub Pages
name: Deploy static content to Pages

on:
# Runs on pushes targeting the default branch
push:
branches: ["main"]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "pages"
cancel-in-progress: false

jobs:
# Single deploy job since we're just deploying
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Pages
uses: actions/configure-pages@v3
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
with:
# Upload entire repository
path: '.'
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v2
72 changes: 71 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,71 @@
# Launch-Summarize-Capstone-YT
# Launch School Capstone Project Transcript Analysis

This repository provides tools and methodologies to extract and analyze transcripts from Launch School Capstone Projects. The project consists of a four-step pipeline - fetching video transcripts, processing transcripts, modeling topic data, and delivering an interactive presentation of the results.

Experience the process live and interact with results at [Capstone Summary Web App](https://launch-summarize-capstone-yt.streamlit.app/).

## Setup & Requirements

1. **Virtual Environment Activation**

Before running any scripts, activate your virtual environment (.venv) or Conda environment:

```bash
pip install -r requirements.txt
```

2. **API Credentials**

Required API keys for YouTube Data and OpenAI should reside in your `.env` file as `YT_DATA_API_KEY` and `OPENAI_API_KEY` respectively.

## Step-by-Step Usage

### 1. Fetching Video Transcripts

Running `get_transcript.py` fetches transcripts from YouTube videos, and is specifically targeted at the format of the Launch School Capstone videos (it relies on the video title format to get the project name, etc.). There is an interactive CLI menu that allows you to choose to provide either a video URL or a playlist URL. The script then fetches the transcript for each video in the playlist or the single video.

```bash
python get_transcript.py
```

Each fetched transcript is stored in a directory named after the project and categorized under corresponding `<year>/<project_name>` directories.

### 2. Transcript Processing with GPT Models

`process_transcript_gpt.py` offers control over transcript processing using GPT models. The script produces an interactive CLI menu allowing you to select a project and a GPT model - either 'gpt-4' or 'gpt-3.5-turbo-16k' - for processing.

```bash
python process_transcript_gpt.py
```

Options available within the processing menu include:

- **"Rewrite Transcript in Shorter Form"**: It condenses the transcript retaining key points. If a rewritten form doesn't exist, this operation should be performed first.

- **"Summarize Rewrite in Outline Form"**: Provides an outline summary for the rewritten transcript.

- **"Summarize Transcript in Outline Form"**: Facilitates an outline summary for raw transcripts. Note: This option requires the use of 'gpt-3.5-turbo-16k' for transcripts exceeding 8k tokens.

- **"Get token count of Transcript"** and **"Get token count of Rewrite"**: Both options return the token count for respective transcripts.

### 3. Topic Modeling on Transcripts

LDA (Latent Dirichlet Allocation) implemented in the `topic_modeling.py` script surfaces dominant topics from the transcripts.

```bash
python topic_modeling.py
```

It uses either raw transcripts or a rewritten form (depends on uncommented lines in the script) and proposes a set of topics for each document. Options for calculating coherence and executing a grid search for parameter optimization are available. Uncomment `# pyLDAvis.save_html(lda_viz, 'lda.html')` to output an easy-to-understand interactive HTML visualization.

The results also cluster projects by the identified primary topic and measure the level of association. It allows for a comparison and understanding of the broad topics covered per project.

### 4. Interactive Summary Visualization with Streamlit

The final step encompasses an accessible and interactive summary visualization, available via a Streamlit app. To launch the app, run:

```bash
streamlit run view_writeups.py
```

Now you can embark on analyzing previous Capstone Projects transcripts and discover valuable insights!
111 changes: 91 additions & 20 deletions get_transcript.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
import os
import sys
import json
import base64
from pathlib import Path
import requests
import google.auth
import google.auth.transport.requests
import google.oauth2.credentials
import googleapiclient.discovery
import inquirer

from googleapiclient.discovery import build
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
Expand All @@ -18,6 +26,20 @@
target_video_url = 'https://www.youtube.com/watch?v=I3sinNeqwZU'


YT_DATA_API_BASE_URL = 'https://www.googleapis.com/youtube/v3'
YT_PLAYLIST_ITEMS_URL = f'{YT_DATA_API_BASE_URL}/playlistItems'

# Retrieve and decode the base64-encoded credentials from the environment
google_credentials_base64 = os.environ["GOOGLE_CREDENTIALS_BASE64"]
google_credentials_info = json.loads(base64.b64decode(google_credentials_base64))

# Build credentials from the service account info
credentials = google.oauth2.service_account.Credentials.from_service_account_info(google_credentials_info, scopes=['https://www.googleapis.com/auth/youtube.force-ssl'])


# Set up the YouTube Data API client
youtube = googleapiclient.discovery.build('youtube', 'v3', credentials=credentials)

def video_url_to_id(video_url):
# Handle case of further parameters
if '&' in video_url:
Expand All @@ -36,6 +58,7 @@ def video_url_to_playlist_id(video_url):
return playlist_id

def get_transcript(video_id):
print(f'Getting transcript for video_id: {video_id}')
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])

Expand All @@ -51,17 +74,24 @@ def save_transcript(video_id, playlist_id, transcript):
capstone_year = video_playlist_to_presentation_year(get_video_playlist(playlist_id))
transcripts_path = Path('transcripts')

subfolder_path = transcripts_path / capstone_year / video_title
# Video title is going to start with 'Launch School Capstone Presentation: <Project Name>'
# We want to remove the 'Launch School Capstone Presentation: ' part
project_name = video_info_to_project_name(get_video_info(video_id))

subfolder_path = transcripts_path / capstone_year / project_name
subfolder_path.mkdir(parents=True, exist_ok=True)

# Path(f'transcripts/{capstone_year}/{video_title}').mkdir(parents=True, exist_ok=True)


if not transcript:
success, transcript = get_transcript(video_id)
if not success:
return False
file_path = subfolder_path / f'{video_id}_transcript.txt'
if file_path.exists():
print(f'File already exists at {file_path} for video_id: {video_id}. Skipping...')
return False
with file_path.open('w', encoding='utf-8') as f:
print(f'Writing transcript to {file_path}')
f.write(transcript)

return True
Expand Down Expand Up @@ -91,6 +121,11 @@ def get_video_info(video_id):
response = requests.get(yt_data_url, params=params)
return response.json()

def video_info_to_project_name(video_info):
title = video_info['items'][0]['snippet']['title']
video_title = title.split(': ')[1]
return video_title

def video_info_to_title(video_info):
title = video_info['items'][0]['snippet']['title']
# Remove invalid characters (e.g. /, \, :, etc.)
Expand All @@ -99,31 +134,67 @@ def video_info_to_title(video_info):
title = title.replace(' ', '_')
return title

def select_video_or_playlist():
options = [
'Video',
'Playlist'
]

def app():
video_url = input('Enter video url: ')
if not video_url:
video_url = target_video_url
# video_url = input('Enter video url: ')
# print(f'video_url: {video_url}')
questions = [
inquirer.List('video_or_playlist',
message='Select video or playlist',
choices=options,
),
]

video_id = video_url_to_id(video_url)
playlist_id = video_url_to_playlist_id(video_url) # returns None if no playlist ID found, needs to be handled
video_playlist = get_video_playlist(playlist_id)

# print(f'Capstone year: {video_playlist_to_presentation_year(video_playlist)}')

answers = inquirer.prompt(questions)
return answers['video_or_playlist']

def download_transcript(video_id, playlist_id):
success, transcript = get_transcript(video_id)
print(video_id)
if success:
print("Success!")
save_transcript(video_id, playlist_id, transcript)
save_transcript(video_id, playlist_id, transcript)
else:
exit(transcript)

video_info = get_video_info(video_id)
print(video_info_to_title(video_info))
def playlist_id_to_video_ids(playlist_id):
params = {
'key': YT_DATA_API_KEY,
'part': 'snippet',
'playlistId': playlist_id,
'maxResults': 50
}
response = requests.get(YT_PLAYLIST_ITEMS_URL, params=params)
playlist_items = response.json()
video_ids = []
for item in playlist_items['items']:
video_ids.append(item['snippet']['resourceId']['videoId'])
return video_ids

def app():
answer = select_video_or_playlist()
if answer == 'Video':
video_url = input('Enter video url: ')
video_id = video_url_to_id(video_url)
playlist_id = video_url_to_playlist_id(video_url) # returns None if no playlist ID found, needs to be handled
video_playlist = get_video_playlist(playlist_id)

print(f'video_playlist: {video_playlist}')
print(f'Capstone year: {video_playlist_to_presentation_year(video_playlist)}')
download_transcript(video_id, playlist_id)
elif answer == 'Playlist':
# print(f'Capstone year: {video_playlist_to_presentation_year(video_playlist)}')
playlist_url = input('Enter playlist URL: ')
playlist_id = video_url_to_playlist_id(playlist_url)
video_ids = playlist_id_to_video_ids(playlist_id)
print(f'video_ids: {video_ids}')
for video_id in video_ids:
download_transcript(video_id, playlist_id)
else:
print('Invalid option selected. Exiting...')
exit()




if __name__ == '__main__':
Expand Down
41 changes: 41 additions & 0 deletions index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions process_transcript_gpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,9 @@ def select_model():

def select_option():
options = [
'Summarize Transcript in Outline Form',
'Rewrite Transcript in Shorter Form',
'Summarize Rewrite in Outline Form',
'Summarize Transcript in Outline Form',
'Get token count of Transcript',
'Get token count of Rewrite',
]
Expand Down Expand Up @@ -195,7 +195,7 @@ def app():
transcripts_path = Path('transcripts')

# Get a list of all (year) subdirectories in the transcripts folder
year_dirs = [d for d in transcripts_path.iterdir() if d.is_dir()]
year_dirs = sorted([d for d in transcripts_path.iterdir() if d.is_dir()])

# Ask the user to select a year
selected_year_dir = select_directory(year_dirs)
Expand Down
19 changes: 17 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# spacy==3.3.1
streamlit==0.74.1
protobuf==3.20.0
click==8.0.3
# requests==2.28.1
# nltk==3.7
# matplotlib==3.5.0
# torch==1.6.0
# openai==0.27.8
# gensim==4.3.1
# tiktoken==0.4.0
# numpy==1.24.3
# youtube_transcript_api==0.6.1
# google_api_python_client==2.90.0
# inquirer==3.1.3
# pyLDAvis==3.4.1
# python-dotenv==1.0.0
# scipy==1.10.1
# sentence_transformers==2.2.2
# transformers==4.30.2
35 changes: 35 additions & 0 deletions team_to_topic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"Armada": "Containerized Service Management",
"QMantis": "API Load Testing & Metrics",
"Bard": "User Session Analysis & Data Messaging",
"Skopos": "Automated User-Centric Testing",
"Constellation": "API Load Testing & Metrics",
"Bastion": "Automated User-Centric Testing",
"Artemis": "API Load Testing & Metrics",
"Chimera": "Containerized Service Management",
"Triage": "Automated User-Centric Testing",
"Bubble": "Log Analysis & Collaborative Preview",
"Sentinel": "Application Deployment & Testing",
"Kuri": "User Session Analysis & Data Messaging",
"Nexus": "GraphQL API Development & Optimization",
"Cascade": "Containerized Service Management",
"Hypha": "Log Analysis & Collaborative Preview",
"Waypost": "Feature Management",
"Fána": "User Session Analysis & Data Messaging",
"Seymour": "Automated User-Centric Testing",
"Tailslide": "Automated User-Centric Testing",
"Arroyo": "Log Analysis & Collaborative Preview",
"Conifer": "Automated User-Centric Testing",
"Trellis": "Automated User-Centric Testing",
"Symphony": "Containerized Service Management",
"Edamame": "API Load Testing & Metrics",
"Otter": "Web & API Development",
"Seamless": "Containerized Service Management",
"Herald": "User Session Analysis & Data Messaging",
"Test Lab": "Automated User-Centric Testing",
"Fauna": "Feature Flag Management",
"Firefly": "Observability & Monitoring",
"Haifa": "Observability & Monitoring",
"Q Mentis": "GraphQL API Development & Optimization",
"Tailsite": "Feature Flag Management"
}
Loading

0 comments on commit a16682a

Please sign in to comment.