Co-authored-by: Bethany Pietroniro <[email protected].…

…com> Co-authored-by: Michael Johnson <[email protected]> Co-authored-by: Fred-209 <[email protected]>
davidrd123 · Aug 20, 2023 · a16682a · a16682a
2 parents 79970be + 45f3bf6
commit a16682a
Show file tree

Hide file tree

Showing 87 changed files with 4,835 additions and 62 deletions.
diff --git a/.github/workflows/static.yml b/.github/workflows/static.yml
@@ -0,0 +1,43 @@
+# Simple workflow for deploying static content to GitHub Pages
+name: Deploy static content to Pages
+
+on:
+  # Runs on pushes targeting the default branch
+  push:
+    branches: ["main"]
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+
+jobs:
+  # Single deploy job since we're just deploying
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v3
+      - name: Setup Pages
+        uses: actions/configure-pages@v3
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v1
+        with:
+          # Upload entire repository
+          path: '.'
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v2
diff --git a/README.md b/README.md
@@ -1 +1,71 @@
-# Launch-Summarize-Capstone-YT
+# Launch School Capstone Project Transcript Analysis
+
+This repository provides tools and methodologies to extract and analyze transcripts from Launch School Capstone Projects. The project consists of a four-step pipeline - fetching video transcripts, processing transcripts, modeling topic data, and delivering an interactive presentation of the results.
+
+Experience the process live and interact with results at [Capstone Summary Web App](https://launch-summarize-capstone-yt.streamlit.app/).
+
+## Setup & Requirements
+
+1. **Virtual Environment Activation**
+
+Before running any scripts, activate your virtual environment (.venv) or Conda environment:
+
+```bash
+pip install -r requirements.txt
+```
+
+2. **API Credentials**
+
+Required API keys for YouTube Data and OpenAI should reside in your `.env` file as `YT_DATA_API_KEY` and `OPENAI_API_KEY` respectively.
+
+## Step-by-Step Usage
+
+### 1. Fetching Video Transcripts
+
+Running `get_transcript.py` fetches transcripts from YouTube videos, and is specifically targeted at the format of the Launch School Capstone videos (it relies on the video title format to get the project name, etc.). There is an interactive CLI menu that allows you to choose to provide either a video URL or a playlist URL. The script then fetches the transcript for each video in the playlist or the single video.
+
+```bash
+python get_transcript.py
+```
+
+Each fetched transcript is stored in a directory named after the project and categorized under corresponding `<year>/<project_name>` directories.
+
+### 2. Transcript Processing with GPT Models 
+
+`process_transcript_gpt.py` offers control over transcript processing using GPT models. The script produces an interactive CLI menu allowing you to select a project and a GPT model - either 'gpt-4' or 'gpt-3.5-turbo-16k' - for processing.
+
+```bash
+python process_transcript_gpt.py
+```
+
+Options available within the processing menu include:
+
+- **"Rewrite Transcript in Shorter Form"**: It condenses the transcript retaining key points. If a rewritten form doesn't exist, this operation should be performed first.
+
+- **"Summarize Rewrite in Outline Form"**: Provides an outline summary for the rewritten transcript.
+
+- **"Summarize Transcript in Outline Form"**: Facilitates an outline summary for raw transcripts. Note: This option requires the use of 'gpt-3.5-turbo-16k' for transcripts exceeding 8k tokens.
+
+- **"Get token count of Transcript"** and **"Get token count of Rewrite"**: Both options return the token count for respective transcripts.
+
+### 3. Topic Modeling on Transcripts
+
+LDA (Latent Dirichlet Allocation) implemented in the `topic_modeling.py` script surfaces dominant topics from the transcripts.
+
+```bash
+python topic_modeling.py
+```
+
+It uses either raw transcripts or a rewritten form (depends on uncommented lines in the script) and proposes a set of topics for each document. Options for calculating coherence and executing a grid search for parameter optimization are available. Uncomment `# pyLDAvis.save_html(lda_viz, 'lda.html')` to output an easy-to-understand interactive HTML visualization.
+
+The results also cluster projects by the identified primary topic and measure the level of association. It allows for a comparison and understanding of the broad topics covered per project.
+
+### 4. Interactive Summary Visualization with Streamlit
+
+The final step encompasses an accessible and interactive summary visualization, available via a Streamlit app. To launch the app, run:
+
+```bash
+streamlit run view_writeups.py
+```
+
+Now you can embark on analyzing previous Capstone Projects transcripts and discover valuable insights!
diff --git a/get_transcript.py b/get_transcript.py
@@ -1,7 +1,15 @@
 import os
 import sys
+import json
+import base64
 from pathlib import Path
 import requests
+import google.auth
+import google.auth.transport.requests
+import google.oauth2.credentials
+import googleapiclient.discovery
+import inquirer
+
 from googleapiclient.discovery import build
 from dotenv import load_dotenv, find_dotenv
 _ = load_dotenv(find_dotenv())
@@ -18,6 +26,20 @@
 target_video_url = 'https://www.youtube.com/watch?v=I3sinNeqwZU'
 
 
+YT_DATA_API_BASE_URL = 'https://www.googleapis.com/youtube/v3'
+YT_PLAYLIST_ITEMS_URL = f'{YT_DATA_API_BASE_URL}/playlistItems'
+
+# Retrieve and decode the base64-encoded credentials from the environment
+google_credentials_base64 = os.environ["GOOGLE_CREDENTIALS_BASE64"]
+google_credentials_info = json.loads(base64.b64decode(google_credentials_base64))
+
+# Build credentials from the service account info
+credentials = google.oauth2.service_account.Credentials.from_service_account_info(google_credentials_info, scopes=['https://www.googleapis.com/auth/youtube.force-ssl'])
+
+
+# Set up the YouTube Data API client
+youtube = googleapiclient.discovery.build('youtube', 'v3', credentials=credentials)
+
 def video_url_to_id(video_url):
 	# Handle case of further parameters
 	if '&' in video_url:
@@ -36,6 +58,7 @@ def video_url_to_playlist_id(video_url):
 	return playlist_id
 
 def get_transcript(video_id):
+	print(f'Getting transcript for video_id: {video_id}')
 	try:
 		transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
 
@@ -51,17 +74,24 @@ def save_transcript(video_id, playlist_id, transcript):
 	capstone_year = video_playlist_to_presentation_year(get_video_playlist(playlist_id))
 	transcripts_path = Path('transcripts')
 
-	subfolder_path = transcripts_path / capstone_year / video_title
+	# Video title is going to start with 'Launch School Capstone Presentation: <Project Name>'
+	# We want to remove the 'Launch School Capstone Presentation: ' part
+	project_name = video_info_to_project_name(get_video_info(video_id))
+
+	subfolder_path = transcripts_path / capstone_year / project_name
 	subfolder_path.mkdir(parents=True, exist_ok=True)
 
-	# Path(f'transcripts/{capstone_year}/{video_title}').mkdir(parents=True, exist_ok=True)
-
+
 	if not transcript:
 		success, transcript = get_transcript(video_id)
 		if not success:
 			return False
 	file_path = subfolder_path / f'{video_id}_transcript.txt'
+	if file_path.exists():
+		print(f'File already exists at {file_path} for video_id: {video_id}. Skipping...')
+		return False
 	with file_path.open('w', encoding='utf-8') as f:
+		print(f'Writing transcript to {file_path}')
 		f.write(transcript)
 
 	return True
@@ -91,6 +121,11 @@ def get_video_info(video_id):
 	response = requests.get(yt_data_url, params=params)
 	return response.json()
 
+def video_info_to_project_name(video_info):
+	title = video_info['items'][0]['snippet']['title']
+	video_title = title.split(': ')[1]
+	return video_title
+
 def video_info_to_title(video_info):
 	title = video_info['items'][0]['snippet']['title']
 	# Remove invalid characters (e.g. /, \, :, etc.)
@@ -99,31 +134,67 @@ def video_info_to_title(video_info):
 	title = title.replace(' ', '_')
 	return title
 
+def select_video_or_playlist():
+	options = [
+		'Video',
+		'Playlist'
+	]
 
-def app():
-	video_url = input('Enter video url: ')
-	if not video_url:
-		video_url = target_video_url
-		# video_url = input('Enter video url: ')
-	# print(f'video_url: {video_url}')
+	questions = [
+		inquirer.List('video_or_playlist',
+			message='Select video or playlist',
+			choices=options,
+		),
+	]
 
-	video_id = video_url_to_id(video_url)
-	playlist_id = video_url_to_playlist_id(video_url) # returns None if no playlist ID found, needs to be handled
-	video_playlist = get_video_playlist(playlist_id)
-
-	# print(f'Capstone year: {video_playlist_to_presentation_year(video_playlist)}')
-
+	answers = inquirer.prompt(questions)
+	return answers['video_or_playlist']
 
+def download_transcript(video_id, playlist_id):
 	success, transcript = get_transcript(video_id)
-	print(video_id)  
 	if success:
-		print("Success!")
-		save_transcript(video_id, playlist_id, transcript)		
+		save_transcript(video_id, playlist_id, transcript)
 	else:
 		exit(transcript)
 
-	video_info = get_video_info(video_id)
-	print(video_info_to_title(video_info))
+def playlist_id_to_video_ids(playlist_id):
+	params = {
+		'key': YT_DATA_API_KEY,
+		'part': 'snippet',
+		'playlistId': playlist_id,
+		'maxResults': 50
+	}
+	response = requests.get(YT_PLAYLIST_ITEMS_URL, params=params)
+	playlist_items = response.json()
+	video_ids = []
+	for item in playlist_items['items']:
+		video_ids.append(item['snippet']['resourceId']['videoId'])
+	return video_ids
+
+def app():
+	answer = select_video_or_playlist()
+	if answer == 'Video':
+		video_url = input('Enter video url: ')
+		video_id = video_url_to_id(video_url)
+		playlist_id = video_url_to_playlist_id(video_url) # returns None if no playlist ID found, needs to be handled
+		video_playlist = get_video_playlist(playlist_id)
+
+		print(f'video_playlist: {video_playlist}')
+		print(f'Capstone year: {video_playlist_to_presentation_year(video_playlist)}')
+		download_transcript(video_id, playlist_id)
+	elif answer == 'Playlist':
+		# print(f'Capstone year: {video_playlist_to_presentation_year(video_playlist)}')
+		playlist_url = input('Enter playlist URL: ')
+		playlist_id = video_url_to_playlist_id(playlist_url)
+		video_ids = playlist_id_to_video_ids(playlist_id)
+		print(f'video_ids: {video_ids}')
+		for video_id in video_ids:
+			download_transcript(video_id, playlist_id)
+	else:
+		print('Invalid option selected. Exiting...')
+		exit()
+
+
 
 
 if __name__ == '__main__':

diff --git a/index.html b/index.html
diff --git a/process_transcript_gpt.py b/process_transcript_gpt.py
@@ -152,9 +152,9 @@ def select_model():
 
 def select_option():
 	options = [
-		'Summarize Transcript in Outline Form', 
 		'Rewrite Transcript in Shorter Form', 
 		'Summarize Rewrite in Outline Form',
+		'Summarize Transcript in Outline Form', 
 		'Get token count of Transcript',
 		'Get token count of Rewrite',
 	]
@@ -195,7 +195,7 @@ def app():
 	transcripts_path = Path('transcripts')
 
 	# Get a list of all (year) subdirectories in the transcripts folder
-	year_dirs = [d for d in transcripts_path.iterdir() if d.is_dir()]
+	year_dirs = sorted([d for d in transcripts_path.iterdir() if d.is_dir()])
 
 	# Ask the user to select a year
 	selected_year_dir = select_directory(year_dirs)

diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,18 @@
+# spacy==3.3.1
 streamlit==0.74.1
-protobuf==3.20.0
-click==8.0.3
+# requests==2.28.1
+# nltk==3.7
+# matplotlib==3.5.0
+# torch==1.6.0
+# openai==0.27.8
+# gensim==4.3.1
+# tiktoken==0.4.0
+# numpy==1.24.3
+# youtube_transcript_api==0.6.1
+# google_api_python_client==2.90.0
+# inquirer==3.1.3
+# pyLDAvis==3.4.1
+# python-dotenv==1.0.0
+# scipy==1.10.1
+# sentence_transformers==2.2.2
+# transformers==4.30.2
diff --git a/team_to_topic.json b/team_to_topic.json
@@ -0,0 +1,35 @@
+{
+  "Armada": "Containerized Service Management",
+  "QMantis": "API Load Testing & Metrics",
+  "Bard": "User Session Analysis & Data Messaging",
+  "Skopos": "Automated User-Centric Testing",
+  "Constellation": "API Load Testing & Metrics",
+  "Bastion": "Automated User-Centric Testing",
+  "Artemis": "API Load Testing & Metrics",
+  "Chimera": "Containerized Service Management",
+  "Triage": "Automated User-Centric Testing",
+  "Bubble": "Log Analysis & Collaborative Preview",
+  "Sentinel": "Application Deployment & Testing",
+  "Kuri": "User Session Analysis & Data Messaging",
+  "Nexus": "GraphQL API Development & Optimization",
+  "Cascade": "Containerized Service Management",
+  "Hypha": "Log Analysis & Collaborative Preview",
+  "Waypost": "Feature Management",
+  "Fána": "User Session Analysis & Data Messaging",
+  "Seymour": "Automated User-Centric Testing",
+  "Tailslide": "Automated User-Centric Testing",
+  "Arroyo": "Log Analysis & Collaborative Preview",
+  "Conifer": "Automated User-Centric Testing",
+  "Trellis": "Automated User-Centric Testing",
+  "Symphony": "Containerized Service Management",
+  "Edamame": "API Load Testing & Metrics",
+  "Otter": "Web & API Development",
+  "Seamless": "Containerized Service Management",
+  "Herald": "User Session Analysis & Data Messaging",
+  "Test Lab": "Automated User-Centric Testing",
+  "Fauna": "Feature Flag Management",
+  "Firefly": "Observability & Monitoring",
+  "Haifa": "Observability & Monitoring",
+  "Q Mentis": "GraphQL API Development & Optimization",
+  "Tailsite": "Feature Flag Management"
+}