-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: Bethany Pietroniro <[email protected].…
…com> Co-authored-by: Michael Johnson <[email protected]> Co-authored-by: Fred-209 <[email protected]>
- Loading branch information
Showing
87 changed files
with
4,835 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Simple workflow for deploying static content to GitHub Pages | ||
name: Deploy static content to Pages | ||
|
||
on: | ||
# Runs on pushes targeting the default branch | ||
push: | ||
branches: ["main"] | ||
|
||
# Allows you to run this workflow manually from the Actions tab | ||
workflow_dispatch: | ||
|
||
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages | ||
permissions: | ||
contents: read | ||
pages: write | ||
id-token: write | ||
|
||
# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued. | ||
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete. | ||
concurrency: | ||
group: "pages" | ||
cancel-in-progress: false | ||
|
||
jobs: | ||
# Single deploy job since we're just deploying | ||
deploy: | ||
environment: | ||
name: github-pages | ||
url: ${{ steps.deployment.outputs.page_url }} | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v3 | ||
- name: Setup Pages | ||
uses: actions/configure-pages@v3 | ||
- name: Upload artifact | ||
uses: actions/upload-pages-artifact@v1 | ||
with: | ||
# Upload entire repository | ||
path: '.' | ||
- name: Deploy to GitHub Pages | ||
id: deployment | ||
uses: actions/deploy-pages@v2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,71 @@ | ||
# Launch-Summarize-Capstone-YT | ||
# Launch School Capstone Project Transcript Analysis | ||
|
||
This repository provides tools and methodologies to extract and analyze transcripts from Launch School Capstone Projects. The project consists of a four-step pipeline - fetching video transcripts, processing transcripts, modeling topic data, and delivering an interactive presentation of the results. | ||
|
||
Experience the process live and interact with results at [Capstone Summary Web App](https://launch-summarize-capstone-yt.streamlit.app/). | ||
|
||
## Setup & Requirements | ||
|
||
1. **Virtual Environment Activation** | ||
|
||
Before running any scripts, activate your virtual environment (.venv) or Conda environment: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
2. **API Credentials** | ||
|
||
Required API keys for YouTube Data and OpenAI should reside in your `.env` file as `YT_DATA_API_KEY` and `OPENAI_API_KEY` respectively. | ||
|
||
## Step-by-Step Usage | ||
|
||
### 1. Fetching Video Transcripts | ||
|
||
Running `get_transcript.py` fetches transcripts from YouTube videos, and is specifically targeted at the format of the Launch School Capstone videos (it relies on the video title format to get the project name, etc.). There is an interactive CLI menu that allows you to choose to provide either a video URL or a playlist URL. The script then fetches the transcript for each video in the playlist or the single video. | ||
|
||
```bash | ||
python get_transcript.py | ||
``` | ||
|
||
Each fetched transcript is stored in a directory named after the project and categorized under corresponding `<year>/<project_name>` directories. | ||
|
||
### 2. Transcript Processing with GPT Models | ||
|
||
`process_transcript_gpt.py` offers control over transcript processing using GPT models. The script produces an interactive CLI menu allowing you to select a project and a GPT model - either 'gpt-4' or 'gpt-3.5-turbo-16k' - for processing. | ||
|
||
```bash | ||
python process_transcript_gpt.py | ||
``` | ||
|
||
Options available within the processing menu include: | ||
|
||
- **"Rewrite Transcript in Shorter Form"**: It condenses the transcript retaining key points. If a rewritten form doesn't exist, this operation should be performed first. | ||
|
||
- **"Summarize Rewrite in Outline Form"**: Provides an outline summary for the rewritten transcript. | ||
|
||
- **"Summarize Transcript in Outline Form"**: Facilitates an outline summary for raw transcripts. Note: This option requires the use of 'gpt-3.5-turbo-16k' for transcripts exceeding 8k tokens. | ||
|
||
- **"Get token count of Transcript"** and **"Get token count of Rewrite"**: Both options return the token count for respective transcripts. | ||
|
||
### 3. Topic Modeling on Transcripts | ||
|
||
LDA (Latent Dirichlet Allocation) implemented in the `topic_modeling.py` script surfaces dominant topics from the transcripts. | ||
|
||
```bash | ||
python topic_modeling.py | ||
``` | ||
|
||
It uses either raw transcripts or a rewritten form (depends on uncommented lines in the script) and proposes a set of topics for each document. Options for calculating coherence and executing a grid search for parameter optimization are available. Uncomment `# pyLDAvis.save_html(lda_viz, 'lda.html')` to output an easy-to-understand interactive HTML visualization. | ||
|
||
The results also cluster projects by the identified primary topic and measure the level of association. It allows for a comparison and understanding of the broad topics covered per project. | ||
|
||
### 4. Interactive Summary Visualization with Streamlit | ||
|
||
The final step encompasses an accessible and interactive summary visualization, available via a Streamlit app. To launch the app, run: | ||
|
||
```bash | ||
streamlit run view_writeups.py | ||
``` | ||
|
||
Now you can embark on analyzing previous Capstone Projects transcripts and discover valuable insights! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,18 @@ | ||
# spacy==3.3.1 | ||
streamlit==0.74.1 | ||
protobuf==3.20.0 | ||
click==8.0.3 | ||
# requests==2.28.1 | ||
# nltk==3.7 | ||
# matplotlib==3.5.0 | ||
# torch==1.6.0 | ||
# openai==0.27.8 | ||
# gensim==4.3.1 | ||
# tiktoken==0.4.0 | ||
# numpy==1.24.3 | ||
# youtube_transcript_api==0.6.1 | ||
# google_api_python_client==2.90.0 | ||
# inquirer==3.1.3 | ||
# pyLDAvis==3.4.1 | ||
# python-dotenv==1.0.0 | ||
# scipy==1.10.1 | ||
# sentence_transformers==2.2.2 | ||
# transformers==4.30.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
{ | ||
"Armada": "Containerized Service Management", | ||
"QMantis": "API Load Testing & Metrics", | ||
"Bard": "User Session Analysis & Data Messaging", | ||
"Skopos": "Automated User-Centric Testing", | ||
"Constellation": "API Load Testing & Metrics", | ||
"Bastion": "Automated User-Centric Testing", | ||
"Artemis": "API Load Testing & Metrics", | ||
"Chimera": "Containerized Service Management", | ||
"Triage": "Automated User-Centric Testing", | ||
"Bubble": "Log Analysis & Collaborative Preview", | ||
"Sentinel": "Application Deployment & Testing", | ||
"Kuri": "User Session Analysis & Data Messaging", | ||
"Nexus": "GraphQL API Development & Optimization", | ||
"Cascade": "Containerized Service Management", | ||
"Hypha": "Log Analysis & Collaborative Preview", | ||
"Waypost": "Feature Management", | ||
"Fána": "User Session Analysis & Data Messaging", | ||
"Seymour": "Automated User-Centric Testing", | ||
"Tailslide": "Automated User-Centric Testing", | ||
"Arroyo": "Log Analysis & Collaborative Preview", | ||
"Conifer": "Automated User-Centric Testing", | ||
"Trellis": "Automated User-Centric Testing", | ||
"Symphony": "Containerized Service Management", | ||
"Edamame": "API Load Testing & Metrics", | ||
"Otter": "Web & API Development", | ||
"Seamless": "Containerized Service Management", | ||
"Herald": "User Session Analysis & Data Messaging", | ||
"Test Lab": "Automated User-Centric Testing", | ||
"Fauna": "Feature Flag Management", | ||
"Firefly": "Observability & Monitoring", | ||
"Haifa": "Observability & Monitoring", | ||
"Q Mentis": "GraphQL API Development & Optimization", | ||
"Tailsite": "Feature Flag Management" | ||
} |
Oops, something went wrong.