NexisLexis_DataPipeline

This repository provides tools for validating, extracting, processing, and segmenting JSON-formatted media transcripts, specifically tailored for bulk data scraped from Lexis Nexis. It includes functions for parsing text and metadata, identifying errors, handling hierarchical data structures, and segmenting transcripts by speaker, enabling streamlined analysis of media content with structured speaker metadata.

Features

JSON Structure Validation: Checks each JSON file to ensure required fields are present.
Data Extraction: Extracts key metadata and content, including:
- result ID, title, byline, date, source name, highlights, headlines, guest names, and body content.
Error Detection: Identifies and logs files with structural issues or missing content for easier debugging.
Text Processing: Cleans and formats extracted text, removing video clips and timestamp markers.
Speaker Segmentation: Segments transcripts by speaker using regex patterns, producing structured text segments with associated metadata.
Multimedia Parsing: Uses BeautifulSoup to parse and extract text elements from HTML-encoded content within JSON files.

Prerequisites

Python 3.x
Libraries: Install required libraries using:
```
pip install -r requirements.txt
```

Usage

Clone the Repository:

git clone https://github.com/mimshiran/NexisLexis_DataPipeline.git
cd NexisLexis_DataPipeline

Set Root Folder Path: Define the root directory for JSON files by setting root_folder_path in your script.
Run the Script: Execute the script to validate and extract data from JSON files:
```
python parse_transcripts.py
```
Run Speaker Segmentation: To process and segment speaker data, run the segmentation script:
```
python parse_segment_speakers.py
```
Output:
- Extracted metadata and cleaned content will be stored in the specified output directory.
- Speaker-segmented data will be saved in a Parquet file, making it efficient for further analysis.

Folder Structure

/data: Directory to store your JSON files.
/output: (Optional) Directory to save processed data files.
/segments: Optional folder for keeping modularized scripts.

Error Handling

Logs files with issues (e.g., decoding errors, missing fields) and stores the list in error_files for review.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
parse_segment_speaker.py		parse_segment_speaker.py
parse_transcripts.py		parse_transcripts.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NexisLexis_DataPipeline

Features

Prerequisites

Usage

Folder Structure

Error Handling

License

About

Releases

Packages

Languages

License

mimshiran/NexisLexis_DataPipeline

Folders and files

Latest commit

History

Repository files navigation

NexisLexis_DataPipeline

Features

Prerequisites

Usage

Folder Structure

Error Handling

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages