Data Extraction:

Fetch HTML content from any given URL, allowing flexibility to use URLs from sources like Wikipedia or any other web page containing relevant text data.
Using Python libraries like requests and BeautifulSoup, the script extracts text from HTML elements such as paragraphs, headings, etc.

Text Processing Steps:

Cleaning Data: Eliminates any symbols or characters that are not relevant to the text content.
Normalization: Converts all text to lowercase to ensure uniformity for subsequent processing steps.
Tokenization: Splits the text into individual words or tokens, facilitating further analysis.
Lemmatization or Stemming: Reduces words to their base or root form to standardize variations of the same word. Users can choose between lemmatization or stemming based on their preference or requirements.
Stop Words Removal: Filters out common words such as "is," "and," "the," etc., which do not add significant meaning to the text.
Unique Word Extraction: After processing, we get all unique words present in the text data. This ensures that only distinct words are considered for analysis, providing valuable insights into the vocabulary used in the text.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
NLP.ipynb		NLP.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Extraction:

Text Processing Steps:

About

Releases

Packages

Languages

Aalaa4444/Text_Processing-and-Unique_Word_Extraction_fromHTML

Folders and files

Latest commit

History

Repository files navigation

Data Extraction:

Text Processing Steps:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages