The COH-PIAH Detector is a Python program that analyzes text samples to identify potential cases of COH-PIAH infection based on linguistic patterns. It uses various text metrics to compare writing styles and determine similarities between texts.
The program analyzes texts using six main linguistic traits:
- Average word length
- Type-Token ratio (lexical diversity)
- Hapax Legomana ratio (proportion of words used only once)
- Average sentence length
- Sentence complexity
- Average phrase length
- The program first collects a baseline signature of linguistic traits from a known infected text
- Users can then submit multiple texts for analysis
- Each text is broken down into its linguistic components (sentences, phrases, words)
- The program calculates a similarity score between each submitted text and the baseline
- The text with the closest match to the infected signature is identified as potentially infected
le_assinatura()
: Collects the baseline linguistic signaturele_textos()
: Reads multiple texts for comparisoncalcula_assinatura()
: Calculates the linguistic signature of a given textcompara_assinatura()
: Compares two linguistic signaturesavalia_textos()
: Evaluates multiple texts and identifies the most likely infected one
separa_sentencas()
: Splits text into sentencessepara_frases()
: Splits sentences into phrasessepara_palavras()
: Splits phrases into wordsn_palavras_unicas()
: Counts words that appear only oncen_palavras_diferentes()
: Counts unique words
- Run the program
- Enter the baseline linguistic traits when prompted
- Input the texts you want to analyze
- The program will output which text is most likely infected with COH-PIAH
Example:
python coh_piah_detector.py
- Python 3.x
- re (Regular Expression) module
The program uses regular expressions for text processing and implements various linguistic analysis algorithms. It calculates similarity scores using the average absolute difference between linguistic traits.
- All text analysis is case-insensitive
- The program handles various punctuation marks (., !, ?, :, ;, ,) for sentence and phrase separation
- Empty strings and trailing spaces are properly handled
Feel free to submit issues and enhancement requests. The main area marked with comments ("Não mexer daqui para cima") should not be modified as it contains core functionality.