Text Analyzer and Step-through Viewer (Main version v2_4_2)

"Text Analyzer and Step-through Viewer" is a utility to navigate an input text and examine the usage, repetition, and proximity of words. The viewer can jump forwards and backwards in an input text by line or between instances of particular words to see distances between those word instances. I wish to develop the program into a tool for more text analysis as I learn more about areas such as natural language processing. (For example, one might compare authors' word use or examine one's own writing styles with the tool.)

Version 2_4_2, December 2016

By Karl Toby Rosenberg

Instructions

Running the program:

# in the command line:

# plain start:

$ python3 v2_4_2_main_Text_Analyzer_Step_through_Viewer.py test_11.txt

# specify files to add to the reserved files list: 
# (Use absolute file path names here unless the file is in the program directory)

$ python3 v2_4_2_main_Text_Analyzer_Step_through_Viewer.py test_11.txt 
<file_name> <encoding_name e.g. ascii> ... <file_name_k> <encoding_name_k>

# (I will shorten the name of the exectable)

Choosing File(s), Setting-up Working Directory:

The program displays a list of files in the current working directory paired with convenience indices

You are prompted to select a file via a number of commands: (NOTE: sample files included)

<index> to select the file corresponding to the index (then specify the encoding, e.g. 1 for ascii or ascii)
r <index> to select a reserved file

The program keeps a reserved files list/cache so even if you change directory while running the program, the reserved files can be opened. (abs_file_name:encoding_name) pairs specified as command line arguments are added to the list, as well as (abs_file_name:encoding_name) pairs listed in a local reserved_input_info.txt file (one pair for each line, separated by a space) The file name and separator between file and encoding (in the file) can be changed by passing optional arguments to the relevant function: add_file_info_from_file()

cd <path> to change directory (This simulates the usual cd command)
cd sub-commands:
- cd <path> to continue to change directory
- h to select the utility's starting home directory
- enter to confirm the current directory and return to file selection
- p to cancel (even after changing directory without confirming)
- ls to list all available files in the current directory
- dir to list all immediate sub-directories
- 0 to quit the program
ls to list all available files in the current directory
lsr to list all available files in the reserved files list
dir to list all immediate sub-directories
0 to quit the program

Text Step and Word Step:

You are then prompted to start using the text step functionality:

enter a word that appears in the text to be able to step between instances of the word, press enter or an empty line to ignore this functionality
You can always exit the stepper with 0 to choose a(nother) word or select another file Commands to enter:
a number n to step forwards in the text by n lines
a negative number -n to move to a specific line number n NOTE: moving to a line beyond the end of the text will exit text step and prompt for a new word or file
< and > to skip to the previous or next instance of the given word the program will move to the correct line and indicate the position of the word instance, as well as display the number of words between the new and previous instance and the word number of the new instance (word i in the entire text)
qa or help to display the instructions
0 to leave text step for this file (and have the option to choose another word or file, or to list the words in the current file with lsw)

Additional Notes

Possible Explorations:

Tracking distance relationships between different words as opposed to instances of a specific word only would be interesting. This would likely require a more complex system with various graphs and many more calculations. I might try to use databases and/or play with a different language (perhaps C) as outlined below (under the Version 3> heading) Reading the text in smaller chunks may help, as might being more careful about what information is stored as a value for a string key (having pointer arithmetic might help here too.) I would also like to turn this program into something that can be more easily used for comparing texts (e.g. authors' word use habits in terms of distances). I will probably need to compile a sub-set of words (definitely omitting ones such as "a" and "the") related to a specific topic (e.g. philosophical terms when looking at philosophers' writing). It would be good if the program could be developed to have educational or self-analysis applications to some extent. The newer directory changing, file-reserving, and other features are making the program easier to test and use, so the preliminary steps are in-progress.

Work-in-Progress Functionality:

WordInfo class
- extend from the abstract WordInfo class to create a "command" for calculating information based on the analysis dictionary (created from the input text). Assign the derived class a key for retrieval from an output dictionary and implement a calculate method that works upon the analysis dictionary. Simple test examples include finding the longest words in the text or finding all words with a given length

Test and Old Versions

Version 3>

stores positions of the beginning of each line (the index of the starting character with respect to the entire text),
does NOT store the text in a list, reads from file for every line output
requires check for both DOS/Windows and UNIX/UNIX-like system new-lines (\r\n, \n) A problem: This uses seek(), which reads in terms of bytes. Python char strings are of variable length and do not represent actual sizes in bytes in the text. I need to encode any non-ASCII character, and doing so is noticeably slower than when encoding isn't done. In other words, there is a tradeoff between speed and memory use. Options:
Use ASCII-only text so len(Python char) is one byte (This might not a good limitation to have.)
Use another language for the text input and parsing phase (Perhaps C, which has char primitives, though there would be additional challenges related to non-ASCII support. I have been experimenting with the idea in the hash-table C project.)
Something completely different...

Miscellaneous idea:

If I am able to create a system that saves memory by not loading the entire text into memory, then I might choose to store chunks of the text, rather than none at all. For example, for very large texts I could use something similar to demand paging to save only a constant piece of a text at any given point.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
old		old
old_summer_2016		old_summer_2016
0000_John_Locke_An_Essay_Concerning_Human_Understanding_vol_1_UTF_8.txt		0000_John_Locke_An_Essay_Concerning_Human_Understanding_vol_1_UTF_8.txt
0000_John_Locke_An_Essay_Concerning_Human_Understanding_vol_2_UTF_8.txt		0000_John_Locke_An_Essay_Concerning_Human_Understanding_vol_2_UTF_8.txt
0000_John_Locke_Treatise_of_Government.txt		0000_John_Locke_Treatise_of_Government.txt
0001_John_Stuart_Mill_Considerations_Representative_Government.txt		0001_John_Stuart_Mill_Considerations_Representative_Government.txt
0001_John_Stuart_Mill_on_Liberty.txt		0001_John_Stuart_Mill_on_Liberty.txt
0002_Thomas_Hobbes_The_Leviathan.txt		0002_Thomas_Hobbes_The_Leviathan.txt
0003_Ralph_Waldo_Emerson_Nature.txt		0003_Ralph_Waldo_Emerson_Nature.txt
0006_Shakespeare_Hamlet.txt		0006_Shakespeare_Hamlet.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
Steely_Dan_UTF_8.txt		Steely_Dan_UTF_8.txt
default.txt		default.txt
reserved_input_info.txt		reserved_input_info.txt
test_1.txt		test_1.txt
test_10.txt		test_10.txt
test_11.txt		test_11.txt
test_2.txt		test_2.txt
test_3.txt		test_3.txt
test_4.txt		test_4.txt
test_7.txt		test_7.txt
test_8.txt		test_8.txt
test_9.txt		test_9.txt
v2_4_2_main_Text_Analyzer_Step_through_Viewer.py		v2_4_2_main_Text_Analyzer_Step_through_Viewer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Analyzer and Step-through Viewer (Main version v2_4_2)

Instructions

Running the program:

Choosing File(s), Setting-up Working Directory:

You are prompted to select a file via a number of commands: (NOTE: sample files included)

Text Step and Word Step:

Additional Notes

Possible Explorations:

Work-in-Progress Functionality:

Test and Old Versions

About

Releases

Packages

Languages

License

KTRosenberg/Text-Analyzer-and-Step-through-Viewer

Folders and files

Latest commit

History

Repository files navigation

Text Analyzer and Step-through Viewer (Main version v2_4_2)

Instructions

Running the program:

Choosing File(s), Setting-up Working Directory:

You are prompted to select a file via a number of commands: (NOTE: sample files included)

Text Step and Word Step:

Additional Notes

Possible Explorations:

Work-in-Progress Functionality:

Test and Old Versions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages