Created by Juan Francisco Montero (https://github.com/monp4r/) and Luis Alfonso Martínez
Welcome to the LZ78 Variable-Length Compressor, where efficiency meets simplicity in the world of text file compression.
Based on the LZ78 lossless data compression algorithm published by Abraham Lempel and Jacob Ziv, it consists of a single executable program that can be used both as a compressor and decompressor, depending on the selected menu options. It provides optimal compression ratios while maintaining ease of use.
The LZ78 Variable-Length Compressor has been developed as the final project for the Automatic Information Processing exam in the Double Degree program in Mathematics and Computer Science at the University of Valladolid.
- Efficient Compression: Utilizes advanced LZ78 algorithm to achieve high compression ratios.
- Up to 12-Bit Support: Precision handling of variable-length encoding, accommodating up to 12 bits for optimal compression.
- User-Friendly Interface: Simple command-line interface for seamless compression operations.
- Robust Error Handling: Ensures reliability and robustness during compression, giving you peace of mind.
Ready to experience lightning-fast compression without compromising quality? Let's get started!
To get started with the LZ78 Variable-Length Compressor using the Spyder IDE, follow these steps:
-
Clone the Repository:
Clone the LZ78 Variable-Length Compressor repository to your local machine using Git. Open a terminal or Git Bash and run the following command:
git clone https://github.com/monp4r/lz78_varlen_compressor.git
Open the Spyder IDE (or on python CI) on your local machine. Navigate to the directory where you cloned the LZ78 Variable-Length Compressor repository using the file explorer or terminal.
-
Explore and Run the Code
Once the project is open (in Spyder), you can explore the source code files. The main file,
lz78_varlen_compressor.py
, contains the implementation of the LZ78 Variable-Length Compressor. -
Run the Compressor
To compress a text file using the LZ78 Variable-Length Compressor, you can select the first option to specify the path to your input text file after executing the script.You can do this by clicking the green "play" button in the Spyder IDE toolbar or by pressing F5 (invoke python.exe in cli).
-
View Compression Results
After running the compressor, you can view the compressed output in the console or terminal. The compressed output will consist of a series of indices and phrases generated by the LZ78 Variable-Length Compressor.
-
Decompress Files (Optional)
If desired, you can also decompress files using the provided decompression functionality in the
lz78_varlen_compressor.py
file. Follow similar steps to select a different option and run the script to decompress a compressed text file.
By following these steps, you can easily get started with the LZ78 Variable-Length Compressor using the Spyder IDE.
Please note that in the process of compressing a text file, the default format of the text file in question will be '.txt' in this version of the program. This is to avoid possible errors in the program and to maintain the design conception proposed from the beginning of the creation of the program.
Finally, it is advisable to save the content of each encoding or decoding operation you carry out, in order to avoid losing the content.
The LZ78 Variable-Length Compressor works by analyzing the input text file and identifying recurring patterns. Here's a step-by-step overview of the compression and decompression process:
-
Compression:
- Building the Dictionary: The compressor reads the input text file and builds a dictionary of phrases encountered in the text.
- Variable-Length Encoding: Each phrase in the dictionary is assigned a unique index, and the compressor replaces each occurrence of a phrase with its corresponding index. The indices are encoded using variable-length encoding to optimize storage efficiency.
- Output Generation: The compressor generates the compressed output, consisting of a sequence of indices representing the phrases in the input text.
-
Decompression:
- Dictionary Reconstruction: During decompression, the compressor reconstructs the dictionary using the indices provided in the compressed file.
- Decoding Indices: The decompressor reads the sequence of indices from the compressed file and decodes them using variable-length decoding to retrieve the original phrases.
- Output Reconstruction: As the indices are decoded, the decompressor reconstructs the original text by replacing each index with its corresponding phrase from the dictionary.
The LZ78 Variable-Length Compressor utilizes this process to achieve efficient compression and decompression of text files while maintaining data integrity.
Let's observe the decoding process with this example.
Given a dictionary and the tuples of a message, it is possible to decode it in an orderly manner to obtain the message. For example, the first entry in the dictionary is A, so its decoding will be A itself. The same applies to the second and third entries in the dictionary, being single letters.
For the fourth entry in the dictionary, it is decoded as AC, and so on for all entries. The final message in this case is the result of concatenating all the entries in the dictionary, for this example it is 'ABRACADABRA'.
As we can see, the presence of dictionary entries greatly facilitates decoding, making it almost trivial. It could happen that we are unaware of the entries in our dictionary and only know the relationship between each tuple and its dictionary index. In this case, the easiest way would be to obtain the entry of each dictionary position similarly to how it was done in the previous example.
Decoding | Tuple | Dictionary[Index] | Dictionary Entry |
---|---|---|---|
A | (0,a) | 1 | A |
B | (0,b) | 2 | B |
R | (0,r) | 3 | R |
AC | (1,c) | 4 | AC |
AD | (1,d) | 5 | AD |
AB | (1,b) | 6 | AB |
RA | (3,a) | 7 | RA |
To assess the performance of our dictionary-based compressor, we conducted compression tests on several books from the Gutenberg Project, including Philosophiae Naturalis Principia Mathematica, War and Peace, as well as the file of the book El ingenioso hidalgo don Quijote de la Mancha provided in the virtual campus of the subject.
We've created a comparative table to illustrate the tests conducted with the program, analyzing the uncompressed size, compressed size, and compression factor of each file. The compression factor is defined as the ratio of the compressed size of a file to its uncompressed size, allowing us to understand the percentage of the original size occupied by the compressed file.
File | Uncompressed Size | Compressed Size | Compression Factor |
---|---|---|---|
quijote.txt | 2100 KB | 978 KB | 0.466 |
philosophiae_mathematica.txt | 852 KB | 420 KB | 0.493 |
war_and_peace.txt | 3400 KB | 1500 KB | 0.441 |
a_reptidas.txt | 1000 KB | 3 KB | 0.003 |
checkmates_data.txt | 18800 KB | 3700 KB | 0.197 |
Upon analyzing these data, we observe that the dictionary-based compressor shows significant efficiency in most cases, with compression factors ranging between 0.003 and 0.493. This suggests that the existence of patterns in the character string of the text significantly contributes to the compression capability. However, we also note that the compression factor varies widely depending on the file content, as evidenced by the file "a_reptidas.txt," which contains repeated characters, resulting in extremely efficient compression. In conclusion, these results support the effectiveness of the dictionary-based compressor and underscore the importance of considering the content nature when evaluating data compression techniques.
Here are some potential areas for future development of the LZ78 Variable-Length Compressor:
-
Enhanced Compression Algorithms: Explore and implement advanced compression algorithms to improve the compression ratio and efficiency of the compressor.
-
Graphical User Interface (GUI): Develop a user-friendly GUI application for the LZ78 Variable-Length Compressor, allowing users to compress and decompress files with ease.
-
Optimization Techniques: Investigate optimization techniques to optimize the performance of the compressor, making it faster and more efficient.
-
Support for Additional File Formats: Extend the compressor to support compression and decompression of various file formats, beyond just text files.
-
Error Handling and Recovery: Implement robust error handling mechanisms and recovery strategies to handle errors gracefully during compression and decompression processes.
I extend my sincere gratitude to the following individuals and entities for their invaluable support and contributions to the development of the LZ78 Variable-Length Compressor:
-
Automatic Information Processing Teacher: I express our heartfelt thanks to our Automatic Information Processing teacher for their guidance, mentorship, and encouragement throughout the development of this project.
-
Python Software Foundation: For developing and maintaining the Python programming language, which serves as the foundation for this project.
-
Spyder IDE: For providing an intuitive and powerful integrated development environment for Python programming, facilitating the development process of the LZ78 Variable-Length Compressor.
Angel, P. (2022, April 28). Un Poco de Historia: Algoritmos LZ77, LZ78 y lzw - INCUBAWEB - software y web 2.0. Incubaweb. https://www.incubaweb.com/un-poco-de-historia-algoritmos-lz77-lz78-y-lzw/
Hmong.wiki. (n.d.). LZ77 y lz78 Contenido y eficiencia Teórica https://hmong.es/wiki/LZ77_and_LZ78
Ignacio, F. M. J. (2020). Computación Matemática Con Python: Introducción Al Lenguaje python para Científicos e Ingenieros. Ediciones Universidad de Valladolid.