As mentioned in prior sections, this paper introduces 3 enhancements on top of LZMW data compression algorithm and this section includes the detailed description of each of them. The descriptions of the prior arts (LZMW and 2Q) are not explicitly included in this section except for the parts needed to describe the enhancements. The readers of this section are assumed to have prior knowledge of LZMW and 2Q algorithms.
3.1 Enhancement 1: Find the largest “word pair” instead of the largest word
As mentioned in the description of the prior-art in the previous section, LZMW algorithm and other variants of LZ78 algorithm finds for the largest word in the dictionary, which matches with the string read from the current position of input-pointer in the input file. But the proposed enhancement in this section, which will be termed as Enhancement_1 from now onward, suggests not to find the largest word always. Instead of finding the largest word in the dictionary, in some situations finding a smaller matching word can produce better compression ratio. That means in the our previous example instead of writing the code of “jumps” in the output file, in some situations, it’s possible to get extra compression ratio by selecting “j” or “jum” and writing its code to the output file. In that case the input-pointer also will be set to the character which is right after the selected word in the input file. That means after selecting “jum” and writing its code to output, the input-pointer will be set to “p” which is right after “jum” for reading the next word.
Introducing another example to clarify Enhancement_1. Let’s assume a character set where only 4 characters A, B, C & D are available with character numbers 0, 1, 2 & 3 respectively. When the dictionary is initialized, it creates 4 single-character words A, B, C & D with the initial codes 0, 1, 2 and 3 for them respectively and each new word added in the dictionary will get codes from 4 onward until the upper limit (not defined in this example, just assume there’s a limit) of the dictionary is reached. Now let’s consider we need to compress the following string.
CDBABACCDCBACDC
Setting the input-pointer at the beginning (“C”) and going ahead with the prior-art, LZMW algorithm by matching with the largest word in the dictionary, we’ll get the steps and outcomes defined in “Table 2”.
Table 2
Illustration of LZMW Compression
| Output | Dictionary A – 0 B – 1 C - 2 D − 3 |
Read Step | Word Read in the input string | Output Code | New Word | Code of New Word |
1 | C | 2 | - | - |
2 | D | 3 | CD | 4 |
3 | B | 1 | DB | 5 |
4 | A | 0 | BA | 6 |
5 | BA | 6 | ABA | 7 |
6 | C | 2 | BAC | 8 |
7 | CD | 4 | CCD | 9 |
8 | C | 2 | CDC | 10 |
9 | BAC | 8 | CBAC | 11 |
10 | D | 3 | BACD | 12 |
11 | C | 2 | DC | 13 |
Each of the rows defined in “Table 2” shows the word read from the input file, the code written in the output data and the new word formed in the dictionary with a new code by concatenating the last 2 words read in the input file. The total 11 steps indicate, 11 different codes are written in the output data during the compression of the above string. So by following the process of finding the largest words in the input file, we are compressing the input file with 11 output codes. We can now see we can finish compressing the data by generating 10 codes instead of 11 by making a change in the step 9 onward, where we’ll select a smaller matching word from the dictionary instead of selecting with the largest one.
Out of the input file CDBABACCDCBACDC, the first part CDBABACCDC were read by the end of step 8 and the part BACDC were left for reading. That means after step 8, the input-pointer is set at the beginning of BACDC. The dictionary had the words, “B”, “BA” and “BAC” and “Table 2” shows the word BAC was selected at step 9 as that is the largest matching word in the dictionary from the position of the input-pointer. But instead of selecting “BAC” in step 9, if “BA” were selected, then the input-pointer would’ve read the next word starting from the “C” after “BA” in the input file and step 10 would’ve finished the input file by selecting “CDC” which was also present in the dictionary with code = 10. In that way the last 2 steps of “Table 2” would look like the one shown in “Table 2a” and the output needed 10 codes instead of 11 as shown above.
Table 2
a: Illustration with a Smaller Word in Step 9
9 | BA | 6 | CBA | 11 |
10 | CDC | 10 | BACDC | 12 |
The alternative approach defined above, finishes the compression by producing reduced number of codes for the output data, hence ended up by providing better compression ratio than the previous approach. The reason behind the better data compression is, in step 9 of the alternative approach the word “BA” was selected whose length is 2 characters and step 10 has “CDC” which is 3 characters long. So the total length of those 2 successive words is 2 + 3 = 5. Whereas in the previous approach the length of the word “BAC” at step 9 was 3 and that of “D” in step 10 was 1 which makes total length of the 2 successive words to be 3 + 1 = 4, which is less than the alternative approach. Hence the conclusion drawn from the above illustration is, the data compressor program should find the largest “word pair” in the 2 subsequent reading, instead of finding the largest word.
The only way to find out the largest word pair is to iterate through all possible words in a particular step “x” and to find the possible next word in step “x + 1” for each alternative words in step “x”. Like the step 9 of the above example had 3 options “B”, “BA” and “BAC” to select from the dictionary and we ended up with selecting “BA” because in that case next word will be “CDC” which will make the combined length of the words of step 9 and 10 to be 2 + 3 = 5, which is more than the words would’ve been found in steps 9 and 10 for any other word selected in step 9. To determine, which option among “B”, “BA” and “BAC” will produce the largest word-pair along with step 10 is to iterate through each of the words and then see which word is matched in step 10 in the dictionary for each of the selection. In this example by selecting “B” in step 9, the only matching word in step 10 would be “A” and the combined length of the words will be 1 + 1 = 2 and by selecting “BAC” in step 9 we already have seen “D” to be selected in step 10 above, which made the combined length of the words to be 3 + 1 = 4, which proves selecting “BA” is most beneficial.
With the above example of iterating through the possible words in step 9 and identifying the largest word pair in step 9 and 10, in this section under Enhancement_1, an efficient programmatic approach of finding largest word pair is proposed, which is relatively slower than the prior-art but capable of providing remarkably better compression ratio. The decompression process won’t have any change in this alternative approach, so it doesn’t have any impact on performance in the decompression process. The proposed algorithm suggests creating 2 arrays named oldWordArray and newWordArray and keeping all the matching words in a particular step into oldWordArray. That means in our current example, at step 9, oldWordArray will be populated in the following manner if we consider 0 based index.
OldWordArray[0] = “B”
oldWordArray[1] = “BA”
oldWordArray[2] = “BAC”
The proposed algorithm suggests starting a loop on oldWordArray and then to read the next set of words from the input file in each iteration of the loop and populating newWordArray with that. That means if we iterate the loop with the example described in the above section, the 1st iteration of the loop will pick up “B” from oldWordArray and populate newWOrdArray[0] = “A”, the 2nd iteration will pick up “BA” from oldWordArray and populate newWordArray with newWOrdArray[0] = “C”, newWOrdArray[1] = “CD” and newWOrdArray[2] = “CDC” (C, CD and CDC are present in the dictionary) and the last iteration of the loop will pick up “BAC” from oldWordArray and populate newWordArray[0] = “D”.
While in the loop, the proposed algorithm suggests finding the length of word-pair by adding the length of oldWordArray[i] + length of newWordArray[last_index] where “i” is the index in current iteration of oldWordArray and “last_index” is the index of the last element in newWordArray. The index “i” and newWordArray must be retained or remembered for the iteration of the loop where the value of the word-pair becomes the largest. “Diagram 1” shows the oldWordArray and the newWordArray populated in each iteration of the loop along with the length of the word-pair found in each iteration.
After the end of the above-mentioned loop, the proposed algorithm suggests to write the code of oldWordArray[i] to the output data where “i” is the index of the oldWordArray, for which the largest word-pair was found. Then it suggests to oncatenate the previous word that was written in the output with oldWordArray[i] and add the new word created by the concatenation to the dictionary, and then copy newWordArray to oldWordArray (oldWordArray = newWordArray) and then start a new loop on the newly formed oldWordArray to get the next set of newWordArray.
Hence Enhancement_1 suggests nested loops with an outer loop to work on different oldWordArray in each iteration (or each step as mentioned in the “Table 2” or “Table 3”) and an inner loop on each of the elements of the oldWordArray to identify the largest word-pair with the newWordArray created in that iteration as described above and then to use the finally selected newWordArray of the current iteration as the oldWordArray of the next iteration of the outer loop. The very first oldWordArray will be populated with the first character read in the input file and the only element of the array will be oldWordArray[0] = < first character of input file>.
If we go with the above example to compress input file CDBABACCDCBACDC using the concept of Enhancement_1 described so far, we’ll have to go through the steps defined in “Table 3”.
Table 3
Illustration 1 with Enhancement_1
| | Output | | Dictionary A – 0 B – 1 C - 2 D – 3 |
Read Step | oldWordArray | Selected index in oldWordArray (Starting from 0) | OldWord = Selected element in oldWordArray | output code | Selected newWordArray | New Word :: Prev oldword + Current oldWord | New Word Code |
1 | C | 0 | C | 2 | D | - | - |
2 | D | 0 | D | 3 | B | CD | 4 |
3 | B | 0 | B | 1 | A | DB | 5 |
4 | A | 0 | A | 0 | B | BA | 6 |
5 | B | 0 | B | 1 | A | AB | 7 |
6 | A | 0 | A | 0 | C | BA | (6-repeat) |
7 | C | 0 | C | 2 | C, CD | AC | 8 |
8 | C, CD | 1 | CD | 4 | C | CCD | 9 |
9 | C | 0 | C | 2 | B, BA | CDC | 10 |
10 | B, BA | 1 | BA | 6 | C, CD, CDC | CBA | 11 |
11 | C, CD, CDC | 2 | CDC | 10 | - | BACDC | 12 |
There are 2 things to notice when we compare Table 3 containing steps of Enhancement_1 with Table 2 that contains steps of LZMW. First, the compression still needs 11 steps by Enhancement_1and second, the step 5 of Enhancement_1 is reading “B” but the same in LZMW (Table 2) reads “BA” instead. The word “BA” was created in step 4 by concatenating the words (“B” and “A”) selected in step 3 and 4and then read in step 5 by LZMW because it encountered BA in the input file starting from position 5. The reason behind not reading “BA” by Enhancement_1 is, the newWordArray is getting finalized before the concatenation of oldWord in previous step and the oldWord of current step as we can’t finalize the oldWord of current step without finalizing newWordArray because the largest word pair i.e. “length of oldWord” + “last element of newWordArray” is the basis of finalizing them. This behavior indicates, Enhancement_1 still needs some refinements before we can practically use it as a good improvement over LZMW algorithm. To resolve this issue, the following modification is suggested for Enhancement_1.
Modification 1: Creating a temporary word:
-
Introduce a global variable named “temporaryWord”, which will just keep 1 word at a time. In each step concatenate the oldWord (the selected oldWordArray[i] as defined in Table 3) + “last element of newWordArray” and keep the concatenated word to “temporaryWord” without assigning any code in it.
-
While populating newWordArray in the next step, not only search matching words in the dictionary, check if temporaryWord created in the last step matches with the input file or not. If matches then add it in the newWordArray.
-
If the temporaryWord gets a new code in the next step which means if temporaryWord = prev oldWord + current oldWord in the concatenation performed in the next step, then add it to the dictionary.
-
Override the temporaryWord with the new temporaryWord created in each step.
-
Rest of the functions will be continued as it is.
If we add this modification in our above mentioned example, then the steps to compress the input file will be modified as shown in Table 4 and now it will get completed in 10 steps.
Table 4
Illustration 1 with Enhancement_1
| | Output | | Dictionary A – 0 B – 1 C - 2 D – 3 |
Read Step | oldWordArray | Selected index in oldWordArray (Starting from 0) | OldWord = Selected element in oldWordArray | output code | Selected newWordArray (Includes current temporary word) | New Word :: Prev oldword + Current oldWord | New Word Code | temporaryWord = (Temporary Word for the next Step) :: OldWord + Last element of newWordArray |
1 | C | 0 | C | 2 | D | - | - | CD |
2 | D | 0 | D | 3 | B | CD | 4 | DB |
3 | B | 0 | B | 1 | A | DB | 5 | BA |
4 | A | 0 | A | 0 | B, BA | BA | 6 | ABA |
5 | B, BA | 1 | BA | 6 | C | ABA | 7 | BAC |
6 | C | 0 | C | 2 | C, CD | BAC | 8 | CCD |
7 | C, CD | 1 | CD | 4 | C | CCD | 9 | CDC |
8 | C | 0 | C | 2 | B, BA, BAC | CDC | 10 | CBA |
9 | B, BA, BAC | 1 | BA | 6 | C, CD, CDC | CBA | 11 | BACDC |
10 | C, CD, CDC | 2 | CDC | 10 | - | BACDC | 12 | - |
Modification 2: Adjusting oldWordArray
As we already observed, the newWordArray of each “Read Step” shown in Table 4, becomes oldWordArray in the next step, but before we use it in the next step, a concatenation happens between previous oldWord and current oldWord, which not only adds a new word in the dictionary, it also removes least recently used word from it. If the least recently used word is part of oldWordArray, or the new word is something, which can be inserted into oldWordArray then a modification is needed to the oldWordArray before it’s used in the next step for iterating through it and finding the largest word pair. So suggestion for Enhancement_1 to be practically implementable is, to capture the new word added as well as any removed word from the dictionary, while concatenation of the previous oldWord and current oldWord and then adjust the newWordArray in the current step (or oldWordArray in the next step) with the result of the concatenation.
Algorithm
The complete compression algorithm of Enhancement_1 is provided in Algorithm 3. The decompression algorithm of Enhancement 1 is the same as the decompression of LZMW, so Algorithm 2 is still valid decompression algorithm for Enhancement_1.
Algorithm 3
compression algorithm for Enhancement_1.
Begin
declare input_pointer = start a stream reader for the input data
declare output_pointer = output stream to write the code
initialize the dictionary with single-charactered word
declare oldWordArray, newWordArray and currentWordArray
oldWordArray[0] = Read the first word (single-charactered word as of now)
declare oldWord, olderWord, currentOldWord temporaryWord
declare concatResult (newWord, removedWord)
while not end-of-file loop (outer loop)
if concatResult is not null then
Adjust oldWordArray with concatResult if there's scope to adjust
end if
declare largestWordPair = the smallest value possible
loop on oldWordArray
currentOldWord = current element of oldWordArray
set the data_pointer right after currentOldWord in the input data
currentWordArray = Read all words matching in the dictionary
(from the current position of the input_pointer)
if temporaryWord matches from the current position of the input_pointer then
add temporaryWord to currentWordArray
end if
wordPairLength = length of currentOldWord + length of currentWordArray[last_index]
if largestWordPair < wordPairLength then
newWordArray = currentWordArray
oldWord = currentOldWord
end if
end loop
mark oldWord as a "hit" in 2q (or any other lru / cache replacement algorithm)
if olderWord is not null then
/*Below function concatenates olderWord and oldWord*/
newWord = concatenate ()
if dictionary is full then
removedWord = remove least recently used word from dictionary using 2q
newCode = code of removedWord
else
newCode = next code in dictionary
end if
assign newCode to newWord
add newWord to the dictionary
include newWord and removedWord to concatResult
end if
/*Below function writes
the code of oldWord
to the output*/
writeCode (oldWord)
/*Below function concatenates oldWord
and the last element of newWordArray*/
temporaryWord = concatenateToTempWord ()
olderWord = oldWord
oldWordArray = newWordArray
end loop
End
The comparison between the prior art LZMW algorithm and Enhancement_1 is performed using separate source codes of these algorithms that are written in Java (Say Source Code 1 for LZMW and Source Code 2 for Enhancement_1). Same parameters for the dictionary and 2Q cache replacement algorithm are used in both the source codes, hence common parameters used in both the java source codes are as follows. The code length is 16 bits, hence the dictionary size is 216 = 65536, that means the dictionary can hold 65536 words including the 256 single characters of an 8-bit character set. For 2Q the size of the A1 queue is kept as 65536/4, Am queue will have rest of the multi character words of the dictionary, as well as size of A1-out queue is 4096. Keeping these parameters for both the java source codes, both the algorithms (LZMW and Enhancement_1) are tested in 5 different data set and the test results are disclosed in the Table 5.
Table 5
Data Compression Comparison
File Name | Size (Bytes) | Compressed File Size using Prior Art, LZMW (Bytes) | Compressed File Size using Enhancement_1 (Bytes) |
Data_set1.txt | 14,064,134 | 4,650,160 | 4,284,758 |
Data_set2.bmp | 30,108,726 | 20,535,076 | 17,860,094 |
Data_set3.doc | 2,750,976 | 673,110 | 651,544 |
Data_set4.doc | 15,605,248 | 7,511,854 | 7,351,172 |
Data_set5.tif | 72,481,956 | 50,568,720 | 49,261,850 |
3.2 Enhancement 2: A multi-character word will have minimum 3 characters
The LZMW algorithm suggests initializing the dictionary with all single characters and then we start adding multiple characters in the dictionary by concatenating 2 subsequent words found in the input file, which is also followed in Enhancement_1 to create new words. This section introduces another enhancement on top of Enhancement_1, which will have little variation in the concatenation process and the main objective of this enhancement is, not to create any word having 2 characters. That means apart from the single characters, other words of the dictionary will contain 3 or more characters. This change will allow us to increase the size of the code length beyond 16 bits and in this paper, the examples used mainly 20 bits code length to get dictionary space of 220 = 1048576 words, which will result in producing better compression ratio. It’s not beneficial to increase the code length to 20 bits keeping 2 character words in the dictionary because size of a single character is 8 bits, that of 2 characters is 16 bits. If we create 20 bits code for the 16 bits word then it will create adverse effect in the compression ratio. So the enhancement proposed in this section, which will be termed as Enhancement_2 from now onward provides 2 concatenation algorithms to replace the current algorithms of just concatenating last 2 words.
The Algorithm 3 provided with Enhancement_1 above has 2 different lines for concatenation of words for different purpose. One of them concatenates “olderWord” with oldWord to create new word in the dictionary and other concatenates oldWord with the last word of newWordArray to create a temporary word. Also Algorithm 2, which describes the decompression of both LZMW and Enhancement_1 contains a similar line indicating concatenation of 2 words to create new word in the dictionary. Enhancement_2 suggests change in concatenation process in all these places because if the 2 concatenating words only have single character each, then concatenation will end up in creating a word containing 2 characters, which is not the intention of this enhancement. Hence to create words with minimum 3 characters, this enhancement suggests declaring 2 variables named buffer1 and buffer2 that will have the capability of holding the previous 2 words read from the input file. The intension is, if each of the previous 2 words contain 1 character then keep them in the buffer instead of concatenating and creating new word and go ahead with the next step so that later on all the 3 words can be concatenated. The higlights of the entire process are provided below.
Initially the dictionary will be initialized with all 256 characters of the 8-bit character set and then reading of the input file will start. While reading the Input file from the first step onward we need to follow the concatenation process to create a new word in dictionary will follow the process defined below.
-
If buffer1 is empty (like in the first step) add current word to buffer1.
-
if buffer1 has single character, buffer2 is empty and current word has single character then add current word to buffer2 and don’t concatenate anything in this step (i.e., don’t create new word).
-
if buffer1 has multiple character, buffer2 is empty then current word has single character concatenate buffer1 + current word and override buffer1 with the current word for the next step.
-
If buffer 1, buffer 2 and current word all contain single character words only then concatenate buffer1 + buffer2 + current word and then override buffer1 with buffer2 and then buffer2 with current word to use in the next step.
If the current word contains more than one character, then check if buffer2 is empty or not. If not, then concatenate buffer1 + buffer2 + current word else concatenate buffer1 current word.
Algorithm 4 submitted with this paper defines the algorithm of the above defined process under Enhancement_2, which must be applied in both compression and decompression process.
Algorithm 4
Enhancement_2 - Concatenation during compression and decompression.
------------------------------------------------------
function concatenate()
------------------------------------------------------
if validNode.level = 1 and buffer1 = null then
buffer1 = validNode
else if validNode.level = 1 and buffer1.level = 1 and buffer2 = null then
buffer2 = validNode
else if validNode.level = 1 and buffer2 = null /*and buffer1.level > 1*/ then
concat buffer1 + validNode
buffer1 = validNode
else if validNode.level = 1 /*and buffer1 and buffer2 with level = 1 for both*/ then
concat buffer1 + buffer2 + validNode
buffer1 = buffer2
buffer2 = validNode
else /*only remaining option is, "if validNode.level > 1"*/
if buffer2 = null then
concat buffer1 + validNode
else
concat buffer1 + buffer2 + validNode
end if
buffer1 = validNode
buffer2 = null
end if
The part 2 of Enhacement_2 is to alter the concatenation process of creating the temporaryWord defined in Enhancement_1. As we know the temporary word is created during compression process of Enhancement_1 by concatenating oldWord and the last element of newWordArray in each step. Since Enhancement_2 suggests creating minimum 3 character word during concatenation, the temporary word also must have minimum 3 characters because depending on the largest word pair, there’s a chance that temporary word will be selected as a word to be inserted into the dictionary in the next step. Hence the concatenation algorithm of temporary word should be created in such a way that if the next step finds the last element of oldWordArray + last element of newWordArray to be the largest word pair, then the concatenation process to create new word for dictionary in the next step, which is defined in algorithm 4 must create the temporary word created in this step. The algorithm to create temporary word by concatenation is included in Algorithm 5.
Algorithm 5
Concatenation to create temporary word with minimum 3 characters.
---------------------------------------------------------------
function concatenateToTempWord()
---------------------------------------------------------------
Declare bufferConTempWord
Declare oldWord = selected element of oldWordArray
Declare newWord = last element of newWordArray
if bufferConTempWord is null then
if length of oldWord = 1 and length of newWord = 1 then
bufferConTempWord = oldWord
else
temporaryWord = concat oldWord + newWord
end if
else
if length of oldWord = 1 then
temporaryWord = concat bufferConTempWord + oldWord + newWord
bufferConTempWord = oldWord
else
temporaryWord = concat oldWord + newWord
bufferConTempWord = null
end if
end if
Algorithm 4 defines a function named concatenate () and algorithm 5 defines concatenateToTempWord() that need to replace the reference of concatenate () and concatenateToTempWord() defined in the compression algorithm, algorithm 3 and decompression algorithm, algorithm 4 of Enhancement_1, to combine Enhancement_1 and Enhancement_2.
The source code containing Enhancement_1 and Enhancement_2 doesn’t show much improvement on the data set files submitted in this paper unless the code is added with Enhancement_3, which is described in the next section. Hence there’s no comparison table submitted at the end of this section and the final comparison table is shown at the end of this paper with compression ratio obtained by a program that includes all the 3 enhancements.
3.3 Enhancement 3: Process to write in the output file
There are 2 ways to write a code in the output file during the compression process. When the compression program reads a word in the input file, it can encounter a single character, if there’s no other match found in the dictionary from the current position of the input-pointer, which is 8 bits in size. Or it can encounter a word with multiple characters for which a code is assigned to the dictionary. Enhancement_2 explained in the previous section suggests a 20-bits code length to get a dictionary space capable of keeping 220 words. In order to write the code of the words or single characters in the output file, we can’t just write 8 bits for single characters and 20 bits for words with multiple characters because in that case the decompressor will never know if 8 bits or 20 bits to be read from the compressed output to get the next word.
Hence we have 2 ways to write the code in the output file, by which decompressor can identify the code of single or multiple characters words unambiguously. One of the ways (say approach_1) is, to write 20 bits codes for both single characters and words with multiple characters so that the decompressor will always read 20 bits from the compressed file to get a word with single or multiple characters, for instance if the compressor writes the code of the character ‘A’, whose ASCII value is 65, which is “01000001” in 8 bits binary, it needs to prefix this 8 bits with 12 extra bits with zeros to make it 20 bits. In that way the binary output code for A will be “00000000000001000001”.
Another approach (approach_2) to write 8 bits character code for single character and 20 bits dictionary code for multiple characters words in the output file is to write them with a 1 bit prefix before each of them. The compression program can write ‘0’ before writing the 8-bits character code and ‘1’ before writing a 20 bits code from dictionary for a word with multiple characters so that the decompressor will know by reading 0 or 1 whether the code ahead is for a word with a single character or a word with multiple characters and it will read 8 bits or 20 bits respectively.
The enhancement introduced in this section, which will be termed as Enhancement_3 from now onward proposes a proper way to calculate and identify whether approach_1 or approach_2 will bring better compression ratio and the calculation is performed in each step of the compression and decompression process. The calculation to select approach_1 or approach_2 depends on the number of words with multiple characters vs number of words with single characters found in the input file.
If the symbol M indicates the number of words with multiple characters present in an input file and S indicates the number of words with single characters present in the same file, then the extra bits (any bits added in the compressed output apart from 8 bits for single character and 20 bits for the code of word with multiple characters) needed in each of the approaches are described below.
For code length of 20 bits, Approach_1 needs 12 extra bits to be prefixed before each 8 bits characters in the file to make them 20 bits, so extra bits needed in approach_1 = 12 x S.
Approach_2 needs 1 extra bit prefixed before each word with single or multiple characters, so extra bits needed in approach_2 = (M + S) x 1 = M + S.
If for an input file, both approach_1 and approach_2 brings the same compression ratio, then we can find the ratio M/S (/ = division) in the following way.
M + S = 12 * S
or (M + S)/S = 12
or M/S + 1 = 12
or M/S = 11
That means for an input file with M/S = 11, (assuming 20 bits code length for words) the compression ratio obtained from both approach_1 and approach_2 are exactly same. It can be proved in the same way that for a file with M/S < 11, approach_2 will bring better compression ratio and for a file with M/S > 11, approach_1 will bring better compression ratio.
Inequation 1: M/S < 11 or M/S + 1 < 12 or (M + S)/S < 12 or (M + S) < 12 * S → proves approach_2 needs less number of extra bits.
Inequation 2: M/S > 11 or M/S + 1 > 12 or (M + S)/S > 12 or (M + S) > 12 * S → proves approach_1 needs less number of extra bits.
Included a table identified as Table 6 in this section that’s listing the same 5 data set files that were shown in Table 5 but this time the data sets were used to collect the count of words with multiple characters (M) and single characters (S) for each of the files by executing them with a compression program that includes Enhancement_1 and Enhancement_2 both and with a code length of 20 bits. It has been observed with multiple files that the ratio M/S varies across files and it’s not consistent if the ratio will be greater than or less than 11. Like in Table 6, the M/S for data_set1.txt is greater 11 but it’s less than 11 for the other 4 files.
Table 6
Count of Words with Multiple and Single Characters
File Name | Number of Single Characters Read and the Code Written to Output (Symbol → S) | Number of Multiple Characters Words Read and the Code Written to Output (Symbol → M) | M/S |
Data_set1.txt | 90925 | 1488670 | 16.3725 |
Data_set2.bmp | 1639122 | 5687146 | 3.4696 |
Data_set3.doc | 101360 | 207853 | 2.0506 |
Data_set4.doc | 3149275 | 1081770 | 0.3435 |
Data_set5.tif | 4333361 | 14574093 | 3.3632 |
Hence the Enhancement_3 proposes a generic algorithm, which will calculate and identify the approach_1 or approach_2 to be used each time before writing the code of a character or a word in the output file. The calculation will happen based on the number of words with single character or multiple characters read so far in the input file. The same algorithm has to be used during decompression to unambiguously identify the approach used during compression.
The ratio M/S for which both approach_1 and approach_2 will provide same compression ratio was identified as 11 for 20 bits code length as we saw earlier, the formula to find this number for different code length is M/S = “Code Length” – 8–1 where 8 is the length of a character in bits. This particular number (11 for 20 bit code length) can be kept in a variable named “ratioline” to be used in the algorithm.
So the basic purpose of the algorithm included in Algorithm 6 is to identify the M/S using the count of the words read so far in the input file and then compare it with the “ratioline”. The algorithm starts with initializing 2 variables, multiChars and SingleChars having value 1 in each of them at the beginning and then the value of the respective variable is increased when any word with multiple characters or single character encountered in the input file. The ratioline(11) is compared with multiChars/singleChars in each step and if the value is greater than ratioline then next code will will be written to the output file by following approach_1 else it will be written by following approach_2. The decompression algorithm is also provided in Algorithm 6, which is similar to the compression and starts with initializing variables multiChars and singleChars with 1.
Algorithm 6
Proposed compression and decompression algorithm under Enhancement_3.
----------------------------------------------
--Algorithm During Compression ---------------
----------------------------------------------
Function writeCode (word)
----------------------------------------------
code = read the code of the word from dictionary
declare code_length = length of the code assigned a word in the dictionary
declare ratioline = code_length − 8 − 1
declare multiChars = 1, SingleChars = 1
if multiChars / SingleChars > ratioline then //approach_1
if word has single character then
prefix (code_length − 8) zeros with the code to make the size to code_length
end if
Write the code with size code_length bits to the output file
else
declare oneBit
if word has single character then
oneBit = 0
else
oneBit = 1
end if
write oneBit to output file
write 8 bit code for single character or
code_length bit code for multiple characters in the output file
end if
if word has single character then
SingleChars = SingleChars + 1
else
multiChars = multiChars + 1
end if
----------------------------------------------
--End of Function-----------------------------
----------------------------------------------
--Algorithm During Decompression ---------------
----------------------------------------------
Function readWord ()
----------------------------------------------
declare code_length = length of the code assigned a word in the dictionary
declare ratioline = code_length − 8 − 1
declare multiChars = 1, SingleChars = 1
declare wordToReturn
if multiChars / SingleChars > ratioline then //approach_1
read code_length bits from the compressed input
wordToReturn = the related word from the dictionary
else
oneBit = read 1 bit from the compressed input
if oneBit = 0 then
read 8 bits from the compressed input
wordToReturn = the related word with single characters from dictionary
else
read code_length bits from the compressed input
wordToReturn = the related word with multiple characters from the dictionary
end if
end if
if wordToReturn is single character word then
SingleChars = SingleChars + 1
else
multiChars = multiChars + 1
end if
return wordToReturn
-----------------------
--End of Function------
-----------------------
Algorithm 6 defines function writeCode () for compression and readWord () for decompression, which can replace the same functions defined in algorithm 3 and algorithm 2 of Enhacement_1 to combine Enhancement_3 with Enhancement_1. The process of combining Enhancement_1 and Enhancement_2 was already described in the previous section. The compression ratio obtained by the program by combining all the 3 enhancements (Enhancement_1 + Enhancement_2 + Enhancement_3) with code length of 20 bits (and 22 bits for 1 of the files) compared with LZMW and Enhancement_1 is provided in Table 7.
Table 7
Data Compression Comparison
File Name | Size (Bytes) | Compressed File Size using Prior Art (Bytes) | Compressed File Size using Enhancement_1 (Bytes) | Size of Zip File using “Ultra” option (Bytes) | Compressed File Size using Enhancement_1 + Enhancement_2 + Enhancement_3 (Bytes) (using 20 bits code length) | Compressed File Size using Enhancement_1 + Enhancement_2 + Enhancement_3 (Bytes) (using 22 bits code length) |
Data_set1.txt | 14,064,134 | 4,650,160 | 4,284,758 | 5,013,497 | 3,949,018 | |
Data_set2.bmp | 30,108,726 | 20,535,076 | 17,860,094 | 17,502,428 | 16,772,126 | |
Data_set3.doc | 2,750,976 | 673,110 | 651,544 | 436,801 | 659,664 | 711,629 |
Data_set4.doc | 15,605,248 | 7,511,854 | 7,351,172 | 6,335,961 | 6,386,205 | 5,586,613 |
Data_set5.tif | 72,481,956 | 50,568,720 | 49,261,850 | 46,329,478 | 43,128,118 | |
With combination of the 3 enhancements and using code length of 20 bits, all the data set files shown in Table 7 compress better than the zip files created out of them except Data_set3.doc and Data_set4.doc. Table 6 shows the Data_set3.doc has 101360 + 207853 = 309213 words and the dictionary has capability of holding 220 = 1048576 words in total, which means the Data_set3.doc gets compressed before the dictionary is full (assuming at most a new word gets created in each step by concatenation), that’s why the purpose of Enhancement_2 of keeping as much word as possible in dictionary doesn’t have any effect on Data_set3.doc.
Data_set4.doc is much bigger in size and Table 6 shows it has 3149275 + 1081770 = 4231045 total words which is much higher than the dictionary capacity of 1048576 words, so we can assume that the dictionary was full during compression but still the statistics of Table 6 shows that the number of words with single character are more than the number of multiple characters, hence providing more dictionary space by increasing the code length to 22 (222 = 4194304) gives it the opportunity to create more words with multiple characters and result into a compression better than zip as shown in Table 7.