An Efficient Data Compression Algorithm

doi:10.21203/rs.3.rs-2082171/v1

Download PDF

Research Article

An Efficient Data Compression Algorithm

https://doi.org/10.21203/rs.3.rs-2082171/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This paper introduces a new and highly efficient data compression algorithm, which is a variant of Lempel-Ziv group of algorithms but much more efficient than other variants of Lempel-Ziv algorithm.

Unlike popular data compression tools (zip, rar, 7z, gzip, bzip etc) available in the market that use multiple data compression algorithms together to achieve optimum compression, this algorithm alone is capable of compressing a data much better than deflate[8] (zip) for most of the files > 10–12 MB without using any other data compression technology at the second level, hence leaves enough opportunity for the researchers to achieve much better compression in future by using an entropy algorithm along with the proposed algorithm.

Data Compression

Lossless

Lempel Ziv

Text Compression

Dictionary Based

LZ77[1] and LZ78[2] are lossless data compression algorithms introduced by Abraham Lempel and Jacov Ziv in years 1977 and 1978 and since then, multiple variants of these two algorithms are invented. Few of the variants are listed below: -

LZSS [6]
LZMA [5]
LZW [7]
LZMW [4]

And there are many more variants invented over the years.

The main difference between LZ77 and LZ78 is, LZ77 uses sliding window to match similar strings in the input file / data whereas LZ78 creates a runtime dictionary to match the similar strings. Hence all the variants using sliding window are considered to be the variant of LZ77 and the variants using dictionary are considered to be the variant of LZ78. Over the year LZ77 got popularity as the variants of LZ77 are used in popular data compression softwares like LZSS is used in deflate [8] (zip) and LZMA is used in 7z software that are variants of LZ77 but somehow the variants of LZ78 lost popularity over the year probably because of the complexity.

The algorithm introduced in this paper uses LZMW algorithm (One of the best variants of LZ78) as a base and then introduces 3 enhancements on top of LZMW algorithm and shows that the compression ratio with these 3 enhancements is much better than the original LZMW algorithm. Also, this paper shows, most of the uncompressed files with size > 10-12 MB gets a compression ratio better than deflate (zip). Multiple set of source codes are submitted with this paper to prove the improvement which includes, source codes implementing the LZMW algorithm with 2Q cache replacement and separate source codes for the first enhancement alone and all the 3 enhancements together. The submitted source codes are all single threaded (unlike multi-threaded implementation of leading data compression tools), hence the performance is not much optimized as the entire focus of this paper is to observe the improvement of the compression ratio with the enhancements defined here.

The data sets (files) submitted with the source codes are not already compressed by any other method, i.e. we are not trying to compress any zip, jpg, gif or docx file using the algorithm proposed in this paper as those files were already compressed by other compression techniques. Instead, the submitted data sets are mainly bmp, doc, text, tiff (uncompressed) etc files that are not at all compressed by any other methods.

In order to help reading and understanding the algorithm defined in this paper, few terminologies are consistently used throughout the paper and described in “Table 1”.

Table 1: List of Terminologies

Term	Description
Enhancement_1	The first enhancement among the set of 3 enhancements on LZMW algorithm proposed in this paper.
Enhancement_2	The second enhancement among the set of 3 enhancements on LZMW algorithm proposed in this paper.
Enhancement_3	The third enhancement among the set of 3 enhancements on LZMW algorithm proposed in this paper.
Input File	The file or data to compress.
Input-pointer	A pointer pointing to a particular position of input file and set to read the input file from that position.
Output file	The compressed output during compression process and the uncompressed output during the decompression process.

2.1. Brief description of LZMW algorithm (Prior art)

A brief description of the LZMW algorithm is introduced in this section just to start with the enhancements described in the later sections of this document. Readers are expected to go through the original submission of the LZMW algorithm by following the link provided in the bibliography.

Like all other variants of LZ78, the LZMW algorithm proposes to use a dictionary which will be created and maintained at runtime during the compression of the input file. Each of the words in the dictionary are assigned with fixed-length codes and the size of the dictionary is fixed and depends on the length of the code assigned to the word. That means, if the dictionary uses 16 bit codes for the words then the maximum number of words in the dictionary will be 2¹⁶ = 65536. Both the compression and decompression processes start with initializing the dictionary containing all single characters, which means for an 8 bits character-set, 2⁸ = 256 single characters are initialized in the dictionary and the codes 0–255 are assigned to them and the rest of the codes (256–65535) is left for the multi-character words to be added to the dictionary during compression. A file-pointer or input-pointer is opened up to read the input file and the pointer is set at the beginning of the input file. When the input file is read from the current position of the input-pointer, it looks for the largest word in the dictionary matching with the string read from input file. Initially the dictionary will contain only single characters but gradually the dictionary will be populated with multiple character words by following the concatenation process described below in this section. Let’s assume somewhere in the input file there’s a string, “The quick brown fox jumps over the lazy dog” and the input-pointer is currently at “j”, as well as the dictionary contains “j”, “jum” and “jumps” then the input-pointer should select “jumps” as that is the largest match found in the dictionary from the current position of the input-pointer and the next position of the input-pointer will be set to the “ ” (space) after “jumps” in the input file. The code of the largest word then added to the output file after the match is found.

New words added in the dictionary by concatenating the last 2 words read from the input file with a new code assigned to the word. When the dictionary is full, which means for a 16 bit length code when there are 65536 words in the dictionary and the codes 0–65535 are all used up, then LZMW algorithm proposes to remove the least recently used word from the dictionary (not the first 256 single character words identified by codes 0 -255) before adding a new word in it and then the code of the removed word will be reused by the new word added in it. The algorithm proposes to use an efficient Page Replacement / Cache Replacement algorithm to identify the least recently used word to get good compression.

The compression algorithm related to the prior art is included in “Algorithm 1” to clarify the concept described above.

Algorithm 1

LZMW Compression Algorithm

Begin

declare input_pointer = start a stream reader for the input data

declare output_pointer = output stream to write the code

declare prevWord

initialize the dictionary with single-charactered word

while not end-of-file loop

word = Read the largest word from the current position

of the input_pointer that matches with the dictionary

write the code of the word to output using output_pointer

mark word as a "hit" in 2q (or any other lru / cache replacement algorithm)

if prevWord is not null then

newWord = concatenate prevWord and word

if dictionary is full then

removedWord = remove least recently used word from dictionary using 2q

newCode = code of removedWord

else

newCode = next code in dictionary

end if

assign newCode to newWord

add newWord to dictionary

end if

prevWord = word

end loop

End

The decompression process described in LZMW algorithm is similar to the compression process in many ways. The difference is, the input-pointer is set to the compressed file and it reads the codes of the words that were written in the compressed file during compression. The dictionary is initialized with single characters exactly like it was initialized during the compression and then other words in the dictionary is populated by concatenating the last 2 words read from the compressed file. The only difference is, during decompression the dictionary is used to get the words using the codes read in the compressed data and the string value of the word is written to the output data. The decompression algorithm is included in “Algorithm 2” for clarification.

Algorithm 2

LZMW Decompression Algorithm.

Begin

declare input_pointer = start a stream reader for the input compressed data

declare output_pointer = output stream to write the uncompressed characters

declare prevWord

initialize the dictionary with single-charactered word

Loop on input stream to read codes (using output_pointer)

/Below Function first

reads code from compressed file

and then reads word

from dictionary using code*/

word = readWord()

mark oldWord as a "hit" in 2q (or any other lru / cache replacement algorithm)

if prevWord is not null then

/*Below function concatenates

prevWord and word*/

newWord = concatenate()

if dictionary is full then

removedWord = remove least recently used word from dictionary using 2q

newCode = code of removedWord

else

newCode = next code in dictionary

end if

assign newCode to newWord

add newWord to dictionary

end if

write word into output using output_pointer

prevWord = word

End loop

End

2.2. Description of the cache replacement algorithm used in the introduced data compression algorithm

The Data compression algorithm introduced in this paper works with any cache replacement algorithm but the source codes of LZMW and the enhancements submitted with this paper use 2Q cache replacement algorithm[3]. In each of the source codes submitted in this paper, first the A1 queue is populated until the dictionary is filled up, then ¾ th of the words are transferred to Am queue and then the code follows 2Q algorithm using A1, Am and A-out queues in the way defined in 2Q algorithm.

As mentioned in prior sections, this paper introduces 3 enhancements on top of LZMW data compression algorithm and this section includes the detailed description of each of them. The descriptions of the prior arts (LZMW and 2Q) are not explicitly included in this section except for the parts needed to describe the enhancements. The readers of this section are assumed to have prior knowledge of LZMW and 2Q algorithms.

3.1 Enhancement 1: Find the largest “word pair” instead of the largest word

As mentioned in the description of the prior-art in the previous section, LZMW algorithm and other variants of LZ78 algorithm finds for the largest word in the dictionary, which matches with the string read from the current position of input-pointer in the input file. But the proposed enhancement in this section, which will be termed as Enhancement_1 from now onward, suggests not to find the largest word always. Instead of finding the largest word in the dictionary, in some situations finding a smaller matching word can produce better compression ratio. That means in the our previous example instead of writing the code of “jumps” in the output file, in some situations, it’s possible to get extra compression ratio by selecting “j” or “jum” and writing its code to the output file. In that case the input-pointer also will be set to the character which is right after the selected word in the input file. That means after selecting “jum” and writing its code to output, the input-pointer will be set to “p” which is right after “jum” for reading the next word.

Introducing another example to clarify Enhancement_1. Let’s assume a character set where only 4 characters A, B, C & D are available with character numbers 0, 1, 2 & 3 respectively. When the dictionary is initialized, it creates 4 single-character words A, B, C & D with the initial codes 0, 1, 2 and 3 for them respectively and each new word added in the dictionary will get codes from 4 onward until the upper limit (not defined in this example, just assume there’s a limit) of the dictionary is reached. Now let’s consider we need to compress the following string.

CDBABACCDCBACDC

Setting the input-pointer at the beginning (“C”) and going ahead with the prior-art, LZMW algorithm by matching with the largest word in the dictionary, we’ll get the steps and outcomes defined in “Table 2”.

Table 2

Illustration of LZMW Compression
	Output		Dictionary A – 0 B – 1 C - 2 D − 3
Read Step	Word Read in the input string	Output Code	New Word	Code of New Word
1	C	2	-	-
2	D	3	CD	4
3	B	1	DB	5
4	A	0	BA	6
5	BA	6	ABA	7
6	C	2	BAC	8
7	CD	4	CCD	9
8	C	2	CDC	10
9	BAC	8	CBAC	11
10	D	3	BACD	12
11	C	2	DC	13

Each of the rows defined in “Table 2” shows the word read from the input file, the code written in the output data and the new word formed in the dictionary with a new code by concatenating the last 2 words read in the input file. The total 11 steps indicate, 11 different codes are written in the output data during the compression of the above string. So by following the process of finding the largest words in the input file, we are compressing the input file with 11 output codes. We can now see we can finish compressing the data by generating 10 codes instead of 11 by making a change in the step 9 onward, where we’ll select a smaller matching word from the dictionary instead of selecting with the largest one.

Out of the input file CDBABACCDCBACDC, the first part CDBABACCDC were read by the end of step 8 and the part BACDC were left for reading. That means after step 8, the input-pointer is set at the beginning of BACDC. The dictionary had the words, “B”, “BA” and “BAC” and “Table 2” shows the word BAC was selected at step 9 as that is the largest matching word in the dictionary from the position of the input-pointer. But instead of selecting “BAC” in step 9, if “BA” were selected, then the input-pointer would’ve read the next word starting from the “C” after “BA” in the input file and step 10 would’ve finished the input file by selecting “CDC” which was also present in the dictionary with code = 10. In that way the last 2 steps of “Table 2” would look like the one shown in “Table 2a” and the output needed 10 codes instead of 11 as shown above.

Table 2

a: Illustration with a Smaller Word in Step 9
9	BA	6	CBA	11
10	CDC	10	BACDC	12

The alternative approach defined above, finishes the compression by producing reduced number of codes for the output data, hence ended up by providing better compression ratio than the previous approach. The reason behind the better data compression is, in step 9 of the alternative approach the word “BA” was selected whose length is 2 characters and step 10 has “CDC” which is 3 characters long. So the total length of those 2 successive words is 2 + 3 = 5. Whereas in the previous approach the length of the word “BAC” at step 9 was 3 and that of “D” in step 10 was 1 which makes total length of the 2 successive words to be 3 + 1 = 4, which is less than the alternative approach. Hence the conclusion drawn from the above illustration is, the data compressor program should find the largest “word pair” in the 2 subsequent reading, instead of finding the largest word.

The only way to find out the largest word pair is to iterate through all possible words in a particular step “x” and to find the possible next word in step “x + 1” for each alternative words in step “x”. Like the step 9 of the above example had 3 options “B”, “BA” and “BAC” to select from the dictionary and we ended up with selecting “BA” because in that case next word will be “CDC” which will make the combined length of the words of step 9 and 10 to be 2 + 3 = 5, which is more than the words would’ve been found in steps 9 and 10 for any other word selected in step 9. To determine, which option among “B”, “BA” and “BAC” will produce the largest word-pair along with step 10 is to iterate through each of the words and then see which word is matched in step 10 in the dictionary for each of the selection. In this example by selecting “B” in step 9, the only matching word in step 10 would be “A” and the combined length of the words will be 1 + 1 = 2 and by selecting “BAC” in step 9 we already have seen “D” to be selected in step 10 above, which made the combined length of the words to be 3 + 1 = 4, which proves selecting “BA” is most beneficial.

With the above example of iterating through the possible words in step 9 and identifying the largest word pair in step 9 and 10, in this section under Enhancement_1, an efficient programmatic approach of finding largest word pair is proposed, which is relatively slower than the prior-art but capable of providing remarkably better compression ratio. The decompression process won’t have any change in this alternative approach, so it doesn’t have any impact on performance in the decompression process. The proposed algorithm suggests creating 2 arrays named oldWordArray and newWordArray and keeping all the matching words in a particular step into oldWordArray. That means in our current example, at step 9, oldWordArray will be populated in the following manner if we consider 0 based index.

OldWordArray[0] = “B”

oldWordArray[1] = “BA”

oldWordArray[2] = “BAC”

The proposed algorithm suggests starting a loop on oldWordArray and then to read the next set of words from the input file in each iteration of the loop and populating newWordArray with that. That means if we iterate the loop with the example described in the above section, the 1st iteration of the loop will pick up “B” from oldWordArray and populate newWOrdArray[0] = “A”, the 2nd iteration will pick up “BA” from oldWordArray and populate newWordArray with newWOrdArray[0] = “C”, newWOrdArray[1] = “CD” and newWOrdArray[2] = “CDC” (C, CD and CDC are present in the dictionary) and the last iteration of the loop will pick up “BAC” from oldWordArray and populate newWordArray[0] = “D”.

While in the loop, the proposed algorithm suggests finding the length of word-pair by adding the length of oldWordArray[i] + length of newWordArray[last_index] where “i” is the index in current iteration of oldWordArray and “last_index” is the index of the last element in newWordArray. The index “i” and newWordArray must be retained or remembered for the iteration of the loop where the value of the word-pair becomes the largest. “Diagram 1” shows the oldWordArray and the newWordArray populated in each iteration of the loop along with the length of the word-pair found in each iteration.

After the end of the above-mentioned loop, the proposed algorithm suggests to write the code of oldWordArray[i] to the output data where “i” is the index of the oldWordArray, for which the largest word-pair was found. Then it suggests to oncatenate the previous word that was written in the output with oldWordArray[i] and add the new word created by the concatenation to the dictionary, and then copy newWordArray to oldWordArray (oldWordArray = newWordArray) and then start a new loop on the newly formed oldWordArray to get the next set of newWordArray.

Hence Enhancement_1 suggests nested loops with an outer loop to work on different oldWordArray in each iteration (or each step as mentioned in the “Table 2” or “Table 3”) and an inner loop on each of the elements of the oldWordArray to identify the largest word-pair with the newWordArray created in that iteration as described above and then to use the finally selected newWordArray of the current iteration as the oldWordArray of the next iteration of the outer loop. The very first oldWordArray will be populated with the first character read in the input file and the only element of the array will be oldWordArray[0] = < first character of input file>.

If we go with the above example to compress input file CDBABACCDCBACDC using the concept of Enhancement_1 described so far, we’ll have to go through the steps defined in “Table 3”.

Table 3

Illustration 1 with Enhancement_1
		Output				Dictionary A – 0 B – 1 C - 2 D – 3
Read Step	oldWordArray	Selected index in oldWordArray (Starting from 0)	OldWord = Selected element in oldWordArray	output code	Selected newWordArray	New Word :: Prev oldword + Current oldWord	New Word Code
1	C	0	C	2	D	-	-
2	D	0	D	3	B	CD	4
3	B	0	B	1	A	DB	5
4	A	0	A	0	B	BA	6
5	B	0	B	1	A	AB	7
6	A	0	A	0	C	BA	(6-repeat)
7	C	0	C	2	C, CD	AC	8
8	C, CD	1	CD	4	C	CCD	9
9	C	0	C	2	B, BA	CDC	10
10	B, BA	1	BA	6	C, CD, CDC	CBA	11
11	C, CD, CDC	2	CDC	10	-	BACDC	12

There are 2 things to notice when we compare Table 3 containing steps of Enhancement_1 with Table 2 that contains steps of LZMW. First, the compression still needs 11 steps by Enhancement_1and second, the step 5 of Enhancement_1 is reading “B” but the same in LZMW (Table 2) reads “BA” instead. The word “BA” was created in step 4 by concatenating the words (“B” and “A”) selected in step 3 and 4and then read in step 5 by LZMW because it encountered BA in the input file starting from position 5. The reason behind not reading “BA” by Enhancement_1 is, the newWordArray is getting finalized before the concatenation of oldWord in previous step and the oldWord of current step as we can’t finalize the oldWord of current step without finalizing newWordArray because the largest word pair i.e. “length of oldWord” + “last element of newWordArray” is the basis of finalizing them. This behavior indicates, Enhancement_1 still needs some refinements before we can practically use it as a good improvement over LZMW algorithm. To resolve this issue, the following modification is suggested for Enhancement_1.

Modification 1: Creating a temporary word:

Introduce a global variable named “temporaryWord”, which will just keep 1 word at a time. In each step concatenate the oldWord (the selected oldWordArray[i] as defined in Table 3) + “last element of newWordArray” and keep the concatenated word to “temporaryWord” without assigning any code in it.
While populating newWordArray in the next step, not only search matching words in the dictionary, check if temporaryWord created in the last step matches with the input file or not. If matches then add it in the newWordArray.
If the temporaryWord gets a new code in the next step which means if temporaryWord = prev oldWord + current oldWord in the concatenation performed in the next step, then add it to the dictionary.
Override the temporaryWord with the new temporaryWord created in each step.
Rest of the functions will be continued as it is.

If we add this modification in our above mentioned example, then the steps to compress the input file will be modified as shown in Table 4 and now it will get completed in 10 steps.

Table 4

Illustration 1 with Enhancement_1
		Output				Dictionary A – 0 B – 1 C - 2 D – 3
Read Step	oldWordArray	Selected index in oldWordArray (Starting from 0)	OldWord = Selected element in oldWordArray	output code	Selected newWordArray (Includes current temporary word)	New Word :: Prev oldword + Current oldWord	New Word Code	temporaryWord = (Temporary Word for the next Step) :: OldWord + Last element of newWordArray
1	C	0	C	2	D	-	-	CD
2	D	0	D	3	B	CD	4	DB
3	B	0	B	1	A	DB	5	BA
4	A	0	A	0	B, BA	BA	6	ABA
5	B, BA	1	BA	6	C	ABA	7	BAC
6	C	0	C	2	C, CD	BAC	8	CCD
7	C, CD	1	CD	4	C	CCD	9	CDC
8	C	0	C	2	B, BA, BAC	CDC	10	CBA
9	B, BA, BAC	1	BA	6	C, CD, CDC	CBA	11	BACDC
10	C, CD, CDC	2	CDC	10	-	BACDC	12	-

Modification 2: Adjusting oldWordArray

As we already observed, the newWordArray of each “Read Step” shown in Table 4, becomes oldWordArray in the next step, but before we use it in the next step, a concatenation happens between previous oldWord and current oldWord, which not only adds a new word in the dictionary, it also removes least recently used word from it. If the least recently used word is part of oldWordArray, or the new word is something, which can be inserted into oldWordArray then a modification is needed to the oldWordArray before it’s used in the next step for iterating through it and finding the largest word pair. So suggestion for Enhancement_1 to be practically implementable is, to capture the new word added as well as any removed word from the dictionary, while concatenation of the previous oldWord and current oldWord and then adjust the newWordArray in the current step (or oldWordArray in the next step) with the result of the concatenation.

Algorithm

The complete compression algorithm of Enhancement_1 is provided in Algorithm 3. The decompression algorithm of Enhancement 1 is the same as the decompression of LZMW, so Algorithm 2 is still valid decompression algorithm for Enhancement_1.

Algorithm 3

compression algorithm for Enhancement_1.

Begin

declare input_pointer = start a stream reader for the input data

declare output_pointer = output stream to write the code

initialize the dictionary with single-charactered word

declare oldWordArray, newWordArray and currentWordArray

oldWordArray[0] = Read the first word (single-charactered word as of now)

declare oldWord, olderWord, currentOldWord temporaryWord

declare concatResult (newWord, removedWord)

while not end-of-file loop (outer loop)

if concatResult is not null then

Adjust oldWordArray with concatResult if there's scope to adjust

end if

declare largestWordPair = the smallest value possible

loop on oldWordArray

currentOldWord = current element of oldWordArray

set the data_pointer right after currentOldWord in the input data

currentWordArray = Read all words matching in the dictionary

(from the current position of the input_pointer)

if temporaryWord matches from the current position of the input_pointer then

add temporaryWord to currentWordArray

end if

wordPairLength = length of currentOldWord + length of currentWordArray[last_index]

if largestWordPair < wordPairLength then

newWordArray = currentWordArray

oldWord = currentOldWord

end if

end loop

mark oldWord as a "hit" in 2q (or any other lru / cache replacement algorithm)

if olderWord is not null then

/*Below function concatenates olderWord and oldWord*/

newWord = concatenate ()

if dictionary is full then

removedWord = remove least recently used word from dictionary using 2q

newCode = code of removedWord

else

newCode = next code in dictionary

end if

assign newCode to newWord

add newWord to the dictionary

include newWord and removedWord to concatResult

end if

/*Below function writes

the code of oldWord

to the output*/

writeCode (oldWord)

/*Below function concatenates oldWord

and the last element of newWordArray*/

temporaryWord = concatenateToTempWord ()

olderWord = oldWord

oldWordArray = newWordArray

end loop

End

The comparison between the prior art LZMW algorithm and Enhancement_1 is performed using separate source codes of these algorithms that are written in Java (Say Source Code 1 for LZMW and Source Code 2 for Enhancement_1). Same parameters for the dictionary and 2Q cache replacement algorithm are used in both the source codes, hence common parameters used in both the java source codes are as follows. The code length is 16 bits, hence the dictionary size is 2¹⁶ = 65536, that means the dictionary can hold 65536 words including the 256 single characters of an 8-bit character set. For 2Q the size of the A1 queue is kept as 65536/4, Am queue will have rest of the multi character words of the dictionary, as well as size of A1-out queue is 4096. Keeping these parameters for both the java source codes, both the algorithms (LZMW and Enhancement_1) are tested in 5 different data set and the test results are disclosed in the Table 5.

Table 5

Data Compression Comparison
File Name	Size (Bytes)	Compressed File Size using Prior Art, LZMW (Bytes)	Compressed File Size using Enhancement_1 (Bytes)
Data_set1.txt	14,064,134	4,650,160	4,284,758
Data_set2.bmp	30,108,726	20,535,076	17,860,094
Data_set3.doc	2,750,976	673,110	651,544
Data_set4.doc	15,605,248	7,511,854	7,351,172
Data_set5.tif	72,481,956	50,568,720	49,261,850

3.2 Enhancement 2: A multi-character word will have minimum 3 characters

The LZMW algorithm suggests initializing the dictionary with all single characters and then we start adding multiple characters in the dictionary by concatenating 2 subsequent words found in the input file, which is also followed in Enhancement_1 to create new words. This section introduces another enhancement on top of Enhancement_1, which will have little variation in the concatenation process and the main objective of this enhancement is, not to create any word having 2 characters. That means apart from the single characters, other words of the dictionary will contain 3 or more characters. This change will allow us to increase the size of the code length beyond 16 bits and in this paper, the examples used mainly 20 bits code length to get dictionary space of 2²⁰ = 1048576 words, which will result in producing better compression ratio. It’s not beneficial to increase the code length to 20 bits keeping 2 character words in the dictionary because size of a single character is 8 bits, that of 2 characters is 16 bits. If we create 20 bits code for the 16 bits word then it will create adverse effect in the compression ratio. So the enhancement proposed in this section, which will be termed as Enhancement_2 from now onward provides 2 concatenation algorithms to replace the current algorithms of just concatenating last 2 words.

The Algorithm 3 provided with Enhancement_1 above has 2 different lines for concatenation of words for different purpose. One of them concatenates “olderWord” with oldWord to create new word in the dictionary and other concatenates oldWord with the last word of newWordArray to create a temporary word. Also Algorithm 2, which describes the decompression of both LZMW and Enhancement_1 contains a similar line indicating concatenation of 2 words to create new word in the dictionary. Enhancement_2 suggests change in concatenation process in all these places because if the 2 concatenating words only have single character each, then concatenation will end up in creating a word containing 2 characters, which is not the intention of this enhancement. Hence to create words with minimum 3 characters, this enhancement suggests declaring 2 variables named buffer1 and buffer2 that will have the capability of holding the previous 2 words read from the input file. The intension is, if each of the previous 2 words contain 1 character then keep them in the buffer instead of concatenating and creating new word and go ahead with the next step so that later on all the 3 words can be concatenated. The higlights of the entire process are provided below.

Initially the dictionary will be initialized with all 256 characters of the 8-bit character set and then reading of the input file will start. While reading the Input file from the first step onward we need to follow the concatenation process to create a new word in dictionary will follow the process defined below.

If buffer1 is empty (like in the first step) add current word to buffer1.
if buffer1 has single character, buffer2 is empty and current word has single character then add current word to buffer2 and don’t concatenate anything in this step (i.e., don’t create new word).
if buffer1 has multiple character, buffer2 is empty then current word has single character concatenate buffer1 + current word and override buffer1 with the current word for the next step.
If buffer 1, buffer 2 and current word all contain single character words only then concatenate buffer1 + buffer2 + current word and then override buffer1 with buffer2 and then buffer2 with current word to use in the next step.

If the current word contains more than one character, then check if buffer2 is empty or not. If not, then concatenate buffer1 + buffer2 + current word else concatenate buffer1 current word.

Algorithm 4 submitted with this paper defines the algorithm of the above defined process under Enhancement_2, which must be applied in both compression and decompression process.

Algorithm 4

Enhancement_2 - Concatenation during compression and decompression.

------------------------------------------------------

function concatenate()

------------------------------------------------------

if validNode.level = 1 and buffer1 = null then

buffer1 = validNode

else if validNode.level = 1 and buffer1.level = 1 and buffer2 = null then

buffer2 = validNode

else if validNode.level = 1 and buffer2 = null /*and buffer1.level > 1*/ then

concat buffer1 + validNode

buffer1 = validNode

else if validNode.level = 1 /*and buffer1 and buffer2 with level = 1 for both*/ then

concat buffer1 + buffer2 + validNode

buffer1 = buffer2

buffer2 = validNode

else /*only remaining option is, "if validNode.level > 1"*/

if buffer2 = null then

concat buffer1 + validNode

else

concat buffer1 + buffer2 + validNode

end if

buffer1 = validNode

buffer2 = null

end if

The part 2 of Enhacement_2 is to alter the concatenation process of creating the temporaryWord defined in Enhancement_1. As we know the temporary word is created during compression process of Enhancement_1 by concatenating oldWord and the last element of newWordArray in each step. Since Enhancement_2 suggests creating minimum 3 character word during concatenation, the temporary word also must have minimum 3 characters because depending on the largest word pair, there’s a chance that temporary word will be selected as a word to be inserted into the dictionary in the next step. Hence the concatenation algorithm of temporary word should be created in such a way that if the next step finds the last element of oldWordArray + last element of newWordArray to be the largest word pair, then the concatenation process to create new word for dictionary in the next step, which is defined in algorithm 4 must create the temporary word created in this step. The algorithm to create temporary word by concatenation is included in Algorithm 5.

Algorithm 5

Concatenation to create temporary word with minimum 3 characters.

---------------------------------------------------------------

function concatenateToTempWord()

---------------------------------------------------------------

Declare bufferConTempWord

Declare oldWord = selected element of oldWordArray

Declare newWord = last element of newWordArray

if bufferConTempWord is null then

if length of oldWord = 1 and length of newWord = 1 then

bufferConTempWord = oldWord

else

temporaryWord = concat oldWord + newWord

end if

else

if length of oldWord = 1 then

temporaryWord = concat bufferConTempWord + oldWord + newWord

bufferConTempWord = oldWord

else

temporaryWord = concat oldWord + newWord

bufferConTempWord = null

end if

Algorithm 4 defines a function named concatenate () and algorithm 5 defines concatenateToTempWord() that need to replace the reference of concatenate () and concatenateToTempWord() defined in the compression algorithm, algorithm 3 and decompression algorithm, algorithm 4 of Enhancement_1, to combine Enhancement_1 and Enhancement_2.

The source code containing Enhancement_1 and Enhancement_2 doesn’t show much improvement on the data set files submitted in this paper unless the code is added with Enhancement_3, which is described in the next section. Hence there’s no comparison table submitted at the end of this section and the final comparison table is shown at the end of this paper with compression ratio obtained by a program that includes all the 3 enhancements.

3.3 Enhancement 3: Process to write in the output file

There are 2 ways to write a code in the output file during the compression process. When the compression program reads a word in the input file, it can encounter a single character, if there’s no other match found in the dictionary from the current position of the input-pointer, which is 8 bits in size. Or it can encounter a word with multiple characters for which a code is assigned to the dictionary. Enhancement_2 explained in the previous section suggests a 20-bits code length to get a dictionary space capable of keeping 2²⁰ words. In order to write the code of the words or single characters in the output file, we can’t just write 8 bits for single characters and 20 bits for words with multiple characters because in that case the decompressor will never know if 8 bits or 20 bits to be read from the compressed output to get the next word.

Hence we have 2 ways to write the code in the output file, by which decompressor can identify the code of single or multiple characters words unambiguously. One of the ways (say approach_1) is, to write 20 bits codes for both single characters and words with multiple characters so that the decompressor will always read 20 bits from the compressed file to get a word with single or multiple characters, for instance if the compressor writes the code of the character ‘A’, whose ASCII value is 65, which is “01000001” in 8 bits binary, it needs to prefix this 8 bits with 12 extra bits with zeros to make it 20 bits. In that way the binary output code for A will be “00000000000001000001”.

Another approach (approach_2) to write 8 bits character code for single character and 20 bits dictionary code for multiple characters words in the output file is to write them with a 1 bit prefix before each of them. The compression program can write ‘0’ before writing the 8-bits character code and ‘1’ before writing a 20 bits code from dictionary for a word with multiple characters so that the decompressor will know by reading 0 or 1 whether the code ahead is for a word with a single character or a word with multiple characters and it will read 8 bits or 20 bits respectively.

The enhancement introduced in this section, which will be termed as Enhancement_3 from now onward proposes a proper way to calculate and identify whether approach_1 or approach_2 will bring better compression ratio and the calculation is performed in each step of the compression and decompression process. The calculation to select approach_1 or approach_2 depends on the number of words with multiple characters vs number of words with single characters found in the input file.

If the symbol M indicates the number of words with multiple characters present in an input file and S indicates the number of words with single characters present in the same file, then the extra bits (any bits added in the compressed output apart from 8 bits for single character and 20 bits for the code of word with multiple characters) needed in each of the approaches are described below.

For code length of 20 bits, Approach_1 needs 12 extra bits to be prefixed before each 8 bits characters in the file to make them 20 bits, so extra bits needed in approach_1 = 12 x S.

Approach_2 needs 1 extra bit prefixed before each word with single or multiple characters, so extra bits needed in approach_2 = (M + S) x 1 = M + S.

If for an input file, both approach_1 and approach_2 brings the same compression ratio, then we can find the ratio M/S (/ = division) in the following way.

M + S = 12 * S

or (M + S)/S = 12

or M/S + 1 = 12

or M/S = 11

That means for an input file with M/S = 11, (assuming 20 bits code length for words) the compression ratio obtained from both approach_1 and approach_2 are exactly same. It can be proved in the same way that for a file with M/S < 11, approach_2 will bring better compression ratio and for a file with M/S > 11, approach_1 will bring better compression ratio.

Inequation 1: M/S < 11 or M/S + 1 < 12 or (M + S)/S < 12 or (M + S) < 12 * S → proves approach_2 needs less number of extra bits.

Inequation 2: M/S > 11 or M/S + 1 > 12 or (M + S)/S > 12 or (M + S) > 12 * S → proves approach_1 needs less number of extra bits.

Included a table identified as Table 6 in this section that’s listing the same 5 data set files that were shown in Table 5 but this time the data sets were used to collect the count of words with multiple characters (M) and single characters (S) for each of the files by executing them with a compression program that includes Enhancement_1 and Enhancement_2 both and with a code length of 20 bits. It has been observed with multiple files that the ratio M/S varies across files and it’s not consistent if the ratio will be greater than or less than 11. Like in Table 6, the M/S for data_set1.txt is greater 11 but it’s less than 11 for the other 4 files.

Table 6

Count of Words with Multiple and Single Characters
File Name	Number of Single Characters Read and the Code Written to Output (Symbol → S)	Number of Multiple Characters Words Read and the Code Written to Output (Symbol → M)	M/S
Data_set1.txt	90925	1488670	16.3725
Data_set2.bmp	1639122	5687146	3.4696
Data_set3.doc	101360	207853	2.0506
Data_set4.doc	3149275	1081770	0.3435
Data_set5.tif	4333361	14574093	3.3632

Hence the Enhancement_3 proposes a generic algorithm, which will calculate and identify the approach_1 or approach_2 to be used each time before writing the code of a character or a word in the output file. The calculation will happen based on the number of words with single character or multiple characters read so far in the input file. The same algorithm has to be used during decompression to unambiguously identify the approach used during compression.

The ratio M/S for which both approach_1 and approach_2 will provide same compression ratio was identified as 11 for 20 bits code length as we saw earlier, the formula to find this number for different code length is M/S = “Code Length” – 8–1 where 8 is the length of a character in bits. This particular number (11 for 20 bit code length) can be kept in a variable named “ratioline” to be used in the algorithm.

So the basic purpose of the algorithm included in Algorithm 6 is to identify the M/S using the count of the words read so far in the input file and then compare it with the “ratioline”. The algorithm starts with initializing 2 variables, multiChars and SingleChars having value 1 in each of them at the beginning and then the value of the respective variable is increased when any word with multiple characters or single character encountered in the input file. The ratioline(11) is compared with multiChars/singleChars in each step and if the value is greater than ratioline then next code will will be written to the output file by following approach_1 else it will be written by following approach_2. The decompression algorithm is also provided in Algorithm 6, which is similar to the compression and starts with initializing variables multiChars and singleChars with 1.

Algorithm 6

Proposed compression and decompression algorithm under Enhancement_3.

----------------------------------------------

--Algorithm During Compression ---------------

----------------------------------------------

Function writeCode (word)

----------------------------------------------

code = read the code of the word from dictionary

declare code_length = length of the code assigned a word in the dictionary

declare ratioline = code_length − 8 − 1

declare multiChars = 1, SingleChars = 1

if multiChars / SingleChars > ratioline then //approach_1

if word has single character then

prefix (code_length − 8) zeros with the code to make the size to code_length

end if

Write the code with size code_length bits to the output file

else

declare oneBit

if word has single character then

oneBit = 0

else

oneBit = 1

end if

write oneBit to output file

write 8 bit code for single character or

code_length bit code for multiple characters in the output file

end if

if word has single character then

SingleChars = SingleChars + 1

else

multiChars = multiChars + 1

end if

----------------------------------------------

--End of Function-----------------------------

----------------------------------------------

--Algorithm During Decompression ---------------

----------------------------------------------

Function readWord ()

----------------------------------------------

declare code_length = length of the code assigned a word in the dictionary

declare ratioline = code_length − 8 − 1

declare multiChars = 1, SingleChars = 1

declare wordToReturn

if multiChars / SingleChars > ratioline then //approach_1

read code_length bits from the compressed input

wordToReturn = the related word from the dictionary

else

oneBit = read 1 bit from the compressed input

if oneBit = 0 then

read 8 bits from the compressed input

wordToReturn = the related word with single characters from dictionary

else

read code_length bits from the compressed input

wordToReturn = the related word with multiple characters from the dictionary

end if

if wordToReturn is single character word then

SingleChars = SingleChars + 1

else

multiChars = multiChars + 1

end if

return wordToReturn

-----------------------

--End of Function------

-----------------------

Algorithm 6 defines function writeCode () for compression and readWord () for decompression, which can replace the same functions defined in algorithm 3 and algorithm 2 of Enhacement_1 to combine Enhancement_3 with Enhancement_1. The process of combining Enhancement_1 and Enhancement_2 was already described in the previous section. The compression ratio obtained by the program by combining all the 3 enhancements (Enhancement_1 + Enhancement_2 + Enhancement_3) with code length of 20 bits (and 22 bits for 1 of the files) compared with LZMW and Enhancement_1 is provided in Table 7.

Table 7

Data Compression Comparison
File Name	Size (Bytes)	Compressed File Size using Prior Art (Bytes)	Compressed File Size using Enhancement_1 (Bytes)	Size of Zip File using “Ultra” option (Bytes)	Compressed File Size using Enhancement_1 + Enhancement_2 + Enhancement_3 (Bytes) (using 20 bits code length)	Compressed File Size using Enhancement_1 + Enhancement_2 + Enhancement_3 (Bytes) (using 22 bits code length)
Data_set1.txt	14,064,134	4,650,160	4,284,758	5,013,497	3,949,018
Data_set2.bmp	30,108,726	20,535,076	17,860,094	17,502,428	16,772,126
Data_set3.doc	2,750,976	673,110	651,544	436,801	659,664	711,629
Data_set4.doc	15,605,248	7,511,854	7,351,172	6,335,961	6,386,205	5,586,613
Data_set5.tif	72,481,956	50,568,720	49,261,850	46,329,478	43,128,118

With combination of the 3 enhancements and using code length of 20 bits, all the data set files shown in Table 7 compress better than the zip files created out of them except Data_set3.doc and Data_set4.doc. Table 6 shows the Data_set3.doc has 101360 + 207853 = 309213 words and the dictionary has capability of holding 2²⁰ = 1048576 words in total, which means the Data_set3.doc gets compressed before the dictionary is full (assuming at most a new word gets created in each step by concatenation), that’s why the purpose of Enhancement_2 of keeping as much word as possible in dictionary doesn’t have any effect on Data_set3.doc.

Data_set4.doc is much bigger in size and Table 6 shows it has 3149275 + 1081770 = 4231045 total words which is much higher than the dictionary capacity of 1048576 words, so we can assume that the dictionary was full during compression but still the statistics of Table 6 shows that the number of words with single character are more than the number of multiple characters, hence providing more dictionary space by increasing the code length to 22 (2²² = 4194304) gives it the opportunity to create more words with multiple characters and result into a compression better than zip as shown in Table 7.

The entire paper with 3 enhancements shows an opportunity to create a remarkably efficient data compression program with further research and development. The Enhancements together could produce a compression ratio better than deflate (zip) for multiple files, and by adding any entropy compression (Huffman, Range etc) along with these enhancements can produce much better compression in future.

1. The author declares that there’s no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

2. The author also declares that there’s no funding source available for this paper at this time.

Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Xplore. https://ieeexplore.ieee.org/document/1055714
Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Xplore. https://ieeexplore.ieee.org/document/1055934
Johnson, T., & Shasha, D. (1994, September 12). 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. VLDB Endowment Inc. https://www.vldb.org/conf/1994/P439.PDF
Miller, V., & Wegman, M. (1983, June 1). US4814746A - Data compression method - Google patents. Google Patents. https://patents.google.com/patent/US4814746A/en
Pavlov, I. (1998). 7-Zip, LZMA. Wikipedia. https://www.7-zip.org/
Storer, J. A., & Szymanski, T. G. (1982, October 1). Data compression via textual substitution. Journal of the ACM. https://dl.acm.org/doi/10.1145/322344.322346
Welch, T. (1984, June 6). A technique for high-performance data compression. IEEE Xplore. https://ieeexplore.ieee.org/document/1659158
DEFLATE Compressed Data Format Specification. (1996, May). RFC. https://www.rfc-editor.org/info/rfc1951

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

An Efficient Data Compression Algorithm

Status:

Version 1

Abstract

1. Introduction

2. Background

2.1. Brief description of LZMW algorithm (Prior art)

2.2. Description of the cache replacement algorithm used in the introduced data compression algorithm

3. Current Invention

3.1 Enhancement 1: Find the largest “word pair” instead of the largest word

3.2 Enhancement 2: A multi-character word will have minimum 3 characters

3.3 Enhancement 3: Process to write in the output file

4. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1