Current location - Health Preservation Learning Network - Healthy weight loss - A brief history of data compression technology
A brief history of data compression technology
The data compression in the computer is actually similar to the slimming exercise of girls, but it has two major functions. First, it can save space. Take slim girls for example. If eight girls can squeeze into a taxi, how much money will be saved! Second, it can reduce the occupation of bandwidth. For example, we all want to watch DVD blockbusters below 100Kbps on the GPRS network, which is like slimming girls always want to cut out seven halter tops with a foot of cloth. The former needs a breakthrough in data compression technology, while the latter depends on their perseverance and perseverance.

Simply put, if there is no data compression technology, we can't use WinRAR to slim down the attachments in the mail; If there is no data compression technology, the digital recording pen on the market can only record less than 20 minutes of voice; Without data compression technology, it might take half a year to download a movie from the internet ... but how did all this happen? How did data compression technology develop from scratch? More than a thousand years ago, China scholars knew to use the abbreviation "Banma" to refer to Ban Gu and Sima Qian. This custom of advocating simplicity has continued into today's Internet age: when we use "7456" to represent "piss me off" or "B4" to represent "before" on BBS, we should at least know that this is actually the simplest data compression.

Strictly speaking, data compression originated from people's understanding of probability. When we encode text information, if we give shorter encoding to letters with higher probability of occurrence and longer encoding to letters with lower probability of occurrence, the total encoding length can be greatly shortened. Long before the computer appeared, the famous Morse code had successfully practiced this criterion. In Morse code table, each letter corresponds to a unique combination of dots and dashes. The letter E with the highest occurrence probability is encoded as a dot ".",while the letter Z with a lower occurrence probability is encoded as "-..". Obviously, this can effectively shorten the final code length.

Shannon, the father of information theory, first expounded the relationship between probability and information redundancy in mathematical language. Shannon pointed out in the paper "Mathematical Theory of Communication" published in 1948 that any information is redundant, and redundancy is related to the probability or uncertainty of each symbol (number, letter or word) in the information. Shannon used the concept of thermodynamics for reference, called the average information excluding redundancy "information entropy", and gave a mathematical expression to calculate the information entropy. This great paper was later hailed as the pioneering work of information theory, and information entropy also laid the theoretical foundation for all data compression algorithms. In essence, the purpose of data compression is to eliminate redundancy in information. Information entropy and related theorems accurately describe the degree of information redundancy by mathematical means. Using the information entropy formula, people can calculate the limit of information coding, that is, under a certain probability model, the coding length of lossless compression can not be less than the result given by the information entropy formula.

With a complete theory, the next thing is to find a way to realize the specific algorithm and try to make the output of the algorithm close to the limit of information entropy. Of course, most engineers and technicians know that it is not easy to develop a theory from a mathematical formula into a practical technology, just like making a nuclear weapon with only one formula E = MC 2. The process of designing a specific compression algorithm is usually more like a mathematical game. Developers should first find a way to count or estimate the probability of symbols appearing in information as accurately as possible, and then design a set of coding rules to describe each symbol with the shortest code. Statistical knowledge is quite effective for previous work. So far, people have successively realized the static model, semi-static model, adaptive model, Markov model, partial matching prediction model and other probability statistical models. Relatively speaking, the development of coding mode is more tortuous.

In 1948, Shannon proposed the information entropy theory and a simple coding method-Shannon coding. In 1952, R. M. Fano further proposed Fano coding. These early coding methods reveal the basic laws of variable-length coding, which can achieve certain compression effect, but it is still far from the real practical compression algorithm.

The first practical coding method was put forward by D. A. Huffman in his paper 1952. Until today, when discussing binary tree in many textbooks of data structure, this method called huffman encoding will still be mentioned. Huffman encoding is so famous in the computer field that even the invention process of coding itself has become a hot topic. It is said that 1952, the young Hoffman was still a student of MIT. He designed this seemingly simple but far-reaching coding method in order to prove to the teacher that he could not take the final exam of a course.

Huffman encoding has high efficiency, fast operation speed and flexible implementation. Since 1960s, it has been widely used in the field of data compression. For example, COMPACT, a compression program unknown to modern people in the early UNIX system, is actually the concrete implementation of Huffman 0-order adaptive coding. In the early 1980s, huffman encoding appeared in CP/M and DOS systems, and its representative program was called SQ. Today, huffman encoding appears in many famous compression tools and algorithms (such as WinRAR, gzip and JPEG). However, the coding length obtained by huffman encoding is only an approximation of the calculation result of information entropy, and it cannot really approach the limit of information entropy. Because of this, modern compression technology usually only takes Huffman as the final coding method, not the whole data compression algorithm.

Scientists have never given up the ideal of challenging the limit of information entropy. 1968, P. Elias developed Shannon and Fano coding methods, and constructed a more perfect Shannon-Fano-Elias coding from a mathematical point of view. Following the idea of this coding method, in 1976, J. Rissanen proposed a coding method-arithmetic coding, which can successfully approach the limit of information entropy. 1982, Rissanen and G. G. Langdon improved arithmetic coding together. Then, people combine arithmetic coding with the partial matching prediction model (PPM) proposed by J. G. Cleary and I. H. Witten in 1984, and develop an algorithm with nearly perfect compression effect. Today, those universal compression algorithms named after PPMC, PPMD or PPMZ, which are known as the world's first compression effect, are actually the concrete realization of this idea.

For lossless compression, the combination of PPM model and arithmetic coding can approach the limit of information entropy to the greatest extent. It seems that the development of compression technology can stop here. Unfortunately, things are often not as simple as imagined: although arithmetic coding can obtain the shortest coding length, its complexity also makes any concrete implementation of arithmetic coding slow as a snail. Even today, when Moore's Law prevails and the speed of CPU is changing with each passing day, the running speed of arithmetic coding programs can hardly meet the needs of daily applications. No way, if it weren't for the two Jews mentioned later, we wouldn't know when we could use such a convenient and practical compression tool as WinZIP. Reverse thinking is always a magic weapon to win by surprise in the field of science and technology. Just when most people rack their brains to improve Huffman or arithmetic coding, in order to obtain a coding with perfect running speed and compression effect, J Ziff and A Lempel, two smart Jews, broke away from the design ideas of Huffman and arithmetic coding and created a series of compression algorithms that are more effective than huffman encoding and faster than arithmetic coding. We usually use the abbreviations of these two Jewish surnames to collectively refer to these algorithms as LZ series algorithms.

In chronological order, the development of LZ series algorithms is roughly as follows: Ziff and Lempel published a paper entitled "General Algorithm for Sequential Data Compression" in 1977, and the algorithm described in the paper was called LZ77 algorithm by later generations. In 1978, they published the sequel of this paper, "Compressing a single sequence through variable rate coding", and described a compression algorithm named LZ78. 1984, T. A. Welch published a paper entitled "High Performance Data Compression Technology", describing his research results in Sperry Research Center (later incorporated into Unisys). This is a variant of LZ78 algorithm, which was later called LZW algorithm. After 1990, T. C. Bell and others put forward many variants or improved versions of LZ series algorithms.

To tell the truth, the idea of LZ series algorithm is not novel, which has neither profound theoretical background nor complicated mathematical formula. They just continue people's admiration and love for dictionaries for thousands of years, and apply dictionary technology to the field of general data compression in an extremely ingenious way. Generally speaking, when you replace every word in the article with the page number and line number in the dictionary, you have actually mastered the true meaning of LZ series algorithms. Although this idea based on dictionary model is very different from the statistical method initiated by Shannon and Hoffman on the surface, it can actually approach the limit of information entropy. Moreover, it can be proved theoretically that LZ series algorithms still conform to the basic law of information entropy in essence.

The advantages of LZ series algorithms are quickly reflected in the field of data compression, and the number of tools and software using LZ series algorithms has exploded. The compression program using LZW algorithm first appeared on UNIX system, and soon became the compression standard of UNIX world. Secondly, it is the ARC program in MS-DOS environment, and the imitations such as PKWare and PKARC. In 1980s, the famous compression tools LHarc and ARJ were outstanding representatives of LZ77 algorithm.

Today, LZ77, LZ78, LZW algorithms and their variants almost monopolize the whole field of general data compression. We are familiar with the compression tools such as PKZIP, WinZIP, WinRAR, gzip, etc. The file formats are ZIP, GIF, PNG, etc. These are all benefited from LZ series algorithms. Even encrypted file formats such as PGP have chosen LZ series algorithm as their data compression standard.

No one can deny the contribution of two Jews to data compression technology. I just want to emphasize that in the field of engineering technology, one-sided pursuit of theoretical perfection often gets twice the result with half the effort. If we can always think from another angle like Ziff and Lempel, maybe you and I can invent a new algorithm and make a name for ourselves in the history of the technology exhibition. LZ series algorithms basically solve the problem of giving consideration to both speed and compression effect in general data compression. However, in the field of data compression, there is another broader world waiting for us to explore. Shannon's information theory tells us that the more transcendental knowledge of information, the smaller we can compress. In other words, if the design target of the compression algorithm is not any data source, but special data with known basic attributes, the compression effect will be further improved. This reminds us that in addition to developing general compression algorithms, we must also seriously study special compression algorithms for various special data. For example, in today's digital life, the image, audio and video information distributed in various digital devices such as digital cameras, digital tape recorders, digital walkman and digital video cameras must be effectively compressed before they can be stored on hard disks or transmitted through USB cables. In fact, the compression of multimedia information has always been an important topic in the field of data compression, and each branch of it may dominate a certain technical trend in the future, bringing infinite business opportunities to developers of digital products, communication equipment and application software.

Let's talk about the compression of image data. Generally speaking, images can be divided into binary images, gray images, color images and other different types. The compression method of each type of image is also different.

The invention and wide use of fax technology promoted the rapid development of binary image compression algorithm. CCITT (International Telegraph and Telephone Advisory Committee, a subsidiary of ITU) has established a series of image compression standards for fax applications, which are specially used to compress and transmit binary images. These standards generally include CCITT Group 1 and CCITT Group 3, 1980 of Group 2 in the late 1970s, and 1984 of CCITT Group 4. In order to adapt to different types of fax images, the coding methods used in these standards include one-dimensional MH coding and two-dimensional MR coding, in which RLE and huffman encoding are used. Today, when we send and receive faxes in the office or at home, we mostly use CCITT Group 3 compression standard, while some fax devices based on digital networks and TIFF files that store binary images use CCITT Group 4 compression standard. In 1993, Joint Binary Image Expert Group (JBIG) established by CCITT and ISO (International Organization for Standardization) further developed binary image compression into a more general JBIG standard.

In fact, for binary images and discontinuous gray and color images, many general compression algorithms, including LZ series algorithms, can achieve good compression results. For example, the GIF image file format born in 1987 uses LZW compression algorithm, while the PNG format appearing in 1995 is more perfect than GIF format, and it chooses zlib, a variant of LZ77 algorithm, to compress image data. In addition, people have actually constructed many effective image compression algorithms by using the above-mentioned huffman encoding, arithmetic coding and PPM model.

However, for more common gray-scale or color images (such as digital photos), the pixel value changes continuously in space, so the advantages of general compression algorithms are not so obvious. Fortunately, scientists have found that if some unimportant pixel values are allowed to be changed or some precision is allowed to be lost when compressing this kind of image data (we will never tolerate any loss of precision when compressing general data, but when compressing and displaying a digital photo, if some leaves in a forest are slightly dark, people who look at the photo will usually not notice it), we may make a breakthrough in the compression effect. This idea has a revolutionary position in the field of data compression: by losing some precision within the tolerance of users, we can compress images (including audio and video) to one tenth, one hundredth or even one thousandth of the original size, which far exceeds the capacity limit of general compression algorithms. Perhaps, this is similar to the truth that "step back and broaden the horizon" is often said in life.

This compression that allows loss of accuracy is also called lossy compression. In the field of image compression, the famous JPEG standard is a classic in lossy compression algorithm. JPEG standard was formulated by Joint Photographic Experts Group (JPEG) in 1986, and became an international standard after 1994. JPEG takes discrete cosine transform as the core algorithm, and controls the accuracy and size of the image by adjusting the quality coefficient. For continuously changing gray or color images such as photos, JPEG can generally compress the image to one tenth to one twentieth of its original size on the premise of ensuring the image quality. If the image quality is not considered, JPEG can even compress the image to "infinitely small".

The latest development of JPEG standard is JPEG 2000, which was formulated by 1996, and 200 1 officially became an international standard. Compared with JPEG, JPEG 2000 has been greatly improved, among which the most important thing is to replace the discrete cosine transform in JPEG standard with discrete wavelet transform (DWT). Under the same file size, JPEG 2000 compressed images have higher quality and less precision loss than JPEG. As a new standard, JPEG 2000 has not been widely used at present, but many enterprises, including digital camera manufacturers, are optimistic about its application prospect, and the day when JPEG 2000 will fully play its role in the field of image compression should not be particularly far away.

The design idea of losing precision for compression effect in JPEG standard directly affects the compression technology of video data. CCITT has formulated the draft proposal of H.26 1988 for videophone and conference TV. The basic idea of H.26 1 is to use an algorithm similar to JPEG standard to compress each frame image in the video stream, and at the same time, to use motion compensated inter-frame prediction to eliminate redundant information in the time dimension of the video stream. On this basis, 1993, ISO passed the MPEG- 1 standard proposed by the Moving Picture Expert Group (MPEG). MPEG- 1 can effectively encode ordinary quality video data. Most VCD discs we see now use MPEG- 1 standard to compress video data.

In order to support clearer video images, especially high-end applications such as digital TV, ISO proposed a new MPEG-2 standard in 1994 (equivalent to CCITT's H.262 standard). MPEG-2 classifies image quality, which can adapt to video applications with different quality such as ordinary TV programs, conference TV and high-definition digital TV. In our life, MPEG-2 standard can provide high-definition pictures for DVD.

The development of Internet puts forward higher requirements for video compression. Stimulated by new requirements such as content interaction, object editing and random access, ISO adopted MPEG-4 standard in 1999 (equivalent to CCITT's H.263 and H.263+ standards). MPEG-4 standard has higher compression ratio, and supports advanced features such as encoding of concurrent data streams, content-based interactive operation, enhanced random access in time domain, fault tolerance and content-based scale variability. DivX and XviD file formats appearing on the Internet use MPEG-4 standard to compress video data. They can provide high-definition video comparable to DVD with smaller storage space or communication bandwidth, and make our dream of publishing or downloading digital movies on the Internet come true.

Just as video compression is inseparable from the development of TV industry, audio data compression technology was first developed by technicians in the fields of radio broadcasting and voice communication. Among them, the research on speech coding and compression technology is the most active. Since H. Dudley invented vocoder in 1939, people have successively invented speech analysis and processing technologies such as pulse code modulation (PCM), linear prediction (LPC), vector quantization (VQ), adaptive transform coding (ATC) and subband coding (SBC). These speech techniques can not only collect speech features and obtain digital signals, but also generally reduce information redundancy. Like JPEG in the field of image compression, in order to obtain higher coding efficiency, most speech coding technologies allow a certain degree of precision loss. Moreover, in order to better store or transmit the speech signal of binary data, these speech coding technologies always use general compression algorithms such as huffman encoding and arithmetic coding to further reduce redundant information in the data stream after converting the speech signal into digital information.

For ordinary audio information stored in computers and digital appliances (such as digital recording pens and digital walkman), the audio compression standard in MPEG series is the most commonly used compression method. For example, the MPEG- 1 standard provides three optional audio compression standards, namely the first layer, the second layer and the third layer * * *, and MPEG-2 further introduces the AAC (Advanced Audio Coding) audio compression standard. The audio part of MPEG-4 standard supports different types of applications, such as synthetic audio coding and natural audio coding. Among many audio compression standards, MPEG- 1 Layer III is probably the most famous one, which is also the MP3 audio compression standard we often say. From MP3 players to MP3 mobile phones, from the mountain of MP3 files on hard disks to the continuous downloading of MP3 files with copyright disputes on the Internet, MP3 has already surpassed the scope of data compression technology and become a symbol of fashion culture.

Obviously, in the digital age when multimedia information is becoming the mainstream information form, data compression technology still has considerable room for development, especially for images, audio and video. After all, people's pursuit of information quantity and quality is endless. From information entropy to arithmetic coding, from Jews to WinRAR, from JPEG to MP3, the development history of data compression technology is like a sheepskin picture full of innovation, challenges, breakthroughs and changes. Perhaps, we take pains to list the years, figures, standards and documents here, just to tell you that the achievements of our predecessors are just the goals that future generations are expected to surpass. Who knows how many Shannon and Hoffman will appear in the next few years?

Speaking of the future, we can also add some topics related to the development trend of data compression technology.

In 1994, M. Burrows and D. J. Wheeler *** proposed a new general data compression algorithm. The core idea of this algorithm is to sort and transform the character matrix obtained after string rotation. A similar transform algorithm is called burrows-Wheeler transform, or BWT transform for short. The BWT algorithm designed by Burrows and Wheeler is completely different from all previous general compression algorithms, just as Ziff and Lempel found another method. Nowadays, BWT algorithm has achieved great success in the open source compression tool bzip, and the compression effect of bzip on text files is far better than that of the tool software using LZ series algorithms. This can at least show that even in the increasingly mature field of general data compression, we can still find new breakthroughs as long as we keep innovating in concept and technology.

Fractal compression technology is a hot spot in the field of image compression in recent years. This technique originated from the fractal geometry founded by B. Mandelbrot in 1977. M. Barnsley laid a theoretical foundation for fractal compression in the late 1980s. Since 1990s, A. Jacquin and others have proposed many experimental fractal compression algorithms. Today, many people think that fractal compression is the most potential technical system in the field of image compression, but many people disdain it. Regardless of its prospect, the research and development of fractal compression technology reminds us that after decades of rapid development, perhaps we need a new theory or several more effective mathematical models to support and promote the leap of data compression technology.

Artificial intelligence is another key word that may have a significant impact on the future of data compression. Since Shannon thinks that whether or not information can be compressed and to what extent it can be compressed is directly related to the uncertainty of information, it is no longer a fantasy to compress information to one tenth or even one hundredth of the original if artificial intelligence technology is mature and computers can infer subsequent information according to a small number of known contexts, just like people.

After reviewing history, people always like to think about the future. But the future is the future. If we can sort out the future technological development trend only by a few words from you and me, wouldn't the work of technological innovation be very boring? For me, the future is not important. It is important to download some blockbusters on the Internet, and then lie on the sofa and enjoy the infinite happiness brought by data compression.