Talk:Data compression/Archive 1

Archive 1

Cleanup

This page is getting a bit long, and I think that some of the lists could used cleaned up Zack3rdbb 04:50, 22 December 2006 (UTC)

I've again broken the algorithms into a diffrent list from the implementations. Did it a few years back when I wrote part of the LZ and Huffman pages, but it was reverted back then by someone who seemed to be unable to determine the diffrence between an algorithm and an implementation. I also agree about moving the algorithm's into the lossy/lossless pages. If my current change sticks I will do that too.. Jrlinton


Where should a page referencing lossless compression link? Lossless data compression is more on-topic, but has less information (e.g. about practical algorithms) than data compression.

IMHO data compression should convey the basic concept, discuss the difference between lossy & lossless, and leave it at that. Especially the algorighm list and discussion should move to the apropriate subtopic. --Robbe

How are we defining "obsolete" here? --Alan D


There's a really detailed public domain article on data compression at http://www.vectorsite.net/ttdcmp1.html#m6

It would be great if someone could expand our articles using this material (unfortunately, I don't really have a clue about this subject!) Enchanter --- I agree that this page should be about data compression not about 100's of diffrent implementations of the half dozen basic alogrithms. The discussion about MPEG, DEFLATE, PKZIP, etc should not be on this page since they are implementations of one or more of the basic algorithms (LZ77, DCT, etc)

>> I find it disturbing that a contributor to this page claims to have no knowledge of the subject!

Lossy compression on CD-ROM?

In particular, compression of images and sounds can take advantage of limitations of the human sensory system to compress data in ways that are lossy, but nearly indistinguishable from the original by the human eye. This is used on CD-ROM and DVD, as well as in digital cameras.

This is highly suspect. DVD, yes: the video is encoded in a lossy format. Digital cameras, yes: pictures are recorded into lossy formats. CD-ROMs, however, merely store the information on them, and in fact, not only is the information not reduced by lossy means, but redundancy in the form of error-correcting codes is added to safeguard the information that's there. Even if "audio CDs" are meant rather than CD-ROMs, it's still a stretch; sure, the audio is quantized before it's recorded onto the master, but no subsequent compression, lossless or lossy, takes place after quantization. -- Antaeus Feldspar 17:32, 23 Sep 2004 (UTC)

I agree. It would be better to say that lossy compression is used mainly for images, video and audio. --Matt Mahoney 01:17, 25 Jan 2005 (UTC)

Self-extraction points to its disambiguation page, which points to uncompression, which returns to this article, which has no information about uncompression.

Perhaps someone "in the know" could do a bit on uncompression (decompression?).--suburbancow 02:14, 26 January 2006 (UTC)

The "T-code" link under Lossless Compression winds up on a SAP "Transaction Code" page that has nothing to do with data compression. —Preceding unsigned comment added by RobSimpson (talkcontribs) 21:13, 31 March 2009 (UTC)

failure to compress

I find this text from the article problematic:

"indeed, any compression algorithm will necessarily fail to compress at least 99.609 375 % (255/256) of all files of a given length."

Clearly it cannot be true that the algs fail to compress 99.6% of the time. I think the intent was to say that any lossless alg will fail to compress on some files of any given length, but the numbers don't make sense. Also, the percentage of files that would fail would definitely be a function of the length of the files. Kryzx 15:41, 29 March 2006 (UTC)

I agree, I can't really guess at what that sentence is trying to say. Only lossless compression has the possibility of failing to reduce file size, and that doesn't happen that often. Algr 17:44, 30 March 2006 (UTC)
Actually it's true that the vast majority of possible files will get bigger during lossless compression. It's just that the vast majority of possible files are useless random byte sequences, and we seldom try to compress those -- so it seems like compression algorithms almost always make things smaller. You can trivially prove this using the pigeonhole theorem:
There are exactly 256^100 possible 100-byte files.
A reduction in size will yield a file between 1 and 99 bytes long.
There are roughly 256^99 possible files between 1 and 99 bytes long.
Each compressed file uniquely maps to a decompressed file -- that is, you can't (correctly) compress two different files and end up with the same compressed file.
256^100 / roughly 256^99 = roughly 256.
That means that, in order to compress the file by at least one byte, we are reducing the number of possible byte sequences down to roughly 1 in 256 of the original possibilities, or about 0.4%. Only about 0.4% of the input sequences will therefore be able to be compressed.
The other 99.6% of the byte sequences will not map to a shorter sequence. Ergo, they must map to an equal or greater length sequence.
So, it is absolutely true that compression algorithms only work 0.4% of the time. BUT, it only applies when you are talking about compressing "all possible byte sequences of a given length". Nothing we ever produce is actually that random, which is why in practice compression algorithms do indeed work. Additionally, by adding even a single byte to the size you multiply your possible sequences by a factor of 256 -- meaning that while only 0.4% of the sequences can be compressed, the remaining 99.4% need not (in theory) grow by more than one byte. Egomaniac 13:54, 20 June 2007 (UTC)

Predictive coding

I see that predictive coding is presented under lossy encoding, but surely it's possible to use prediction for lossless encoding too.

It seems to me that there are two possibilities:

a. It is possible to do perfect predictive coding, and this will be lossless, or b. Do a fairly accurate predictive coding, then subtract from the source, and encode the residual. If the residual is encoded without loss then the overall encoding is also lossless.

I am fairly sure that for some applications predictive coding schemes which are completely lossless can be constructed. The article could deal with this. David Martland 09:27, 25 August 2006 (UTC)


WHAT NO IMAGES?FAIL! Trigger hurt 15:33, 6 September 2006 (UTC)

compression and encryption

The article currently claims Similarly, compressed data can only be understood if the decoding method is known by the receiver. Some compression algorithms exploit this property in order to encrypt data during the compression process.

I'm deleting the second of those sentences, on the basis of:

It is true that some data compression programs also allow people to produce "password protected archives" (with the notable exception of gzip, as mentioned in the gzip FAQ ). But my understanding is that they do not "exploit this property" -- instead, those programs first compress the data, then use an unrelated encryption algorithm that would encrypt plain text just as well as compressed text. If there actually exists some compression+encryption algorithm that "exploits this property", please tell me about it or link to it. --68.0.120.35 01:01, 13 December 2006 (UTC)

Source coding merged here

I've redirected "source coding" to here. The term was already bolded as something the article laid claim to. Data compression generally is most of source coding, but perhaps a paragraph might now be worthwhile on other aspects of source coding - eg perhaps the way source coding/data compression may make surprising symbols/runs of data more apparent, which may be interesting from a detection point of view.

Jheald 21:45, 5 March 2007 (UTC)

LZMA in sitx?

I find the assertion that sitx uses LZMA to be extremely dubious. I can't find any other source on the web that supports this claim, and the Stuffit page seems to indicate that Stuffit X uses a variety of other methods. Anybody know where this idea came from?

-Gsnixon 11:29, 6 March 2007 (UTC)

Citations

I was looking through this article for some basic beginner compression information, and although all of it matches up to what I already knew or at least makes sense...it appears to be drastically in need of some citations. I hesitate to tag it as one of my first wikipedia actions so I figured I would post first to see what other thought. Jaidan 16:48, 16 March 2007 (UTC)

Comparative

Copied from the french personal discussion page of the author of the comparative.

Interesting compression format/software comparison. Thanks for sharing your results. However, it is missing something important that I didn't find anywhere... What version of each program did you use for the tests? Specifically, what version of 7z did you test? Thanks again.

Thanks. I used the latest versions at the date of this comparative (january 2007), as specified on the second page. For 7-zip, I used the 4.43 version. Sorry for my english: I love so french that I have not time enough to learn others languages. ;-) Rlwpx 11:04, 21 April 2007 (UTC)

Oops. The article did mention v.4.43... I thought I looked everywhere.. Anyways, thanks for the info. One more question though: Did you save the test data and do you plan to keep updating the results as newer versions come out? For example, I know the new 7z 4.45 runs much faster on my dual core machine (but I have not tested if it compresses any better). I would be very interested in checking back here to see the progress of the different archivers. Well thanks again for a well done comparison :)

Don't worry for the mistake. Of course, I do appreciate 7-zip and I always use the latest version for my own. Because I made a first comparative in 2006 and I hope to make a new one next year, you should be happy to know that I have a copy of all the test data: I shall give you here the results for 7-zip 4.45 in a few days. Sincerely yours. Rlwpx
Results:
Comparison of compression efficiency between 7 zip 4.43 (january 2007) & 7 zip 4.45 (Rlwpx april 2007)
*.avi*.dll*.doc*.exe*.gif*.htm*.jpg*.mp3*.mpg*.pdf*.txt*.wav*.zip
7z,4.434,524,0671,543,179147,6903,910,5414,620,354341,9964,770,0615,053,8134,879,0674,258,8631,270,8843,670,2255,226,742
7z,4.454,523,8031,508,356147,6553,900,4464,620,335333,9324,769,9375,053,7004,878,5344,258,8951,377,5023,668,3465,240,542


This graph is incredibly difficult to interpret. The meaning of the numbers is ambiguous. Do the results contain file size as a number split up by spaces? Like 5,245,423 = 5 245 423 (split up on different lines?). And what do the numbers in parenthesis give? Units, descriptions? Could we add some information to aid in interpreting what we see?

Sorry for the spaces: you must read commas (in french, spaces are used like commas in english). I'll correct this mistake. For example, 5 245 423 is 5,245,423 b, i.e. about 5 MB. Numbers in parenthesis indicate the rank of the method for the category: for example, 7z is the second best method for the compression of avi files, the best is uha (1) and the worth one, 20th, is pkz (20). Rlwpx 18:54, 7 June 2007 (UTC)

I have removed this section in its entirety as it fails to differentiate between compression algorithms and compression software (some of which use a variety of different algorithms), and because most of the data sets contain files which are already compressed. A better comparison would plot individual compression algorithms against uncompressed datasets of particular types (e.g., text, executables, raw bitmaps, raw audio). —Psychonaut 22:07, 12 July 2007 (UTC)

I don't agree with you: first, the table does not present softwares (for example, zip does not mean Winzip) and, second, one can very agree to compress compressed files: I doubt that there are many people who compress only txt or wav files! Rlwpx 18:23, 14 July 2007 (UTC)

So I can't agree with "Globally, the three best methods tested are rk, uha and 7z." because the table says that the strongest compression is rk (41,699,633) followed by rar (43,593,482) and 7z (44,217,482). Can someone change the text? —Preceding unsigned comment added by 87.160.124.182 (talk) 13:31, August 30, 2007 (UTC)

I think the AVI column should be removed - AVI is a container file for any number of different compression systems. Contents could have been anything from uncompressed video to some form of MPEG or DV. —Preceding unsigned comment added by 82.5.204.14 (talk) 12:14, 9 September 2007 (UTC)

Comparative: copyright?

P.Table(P.T.): PAQ8 (kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.

If this is true, then the table should be removed. Text in Wikipedia is released under the GFDL. It's also not the best table in the world, as it includes formats like "AVI" which are highly codec-dependent. ⇌Elektron 17:10, 27 August 2007 (UTC)

I've removed the infringing section. The section without the table is reproduced below. If the original contributor is the copyright holder (the domain name of the site seems to suggest it), then it's original research, and also shouldn't be included. The section isn't written very well, and the last paragraph seems biased, and there are a bunch of other technical problems with the data (".avi" doesn't really mean anything). Let's hope bots don't think I'm a vandal. ⇌Elektron 16:00, 4 September 2007 (UTC)

Independent comparison of different methods of data compression (Results & Softwares, in French. Airelle, 2007). Numbers in parenthesis are the rank of the method of compression for the category of file specified above.

  • Text files, such as .htm or .txt, can be hard compressed.
  • Some files are already compressed (e.g. .mp3 or .zip), so the compression rate of such files is poor. Due to the addition of header data, often there are diminishing returns in such compression, causing the file to actually be slightly larger upon storage.
  • To be more representative of the performance, the global score (/20) is calculated with a non-parametric formula after the sum of the ranks (1 to 20) for each of the 20 tested methods.

(Missing table here)

P.Table(P.T.): PAQ8 (kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.

Globally, the three best methods tested are rk, rar and 7z. WinRK and WinRar are commercial software, 7-zip is free, open source (LGPL licence) and can be used with Linux.

—Preceding unsigned comment added by Elektron (talkcontribs) 16:00, 4 September 2007 (UTC)

I, Airelle, the author of this comparative, is the original user who added the table in this article. I agree with the GFDL licence. As you wrote, my user name is part of the URL of the external website. Note that I had initially added a simple link towards my Web site but a user advised me to include the results in the article. So did I.
As I told, globally, the three best methods tested are rk, uha and 7z, because I used a non parametric method: so, these three formats of compression obtained the best notes (18.2, 17.3 and 16/20). If you consider the global size of the compressed files, the three best formats of compression are rk (41,699,633) followed by rar (43,593,482) and 7z (44,217,482): there is no contradiction: that depends on the criterion used.
Lastly, forgive my bad English, but that should not be a problem: no matter who can correct the text to improve it. That should not be a criterion of elimination.Rlwpx 10:41, 8 September 2007 (UTC)
I mainly removed it because of the line but for the copyright of the table it can't be copied (I'm not sure who added that), suggesting that the table is not GFDL. It's also only one person's results (possibly violating NPOV), and since you added it it counts as original research. It also can't really be a fair comparison without saying what was in the files you compressed. ⇌Elektron 11:50, 8 September 2007 (UTC)
I don't know who added the line about the copyright. But I do know that I am the author! I can say what is the list of the files, but it would be file.ext of x bits, file2.ext of x2 bits aso: hmmm... Is it really needed? I think the global size is enough. Rlwpx 12:09, 8 September 2007 (UTC)
File extensions are at best indicative, but for the most part meaningless (see the AVI comment in the section above). The entropy of a source depends on the source — text compression algorithms are typically tested on something like the Calgary Corpus. For pictures, there are standard test images, and presumably freely available videos in various formats (but if you, say, turn a TIFF into a JFIF, the particular algorithm and settings used matter too). For video/audio formats, you need to specify what codec/version/settings is used, but how much these compress largely don't matter — your average MP3 gets transmitted as an MP3 (this is also largely true for images), and your average WAV either stays like that or turns into an MP3/Vorbis/WMA/AAC/FLAC/etc. PDFs can use zlib compression internally, PNGs do, GIFs use LZW. How much ZIPs compress are almost completely irrelevant (what was in the ZIP?).
In fact, I mostly see compression used either in torrents (for no good reason, since compressing MP3s/movies doesn't help much and wastes everyone's CPU time), for precompiled programs (no-install-needed ZIPs, compressed disk images, RPM/DEB packages) or source tarballs, and I only bother if it compresses/uncompresses within reasonable time. Your choice of "file formats" introduces bias into the results, and thus any overall result is meaningless. ⇌Elektron 13:45, 9 September 2007 (UTC)
I understand well what you say, but I seldom seen a comparative with these precise details. I can add, for example, that the .avi are x mpeg3 and y mpeg2 or that the .zip are x' exe and y' txt, but the table would become incomprehensible after this additional information of a limited interest. I think that the important thing is to use the same data file for each compression so that one can compare them. Rlwpx 09:54, 15 September 2007 (UTC)
The results are meaningless unless you state what data you use. Most respectable comparisons of image/text/whatever compression use standard data. You give a file extension which has absolutely nothing to do with the data inside the file. They're also huge and don't really belong in the article - I suggest that it is at least moved to a different article. ⇌Elektron 17:01, 15 September 2007 (UTC)

Comparative (again)

This has too many issues. One of them is a private test case (mentioned before). Another is that it doesn't list archivers and archive format, it justs displays the archive format. Also, the parameters for compression are also unknown. These are quite important issues, because different archivers can archive to the same format, but do so differently, and different parameters tend to lead to different ratios also. This difference, while possibly small, is actually important. Additionally, as mentioned before, what's up with the copyright thing? And while I agree paq8o* probably compresses far better than everything else, unless we have solid evidence we should not publish these things. The copyright issue probably arose because someone wanted to include GFDL-incompatible info in the table, which, unless counting as fair-use, should not even be in Wikipedia. Someone really should fix this.

And, in case you didn't know, rar, a command-line version of WinRAR, is also available as shareware for Linux, and WinRAR works under Wine. 7-Zip is not a Linux native, p7zip is. I'll fix this. --Yugsdrawkcabeht (talk) 15:41, 29 December 2007 (UTC)

Comparative (final)

I am removing the comparative section, because it clearly falls under WP:NOR and as stated above is kinda a bunch of crappy data anyways, I.E. .ZIP does not = winzip, .AVIs are container files . . . if anyone has any strong objections it can always be restored, I guess. Though I will agrue against any such action. --SelfStudyBuddyTALK-- 05:30, 15 June 2008 (UTC)

Yes, we should be citing any one of the many public compression comparison, on the web or in the research. While I value their intentions and think a comparison is useful, it's far more useful to provide a comparison of methods here than specific compressors that will rapidly become obsolete, as is done in Managing Gigabytes, which we could cite directly. Dcoetzee 10:31, 25 June 2008 (UTC)

Transparent compression in file systems and archivers

It would be nice to have something on the above topic, with links to appropriate articles and categories. ((Category:Compression file systems)), Transparency (computing), Transparency (data compression), Data compression and ZIP (file format) come to mind. bkil (talk) 13:24, 5 August 2008 (UTC)

Linear Compression

There's a news item today about an advance in Linear Data Compression: http://www.ddj.com/architect/222002847 I came to WP to learn more about Linear Compression... but couldn't find anything! Anyone want to add an article?75.16.119.163 (talk) 21:54, 21 December 2009 (UTC)

Source vs Entropy Coding

I have it on reliable authority that Source Coding and Entropy Coding are different sides of the topic of data compression.

Source coding is lossy, Entropy coding is lossless - at least if I answered that way in the exam I'm about to sit I would get the marks allocated to the question.

My lecturers describe a two-phase compression scheme such as JPEG to involve a source coding phase first - effectively reducing the number of bits required to store the data by finding another representation (such as a DCT transform, which effectively represents the majority of the info in a few terms, leaving the rest nearly zero). The second phase is to take this alternative representation and encode it using an entropy coding scheme (ie huffman and/or RLE). The redundancy introduced by the first phase renders the data more susceptible to efficient entropy coding.

They're fairly specific about the difference (to the point of broad hints about the exam paper) - why is this not made clearer in the text?

Another point regarding source coding is that the DCT is COMPLETELY reversible without loss of data. The loss only occurs when the DCT transformed image cell is quantised - this is normally done by point-wise division of the transformed cell with a quantisation matrix, followed by a rounding of the values to integers. This quantisation and subsequent rounding is where the loss occurs. Because the majority of the information is contained in a few terms, the reverse process can "fairly faithfully" reconstruct the original "shape" of the intensity information represented within the cell. The amount of quantisation applied dictates the quality of the decompressed image data.

Entropy coding tries to reduce the overall "energy" represented by a data stream by finding lower energy representations. i.e. Huffman codes require the most common symbols to be given the shortest codes.

Ian (3rd Yr Undergrad - Computer Science)

See also

This section is too long. I think one could argue that all this could be replaced with Category:Data compression. --Kvng (talk) 23:29, 3 February 2011 (UTC)

Copyedit cleanup

I'm undertaking a medium-sized cleanup of a lot of the copy editing. Most of this actually is due to the merge in from two other articles in November, so most of the work is reintegrating those together. So if it looks like I'm slashing down content it's still much more information than was there before the merge... I'm also integrating almost everything. Very little information that's there now is being removed. But if I have an oversight or there's something wrong please drop me a msg on my talk or here.

This also involves disambiguating a few links so I'm trying to do that while at it. Shadowjams (talk) 03:43, 10 February 2012 (UTC)

Does unused potential factor 4.5 refer to lossy or lossless compression?

Section 3 of he article states:

Outlook and currently unused potential.
It is estimated that the total amount of the information that is stored on the world's storage devices could be furthermore compressed by a remaining average factor of 4.5 : 1 with existing compression algorithms, which means that thanks to compression, the world could store 4.5 times more information on its existing storage devices than it currently does

The section should mention if this refers to lossy or lossless. 93.95.251.162 (talk) 11:53, 16 November 2011 (UTC) Martin.

Presumably it refers to both, as lossless compression of video could almost never be regarded as "optimal" while it obviously is optimal for some formats. Regardless, the whole section in question is fairly trivial, and refers to a reference which (despite an assertion to the contrary both on the article and on the website it links to) is hidden behind a paywall with no free access; it could (and should) probably be removed in the course of tightening this article up. Chris Cunningham (user:thumperward) (talk) 12:34, 23 November 2011 (UTC)
I think you slightly mispoke and mean lossy, not loss-less. Shadowjams (talk) 21:52, 7 November 2012 (UTC)

Ancient ideas, as far as I know it

lets say we got a binary string with totally N bits from witch p% are bits of 1 , (100-p)% bits of 0. We might use for an avalanche compressing data an XOR bit by bit with a N bits string that have 1.2*p% bits of 1... etc obtaining 0.75*p bits of 1 N resulted string... applyable also 4 optimizing performing muls or other functions using a small base of preresolved cases ... 93.118.212.93 (talk) 20:31, 7 November 2012 (UTC) (take it as an iq test if u wish, Russian problem style) a counting system using digits 0, 1, 3, 7, 15 might work with counting base b<=...3 wich tries to say we gat an up to 2X data compression comparing to natural bin encoding digits 0 n 1 and counting base b=2 deserialization starts from MSB... 93.118.212.93 (talk) 14:00, 23 February 2013 (UTC)

This no sense makes. Asymptotically, if the fraction of "1"s is p (not p%), we could use an entropy-based compression method with   bits. — Arthur Rubin (talk)

thank u very much 4 this formula, it really helps to make some evaluation a priori, abt some other creative (pure creative) ideas in data compression, i understood that once p n (1-p) are not equals, there is a chance to obtain a gain compression which is numerical given by this formula. thank u once more ! Florin 93.118.212.93 (talk) 11:58, 14 April 2013 (UTC)

General File Compression

It seems strange that in the usage section there's sub-sections for audio and video but not just general file compression for things such as archiving or data transport. — Preceding unsigned comment added by 87.237.64.150 (talk) 11:23, 4 September 2013 (UTC)

Use of AST for data compression

(also posted at Talk:Time stretch dispersive Fourier transform)

http://www.sciencedaily.com/releases/2013/12/131219131237.htm

... [A] UCLA group, led by Bahram Jalali, holder of the Northrop Grumman Opto-Electronic Chair in Electrical Engineering, and including postdoctoral researcher Mohammad Asghari, created an entirely new method of data compression. The technique reshapes the signal carrying the data in a fashion that resembles the graphic art technique known as anamorphism, which has been used since the 1500s to create optical illusions in art and, later, film. The Jalali group discovered that it is possible to achieve data compression by stretching and warping the data in a specific fashion prescribed by a newly developed mathematical function. The technology, dubbed "anamorphic stretch transform," or AST, operates both in analog and digital domains. In analog applications, AST makes it possible to not only capture and digitize signals that are faster than the speed of the sensor and the digitizer, but also to minimize the volume of data generated in the process. AST can also compress digital records -- for example, medical data so it can be transmitted over the Internet for a tele-consultation. The transformation causes the signal to be reshaped is such a way that "sharp" features -- its most defining characteristics -- are stretched more than data's "coarse" features.

-- Jo3sampl (talk) 16:36, 26 December 2013 (UTC)

"Compression is driven by low bandwidth paths between high bandwidth parts of the world" Section

This section seems very out of place. The ideas it discusses are interesting, but poorly worded and seemingly speculative, with no citations. It should be reviewed. — Preceding unsigned comment added by 157.166.175.129 (talk) 14:45, 24 January 2014 (UTC)

missing answers to common questions

People may come here searching for : What is/was a packer, what is/was a cruncher? How do they correlate to modern compression types etc. Terms used in the eighty and nineties for compression I think should have some mention in here. — Preceding unsigned comment added by 82.156.142.118 (talk) 11:40, 2 March 2014 (UTC)

Compression of Random data

https://sites.google.com/site/rubikcompression/strictly-long and more explination on that sites home page. Anyone up for an independent code up? Given the "logic" of the usual response, an independent coding is the most obvious way to eliminate accusations of trickery. — Preceding unsigned comment added by 188.29.166.89 (talk) 00:23, 31 March 2014 (UTC)

Stellar brilliance

Lossless compression reduces bits by identifying and eliminating statistical redundancy... Lossy compression reduces bits by identifying unnecessary information and removing it.

What is the difference between identifying and eliminating statistical redundancy and identifying unnecessary information and removing it?

Looks like there was once a message here, but it got filtered through tapioca and turned into goo.

Noisy channel? 178.38.152.228 (talk) 22:14, 20 November 2014 (UTC)

Lossy removes information; lossless only removes redundancy. Dicklyon (talk) 22:24, 20 November 2014 (UTC)

Bill Nye for National Science Foundation

Compression Technology by Bill Nye for National Science Foundation

Suggested file to add to this article. — Cirt (talk) 16:59, 6 October 2015 (UTC)

Hello fellow Wikipedians,

I have just modified one external link on Data compression. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

 Y An editor has reviewed this edit and fixed any errors that were found.

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 07:55, 7 December 2016 (UTC)

Vocoder audio compression

For voice range audio compression Vocoder based compression goes back further than u-Law and a-Law compression. However, it is debatable whether this is true "audio compression" due to the fact that it reduces to audio to tonal and noise patterns typical to speech. --Robert Wm "Ruedii" (talk) 21:43, 25 September 2017 (UTC)

Terry Welch

OrdinaryArtery has added cited information indicating that LZW was based on earlier work by Terry Welch. Lempel–Ziv–Welch, citing Welch's 1984 paper (a primary source), claims LSW was based on earlier work by Lempel and Ziv. I don't have access to OrdinaryArtery's source so am unable to sort this out. ~Kvng (talk) 13:55, 30 March 2018 (UTC)

WP:SUMMARY

Because of the many {{Main}} instances, this is clearly a WP:SUMMARY artcicle so I believe Maestro2016's recent and welcome improvements should be moved to the supporting article Image compression and then summarized here. ~Kvng (talk) 12:54, 23 October 2019 (UTC)

I've made an attempt at a summary, trimming it down to a basic summary of the three image compression types (DCT, LZW, DWT). Maestro2016 (talk) 16:14, 23 October 2019 (UTC)

increasing information density

What term should I use for: increasing information density by storing the same number of bits, but in a smaller physical space?

My understanding is that "data compression" involves only representing digital data in a fewer number of bits, which is one but not the only way of increasing information density.

(The terms information density, ephemeralization, miniaturization, MOSFET scaling, Kryder's Law, data compaction, etc. all seem related but are apparently either more specialized or more general than "storing the same number of bits, but in a smaller physical space"). --DavidCary (talk) 21:59, 15 November 2019 (UTC)

I have seen data density or compression density used informally. The technical terms include code rate or entropy rate. Shannon's source coding theorem gives an upper bound on data density for lossless compression. I don't know of a general all-encompassing term for all the ways to store data in a smaller physical volume. It is probably better to simply explain what is meant. --{{u|Mark viking}} {Talk} 22:25, 15 November 2019 (UTC)

Singularity Compression Theorem Peer review?

I came up with this idea of "singularity compression" for data compression. I wrote a small paper on it, here (Google Document): Singularity Compression Theorem.

The main idea:

An unlimited amount of data can be compressed to a finite hash.

The method for achieving it:

          	To compress, simply take each character of the file and hash it individually. Each hash represents a node that will be used to build a tree. Each consecutive pair of hash nodes are children of the same parent node. In the simplified version, the two are hashed together, creating a resulting hash that becomes the parent node. When repeated for all pairs in the data leaf layer, the next level is complete. Repeat the pairing for all consecutive pairs of parent nodes for the next layer, repeat again and again until there is only 1 node left, the root hash. 

The root hash is the compressed file.

          	To decompress, you must keep in mind, the string that was hashed to create the root is composed of two hashes, both equal in length to the root hash. Splitting this string in half yields the first child node in the first half of the string, and the second half yields the second child node. Using the process of elimination, each and every possible string can be hashed and checked to see if the resulting hash equals the root hash. Since there are a finite number of unique strings of the specified length, there are a finite number of hashes to check. There may be more than one string that equals the same hash, if this is the case, then the search branches out to try the candidates, until the bottom of the tree is reached, and can be confirmed to be composed of hashes of original characters.

To verify the file is correct upon reconstruction, a hash of the original file might be included as an additional means of guaranteeing data integrity.

          	A more complicated version of the compression method is intended to eliminate all other strings that, when hashed, equal the parent node, like the solution, but are otherwise incorrect. It is similar to the first method, but when pairs of hashes are hashed, the original hashes are concatenated and the new hash is appended to the end to make a new string that is three times longer than an individual hash. Finally, the new string is hashed to get the parent node. Using this extra layer of complexity, all strings that are discovered to hash to the solution (parent node), are invalid if the first two thirds of the string, when hashed, fails to equal the last third of the string. When the hash of the first two thirds equals the last third, there is a match. With the internal consistency being present, the odds are great that this is not just a coincidence, but is in fact, the solution. To further improve the odds, a pair of hashes could be concatenated with their hash, the resulting string hashed and appended to the end, so on and so forth, as many times as desired, to ensure redundant data integrity at time of decompression. This would limit the branching factor of any search in the tree from one layer of nodes to the next.
          	It is possible that any given root hash can be overloaded to have more than one valid solution. For example, given a hash of the original file does not match the first valid decompressed file (the final layer in the tree corresponds perfectly at each point to a valid character), perhaps a new layer can be added to continue the process until each node in the layer represents a valid character and the characters, when hashed, equals the hash of the original file.  — Preceding unsigned comment added by MatthewFelsted (talkcontribs) 06:30, 29 July 2018 (UTC) 


That's an interesting idea. What you've described sounds a lot like a Merkle tree, which is very useful in several applications. It also has some similarity to torrent files, which are very useful for segmented file transfer. It also reminds me of the hash functions used in rzip and some other large-window data compression algorithms such as IZO.

Alas, if I'm understanding your description correctly, "singularity compression" sounds a lot like the "recursive compression" approach that many people have tried before. A relatively simple counting argument in the "Compression FAQ" by Jean-loup Gailly shows recursive compression doesn't work for general-purpose lossless data compression. --DavidCary (talk) 07:58, 24 November 2019 (UTC)