Data vs. information
« previous post | next post »
[This is a guest post by Conal Boyce]
The following was drafted as an Appendix to a project whose working title is "The Emperor's New Information" (after Penrose, The Emperor's New Mind). It's still a work-in-progress, so feedback would be welcome. For example: Are the two examples persuasive? Do they need technical clarification or correction? Have others at LL noticed how certain authors "who should know better" use the term information where data is dictated by the context, or employ the two terms at random, as if they were synonyms?
For context, here is part of the email I sent VHM recently, by way of introducing him to this data/information project:
"For your amusement, here is 'Appendix P' to a manuscript that I've been working at, on and off, over a five-year period. The MS is motivated by the phenomenon of biologists and physicists who are unable to grasp the difference between data and information. For example, Richard Dawkins (The Blind Watchmaker, 1987) and Antoine Danchin (La barque de Delphes, 1998, tr. to English 2002) are both computer savvy, even proficient as programmers, yet they remain oddly clueless about the data / information distinction, and both sing the praises of DNA as a kind of 'information technology' (when DNA is really just a big blob of mindless data in my opinion — a concept that both Dawkins and Danchin would agree with, yet their use of terminology clashes with it). Even more annoying are the astrophysicists such as Hawking and Susskind, who yammer about 'information lost in a black hole', an absurd idea for anyone who actually knows what information is. What Hawking and Susskind are actually fretting about is the possibility of 'mass lost in a black hole', i.e., that Conservation of Mass might be violated, so they should just say so!"
Appendix P (for 'Polish'), which follows below, is one of several examples where I try to make it clear, even for a biologist or astrophysicist set in his ways, that there is a profound gulf separating data from information. And the 'punch line' of the example makes use of Victor Mair's 1990 translation of the Tao Te Ching, so I thought I'd pass along a copy of it for your amusement. (Someday I'll get the MS itself cleaned up and try submitting it to a journal somewhere — not in the realm of biology or physics, where it would be offensive, but perhaps in a computer science or data processing journal.)
Appendix P to: The Emperor’s New Information
Consider a string of zeroes and ones that starts and ends as follows:
01000100 01110010 01101111 01100111 01101001 00101100 00100000 …
01110111 01101001 01100101 01100011 01111010 01101110 01111001 01101101.
Then a series of hex values that looks like this:
44 72 6F 67 69 2C 20 6B 74 A2 72 79 6D 69 20 6D 6F BE 6E 61 20 63 68 6F 64 7A 69 86 2C 20
6E 69 65 20 73 A5 20 64 72 6F 67 A5 20 77 69 65 63 7A 6E A5
49 6D 69 6F 6E 61 2C 20 6B 74 A2 72 65 20 6D 6F BE 6E 61 20 6E 61 7A 77 61 86 2C 20
6E 69 65 20 73 A5 20 69 6D 69 65 6E 69 65 6D 20 77 69 65 63 7A 6E 79 6D.
And then a text that looks like this:
Drogi, którymi można chodzić,
nie są drogą wieczną
Imiona, które można nazwać,
nie są imieniem wiecznym
I trust that most readers will agree that the string 01000100…01101101 is only data, something comparable to the dits and dahs of Morse Code; not YET information. What about the string ‘44 72 6F 67 69 2C 20 … 77 69 65 63 7A 6E 79 6D’: is that information? Only in the very rudimentary sense that it reflects knowledge of how hex and binary relate to one another. (For example, binary 01000100 translates to ASCII code 44 hex, which later becomes the ‘D’ at the beginning of the text; binary 01110010 translates to 72 hex, which later becomes the ‘r’ in ‘Drogi’; and from ASCII 2C we obtain the comma after ‘Drogi’; and so on.) By performing all the substitutions, we arrive at the sixteen-word text shown above. At that point, do we have some information at last?
Let’s take this a step at a time. If one reads Polish, one will glean the following from the first half of the text: “The ways that can be walked are not the eternal way.” Fine. But is this practical advice or something philosophical? It sounds philosophical. Does the reader know why it sounds philosophical, and where it actually originates? Some readers will recognize that line as a Polish rendition of the first six characters shown here…
Dào kě dào,
fēi cháng dào.
Míng kě míng,
fēi cháng míng.
道可道,非常道. 名可名, 非常名.
…from the Dào Dé Jīng 道德經. Some readers might even suspect that the Polish rendition shows the influence of page 59 in Mair (Tao Te Ching, 1990). (And in fact, that is the genesis of the Polish text: I produced it by plugging two lines of Mair’s translation into Google Translate; then I rewrote the Polish in ASCII hex, then translated the hex to binary.)
Let’s step back and appreciate how many steps are involved in getting from the data to the information: First, one must have a suspicion that the string of zeroes and ones is the binary translation of some ASCII codes. Next, one must know how to get from ASCII to the Central European Alphabet (which includes exotica such as ‘lower case a with ogonek’). Call these two steps rudimentary if you like, but neither can they be avoided. Next, one must either know Polish or recognize the text as ‘something like Polish’ so that one can get it translated to one’s own native tongue. But is the message just that the ways that can be walked are not the eternal way? Without a philosophical interpretation, that is near gibberish; not yet good information, not yet the message intended. For this particular message (01000100…01101101), the information ‘payload’ depends on the reader already knowing what the Dào Dé Jīng 道德經 is. Just “knowing Polish” is not enough to get the message.
The little story above may sound contrived and convoluted. Well, it is slightly contrived, but data and information often relate to one another in ways that are nearly this complex. The salient point is that there is no such thing as information just floating in a vacuum. A sentient being needs to ‘observe’ the data (shades of Berkeley and the tree falling down), and this sentient being must also bring with her a context into which to place the data. Then and only then does actual information come into play. The information step always involves the contribution of some such ‘outside’ element which will bring the dead data to life.
To further illustrate the point about the need for context, I will follow up with a seemingly very simple example, call it the minimal or ‘paradigm’ case:
01000111 01101111 00100001b
47 6F 21h
The binary (b) and hex (h) digits above decode to the Roman letters ‘G’ and ‘o’ followed by an exclamation mark, which is to say: ‘Go!’ At the trivial level, we may say that the information conveyed by ‘Go!’ is the imperative form of the verb ‘to go’. Could ‘Go!’ convey something else? Yes. Given a bit of context, the information content of ‘Go!’ could be: “Hurry up, children. It’s time to go! Otherwise, you might miss the school bus.” Or, in another context, the information content of ‘Go!’ could be the following: A military commander is ordering a pilot to take off from Tinian and drop a bomb on Hiroshima (or, to update the example, a commander orders a technician to launch a drone that will, collaterally, kill women and children). Which of these three instances of information (one trivial, one nontrivial, and one rather horrifying) is conveyed by ‘47 6F 21h’? Surely even a physicist (Susskind) who frets about what gets lost in a black hole (or biologists such as Dawkins and Danchin who glibly sing the praises of a supposed ‘information technology’ inherent in DNA), should be able to see that none of our three instances of information is conveyed by ‘47 6F 21’. That’s because ‘47 6F 21’ is just six digits of data. This has been a long way saying: There really is such a thing as a data / information distinction.
Selected readings
- "'The data are": How fetishism makes us stupid" (1/1/13)
- "Data" (8/10/15)
- "The Data Says …" (10/24/19)
- "Plural data" (10/3/14)
- "The sparseness of linguistic data" (4/7/14) — "'Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.'"
Martin said,
February 7, 2021 @ 4:46 am
'Information' in physics is a well-established term of art (https://en.wikipedia.org/wiki/Physical_information) referring to the physical state of a system and governed by certain physical laws that appear to be violated in the case of a black hole — as distinct from data, which would be what we could measure about that system. It's not completely clear that the writer does 'actually know what information is' in this context, or whether it matters that you could come up with a different definition in a different context.
Jamie said,
February 7, 2021 @ 6:06 am
There may be some cases where data is (or can be) distinguished from information (aka meaningful data). But I don't think it is right to suggest that it is always true, or that the author's distinction is the only one.
Maybe the final sentence would be better as:
"There really can be such things as a data / information distinctions."
~flow said,
February 7, 2021 @ 6:14 am
I think part of the problem here is that practitioners of different fields, including the non-specialist audience, will use terms with differing interpretations. As for 'information', there is definitely more than a single sense across different jargons.
The way I use it, 'data' is 'the marks written down', 'the sounds recorded on tape', 'the specific words used', as opposed to 'information', which I conceptualize as 'what you can do with data'.
As far as DNA is concerned, I'd say that the nucleotide sequences represent data for the cellular processes of various kinds to act upon; *what* a given process does with that data is what I'd call the 'information' of that nucleotide sequence. My impression is that the popular understanding of genetic information encoding is such that it vastly underestimates the importance of the environment: you cannot just drop some DNA somewhere and expect it to do anything useful, you need to have the conditions that prevail in the right cells, and this includes the myriads of contributing factors like presence of complex proteins and the right temperature, and therefore, ultimately, the existence of a source of heat and light (our sun). It is only under these very specific and, cosmologically speaking, certainly unlikely conditions that data as present in DNA can be expressed as information.
Neither 'data' nor 'information' makes much sense outside of a defined purpose; for example, the 'data' that 'is physically there' in an old manuscript may or may not include the kind of ink used, depending on whether you are concerned with a precise original wording an author used or whether you're doing a forgery analysis. Likewise, it would seem to me that what is 'information' for one way of looking at it can become the 'data' of another level of analysis.
As for the question whether 'information' can be 'transmitted' (transported), I tend to say that—strictly speaking—no, 'information' can only be reconstructed; what *is* being transported is 'data', and that transport always needs a physical medium such as a piece of paper or an electromagnetic wave. OTOH in a looser sense then yes, information that could be gleaned from some data at point A at time t0 and can, at a later time t1 reconstructed from the data that were transmitted to point B *can* be said to have been transmitted (using data transmission as its carrier if you will). This is certainly similar to the way we talk about 'generating' or 'producing energy', or 'to waste energy', where for all non-cosmological scales, energy is always preserved. It is not so much that these wordings are deeply at fault; it's more a matter of finding the right frame to look at these ways of expression.
As for the usage of 'information' in connection to black holes, I can recommend Dr. Hossenfelder's videos on Black Holes and Loss of Information (https://www.youtube.com/results?search_query=hossenfelder+information+black+hole&page&utm_source=opensearch). Spoiler: she's quite critical and outspoken about it.
Frédéric Grosshans said,
February 7, 2021 @ 6:30 am
Furthermore, recent developments of the blackhole problem show strong links with (a quantum generalisation of) Shannon information theory.
The examples given by Conal Boyce could be canonical examples from an information theory textbooks of different sets of data containing the same information. The context (polish language, knowledge of the Dào Dé Jīng 道德經) being example nos of the concepts of conditional mutual information.
Note that I have no doubt a full translation of the Dào Dé Jīng 道德經 in Polish encoded in an arbitrary binary encoding (say ECBDIC) can easily be deciphered with reasonable efforts, and contains all the relevant information. To take a real example of something similar which has actually happened, the linear B tablets contained exactly the same amount of information before and after Ventris decipherment. Calling this text “merely data” before 1952 makes no sense
Philip Taylor said,
February 7, 2021 @ 6:32 am
For me (and I very much suspect that this goes right back to my school days), "data" are what one is given. "Information" may be given, may be derived, may even be inferred, but the concept of "given" is, for me, inseparable from the concept of "data".
Thomas Hutcheson said,
February 7, 2021 @ 7:38 am
Careful about physicists "yammering" about "information." It may be (I think it is the case) that they use "information" in a way that others would use "data," but lots of luck trying to change the profession's use of the word.
Isn't this a case of language being the way it is actually used? Does data/information necessarily cause any more confusion that set/sit?
Cervantes said,
February 7, 2021 @ 8:01 am
I think the distinction you really intend is between data/information, and understanding or interpretation. The binary code has exactly the same meaning as the Polish or Chinese, it's just in a different alphabet. It isn't any more or less information than the Polish or Chinese, it just requires more layers of knowledge to interpret it.
As for your last examples, this is a very well known phenomenon in sociolinguistics. Speech acts are undertermined by the semantic content; you need context, including oftentimes a shared understanding between speaker and interlocutor that may not be available to others. That doesn't create any distinction between data and information, however. Both can describe the actual words they ARE information — but there is additional data/information that is not included in the words themselves, that one requires for proper interpretation.
milu said,
February 7, 2021 @ 9:44 am
just an editing note, §3 "even for a biologist or astrophysicist set in his ways" which is old-fashioned, and also inconsistent with this later "this sentient being must also bring with her a context into which to place the data".
Also, in a biology context, specifically a critical discussion of Dawkins' "selfish gene" theory, i have read it pointed out that "information" smacks of platonic essentialism. Apparently "in-form-ation" has philosophical baggage in the sense of "the process by which an essence (form) molds matter", where matter is understood as inert goo.
Don't know how useful that is to you, but i thought i'd mention it. The choice of one word over another could result from affiliation within a philosophical tradition as much as from common modern usage
Rodger C said,
February 7, 2021 @ 9:45 am
"Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?"
Peter Berry said,
February 7, 2021 @ 11:04 am
You ask for technical clarification. How did you convert that Polish text to hexadecimal? It doesn't correspond to any encoding for Polish that I know of. For example ó is supposedly encoded as A2, which is not a valid UTF-8 code unit by itself; the Unicode code point U+00A2 is CENT SIGN (¢), while A2 is ˘ (breve) in ISO 8859-2 and ą (lower case a with ogonek) in ISO 8859-16. In these encodings ó should be F3, represented in the variable-width UTF-8 as C3 B3.
Additionally, you refer to ASCII. The text certainly is not ASCII, which is a 7-bit encoding so it has no code points/units above 7F, and cannot represent Polish complete with ogoneks and so on. I recommend Joel Spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" which dates from 2003 but is still relevant.
Shimon Edelman said,
February 7, 2021 @ 11:36 am
The realization that the use of language in communication is more profitably seen as activating preexisting representations in the listener (rather than sending along a certain number of bits of information, which single out THIS message out of a list of possible agreed-upon ones) is not new. For some relevant references, see my "Language and other complex behaviors: unifying characteristics, computational models, neural mechanisms", Language Sciences 62:91-123 (2017). DOI: http://doi.org/10.1016/j.langsci.2017.04.003
Colin Watson said,
February 7, 2021 @ 11:49 am
Peter Berry: I think it's CP852 (https://en.wikipedia.org/wiki/Code_page_852). Pretty obsolete these days seeing as Unicode exists, though.
Jerry Friedman said,
February 7, 2021 @ 12:13 pm
The part about black holes appears to come from a complete misunderstanding. The "information loss" refers is to the hypothesis that the only characteristics of a black hole are mass (or energy if you like), angular momentum, and electric charge. There is no way that observation of a black can reveal the information about whether the material that went into the black hole was hydrogen or neutron-star matter or "the complete works of Marcel Proust in ten leather-bound volumes, or a television set" (to quote Hawking from memory in the context of what a black hole could radiate). That seems to be what Conal Boyce means by "information".
For more on this, see <a href="https://phys.org/news/2014-09-proof-no-hair-theorem.html#:~:text=The%20basic%20idea%20of%20the,collapses%20into%20a%20black%20hole""Seeking proof for the no-hair theorem.
The idea that what Hawking and Susskind were or are fretting about is loss of mass, or violation of the non-existent physics principle of "conservation of mass", is absurd.
Jerry Friedman said,
February 7, 2021 @ 12:15 pm
Sorry, here's that article.
Jason M said,
February 7, 2021 @ 12:17 pm
To this biologist, it makes sense to distinguish data and information. The genetic code, written in strings comprising 4 text characters is certainly data. It is information only in context. At the most basic, the context would be: 1) knowing that the string of text represents a string of DNA nucleotides; 2) knowing which part of which creature’s genome the string of characters denoting nucleotides represents.
Beyond all the complexities of cellular, instantaneous context elaborated by @~flow the latter is critical because each organism — and to a lesser extent each cell within each organism — subdivides its interpretation of DNA strings. The easiest of these subdivisions to explain is the one that directly, eventually is transcribed and translated to encode a specific protein. In many organisms — like humans — protein-encoding portion of the DNA ‘data’ is tiny. Even within that translatable portion, if one doesn’t know the open reading frame (ie where the protein translation is supposed to start and which direction it’s supposed to go in), one will get mostly short gibberish strings of amino acid out the other end and not a protein. And, in reality, the gibberish will be seen long before the cell wastes its time even trying to translate.
So strings of nucleotides represented by text characters is certainly not information, even if you want to know what protein they might encode without knowing a few other key facts…. Now, of course, there are tools we use all the time to sort of “Google Translate Detect the Language” approach this problem using, eg, tools of the BLAST family which will simply align the presumptive DNA string with every known organism’s DNA and protein sequences. If your string returns long, real-sounding matches across multiple organisms, you can guess that your string encodes a protein in some animal if it is representing an actual DNA molecule sequence in some actual cell somewhere.
Anyway. Fun. Heisenbergian definition of data interpretation: that the cat is in a box with a mortal dilemma is data. If it actually winds up dead or alive is information?
Benjamin E. Orsatti said,
February 7, 2021 @ 2:11 pm
“ The binary code has exactly the same meaning as the Polish or Chinese”
Traduttore, traditore.
Peter Berry said,
February 7, 2021 @ 2:35 pm
Colin Watson: Ah yes, that checks out. It's a pet peeve of mine that people use "ASCII" to refer not only to ASCII proper but to all of its various mutually incompatible extensions (as well as Unicode which isn't even an encoding as such).
(Also, that hex string doesn't include the line breaks, so some words are smashed together with no space between them if you disregard its context where it has them.)
Calvin said,
February 7, 2021 @ 2:46 pm
Data is a representation, whereas information is an interpretation with some context.
There you can have "misinformation" and "disinformation". As for data, it can be "corrupted", unintentionally or maliciously. You can counter that with correction and authentication.
Conal said,
February 7, 2021 @ 3:20 pm
Many thanks to the baker's dozen of LL members who have provided feedback already, within minutes it seems!
For certain comments, it will take time for me to work through the ramifications off-line, so to speak (e.g., to Martin's suggestion that I consult the wikipedia page on 'physical information') But to some I can respond now:
Thank you milu for pointing out 'his…' followed by 'her…' (So much for a 77-year-old trying to be au courant…)
Encoding: To me, this encoding business was just an innocent game, not meant as an exercise in computer science to produce a practical tool. But knowing that some may read it that way (Peter Barry), in an edited version of Appendix P I'll certainly point out that I'm not using Unicode, and not showing real ASCII 7-bit encoding. Colin is correct — what I did use was the CP852 code sheet found at ascii-codes.com. (The coding is all accurate, and triple-checked; it's just that it is amateurish and obsolete, from a certain "insider's" perspective.)
The long tangent that we might pursue about encoding reminds me to mention this:
I was surprised that no one commented on my 'minimalist' example with 'Go!' and a drone that kills women and children. One thing I've learned already from your various reactions is that I must bring the 'Go!' example to the fore, and downplay the Polish/Dao De Jing example, which is meant only as entertainment (but can easily induce a reader to get bogged down in something irrelevant, such as the question of whether the whole DaoDeJing in Polish can be 'deciphered' or not — of course it can! And so what?)
So, everyone, please look carefully at the "Go!" example, by itself!
Thank you ~flow for mentioning Dr. Hossenfelder. And speaking of her, I'll now quote from her post of 18 November 2020:
"The Black Hole Information Loss Problem has actually nothing to do with information […] Really, the issue is not 'loss of information' which is an extremely vague phrase. The issue is time irreversibility" (@3:53-4:08 in https://www.youtube.com/watch?v=mqLM3JYUByM).
In other words, she is one of the exceedingly rare physicists who DOES seem to understand the distinction I'm talking about. (This is confirmed when later, @10:19-10:43, she uses the term 'data' three times, correctly.) In contrast (and more characteristically of the field), after Susskind delivered a lecture on black holes at Stanford, a biologist in the audience asked him why he didn't draw a distinction between bathtub temperature 'data' and bathtub temperature 'information', to which Susskind's response was: "Well, data and information are pretty much the same thing, aren't they?" thus blowing off the biologist's perfectly good question.
Speaking of biologists, thank you very much Jason M for your feedback.
Sniffnoy said,
February 7, 2021 @ 4:03 pm
No, sorry, the term is indeed "information" as in information theory. It doesn't matter if Dr. Boyce thinks it ought to be called "data theory", "information" is what it is called in all technical contexts. If he wants to introduce a distinction between what he calls "data" and what he calls "information", he had better come up with terms for them that do not conflict with existing technical usage.
As it is, I'm skeptical such a project can be made to work. The supposed distinction between "information" and "meaningful information" is one that the creationists would yammer on about when they were much in the public eye, but they never provided any serious account of the distinction when pressed. While I'll admit that such people might not have been the best to carry out such a research program, I remain skeptical that the distinction can be made meaningful.
Matt said,
February 7, 2021 @ 4:19 pm
Without knowing the wider context of the project to see why this is of relevance, as a stand-alone piece, I find it rather unnecessary:
It seems to state the obvious, that context is necessary to interpret/act on data.
And yet it is unnecessarily pedantic in insisting on a distinction between “data” and “information”, when ironically, context makes it perfectly clear what is meant in most cases.
I do not doubt, after all, that the highly intelligent physicists and biologists you have quoted understand such a distinction in concept and have made a deliberate language choice, knowing that their audience will also understand.
For example, I imagine it is Dawkins’ key point that interpretable information exists within DNA, and that is what he wants to convey to his readers… without the unnecessary circumlocution of saying that DNA contains data which can be interpreted to derive information (an irrelevant distinction for his purpose).
—
On a separate note, I struggled with this:
“A military commander is ordering a pilot to take off from Tinian and drop a bomb on Hiroshima (or, to update the example, a commander orders a technician to launch a drone that will, collaterally, kill women and children).”
Upon first reading, despite the brackets, this stuck in my head as two separate examples.
Which meant when I reached this:
“Which of these three instances of information (one trivial, one nontrivial, and one rather horrifying) is conveyed by ‘47 6F 21h’?”
I was left trying to work out which of the drone strike and the atomic bomb was merely “nontrivial” instead of “horrifying”.
It took me several re-reads over the preceding paragraph to realise the first example was “the imperative form of to go” which is buried in a much earlier sentence, and the bomb/drone was one example combined.
So I would suggest removing the Hiroshima reference and just using the latter generic drone strike, and perhaps breaking all 3 examples into a numbered list to make it easier to follow.
Gregory Kusnick said,
February 7, 2021 @ 5:11 pm
While the raw nucleotide sequence of DNA can in some sense be regarded as mere "data", it's nevertheless possible to model DNA transcription and protein synthesis as a communications channel with a well-defined information content (namely, enough to specify one of 20 amino acids per codon). This is a legitimate, objective metric that does not depend on the existence of sentient observers.
With regard to black-hole physics, Susskind and Hawking certainly know precisely what they mean by "information", and for that matter so does Hossenfelder. Her rhetoric about vagueness is just peevery about well-established terminology that she happens not to like.
By invoking sentient observers, Boyce seems to be confusing information with something more subjective such as knowledge or meaning. By invoking Penrose, he seems to be suggesting that there's something inherently mysterious about information that mere mechanistic models can never capture.
Henry Milner said,
February 7, 2021 @ 7:17 pm
With respect to Dr Boyce, this was a disappointing article, for the reasons others have articulated above. This attempt to make a categorical distinction between things that require interpretation and things that embody understanding is an old idea, fought over by deep thinkers like Searle and Dennett.
It’s hubris to suggest that preeminent physicists are misusing a term of art in their field, which (at least in my limited experience) many other physicists, statisticians, and computer scientists also use. Surely even the hardest prescriptivist would allow people their technical terms of art.
I suggest sticking to more limited claims: for example that Hawkings/etc use the word in a way that bamboozles laypersons, or that they are themselves naive to the philosophical argument you’re making. I think there is some truth to that, although no hard proof is offered of the meaningfulness of a dualist data/information distinction, so perhaps the physicists are wise to ignore it.
As it is, this article reminded me of the “Igon Value Problem” — I felt the author was straying unwittingly from his areas of expertise.
The girl from the bus said,
February 7, 2021 @ 9:34 pm
@Gregory Kusnick:
"By invoking sentient observers, Boyce seems to be confusing information with something more subjective such as knowledge or meaning."
Well…
Information is not knowledge.
Knowledge is not wisdom.
Wisdom is not truth.
Truth is not beauty.
Beauty is not love.
Love is not music.
Music is the best.
—Frank Zappa
Conal said,
February 8, 2021 @ 3:59 am
Point taken about the trivial, nontrivial and horrifying examples; I'll make the 3-way mapping explicit. Thank you, Matt, for the detailed critique. And yes, 'sentient' was a mistake, as pointed out by Gregory Kusnick. I'd be better off saying: "An outside entity such as a robot or a human is needed" (because the data alone is 'necessary but not sufficient' [usually] for achieving the information level).
A bit of background on the terms 'information theory' and 'the Shannon entropy'. Here is the fifth sentence in Shannon's landmark paper of 1948: "These semantic aspects of communication are irrelevant to the engineering problem." In other words, he warns the reader that the topic of his paper is a mathematical theory of data communication, nothing to do with information. But journalists had to make it sound appealing by calling it 'information theory' and soon Shannon himself (in the 1950s) caved in to the pressure. (Note that 'information theory' is now one of the most pernicious and 'successful' factoids of all time; there is no such thing as 'information theory' — not until Superintelligent AIs create it for us in the near future. And they will be sure to include the information content of Bach fugues and of Frank Zappa's music, by the way — referring here to The girl from the bus.)
As for 'the Shannon entropy', that goes back to a joke that von Neumann has played on us all. When Shannon was casting about for a term to use, von Neumann suggested 'entropy' because "nobody knows what entropy really is, so in a debate you will always have the advantage" (as recounted by Tribus in SciAm 225, 1971 179-184). The so-called Shannon entropy, which has taken on a kind of mystique, is actually just a very pedestrian thing that I call the 'Data Encoding Richness Needed' value (DERN), for a given 'message source language'. Nothing woo-woo about it, and certainly nothing to do with Boltzman and all that. It is because of such historical considerations that Danchin warns near the beginning of his 75-page chapter called "Information and Creation" that "The ground we will have to cross is a minefield" (The Delphic Boat, p. 171). Indirectly, what I'm saying here is that No, there is nothing simple and obvious about this topic, as assumed by some of my critics above. I love the word 'peevery' that I just learned from Gregory Kusnick (but it would be a shame to assume it applies to Hossenfelder's BHILP video, something I've been 'waiting for' as it were for 10 years — i.e., a physicist who has a clue about the 'minefield' alluded to by Danchin).
Matt said,
February 8, 2021 @ 7:00 am
> But journalists had to make it sound appealing by calling it 'information theory' and soon Shannon himself (in the 1950s) caved in to the pressure.
So what you are saying is that a term was agreed upon 70 years ago, has been in continuous use ever since, and most people working in related fields will have heard no other term in their lifetime?
Not only has the horse bolted, but it’s descendants have long since settled a new continent and evolved into a different species.
Peter Taylor said,
February 8, 2021 @ 9:20 am
The distinction which is being drawn seems to me to be a distinction between encoding and meaning, not between data and information.
I am uneasy with the implied premise that a sequence of symbols (e.g. binary digits) is inherently data. I think that my intuition is that data is a structured encoding of facts: if a sequence of symbols has a lot of information in the Shannon sense then it probably will turn out to be data, but if it seems completely random then it could equally well be data or pure random noise, unless the context dictates otherwise.
This phrasing feels very unnatural to me. Taking a non-Shannon interpretation of information, I would say that the information conveyed by 'Go!' is that the speaker wishes the hearer either to move or to execute some pre-arranged plan of action. Although, as you say, it is contextual: another possibility is that the speaker has just remembered the name of the Japanese board game which had previously escaped their memory.
Jerry Friedman said,
February 8, 2021 @ 4:33 pm
Conal Boyce has omitted Von Neumann's other reason for suggesting the name "entropy" (instead of "uncertainty"). Tribus's version of Shannon's recollection was, "You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage" So Von Neumann had a perfectly sound reason for suggesting the name. I agree that his second reason, including "more important", was a joke.
(I found that quotation in a book called A Farewell to Entropy::Statistical thermodynamics Based on Information by Arieh Ben-Naim, a book where I think Prof. Boyce would find a few things to agree with and a lot to disagree with. )
And the connection is not merely mathematical. Boltzmann's entropy is the amount of information in Shannon's sense needed to specify the molecular-level details at an instant of, say, a sample of gas, if the macroscopic facts (e.g., pressure, temperature and volume) are known.
Gregory Kusnick said,
February 8, 2021 @ 5:30 pm
Curious about who Conal Boyce is (VHM's intro doesn't give us much to go on) I did a bit of googling and found this, in which he expresses some (shall we say) unconventional views about the mathematical legitimacy of pi, e, the square root of two, and irrational numbers in general. (One can only guess about his opinion of complex numbers.)
Make what you will of that.
GH said,
February 8, 2021 @ 5:51 pm
I agree with other commenters who find the appendix unsatisfying. It appears to beg the question, asserting a fundamental distinction between data and information without ever clearly defining it, making an argument why this terminology is correct and others are wrong, or providing one iota of evidence that it is in fact how the words are used. (In contrast, the recent peevery in the comments over the "misuse" of "weight" for "mass" was able to provide all three of these elements… and was still refuted.) It can only be convincing to those who already agree.
As several others have said, Boyce appears to use "information" more or less as a synonym for "meaning." This strikes me as a quite restrictive sense of the word, but if we accept it, the question of how information relates to data—how symbols acquire meaning—can hardly be considered clear-cut. If we assume that "meaning" is something that only makes sense with reference to a mind, we're bang up against the big open questions in philosophy: the mind–body problem, the hard problem of consciousness, etc. (When Boyce in the comments accepts a robot as a possible receiver/generator of information, these questions become acute.)
An alternative distinction between "data" and "information" might be to say that information is data that controls (or acts as input to) some defined process. So, for example, by itself a pattern of holes on a reel of paper would be data, but if that reel was fed to a player piano to produce music, it would be, in that context, information—and only the aspects of the pattern that actually make a difference to the operation of the device. And the same obviously for computer punch cards, DNA, etc. Any sensory input to a brain would be information because it affects the brain's processing.
I'm not claiming this definition is actually in use, but it might get us somewhat closer to the distinction desired by Boyce without getting tangled up in unfathomable questions of what it means to think.
Jerry Friedman said,
February 8, 2021 @ 6:31 pm
GH: I missed peevery over "weight" versus "mass"? Do you remember where that was? I can't find it by searching for "mass".
Philip Taylor said,
February 8, 2021 @ 6:44 pm
Jerry — "How a physicist talks" (under "How a porcupine talks").
Jerry Friedman said,
February 8, 2021 @ 7:03 pm
Philip Taylor: Thanks. I did miss that at the time.
Charles Gaulke said,
February 9, 2021 @ 1:13 pm
Gregory Kusnick:
Oof. I didn't get too far into that before running into some real howlers – irrational numbers can't be constants, or included in ratios? And the business about algorithms that run forever – this is the kind of crankery I used to read about on Mark Chu-Carroll's GoodMath/BadMath blog. If that is the same person as this guest post, it doesn't seem likely there's a productive discussion to be had with them about the idea of words being used differently in different contexts.
Ian said,
February 11, 2021 @ 1:55 pm
Interesting parallels to what I do for work, a big part of which is "data modeling", which I suppose is to arrive at "information". We model data using the YANG language so that external APIs can interact with our system and know the relevance of the various data structures.