In each case there is a common sense notion of information that is a more restricted special case reflecting our interests and capacity to represent, as based in the relevant distinctions within the system under study. The various mathematical theories can regarded as dealing with information capacities, the ability to carry information in the usual sense (for more detail on this, see the discussion of the combinatorial approach to information). So far, the intentional aspect of information lies outside the capabilities of rigorous information theory, though it remains a highly prized philosophical goal. A construction within information theory would start with one of the standard formulations of information theory, with the foundational elements being the elements of that theory, and various philosophically justifiable constraints (and possibly further operators) are added to produce a theory that corresponds to our intuitions about intentionality, as mediated by a process of reflective equililibration (reflective readjustment of our intuitions and the introduced constraints and operators). Despite the difference between technical definitions of information and our everyday usage, it has become conventional to refer to information capacity simply as information. This can be confusing to neophytes, so some experts prefer to reserve the term `information' for intentional cases, and use 'complexity' for the other cases. Unfortunately, `complexity' also has many disparate uses. There is currently some movement towards a unified approach to information, but several philosophically interesting issues remain unresolved.
In philosophy, information theory has been applied to logic, perception,
The fundamental quantitative notion of information is of a unit of distinction (Spencer Brown 1972), called a logon by MacKay (1969) which enables the isolation of a distinguishable group. A distinction is thus an operation, or possible operation. A classification places particular objects in the domain of the classification according to the types of the classification. The types are abstract, permitting further specification, whereas particulars are maximally determinate. For example, in order to classify a tree, a grouping of properties that is sufficient to define a tree must be distinguished from the domain of properties in general. To represent a specific tree, this grouping must be specific enough to pick out the tree in question. More information is required to distinguish a specific tree than a tree in general, since a specific tree is first of all a tree, and must be distinguished from other trees by at least some of its peculiar properties. This might not be obvious. We can, for example, pick out a specific tree by pointing, or saying "that tall skinny thing over there", which doesn't contain any representation of a tree. The problem is that pointing and demonstrative reference do not constitute representations, but can guide us to representations. (This is also true of definite descriptions, at least in their referential mode, as opposed to their descriptive mode, in which the description does not need to be true in order to achieve reference.) In general, the more specific the representation, the more information it requires, i.e. the more distinctions are involved. However, this holds only for representations ordered by determinateness on W.E. Johnson's determinable/determinate scale. Structures of the same determinateness involve less or more information, respectively, depending on whether they are more or less regular, or ordered.
To take a simple example, consider a frame cube and a spatial structure composed
of eight irregularly placed nodes with straight line connections between each
node. Both structures may encompass the same volume with the same number of
components, but the regularity of the cube reduces the amount of information
required to specify it. This information reduction results from the mutual constraints
on values in the system implied by the regularities in the cube ö all the sides,
angles and nodes must be the same. This redundancy reduces, for example, the
amount of information required in a program that draws the cube over that required
by a program that draws the arbitrary eight node volume. On the other hand,
the notion of a cube involves more information than the notion of an eight node
volume. This is because "cube" is more determinate than the much more determinable
"eight node volume".
It has become popular to talk of complex structures as being midway between
highly ordered structures like perfect crystals and highly disordered structures
like ideal gases, along the lines of the figure to the left. The scale between
the two (the x axis) is a measure of the amount of information required to fully
determine the structure. The crystal requires determining the location of a
few atoms and the repetitive relations to the other atoms of the crystal, while
the gas requires the specification of the position and momentum of each molecule,
a very large amount of information indeed. Note, however, that in describing
the gas as low in complexity one is abstracting from the detailed behaviour
of its molecules, but this is not so in the case of the crystal, in which the
position and location of at least some individual atoms is determined. The comparison
is possible, but it is misleading. It is better to compare complexity at a comparable
level of determinateness, and recognise that there are two dimensions to the
middle realm, organisation and complexity, which have differing measures, as
indicated in the figure to the right. An interesting question is whether there
is an information theoretic quantification of organisation. Charles Bennett
has proposed logical depth as a suitable measure.
A sequence of 32 Î7's requires a shorter program to produce (namely one specifying
5 doublings of an initial output of Î7') than does an arbitrary sequence of
decimal digits. To take a less obvious case, any specific sequence of digits
in the expansion of the transcendental number
=3.14159... can be produced with a relatively short program, despite the apparent
randomness of expansions of
. The information
required to unambiguously describe certain types of structures can be compressed
due to the redundant information they contain; other structures can not be so
compressed. This is a property of the constraints contained in the structures,
not directly of any particular description of the structures, or language used
for description. The length of a shortest description of a structure encoded
as a string of 1s and 0s represents the amount of information in the structure.
This length is the minimal number of distinctions (logons) required to define
the structure.
Information, Logic and Computation
Anything that can be represented can be represented (in principle) as a string
of binary digits via an isomorphic mapping, as pictured to the right. The process
is similar to producing a string that is rather like the record of an extended
game of twenty questions, with 1s conventionally representing affirmation and
0s representing negation. A successful series of guesses produces a truth table
row that represents the original thing. The string, represented in the middle
in the picture to the right, has the same structure as the original structure
(more correctly, they are manifestations of the same abstract structure); that
is, for each set of distinctions in the structure, there is an equivalent set
of distinctions in the string. Each string has at least one most compressed
form that is non-redundant, represented by the bottom row in the picture to
the right. A most compressed string is a generator of all truth table rows implied
by what is true of whatever is represented, and can be thought of as a vector
in the minimal
dimensional
space in which the thing can be fully represented (see Bell and Demopoulos,
1996 for details of the notion of a generator). It is relatively
trivial to render truth-functional operators into information theoretic form.
Negation is just the complement of a string (replace 1s for 0s and 0s for 1s),
and conjunction is the mutual information between two strings (mutual information
is defined in most elementary works on information within the formalism of the
work; intuitively it is the information in both strings). The set of minimal
strings is equivalent to a set of equivalent truth-functional propositions.
This latter equivalence class can be thought of as the fundamental fact of the
structure. It has the advantage of being completely unambiguous, whereas individual
minimal length strings can equally well serve as truth functional generators
of the truth table rows that are satisfied by the original structure. Each proposition
in the set can form a basis for the binary linear Boolean space of truths about
the structure, and the variant propositions form alternative bases for this
space. Thus information theory and propositional logic have the same foundation,
converging in truth table rows together with the equivalence of the truth functional
operators with certain information theoretic functions.
A logon might be considered to be an element of the uncompressed string, but then some of the logons would then be redundant. The redundancy means that the digit may not make a difference, or that its difference is not a distinct 1 or 0 value. The redundancy is lost only in the compressed form. Therefore, I define a logon as the value of a place in a maximally compressed string. The information represented by a digit of an uncompressed string can be one or fewer logons. A logon is therefore not the same as a bit, except for maximally compressed strings. Each digit of an uncompressed string is a bit of information, even though it may be redundant. The length of the compressed string is the information content in logons. In general, the number of bits in a string is not the same as its information content in logons.
Classifications distribute tokens among the types of a class according to "yes",
"no" questions concerning whether they are of the type. As such, they
classify according to information, and a specific classification (assignment
of tokens to types) can represented by a set of strings, one for each token.
A nonredundant classification would require that each of these strings is not
compressible. Ideally, classifications should be nonredundant, but this is often
not true in practice, and different classes (types) contain mutual information
that is not merely a consequence of their being subclasses within the same classification.
The "twenty question game" discussed in the previous section can be
thought of as a complete classification of some thing. If we have an ideal and
complete classification, the tokens will have equivalence classes of mutual
information that can be taken as the types of the classification. If we take
these to be predicates, and the tokens to be objects, we can abstract from particular
classifications and tokens to implicitly include all possible classifications
and tokens. We can then define existential quantification as the assertion that
a type is not empty, and universal classification as the assertion that all
members of a type exist. This gives, together with the interpretation of truth-functional
logic, an information-theoretic interpretation of predicate (1st order) logic. The relation between information and logic can be seen from another perspective.
George Spencer Brown's Laws of Form, or calculus of distinctions (1972),
uses the calculus of distinctions to derive truth functional logic. The distinction
can be thought of as an operation represented by Spencer Brown's basic symbol,
the right corner, in which case there is one primitive to his system, as Brown
thought. Alternatively, a distinction can be thought of as symbol, in which
case there are at least two primitives, distinction (represented by a right
corner) and non-distinction (represented by a blank) (Cull and Frank 1979).
In any case, both the making of a distinction and the failure to make a distinction
are required, so a second state is implied by the operator approach. The failure
to make a distinction is just the blank, which may be regarded as the only constant
(Cull and Frank 1979). Banaschewski (1977)
showed that truth functional logic implies the calculus of distinctions, proving
that they are notational variants, since Spencer Brown had earlier proven that
truth-functional logic follows from the logic of distinctions. Cull and Frank
(1979) made this more perspicuous by rendering Brown's axioms
into a more standard notation, showing the equivalence of the calculus of distinctions
and two-element Boolean algebra directly, up to notational variation. Thus it
is firmly established that the logic of distinctions is none other than truth-functional
logic, or, equivalently, the two-element Boolean algebra (it is known that there
is only one two-element Boolean algebra). Present digital computers process
information by distinguishing between "on" and "off" states at certain locations,
and using these states, through their circuitry, to control the state of other
locations in the computer.
There are several ways to approach information theory. The most familiar is
the statistical approach used in Shannon's communication theory. Since communication
theory is a rather high order application, I will leave full discussion of it
until later. It is closely related to the combinatorial approach. Both these
approaches are better suited to ensembles of system states rather than individual
states. Algorithmic information theory, on the other hand, is well suited to
individual states. It also has practical variants that try to estimate the complexity
of a data set. A third variant is a theoretical demonstration using Boolean
rings and sub-rings to demonstrate that probability is dispensable in information
theory. Due to the technical nature of these approaches, I have set them on
separate pages:
Randomness and Probability
The most refined approach to defining randomness is found within the algorithmic
complexity approach to information, and goes back to Kolmogorov (1968),
who also gave a standard axiomatisation of probability theory. The approach
is based on the noncomputabilty of incompressible strings by any program of
cardinality less than themselves. If a string has this characteristic, then
it is not distinguishable from a random string by any effective statistical
test. Some of the more important details are given here Organisation is the co-ordination or interdependence of parts or components,
especially in support of vital functioning (OED). A living body, for example,
is well organised when its organs so interrelate that the body as a whole can
maintain all its vital functions. Correlations entail descriptive redundancy;
if A, B and C are correlated in respect X we may replace their independent description
{A(X), B(X), C(X)} with {A(X), R(A,B,C)} and so on. So a formal characterisation
of organisation might well focus on a specification in terms of redundancy.
Following Shannon (1949) redundancy orders are determined
by the minimal number of elements in which a redundancy can be detected, so
the redundancy in a system can be decomposed into orders n based on the number
of components, kn required to detect the redundancy of order n. Order 1 redundancy
can be detected by examining elements of a system pairwise, whereas order n
redundancy is detectable over a minimum of 2n elements. Examples of low order
redundancies are the simple repetitions of molecular arrangement in a crystal
and the requirement that being a word of English places on sequences of letters.
An example of higher order redundancy is the long-range correlations imposed
by being a sequence from a possible lost Shakespearean play or being a sequence
of letters from a PhD thesis. To be organised requires redundancy. But real systems show various combinations
of high and low order redundancy, local and global redundancy. This provides
an internal richness to the notion of organisation. It also undermines any attempt
to provide a simple univocal redundancy correlate of organisation. A significantly
organised system is not maximally complex, because of its redundancy (more internally
ordered than a gas), but it is not maximally ordered either, because of its
higher order correlations (less ordered than a crystal). Because the information in organised systems involves large numbers of components
considered together without any possibility of simplification to logically additive
combinations of subsystems, computation of the surface form from the maximally
compressed form (typically an equation) requires many individual steps, i.e.
it has considerable logical depth (Li and Vitányi 1990
pg. 238). Of course, this measure applies whether or not we regard the order
as epistemically hidden or buried. Formally, logical depth is a measure of the
minimal computation time (in number of computational steps) required to compute
an uncompressed string from its maximally compressed form. C.H. Bennett has proposed that logical depth is a suitable measure of the organisation
in a system. However, while adding more components to a system at the same redundancy
level will not increase the system organisation, only the size of the system
organised, it will increase its depth because the sheer length of the sequence
to be computed has increased. All sequences of n identical entries are intuitively
equally trivial, however the depth of each string depends on the depth of n
itself. This effect can be made negligible if we consider only relative depth:
The depth of a sequence relative to the depth of the length of the sequence.
The relative depth itself of a sequence of n identical entries is no more than
the depth required to specify the entry itself (and negligible if the entry
is 0 or 1). In the case of adding identical components to a system the relative
depth does not increase since the depth of a component is already included in
the original system relative depth. It is not transparent whether relative depth
deals satisfactorily with all possible cases of this kind, but it is a reasonable,
and plausibly sufficient, refinement of logical depth simplicitur to adopt.
When we observe organisation we can reasonably infer that it is the result
of a dynamical process that can produce depth. The most likely source of the
complex connections in an organised system is an historically long dynamical
process. Bennett recognised this in the following conjecture: A structure is deep, if it is superficially random but subtly redundant,
in other words, if almost all its algorithmic probability is contributed by
slow-running programs. ... A priori the most probable explanation of organized
information' such as the sequence of bases in a naturally occurring DNA molecule
is that it is the product of an extremely long biological process. (Bennett,
1985; quoted in Li and Vitányi, 1990: 238) The converse of Bennett's claim is not generally true: a system's being the
product of an extremely long process does not ensure that it will contain a
lot of organised information. There are further problems about how depth in
material systems might arise, and why it seems to be favoured. For further detailed
discussion, see Collier and Hooker (1999). Despite this success, the theory has metaphysical difficulties, since it requires
quantification over an ensemble of states, most of which are often non-existent.
This is not a problem when we are concerned only with capacities or potentials
of communications channels, but presents problems when the theory is applied,
as it often is, to the information content of individual messages or even to
specific information sources. The problem concerns the grounding of the probabilities
used to compute the information contents. If a source is ergodic,
originally meaning that energy alone describes the dynamical state of the system,
but now usually interpreted as the ensemble average of the source almost certainly
equalling the time average of the source, the probabilities in the ensemble
can be understood in terms of potential emissions at some time. Unfortunately,
ergodicity in real sources is usually trivial or very hard to establish. The
problem of grounding the ensemble probabilities is often dealt with by using
operational procedures that vary according to the details of the case. The success
of these methods depends on the reliability of the approximations for the problem
to be solved as well as the nature of the source. Inappropriate choices can
lead to perfectly justifiable formal measures that are nonetheless intuitively
wildly unsatisfactory. For example, segments of the decimal expansion of
More recent approaches start with meaningful representations and try to specify
their interpretation, making use of available empirical constraints (Barwise
and Perry 1983, Dretske 1981,
Israel and Perry 1990, Devlin 1991).
On this view, the interpretation of a representation is given in terms of the
information it conveys. Unlike the formal approach, in which information content
is determined entirely by the structure of language (or other representational
system), information in this approach is the content (or factual content) of
a representation (or information-report).
The goal of this approach is to connect meaningful representations to the concrete
situations represented. In their situation semantics, Perry and Barwise
base this connection on nomic regularities, called constraints (Israel
and Perry 1990). Information is conveyed to us by causal
chains connecting situations in a lawful way. The information indicated by a
situation is relative to the causal chains connecting the indicating situation
both to our beliefs and to the situation the information is about. Thus, "[t]he
information a factual state of affairs carries is relative to a constraint".
Complete determination of the reference of a representation (at least in cases
involving indexicality) also requires specific circumstances. The information
content of a representation available to us is delimited by our ability to invoke
relevant constraints in the circumstances.
Situation semantics requires that there is something "out there in the world"
that can be transmitted to intelligent beings who can understand the information
it contains, and pass it around among themselves. The Barwise/Perry approach
needs an information-theoretic account of nomic regularities and causal interactions,
and of the transmission of the information these nomic regularities and causal
interactions contain. Collier (1990, 1999)
has offered one such approach based in physical interpretations of information.
Barwise and Seligman's (1997) seminal work on the mathematical
structure of information flow in terms of classifications of tokens under types
related through infomorphisms that retain the structure of a classification
of tokens across both changes in classifications and tokens approaches the issue
from a more formal direction.
Superficially, Dretske's (1981) approach to information
resembles the Carnap/Bar-Hillel approach. He also defines the information content
of a piece of evidence in terms of the cases ruled out. A major difference is
that Dretske does not try to specify representations purely syntactically. Rather
than calculating the information content of statements, he uses states of affairs
directly. His measure of information is similar to the inf definition (Dretske
1981: 52):
Dretske held that "[t]he ultimate source of intentionality inherent in the
transmission and receipt of information is, of course, the nomic regularities
on which the transmission of information depends" (1981:
76). This is similar to Barwise and Perry's placement of meaning in the world.
Dretske's definition of information in terms of the cases ruled out might seem
to fall afoul of the problem of background knowledge that plagues the Carnap/Bar-Hillel
approach, however, information is transmitted (perhaps indirectly) from structure
to structure according to causal laws. If it is transmitted to a structure with
the right order of intentionality, the causal constraints imply that reliable
belief. The causal processes producing beliefs eliminate other possibilities
from consideration. Linguistic relativity and related problems are mitigated
by allowing information to exist in the non-mental world as well as in the mind
and as a purely formal abstraction. Dretske defines three orders of intentionality
in order to admit higher order cognitive states. The first requires that all
Fs are Gs, S has the content that t is F, and S does not
have the contentn that t is G, where S is a structure, and t an object.
The second order requires of the first condition that it is a natural law that
Fs are Gs, and the third requires that it is analytically
necessary that Fs are Gs. Dretske notes that the second
and third orders don't have a clearly defined boundary, but he calls any propositional
content exhibiting the third order of intentionality a semantic content. His
definitions require that beliefs have higher order intentionality than structures
with respect to information content; first order cases have the systematic content
required for belief (though they can qualify as awareness or sensation). Dretske
holds that the higher orders are formed from the lower orders through a process
of digitalisation of the information from the analogue form in
which it is received, where a digital representation has a form containing all
and only the information of its semantic content. First order intentionality
is analogue and vague (a little like C. I. Lewis's ineffability of the given,
or James "blooming, buzzing confusion").
Perry has expanded on this idea by noting that we get information by making
discriminations or distinctions within a context, thereby specifying which of
several possibilities we mean. To be successful in these discriminations, the
distinctions must also exist elsewhere. To be successful in making use of our
semantic discriminations in our interactions with the world, there must be appropriately
correlated distinctions in the world. Much of the work in situation semantics
involves unpacking "discriminations", "interactions", "relevant" and "appropriate"
in logical and information theoretic terms (see Devlin 1991
for a recent account).
Szillard (1929) developed an idealised
argument involving a single particle on one side or another of a piston that
excludes a demon that detects molecules with radiation, showing that each molecule
detected required dispersion of production of an amount of entropy equal to
the amount lost by sorting it, thus tying detection to entropy increase. Schrödinger
(1944) proposed that order as found in macromolecules that
carry biological information was the negative of entropy, or negentropy. Schrödinger
said that he used negative entropy rather than free energy because of misunderstandings
of the relation of the technical notion of free energy to the common notions
of free and energy, and traced the idea back to Boltzmann. Brillouin (1962)
formalised this idea and related it to the Shannon information of communications
theory (others developing similar ideas were Gabor, Raymond and Rothstein, see
Leff and Rex, 1990). The negentropy
principle of information implies that no physical entity can use information
in a physical system to lower its entropy. In particular, it implies that any
measurement requires the dissipation of a minimal amount of energy in any measurement.
It is worth noting that Shannon entropies have the same mathematical form as
entropy, but correspond more closely to negentropy in most applications. The
difference doesn't matter much to abstract communications theory, which quantifies
over ensembles of messages, most of which are fictional, but when we turn to
concrete particular messages, such as the information in a measurement, the
difference becomes crucial. Shannon entropy can be decreased by a passive filter,
but physical entropy, by the Second Law, cannot. This means that the Shannon
entropy of a source must be negentropic to be measured.
A second approach to physical information is through the physics of computation.
Rolf Landauer, noting that some computations are logically reversible, asked
whether physical computation is reversible. He concluded that the only essentially
irreversible step is erasure. It is possible to make a computer without erasure,
as shown by Fredkin and others. However, for computations showing other than
logical equivalence, large amounts of waste storage are produced. A computer
can be implemented on a system of colliding elastic balls, so at least a reversible
physical implementation of a general purpose computer is possible in principle.
Erasure corresponds to loss of information, and waste of unusable storage, its
reversible equivalent, corresponds to information that cannot be used to for
further computations without an equivalent loss. The parallel to the Second
Law of Thermodynamics did not go unnoticed, and Charles Bennett (1987)
argued that Maxwell's demon failed because it must erase information. Collier
(1990) argued that the demon fails because it can only
make accessible the information required for manipulating the macrostate so
as to reduce entropy by making an equivalent or greater amount of information
inaccessible in the sense mentioned previously, thereby lowering the entropy.
Earman and Norton (1999) argue for the irrelevance of
information theory to the explanation of the Second Law, echoing claims by the
Denbeighs. The resolution for the issue requires a deeper understanding of the
problem the demon has to solve.
Maxwell himself used the demon in an argument that the Second Law was statistical
in nature, and was subject to exceptions, though these were highly unlikely.
Explaining the statistical nature of entropy led to the ergodic
problem, which is the problem the of how the state parameters of a system
with components with significant spatial and momenta parameters could depend
on energy alone. Research in the ergodic theory has been extensive, but has
drifted away from the original problem, which remains unresolved except for
some very special cases (Sklar 1993, 1996).
The connections among computation, chance and probability, as well as the demon
problem, through information theory suggest information might play a central
role in understanding the Second Law, despite cogent arguments to the contrary.
Information theory is connected to the problem of the direction of time through
thermodynamics as well as through the asymmetry in our information about the
past and the future, and through related asymmetries in causal processes. The
significance of the asymmetries is a subject of much current debate.
Information theory with a physical interpretation has been applied to biology,
with limited success so far. Notable attempts are Gatlin (1976),
Holzmüller (1984), Küppers (1990),
Kauffman (1993) and Brooks and Wiley (1988).
None of this work has yet been widely accepted in the biology community.
Measurement involves getting information about a source via a physical process.
This requires the transmission of information from the source, or from something
that contains information about the source. Essentially, measurement is a co-ordination
problem, in which the mutual information of the source
and the result must be maximised through some physical process or processes.
This appears to be a problem in communication theory, and it is at least that,
but further issues involve the role of natural laws, theory,
and often tacit auxiliary assumptions in specifying both what is measured and
its significance. Some of these problems converge with the problems concerning
semantic information, discussed above. In particular,
tacit assumptions place constraints on the interpretation of observations, and
causal processes convey information from what is measured to the measuring device.
Independent of Quantum Mechanics, the non-existence of a Maxwellian demon places
limits on the total information that can be extracted in any particular measurement
process because measurement requires the expenditure of available energy, or
exergy. This places the sort of physical limits on the accuracy of measurement
discussed by Brillouin (1962, Chapter 16). The measurement process itself can be thought of as a source, coding, channel
transmission and decoding process, much as in a communications channel. For
some purposes, for example in seismology, where the physics of the source and
its connection to the channel are well understood, this model can be quite useful.
It is also useful for determining the sensitivity of observations, and the amount
of information that can be conveyed by a particular experiment. In many if not
most cases, though, problem with understanding the analogues to coding and the
channel, not to mention decoding, are very unclear, and involve problems of
observational dependence on theory and related issues. For these same reasons,
Barwise and Seligman's (1997) approach to information flow
is not immediately helpful, depending, as it does, so heavily on knowing the
classifications in the infomorphisms. One area that has not been investigated as well as it should is the role if
distinctions in crucial experiments with an eye to how theory based semantic
distinctions connect to experimental distinctions. Testing in general involves
classification, and seems ripe for information theoretic analysis. As usually conceived, natural laws have the role of axioms for the world. Therefore,
the question of the information content of natural laws makes sense in the same
way as the question of the information content of an axiomatic
formal system makes sense. In both cases, it is the abstract structure of
the system, whether laws or axioms, that is relevant to the algorithmic complexity.
Axiomatic theories and mathematical models can be treated in the same way. Some
attempts, e.g., by Brillouin, have been made to determine the information content
of empirical laws, and a number of others have noted that simplicity of theories
and compression are connected, but so far there is no canonical way make the
connection. The situation is likely to be analytically intractable for reasons
mentioned in the discussions of various mathematical approaches to information
theory, but some progress has been made with mathematical models by using minimum
message length and minimum description length techniques. Other approaches use evolutionary considerations to yield a naturalised epistemology
based on information flow, but relaxing various requirements of Dretske's account.
A completely different approach uses Bayesian methods together with the idea
that knowledge is a correlation of mental state with the world. This approach
gives up the requirement of probability one. It is also a naturalised approach,
since prior probabilities are required, and evolution is the most natural source
of these. One account of how an initial reliability can be established was given
by Mohan Matthen (1988). Grandy (1987)
has extended the correlation account to take practical considerations of survival
into account, which creates some problems for the pure correlation account,
including Dretske's account.
Two approaches that avoid evolutionary considerations make use the idea of
compression. The Minimum Message Length (MML) approach
was developed by Wallace (see 1999) and the Minimum Description
Length (MDL) approach was developed by Rissanen (1989).
The basic idea of both accounts is to find a minimal message that encodes binary
coded data about the real world, though the best that can be achieved in most
cases, given the noncomputability of the shortest string, is a probability distribution
over a set of strings that gives a model of the probability that the data represented
by a string is true of the real world. The two approaches have some differences,
over which there has been some dispute. Part of this may stem from differing
intuitions about the nature of the task. Wallace and his colleague David Dowe
see their approach as fundamentally Bayesian, whereas Rissanen sees his approach
as giving an actual hypothesis about the world, suggesting he sees the process
embedded in an "epistemic engine" in which the strings have a natural interpretation.
When I mentioned this to Dowe, he found the idea preposterous. In any case,
both approaches have had some success with restricted data sets, such as DNA
strings.
John Dorling (1991) has a more ambitious project of
basing theory construction on the minimisation of information relative to data,
the best theory being the one that most minimises the data. Brillouin earlier
tried a similar approach, trying to determine the information in a theory, but
it drew little attention, and had small success. Some of the problems are mentioned
in the previous section.
A second question is whether information theory has anything to say about traditional
problems in the Philosophy of Mind such as the mind-body problem, the problem
of intentionality, and the "hard problem" of consciousness. At this stage it
seems unlikely that it will help with these problems in their traditional form,
but it might be helpful in reformulating the problems in a more intelligible
form. For example, Dretske's three levels of intentionality, though hardly providing
a complete solution to the problem, suggests that the problem of intentionality
is not a single problem, and that different informational states have differing
causal and logical properties relevant to representation. Possibly, the traditional
questions are the wrong questions to ask, or are at least too confused to have
coherent answers.
References
Copyright © 1999
by First posted: February 19, 1999
The close connection between information and probability is evident from the
statistical and Boolean approaches. Although historically probability theory
preceded Shannon entropy, it is possible to define Shannon information without
the explicit use of probability by using Boolean algebra. A brief summary of
a proof by Ingarden et al (1990: 25ff) is here.
It follows from this that the probability axioms can be explicitly defined within
information theory, though I won't give the proof here. Given the inseparability
of information theory and logic, probability theory is thus a branch of logic.
Given the close connection between information and logic, it seems reasonable
to conclude that information is the more fundamental notion. Hume, among others,
thought that chance was completely a consequence of ignorance, or lack of information.
We now know that this is unlikely, but the syntactic character of contemporary
information theory allows us to go beyond epistemic and even intentional characterisations
of information.
Organisation and Logical Depth
Communications Theory
Communications theory was the first applied mathematical theory of information
developed. For practical reasons involving technological applications in the communications
and computation industry, it is the one that has been pursued the furthest. Communications
theory is the theory of the evaluation and control of the probability of transmission
of messages with specified accuracy in the presence of noise, including transmission
failure, distortion and accidental additions. Its basic elements are a message
source, an encoder, a channel over which the message is transmitted, a decoder,
and a message recipient. Numerically, information is measured in bits (short for
binary digits). One bit is equivalent to the choice between two equally likely
choices. For several equally likely choices, the number of bits is the base two
logarithm of the number of choices. When the choices are not equally probable,
the information is the sum logarithm of the probability of each choice weighted
by the probability choice, yielding and equation similar in form to that for entropy
in Boltzmann's statistical thermodynamics. The greater the information in a message,
the more possible cases it rules out, i.e., the more specific it is, and the less
likely it is to be true. Because a less likely message is more surprising, the
information is sometimes called the surprisal. Any message of equal length
to a maximally unlikely message, but less than maximally likely, must contain
some redundancy. In a channel with no noise, the maximal information capacity
can be gained by coding to eliminate redundancies in the source. In the presence
of noise, which introduces equivocation into the message, reducing its probability
of transmission, clever coding can reduce the loss, but at the expense of greater
redundancy. This places an upper limit on the probability of transmission of a
message in a noisy channel. Because of the provable existence of maximally efficient
codings for any message, for any given channel there is a limiting capacity or
rate at which it can carry information, expressed in bits per second. Once the
information content and channel capacity are calculated, specific coding techniques
can be used to control errors in the channel. The communication problem is to
maximise the mutual information of the source and receiver. The mutual
information can be expressed as the intersection of the information in the source
and in the receiver, and in bits is the base two logarithm of the correlation
of the source and receiver. Most of the fundamentals were first presented in Claude
Shannon's painstaking seminal work (1949). The theory
does strikingly well in defining the engineering requirements and limitations
of communications systems.
show no obvious regularity if sampled by standard statistical methods (very
recent work may have refuted this), but are highly correlated with ubiquitous
physical and mathematical functions involving
either explicitly or implicitly. This ergodic problem is part of the reason
for the proliferation of information and entropy measures. It should be noted
that an analogous problem is also foundational in statistical mechanics.
Semantic Information
A rigorous account of semantic information remains an elusive object of desire.
Early attempts were made within the Logical Empiricist approach to language. Carnap
and Bar-Hillel (Bar-Hillel, 1964) used the resources
of inductive logic to define the information content of a statement in a given
language in terms of the possible states it rules out. For "technical reasons"
they calculate the states ruled out as a number of state descriptions. A state
description is a conjunction of atomic statements assigning each primitive monadic
predicate or its negation (but never both) to each individual constant of the
language. The information content of a statement is thus relative to a language.
Evidence, in the form of observation statements, contains information in virtue
of the class of state descriptions the evidence rules out. (They assumed that
observation statements can be connected to experience unambiguously.) Information
content, then, is inversely related to probability, as intuition would suggest.
Our pre-systematic intuitions, though, confuse two different measures of information
content, both of which have plausible but incompatible properties. The first measure
of the information content of statement S is called the content measure, cont(S).
It is defined as the complement of the a priori probability that S is true:
This measure fails the additivity condition, according to which the combined information
content of two inductively independent statements should be the sum of their individual
information contents (Bar-Hillel, 1964: 302). It also fails some natural assumptions
about conditional information. These problems motivated the introduction of another
measure, called the information measure, inf(S):
[1] cont(S) = 1- prob(S)
The value of this measure is in bits. Although inf satisfies additivity and conditionalisation
requirements, it has a property that some people find counterintuitive. If some
evidence E is negatively relevant to a statement S, then the information measure
of S conditional on E will be greater than the absolute information measure of
S. This violates a common intuition that the information of S given E must be
less than or equal to the absolute information of S. The content measure, cont(S),
does satisfy this intuition (Bar-Hillel, 1964: 306-7).
I do not share this widespread intuition since it requires effort to correct the
inference based on E that S is less likely. A more serious problem with the whole
approach is the linguistic relativity of information, and problems with the Logical
Empiricist program that supports it, such as what has somewhat misleadingly been
called the theory ladenness of observation (Collier 1990).
[2] inf(S) = log2 (1/(1- cont(S)))
= -log2 prob(S)
where prob(s) is the probability of the state of affairs s. The use of states
of affairs has the potential to avoid the problems of relativity to language that
plague the formal approach (see Bell and Demopoulos 1998 for details of the logic
of the relativity problem).
[3] I(s) = -log2 prob(s)
(in bits)
Physical Information
Physical information is closely connected to its entropy, which is, very roughly,
a measure of the objective disorder of the system. The Second Law of Thermodynamics
requires that the entropy of an isolated system cannot decrease with time. This
means, again very roughly, that only some energy within an isolated system (and
more generally, in all connected systems) is available for work, and that this
energy never decreases. This was deeply disturbing to the Victorian mind. Maxwell
posited a "a very observant and neat-fingered beingä that sits by a frictionless
door between two chambers A and B, initially at the same temperature. The demon
opens the door whenever either a relatively fast moving molecule moves towards
it from B, or a relatively slow moving molecule moves towards it from A. Gradually,
the manipulations of the demon lead, without the expenditure of available energy,
to a sorting of the fast moving molecules into A and the slow moving molecules
into B. This lowers the temperature in B relative to A, decreasing the total entropy
of the system, apparently violating of the Second Law. It was quickly evident
that a purely mechanical demon was not possible.
Measurement
Causation
Information theory has a bearing on a number of the characteristics of physical
causation. Some involve the temporal asymmetries mentioned previously. One prominent
approach to causation, the mark approach initiated by Reichenbach (1956,
1958) and furthered by Salmon, defines a causal process
as one that can bear information, and causal interactions in terms of forks in
causal processes. Causal forks exhibit the probabilistic relations dealt with
in theories of probabilistic causation. Collier (1999)
has given a definition of causal process and causal forks in terms of physical
information theory based on the algorithmic model, from
which the necessity and other modal properties of causation and natural laws follow
naturally. The main problem with this approach, aside from the obscurity of the
resources it uses relative to commonly understood ideas, is a possible circularity
in the notion of information transfer that may also infect similar accounts.
Natural Laws
Perception and Epistemology
Dretske's (1981) account of perceptual knowledge remains
one of the most advanced philosophical accounts based on information theory. It
is based in causal constraints, requiring probability one that the information
represents the object for perceptual beliefs, and is thus a reliability account
(no justification required). His three levels of intentionality, distinguishing
between digital vs. analogue information, allows us to distinguish between simple
perception and perceptual beliefs.
Philosophy of Mind
The neutrality of syntactic information between the dynamical and logical, the
representational and the represented has been noted by a number of authors (e.g.,
Sayre, Maturana and Varela 1980, Kampis 1991
and Devlin 1991) as relevant to the Philosophy of Mind.
Some of the issues have been discussed under semantics,
causation and perception. Whether
there is more that information theory can offer the philosophy of mind is open
to debate. The Dretskean view of information flow dovetails nicely with computational
accounts of mind, and evolutionary accounts of perceptual information fits well
with dynamical accounts of mind. No current approach makes deep connections between
these larger approaches, however.
Game Theory and Economics
One of the main issues in game theory is how to deal with imperfect information
each player has about the others strategies. This is taken up in the article on
game theory. The problem is especially difficult in cases of changing information.
Since the information states are relevant to determining what game is being played,
the dynamics of information is fundamental to useful application of game theory.
Bibliography
Resources
A readable and non-mathematical introduction to the issues involving information
discussed here is Paul Young's The Nature of Information (1987). There
is no compendium of mathematical information theory that covers all aspects of
the topic. Kolmogorov's "Three Approaches to the Quantitative Definition of Information"
(1965) lays out the basics nicely. Li and Vitányi
(1993) review the scope of algorithmic information theory fairly
completely. Calude (1994) is a basic text on information and randomness. Ingarden
et al (1997) review some central principles, and applications
to dynamical systems. Shannon's original paper on communications theory (1949)
is still unsurpassed as a source on this topic. Chaitin's Algorithmic Information
Theory (1987) expresses main results of metalogic
in terms of information theory. Unfortunately, his account is somewhat inaccessible
due to his use of LISP as his formal language. See Boolos and Jeffrey (xx) for
a more familiar approach. Keith Devlin's Logic and Information (1991)
reviews the basic results of mathematical information and its connections to logic,
mental states, perception and action and situation semantics. Jon Barwise and
Jerry Seligman's Information Flow (1997) sets new standards
for discussion of the links between information and classifications of tokens
by types. As the discussion of this entry has indicated, these issue are central
to the role of information theory in a range of philosophical and scientific endeavours.
Leon Brillouin's, Science and Information Theory (1962)
is a classic source for the connections between information theory and physics.
A general but not completely reliable introduction to the issues is Jeremy Campbell's
Grammatical Man (1982). Leff and Rex (1990) have collected
central papers to 1990 in Maxwellâs Demon: Entropy, Information, Computing.
A classic philosophical source on information and perception is Fred Dretske's
Knowledge and the Flow of Information (1981). The
other areas covered in this article are still too ill-formed or too controversial
to have reliable canonical texts.
Other Internet Resources
John Collier
pljdc@alinga.newcastle.edu.au
Last modified:
Mai 3, 2002