# d

## Simplicity

In critical rationalism, popper on 15/07/2011 at 9:32 am

Why should scientists prefer theories that are simple and not complex? There is a very simple answer to this complex problem: Leibniz points out in section VI of his Discourse on Metaphysics that a theory ought to be simpler than the data it sets out to explain, otherwise it does not explain anything. A theory becomes vacuous if an arbitrarily complex mathematical statement is permitted to count as a theory, for one can always construct a theory to fit the data, even if the data is random.

Today, complexity and simplicity are far more rigorous than in Leibniz’s day. We talk about information. This concept is used all the time when discussing computers. So how does this idea relate to scientific theories? The insight is a interpretation of scientific theories that treats them like software: the theory, coupled with some background assumptions and initial conditions, predicts observations in much the same way as a program is executed on a computer to provide an output.

A guiding rule of behavior that is not necessarily true, but useful in solving a problem is referred to as a ‘heuristic.’

For instance, when attempting to catch a baseball, the outfielder runs towards the ball while simultaneously keeping his eye trained on it. Since his neck is bending in relation to the position of the ball, he is far more interested in the degree at which his neck is bent detected in the inner ear rather than Newton’s laws of motion.

When his neck begins to bend back, the outfielder’s inner ear tells him that he is running away from the ball, and he reverses direction; when running toward the ball, if his inner ear tells him he is approaching level, the outfielder and the ball must also be approaching very quickly.

Take the two following heuristics:

1. Occam’s razor [1]: Given two theories that each explain the data equally well at time t, one ought to prefer the simpler theory until an outcome of a test, if true, says that one of the two theories is false (a ‘crucial experiment’);
2. Liebniz’s lather [2]: If a theory is the same size in bits or larger as the data it sets out to explain, then it is worthless, for any random string of data has a theory of that size.

These two heuristics provide the following methodological rule: Seek out theories that can be expressed in a (comparatively) small number of bits. Since a strictly universal statement (“all x are y”) can be expressed in a small number of bits when compared to a finite list of existential statements (“this x1 is a y, that x2 is a y … that xn is a y”), and since scientists are interested in theories that predict a great deal, when explaining a set of data, the scientists ought to prefer strictly universal statements that predict a great deal over a finite list of existential statements that predict very little. [3]

III.

I have engaged with this bit of pedantry in order to make the case clear that scientists are interested in strictly universal statements. Now that this is tentatively settled, I will get to the heart of the problem.

Since a strictly universal statement predicts a great deal (i.e., it is interesting), it prohibits a large number of state of affairs, meaning that it has a probability that approaches zero. In brief, if a theory T1 predicts “It will rain on Monday”, then T2 “It will rain on Monday and it will rain on Wednesday” must necessarily be either as probable or less probable as T1.

Continue adding on predictions to arrive at Tn: An infinite list of conjunctions of predictions, meaning that Tn would have a probability that approaches zero. In other words, the probability of a statement has an inverse relationship with its content. Furthermore, the more interesting you make a theory (the more it predicts), the easier it is to test the theory, for there are more possible situations to test the theory.

The problem: There are as many different theories that satisfy Tn as there are types of Bayesianism!

IV.

Question: Which theory that satisfies Tn should the scientist prefer?

Answer: A theory expressible in a form that satisfies Liebniz’s lather!

In other words, out of all the possible theories that satisfy Tn, we want to adopt a strictly universal statement: A theory that is simple, interesting and improbable. Is there anything objectionable (counter-productive, imprudent) about adopting this answer? I ask this question, for its consequences are currently controversial in the philosophy of science.

V.

Although the acceptance of this methodological rule is permitted, and often recommended, it is never demanded. Appealing to this rule is only to minimize the number of uninteresting and highly probable theories that are difficult to test. The reason is as follows: We have a very limited amount of time at hand, as as Keynes remarked, In the long run we are all dead. Thus, we want interesting and highly improbable theories that are comparatively easy to test.

Now, the great reveal: What I have just done is to reverse-engineer what Popper’s Critical Rationalism (CR) advocates (but does not demand), for [3] no number of corroborating evidence can indicate if a theory is true. ‘Testability’ is then equivalent to ‘falsifiability.’ The scientist is interested in the origination of bold conjectures that are easily falsified, if the theory is false!

We’ve moved from something uncontroversial to something radical. The argument goes as follows:

1. Scientists prefer theories that are simple and interesting. (I., II.)
2. If it is simple and interesting, then it is a strictly universal statement. (II.)
3. Strictly universal statements are highly improbable. (III.)
4. The more improbable a theory, the more testable a theory. (III.)
5. Testability is equivalent to falsifiability. (V.)
6. Scientists prefer theories that are highly falsifiable. (V.)

C: Critical Rationalism best describes scientific practice.

[1] An aside. I must give a caveat: I do not mean to imply that a scientist is prohibited from formulating a theory that presently violates Occam’s razor as traditionally understood, by positing presently unobserved regularities. I mean only that theories expressible in a small number of bits are to be preferred over others.

[2] I made the name up.

[3] This is a guiding principle, not a rule set in stone, for scientists that understand a problem in great detail often come about new theories through a flash of insight. I do not mean to say that there is a specific method to theory-formation, only that it is recommended that whatever theory a scientist posits should conform to these restrictions. Once the theory has been formed, the theory ought to receive — ahem! — a clean shave.

1. Very good. Write a book.

• I’d seriously think about it once I hit thirty. Besides, there are too many books these days.

2. You are right: there are too many books. I just want to read it. I also want a handy reference when I feel too lazy to explain CR to someone. You have a talent for clear and concise explanation.

• Thanks for the compliment. My writing needs a good editor, though.

3. Regarding [2], have you seen this: http://www.scribd.com/doc/2136072/entropy-in-cosmology

They’re both wrong in the same way ;) although I’ll take this back if I ever see a sensible way of measuring the number of bits that a theory has (which I wont). Can you tell me how to measure the size of a subset of possible states of affairs?!?!!

A related thing pops up in multimodel inference, all the “razors” are heuristics (in a different way to what you mean, I think), they have to be. They’re heuristics in the sense that there is a trade of between predictive/descriptive power and data required to define them, the optimal trade of is just pulled out of no-where. Even the Bayes Information Criterion (its not Bayesian btw); the only one that is justified in any way at all; is really only so in the case of infinite data (which would defeat the point of making a model).

One always need a cost function (a heuristic one), or, in the case of [2], an equation that relates theory-bits to observation-bits (I say it like there a different kinds of bits, but the point is that whenever measure something, how you measure it is important for drawing a conclusion – measuring the height of a ladder and the height of the door to my house wont tell me if I can get the ladder through it).

Anyway, you are doing exactly what Popper is often Criticized for: Substituting what you think science should be for what it actually is. Presenting a prescription as a description. Not that I’m not doing exactly the same thing when I’m modelling an organisms behavior as an optimal strategy. I don’t buy you premise I/II.

• last bit was not intended to be related to first 3

• Lucas,

1. You said, “I’ll take this back if I ever see a sensible way of measuring the number of bits that a theory has (which I wont).” Are you aware of Kolmogorov complexity?

2. Could you explain ‘multimodel inference’ a bit more? I think I understand what you’re saying, but I don’t want to misstate your position.

3. You said, “you are doing exactly what Popper is often Criticized for: Substituting what you think science should be for what it actually is.” You assert that I do not accurately describe science. Could you explain where I have gone wrong? Are scientists not interested in strictly universal statements? Are strictly universal statements not simple in light of Kolmogorov complexity? Are strictly universal statements not highly informative?

4. 1) Sure, I know about Kolmogorov complexity. In this case, the problem that I am trying to get at manifests itself as it being defined up to an additive constant. One can compare complexities if the interpreter is the same for both – but what the hell is the correct interpreter that is common between the observed universe and someone who has a theory? Is there one that isn’t just “the entire universe to some extent”? Is there one at all?

There are two posts on my blog that are very much related to this:
my friends post: http://jellymatter.com/2011/04/15/entropy-is-not-disorder/
my response: http://jellymatter.com/2011/04/18/entropy-is-disorder/
(we actually agree, despite the titles)

Same problem, different manifestation (formally, it all comes from measure theory – the connection of which to probability theory is due to Kolmogorov)

2) Multimodel inference is a field which attempts to answer the question of how many parameters should be used when making a statistical model. Maybe I should have just said model selection, as the inference bit is sort of irrelevant here. Anyway, obviously one could describe a set of n data perfectly with n or more parameters, but perhaps you could do it with less. The idea is to select a model which is in some sense optimal in terms of a trade off between number of parameters and the ability to describe the data. The target function that is optimized is called an information criterion (AIC, BIC…) which are (despite what some people might tell you) heuristics, sometimes with a justification – though the most common, AIC, is outright made up.

3) First of all, I agree with much of what you are saying, there is some truth to it. As for where I disagree, there are two different points.

First, there is the problem which I mention above: that you can’t measure KC usefully without specifying an interpreter. In other words, statements don’t have complexity in of themselves. It is a constant battle in information theory to get people to understand that measures of information are not independent of how one decides to measure them (actually, its even worse in conventional statistics).

Second, “Are scientists not interested in strictly universal statements?”, move the negation and this is what I’m saying. I would say that it’s definitely not a given that scientists are interested in strictly universal statements. A model scientist may be, but this isn’t the case universally. Though I think that academia encourages this approach to a large extent.

Don’t think that I am unhappy with the approximation, it’s reasonable, just not exactly right: I don’t think it’s unfair to say scientists are often motivated by the act of going and getting some data, the practical problem solving side of things – often they care very little about what it means. Actually, in biology, people often incredibly skeptical about generalizations and potentially powerful theories, mainly because there’s nearly always a counterexample. In other words, people don’t like easily falsifiable theories because in all likelihood they will be falsified – and they will have wasted there time (of course, there are many notable exceptions). So, even if accepted that there is some well defined way of measuring the complexity of a theory, there is still a trade of between how universal it is and, effectively, how likely it is to get published. Maximizing personal utility is not the same as maximizing universality.

• Lucas,

I’m sorry your comment was picked up by the spam filter. I think the links may have triggered that action.

(1) You referred to a ‘a sensible way of measuring the number of bits that a theory has’ — Kolmogorov complexity appears, at least to my eyes, to be ‘sensible’. I’ve been wary of Kolmogorov complexity since (as I understand) it assumes a subjectivist account of the probability calculus. It’s not an objective interpretation, of course, but that would require satisfying something that is impossible to obtain and still be of any use to scientists. By analogy, a correspondence theory of truth is helpful in defining ‘truth’, but it is not a criterion of truth. Does that make sense?

(2) You said, “The idea is to select a model which is in some sense optimal in terms of a trade off between number of parameters and the ability to describe the data.” I fail to see how a strictly universal statement does not fit the bill. Could you clarify?

(3) a. You said, “I would say that it’s definitely not a given that scientists are interested in strictly universal statements. A model scientist may be, but this isn’t the case universally.” If scientists search for interesting explanations, they will prefer strictly universal statements over other alternative explanations. If scientists don’t search for interesting theories, then I don’t think they’re doing science anymore.

b. You said, “people [in biology] don’t like easily falsifiable theories because in all likelihood they will be falsified.” I find this claim to be highly controversial. By decreasing the logical content of theories, their chances of being falsified decrease as well. Do you think scientists want theories that have as little logical content as possible? I suppose this may be true due to, as you note, sociological issues, but I’m not as interested in the sociology of science as the logical reconstruction of scientific practice: ‘Maximize logical content of theories’ looks like a guiding heuristic, no?

• I forgot to mention that a while back I had an attempt at formalising something a bit like what you’re saying (its quite Bayesian, hope that this doesn’t put you off):

http://jellymatter.com/2011/04/17/a-scientist-modelling-a-scientist-modelling-science/

Also, looking back at it, I realize I’m far more liberal with the word knowledge than some people I know would like me to be.

1) I’m vaguely aware of the correspondence theory, but not enough for the analogy to make sense to me. Can you expand?

2) You asked “I fail to see how a strictly universal statement does not fit the bill. Could you clarify?”: It’s just that the details of the trade off are completely arbitrary. The bias for simplicity over descriptive power can be anything between between the two extremes. It’s not that think this disagrees with your description of a strictly universal statement, I think it highlights the fact that there is absolutely nothing special about them other than them being preferred by someone with some prior understanding and some degree of preference for theory over data and no other concerns.

3) a) You state: “If scientists don’t search for interesting theories, then I don’t think they’re doing science anymore.”: You’d be right if there was only one scientist in the world. But in general, people work in groups, each member specializing in some way, each one is valuable – even if they pretty much only collect data. This obviously applies more widely to the scientific community at large, there is nothing wrong with whole groups of people who just collect data (I quite like these guys: http://www.iapws.org/ ) they have other people to make theories (up to a point).

b) “people [in biology] don’t like easily falsifiable theories because in all likelihood they will be falsified.”… OK, maybe I could have expressed that better. It’s just that some scientists would prefer to make no claim than a wrong claim. There is of course a whole spectrum.

I’m OK with ignoring the social side up to a point, but when you say something like like “C: Critical Rationalism best describes scientific practice.” I have to take exception – replace “practice” with something like “method” then I am more inclined to agree. Maybe you don’t want to explain science in terms of sociology, and possibly sociological explanations get far to much emphasis outside of science (they’re obviously invaluable to anyone in science who actually wants a career), but if you completely ignore the social aspect you are ignoring an important part of what it actually is.

• Oh, with respect to (2) just now, the question is then what makes a scientist different from anyone else.

• Lucas,

I don’t think there is a difference. Science is but a refined form of problem-solving: trial and error; conjecture and refutation. Scientists, however, are dealing with problems in the abstract, arriving at explanations for phenomena in the broadest sense possible.

• Lucas,

I’m as liberal as you are, since I reject knowledge as ‘justified true belief’ as an impossible task.

(1) I should probably have explained fuller: if there exists an objective way to measure bits, this does no good for us, since we are not omniscient. That’s identical to the distinction made between objective and subjective interpretations of the probability calculus: one takes probability as an objective property while the other takes probability as existing only within the mind.

The same goes for the correspondence theory of truth: it tells us that statements are true under certain conditions (the statement, “snow is white” is true iff snow is in fact white), but we can never know (in the sense of having justified true beliefs) that these conditions obtain. They’re dealing with ontology, not with epistemology. Does that make sense?

(2) I don’t see it as a trade-off at all. Strictly universal statements are both simple and have a great deal of ‘descriptive power’, since they have the greatest logical content for the smallest number of bits. The statement, “All X are Y” is highly informative when compared to “Some X are Y”. Or am I misunderstanding you once more?

(3) a. I do agree with you on this point. After reading Rowbottom I’ve started to treat rational behavior as a function of a group: some scientists may, as Kuhn would say, engage in ‘puzzle solving’, some scientists may attempt to defend a theory come what may, and others may spend their time solely criticizing the theories of others. It may be possible to revise my initial comments so that we deal with group behavior, rather than individual behavior.

b. You’re right — perhaps the word ‘method’ would have worked better, but they’re just words after all, and we bring to them our own personal baggage. I tend to focus on the ‘logic’ or ‘method’ that underlies science, even though the sociological aspect of scientific practice is incredibly important, and hadn’t thought that the word would be an issue.

5. 1) Yes
2) Comparing “All X are Y” and “Some X are Y” is not quite the right comparison for what I’m talking about, it more like a sacrifice of detail for simplicity. Its more like “All X are Y” compared to “Some X are Z” where Z is a subset Y. There is a choice, something that explains more simply, or something that explains better. Obviously, both simply and better would be great – but this isn’t usually an option.

But back to what you said, “All X are Y” is definitely better than “Some X are Y” for the same Y (if it is true). I guess what I’m saying is this is some kind of Pareto optimality: yes its optimal, but there is other optimals too – each one corresponding to a different way of relating/equating bits and logical content. Maximizing these produces some kind of Pareto front on which in general there is only “Some X are Y” if there isn’t “All X are Y”, but maybe “Some X are Z” is there instead of “All X are Y” (Z $\subset$ Y)

I don’t know if this makes it any clearer.

Perhaps, think of it this way: Assume statements A and B have the same number of bits (equally, logical content), only differing in their logical content (bits). The question of how to relate bits and logical content doesn’t come up, because the bits (logical content) are the same and you just need to compare logical content (bits). But if the bits (logical content) of A and B were different then to compare A and B you would have to find a way of talking about logical content (can you define this term for me, or was it an on the spot thing?) in terms of bits.

3)
a) Good.

I’m quite extreme in this respect: I have a tendency to describe science as an singular agent. But no one has bothered to disagree with me on it, so I don’t know how the idea will withstand scrutiny.
b) Every word is going to be an issue for someone somewhere.

• Lucas,

2) You said, ““All X are Y” is definitely better than “Some X are Y” for the same Y (if it is true).” Would you agree that this also applies to ranking “All X are Z” and “Some X are Z” (where Z is a subset of Y)?

3) a. You said, “I’m quite extreme in this respect: I have a tendency to describe science as an singular agent. But no one has bothered to disagree with me on it, so I don’t know how the idea will withstand scrutiny.” That’s a highly admirable attitude to have.

Is there anything else you think we should discuss?

6. Dammit, my computer crashed loosing a significant amount of typing.

2) Yes, as long as they are both reasonable in the first place.

I think I can explain what I’m getting at – and how it relates to what you are saying better now – better now.

Take this pseudoformalisation of the problem (its incredibly hand wavy actually): Let $A \subset B \subset C$ be sets of statements about some observation(s). Also,
Let $p(X)$ be the total probability of the statements $x \in X$ for some $X \in \{A, B, C\}$
Let $q(X)$ be a measure the explanatory power of X.

Then I think that you will agree that your III is something like (hand wavy!!!):
$p(A) > p(B) > p(C)$
$q(A) < q(B) minimise p(X) => falsifiability in some sense I’ve been trying to say that there is currently a general trend within science and statistics (applied epistemology ;) ) to create models (i.e. theories) by doing: maximise f(p,q) for some monotonically increasing function f. The choice of f is still rather arbitrary, and has to be if you ask me. But what is generally considered bad is choosing a function that doesn’t depend on p (-> probably useless) or on q (-> just a list). It is argued often that the choices of f that are used are close to what we are trying to do as scientists. I personally think there is something wrong with the justifications for the usual choices of f, but the procedure seems to be the right kind of thing. “Is there anything else you think we should discuss?” Are you going to finish your PhD? • Lucas, I’ll have to think over what you said. And … maybe? 7. I wrote$ instead of $which meant it all got mixed up... and I wasn't logged on, thats annoying: Dammit, my computer crashed loosing a significant amount of typing. 2) Yes, as long as they are both reasonable in the first place. I think I can explain what I’m getting at – and how it relates to what you are saying better now – better now. Take this pseudoformalisation of the problem (its incredibly hand wavy actually): Let$latex A \subset B \subset C$be sets of statements about some observation(s). Also, Let $p(X)$ be the total probability of the statements $x \in X$ for some $X \in \{A, B, C\}$ Let $q(X)$ be a measure the explanatory power of X. Then I think that you will agree that your III is something like (hand wavy!!!): $p(A) > p(B) > p(C)$$latex q(a) < q(B) minimise p(X) => falsifiability in some sense

I’m questioning the maximisation of q independently of p.

I’ve been trying to say that there is currently a general trend within science and statistics (applied epistemology ;) ) to create models (i.e. theories) by doing:

maximise f(p,q) for some monotonically increasing function f.

The choice of f is still rather arbitrary, and has to be if you ask me. But what is generally considered bad is choosing a function that doesn’t depend on p (-> probably useless) or on q (-> just a list). It is argued often that the choices of f that are used are close to what we are trying to do as scientists. I personally think there is something wrong with the justifications for the usual choices of f, but the procedure seems to be the right kind of thing.

“Is there anything else you think we should discuss?”
Are you going to finish your PhD?

• Bollocks bollocks bollocks…

2) Yes, as long as they are both reasonable in the first place.

I think I can explain what I’m getting at – and how it relates to what you are saying better now – better now.

Take this pseudoformalisation of the problem (its incredibly hand wavy actually): Let $A \subset B \subset C$ be sets of statements about some observation(s). Also,
Let $p(X)$ be the total probability of the statements $x \in X$ for some $X \in \{A, B, C\}$
Let $q(X)$ be a measure the explanatory power of X.

$p(A) > p(B) > p(C)$
\$latex q(A) < q(B) minimise p => falsifiability in some sense

I’m questioning the maximisation of q independently of p.

My point has been that there is currently a general trend within science and statistics (applied epistemology ;) ) to create models (i.e. theories) by doing:

maximise f(p,q) for some monotonically increasing function f.

The choice of f is still rather arbitrary, and has to be if you ask me. But what is generally considered bad is choosing a function that doesn’t depend on p (-> probably useless) or on q (-> just a list). It is argued often that the choices of f that are used are close to what we are trying to do as scientists. I personally think there is something wrong with the justifications for the usual choices of f, but the procedure seems to be the right kind of thing.

“Is there anything else you think we should discuss?”
Are you going to finish your PhD?

8. WHAT!!!!

9. OK, If this doesn’t work I give up!

2) Yes, as long as they are both reasonable in the first place.

I think I can explain what I’m getting at – and how it relates to what you are saying better now – better now.

Take this pseudoformalisation of the problem (its incredibly hand wavy actually): Let $A \subset B \subset C$ be sets of statements about some observation(s). Also,
Let p(X) be the total probability of the statements $x \in X$ for some $X \in \{A, B, C\}$
Let q(X) be a measure the explanatory power of X.

p(A) > p(B) > p(C)
q(A) < q(B) minimise p => falsifiability in some sense

I’m questioning the maximisation of q independently of p.

My point has been that there is currently a general trend within science and statistics (applied epistemology ;) ) to create models (i.e. theories) by doing:

maximise f(p,q) for some monotonically increasing function f.

The choice of f is still rather arbitrary, and has to be if you ask me. But what is generally considered bad is choosing a function that doesn’t depend on p (-> probably useless) or on q (-> just a list). It is argued often that the choices of f that are used are close to what we are trying to do as scientists. I personally think there is something wrong with the justifications for the usual choices of f, but the procedure seems to be the right kind of thing.

“Is there anything else you think we should discuss?”
Are you going to finish your PhD?

10. It seems the powers that be don’t want me to include a critical part of my post….

aha

its because I’ve been using left and right angle brackets and its been wiping them thinking they’re html tags.

This is what has been missing:

…. p(A) is more than p(B) is more than p(C)
q(A) is less than q(B) is less than q(C)

Then I think that you will agree that your III is something like (hand wavy!!!):

maximise q therefore minimise p therefore falsifiability in some sense

I’m questioning the maximisation of q independently of p….

Sorry to spam you comment section like this

• I don’t mind. If you want any of them deleted, just say the word.