A common and powerful way to compare data to a theory is to search for a theoretical curve that matches the data as closely as possible. You may suspect, for example, that friction causes a uniform deceleration of a spinning disk, so you have gathered data for the angular velocity of the disk as a function of time. If your hypothesis is correct, then these data should lie approximately on a straight line when angular velocity is plotted as a function of time. They won't be exactly on the line because your experimental observations are inevitably uncertain to some degree. They might look like the data shown in the figure below.
Our task is to find the best line that goes through these data. When we have found it, we would like answers to the following questions:
Associated with each data point is an error bar, which is the graphical representation of the uncertainty of the measured value. We assume that the errors are normally distributed, which means that they are described by the bell-shaped curve or Gaussian shown in the discussion of standard deviation. The height between the data point and the top or bottom of the error bar is s, so about 2/3 of the time, the line or curve should pass within one error bar of the data point.
Sometimes the uncertainty of each data point is the same, but it is just as likely (if not more likely!) that the uncertainty varies from datum to datum. In that case the line should pay more attention to the points that have smaller uncertainty. That is, it should try to get close to those "more certain" points. When it can't, we should grow worried that the data and the line (or curve) fundamentally don't agree.
A pretty good way to fit straight lines to plotted data is to fiddle with a ruler, doing your best to get the line to pass close to as many data points as possible, taking care to count more heavily the points with smaller uncertainty. This method is quick and intuitive, and is worth practicing. Here's my attempt to fit a line by eye.
For more careful work, we need a way to evaluate how successfully a given line (or curve) agrees with the data. Each data point sets its own standard of agreement: its uncertainty. We can quantify the disagreement between a point and the line by measuring the (vertical) distance between the point and the line, in units of the error bar for each point. The data point at t = 10 s, for example, is about 1 error bar unit away from the line. It turns out that a very useful way of adding up all the discrepancies [yi-f(xi)]/si between the line and the data is to square them first. That way, all the terms in the sum are positive (after all, a point can't be correct with 200% probability!).
We define the function c2 to be this sum of squares of discrepancies,
each measured in units of error bars. Symbolically,
(1)Kaleidagraph and Origin can perform the operation of finding the line or curve that minimizes c2. The result of performing this least-squares fit is shown in the red curve in the following figure.

Evidently, my c by eye method was pretty good for the slope, but was off a bit in the offset. According to this fit, the acceleration is -3.10 ± 0.08 bar/s/s, which you can read off the fit results table made by Kaleidagraph. This is pretty neat! The plotting and analysis program found the best-fit line for me, and even estimated the confidence of the slope. What could be better?
Well, what about some assessment of the likelihood that these data are really trying to follow a straight line? We may have found the best line, in the sense of the one that minimizes the squared deviations of the data points, but it may well be that the data follow a different curve and so no line properly describes the data.
The value of c2 tells us a great deal about whether we should trust this whole fitting operation. If our assumptions about normal errors and the straight line are correct, then the typical deviation between a data point and the line should be a little less than 1 s. This means that the value of c2 should be about equal to the number of data points.
Actually, we have to reduce the number of data points N by the
number of fit parameters m because each fit parameter allows us to
match one more data point exactly. In the pictured data set, there are 16
data points and 2 fit parameters. We can compute the reduced value of
c2, denoted
, by dividing c2 by N-m. Hence, we
find here that
= 2.1. This strongly suggests that the data and the
line do not agree!
How can this be? They look so good together!
A good way to look more closely is to prepare a plot of residuals.
Residuals are the differences between each data point and the line or curve
at the corresponding value of x. Such a plot is shown at the right.
For a reasonable fit, about two-thirds of the points should be within one error bar from the black line at zero. In this fit we can see that several points are considerably more than one standard deviation from the line at zero. The first point is decidedly above the line, and the last point is clearly above the line, too. Almost all the other points are below the line, and a few of them are considerably below, again measured in units of their error bars. Maybe we need a curve that opens up a bit, instead of a line.
On more solid theoretical grounds, if the braking torque (twisting force)
is proportional to the rotational speed, then we would expect a speed that
decreases exponentially with time. Let's try an exponential curve of the
form
where w is the angular speed and t is the characteristic time of the deceleration. The result of performing such a fit is shown below.

Does it look a bit better to the eye? Maybe. But it certainly looks better
statistically. The value of c2 = 16.3, which means
= 1.16. It
is a little higher than expected, but not alarmingly so. According to the
table in Appendix D of An Introduction to Error Analysis, Second Edition, by John R. Taylor, the probability of getting a value of
that is larger than 1.16 on repeating this experiment is
about 31%. That is, slightly more than 2/3 of the time we should expect a
value of
that is smaller than this value. Not perfect, but quite
reasonable.
By contrast, the same table gives the probability that the straight line
fit shown above is correct is only about 1%. It's hard to see by eye that
the exponential fit is so much better than the linear fit.
A residual plot also shows a more even distribution of errors. Now about half the points are above the zero line, half below. The end points are still above the line, but not markedly so. The residual plot helps build confidence in our exponential analysis.
Now that we have a fit with a reasonable value for c2, we can be more confident of the values determined by the fit. These values, and their uncertainties, are shown in the red table of the figure. (I hasten to add that such a means of presenting this information is informal; it is great for lab notebooks and notes, but in a formal presentation of data, such as in a technical report or journal article, such information is removed from the figure and the most important parts are placed in a caption below the figure.) In particular, the deceleration time constant is t = (24.3 ± 0.7) s and the initial angular speed is (100.2 ± 0.6) bar/s.
| Conclusions |
|---|
Based on the better
behavior of the exponential fit we can conclude that
|
Well, each data point is supposed to have some uncertainty, estimated as si. It is fantastically improbable that the discrepancy between each point and the curve should vanish. When c2 = 0, it means that you dry-labbed the experiment. Don't even think of trying it!
Updated 9/11/99 by Peter N. Saeta .