HomeUnderstanding Basic StatisticsError AnalysisExcel • Fitting • KaleidagraphOriginPower LawsDimensional Analysis

Fitting Data

A common and powerful way to compare data to a theory is to search for a theoretical curve that matches the data as closely as possible. You may suspect, for example, that friction causes a uniform deceleration of a spinning disk, so you have gathered data for the angular velocity of the disk as a function of time. If your hypothesis is correct, then these data should lie approximately on a straight line when angular velocity is plotted as a function of time. They won't be exactly on the line because your experimental observations are inevitably uncertain to some degree. They might look like the data shown in the figure below.

fit1 picture

Our task is to find the best line that goes through these data. When we have found it, we would like answers to the following questions:

What do you mean, "best line"?

Associated with each data point is an error bar, which is the graphical representation of the uncertainty of the measured value. We assume that the errors are normally distributed, which means that they are described by the bell-shaped curve or Gaussian shown in the discussion of standard deviation. The height between the data point and the top or bottom of the error bar is s, so about 2/3 of the time, the line or curve should pass within one error bar of the data point.

Sometimes the uncertainty of each data point is the same, but it is just as likely (if not more likely!) that the uncertainty varies from datum to datum. In that case the line should pay more attention to the points that have smaller uncertainty. That is, it should try to get close to those "more certain" points. When it can't, we should grow worried that the data and the line (or curve) fundamentally don't agree.

A pretty good way to fit straight lines to plotted data is to fiddle with a ruler, doing your best to get the line to pass close to as many data points as possible, taking care to count more heavily the points with smaller uncertainty. This method is quick and intuitive, and is worth practicing. Here's my attempt to fit a line by eye.

fit2 picture

Least-Squares Fitting

For more careful work, we need a way to evaluate how successfully a given line (or curve) agrees with the data. Each data point sets its own standard of agreement: its uncertainty. We can quantify the disagreement between a point and the line by measuring the (vertical) distance between the point and the line, in units of the error bar for each point. The data point at t = 10 s, for example, is about 1 error bar unit away from the line. It turns out that a very useful way of adding up all the discrepancies [yi-f(xi)]/si between the line and the data is to square them first. That way, all the terms in the sum are positive (after all, a point can't be correct with 200% probability!).

We define the function c2 to be this sum of squares of discrepancies, each measured in units of error bars. Symbolically,

chi2 picture        (1)
where the sum is over the n data points and f(x) is the equation of the line (or curve) we think models the data. Since it is the sum of squares, c2 cannot be negative. We would like c2 to be as small as possible. As we try different lines, we can calculate c2 for each one. The "best line" is the one with the smallest value of c2. That is, the best line is the one which has the "least squares."

Kaleidagraph and Origin can perform the operation of finding the line or curve that minimizes c2. The result of performing this least-squares fit is shown in the red curve in the following figure.

fit3 picture

Evidently, my c by eye method was pretty good for the slope, but was off a bit in the offset. According to this fit, the acceleration is -3.10 ± 0.08 bar/s/s, which you can read off the fit results table made by Kaleidagraph. This is pretty neat! The plotting and analysis program found the best-fit line for me, and even estimated the confidence of the slope. What could be better?

Well, what about some assessment of the likelihood that these data are really trying to follow a straight line? We may have found the best line, in the sense of the one that minimizes the squared deviations of the data points, but it may well be that the data follow a different curve and so no line properly describes the data.

The Meaning of c2

The value of c2 tells us a great deal about whether we should trust this whole fitting operation. If our assumptions about normal errors and the straight line are correct, then the typical deviation between a data point and the line should be a little less than 1 s. This means that the value of c2 should be about equal to the number of data points.

Actually, we have to reduce the number of data points N by the number of fit parameters m because each fit parameter allows us to match one more data point exactly. In the pictured data set, there are 16 data points and 2 fit parameters. We can compute the reduced value of c2, denoted rcs picture, by dividing c2 by N-m. Hence, we find here that rcs picture = 2.1. This strongly suggests that the data and the line do not agree!

fit5 picture How can this be? They look so good together! A good way to look more closely is to prepare a plot of residuals. Residuals are the differences between each data point and the line or curve at the corresponding value of x. Such a plot is shown at the right.

For a reasonable fit, about two-thirds of the points should be within one error bar from the black line at zero. In this fit we can see that several points are considerably more than one standard deviation from the line at zero. The first point is decidedly above the line, and the last point is clearly above the line, too. Almost all the other points are below the line, and a few of them are considerably below, again measured in units of their error bars. Maybe we need a curve that opens up a bit, instead of a line.

On more solid theoretical grounds, if the braking torque (twisting force) is proportional to the rotational speed, then we would expect a speed that decreases exponentially with time. Let's try an exponential curve of the form

exponential picture        (2)

where w is the angular speed and t is the characteristic time of the deceleration. The result of performing such a fit is shown below.

fit4 picture

Does it look a bit better to the eye? Maybe. But it certainly looks better statistically. The value of c2 = 16.3, which means rcs picture = 1.16. It is a little higher than expected, but not alarmingly so. According to the table in Appendix D of An Introduction to Error Analysis, Second Edition, by John R. Taylor, the probability of getting a value of rcs picture that is larger than 1.16 on repeating this experiment is about 31%. That is, slightly more than 2/3 of the time we should expect a value of rcs picture that is smaller than this value. Not perfect, but quite reasonable.

fit6 picture By contrast, the same table gives the probability that the straight line fit shown above is correct is only about 1%. It's hard to see by eye that the exponential fit is so much better than the linear fit.

A residual plot also shows a more even distribution of errors. Now about half the points are above the zero line, half below. The end points are still above the line, but not markedly so. The residual plot helps build confidence in our exponential analysis.

Fit results

Now that we have a fit with a reasonable value for c2, we can be more confident of the values determined by the fit. These values, and their uncertainties, are shown in the red table of the figure. (I hasten to add that such a means of presenting this information is informal; it is great for lab notebooks and notes, but in a formal presentation of data, such as in a technical report or journal article, such information is removed from the figure and the most important parts are placed in a caption below the figure.) In particular, the deceleration time constant is t = (24.3 ± 0.7) s and the initial angular speed is (100.2 ± 0.6) bar/s.

Conclusions
Based on the better behavior of the exponential fit we can conclude that
  • The data are inconsistent with a model of uniform deceleration, but are probably consistent with a frictional torque that is proportional to the angular velocity.

  • The time constant for the exponential decay is (24.3 ± 0.7) s
  • The initial angular speed is (100.2 ± 0.6) bar/s.

Pitfalls to avoid


Updated 9/11/99 by Peter N. Saeta .