Distributions in Excel

The properties of binomial and normal distributions are well-documented in statistics books, and they are tabulated in computer software like Excel, too.

We will continue our introduction to Excel by looking at ways to display distributions. If you are already comfortable using Excel, there are several steps you will be able to skim or even skip. If you are not familiar with basic Excel features, each step should be followed. Either way, you should feel free to experiment with different features. You can't break the computer by entering a strange function or clicking on an unknown button, and none of the data we are working with here is critical. Sometimes playing around with buttons and menu items is the best way to learn about software.

Graphing

The first feature of Excel to get used to is its ability to make graphs. Unfortunately, Excel was originally meant as a business software program, and many of its graphing features do not yet live up to the expectations of most scientists and statisticians. But you can still do quite a bit.

Your task is simply to create a histogram of the following frequency table. Note that you only need to use Excel's regular graphing features since this table is not the raw data. Later, we will learn how to make histograms directly from the raw data.

If you are already familiar with Excel's graphing features, go ahead and make either a column or bar graph of this data. Otherwise, follow the steps below.

Bin
Frequency
2
1
3
2
4
5
5
6
6
8
7
9
8
7
9
6
10
5
11
1
12
0
  1. Either select the table cells above, copy them, and paste them into Excel; or type the data into Excel.

  2. Now, in Excel, select the second column of these data and click on the Chart Wizard button or select Chart... from the Insert menu.

  3. In the dialog box that appears, you may choose either the Column or Bar chart type and the first chart sub-type. You could click the Finish button at this point, but it is useful to click the Next button to see what other options the Chart Wizard will take you through.

  4. If you only selected the second column of the data for Excel to chart, you should notice that the chart makes a nice histogram but that the numbers at the bottom are not correct. At this point, there should be a Chart menu at the top of the screen (If not, click once somewhere on the chart). Select Source Data... from the Chart menu.

  5. In the dialog box that appears, click in the field next to Category (X) axis labels: and then drag the mouse over the first column of the data. Click the OK button.

You should see that the numbers at the bottom of the graph have changed to what we want. The only problem is that the numbers are shown in the middle of the bins rather than at the edges. That may be all right for this histogram, but in general it is not what we want. But we have to live with it.

You can play with other types of charts, too. Of particular interest for many science applications is the XY (Scatter) chart type. Play around! See what you can do.

Calculations

Calculations are done in Excel cells by using an equal sign (=) as the first character in the cell. Anything after the equal sign is treated as and equation. To practice this and to reinforce the concepts of the mean and standard deviation of a random variable, consider the data you worked with above. Assume that data is a histogram of several trials where two six-sided dice were thrown and their sum was recorded. Given only the data presented, how could you determine the mean and standard deviation of the data? That is your task. If you can do it on your own, do so. Otherwise, follow these steps:

To find the mean, we must add up the product of the value times its frequency. The first step is to find this product.

  1. In the cell next to the first row of data {2,1}, type an equal sign and then click on the cell containing the '2'.

  2. Now type the multiplication operator * and then click on the cell containing the '1' and hit the enter or return key to end your input of the equation. The cell you just worked with should display the product of the first column and the second column.

  3. Instead of doing that for every cell in the third column, you can simply select the third column from the cell you just worked on down to the last row of data and choose Fill -> Down from the Edit menu. Excel automatically adjusts the equations as they are filled down the column. So now the third column should show the product of the first two columns.

  4. To get the mean, all we have to do is add up this third column and divide by the total number of observations. You can do this by selecting the cell below the data in column 2and typing '=SUM(' and then dragging over the cells you want to add up and then typing the closing parentheses ')' and hitting the enter or return key to end your input of the equation. This gives you the total number of observations.

  5. Now you can select the cell that has the sum you just calculated and the cell to the right (in column 3 below the products you just calculated) and choose Fill -> Right from the Edit menu. Again, Excel adjusts the equations automatically and now the cell below the products has the sum of the products.

  6. The final step is just to divide the sum of the products by the total number of observations. Choose an empty cell and type an equal sign. Then click on the cell with the sum of the products; type the division operator /; click on the cell with the total number of observations; and hit enter or return to end your input of the equation. You should now have the mean! It should be about 6.8.

To find the standard deviation, we must find the variance which is more involved, but certainly possible once you calculate the mean. Try it! You should find a standard deviation of about 2.34.

Functions

We already used some functions last week and above, but sometimes it is good to get used to inserting functions by using the Paste Function button or by selecting Function... from the Insert menu. This can be especially useful when the function has several parameters because Excel will guide you through the parameters. You can use this feature to get to know Excel's functions for the normal and binomial distribution.

Normal distribution

While we have the above data in our Excel spreadsheet, we can compare it to a normal distribution. It looks pretty normal (in the statistical sense), but we can use Excel to plot a normal distribution for comparison. You can do this with the NORMDIST() function. If you're up for exploring, try to make a column next to the data we already have which shows the values predicted for a normal distribution with a mean of 6.8 and a standard deviation of 2.34. If you like to follow steps, here they are.

  1. Select an empty cell in the first row of data and click on the Paste Function button or select Function... from the Insert menu.

  2. In the dialog box which comes up, choose Statistical on the left and then choose NORMDIST from the list on the right.

  3. Another dialog box should pop up with the parameters for a normal distribution: X (which is your bin value -- the first column of your data) and the mean and standard deviation of the data. The fourth parameter is either TRUE or FALSE depending on whether you want a cumulative probability or a discrete one. For what you are doing now, use FALSE. Usually, you will want to use TRUE unless you are making a graph. If you want clarification on this, ask!

  4. So, in the parameter dialog box, click in the X input field and choose the first cell in the bin column of your data (you may have to move the dialog box to be able to do this). Then enter the mean and standard deviation and type FALSE in the Cumulative field.

  5. Notice that Excel displays the result in the dialog box. You can use the dialog box as a "normal distribution calculator" if you want! For now, click the OK button.

  6. Now use the Fill -> Down feature to fill the column with calculations of the normal distribution.

  7. What you should notice is that the values for the normal distribution are relative frequencies. They are probabilities, so they are all between 0 and 1. Our original data are not relative frequencies. To remedy this, you must either convert the original data to relative frequencies by dividing each value by the total number of observations; or you must convert the normal distribution probabilities to frequencies by multiplying each value by the total number of observations.

  8. You can now add the data you just calculated to your graph (if it is still around) by clicking on the graph and selecting Add Data... from the Chart menu. For the dialog box that comes up, you can drag over the colum of normal distribution data and then click the OK button. That should add the normal distribution to your chart.

Binomial distribution

As our final exercise, we will return to the question from the end of the M&M's workshop about the probability of getting a certain number of red M&M's in a set of 5. We found that this question could be answered by looking at the binomial distribution. Calculating the binomial distribution is not that difficult for small numbers of trials, but a computer is very nice for calculations involving larger numbers of trials.

Your task, now that you know how to use functions and graphs, is to create a graph of the binomial distribution for five trials. Notice that this is all you have to be told. The binomial distribution is only a function of the number of trials. That's what makes it so useful!

Further investigations

If you get all of the above done, take some time to look at the investigations in these Flash simulations of rolling dice.