Grouping

Tutorial

Sometimes the data you want to visualize must be computed from the data you already have. We've talked about transforms and filtering, which are two types of computation that might be required. These computations are against single data points. For example, change each temperature independently of all others to a different unit or remove any data points that are from the wrong weather station.

Another type of computation is one across the data set. For example, what is the average temperature across all days in July 2011? What was the total government spending in 2013? The aggregates themselves can be used in computations to transform data points. For example, transform the budgets of each government office from dollar amounts to percents of the total government's spending. This requires computing the total amount spent and then dividing each individual amount by the total.

The Group that Always Exists

Implicitly your existing data is already in a group, the entire dataset. D3 has a few convenience functions to help you compute statistics about any group of data, including the entire dataset. All of these convenience functions take an accessor functions of the (d) form that allow you to compute statistics across a single "column"/"atttribute" of data. Below is a table of less well known aggregate functions that D3 provides followed by an example using the number of Oscar wins for made-up films.

Function Name	Description
d3.min	Finds the minimum element of the data.
d3.max	Finds the maximum element of the data.
d3.extent	Finds both the min and max of the data. Equivalent to [d3.min(data), d3.max(data)]. Useful for scaling data based on it's range of values.
d3.sum	Returns the sum of the data
d3.mean	Returns the arithmetic mean (commonly called the average) of the data.
d3.median	Returns the median a.k.a. 50th percentile which is conceptually the value that is the center of all of the values.

var numberOfOscarWins = [2,6,1,7,9]; // It's just fake data
var ex1 = d3.select("#example1");
ex1.append("h3").text("Example:");
ex1.append("p").text("Example data: " + numberOfOscarWins);
ex1.append("p").text("Minimum data point: " + d3.min(numberOfOscarWins));
ex1.append("p").text("Maximum data point: " + d3.max(numberOfOscarWins));
ex1.append("p").text("Data range: " + d3.extent(numberOfOscarWins));
ex1.append("p").text("Sum: " + d3.sum(numberOfOscarWins));
ex1.append("p").text("Mean value: " + d3.mean(numberOfOscarWins));
ex1.append("p").text("Median value: " + d3.median(numberOfOscarWins));

As you can see these aggregate functions are straight forward to use. A common transform to perform is converting from values to percent of total. This can easily be accomplished with d3.sum and Array.map as in the example below.

var numberOfOscarWins = [2,6,1,7,9];
var ex2 = d3.select("#example2");
var total = d3.sum(numberOfOscarWins);
var percentFormatter = d3.format("%");
var percentStrings = numberOfOscarWins.map(function(d) { return " " + percentFormatter(d/total); })
ex2.append("p")
   .text("Example data as Percent of Total: " + percentStrings);

For the statisticians

This second table of aggregates has some functions for the more statistically minded that are less well known.

Function Name	Description
d3.quantile([data], percent)	A quantile helps describe the distribution of quantitative data. A quantile, also called percentile, specifies what percent of the data is above and below a given value. For example the 95th percentile is the value that 95% of the values are below and 5% are above. Likewise for 75th percentile would be above 75% and below 25% of the values. The function requires that the numbers are already sorted in ascending order.
d3.variance	Returns the variance of the data. Variance measures how closely grouped the values are. A higher variance means the data are less closely grouped.
d3.deviation	Returns the standard deviation of the data. Standard deviation is the square root of variance. If your data is normally distributed (looks like a bell-curve) then 68% of the data is within 1 standard deviation in either direction from the mean, and 95% is within 2 standard deviations of the mean.

var ex = d3.select("#statistics_examples");
var tightlyGrouped = [1,1,1,1,1,1,1,2];
var allOverThePlace = [10,50,-20,10000];

ex.append("p")
  .text(tightlyGrouped + " - Variance = " + d3.variance(tightlyGrouped)); 

ex.append("p")
  .text(allOverThePlace + " - Variance = " + d3.variance(allOverThePlace)); 

ex.append("p")
  .text(tightlyGrouped + " - Standard Deviation = " + d3.deviation(tightlyGrouped)); 

ex.append("p")
  .text(allOverThePlace + " - Standard Deviation = " + d3.deviation(allOverThePlace)); 

var ten = [1,2,3,4,5,6,7,8,9,10];

ex.append("p")
  .text(ten + " - 80% of values are below: " + d3.quantile(ten, .8));

Grouping data with nest()

Grouing data together is very common. For example. say you had a data set with every home sale transaction in every state. You want to examine the differences between sales prices between states. What you would like to report is the average sales price in each state. This means you need to group the transactions by state, and then compute the average for each group. It sounds a bit tedious and D3 helps you out with d3.nest().

d3.nest() transforms your flat dataset into a nested or hierarchical data set. Let's examine how we would use it for the sample data below.

"homeid","state","price"
"1","KA","89,900"
"2","CA","165,000"
"3","KA","77,450"
"4","CA","343,000"
"5","CA","615,000"
"6","KA","139,183"

This is some CSV data of the home transactions. Once we load it with d3.csv() it will have the structure below.

Now we need to group this data by state using d3.nest(). The d3.nest() is called by using a method chain. There are two essential functions that must be in the method chain: key() and entries(). key() takes an accessor as a parameter and specifies what is being grouped by. In our case we want to group by state so we provide an accesor that returns the state field for each data point. entries() is simply the dataset to operate on. For example we could group the data by state.

d3.nest()
  .key(function (d) { return d.state; })
  .entries(data);

As you can see the nest function creates a hierarchy in the data. At the top level it creates two fields "key" and "values". It sets the value of "key" based on the accessor function, while values is an array of all of the original objects that go with the given key.

With the objects for a particular key grouped into the "values" array, it is easy to run an aggregate function on the values. You can also replace the values array directly with the result of running an aggregate function of the values. For example, if you wanted to visualize the average home price for all of California. The indivual home prices don't matter, you just want to know what the average for California is. You can add rollup() to the method chain to replace values with the result of an aggregate. rollup() takes a function as a parameter that specifies how to aggregate the array of objects that would otherwise comprise "values". It replaces the "values" array with it's returned value. It's possible for to return an object in the case that you want to compute more than just one aggregate, such as both mean and standard deviation.

d3.nest() 
  .key(function(d) { return d.state; })
  .rollup(function(values) { 
            return d3.mean(values, function(d) { 
                     // we have to remove the comma from price with replace
                     // then we can turn it into a number (int) with parseInt
                     return parseInt(d.price.replace(",", "")); 
                   }); 
          })
  .entries(data);

In this example, I computed the mean (average) sales price by state. By using D3 I was easily able to perform grouping, transformation, and aggregate computations on my original data and put it into a form appropriate for visualization. Later we'll be able to take transformed data like this and turn it into a colored map of the US to enable us to visualize average sales price across the USA.

Sorting Keys and Values

It is common to also want to sort the keys and values of nested data. D3 provides two functions that can be added to the chain to help you do this. sortKeys() can be called after key(). It takes a standard sorting function that is used to sort the keys. Similarly, sortValues() takes a standard sorting function that is used to sort the values.

d3.csv("group_data.csv", 
  function(error, data) { 
    var nested = d3.nest() 
                   .key(function(d) { return d.state; })
                   .sortKeys(d3.ascending)
                   .sortValues(function(a,b) { 
                     // need to convert to ints for sorting
                     // which also means no more commas
                     return d3.descending(
                       parseInt(a.price.replace(",","")), 
                       parseInt(b.price.replace(",","")))
                   })
                   .entries(data);
    d3.select("#sortedKeysValues")
      .text(JSON.stringify(nested, null, "    "));
});

In this example, the data is grouped by State and the States are sorted in alphabetical order. In addition, the transactions within each State are sorted in descending order of the transaction price.

Quiz


Match the aggregate function to it's description

d3.min - returns smallest element
d3.mean  - returns the average 
d3.extent - returns the range of values as a two element array
d3.sum - returns the result of adding up all data elements together
d3.median - returns the data element that is the middle of the dataset (50th percentile)


Match the nest function to it's description

.key - set's what value is being grouped on
.entries - set's the dataset to nest
.rollup - applies a function to the values of a group
.sortKeys - set's the sorting functino for the values being grouped on
.sortValues - set's the sorting function for the values within the group


d3.quantile(data, .95) returns the value that is less than 95% of the dataset
True/False

Things to do

The temperature dataset has data from 3 different stations in the Death Valley region. Nest the data by the DATE column so we can aggregate the data by date.
Compute the average maximum temperature (TMAX) across the three stations for each day using rollup. Remember that the data is in string format and will need to be converted using parseInt before it is used for computation. Convert the average from tenths of a degree Celcius to Farenheit. The formula is: (val/10)*5/9 + 32.
Sort the keys so that the dates are in order using sortKeys. Use d3.ascending along with the provided timeFmt.parse(dateStr) to create date objects from the strings in the data set. Date objects sort as you would expect when using d3.ascending.
Remove the debugging code and output a new "p" tag for each date in the format "Month, Day Year - temperature F". Use the provided timeOutFmt() to transform a Date object into a string of the format "Month, Day Year". Remember that the key objects are strings, so you will need to convert from String to Date and then Date to formatted String. Use Math.round to convert the average temperatures to integers.

Extra Credit

Scroll through the output, some of the output values don't make sense. For example "-404 F" is obviously some sort of error. Go back and after converting TMAX to an integer convert obvious bad values into NaN, which stands for "Not a Number". The aggregate functions in D3 will correctly ignore this value.