Tip: You will see quoted blocks like this throughout this example project with tips for constructing your reports. You should consider these quoted sections as outside of the example structure.
Tip: Unless there is a good exception, you will want to hide code and warnings from the output of the HTML. You should try to make your visualizations and tables interpretable without needing to analyze the code. In order to format your code chunks so that they do not show up in output, you can set the following parameters as global settings for the full document or in the chunk headers, e.g.:
{r echo=FALSE, message=FALSE, warning=FALSE}
This report explores a dataset containing prices and attributes for approximately 54,000 diamonds.
## [1] 53940 10
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
Our dataset consists of ten variables, with almost 54,000 observations.
Tip: When plotting on a log scale, it is useful to note that 3 is about halfway between 1 and 10. As a side note, try not to plot counts on a log scale since counts of 0 are undefined and counts of 1 have a value of 0 (no height).
Transformed the long tail data to better understand the distribution of price. The tranformed price distribution appears bimodal with the price peaking around 800 or so and again at 5000 or so. Why is there a gap at 1500? Are there really no diamonds with that price? I wonder what this plot looks like across the categorical variables of cut, color, and clarity.
Tip: You can change the height and width of plots in code chunks with the
fig.height
andfig.width
parameters in the chunk options.
Most diamonds are of ideal cut, with gradually fewer diamonds of lesser-quality cut. A majority of diamonds are of cut G or better (lower letters are of better color). Clarity is skewed to the right, with most diamonds of lower clarity VS2 or worse.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
##
## 0.3 0.31 1.01 0.7 0.32 1 0.9 0.41 0.4 0.71 0.5 0.33 0.51 0.34 1.02
## 2604 2249 2242 1981 1840 1558 1485 1382 1299 1294 1258 1189 1127 910 883
## 0.52 1.51 1.5 0.72 0.53 0.42 0.38 0.35 1.2 0.54 0.36 0.91 1.03 0.55 0.56
## 817 807 793 764 709 706 670 667 645 625 572 570 523 496 492
The lightest diamond is 0.2 carat and the heaviest diamond is 5.0100. Above, I plot the main body of carat weights, trimming the highest-carat diamonds. Some carat weights occur more often than other carat weights. Many of the most common carat counts end in x.x0 or x.x1. I wonder how carat is connected to price, and I wonder if the carat values are specific to certain cuts of diamonds.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 61.00 61.80 61.75 62.50 79.00
Most diamonds have a depth between 60 mm and 65 mm: median 61.8 mm and mean 61.75 mm.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 56.00 57.00 57.46 59.00 95.00
##
## 56 57 58 59 55 60 54 61 62 63 53 64 65 66 52
## 9881 9724 8369 6572 6268 4241 2594 2282 1273 588 567 260 146 91 56
Setting the binwidth indicates that most table values are integers. Most diamonds have a table between 55 mm and 60 mm. Again, I wonder if this has anything to do with the cut of a diamond. Cut is a quality of a diamond that may influence carat weight and is responsible for making a diamond sparkle. There’s likely to be strong relationships among carat, table, cut, and price.
Most diamonds have an x dimension between 4 mm and 7 mm, a y dimension between 4 mm and 7 mm, and a z dimension between 2 mm and 6 mm. The y- and z- plots have a few high outliers so let’s zoom in.
Zooming in, we see that there are a few conspicuous points at value 0 in each of the three x, y, and z plots. Let’s investigate this further by finding these diamonds.
##
## FALSE TRUE
## 53932 8
##
## FALSE TRUE
## 53933 7
##
## FALSE TRUE
## 53920 20
There are eight diamonds with missing x values, seven diamonds with missing y values, and twenty diamonds with missing z values.
## Source: local data frame [20 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 1.00 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.60 0
## 3 1.10 Premium G SI2 63.0 59 3696 6.50 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.50 6.47 0
## 5 1.50 Good G I1 64.0 61 4731 7.15 7.04 0
## 6 1.07 Ideal F SI2 61.6 56 4954 0.00 6.62 0
## 7 1.00 Very Good H VS2 63.3 53 5139 0.00 0.00 0
## 8 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 9 1.14 Fair G VS1 57.5 67 6381 0.00 0.00 0
## 10 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 11 1.56 Ideal G VS2 62.2 54 12800 0.00 0.00 0
## 12 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 13 1.20 Premium D VVS1 62.1 59 15686 0.00 0.00 0
## 14 2.20 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 15 2.25 Premium H SI2 62.8 59 18034 0.00 0.00 0
## 16 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 17 2.80 Good G SI2 63.8 58 18788 8.90 8.85 0
## 18 0.71 Good F SI2 64.1 60 2130 0.00 0.00 0
## 19 0.71 Good F SI2 64.1 60 2130 0.00 0.00 0
## 20 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2130 3564 5352 8803 15470 18790
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 949 2401 3931 5323 18820
If and only if x or y dimensions are 0, then the z dimension is 0. Comparing the diamonds in this subset to all other diamonds, these diamonds tend to be very expensive or fall in the third quartile of the entire diamonds data set. Other variables such as carat, depth, table, and price are reported so I’ll assume those values can be trusted.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 553 967 1207 2887 2644 18700
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2170 2983 3420 4712 5023 17080
Above, we subset the diamonds with high quality in color, clarity, and cut. Let’s compare the prices (first summary) and prices per carat (second summary) to the diamonds with consistently low quality classes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 2808 4306 5747 7563 18530
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 2638 3324 3579 4281 7437
There are a lot fewer diamonds which score low in all of color, clarity, and cut. The price per carat also seems to be significantly lower for the worst diamonds compared to the best diamonds, even if the regular price ranges are fairly similar. Later in my analysis, I’m going create density plots that are similar to the price histograms earlier to examine the price for each level of cut, color, and clarity.
What about the volume of a diamond? Does it have any relationships with price and other variables in the data set? I’m going to use a rough approximation of volume by using x * y * z to approximate a diamond as if it were a rectangular prism, basically a box.
##
## FALSE TRUE
## 53920 20
## carat cut color clarity depth table price x y z volume
## 2208 1.00 Premium G SI2 59.1 59 3142 6.55 6.48 0 0
## 2315 1.01 Premium H I1 58.1 59 3167 6.66 6.60 0 0
## 4792 1.10 Premium G SI2 63.0 59 3696 6.50 6.47 0 0
## 5472 1.01 Premium F SI2 59.2 58 3837 6.50 6.47 0 0
## 10168 1.50 Good G I1 64.0 61 4731 7.15 7.04 0 0
## 11183 1.07 Ideal F SI2 61.6 56 4954 0.00 6.62 0 0
## 11964 1.00 Very Good H VS2 63.3 53 5139 0.00 0.00 0 0
## 13602 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0 0
## 15952 1.14 Fair G VS1 57.5 67 6381 0.00 0.00 0 0
## 24395 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0 0
## 24521 1.56 Ideal G VS2 62.2 54 12800 0.00 0.00 0 0
## 26124 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0 0
## 26244 1.20 Premium D VVS1 62.1 59 15686 0.00 0.00 0 0
## 27113 2.20 Premium H SI1 61.2 59 17265 8.42 8.37 0 0
## 27430 2.25 Premium H SI2 62.8 59 18034 0.00 0.00 0 0
## 27504 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0 0
## 27740 2.80 Good G SI2 63.8 58 18788 8.90 8.85 0 0
## 49557 0.71 Good F SI2 64.1 60 2130 0.00 0.00 0 0
## 49558 0.71 Good F SI2 64.1 60 2130 0.00 0.00 0 0
## 51507 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0 0
The twenty diamonds with at least one dimension with a value of 0 end up getting volumes equal to 0. Instead of using the dimensions x, y, and z to compute the volume, I now use the average density of diamonds to compute the volume instead. I can convert carat to grams and then divide by the density to get the volume of a diamond.
First, 1 carat is equivalent to 2 grams. Using Google, I found that diamond density is typically between 3.15 and 3.53 g/cm^3 with pure diamonds having a density close to 3.52 g/cm^3. I’m going to use the median density 3.34 g/cm^3 to estimate the volume of the diamonds.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1198 0.2395 0.4192 0.4778 0.6228 3.0000
##
## 0.18 0.186 0.605 0.419 0.192 0.599 0.539 0.246 0.24 0.425 0.299 0.198
## 2604 2249 2242 1981 1840 1558 1485 1382 1299 1294 1258 1189
## 0.305 0.204 0.611 0.311 0.904 0.898 0.431 0.317 0.251 0.228 0.21 0.719
## 1127 910 883 817 807 793 764 709 706 670 667 645
The histogram of volume is right skewed so I’m going to transform the data using a log transform. The histogram and count of most common values lines up with carat, since volume is a linear transformation of carat.
Tip: Use the following section to summarize your observations during the univariate exploration of your dataset.
There are 53,940 diamonds in the dataset with 10 features (carat, cut, color, clarity, depth, table, price, x, y, and z). The variables cut, color, and clarity, are ordered factor variables with the following levels.
(worst) —————-> (best)
cut: Fair, Good, Very Good, Premium, Ideal
color: J, I, H, G, F, E, D
clarity: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF
Other observations:
The main features in the data set are carat and price. I’d like to determine which features are best for predicting the price of a diamond. I suspect carat and some combination of the other variables can be used to build a predictive model to price diamonds.
Carat, color, cut, clarity, depth, and table likely contribute to the price of a diamond. I think carat (the weight of a diamond) and clarity probably contribute most to the price after researching information on diamond prices.
I created a variable for the volume of diamonds using the density of diamonds and the carat weight of diamonds. This arose in the bivariate section of my analysis when I explored how the price of a diamond varied with its volume. At first volume was calculated by multiplying the dimensions x, y, and z together. However, the volume was a crude approximation since the diamonds were assumed to be rectangular prisms in the initial calculation.
To better approximate the volume, I used the average density of diamonds. 1 carat is equivalent to 2 grams, and the average diamond density is between 3.15 and 3.53 g/cm^3 with pure diamonds having a density close to 3.52 g/cm^3. I used an average density of 3.34 g/cm^3 to estimate the volume of the diamonds.
I log-transformed the right skewed price and volume distributions. The tranformed distribution for price appears bimodal with the price peaking around $800 or so and again around $5000. There’s no diamonds priced at $1500.
When first calculating the volume using x, y, and z, some volumes were 0 or could not be calculated because data was missing. Additionally, some values for the dimensions x, y, and z seemed too large. In the subset called noVolume, all dimensions (x, y, and z) are missing or the z value is 0. The diamonds in this subset tend to be very expensive or fall in the third quartile of the entire diamonds data set.
## carat depth table price x y z volume
## carat 1.000 0.028 0.182 0.922 0.975 0.952 0.953 1.000
## depth 0.028 1.000 -0.296 -0.011 -0.025 -0.029 0.095 0.028
## table 0.182 -0.296 1.000 0.127 0.195 0.184 0.151 0.182
## price 0.922 -0.011 0.127 1.000 0.884 0.865 0.861 0.922
## x 0.975 -0.025 0.195 0.884 1.000 0.975 0.971 0.975
## y 0.952 -0.029 0.184 0.865 0.975 1.000 0.952 0.952
## z 0.953 0.095 0.151 0.861 0.971 0.952 1.000 0.953
## volume 1.000 0.028 0.182 0.922 0.975 0.952 0.953 1.000
The dimensions of a diamond tend to correlate with each other. The longer one dimension, then the larger the diamond is overall. The dimensions also correlate with carat weight which makes sense. Price correlates strongly with carat weight and the three dimensions (x, y, z).
Tip: Be mindful of the number of data points and variables that you put in a correlation matrix or plot matrix: you do not need to include all variables. In addition, you can use other packages not introduced in the associated course to conduct your exploration. Make sure you load them at the beginning of your document so that it is easiest to see which packages are necessary. (The above plot matrix comes from the
psych
package.)
From a subset of the data, cut, color and clarity do not seem to have strong correlations with price, but color and clarity are moderately correlated with carat. I want to look closer at scatter plots involving price and some other variables like carat, depth, and table.
As carat size increases, the variance in price increases. We still see vertical bands where many diamonds take on the same carat value at different price points. The relationship between price and carat appears to be exponential rather than linear.
##
## Call:
## lm(formula = price ~ carat, data = subset(diamonds, carat <=
## quantile(diamonds$carat, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10922.6 -818.3 -8.3 566.5 12703.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2317.86 12.94 -179.1 <2e-16 ***
## carat 7843.16 14.02 559.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1524 on 53885 degrees of freedom
## Multiple R-squared: 0.8532, Adjusted R-squared: 0.8532
## F-statistic: 3.131e+05 on 1 and 53885 DF, p-value: < 2.2e-16
Despite the fact that the relationship looks nonlinear, based on the R^2 value, carat still explains about 85 percent of the variance in price.
Comparing depth to price, the first plot suffers from some overplotting. Most diamonds have a depth between 60 and 65 (no units), and the lack of correlation seen in the earlier table is easy to see here.
Again, the tall vertical strips indicate table values are mostly integers. Adding jitter, transparency, and changing the plot limits lets us see the slight correlation between table and price.
Next, I’ll look at how the categorical features vary with carat and price.
## cut: Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.220 0.700 1.000 1.046 1.200 5.010
## --------------------------------------------------------
## cut: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5000 0.8200 0.8492 1.0100 3.0100
## --------------------------------------------------------
## cut: Very Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4100 0.7100 0.8064 1.0200 4.0000
## --------------------------------------------------------
## cut: Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.200 0.410 0.860 0.892 1.200 4.010
## --------------------------------------------------------
## cut: Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.3500 0.5400 0.7028 1.0100 3.5000
It doesn’t look like particular cuts have a certain number of carats. However, it looks like most of the ideal cut diamonds are on the smaller side, less than one carat.
The trend between carat and color is clearer, with the worst-color diamonds (best color is D and the worst color is J) having the largest median and largest range. Clarity shows a similar trend, and most of the diamonds of 3 carats or larger fall into the worst clarity groups (I1, SI2).
## diamonds$cut: Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 2050 3282 4359 5206 18570
## --------------------------------------------------------
## diamonds$cut: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327 1145 3050 3929 5028 18790
## --------------------------------------------------------
## diamonds$cut: Very Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 912 2648 3982 5373 18820
## --------------------------------------------------------
## diamonds$cut: Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 1046 3185 4584 6296 18820
## --------------------------------------------------------
## diamonds$cut: Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 878 1810 3458 4678 18810
Ideal diamonds have the lowest median price. This seems really unusual since I would expect diamonds with an ideal cut to have a higher median price compared to the other groups. There are many outliers. The variation in price tends to increase as cut improves and then decreases for diamonds with ideal cuts. What does price/carat look like for these cuts?
## diamonds$color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1838 3170 4214 18690
## --------------------------------------------------------
## diamonds$color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 882 1739 3077 4003 18730
## --------------------------------------------------------
## diamonds$color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342 982 2344 3725 4868 18790
## --------------------------------------------------------
## diamonds$color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 354 931 2242 3999 6048 18820
## --------------------------------------------------------
## diamonds$color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 984 3460 4487 5980 18800
## --------------------------------------------------------
## diamonds$color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 3730 5092 7202 18820
## --------------------------------------------------------
## diamonds$color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4234 5324 7695 18710
Here is another surprise. The lowest median price diamonds have a color of D, which is the best color in the data set. Price variance increases as the color decreases (best color is D and the worst color is J). The median price typically decreases as color improves. Now, I want to look at price per carat by color.
## diamonds$clarity: I1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 345 2080 3344 3924 5161 18530
## --------------------------------------------------------
## diamonds$clarity: SI2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 2264 4072 5063 5777 18800
## --------------------------------------------------------
## diamonds$clarity: SI1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 1089 2822 3996 5250 18820
## --------------------------------------------------------
## diamonds$clarity: VS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 900 2054 3925 6024 18820
## --------------------------------------------------------
## diamonds$clarity: VS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327 876 2005 3839 6023 18800
## --------------------------------------------------------
## diamonds$clarity: VVS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336.0 794.2 1311.0 3284.0 3638.0 18770.0
## --------------------------------------------------------
## diamonds$clarity: VVS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 816 1093 2523 2379 18780
## --------------------------------------------------------
## diamonds$clarity: IF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 369 895 1080 2865 2388 18810
Here again, there is a trend that goes against my intuition. The lowest median price occurs for the best clarity (IF). There also to be many more outliers for the better clarity diamonds. I’m not sure why great clarity diamonds are priced so low. Another trend to note here is that price variance increases then decreases significantly as the clarity improves.
I want to look at two things: price per carat, and the distribution of prices for diamonds with best levels of the categorical variables.
Price correlates strongly with carat weight and the three dimensions (x, y, z).
As carat size increases, the variance in price increases. In the plot of price vs carat, there are vertical bands where many diamonds take on the same carat value at different price points. The relationship between price and carat appears to be exponential rather than linear.
Based on the R^2 value, carat explains about 85 percent of the variance in price. Other features of interest can be incorporated into the model to explain the variance in the price.
Diamonds with better levels of clarity, cut, and color tend to occur more often at lower prices while diamonds with worse levels of clarity, cut, and color tend to occur more often at higher prices.
Ideal diamonds have the lowest median price. This seems really unusual since I would expect diamonds with an ideal cut to have a higher median price compared to the other groups. There are many outliers. The variation in price tends to increase as cut improves and then decreases for diamonds with ideal cuts.
The lowest median priced diamonds have a color of D, which is the best color in the data set. Price variance increases as the color decreases (best color is D and the worst color is J). The median price typically decreases as color improves.
The dimensions of a diamond (x, y, and z) tend to correlate with each other. The longer one dimension, then the larger the diamond. The dimensions also correlate with carat weight which makes sense.
The price of a diamond is positively and strongly correlated with carat and volume. The variables x, y, and z also correlate with the price but less strongly than carat and volume. Either carat or volume could be used in a model to predict the price of diamonds, however, both variables should not be used since they are measuring the same quality and show perfect correlation.
Tip: Even when doing exploration, it can be good to select appropriate color palettes and set plot themes in order to make plots more readable. (The above plots use sequential color palettes from the
RColorBrewer
package; other variables might require qualitative or diverging palettes.)
These density plots elaborate on the odd trends that were seen in the box plots earlier. Diamonds with better levels of clarity, cut, and color tend to occur more often at lower prices while diamonds with worse levels of clarity, cut, and color tend to occur more often at higher prices. Let’s now take a look at price / carat.
## cut: Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1168 2743 3449 3767 4514 10910
## --------------------------------------------------------
## cut: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 2394 3613 3860 4787 15930
## --------------------------------------------------------
## cut: Very Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1139 2332 3606 4014 5016 17830
## --------------------------------------------------------
## cut: Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2592 3763 4223 5323 17080
## --------------------------------------------------------
## cut: Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1109 2456 3307 3920 4766 17080
Wow! Ideal diamonds still have the lowest median for price per carat. The variance across the groups seems to be about the same with Fair cut diamonds having the least variation for the middle 50% of diamonds.
## color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1128 2455 3411 3953 4749 17830
## --------------------------------------------------------
## color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1078 2430 3254 3805 4508 14610
## --------------------------------------------------------
## color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1168 2587 3494 4135 4947 13860
## --------------------------------------------------------
## color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1139 2538 3490 4163 5500 12460
## --------------------------------------------------------
## color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2397 3819 4008 5127 10190
## --------------------------------------------------------
## color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1152 2345 3780 3996 5197 9398
## --------------------------------------------------------
## color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 2563 3780 3826 4928 8647
The best color diamonds (D and E) still have the lowest medians on price per carat. Again, this is an unusual trend. This also seems strange since most diamonds in the data set are not of color D.
## clarity: I1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2112 2887 2796 3354 6353
## --------------------------------------------------------
## clarity: SI2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 3000 3951 4011 4738 9912
## --------------------------------------------------------
## clarity: SI1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1130 2362 3669 3849 4928 9693
## --------------------------------------------------------
## clarity: VS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1152 2438 3429 4081 5484 12460
## --------------------------------------------------------
## clarity: VS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1215 2412 3450 4156 5485 12400
## --------------------------------------------------------
## clarity: VVS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1339 2455 3169 4204 4939 13440
## --------------------------------------------------------
## clarity: VVS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1400 2545 2982 3851 4060 14500
## --------------------------------------------------------
## clarity: IF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1588 2865 3156 4260 4284 17830
This plot seems more reasonable. The lowest median price per carat has clarity I1 which is the lowest clarity rating. The median increases slightly then holds relatively constant before decreasing again for the highest clarity. The variance increases then decreases across the clarity levels from worst to best.
Let’s take another look at other variables and their correlations with price and try to work towards building a linear model to predict price.
Levels of cut cluster by table value. This may make sense based on the type of cut as certain cuts produce certain dimensions. The pattern generally holds across each level of clarity and each level of color with the exception of the lowest clarity.
Color and clarity are not correlated with table, nothing particularly stands out.
We look at the categorical variables against the main price vs. carat relationship. Applying a log transform to price and cube-root transform to carat produces a more linear trend. If we account for constant carat value, better clarity produces a higher-priced diamond.
Diamonds with better color tend to be priced higher holding volume constant. This trend is not as clear when looking at price vs volume and clarity, but the trend is still present. Price does not vary as much on cut holding carat constant; the pattern is not noticeable here.
Tip: Note that performing statistical tests and creating data models are not a required component of the project. Below, the
mtable()
call uses ‘sdigits = 3’ to make sure that three digits are printed after the decimal point in the summary statistics.
The last 3 plots suggest that we can build a linear model and use those variables in the linear model to predict the price of a diamond.
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity,
## data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity +
## cut, data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity +
## cut + color, data = diamonds)
##
## ==============================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.464*** 0.391*** 0.415***
## (0.006) (0.019) (0.014) (0.014) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 9.319*** 9.376*** 9.144***
## (0.007) (0.032) (0.023) (0.023) (0.016)
## carat -1.137*** -1.260*** -1.274*** -1.093***
## (0.012) (0.008) (0.008) (0.006)
## clarity: .L 0.889*** 0.854*** 0.907***
## (0.005) (0.005) (0.003)
## clarity: .Q -0.255*** -0.239*** -0.240***
## (0.005) (0.005) (0.003)
## clarity: .C 0.143*** 0.129*** 0.131***
## (0.004) (0.004) (0.003)
## clarity: ^4 -0.086*** -0.080*** -0.063***
## (0.003) (0.003) (0.002)
## clarity: ^5 0.038*** 0.034*** 0.026***
## (0.003) (0.003) (0.002)
## clarity: ^6 0.001 0.004 -0.002
## (0.002) (0.002) (0.002)
## clarity: ^7 0.054*** 0.051*** 0.032***
## (0.002) (0.002) (0.001)
## cut: .L 0.125*** 0.120***
## (0.003) (0.002)
## cut: .Q -0.034*** -0.031***
## (0.003) (0.002)
## cut: .C 0.016*** 0.014***
## (0.002) (0.002)
## cut: ^4 -0.001 -0.002
## (0.002) (0.001)
## color: .L -0.441***
## (0.002)
## color: .Q -0.093***
## (0.002)
## color: .C -0.013***
## (0.002)
## color: ^4 0.012***
## (0.002)
## color: ^5 -0.003*
## (0.001)
## color: ^6 0.001
## (0.001)
## ------------------------------------------------------------------------------
## R-squared 0.924 0.935 0.967 0.968 0.984
## adj. R-squared 0.924 0.935 0.967 0.968 0.984
## sigma 0.280 0.259 0.185 0.181 0.129
## F 652012.063 387489.366 175093.345 125821.403 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 14605.945 15580.358 34091.272
## Deviance 4242.831 3613.360 1837.549 1772.344 892.214
## AIC 15930.999 7270.637 -29189.890 -31130.717 -68140.544
## BIC 15957.685 7306.220 -29092.038 -30997.282 -67953.736
## N 53940 53940 53940 53940 53940
## ==============================================================================
The variables in this linear model can account for 98.4% of the variance in the price of diamonds. Even with the log transformation of price and cube-root transformation on carat alone, we account for 92.4% of the variace, compared to 85% without the transformation.
Ideal diamonds also have the lowest median for price per carat. The variance across the groups seems to be about the same with Fair cut diamonds having the least variation for the middle 50% of diamonds.
Holding carat weight constant, diamonds with lower clarity are almost always cheaper than diamonds with better clarity (worst clarity is I1 and best clarity is IF). This also applies to a lesser extent with color.
The last three plots from the Multivariate section suggest that I can build a linear model and use those variables in the model to predict the price of a diamond. The results of the model are summarized below.
Levels of cut cluster by table value. This resonates with me because I think certain diamond cuts would produce particular dimensions (x, y, and z). The pattern holds across each level of clarity and each level of color with the exception of the lowest clarity.
Yes, I created a linear model starting from the log of Price and the Cube-Root of carat.
The variables in the linear model account for 98.4% of the variance in the price of diamonds. The addition of the cut variable to the model slightly improves the R^2 value by one tenth of a percent, which is expected based on the visualization above of Log10 Price vs. Cube-Root Carat and Cut. Clarity and color improved the model to greater degrees.
Tip: Polish up the plots that you explored earlier in the report in the final plots section. Make sure you label and title the plots, and provide a description of what can be observed in each of the chosen plots.
The distribution of diamond prices appears to be bimodal on log scale, perhaps due to the demand of diamonds and buyers purchasing in two different ranges of price points. There is a curious gap in prices at the $1500 point.
Diamonds with the best level of clarity (IF) have the lowest median price. A greater proportion of diamonds with the best clarity are priced lower compared to the proportion of diamonds in price distributions for worse levels of clarity. Price variance increases as the clarity improves (worst clarity is I1).
The plot indicates that a linear model could be constructed to predict the price of variables using log10(price) as the outcome variable and cube-root of carat as the predictor variable. Holding carat weight constant, diamonds with higher clarity levels (I1 is worst and IF is best) are almost always cheaper than diamonds with better clarity to account for additional variability in prices.
The diamonds data set contains information on almost 54,000 thousand diamonds across ten variables from around 2008. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the price of diamonds across many variables and created a linear model to predict diamond prices.
There was a clear trend between the volume or carat weight of a diamond and its price. I was surprised that depth or table did not have a strong positive correlation with price, but these variables are likely to be represented by categorical variables: color, cut, and clarity. I struggled understanding the decrease in median price as the level of cut and clarity improved, but this became more clear when I realized that most of the data contained ideal cut diamonds. For the linear model, all diamonds were included since information on price, carat, color, clarity, and cut were available for all the diamonds. After transforming price to log scale and taking the cube root of carat, the model was able to account for 98.4% of the variance in the dataset.
Some limitations of this model include the source of the data. Given that the diamonds date to 2008, the model would likely undervalue diamonds in the market today, either due to changes in demand and supply or inflation rates. To investigate this data further, I would examine how values of 0 were introduced into the data set for the variables x, y, and z, and the derived volume variable. I would be interested in testing the linear model to predict current diamond prices and to determine to what extent the model is accurate at pricing diamonds. A more recent dataset would be better to make predictions of diamond prices, and comparisons might be made between the other linear models to see if other variables may account for diamond prices.