# Analyzing Monthly Expenses with a Pareto Chart

This month, ASQ CEO Paul Borawski encourages us to share stories about “quality solutions in unexpected places.” This is such a fun question, because now I’ll be noticing these unexpected gems all month – and probably beyond!

Today’s gem comes from my former student Andy, who has heard me get excited about quality tools and continuous improvement – and the R statistical software – a LOT over the past few years! Even though he graduated in the spring of 2012, he’s still applying quality solutions to his own life – and this was a very unexpected place for me to find such a thing! I can’t hold back my own personal excitement for improvement and the pursuit of excellence, even as my standards for excellence evolve, and it’s so heartwarming to see how this has influenced Andy’s life.

A couple months ago, Andy posted about how he used a Pareto chart to explore his own monthly expenses, and brainstorm ways to improve his financial situation as a recent college graduate. Want to explore your own finances? Andy’s post can help you… and can also help you use R to produce nice charts and graphs to tell your story. Check it out!!

# Performance Measures for Classifiers: Precision, Recall, and F1

Here is a new, simple tutorial on how to evaluate the quality of a classifier. The attached doc shows you how to construct a confusion matrix, compute the precision, recall, and f1 scores for a classifier, and to construct a precision/recall chart in R to compare the relative strengths and weaknesses of different classifiers.

performance-measures-classifiers-75-925

Granted, these measures are not perfect. Powers (2011), in the Journal of Machine Learning Technologies, advises that they should not be used without a clear understanding of the biases, especially considering the power of intelligent prediction vs. the power of the guess. However, they should provide a decent basis for practitioners to compare different classification strategies. (Notice that you don’t even need algorithms to do this… you can generate a confusion matrix from any plant operation or business activity where classification is performed!)

# Access METAR Weather Data in R Statistical Software

Although this is neither quality nor innovation related, as a multi-decade weather geek and degreed meteorologist, I still really love my weather data. Today I wanted to learn how to retrieve historical weather data from within the R statistical software. I managed to get a list within R that contains METAR observations for an entire day for one observing station. Here’s how I did it!

1. First, I signed up for a KEY to use the Weather Underground API at http://www.wunderground.com/weather/api/ – I’m not going to tell you what my personal key is, but it has 16 characters and looks kind of like this: d7000XXXXXXXXXXX

2. Next, I installed the rjson package into R

3. Then, I used this code to find out that there were 46 observations for August 11, 2012 (the date of interest). You’ll have to try it with YOUR new Weather Underground API key in place of the d7000XXX… :

``` library(rjson)```

``` # BE SURE TO PUT THIS ALL ON THE SAME LINE, NO SPACES, # NO CARRIAGE RETURNS, AND USE YOUR OWN API KEY x <- fromJSON("http://api.wunderground.com/api/d7000XXX/ history_20120811/q/VA/Charlottesville.json") # THIS WILL TELL US HOW MANY OBSERVATIONS WE HAVE length((x\$history)\$observations) ```

```# GET ALL METARS FOR THE WHOLE DAY AND STORE IT TO A LIST daily.metars <- rep(NA,length((x\$history)\$observations)) for (n in 1:length((x\$history)\$observations)) { daily.metars[n] <- (x\$history\$observations)[[n]]\$metar } ```

4. Now you have a list in R called daily.metars that contains strings holding all of your METARs for the day! Here’s the header from the list that I produced:
``` > head(daily.metars) [1] "SPECI KCHO 110408Z AUTO 00000KT 5SM VCTS -RA BR SCT003 BKN030 OVC075 21/19 A2983 RMK AO2 LTG DSNT SW P0001" [2] "SPECI KCHO 110433Z AUTO 20005KT 6SM -TSRA BR FEW003 BKN033 OVC110 21/19 A2984 RMK AO2 LTG DSNT NE AND S AND SW TSE11B27 P0002" [3] "METAR KCHO 110453Z AUTO 00000KT 5SM -TSRA BR SCT018 SCT046 OVC100 21/19 A2983 RMK AO2 LTG DSNT NE-S TSE11B27 SLP093 P0003 T02060194 402940200" [4] "SPECI KCHO 110529Z AUTO 00000KT 8SM -RA FEW070 SCT095 BKN110 21/19 A2982 RMK AO2 LTG DSNT NE-SE TSE23 P0001" [5] "METAR KCHO 110553Z AUTO 00000KT 8SM FEW085 21/19 A2982 RMK AO2 LTG DSNT NE-SE TSE23RAE30 SLP090 P0001 60127 T02060194 10250 20200 58004" [6] "METAR KCHO 110653Z AUTO 18005KT 8SM BKN090 21/19 A2983 RMK AO2 SLP093 T02060194" ```

# Text Analysis Tutorial on Spam Email in R

Hi everyone – I just wrote a tutorial on text analysis in R using the tm and wordcloud packages. Thought some of you here might be interested in it:

# Normal Probability Plots (QQ Plots) in R

Here’s a tutorial on how to tell whether your data are (approximately) normally distributed!

qq-plot-75-925

# Bar Charts and Segmented Bar Charts in R

Here are a couple of tutorials I’ve written to help anyone who’s interested in learning how to produce simple bar charts or simple segmented bar charts in R, given that you have some data stored in a CSV file that you can use. Please leave any comments if there are ways I can make this information more clear and useful. Thanks!

bar-charts-75-925

segmented-bar-charts-75-925

mnm-data (Note: This is an Excel XLS file! You need to download it and re-save as a CSV for the R examples to work.)

Nicole

# Pareto Charts in R

A Pareto Chart is a sorted bar chart that displays the frequency (or count) of occurrences that fall in different categories, from greatest frequency on the left to least frequency on the right, with an overlaid line chart that plots the cumulative percentage of occurrences. The vertical axis on the left of the chart shows frequency (or count), and the vertical axis on the right of the chart shows the cumulative percentage. A Pareto Chart is typically used to visualize:

• Primary types or sources of defects
• Most frequent reasons for customer complaints
• Amount of some variable (e.g. money, energy usage, time) that can be attributed to or classified according to a certain category

The Pareto Chart is typically used to separate the “vital few” from the “trivial many” using the Pareto principle, also called the 80/20 Rule, which asserts that approximately 80% of effects come from 20% of causes for many systems. Pareto analysis can thus be used to find, for example, the most critical types or sources of defects, the most common complaints that customers have, or the most essential categories within which to focus problem-solving efforts.

To find out how to implement a Pareto Chart in R, download my PDF at http://nicoleradziwill.com/r/pareto-charts-75-925-copy.pdf —>