Why R? An illustration through the use of ggplot and magick

  • The R language provides a rich and flexible environment for working with data, especially data to be used for statistical modelling or graphics.

  • R is a comprehensive public domain language for data analysis, with no licensing costs associated with it.

  • Being independent of any platform, R is universally applicable and simple to integrate into existing IT structures.

You can download R for free from:
https://cran.r-project.org/

and RStudio, a free and open-source integrated development environment (IDE) for R
https://www.rstudio.com/

Data Science

  • The new vast amount of data we have begun to take more and more notice of, has given a rise to the new discipline of data science.

  • Growing demand of data volume and easy understandability of extracted knowledge and insights from data is the motivating force of data science.

  • With the explosion of “Big Data” problems, data science has become a very hot field in many scientific areas as well as marketing, finance, and other business and social study disciplines. Hence, there is a growing demand for business and social scientific researchers with statistical, modelling and computing skills.

  • We can now identify patterns and regularities in data of all sorts that allow us to advance scholarship, improve the human condition, and create commercial and social value.

  • The field of data science is emerging at the intersection of the fields of statistics, computer science and design. R provides grate platform for this multidisciplinarity. It is incredibly powerful and as such it should be the first language for data manipulation, data analysis and visualisation you’re looking to grow skills in if you want to move towards data science.

Why R?

  1. The R system has an extensive library of packages that offer state-of-the-art-abilities.
    Many of the analyses that they offer are not even available in any of the standard packages.

  2. The functionalities of:

  • data manipulation,
  • data analysis and
  • visualisation implemented in R are incomparable.
  1. R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages.

  2. R enables easy experimentation and exploration, which improves data analysis.

  3. R is a tool behind reporting modern data analyses in a reproducible manner making an analysis more useful to others because the data and code that actually conducted the analysis can be made available.

R Community

“The R community is one of R’s best features!” Revolutions Daily news about using open source R

  • Supported by the R Foundation for Statistical Computing and with the strong and open engagement of developers and users from all walks of background from science to commerce it is hard to envisage that any commercial corporation will be able to develop sustainable business model with the same innovative drive and power as R community.

  • The collaboration amongst statisticians and other scientist who are engaged with statistical computing and growing interest and engagement of large companies creates altruistic R community which generates the force within which R is conquering the field of data analytics. As a result it creates a more powerful R resource and becomes more usable and attractive to Data scientists and analysists.

List of resources

ROpenSci: “R community is not just for ‘power users’ or developers. It’s a place for users and people interested in learning more about R”; Provides list of useful links:

#rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter

R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters

Local R meetup groups — a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable

Rweekly — an incredible weekly recap of all things R

R-bloggers — an awesome resource to find posts from many different bloggers using R

DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research

Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)

Who uses R?

Some of the major domains using R include:

  • Financial Services,
  • Pharmaceuticals,
  • Telecom,
  • Life Sciences and
  • Education sector.

Top companies using R are :

How do we do it?

Tools needed in a typical data science project:

R for Data Science by Garrett Grolemund & Hadley Wickham

http://r4ds.had.co.nz/index.html.

Real Example

Does declawing (onychectomy) cause harm to cats? Analyzing 17 years’ worth of shelter admissions data. - The dataset captures specifics about the individual cat (declawed status, age, breed, coat color, etc.) as well as the primary reason for admission. Some of the admission reasons are unconnected to the animal (e.g., moving, can’t afford pet, allergies) — but some reasons are based on problematic behaviors exhibited by the cat (e.g., house-soiling, aggressive to other animals, aggressive to people). Available to us is a CSV file containing 200 sample records.

Cat_Data

Do it in R

# Install and load packages and data 
# The tidyverse is a collection of R packages designed for data science
# Install the complete tidyverse with
# install.packages("tidyverse")
# load the complete tidyverse with
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Install and load the ggplot2 package: grammer of graphics
# Install and load the magick package fo advanced image-processing in R
# install.packages("magick")
library(magick)
## Linking to ImageMagick 6.9.11.32
## Enabled features: cairo, fontconfig, freetype, lcms, pango, rsvg, webp
## Disabled features: fftw, ghostscript, x11
# load the data saved on your computer
# cat_claw <- read.csv("declawing_data_sample.csv")
# or load the data directly from the website
cat_claw <- read.csv("declawing_data_sample.csv")
# Have a look at the data: head()
# let us look at first three raws of the data
head(cat_claw, n = 3)
##   Animal.ID Animal.Name Species Gender Date.Of.Birth      Primary.Breed
## 1   1032415      HARLEY     Cat      M     9/18/1999 Domestic Shorthair
## 2   1032962     TRUCKER     Cat      M     4/10/1998 Domestic Shorthair
## 3   1033799                 Cat      M      2/2/2000  Domestic Longhair
##   Secondary.Breed Declawed Distinguishing.Markings Purebred BodyWeight
## 1             Mix     None                                0          0
## 2             Mix     None                                0          0
## 3             Mix     None                                0          2
##   BodyWeightUnit PrimaryColor SecondaryColor ColorPattern         Intake.Date
## 1           <NA>        Black          White         <NA> 03/18/2000 00:14:00
## 2           <NA>         Grey           <NA>        Tiger 04/06/2000 00:45:00
## 3          pound        Black           <NA>         <NA> 05/02/2000 00:37:00
##                Intake.Type Intake.Subtype        Reason Reason.Category
## 1 Owner/Guardian Surrender       Schedule                          <NA>
## 2                    Stray        Walk In                          <NA>
## 3 Owner/Guardian Surrender        Walk In Too Many Pets   Owner problem
# Have alook at the structure of the data: str()
# look at the structure of the data
str(cat_claw)
## 'data.frame':    200 obs. of  20 variables:
##  $ Animal.ID              : int  1032415 1032962 1033799 1033965 1038328 1048494 1052572 1053299 1054811 1057979 ...
##  $ Animal.Name            : chr  "HARLEY" "TRUCKER" "" "" ...
##  $ Species                : chr  "Cat" "Cat" "Cat" "Cat" ...
##  $ Gender                 : chr  "M" "M" "M" "M" ...
##  $ Date.Of.Birth          : chr  "9/18/1999" "4/10/1998" "2/2/2000" "3/7/2000" ...
##  $ Primary.Breed          : chr  "Domestic Shorthair" "Domestic Shorthair" "Domestic Longhair" "Domestic Longhair" ...
##  $ Secondary.Breed        : chr  "Mix" "Mix" "Mix" "Mix" ...
##  $ Declawed               : chr  "None" "None" "None" "None" ...
##  $ Distinguishing.Markings: chr  "" "" "" "" ...
##  $ Purebred               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BodyWeight             : num  0 0 2 0 0 0 0 12 0 0 ...
##  $ BodyWeightUnit         : chr  NA NA "pound" NA ...
##  $ PrimaryColor           : chr  "Black" "Grey" "Black" "Orange" ...
##  $ SecondaryColor         : chr  "White" NA NA "White" ...
##  $ ColorPattern           : chr  NA "Tiger" NA NA ...
##  $ Intake.Date            : chr  "03/18/2000 00:14:00" "04/06/2000 00:45:00" "05/02/2000 00:37:00" "05/07/2000 00:26:00" ...
##  $ Intake.Type            : chr  "Owner/Guardian Surrender" "Stray" "Owner/Guardian Surrender" "Owner/Guardian Surrender" ...
##  $ Intake.Subtype         : chr  "Schedule" "Walk In" "Walk In" "Walk In" ...
##  $ Reason                 : chr  "" "" "Too Many Pets" "Too Many Pets" ...
##  $ Reason.Category        : chr  NA NA "Owner problem" "Owner problem" ...
# Do it in a tidy way: glimpse()
# previous output was messy as it didn't fit on the slide.
# we want tolook at the structure of the data as much data 
# as possible and identify data types for each of the variables
glimpse(cat_claw)
## Rows: 200
## Columns: 20
## $ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 1038328, …
## $ Animal.Name             <chr> "HARLEY", "TRUCKER", "", "", "", "PUDDY TAT",…
## $ Species                 <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Ca…
## $ Gender                  <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", …
## $ Date.Of.Birth           <chr> "9/18/1999", "4/10/1998", "2/2/2000", "3/7/20…
## $ Primary.Breed           <chr> "Domestic Shorthair", "Domestic Shorthair", "…
## $ Secondary.Breed         <chr> "Mix", "Mix", "Mix", "Mix", "Mix", "Mix", "Mi…
## $ Declawed                <chr> "None", "None", "None", "None", "None", "Fron…
## $ Distinguishing.Markings <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ BodyWeight              <dbl> 0.00, 0.00, 2.00, 0.00, 0.00, 0.00, 0.00, 12.…
## $ BodyWeightUnit          <chr> NA, NA, "pound", NA, NA, NA, NA, "pound", NA,…
## $ PrimaryColor            <chr> "Black", "Grey", "Black", "Orange", "Black", …
## $ SecondaryColor          <chr> "White", NA, NA, "White", "White", "Brown", N…
## $ ColorPattern            <chr> NA, "Tiger", NA, NA, NA, "Tiger", "Tortoisesh…
## $ Intake.Date             <chr> "03/18/2000 00:14:00", "04/06/2000 00:45:00",…
## $ Intake.Type             <chr> "Owner/Guardian Surrender", "Stray", "Owner/G…
## $ Intake.Subtype          <chr> "Schedule", "Walk In", "Walk In", "Walk In", …
## $ Reason                  <chr> "", "", "Too Many Pets", "Too Many Pets", "",…
## $ Reason.Category         <chr> NA, NA, "Owner problem", "Owner problem", NA,…

What to focus on?

# Note that variable 'Declawed' is the main variable of interest
# with three possible outcomes
summary(cat_claw$Declawed)
##    Length     Class      Mode 
##       200 character character
# sort the dates (DOB and InatekD) to be in the same format
cat_claw$Date.Of.Birth <- as.Date(cat_claw$Date.Of.Birth, format='%m/%d/%Y')
cat_claw$Intake.Date <- as.Date(cat_claw$Intake.Date, format='%m/%d/%Y')
# How does it look?
# check the data
glimpse(cat_claw)
## Rows: 200
## Columns: 20
## $ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 1038328, …
## $ Animal.Name             <chr> "HARLEY", "TRUCKER", "", "", "", "PUDDY TAT",…
## $ Species                 <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Ca…
## $ Gender                  <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", …
## $ Date.Of.Birth           <date> 1999-09-18, 1998-04-10, 2000-02-02, 2000-03-…
## $ Primary.Breed           <chr> "Domestic Shorthair", "Domestic Shorthair", "…
## $ Secondary.Breed         <chr> "Mix", "Mix", "Mix", "Mix", "Mix", "Mix", "Mi…
## $ Declawed                <chr> "None", "None", "None", "None", "None", "Fron…
## $ Distinguishing.Markings <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ BodyWeight              <dbl> 0.00, 0.00, 2.00, 0.00, 0.00, 0.00, 0.00, 12.…
## $ BodyWeightUnit          <chr> NA, NA, "pound", NA, NA, NA, NA, "pound", NA,…
## $ PrimaryColor            <chr> "Black", "Grey", "Black", "Orange", "Black", …
## $ SecondaryColor          <chr> "White", NA, NA, "White", "White", "Brown", N…
## $ ColorPattern            <chr> NA, "Tiger", NA, NA, NA, "Tiger", "Tortoisesh…
## $ Intake.Date             <date> 2000-03-18, 2000-04-06, 2000-05-02, 2000-05-…
## $ Intake.Type             <chr> "Owner/Guardian Surrender", "Stray", "Owner/G…
## $ Intake.Subtype          <chr> "Schedule", "Walk In", "Walk In", "Walk In", …
## $ Reason                  <chr> "", "", "Too Many Pets", "Too Many Pets", "",…
## $ Reason.Category         <chr> NA, NA, "Owner problem", "Owner problem", NA,…

How old are the cats?

# calculate age in days
cat_claw$diff_in_days <- cat_claw$Intake.Date - cat_claw$Date.Of.Birth
summary(cat_claw$diff_in_days) # summary for class type: 'difftime'
##   Length    Class     Mode 
##      200 difftime  numeric
# summary for diff_in_days as numeric (does everything seem ok?)
summary(as.numeric(cat_claw$diff_in_days))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -322     122     730    1070    1822    5114
# Identify 'incorrect' observation(s)
# identify negative diff_in_days; how many?
ind <- which.min(as.numeric(cat_claw$diff_in_days))
ind
## [1] 30
# remove observations with intake date before date of bearth
# save it as a new data set
cat <- cat_claw[-ind,]
# replace empty spaces with NA
cat$Animal.Name[cat$Animal.Name == ""] <- NA
cat$Distinguishing.Markings[cat$Distinguishing.Markings == ""] <- NA
cat$Reason[cat$Reason == ""] <- NA
# Check the data
glimpse(cat)
## Rows: 199
## Columns: 21
## $ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 1038328, …
## $ Animal.Name             <chr> "HARLEY", "TRUCKER", NA, NA, NA, "PUDDY TAT",…
## $ Species                 <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Ca…
## $ Gender                  <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", …
## $ Date.Of.Birth           <date> 1999-09-18, 1998-04-10, 2000-02-02, 2000-03-…
## $ Primary.Breed           <chr> "Domestic Shorthair", "Domestic Shorthair", "…
## $ Secondary.Breed         <chr> "Mix", "Mix", "Mix", "Mix", "Mix", "Mix", "Mi…
## $ Declawed                <chr> "None", "None", "None", "None", "None", "Fron…
## $ Distinguishing.Markings <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ BodyWeight              <dbl> 0.00, 0.00, 2.00, 0.00, 0.00, 0.00, 0.00, 12.…
## $ BodyWeightUnit          <chr> NA, NA, "pound", NA, NA, NA, NA, "pound", NA,…
## $ PrimaryColor            <chr> "Black", "Grey", "Black", "Orange", "Black", …
## $ SecondaryColor          <chr> "White", NA, NA, "White", "White", "Brown", N…
## $ ColorPattern            <chr> NA, "Tiger", NA, NA, NA, "Tiger", "Tortoisesh…
## $ Intake.Date             <date> 2000-03-18, 2000-04-06, 2000-05-02, 2000-05-…
## $ Intake.Type             <chr> "Owner/Guardian Surrender", "Stray", "Owner/G…
## $ Intake.Subtype          <chr> "Schedule", "Walk In", "Walk In", "Walk In", …
## $ Reason                  <chr> NA, NA, "Too Many Pets", "Too Many Pets", NA,…
## $ Reason.Category         <chr> NA, NA, "Owner problem", "Owner problem", NA,…
## $ diff_in_days            <drtn> 182 days, 727 days, 90 days, 61 days, 3 days…

What to plot?

Plot Age vs Declawed using Boxplot

boxplot(as.numeric(diff_in_days) ~ Declawed, data = cat, horizontal = TRUE)

Can we make it more attractive looking?

Create graph using ggplot

#The image_graph() function opens a new graphics device similar to e.g. png() or x11(). 
# It returns an image objec to which the plot(s) will be written
fig <- image_graph(width = 600, height = 600, res = 96)

# plots Age (in years) vs Declawed and saves it as an image
ggplot(cat, aes(Declawed, round(as.numeric(diff_in_days)/365), 2)) + 
  geom_boxplot(outlier.size = 0) + 
  geom_jitter(position=position_jitter(width=0.30), shape = 20, size = 3, aes(colour=Declawed), alpha=0.75) + 
  stat_summary(fun.y=mean, shape=23, size = 3, fill = "orange", col= "black", geom='point') +
  labs (title= "Cats: Age vs Declawed ", x = " Declawed", y = " Age") +
  theme(panel.border = element_rect(fill = NA, colour = "black", size = 2)) +
  theme(plot.title = element_text(size = 20, vjust = 2)) +
  ggsave('~/Documents/my_R/RLadiesMNE/ggplot_image.png') 
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Saving 6.25 x 6.25 in image

Adding an animation to a graph:

Read gif and background files

# read cat gif file
cat_gif <- image_read("http://media.giphy.com/media/q0ujUmppx3Fu0/giphy.gif")  
#
# Background image
graph_bg <- image_read("~/Documents/my_R/RLadiesMNE/ggplot_image.png")
background <- image_background(image_scale(graph_bg, "650"), "white", flatten = TRUE)
# Combine and flatten frames
frames <- image_apply(cat_gif, function(frame) {
  image_composite(background, frame, offset = "+410+10")
})
# Turn frames into animation
animation <- image_animate(frames, fps = 10)
print(animation)

Happy Plotting!

Tatjana Kecojevic
Tatjana Kecojevic
statistician and ever evolving data scientist

My research work has developed my knowledge and skills within the area of applied statistical modelling. As such, the area of my research enhances the opportunities for cross discipline projects.

comments powered by Disqus

Related