Data Visualization with ggplot2: A Comprehensive Guide

DATA VISUALISATION DA II
NAME: MABRIE TESFAYE ERKIHUN
REG NO: 18BCE2430
SLOT: E2

Data visualization packages in R

iGraphs()

ggplot2()


Following are the basic operations we perform on graphs.

  • Display graph vertices
  • Display graph edges
  • Add a vertex
  • Add an edge
  • Creating a graph
  • zOXfEAHsADeAAP4AE8gAfwAB7AA3gADwTzwP8P0B61trHKTjYAAAAASUVORK5CYII=
  • ggggAACCCCAAAIIIIAAAtUR+H9lLtPRXayFhgAAAABJRU5ErkJggg==


Display graph vertices

To display the graph vertices we simple find the keys of the graph dictionary. We use the keys() method.

class graph:
    def __init__(self,gdict=None):
        if gdict isNone:
            gdict =[]
        self.gdict = gdict

# Get the keys of the dictionary
    def getVertices(self):
        return list(self.gdict.keys())

# Create the dictionary with graph elements
graph_elements ={"a":["b","c"],
                "b":["a","d"],
                "c":["a","d"],
                "d":["e"],
                "e":["d"]
                }

g = graph(graph_elements)

print(g.getVertices())


library(igraph)
library(Cairo)

graphs 
library(igraph) ; library(Cairo)

g 

p><br><h2 style=Data

The data is straightforward – the data to operate on.

Mapping

An aesthetic mapping defines how variables in the dataset are connected to visual properties or outputs. The terms “aesthetic” and “mapping” are often used interchangeably with the more formal “aesthetic mapping”. Just think of a mapping as defining properties of the output that depend upon variables. For example, coloring the points of a scatter plot based upon a categorical variable is a mapping, whereas coloring all points red is not.

The most basic (useful) mapping would be aes(x = var1, y = var2). This tells ggplot what variables are used on what axis.


Basic plots

Let’s start with the simplest example:

library(ggplot2)
data(midwest)
ggplot(data = midwest, 
       mapping = aes(x = percbelowpoverty))


P2SNobKX4W+AAAAAAElFTkSuQmCC


8uDDJABMkAGyAAZIANkgAyQATJABsgAGSADZIAMkAEyQAbIwMNl4P8BSoGc+NFz3RYAAAAASUVORK5CYII=


Now, geom_histogram knows everything that ggplot knows, so it requires no additional arguments. We could easily change this to a density plot:

ggplot(data = midwest,
             mapping = aes(x = percbelowpoverty)) +
geom_density()


wHzZqFrZb3aXQAAAABJRU5ErkJggg==


k2YODMK659AAAAAElFTkSuQmCC


Important: The + must end the previous line, not begin the following.

By changing from geom_histogram to geom_density, we’ve inherited the same mapping information and don’t need to change anything else. Contrast this with the built-in R functionality:

hist(midwest$percbelowpoverty)


CjAAAAAElFTkSuQmCC


DwhH6HiJgrywAAAAAElFTkSuQmCC


plot(density(midwest$percbelowpoverty))


GswAAAABJRU5ErkJggg==


Scatter plots using ggplot2 packages 

A+y2cpnWyV5cAAAAAElFTkSuQmCC


B8TqfjSHZVXQAAAAABJRU5ErkJggg==


Additional Aesthetic Mappings

We saw color used above. Let’s contrast how color is used as aesthetic versus not.

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) +
geom_point(color = 'green')


wH1WZxFEirZYgAAAABJRU5ErkJggg==


Hx+n6o59OcL1AAAAAElFTkSuQmCC


LtWwvD9D9wIAAAAASUVORK5CYII=


Bwi65f2MqZVdAAAAAElFTkSuQmCC


Arguments inside a mapping should apply to variables.

Since 'green' doesn’t exist as a variable, a new variable is created which is constant.

vzHhiBBAAAEEEEAAAQR+oADB+gcOGoeMAAIIIIAAAggg8P0ECNbfb0w4IgQQQAABBBBAAIEfKDAmWP8Hhw+GALCZk24AAAAASUVORK5CYII=


kgaO1VZmEp8AAAAASUVORK5CYII=


ggplot2 will automatically color as best it can. Consider what happens with a continuous variable

C8eN1o8grTy+O5nFSLhEQAREQAREQAREQgd4JpIR1N+dMqJddU7tIX7YshGnTQli8eGrEukR67zebrhQBERABERABERABEcgjUCau6zpfq0inS4cOhfCHP4QwffrUiHWJ9LwbS7lEQAREQAREQAREQAR6J1CXGC8r53+zvHRit6FSYgAAAABJRU5ErkJggg==


AcrTrtbHDx3aAAAAAElFTkSuQmCC


Stat

Behind the scenes, certain geom_functions are transforming the data before plotting.

AzxJSIoRIehpAAAAAElFTkSuQmCC


GkDtAHaAG2ANkAboA3QBmgDtAHaQGxtQI6Jg0kv+Y8s+0EGZEAboA3QBmgDtAHaAG2ANkAboA3QBmgDtAHaAG2ANkAboA3QBmgDtAHaAG0geWyASa9HySMWLyxqRRugDdAGaAO0AdoAbYA2QBugDdAGaAO0AdoAbYA2QBugDdAGaAO0AdoAbUDdBpj0YtKLq9poA7QB2gBtgDZAG6AN0AZoA7QB2gBtgDZAG6AN0AZoA7QB2gBtgDZAG6ANJL0NMOlFI056I2ZGWz2jTS7kQhugDdAGaAO0AdoAbYA2QBugDdAGaAO0AdoAbYA2QBugDdAGaAOpZANMejHpxaQXbYA2QBugDdAGaAO0AdoAbYA2QBugDdAGaAO0AdoAbYA2QBugDdAGaAO0gaS3gf8fV57DtHZiugsAAAAASUVORK5CYII=


hAAAAABJRU5ErkJggg==


AAPOUuUxKRlFAAAAAElFTkSuQmCC


Position

The position argument is used to tweak how certain aspects of the plots are displayed. Its use depends heavily on the type of plot. For each geom, some positions will work, some will do nothing, and some will produce nonsense. They are most commonly used when trying to create grouped plots.

For example, we can look at a histogram of car mpg (hwy).


If we add fill = class, it will group by class. The default is position = "stack"; let’s see what each does.

z+rr7085aDhCwAAAABJRU5ErkJggg==


j9XYxweuYRnSgAAAABJRU5ErkJggg==


In general:

  • identity is useful when you want to plot things exactly as they are.
  • stack is useful when you want to look both at overall values and per-group values.
  • dodge is useful for group comparisons.
  • fill is useful for considering percentages instead of counts.
  • jitter is useful for scatter plots (and similar) when multiple values may be placed at the same point.

    Coordinate systems

    The default coordinate system is coord_cartesian(). There are two useful and two niche different coordinate systems.

    • coord_fixed forces the x and y axes to have a fixed ratio between units on each. (Default ratio is 1:1, it takes an argument to define the ratio)
    • coord_flip flips x and y. Most useful to get e.g. horizontal instead of vertical bar charts.
    • coord_map plots map data
    • coord_polar plots on the polar coordinate system.


Representing higher dimensions in scatter plots

Although scatter plots are inherently a two (or three) dimensional visualization, clever use of plot characteristics can represent higher dimensions. There is always a trade-off – the more information you represent in a single plot, the more likely you are to confuse the reader.

We’ll be using the mpg dataset from ggplot2. Our primary variables of interest will be city mileage (cty) versus highway mileage (hwy), and we’ll be adding other variables as we go.

wcUnMkZzKMoWAAAAABJRU5ErkJggg==


xqcziOPo3vkAAAAASUVORK5CYII=

asA6wDrAOsA6wDrAOsA6wDqQqzpQfeqMRP27KFeZYLys4KwDrAOsA6wDrAOsA6wDrAOoA1HBFP7+P4L5lDJ1ErhFAAAAAElFTkSuQmCC wEDfKwSEIC1pgAAAABJRU5ErkJggg==