Source: Photo by Engin Akyurt https://unsplash.com/photos/tYXBfhpZA1k

1 Introduction

In this tutorial, we cover basic data manipulation operations including

Reading data from variety of file formats.
Selecting variables.
Filtering rows.
Adding new variables.
Arranging rows by variables.
Summarizing and Grouping.
Reshaping data between wide and long format.
Joining tables using key(s) variable.
Saving data to desired file formats.

2 Load packages

Both R and python are open-source and very powerful for data-related work. There are many packages and libraries that can be used, the challenge is to learn and pick the right tool to the task in hand. In this tutorial, we choose standard tools in both language.

R
python

We use

dplyr for data manipulation
tidyr for reshaping, join data
readr for read csv files
readxl for read/import excel files
wriexl for write/import excel files
arrow for read/write parquet files
gapminder example data for demonstration

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(readr)
library(readxl)
library(writexl)
library(arrow, warn.conflicts = FALSE)
# make sure that you have packaged installed in your machine/environments
# you can install with install.packages("<library_name>")

We mainly use pandas, however additional packages required to be installed which are openpyxl and pyarrow package as we want to work with excel and parquet files.

# make sure that you have pandas pyarrow openpyxl 
# installed in your machine/environments
# !pip install pandas openpyxl pyarrow

import pandas as pd

3 Data: gapminder

One of the important step to conduct a data analytics or machine learning project is to understand the data.

Data used in this tutorial are from gapminder R package inspired by famous Hans Rosling ted’s talk. We use two tables which are

gapminder contain a data of countries from 1952 - 2007 with 3 variables including population (pop), life expectancy (LifeExp), and gdp per capita(gdpPercap).

#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # … with 1,694 more rows

country_codes contains country names and associated ISO 3166-1 country codes (alpha and numeric) will be used for join.

#> # A tibble: 187 × 3
#>    country     iso_alpha iso_num
#>    <chr>       <chr>       <int>
#>  1 Afghanistan AFG             4
#>  2 Albania     ALB             8
#>  3 Algeria     DZA            12
#>  4 Angola      AGO            24
#>  5 Argentina   ARG            32
#>  6 Armenia     ARM            51
#>  7 Aruba       ABW           533
#>  8 Australia   AUS            36
#>  9 Austria     AUT            40
#> 10 Azerbaijan  AZE            31
#> # … with 177 more rows

4 Reading data

There are many ways to get data into your machine for data analytics

files i.e. excel, csv, parquet, etc.
database
web api
web scraping

For the sake of simplicity, we use file approach. I will show how to read or write data in other approaches in future posts.

There are many file types, in this tutorial chooses the most common ones

csv
excel
parquet

4.1 csv

CSV is a text file that each columns are seperated by commas stored in .csv file extension. Aside from csv, there are similar text file that store spreadsheet data but used different seperator or fixed width to determine column by position. gapminder.csv file can be downloaded here.

R
python

We use function read_csv from readr package.

gapminder <- read_csv("gapminder.csv", show_col_types = FALSE)
# you can use file path in your local machine or url path.
gapminder

#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # … with 1,694 more rows

Note

Note <- is convention assignment in R, but = can also be used.

gapminder = pd.read_csv("gapminder.csv")
gapminder

#>           country continent  year  lifeExp       pop   gdpPercap
#> 0     Afghanistan      Asia  1952   28.801   8425333  779.445314
#> 1     Afghanistan      Asia  1957   30.332   9240934  820.853030
#> 2     Afghanistan      Asia  1962   31.997  10267083  853.100710
#> 3     Afghanistan      Asia  1967   34.020  11537966  836.197138
#> 4     Afghanistan      Asia  1972   36.088  13079460  739.981106
#> ...           ...       ...   ...      ...       ...         ...
#> 1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
#> 1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
#> 1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
#> 1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
#> 1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298
#> 
#> [1704 rows x 6 columns]

4.2 excel

MS excel is a spreadsheet program and store data in .xls (2003 version and before) or .xlsx. They are not text files like csv, but they are a kind of binary files (XML-like) format. An excel file can contain multiple sheets and within each sheet can have multiple tables. gapminder.xlsx file can be downloaded here.

Tip

Unstructured excel files are difficult to work with using programming. Excel are still widely used by many organizations, so at leaset, we should convince users to make a proper spreadsheet for example follow Broman and Woo’ guideline.

R
python

We use function read_excel from readxl package. As an excel file can have multiple worksheets, we can read one sheet at a time and we can specify sheet to import using sheet argument where it could be index or sheet name.

read fist sheet.

gapminder <- read_excel("gapminder.xlsx")
gapminder

#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # … with 1,694 more rows

read second sheet.

country_codes <- read_excel("gapminder.xlsx", sheet = 2)
country_codes

#> # A tibble: 187 × 3
#>    country     iso_alpha iso_num
#>    <chr>       <chr>       <dbl>
#>  1 Afghanistan AFG             4
#>  2 Albania     ALB             8
#>  3 Algeria     DZA            12
#>  4 Angola      AGO            24
#>  5 Argentina   ARG            32
#>  6 Armenia     ARM            51
#>  7 Aruba       ABW           533
#>  8 Australia   AUS            36
#>  9 Austria     AUT            40
#> 10 Azerbaijan  AZE            31
#> # … with 177 more rows

Note

Index in R starts with 1.

pandas has a function read_excel to import excel files.

read fist sheet.

gapminder = pd.read_excel("gapminder.xlsx")
gapminder

#>           country continent  year  lifeExp       pop   gdpPercap
#> 0     Afghanistan      Asia  1952   28.801   8425333  779.445314
#> 1     Afghanistan      Asia  1957   30.332   9240934  820.853030
#> 2     Afghanistan      Asia  1962   31.997  10267083  853.100710
#> 3     Afghanistan      Asia  1967   34.020  11537966  836.197138
#> 4     Afghanistan      Asia  1972   36.088  13079460  739.981106
#> ...           ...       ...   ...      ...       ...         ...
#> 1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
#> 1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
#> 1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
#> 1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
#> 1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298
#> 
#> [1704 rows x 6 columns]

read second sheet

country_codes = pd.read_excel("gapminder.xlsx", sheet_name = 1)
country_codes

#>                 country iso_alpha  iso_num
#> 0           Afghanistan       AFG        4
#> 1               Albania       ALB        8
#> 2               Algeria       DZA       12
#> 3                Angola       AGO       24
#> 4             Argentina       ARG       32
#> ..                  ...       ...      ...
#> 182             Vietnam       VNM      704
#> 183  West Bank and Gaza       PSE      275
#> 184         Yemen, Rep.       YEM      887
#> 185              Zambia       ZMB      894
#> 186            Zimbabwe       ZWE      716
#> 
#> [187 rows x 3 columns]

Note

Index in python starts with 0.

4.3 parquet

Parquet is an open file format that is great for speed and storage. Some organizeations have changed the way they sharing their data files to parquet instead of csv. gapminder.parquet file can be downloaded here.

R
python

We use function read_parquet from arrow package.

gapminder <- read_parquet("gapminder.parquet")
gapminder

#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # … with 1,694 more rows

pandas has a function read_parquet to import excel files.

gapminder = pd.read_parquet("gapminder.parquet")
gapminder

#>           country continent  year  lifeExp       pop   gdpPercap
#> 0     Afghanistan      Asia  1952   28.801   8425333  779.445314
#> 1     Afghanistan      Asia  1957   30.332   9240934  820.853030
#> 2     Afghanistan      Asia  1962   31.997  10267083  853.100710
#> 3     Afghanistan      Asia  1967   34.020  11537966  836.197138
#> 4     Afghanistan      Asia  1972   36.088  13079460  739.981106
#> ...           ...       ...   ...      ...       ...         ...
#> 1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
#> 1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
#> 1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
#> 1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
#> 1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298
#> 
#> [1704 rows x 6 columns]

5 Selecting variables

In real data wrangling work, we often get a table that contains too many variables. It is helpful to select only needed variables for further uses. In this tutorial, we select three variables inclduing country, year and pop.

R
python

We use select from dplyr package.

gapminder |> select(country, year, pop)

#> # A tibble: 1,704 × 3
#>    country      year      pop
#>    <fct>       <int>    <int>
#>  1 Afghanistan  1952  8425333
#>  2 Afghanistan  1957  9240934
#>  3 Afghanistan  1962 10267083
#>  4 Afghanistan  1967 11537966
#>  5 Afghanistan  1972 13079460
#>  6 Afghanistan  1977 14880372
#>  7 Afghanistan  1982 12881816
#>  8 Afghanistan  1987 13867957
#>  9 Afghanistan  1992 16317921
#> 10 Afghanistan  1997 22227415
#> # … with 1,694 more rows

Note

Pipe operator |> is an operator that take the previous object and apply it to the subsequent function as the first argument (default). We can use |> multiple times to the same object. This is called chaning This style of coding help coding more human readable as it avoids nested code. This is similar to method chainging in python.

|> was introduced in R version 4.0. Previous version used %>% from magrittr. You will see %>% in many tutorials or books.

To subset pandas, we can use .loc where the first argument for row, and the second argument is for columns. To select all rows used keyword :. To select columns, you can use character vector.

gapminder.loc[:, ["country", "year", "pop"]]

#>           country  year       pop
#> 0     Afghanistan  1952   8425333
#> 1     Afghanistan  1957   9240934
#> 2     Afghanistan  1962  10267083
#> 3     Afghanistan  1967  11537966
#> 4     Afghanistan  1972  13079460
#> ...           ...   ...       ...
#> 1699     Zimbabwe  1987   9216418
#> 1700     Zimbabwe  1992  10704340
#> 1701     Zimbabwe  1997  11404948
#> 1702     Zimbabwe  2002  11926563
#> 1703     Zimbabwe  2007  12311143
#> 
#> [1704 rows x 3 columns]

6 Filtering rows

We often interest in a subset of the data. There are many ways to filter data i.e. by index of rows. But one common way is to filter by conditions. In this tutorial, we want a data of Thailand since 1980 from gapminder.

R
python

We use filter from dplyr package.

gapminder |> filter(country == "Thailand" & year > 1980)

#> # A tibble: 6 × 6
#>   country  continent  year lifeExp      pop gdpPercap
#>   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Thailand Asia       1982    64.6 48827160     2393.
#> 2 Thailand Asia       1987    66.1 52910342     2983.
#> 3 Thailand Asia       1992    67.3 56667095     4617.
#> 4 Thailand Asia       1997    67.5 60216677     5853.
#> 5 Thailand Asia       2002    68.6 62806748     5913.
#> 6 Thailand Asia       2007    70.6 65068149     7458.

We use query method from pandas package.

gapminder.query('country == "Thailand" & year > 1980')

#>        country continent  year  lifeExp       pop    gdpPercap
#> 1530  Thailand      Asia  1982   64.597  48827160  2393.219781
#> 1531  Thailand      Asia  1987   66.084  52910342  2982.653773
#> 1532  Thailand      Asia  1992   67.298  56667095  4616.896545
#> 1533  Thailand      Asia  1997   67.521  60216677  5852.625497
#> 1534  Thailand      Asia  2002   68.564  62806748  5913.187529
#> 1535  Thailand      Asia  2007   70.616  65068149  7458.396327

Note

query method argument is string. If the condition used another string, we need to use '' and "" to differentiate between a string used for string comparison and a string used for query arguement as a whole.

7 Adding new variables

When we analyze data, we usually have to compute new variable(s) using original variables. In this tutorial, we compute two new vairables

pop_m is population in million.
gdp_m is total GDP in million USD where gdp is computed from gdpPercap multiply by pop.

R
python

We use mutate from dplyr package. It can use to make multiple variables in single function call.

gapminder |> mutate(
  pop_m = pop / 1e6,
  gdp_m = gdpPercap*pop/1e6
  )

#> # A tibble: 1,704 × 8
#>    country     continent  year lifeExp      pop gdpPercap pop_m  gdp_m
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <dbl>  <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.  8.43  6567.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.  9.24  7585.
#>  3 Afghanistan Asia       1962    32.0 10267083      853. 10.3   8759.
#>  4 Afghanistan Asia       1967    34.0 11537966      836. 11.5   9648.
#>  5 Afghanistan Asia       1972    36.1 13079460      740. 13.1   9679.
#>  6 Afghanistan Asia       1977    38.4 14880372      786. 14.9  11698.
#>  7 Afghanistan Asia       1982    39.9 12881816      978. 12.9  12599.
#>  8 Afghanistan Asia       1987    40.8 13867957      852. 13.9  11821.
#>  9 Afghanistan Asia       1992    41.7 16317921      649. 16.3  10596.
#> 10 Afghanistan Asia       1997    41.8 22227415      635. 22.2  14122.
#> # … with 1,694 more rows

We use assign method from pandas package.

gapminder.assign(
  pop_m = lambda df: df['pop']/1e6,
  gdp_m = lambda df: df['gdpPercap']*df['pop']/1e6
  )

#>           country continent  year  ...   gdpPercap      pop_m        gdp_m
#> 0     Afghanistan      Asia  1952  ...  779.445314   8.425333  6567.086330
#> 1     Afghanistan      Asia  1957  ...  820.853030   9.240934  7585.448670
#> 2     Afghanistan      Asia  1962  ...  853.100710  10.267083  8758.855797
#> 3     Afghanistan      Asia  1967  ...  836.197138  11.537966  9648.014150
#> 4     Afghanistan      Asia  1972  ...  739.981106  13.079460  9678.553274
#> ...           ...       ...   ...  ...         ...        ...          ...
#> 1699     Zimbabwe    Africa  1987  ...  706.157306   9.216418  6508.240905
#> 1700     Zimbabwe    Africa  1992  ...  693.420786  10.704340  7422.611852
#> 1701     Zimbabwe    Africa  1997  ...  792.449960  11.404948  9037.850590
#> 1702     Zimbabwe    Africa  2002  ...  672.038623  11.926563  8015.110972
#> 1703     Zimbabwe    Africa  2007  ...  469.709298  12.311143  5782.658337
#> 
#> [1704 rows x 8 columns]

8 Arranging rows by variables

Often, we have to sort or rearrange rows by values of particular columns to help us better understanding of data. We can sort from lower to larger (ascending) or the opposite (descending). Also multiple columns can be sorted. In this tutorial, we illustrate two cases

sort gapminder by continent and country ascending.
sort gapminder by year descending and continent ascending.

R
python

We use arrange from dplyr package. Default mode of sorting is ascending, if we want to sort descending, use desc(variable) syntax.

gapminder |> arrange(continent, country)

#> # A tibble: 1,704 × 6
#>    country continent  year lifeExp      pop gdpPercap
#>    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Algeria Africa     1952    43.1  9279525     2449.
#>  2 Algeria Africa     1957    45.7 10270856     3014.
#>  3 Algeria Africa     1962    48.3 11000948     2551.
#>  4 Algeria Africa     1967    51.4 12760499     3247.
#>  5 Algeria Africa     1972    54.5 14760787     4183.
#>  6 Algeria Africa     1977    58.0 17152804     4910.
#>  7 Algeria Africa     1982    61.4 20033753     5745.
#>  8 Algeria Africa     1987    65.8 23254956     5681.
#>  9 Algeria Africa     1992    67.7 26298373     5023.
#> 10 Algeria Africa     1997    69.2 29072015     4797.
#> # … with 1,694 more rows

gapminder |> arrange(desc(year), continent)

#> # A tibble: 1,704 × 6
#>    country                  continent  year lifeExp      pop gdpPercap
#>    <fct>                    <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Algeria                  Africa     2007    72.3 33333216     6223.
#>  2 Angola                   Africa     2007    42.7 12420476     4797.
#>  3 Benin                    Africa     2007    56.7  8078314     1441.
#>  4 Botswana                 Africa     2007    50.7  1639131    12570.
#>  5 Burkina Faso             Africa     2007    52.3 14326203     1217.
#>  6 Burundi                  Africa     2007    49.6  8390505      430.
#>  7 Cameroon                 Africa     2007    50.4 17696293     2042.
#>  8 Central African Republic Africa     2007    44.7  4369038      706.
#>  9 Chad                     Africa     2007    50.7 10238807     1704.
#> 10 Comoros                  Africa     2007    65.2   710960      986.
#> # … with 1,694 more rows

We use sort_values method from pandas package. Default mode of sorting is ascending, if we want to sort descending, use ascending arguements.

gapminder.sort_values(["continent", "country"])

#>           country continent  year  lifeExp       pop     gdpPercap
#> 24        Algeria    Africa  1952   43.077   9279525   2449.008185
#> 25        Algeria    Africa  1957   45.685  10270856   3013.976023
#> 26        Algeria    Africa  1962   48.303  11000948   2550.816880
#> 27        Algeria    Africa  1967   51.407  12760499   3246.991771
#> 28        Algeria    Africa  1972   54.518  14760787   4182.663766
#> ...           ...       ...   ...      ...       ...           ...
#> 1099  New Zealand   Oceania  1987   74.320   3317166  19007.191290
#> 1100  New Zealand   Oceania  1992   76.330   3437674  18363.324940
#> 1101  New Zealand   Oceania  1997   77.550   3676187  21050.413770
#> 1102  New Zealand   Oceania  2002   79.110   3908037  23189.801350
#> 1103  New Zealand   Oceania  2007   80.204   4115771  25185.009110
#> 
#> [1704 rows x 6 columns]

gapminder.sort_values(["year", "continent"], ascending=[True, False])

#>           country continent  year  lifeExp       pop     gdpPercap
#> 60      Australia   Oceania  1952   69.120   8691212  10039.595640
#> 1092  New Zealand   Oceania  1952   69.390   1994794  10556.575660
#> 12        Albania    Europe  1952   55.230   1282697   1601.056136
#> 72        Austria    Europe  1952   66.800   6927772   6137.076492
#> 108       Belgium    Europe  1952   68.000   8730405   8343.105127
#> ...           ...       ...   ...      ...       ...           ...
#> 1547         Togo    Africa  2007   58.420   5701579    882.969944
#> 1571      Tunisia    Africa  2007   73.923  10276158   7092.923025
#> 1595       Uganda    Africa  2007   51.542  29170398   1056.380121
#> 1691       Zambia    Africa  2007   42.384  11746035   1271.211593
#> 1703     Zimbabwe    Africa  2007   43.487  12311143    469.709298
#> 
#> [1704 rows x 6 columns]

9 Summarizing and Grouping

As data is huge, we need a way or a measure that let us have a better understanding about data. One way is to condense information into some sort of summary statistics i.e. average, min, max, sum, count, etc. These operations are basically reduce the number of rows. In addition, We can summaize data by group (categorial data). In gapminder, we will show how to do it without group and with group(s).

9.1 without group

We want to find population total and population mean (in millions) of all countries and from 1952 - 2007.

R
python

We use summarize or summarise from dplyr package. It can use to make multiple summarized variables in single function call.

gapminder |> summarize(
  pop_total_m = sum(pop)/1e6,
  pop_mean_m = mean(pop)/1e6
  )

#> # A tibble: 1 × 2
#>   pop_total_m pop_mean_m
#>         <dbl>      <dbl>
#> 1      50440.       29.6

We use assign method from pandas package. Alternative approach is to compute each summarization.

gapminder["pop"].agg(["sum", "mean"])/1e6

# alternative

#> sum     50440.465801
#> mean       29.601212
#> Name: pop, dtype: float64

gapminder["pop"].sum()/1e6

#> 50440.465801

gapminder["pop"].mean()/1e6

#> 29.601212324530515

9.2 by groups

We want to find population total and population mean (in millions) of all countries and from 1952 - 2007.

R
python

We use group_by from dplyr package before use summarize function. The result is grouped dataframe. I normally ungroup it to normal dataframe.

gapminder |>
  group_by(continent, year) |>
  summarize(
    pop_min_m = min(pop)/1e6,
    pop_max_m = max(pop)/1e6
  ) |> ungroup()

#> # A tibble: 60 × 4
#>    continent  year pop_min_m pop_max_m
#>    <fct>     <int>     <dbl>     <dbl>
#>  1 Africa     1952    0.0600      33.1
#>  2 Africa     1957    0.0613      37.2
#>  3 Africa     1962    0.0653      41.9
#>  4 Africa     1967    0.0708      47.3
#>  5 Africa     1972    0.0766      53.7
#>  6 Africa     1977    0.0868      62.2
#>  7 Africa     1982    0.0986      73.0
#>  8 Africa     1987    0.111       81.6
#>  9 Africa     1992    0.126       93.4
#> 10 Africa     1997    0.146      106. 
#> # … with 50 more rows

We use groupby method from pandas package by putting it before agg method.

gapminder.groupby(by=["continent", "year"])["pop"].agg(["min", "max"])/1e6

#>                      min          max
#> continent year                       
#> Africa    1952  0.060011    33.119096
#>           1957  0.061325    37.173340
#>           1962  0.065345    41.871351
#>           1967  0.070787    47.287752
#>           1972  0.076595    53.740085
#>           1977  0.086796    62.209173
#>           1982  0.098593    73.039376
#>           1987  0.110812    81.551520
#>           1992  0.125911    93.364244
#>           1997  0.145608   106.207839
#>           2002  0.170372   119.901274
#>           2007  0.199579   135.031164
#> Americas  1952  0.662850   157.553000
#>           1957  0.764900   171.984000
#>           1962  0.887498   186.538000
#>           1967  0.960155   198.712000
#>           1972  0.975199   209.896000
#>           1977  1.039009   220.239000
#>           1982  1.116479   232.187835
#>           1987  1.191336   242.803533
#>           1992  1.183669   256.894189
#>           1997  1.138101   272.911760
#>           2002  1.101832   287.675526
#>           2007  1.056608   301.139947
#> Asia      1952  0.120447   556.263527
#>           1957  0.138655   637.408000
#>           1962  0.171863   665.770000
#>           1967  0.202182   754.550000
#>           1972  0.230800   862.030000
#>           1977  0.297410   943.455000
#>           1982  0.377967  1000.281000
#>           1987  0.454612  1084.035000
#>           1992  0.529491  1164.970000
#>           1997  0.598561  1230.075000
#>           2002  0.656397  1280.400000
#>           2007  0.708573  1318.683096
#> Europe    1952  0.147962    69.145952
#>           1957  0.165110    71.019069
#>           1962  0.182053    73.739117
#>           1967  0.198676    76.368453
#>           1972  0.209275    78.717088
#>           1977  0.221823    78.160773
#>           1982  0.233997    78.335266
#>           1987  0.244676    77.718298
#>           1992  0.259012    80.597764
#>           1997  0.271192    82.011073
#>           2002  0.288030    82.350671
#>           2007  0.301931    82.400996
#> Oceania   1952  1.994794     8.691212
#>           1957  2.229407     9.712569
#>           1962  2.488550    10.794968
#>           1967  2.728150    11.872264
#>           1972  2.929100    13.177000
#>           1977  3.164900    14.074100
#>           1982  3.210650    15.184200
#>           1987  3.317166    16.257249
#>           1992  3.437674    17.481977
#>           1997  3.676187    18.565243
#>           2002  3.908037    19.546792
#>           2007  4.115771    20.434176

10 Reshaping table

Reshaping is used we want to change the way data is presenting. There are wide format and long format.

wide format is a format that have more columns but less rows. Good for making report orpresenting data.
long format is a format that have less columns but more rows. Good for data management, data warehousing.

We can convert back and forth as needed. For excel user, it is the famous pivot-table function.

10.1 Pivot wider

Convert from long format to wide format. In this tutorial, we want to make a report of each country of life extectancy by years where we want each year to be a column. We can achieve this by pivot gapminder to wide format.

R
python

We use pivot_wider from tidyr package.

gapminder |> pivot_wider(
  id_cols = c("continent", "country"), 
  names_from = "year",
  values_from = "lifeExp")

#> # A tibble: 142 × 14
#>    continent country     `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
#>    <fct>     <fct>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 Asia      Afghanistan   28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8
#>  2 Europe    Albania       55.2   59.3   64.8   66.2   67.7   68.9   70.4   72  
#>  3 Africa    Algeria       43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8
#>  4 Africa    Angola        30.0   32.0   34     36.0   37.9   39.5   39.9   39.9
#>  5 Americas  Argentina     62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8
#>  6 Oceania   Australia     69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3
#>  7 Europe    Austria       66.8   67.5   69.5   70.1   70.6   72.2   73.2   74.9
#>  8 Asia      Bahrain       50.9   53.8   56.9   59.9   63.3   65.6   69.1   70.8
#>  9 Asia      Bangladesh    37.5   39.3   41.2   43.5   45.3   46.9   50.0   52.8
#> 10 Europe    Belgium       68     69.2   70.2   70.9   71.4   72.8   73.9   75.4
#> # … with 132 more rows, and 4 more variables: `1992` <dbl>, `1997` <dbl>,
#> #   `2002` <dbl>, `2007` <dbl>

We use pivot method from pandas package.

gapminder.pivot(
  index = ["continent", "country"], 
  columns = "year", 
  values = "lifeExp"
  )

#> year                        1952    1957    1962  ...    1997    2002    2007
#> continent country                                 ...                        
#> Africa    Algeria         43.077  45.685  48.303  ...  69.152  70.994  72.301
#>           Angola          30.015  31.999  34.000  ...  40.963  41.003  42.731
#>           Benin           38.223  40.358  42.618  ...  54.777  54.406  56.728
#>           Botswana        47.622  49.618  51.520  ...  52.556  46.634  50.728
#>           Burkina Faso    31.975  34.906  37.814  ...  50.324  50.650  52.295
#> ...                          ...     ...     ...  ...     ...     ...     ...
#> Europe    Switzerland     69.620  70.560  71.320  ...  79.370  80.620  81.701
#>           Turkey          43.585  48.079  52.098  ...  68.835  70.845  71.777
#>           United Kingdom  69.180  70.420  70.760  ...  77.218  78.471  79.425
#> Oceania   Australia       69.120  70.330  70.930  ...  78.830  80.370  81.235
#>           New Zealand     69.390  70.260  71.240  ...  77.550  79.110  80.204
#> 
#> [142 rows x 12 columns]

10.2 Pivot longer

Convert from wide format to long format. In this tutorial, we want to normalize gapminder table by moving measurment variables including lifeExp pop and gdpPercap columns into longer format where there is a column to store the variable name and a column store their value. We can achieve this by pivot gapminder to long format.

R
python

We use pivot_longer from tidyr package.

gapminder |> pivot_longer(
  cols = c("lifeExp", "pop", "gdpPercap"), 
  names_to  = "variable",
  values_to = "value")

#> # A tibble: 5,112 × 5
#>    country     continent  year variable       value
#>    <fct>       <fct>     <int> <chr>          <dbl>
#>  1 Afghanistan Asia       1952 lifeExp         28.8
#>  2 Afghanistan Asia       1952 pop        8425333  
#>  3 Afghanistan Asia       1952 gdpPercap      779. 
#>  4 Afghanistan Asia       1957 lifeExp         30.3
#>  5 Afghanistan Asia       1957 pop        9240934  
#>  6 Afghanistan Asia       1957 gdpPercap      821. 
#>  7 Afghanistan Asia       1962 lifeExp         32.0
#>  8 Afghanistan Asia       1962 pop       10267083  
#>  9 Afghanistan Asia       1962 gdpPercap      853. 
#> 10 Afghanistan Asia       1967 lifeExp         34.0
#> # … with 5,102 more rows

We use melt method from pandas package.

gapminder.melt(
  id_vars = ["country", "continent", "year"],
  value_vars = ["lifeExp", "pop", "gdpPercap"], 
  var_name = "variable", 
  value_name = "value"
  )

#>           country continent  year   variable       value
#> 0     Afghanistan      Asia  1952    lifeExp   28.801000
#> 1     Afghanistan      Asia  1957    lifeExp   30.332000
#> 2     Afghanistan      Asia  1962    lifeExp   31.997000
#> 3     Afghanistan      Asia  1967    lifeExp   34.020000
#> 4     Afghanistan      Asia  1972    lifeExp   36.088000
#> ...           ...       ...   ...        ...         ...
#> 5107     Zimbabwe    Africa  1987  gdpPercap  706.157306
#> 5108     Zimbabwe    Africa  1992  gdpPercap  693.420786
#> 5109     Zimbabwe    Africa  1997  gdpPercap  792.449960
#> 5110     Zimbabwe    Africa  2002  gdpPercap  672.038623
#> 5111     Zimbabwe    Africa  2007  gdpPercap  469.709298
#> 
#> [5112 rows x 5 columns]

11 Join tables

We join tables together to enrich data i.e. adding new variables from other table or adding metadata from reference tables. In this tutorial, we add country codes to gapminder dataset.

Like SQL-join, there are many types of join. In this tutorial, we illustrate left-join where the left table rows are preserved and add only matched data from right table.

R
python

We use pivot_wider from tidyr package.

gapminder |> left_join(country_codes, by = c("country" = "country"))

#> # A tibble: 1,704 × 8
#>    country     continent  year lifeExp      pop gdpPercap iso_alpha iso_num
#>    <chr>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>       <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779. AFG             4
#>  2 Afghanistan Asia       1957    30.3  9240934      821. AFG             4
#>  3 Afghanistan Asia       1962    32.0 10267083      853. AFG             4
#>  4 Afghanistan Asia       1967    34.0 11537966      836. AFG             4
#>  5 Afghanistan Asia       1972    36.1 13079460      740. AFG             4
#>  6 Afghanistan Asia       1977    38.4 14880372      786. AFG             4
#>  7 Afghanistan Asia       1982    39.9 12881816      978. AFG             4
#>  8 Afghanistan Asia       1987    40.8 13867957      852. AFG             4
#>  9 Afghanistan Asia       1992    41.7 16317921      649. AFG             4
#> 10 Afghanistan Asia       1997    41.8 22227415      635. AFG             4
#> # … with 1,694 more rows

We use merge method from pandas package.


pd.merge(
  gapminder, country_codes,
  how = "left",
  left_on = "country",
  right_on = "country"
)

#>           country continent  year  ...   gdpPercap  iso_alpha  iso_num
#> 0     Afghanistan      Asia  1952  ...  779.445314        AFG        4
#> 1     Afghanistan      Asia  1957  ...  820.853030        AFG        4
#> 2     Afghanistan      Asia  1962  ...  853.100710        AFG        4
#> 3     Afghanistan      Asia  1967  ...  836.197138        AFG        4
#> 4     Afghanistan      Asia  1972  ...  739.981106        AFG        4
#> ...           ...       ...   ...  ...         ...        ...      ...
#> 1699     Zimbabwe    Africa  1987  ...  706.157306        ZWE      716
#> 1700     Zimbabwe    Africa  1992  ...  693.420786        ZWE      716
#> 1701     Zimbabwe    Africa  1997  ...  792.449960        ZWE      716
#> 1702     Zimbabwe    Africa  2002  ...  672.038623        ZWE      716
#> 1703     Zimbabwe    Africa  2007  ...  469.709298        ZWE      716
#> 
#> [1704 rows x 8 columns]

when both tables have the same key column name, can use only argument on

pd.merge(
  gapminder, country_codes,
  how = "left",
  on = "country"
)

12 Conclusion Remarks

There are many topics that do not cover in this tutorial i.e. date and datetime, categorial, text, panel data, missing data, advance group statistics, etc. Hope this tutorial help you get start with using programming language like R and python to do data manipulation task.

Grammar of Data Manipulation in R and python

1 Introduction

2 Load packages

3 Data: gapminder

4 Reading data

4.1 csv

4.2 excel

4.3 parquet

5 Selecting variables

6 Filtering rows

7 Adding new variables

8 Arranging rows by variables

9 Summarizing and Grouping

9.1 without group

9.2 by groups

10 Reshaping table

10.1 Pivot wider

10.2 Pivot longer

11 Join tables

12 Conclusion Remarks

12.1 Useful resource: