library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(readr)
library(readxl)
library(writexl)
library(arrow, warn.conflicts = FALSE)
# make sure that you have packaged installed in your machine/environments
# you can install with install.packages("<library_name>")
1 Introduction
In this tutorial, we cover basic data manipulation operations including
- Reading data from variety of file formats.
- Selecting variables.
- Filtering rows.
- Adding new variables.
- Arranging rows by variables.
- Summarizing and Grouping.
- Reshaping data between wide and long format.
- Joining tables using key(s) variable.
- Saving data to desired file formats.
2 Load packages
Both R
and python
are open-source and very powerful for data-related work. There are many packages and libraries that can be used, the challenge is to learn and pick the right tool to the task in hand. In this tutorial, we choose standard tools in both language.
We use
dplyr
for data manipulationtidyr
for reshaping, join datareadr
for read csv filesreadxl
for read/import excel fileswriexl
for write/import excel filesarrow
for read/write parquet filesgapminder
example data for demonstration
We mainly use pandas
, however additional packages required to be installed which are openpyxl
and pyarrow
package as we want to work with excel and parquet files.
# make sure that you have pandas pyarrow openpyxl
# installed in your machine/environments
# !pip install pandas openpyxl pyarrow
import pandas as pd
3 Data: gapminder
One of the important step to conduct a data analytics or machine learning project is to understand the data.
Data used in this tutorial are from gapminder
R package inspired by famous Hans Rosling ted’s talk. We use two tables which are
gapminder
contain a data of countries from 1952 - 2007 with 3 variables including population (pop), life expectancy (LifeExp), and gdp per capita(gdpPercap).
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # … with 1,694 more rows
country_codes
contains country names and associated ISO 3166-1 country codes (alpha and numeric) will be used for join.
#> # A tibble: 187 × 3
#> country iso_alpha iso_num
#> <chr> <chr> <int>
#> 1 Afghanistan AFG 4
#> 2 Albania ALB 8
#> 3 Algeria DZA 12
#> 4 Angola AGO 24
#> 5 Argentina ARG 32
#> 6 Armenia ARM 51
#> 7 Aruba ABW 533
#> 8 Australia AUS 36
#> 9 Austria AUT 40
#> 10 Azerbaijan AZE 31
#> # … with 177 more rows
4 Reading data
There are many ways to get data into your machine for data analytics
- files i.e. excel, csv, parquet, etc.
- database
- web api
- web scraping
For the sake of simplicity, we use file approach. I will show how to read or write data in other approaches in future posts.
There are many file types, in this tutorial chooses the most common ones
csv
excel
parquet
4.1 csv
CSV is a text file that each columns are seperated by commas stored in .csv
file extension. Aside from csv
, there are similar text file that store spreadsheet data but used different seperator or fixed width to determine column by position. gapminder.csv
file can be downloaded here.
We use function read_csv
from readr
package.
<- read_csv("gapminder.csv", show_col_types = FALSE)
gapminder # you can use file path in your local machine or url path.
gapminder
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # … with 1,694 more rows
= pd.read_csv("gapminder.csv")
gapminder gapminder
#> country continent year lifeExp pop gdpPercap
#> 0 Afghanistan Asia 1952 28.801 8425333 779.445314
#> 1 Afghanistan Asia 1957 30.332 9240934 820.853030
#> 2 Afghanistan Asia 1962 31.997 10267083 853.100710
#> 3 Afghanistan Asia 1967 34.020 11537966 836.197138
#> 4 Afghanistan Asia 1972 36.088 13079460 739.981106
#> ... ... ... ... ... ... ...
#> 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
#> 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
#> 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
#> 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
#> 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
#>
#> [1704 rows x 6 columns]
4.2 excel
MS excel is a spreadsheet program and store data in .xls
(2003 version and before) or .xlsx
. They are not text files like csv
, but they are a kind of binary files (XML-like) format. An excel file can contain multiple sheets and within each sheet can have multiple tables. gapminder.xlsx
file can be downloaded here.
We use function read_excel
from readxl
package. As an excel file can have multiple worksheets, we can read one sheet at a time and we can specify sheet to import using sheet
argument where it could be index or sheet name.
- read fist sheet.
<- read_excel("gapminder.xlsx")
gapminder gapminder
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # … with 1,694 more rows
- read second sheet.
<- read_excel("gapminder.xlsx", sheet = 2)
country_codes country_codes
#> # A tibble: 187 × 3
#> country iso_alpha iso_num
#> <chr> <chr> <dbl>
#> 1 Afghanistan AFG 4
#> 2 Albania ALB 8
#> 3 Algeria DZA 12
#> 4 Angola AGO 24
#> 5 Argentina ARG 32
#> 6 Armenia ARM 51
#> 7 Aruba ABW 533
#> 8 Australia AUS 36
#> 9 Austria AUT 40
#> 10 Azerbaijan AZE 31
#> # … with 177 more rows
pandas
has a function read_excel
to import excel files.
- read fist sheet.
= pd.read_excel("gapminder.xlsx")
gapminder gapminder
#> country continent year lifeExp pop gdpPercap
#> 0 Afghanistan Asia 1952 28.801 8425333 779.445314
#> 1 Afghanistan Asia 1957 30.332 9240934 820.853030
#> 2 Afghanistan Asia 1962 31.997 10267083 853.100710
#> 3 Afghanistan Asia 1967 34.020 11537966 836.197138
#> 4 Afghanistan Asia 1972 36.088 13079460 739.981106
#> ... ... ... ... ... ... ...
#> 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
#> 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
#> 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
#> 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
#> 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
#>
#> [1704 rows x 6 columns]
- read second sheet
= pd.read_excel("gapminder.xlsx", sheet_name = 1)
country_codes country_codes
#> country iso_alpha iso_num
#> 0 Afghanistan AFG 4
#> 1 Albania ALB 8
#> 2 Algeria DZA 12
#> 3 Angola AGO 24
#> 4 Argentina ARG 32
#> .. ... ... ...
#> 182 Vietnam VNM 704
#> 183 West Bank and Gaza PSE 275
#> 184 Yemen, Rep. YEM 887
#> 185 Zambia ZMB 894
#> 186 Zimbabwe ZWE 716
#>
#> [187 rows x 3 columns]
4.3 parquet
Parquet
is an open file format that is great for speed and storage. Some organizeations have changed the way they sharing their data files to parquet
instead of csv
. gapminder.parquet
file can be downloaded here.
We use function read_parquet
from arrow
package.
<- read_parquet("gapminder.parquet")
gapminder gapminder
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # … with 1,694 more rows
pandas
has a function read_parquet
to import excel files.
= pd.read_parquet("gapminder.parquet")
gapminder gapminder
#> country continent year lifeExp pop gdpPercap
#> 0 Afghanistan Asia 1952 28.801 8425333 779.445314
#> 1 Afghanistan Asia 1957 30.332 9240934 820.853030
#> 2 Afghanistan Asia 1962 31.997 10267083 853.100710
#> 3 Afghanistan Asia 1967 34.020 11537966 836.197138
#> 4 Afghanistan Asia 1972 36.088 13079460 739.981106
#> ... ... ... ... ... ... ...
#> 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
#> 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
#> 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
#> 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
#> 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
#>
#> [1704 rows x 6 columns]
5 Selecting variables
In real data wrangling work, we often get a table that contains too many variables. It is helpful to select only needed variables for further uses. In this tutorial, we select three variables inclduing country
, year
and pop
.
We use select
from dplyr
package.
|> select(country, year, pop) gapminder
#> # A tibble: 1,704 × 3
#> country year pop
#> <fct> <int> <int>
#> 1 Afghanistan 1952 8425333
#> 2 Afghanistan 1957 9240934
#> 3 Afghanistan 1962 10267083
#> 4 Afghanistan 1967 11537966
#> 5 Afghanistan 1972 13079460
#> 6 Afghanistan 1977 14880372
#> 7 Afghanistan 1982 12881816
#> 8 Afghanistan 1987 13867957
#> 9 Afghanistan 1992 16317921
#> 10 Afghanistan 1997 22227415
#> # … with 1,694 more rows
To subset pandas, we can use .loc
where the first argument for row, and the second argument is for columns. To select all rows used keyword :
. To select columns, you can use character vector.
"country", "year", "pop"]] gapminder.loc[:, [
#> country year pop
#> 0 Afghanistan 1952 8425333
#> 1 Afghanistan 1957 9240934
#> 2 Afghanistan 1962 10267083
#> 3 Afghanistan 1967 11537966
#> 4 Afghanistan 1972 13079460
#> ... ... ... ...
#> 1699 Zimbabwe 1987 9216418
#> 1700 Zimbabwe 1992 10704340
#> 1701 Zimbabwe 1997 11404948
#> 1702 Zimbabwe 2002 11926563
#> 1703 Zimbabwe 2007 12311143
#>
#> [1704 rows x 3 columns]
6 Filtering rows
We often interest in a subset of the data. There are many ways to filter data i.e. by index of rows. But one common way is to filter by conditions. In this tutorial, we want a data of Thailand since 1980 from gapminder
.
We use filter
from dplyr
package.
|> filter(country == "Thailand" & year > 1980) gapminder
#> # A tibble: 6 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Thailand Asia 1982 64.6 48827160 2393.
#> 2 Thailand Asia 1987 66.1 52910342 2983.
#> 3 Thailand Asia 1992 67.3 56667095 4617.
#> 4 Thailand Asia 1997 67.5 60216677 5853.
#> 5 Thailand Asia 2002 68.6 62806748 5913.
#> 6 Thailand Asia 2007 70.6 65068149 7458.
We use query
method from pandas
package.
'country == "Thailand" & year > 1980') gapminder.query(
#> country continent year lifeExp pop gdpPercap
#> 1530 Thailand Asia 1982 64.597 48827160 2393.219781
#> 1531 Thailand Asia 1987 66.084 52910342 2982.653773
#> 1532 Thailand Asia 1992 67.298 56667095 4616.896545
#> 1533 Thailand Asia 1997 67.521 60216677 5852.625497
#> 1534 Thailand Asia 2002 68.564 62806748 5913.187529
#> 1535 Thailand Asia 2007 70.616 65068149 7458.396327
7 Adding new variables
When we analyze data, we usually have to compute new variable(s) using original variables. In this tutorial, we compute two new vairables
pop_m
is population in million.gdp_m
is total GDP in million USD where gdp is computed from gdpPercap multiply by pop.
We use mutate
from dplyr
package. It can use to make multiple variables in single function call.
|> mutate(
gapminder pop_m = pop / 1e6,
gdp_m = gdpPercap*pop/1e6
)
#> # A tibble: 1,704 × 8
#> country continent year lifeExp pop gdpPercap pop_m gdp_m
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. 8.43 6567.
#> 2 Afghanistan Asia 1957 30.3 9240934 821. 9.24 7585.
#> 3 Afghanistan Asia 1962 32.0 10267083 853. 10.3 8759.
#> 4 Afghanistan Asia 1967 34.0 11537966 836. 11.5 9648.
#> 5 Afghanistan Asia 1972 36.1 13079460 740. 13.1 9679.
#> 6 Afghanistan Asia 1977 38.4 14880372 786. 14.9 11698.
#> 7 Afghanistan Asia 1982 39.9 12881816 978. 12.9 12599.
#> 8 Afghanistan Asia 1987 40.8 13867957 852. 13.9 11821.
#> 9 Afghanistan Asia 1992 41.7 16317921 649. 16.3 10596.
#> 10 Afghanistan Asia 1997 41.8 22227415 635. 22.2 14122.
#> # … with 1,694 more rows
We use assign
method from pandas
package.
gapminder.assign(= lambda df: df['pop']/1e6,
pop_m = lambda df: df['gdpPercap']*df['pop']/1e6
gdp_m )
#> country continent year ... gdpPercap pop_m gdp_m
#> 0 Afghanistan Asia 1952 ... 779.445314 8.425333 6567.086330
#> 1 Afghanistan Asia 1957 ... 820.853030 9.240934 7585.448670
#> 2 Afghanistan Asia 1962 ... 853.100710 10.267083 8758.855797
#> 3 Afghanistan Asia 1967 ... 836.197138 11.537966 9648.014150
#> 4 Afghanistan Asia 1972 ... 739.981106 13.079460 9678.553274
#> ... ... ... ... ... ... ... ...
#> 1699 Zimbabwe Africa 1987 ... 706.157306 9.216418 6508.240905
#> 1700 Zimbabwe Africa 1992 ... 693.420786 10.704340 7422.611852
#> 1701 Zimbabwe Africa 1997 ... 792.449960 11.404948 9037.850590
#> 1702 Zimbabwe Africa 2002 ... 672.038623 11.926563 8015.110972
#> 1703 Zimbabwe Africa 2007 ... 469.709298 12.311143 5782.658337
#>
#> [1704 rows x 8 columns]
8 Arranging rows by variables
Often, we have to sort or rearrange rows by values of particular columns to help us better understanding of data. We can sort from lower to larger (ascending) or the opposite (descending). Also multiple columns can be sorted. In this tutorial, we illustrate two cases
- sort
gapminder
bycontinent
andcountry
ascending. - sort
gapminder
byyear
descending andcontinent
ascending.
We use arrange
from dplyr
package. Default mode of sorting is ascending, if we want to sort descending, use desc(variable)
syntax.
|> arrange(continent, country) gapminder
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Algeria Africa 1952 43.1 9279525 2449.
#> 2 Algeria Africa 1957 45.7 10270856 3014.
#> 3 Algeria Africa 1962 48.3 11000948 2551.
#> 4 Algeria Africa 1967 51.4 12760499 3247.
#> 5 Algeria Africa 1972 54.5 14760787 4183.
#> 6 Algeria Africa 1977 58.0 17152804 4910.
#> 7 Algeria Africa 1982 61.4 20033753 5745.
#> 8 Algeria Africa 1987 65.8 23254956 5681.
#> 9 Algeria Africa 1992 67.7 26298373 5023.
#> 10 Algeria Africa 1997 69.2 29072015 4797.
#> # … with 1,694 more rows
|> arrange(desc(year), continent) gapminder
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Algeria Africa 2007 72.3 33333216 6223.
#> 2 Angola Africa 2007 42.7 12420476 4797.
#> 3 Benin Africa 2007 56.7 8078314 1441.
#> 4 Botswana Africa 2007 50.7 1639131 12570.
#> 5 Burkina Faso Africa 2007 52.3 14326203 1217.
#> 6 Burundi Africa 2007 49.6 8390505 430.
#> 7 Cameroon Africa 2007 50.4 17696293 2042.
#> 8 Central African Republic Africa 2007 44.7 4369038 706.
#> 9 Chad Africa 2007 50.7 10238807 1704.
#> 10 Comoros Africa 2007 65.2 710960 986.
#> # … with 1,694 more rows
We use sort_values
method from pandas
package. Default mode of sorting is ascending, if we want to sort descending, use ascending arguements.
"continent", "country"]) gapminder.sort_values([
#> country continent year lifeExp pop gdpPercap
#> 24 Algeria Africa 1952 43.077 9279525 2449.008185
#> 25 Algeria Africa 1957 45.685 10270856 3013.976023
#> 26 Algeria Africa 1962 48.303 11000948 2550.816880
#> 27 Algeria Africa 1967 51.407 12760499 3246.991771
#> 28 Algeria Africa 1972 54.518 14760787 4182.663766
#> ... ... ... ... ... ... ...
#> 1099 New Zealand Oceania 1987 74.320 3317166 19007.191290
#> 1100 New Zealand Oceania 1992 76.330 3437674 18363.324940
#> 1101 New Zealand Oceania 1997 77.550 3676187 21050.413770
#> 1102 New Zealand Oceania 2002 79.110 3908037 23189.801350
#> 1103 New Zealand Oceania 2007 80.204 4115771 25185.009110
#>
#> [1704 rows x 6 columns]
"year", "continent"], ascending=[True, False]) gapminder.sort_values([
#> country continent year lifeExp pop gdpPercap
#> 60 Australia Oceania 1952 69.120 8691212 10039.595640
#> 1092 New Zealand Oceania 1952 69.390 1994794 10556.575660
#> 12 Albania Europe 1952 55.230 1282697 1601.056136
#> 72 Austria Europe 1952 66.800 6927772 6137.076492
#> 108 Belgium Europe 1952 68.000 8730405 8343.105127
#> ... ... ... ... ... ... ...
#> 1547 Togo Africa 2007 58.420 5701579 882.969944
#> 1571 Tunisia Africa 2007 73.923 10276158 7092.923025
#> 1595 Uganda Africa 2007 51.542 29170398 1056.380121
#> 1691 Zambia Africa 2007 42.384 11746035 1271.211593
#> 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
#>
#> [1704 rows x 6 columns]
9 Summarizing and Grouping
As data is huge, we need a way or a measure that let us have a better understanding about data. One way is to condense information into some sort of summary statistics i.e. average, min, max, sum, count, etc. These operations are basically reduce the number of rows. In addition, We can summaize data by group (categorial data). In gapminder
, we will show how to do it without group and with group(s).
9.1 without group
We want to find population total and population mean (in millions) of all countries and from 1952 - 2007.
We use summarize
or summarise
from dplyr
package. It can use to make multiple summarized variables in single function call.
|> summarize(
gapminder pop_total_m = sum(pop)/1e6,
pop_mean_m = mean(pop)/1e6
)
#> # A tibble: 1 × 2
#> pop_total_m pop_mean_m
#> <dbl> <dbl>
#> 1 50440. 29.6
We use assign
method from pandas
package. Alternative approach is to compute each summarization.
"pop"].agg(["sum", "mean"])/1e6
gapminder[
# alternative
#> sum 50440.465801
#> mean 29.601212
#> Name: pop, dtype: float64
"pop"].sum()/1e6 gapminder[
#> 50440.465801
"pop"].mean()/1e6 gapminder[
#> 29.601212324530515
9.2 by groups
We want to find population total and population mean (in millions) of all countries and from 1952 - 2007.
We use group_by
from dplyr
package before use summarize
function. The result is grouped dataframe. I normally ungroup
it to normal dataframe.
|>
gapminder group_by(continent, year) |>
summarize(
pop_min_m = min(pop)/1e6,
pop_max_m = max(pop)/1e6
|> ungroup() )
#> # A tibble: 60 × 4
#> continent year pop_min_m pop_max_m
#> <fct> <int> <dbl> <dbl>
#> 1 Africa 1952 0.0600 33.1
#> 2 Africa 1957 0.0613 37.2
#> 3 Africa 1962 0.0653 41.9
#> 4 Africa 1967 0.0708 47.3
#> 5 Africa 1972 0.0766 53.7
#> 6 Africa 1977 0.0868 62.2
#> 7 Africa 1982 0.0986 73.0
#> 8 Africa 1987 0.111 81.6
#> 9 Africa 1992 0.126 93.4
#> 10 Africa 1997 0.146 106.
#> # … with 50 more rows
We use groupby
method from pandas
package by putting it before agg
method.
=["continent", "year"])["pop"].agg(["min", "max"])/1e6 gapminder.groupby(by
#> min max
#> continent year
#> Africa 1952 0.060011 33.119096
#> 1957 0.061325 37.173340
#> 1962 0.065345 41.871351
#> 1967 0.070787 47.287752
#> 1972 0.076595 53.740085
#> 1977 0.086796 62.209173
#> 1982 0.098593 73.039376
#> 1987 0.110812 81.551520
#> 1992 0.125911 93.364244
#> 1997 0.145608 106.207839
#> 2002 0.170372 119.901274
#> 2007 0.199579 135.031164
#> Americas 1952 0.662850 157.553000
#> 1957 0.764900 171.984000
#> 1962 0.887498 186.538000
#> 1967 0.960155 198.712000
#> 1972 0.975199 209.896000
#> 1977 1.039009 220.239000
#> 1982 1.116479 232.187835
#> 1987 1.191336 242.803533
#> 1992 1.183669 256.894189
#> 1997 1.138101 272.911760
#> 2002 1.101832 287.675526
#> 2007 1.056608 301.139947
#> Asia 1952 0.120447 556.263527
#> 1957 0.138655 637.408000
#> 1962 0.171863 665.770000
#> 1967 0.202182 754.550000
#> 1972 0.230800 862.030000
#> 1977 0.297410 943.455000
#> 1982 0.377967 1000.281000
#> 1987 0.454612 1084.035000
#> 1992 0.529491 1164.970000
#> 1997 0.598561 1230.075000
#> 2002 0.656397 1280.400000
#> 2007 0.708573 1318.683096
#> Europe 1952 0.147962 69.145952
#> 1957 0.165110 71.019069
#> 1962 0.182053 73.739117
#> 1967 0.198676 76.368453
#> 1972 0.209275 78.717088
#> 1977 0.221823 78.160773
#> 1982 0.233997 78.335266
#> 1987 0.244676 77.718298
#> 1992 0.259012 80.597764
#> 1997 0.271192 82.011073
#> 2002 0.288030 82.350671
#> 2007 0.301931 82.400996
#> Oceania 1952 1.994794 8.691212
#> 1957 2.229407 9.712569
#> 1962 2.488550 10.794968
#> 1967 2.728150 11.872264
#> 1972 2.929100 13.177000
#> 1977 3.164900 14.074100
#> 1982 3.210650 15.184200
#> 1987 3.317166 16.257249
#> 1992 3.437674 17.481977
#> 1997 3.676187 18.565243
#> 2002 3.908037 19.546792
#> 2007 4.115771 20.434176
10 Reshaping table
Reshaping is used we want to change the way data is presenting. There are wide
format and long
format.
wide
format is a format that have more columns but less rows. Good for making report orpresenting data.long
format is a format that have less columns but more rows. Good for data management, data warehousing.
We can convert back and forth as needed. For excel
user, it is the famous pivot-table function.
10.1 Pivot wider
Convert from long format to wide format. In this tutorial, we want to make a report of each country of life extectancy by years where we want each year to be a column. We can achieve this by pivot gapminder
to wide format.
We use pivot_wider
from tidyr
package.
|> pivot_wider(
gapminder id_cols = c("continent", "country"),
names_from = "year",
values_from = "lifeExp")
#> # A tibble: 142 × 14
#> continent country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Asia Afghanistan 28.8 30.3 32.0 34.0 36.1 38.4 39.9 40.8
#> 2 Europe Albania 55.2 59.3 64.8 66.2 67.7 68.9 70.4 72
#> 3 Africa Algeria 43.1 45.7 48.3 51.4 54.5 58.0 61.4 65.8
#> 4 Africa Angola 30.0 32.0 34 36.0 37.9 39.5 39.9 39.9
#> 5 Americas Argentina 62.5 64.4 65.1 65.6 67.1 68.5 69.9 70.8
#> 6 Oceania Australia 69.1 70.3 70.9 71.1 71.9 73.5 74.7 76.3
#> 7 Europe Austria 66.8 67.5 69.5 70.1 70.6 72.2 73.2 74.9
#> 8 Asia Bahrain 50.9 53.8 56.9 59.9 63.3 65.6 69.1 70.8
#> 9 Asia Bangladesh 37.5 39.3 41.2 43.5 45.3 46.9 50.0 52.8
#> 10 Europe Belgium 68 69.2 70.2 70.9 71.4 72.8 73.9 75.4
#> # … with 132 more rows, and 4 more variables: `1992` <dbl>, `1997` <dbl>,
#> # `2002` <dbl>, `2007` <dbl>
We use pivot
method from pandas
package.
gapminder.pivot(= ["continent", "country"],
index = "year",
columns = "lifeExp"
values )
#> year 1952 1957 1962 ... 1997 2002 2007
#> continent country ...
#> Africa Algeria 43.077 45.685 48.303 ... 69.152 70.994 72.301
#> Angola 30.015 31.999 34.000 ... 40.963 41.003 42.731
#> Benin 38.223 40.358 42.618 ... 54.777 54.406 56.728
#> Botswana 47.622 49.618 51.520 ... 52.556 46.634 50.728
#> Burkina Faso 31.975 34.906 37.814 ... 50.324 50.650 52.295
#> ... ... ... ... ... ... ... ...
#> Europe Switzerland 69.620 70.560 71.320 ... 79.370 80.620 81.701
#> Turkey 43.585 48.079 52.098 ... 68.835 70.845 71.777
#> United Kingdom 69.180 70.420 70.760 ... 77.218 78.471 79.425
#> Oceania Australia 69.120 70.330 70.930 ... 78.830 80.370 81.235
#> New Zealand 69.390 70.260 71.240 ... 77.550 79.110 80.204
#>
#> [142 rows x 12 columns]
10.2 Pivot longer
Convert from wide format to long format. In this tutorial, we want to normalize gapminder
table by moving measurment variables including lifeExp
pop
and gdpPercap
columns into longer format where there is a column to store the variable name and a column store their value. We can achieve this by pivot gapminder
to long format.
We use pivot_longer
from tidyr
package.
|> pivot_longer(
gapminder cols = c("lifeExp", "pop", "gdpPercap"),
names_to = "variable",
values_to = "value")
#> # A tibble: 5,112 × 5
#> country continent year variable value
#> <fct> <fct> <int> <chr> <dbl>
#> 1 Afghanistan Asia 1952 lifeExp 28.8
#> 2 Afghanistan Asia 1952 pop 8425333
#> 3 Afghanistan Asia 1952 gdpPercap 779.
#> 4 Afghanistan Asia 1957 lifeExp 30.3
#> 5 Afghanistan Asia 1957 pop 9240934
#> 6 Afghanistan Asia 1957 gdpPercap 821.
#> 7 Afghanistan Asia 1962 lifeExp 32.0
#> 8 Afghanistan Asia 1962 pop 10267083
#> 9 Afghanistan Asia 1962 gdpPercap 853.
#> 10 Afghanistan Asia 1967 lifeExp 34.0
#> # … with 5,102 more rows
We use melt
method from pandas
package.
gapminder.melt(= ["country", "continent", "year"],
id_vars = ["lifeExp", "pop", "gdpPercap"],
value_vars = "variable",
var_name = "value"
value_name )
#> country continent year variable value
#> 0 Afghanistan Asia 1952 lifeExp 28.801000
#> 1 Afghanistan Asia 1957 lifeExp 30.332000
#> 2 Afghanistan Asia 1962 lifeExp 31.997000
#> 3 Afghanistan Asia 1967 lifeExp 34.020000
#> 4 Afghanistan Asia 1972 lifeExp 36.088000
#> ... ... ... ... ... ...
#> 5107 Zimbabwe Africa 1987 gdpPercap 706.157306
#> 5108 Zimbabwe Africa 1992 gdpPercap 693.420786
#> 5109 Zimbabwe Africa 1997 gdpPercap 792.449960
#> 5110 Zimbabwe Africa 2002 gdpPercap 672.038623
#> 5111 Zimbabwe Africa 2007 gdpPercap 469.709298
#>
#> [5112 rows x 5 columns]
11 Join tables
We join tables together to enrich data i.e. adding new variables from other table or adding metadata from reference tables. In this tutorial, we add country codes to gapminder
dataset.
Like SQL-join, there are many types of join. In this tutorial, we illustrate left-join where the left table rows are preserved and add only matched data from right table.
We use pivot_wider
from tidyr
package.
|> left_join(country_codes, by = c("country" = "country")) gapminder
#> # A tibble: 1,704 × 8
#> country continent year lifeExp pop gdpPercap iso_alpha iso_num
#> <chr> <fct> <int> <dbl> <int> <dbl> <chr> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. AFG 4
#> 2 Afghanistan Asia 1957 30.3 9240934 821. AFG 4
#> 3 Afghanistan Asia 1962 32.0 10267083 853. AFG 4
#> 4 Afghanistan Asia 1967 34.0 11537966 836. AFG 4
#> 5 Afghanistan Asia 1972 36.1 13079460 740. AFG 4
#> 6 Afghanistan Asia 1977 38.4 14880372 786. AFG 4
#> 7 Afghanistan Asia 1982 39.9 12881816 978. AFG 4
#> 8 Afghanistan Asia 1987 40.8 13867957 852. AFG 4
#> 9 Afghanistan Asia 1992 41.7 16317921 649. AFG 4
#> 10 Afghanistan Asia 1997 41.8 22227415 635. AFG 4
#> # … with 1,694 more rows
We use merge
method from pandas
package.
pd.merge(
gapminder, country_codes,= "left",
how = "country",
left_on = "country"
right_on )
#> country continent year ... gdpPercap iso_alpha iso_num
#> 0 Afghanistan Asia 1952 ... 779.445314 AFG 4
#> 1 Afghanistan Asia 1957 ... 820.853030 AFG 4
#> 2 Afghanistan Asia 1962 ... 853.100710 AFG 4
#> 3 Afghanistan Asia 1967 ... 836.197138 AFG 4
#> 4 Afghanistan Asia 1972 ... 739.981106 AFG 4
#> ... ... ... ... ... ... ... ...
#> 1699 Zimbabwe Africa 1987 ... 706.157306 ZWE 716
#> 1700 Zimbabwe Africa 1992 ... 693.420786 ZWE 716
#> 1701 Zimbabwe Africa 1997 ... 792.449960 ZWE 716
#> 1702 Zimbabwe Africa 2002 ... 672.038623 ZWE 716
#> 1703 Zimbabwe Africa 2007 ... 469.709298 ZWE 716
#>
#> [1704 rows x 8 columns]
- when both tables have the same key column name, can use only argument on
pd.merge(
gapminder, country_codes,= "left",
how = "country"
on )
12 Conclusion Remarks
There are many topics that do not cover in this tutorial i.e. date
and datetime
, categorial
, text
, panel data, missing data, advance group statistics, etc. Hope this tutorial help you get start with using programming language like R
and python
to do data manipulation task.