Using R to analyze holidays around the world
Written by Lisa Cooper
The holiday season is fully upon us here in the United States, but how does our holiday schedule compare with the rest of the world? That's a question that's easy to answer with R!
The data set
As we don't have any specific questions in mind, what we'll do is called an exploratory analysis. That means we'll look at data in various ways and see if anything looks interesting. If so, we'll follow that lead to see what else we can learn. It's a fun way to get started with data analysis since there's no wrong way to go about it.
First, we'll need some data to import into R so we can start playing with it. Fortunately, there is a free data set available on Kaggle containing a list of (nearly) Murach's R for Data Analysis.
After downloading and unzipping the archive, we can begin our R script.
library("tidyverse")
holidays <- read_csv("holiday_calendar.csv")
The most holidays by country
Next, we can do some simple grouping to get started creating tables. We'll start with something basic. For example, which countries have the most and least number of holidays?
holidays_total <- holidays %>% group_by(Country) %>%
summarize(Total = n()) %>% arrange(Total)
> head(holidays_total)
# A tibble: 6 x 2
Country Total
<chr> <int>
1 sudan 8
2 cape-verde 9
3 dr-congo 9
4 sao-tome-and-principe 9
5 bermuda 10
6 cuba 10
> tail(holidays_total)
# A tibble: 6 x 2
Country Total
<chr> <int>
1 thailand 31
2 canada 35
3 australia 36
4 malaysia 51
5 spain 55
6 india 243
And with that, we've already got some answers: Sudan has the fewest number of holidays (according to officeholidays.com, where this data was sourced) at 8, with Cape Verde, Democratic Republic of the Congo, and Sao Tome and Principe right behind with 9 each. But that prompts another question: why so few holidays in Central Africa? Is it a flaw in the data set? Africa is frequently undersampled, so that's quite possible. However, these countries also have very new governments. Perhaps the older a country is, the more likely it is to have more holidays? That would be a good question for an extended analysis.
At the other end of the scale, India has a whopping 243 public holidays, nearly five times more than next highest Spain (55) and making it incredibly difficult to graph with an evenly spaced gradient (trust me, I tried). Why so many? In fact, that's low compared to historical levels - India used to have a holiday literally every day! Now, many of those holidays are celebrated only regionally, or have been discarded in favor of spending more time working.
Conveniently, the above code also lets us know how many countries total are in the set: 222. However, there are only 195 countries in the world. What gives? A closer look at the data turns up "countries" like Puerto Rico and American Samoa, territories of the United States. It might be more accurate to say "region" rather than "country," but since this analysis is just for some holiday fun we don't need to worry too much about perfect accuracy.
The most popular holidays worldwide
Okay, that's well and good, but which holiday is celebrated in the most countries/regions? As you might guess, we just need to tweak our previous code like so:
holidays_total <- holidays %>% group_by(`Holiday Name`) %>%
summarize(Total = n()) %>% arrange(desc(Total))
> head(holidays_total)
# A tibble: 6 x 2
`Holiday Name` Total
<chr> <int>
1 New Year's Day 199
2 Christmas Day 172
3 Labour Day 141
4 Independence Day 123
5 Good Friday 116
6 Easter Monday 107
Three of those holidays are Christian holidays. In a deeper analysis (with a much more specialized data set) it would be interesting to see when those holidays were adopted around the world. Also interesting is the popularity of Christmas (celebrated by 77% of countries) compared with Good Friday/Easter (52%). As Easter is usually considered the more important holiday from a religious perspective, we can surmise that 25% of the countries celebrating Christmas are celebrating it largely as a secular holiday. Capitalism, the true colonizer!
But wait, what about those other holidays? Independence Day? That can't be right, can it? Only the United States celebrates July 4th, right? What's happening here is that many, many countries celebrate a day of independence, but the day differs by country.
holidays %>% filter(Name == "Independence Day") %>%
arrange(Date) %>% group_by(Date) %>%
summarize(Total = n())
# A tibble: 98 x 2
Date Total
<date> <int>
1 2022-01-01 2
2 2022-01-04 1
3 2022-02-07 1
4 2022-02-17 1
5 2022-02-18 1
6 2022-02-22 1
7 2022-02-24 1
8 2022-02-27 1
9 2022-03-01 1
10 2022-03-06 1
# ... with 88 more rows
It turns out Independence Day is celebrated on 98 different days of the year. When in doubt, you've got a better than one in four chance of someone celebrating the anniversary of independence from someone else. The most popular date is September 15, with El Salvador, Guatemala, Honduras, and Nicaragua all celebrating. A quick search turns up that that's the anniversary of the Act of Independence of Central America. Normally Costa Rica also celebrates that day as well, but this particular year moved the holiday to the 19th, a Monday.
Labor Day, however, isn't nearly so spread out. The vast majority of the world celebrates the holiday on May Day (May 1st). The United States in an outlier, celebrating on a Monday in September. Once again, a little research into why turns up something interesting: the US date was chosen because President Cleveland was concerned that celebrating in May would strengthen socialist movements.
Try it yourself
There's still quite a lot of insights to mine from such a seemingly innocuous data set, but further tangents will have to wait. For now, why not try exploring the data yourself? You only need the linked data set and RStudio, both available for free. If you find yourself wanting to do more, naturally we recommend Murach's R for Data Analysis, just published this month and perfect for coding newbies and beginning analysists. ;)
By the way, remember me mentioning that India's 243 holidays made it hard to plot? I solved that problem by binning the data into groups. It means the code isn't quite as simple and elegant as it could be, but it does what it's supposed to do and I consider that absolutely something to celebrate.
holidays_total <- holidays %>% group_by(Country) %>%
summarize(Total = n()) %>% arrange(Total)
holidays_bins <- holidays_total %>% rename(region = Country)
holidays_bins <- mutate(holidays_bins,
Bin = cut(Total, breaks = c(0,1,15,20,25,30,35,40,45,50,55,250),
labels = c("NA", "1-15", "16-20", "21-25", "26-30","31-35",
"36-40", "41-45", "46-50", "51-55", "> 55")))
worldmap <- map_data("world")
worldmap <- worldmap %>% mutate(region = str_to_lower(region))
worldmap <- full_join(worldmap, holidays_bins, by = "region")
ggplot() +
geom_polygon(worldmap, mapping = aes(x = long, y = lat, group = group, color = I("black"), fill = Bin)) +
scale_fill_brewer(palette = "YlOrRd") +
theme_void() +
labs(title = "# of Holidays Per Country") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Holidays")) +
coord_fixed()