Data To Coup
Military coups have existed since as far as 800 BC. They occur when there is an illegitimate claimant to power within a state for at least seven days. Although coups have been less frequent in recent time, there have been about 460 coup attempts since 1950 (of which roughly 50% were successful). In past century, coups have been mostly common in Africa and the Americas. Most of the coup attempts in the past 60 years occurred in the 1960s, at the same times majority of African nations gained independence from the Europeans. However, in recent times, successful coups have been less frequent either because governments have developed new methods in containing the usurpers, or because recent coup leaders are less strategic in claiming the throne. That, however, is a data science question for another project.
Although successful coups are more prominent, attempted coups are more frequent than one would expect. In the past 16 years alone, there have been 30 attempted coups (seven successful), with the majority occurring in Africa. Many of these recent coup attempts did not gain as much international attention as that of Turkey that occurred on July 15, 2016. The recent coup attempt encourages one to wonder if there ways to predict a coup attempt (successful or not), given the social and political climate in a state.
Considering social and political instabilities that have occurred in recent times, I want to predict the likelihood of a coup d'etat within a country, given specific economic, political and social factors.
-
Economic factors: income per capita, growth rate
-
Social factors: accounts of ethnic violence, religious, linguistic, and ethnic fractionalization, fatalities from violence, urbanization
-
Political factors: polity score, length of stay for incumbent head of state
Data
I collected data mainly from the Center for Systemic Peace and The World Bank databank. I collected several datasets from both sources, cleaned them, and combined them into one final dataset for my analysis. The final dataset consists of 2,996 observations and 36 variables. The dataset had information on 69 countries from 1955 to 2015. However, some countries do not go as far back as 1955. The data dictionary for the final dataset is shown below:
Exploratory Data Analysis
The interactive graph below shows coup occurrences across the globe over 60 years. In a given year, the light blue represents no coup and the dark blue represents a coup occurrence. The countries in grey represent no data available. You use the graph to select any year between 1955 and 2015 and view where coups occurred in that year. From the map, you will notice that coups are most popular in Africa, Asia, and South America. There were no instances of a coup in North America during the 60-year period. Also, you will see a high occurrence of coups in Africa between the late 1960s and the late 1970s. This may be because many African countries gained their independence during that period. Many African countries faced political and economic instabilities a coup of years into their independence. These conditions may have made these countries more susceptible to a coup.
The polity score of a country to plays an important role in determining the likelihood of a coup. The polity score measures a country's political regime ranging from fully autocratic (-10) to fully democratic (+10). The graphs below show the trends between polity score and other factors such as GDP per capita, urban population, fractionalization, etc. The first graph shows the relationship between polity score and average GDP per capita. From the graph, we see that slight U-shaped curve: on average, countries that are not clearly autocratic or democratic have lower incomes. The second graph shows the relationship between polity score and the average number of years the incumbent president had been in power right before the coup. The result is quite intuitive: the average number or years in power decreases as the scale moves from autocratic to democratic. From the graph, we see that countries in the dataset that have a polity score of 10 have an abnormally high average number of years in power. After going back to take a look at the dataset, I find that the countries with a score of 10 are Chile, Cyprus, Hungary, and the United Kingdom. The United Kingdom operates under a monarchy in which Queen Elizabeth has been ruling since 1952. Her number of years in power raised the average number of years in power for countries with a polity score or 10. Cyprus also had some heads in office for as long as 14 years. The next graph displays the relationship between polity score and regime durability (i.e. the number of years the most recent regime before the coup lastest). From the graph, we see another U-shaped curve: on average, countries with an ambiguous regime have lower durability. The fourth graph shows a relationship between polity score and urban population. I don't see a particular trend, but I noticed that on average, democratic countries have a higher urban population rate than autocratic countries. The final graph shows relationships between polity and ethnic, religious and linguistic fractionalization. There appears to be no trend between polity and fractionalization. You're welcome to move your cursor around the graph to see the actual results for each polity score.
The table below displays the general attributes of data points with a coup versus data points without a coup. From the table, we see that data points with a coup are in a more autocratic regime than data points without a coup. They also, on average, have lower GDP per capita, GDP growth rate, durability, urban population and years in power.
Preparing The Data for The Model
For my model, I did not include time as a determining factor. Each year of each country was treated independently. Before building the model, I wanted to explore the data a little bit more to see if the target (coup or no coup) is balanced i.e. is the number or data points with a coup similar to that of data points with no coup? As anticipated, I found that "coup" data points were much fewer than "no coup" data points. From my 2,996 data points, only 180 were "coup" data points. Because I had spent a great deal of time gathering the data and I apparently do not have a life, I decided to keep moving forward. I split the dataset into two: one for "coup" data points and the other for "no coup" data points. I conducted a t-test for the differences in the means for variables in the "coup" dataset and means for variables in the "no coup" dataset. I found that all means of variables in the "coup" dataset, except one (religious), were statistically different from means of variables in the "no coup" dataset. This results served as a green light for me to go ahead with my model. Before that, I categorized all countries into their respective regions and converted these regions to dummy variables. In the end, I had a dataset with 36 variables and 2,996 data points.
The Models - Ensemble Methods
I split my dataset into a train set (70%) and a test set (30%). I fit Random Forest and AdaBoost classifiers (using the best possible parameter from Grid Search) into the train set and then used the model for predicting the likelihood of a coup in the test dataset. I then compared which model (Random Forest and AdaBoost) was the best at predicting a coup, based on their AUC score from the ROC curve. The ROC curve represents each model's true positive rate and false positive at every threshold. The dotted diagonal line represents an AUC score of 0.5, which indicated an educated guess. Both the AdaBoost and Random Forest models are better that an educated guess (Thank Goodness!). However, I want to select a model that closest to the top left corner of the graph. The Random Forest model has a higher AUC score (0.79) than the AdaBoost model (0.76). The Random Forest model appears to be closer to then corner at some thresholds, but there are other thresholds where the Adaboost model is closer.

Let's see if these models are equally robust when I include only the nine most important features:

This AUC score from the Random Forest model is not very different from the previous one. The AUC score of the AdaBoost model improves by five percentage points.. This proves that the Random Forest model is more robust than the AdaBoost model. There is, however, more overlap between the curves of the two models making it rather difficult to select a better model (even though the AUC of the Random Forest model is higher). A confusion matrix might be helpful in making a decision.


The tables above are the confusion matrices for the Random Forest and AdaBoost models. The predictions for the Random Forest and AdaBoost model are based on probability thresholds of 0.06 and 0.42 respectively. From the results, we can see that the Random Forest model is able to correctly predict more cases of a coup and incorrectly predicts fewer cases of no coup than the AdaBoost model. The Random Forest model, however, also incorrectly predicts a coup more often. Still, I am willing to have a model that incorrectly predict more coups but also incorrectly predicts fewer cases of no coup. For this reason, I would select the Random Forest model.
Future Work:
More work can still be done with this project. The first place to start would be to get more data. I lot of data points because I was unable to get some variables for countries that went as far back as 1955. As a result, I had to drop rows with missing values. In the future, I could conduct more research to find the data or, as a last resort, determine a formula that would best predict the missing data. Also, I would like to collect information on military size and transparency index for each data point. I believe these factors may help better predict the likelihood of a military coup.