A Century of Natural Disasters: Trends, Impact, and Response

data science6 min read
← Back to Projects

Analysis of global disaster records from 1900 to 2021. Droughts and epidemics, less frequent than floods and storms, killed five times as many people.

PythonPandasMatplotlibSeabornStatistical Analysis

I was six years old in July 2005 when monsoon rains submerged Mumbai under a meter of water in a single afternoon. The image of an entire city stalled, transit halted, water lapping at second-floor windows, sat with me long enough to make natural disaster data feel personal when I picked it up two decades later. This case study is what I produced when I had the technical skills to ask the questions that childhood memory had planted.

The dataset is the EM-DAT International Disaster Database, covering 121 years of recorded natural disasters from 1900 to 2021. The questions I came in with were the same ones a policymaker would ask: where do disasters concentrate, how has frequency changed, what kills the most people, what costs the most money, and where is the relationship between cause and response weakest.

Why this dataset is useful and where it is biased

EM-DAT is the most complete cross-country record available for natural disasters but it has structural biases worth naming up front. Reporting completeness has improved dramatically since the 1980s, which means the apparent rise in disaster frequency over the century is partly real and partly an artifact of better coverage. Smaller events in lower-income countries are systematically under-recorded. Casualty figures for events in mid-century are reconstructed from contemporary newspaper accounts and official statistics that vary in reliability.

I treated the dataset as authoritative for relative comparisons (epidemics versus floods, Asia versus Africa) but cautious for absolute trend claims (the 1900 baseline is not directly comparable to the 2020 baseline).

Frequency and severity, separated

The most useful single insight from the data is that frequency and severity are weakly correlated. Floods and storms dominate frequency: hundreds of recorded events per decade in recent years, with the volume curve bending upward over time. Floods alone account for over 6 million recorded deaths across the century, the largest single contribution from any high-frequency category.

But droughts and epidemics, despite being far less frequent, have killed more people in absolute terms. The combined toll from droughts and epidemics over the recorded period exceeds 20 million people. Two large mid-century episodes drive most of that figure: famines associated with drought and the 1917 to 1920 influenza epidemic, recorded in the dataset under the Soviet Union entry.

The implication for disaster preparedness is unintuitive. Public attention and emergency response capacity gravitate toward the high-frequency, high-visibility events. The events that drive the largest total mortality are slower, less televisable, and harder to mobilize sustained resources against.

Geographic concentration

Asia (specifically Southern, South-Eastern, and Eastern Asia) has recorded over 3,000 disaster events across the period, more than any other region. The Americas (North and South combined) and Africa (Eastern and Western regions in particular) round out the high-frequency areas.

Several factors compound in those regions: high population density on flood-prone river deltas and coastal plains, geological exposure to seismic activity along Pacific Rim and Himalayan fault systems, monsoonal climate systems that produce both floods and droughts, and economic constraints that reduce resilience and recovery capacity.

This geographic concentration has clear implications for international disaster preparedness funding. The events that produce the largest aggregate human cost are concentrated in regions that have historically received a smaller share of international risk-reduction investment. Closing that gap is partly a moral question and partly a question of where the marginal preparedness dollar produces the largest return.

Specific events that anchor the picture

A few historical events are large enough to skew any aggregate analysis. Naming them helps interpret the rest of the data.

The 1931 China floods, with mortality estimates ranging from 1 to 4 million depending on source, remain the single deadliest recorded natural disaster of the modern era. The 1917 to 1920 influenza pandemic, recorded in the EM-DAT entry I cited above, dwarfs everything else by total deaths. The 2004 Indian Ocean tsunami killed approximately 230,000 across 14 countries in a single morning. The 1995 Kobe earthquake in Japan caused approximately $100 billion USD in damage. The 2005 hurricane season in the United States, dominated by Hurricane Katrina, caused about $125 billion. The 2011 Tohoku earthquake and tsunami in Japan caused approximately $210 billion in direct damage and triggered the Fukushima nuclear accident, which is recorded separately.

These events are not outliers in the statistical sense. They are reminders that the right scale of analysis for disaster planning is not the typical event but the rare, severe one that the policy environment rarely prepares for.

What economic damage data shows

The economic damage record is shorter and more biased than the human-impact record. Damage figures are systematically larger for events in wealthy countries because the economic stock at risk is larger. A flood that destroys 1,000 houses in Tokyo registers as more expensive than a flood that destroys 1,000 houses in Bangladesh, but the human displacement is comparable.

Adjusting for this bias is non-trivial in the EM-DAT data. The cleanest comparison is within-region year-over-year change, which controls for the level effect even if it does not capture absolute severity well.

What the data does support is the observation that economic damage from weather-related events has grown faster than disaster frequency, a pattern consistent with both increasing wealth concentration in vulnerable areas and increasing storm severity in some regions.

Methods

The technical work used Python with Pandas for cleaning, Matplotlib and Seaborn for static visualizations, and a small set of derived tables for the temporal and geographic breakdowns. The notebook in the repository walks through the steps. Most of the analytical work was data preparation: harmonizing event-type taxonomies, geocoding country-level entries to regional groupings, and reconciling the dataset's mid-century reporting gaps. EM-DAT's country-name field is also not as clean as you would hope. "USSR", "Soviet Union", and "Russia" appear in different rows in the same decade, and joining them onto modern geographies takes a manual mapping I would happily share with anyone else trying to use the data.

The visualization choices were deliberately simple. Bar charts for absolute counts, line charts for time series, choropleths for geographic distribution. The story was strong enough that I did not need to embellish it with chart sophistication.

What I got wrong on the first pass

The first version of the mortality chart plotted total deaths by event type without accounting for the two enormous mid-century outliers (1931 China floods, 1917-1920 influenza). The chart looked dramatic and was misleading. Plotting median per-event mortality on a log axis, alongside the absolute totals, told a more honest story: floods kill many people across many events, while a small number of catastrophic epidemics and droughts drag the mean upward. Both views are useful. Showing only one was a mistake I would not catch today and did at the time.

I also nearly miscounted regional events because EM-DAT records "Sub-region" and "Region" inconsistently. A flood in Bangladesh might be tagged "Southern Asia" in one row and "South Asia" in another. The harmonisation step that resolved that was not glamorous and added a couple of evenings to the project. It also turned the geographic comparisons from "interesting if you trust the labels" into "actually defensible."

The full interactive dashboard is embedded below. Filter by region, decade, or event type to explore the patterns directly.

A century of natural disasters (1900–2021). Interactive Tableau Public dashboard.

What I take away

Three observations from this analysis still inform how I read disaster reporting today.

Frequency is not severity. A category of event can be common and small or rare and catastrophic. Conflating those two questions produces poor policy. The drought-and-epidemic underinvestment is the clearest example in the dataset.

Reporting bias is the largest unfixed problem. A non-trivial share of historical disaster impact has never been fully recorded, particularly in regions and decades that lacked institutional capacity for reporting. Quantitative analysis of disaster trends has to acknowledge that.

The countries that most need preparedness investment are not the countries that produce the largest economic damage figures. Aligning international response funding to human impact rather than damage figures is a research question I have continued to find interesting outside of this project.

The methodological habit I carried out of this work was to always look for the disaggregated view before reporting an aggregate. The headline number is rarely the story. The within-group structure is. That habit is the same one I use now when reading clinical operations data at metricHEALTH, where a smooth monthly KPI almost always hides a cohort or a workflow that is failing.

GitHub: https://github.com/TirtheshJani/Case_Study_Natural_Disasters_1900_2021