Challenge provided by UrbanAnalytica

Patterns and predictive modeling of traffic accidents

Identifying dangerous zones to improve them or have better patrolling in those areas.

The recent and future increase in the population that lives and works in cities will significantly pressure the infrastructure of cities, namely roads. This will increase the probability of traffic accidents, which carries significant challenges in city mobility, transportation systems, and, more importantly, human safety.

In this sense, it is of utmost importance to understand traffic accidents' infrastructural and environmental characteristics and predict them. This enables, for example, city emergency services to optimize responses to an emergency call and city managers to plan road traffic, considering the risk of traffic accidents.


Create an explainable predictive model of traffic accidents at street level by the moment of the day.

United Nations SDG 

GOAL 11: Sustainable Cities and Communities

  • Target 11.2.1: Provide access to safe, affordable, accessible, and sustainable transport systems for all.


The following datasets were provided to the participants:

  • Traffic collision database from Waterloo, Canada, from 2005 to 2018. The dataset included the street of the accident, environmental conditions, and light conditions at the time of the impact, Open data by the City of Waterloo.


Most teams used only the provided dataset. One team also used weather data as input to the model. Teams argued that a more precise model could be built by providing more detailed information (e.g., the hour of each accident and the severity of the accident), information on the quality of the roads, locations of road signs, traffic data and user behavior (e.g., the demographic of the parties involved in the crash), and data on other means of transportation (e.g., cycling and pedestrians). 

Methods and Techniques

In this challenge, various methodologies were used for prediction. Some teams that used supervised learning approached this as a regression task (predicting the number of car accidents on the segment, with a risk factor) or as a classification task (if an accident happened, the categorical target of the number of accidents, location of the accident). A large array of models was tested by different teams as well.

One team compared five models: Random Forest with default hyperparameters and tuning, Logistic Regression, and Gradient Boosting with default hyperparameters and tuning. This team picked Logistic Regression for further prediction analysis as it had higher precision and lower recall. Another team used Random Forest and LASSO, while others used CatBoost and a Neural Network

One team decided to take an unsupervised approach by clustering areas of accident concentration with DBSCAN. There was also a team that took a time-series approach to predict the number of accidents by day, although they did not develop a street-level model.

Main Insights from Data

One team extensively analyzed when most accidents occur and discovered that the number is generally higher during winter and on Fridays (see Figure 1).  Most of the accidents happened at or near a private driveway or non-intersection. It was also possible to observe a higher concentration of accidents in other areas.

Figure 1 - Distribution of the number of collisions between the weekday and the month in the city of Waterloo, Canada.

Another team plotted the accidents, the most dangerous areas of the city (see Figure 2), and the most dangerous roads. Depending on the features used, the biggest contributing factors were the light conditions, the speed limit of the street, and the width of the road.

Figure 2  - Classification of how dangerous, in terms of car crashes, each area of the city of Kitchener is.

Social Impact

The main opportunity for the models proposed by the teams is to help local governments and law enforcement identify dangerous zones to improve them or have better patrolling in those areas. 

Open-source code

Other challenges