This is a group project assignment at Refocus Digital Academy.
Published on April 04, 2023 by Ali
Data Analyst informative NHTSA SQL
9 min READ
Imagine yourself as a data analyst for PT Refocus Consultant. Refocus Consultant is a consultancy firm that handles various data analyses, such as business, environmental, health, and government data. Refocus Consultant is also often asked to provide opinions to assist the government in making policies. Refocus Consultant helps many clients obtain important information from the data provided by the clients.
You and your team in the data analytics department have a great opportunity to help the National Highway Traffic Safety Administration (NHTSA). NHTSA is one of the government departments in the United States that focuses on reducing the number of traffic accidents on highways.
Currently, NHTSA is working on new regulations that will be implemented next year. Refocus Consultant is one of the companies asked to analyze the data collected during 2021. This data is complete data on accidents that occurred during 2021.
The main objective of your study is to provide a number of recommendations on how to reduce the number of accidents on highways. To do this, you first need to identify the following data:
Reliable — The data has been made available by Refocus specifically for this project
Original — Subject to the terms and conditions of the Data License Agreement by NHTSA.
Comprehensive — Data dictionary has been provided through this link.
Current — The latest data is 2022.
Cited — All the data needed for this project can be accessed through this link.
The program that we used for data cleaning is pgAdmin 4 using PostgreSQL.
“Elimination & validation of data to other.
This statement means that certain values in specific columns of a dataset will be eliminated or validated as “other”. The specific columns and values are:
Any rows with these values in the specified columns will either be eliminated from the dataset or validated as “other”. This is done to ensure the accuracy and reliability of the data analysis.
Convert the abbreviation “USA” to “Central Standard Time”
CST stands for Central Standard Time, which is a time zone used in some parts of the United States, Canada, and Mexico. The reason for changing the time zone to CST may be to ensure consistency and accuracy in reporting the time of the crash across different states and jurisdictions.
When dealing with data from multiple states or regions, it is important to have a consistent time zone for reporting and analysis. This is especially true when dealing with time-sensitive data, such as traffic accidents, where the precise timing of events can be critical in understanding the cause and severity of the crash. Therefore, by converting the timestamps to CST, all the data will be standardized and easier to compare and analyze.
There were high numbers of accidents in the United States in 2021, with 35,414 accidents involving 85,311 people. This proves that the regulations or traffic signs that have been established are still inefficient or not adjusted to the factors of accidents that occur. Therefore, an analysis of accident data from the previous year is needed to create more appropriate regulations.
To analyze accident data in 2021, so that from the results of the analysis, new regulations can be created that have the potential to reduce the number of accidents in the future.
The analysis conducted is to identify the factors that cause high accident rates. The first variable we analyzed is atmospheric conditions and light conditions. From this, it was found that in atmospheric conditions that are ‘Clear’, the number of accidents is high. The figure is also significantly different from other conditions such as ‘Cloudy’, ‘Rain’, ‘Fog’, ‘Snow’, and others. Contrary to what we expected, from the above data, we can conclude that some countries may have longer periods of sunlight, which is why the data above has a higher number of accidents in clear conditions. If we also connect the average accident time and day, the accidents occur during the day, where people tend to be more relaxed but in a hurry (not worried), resulting in a very high number of single accidents compared to others. In terms of light conditions, it was found that ‘Daylight’ actually has a high number of accidents, followed by ‘Dark-lighted’. Contrary to what we assumed, the highest number of accidents from the above data occurred during the day. What we can assume here is that more people tend to drive during the day/daylight, resulting in more frequent accidents.
Number of accidents based on lighting conditions.
Number of accidents based on atmospheric conditions.
Next, we analyzed the variables of intersection type or road type and manner of collisions. After data processing, it can be seen that straight roads or ‘Not an intersection’ have higher accident rates compared to other road types with a significant difference. Straight roads are the biggest cause of accidents because drivers can get bored and lose focus while driving, causing the vehicle to lose control and potentially lead to accidents. Then, from the manner of collisions, we found that the most common type of accident is a single-vehicle accident.
Number of accidents based on road type.
Number of accidents based on collision type.
Furthermore, we also looked for states with the highest accident rates. Based on the graph we created, the top three states with the most frequent accidents are Texas, California, and Florida. Each of these states had around 3000 accident cases, while other states only had around 1000 cases. These three states have the highest population in the USA, with California being the most populous, followed by Texas and Florida. However, Texas has the highest accident rate ranking due to a large number of its residents not paying attention to the speed limit set.
Top states with the highest accident rates.
The next finding is based on the time of the accidents. On average, accidents occur between 10 am to 4 pm. During these hours, various activities are taking place such as working, going to school, traveling, and others, and this is also during the daytime where the hot weather can cause fatigue. The second highest average occurrence of accidents is between 7 am to 9 am and 5 pm to 6 pm.
Number of accidents based on the time of the accidents.
In addition, we also analyzed the number of accidents that may have been caused by drunk driving. However, we found that only 25% of all accidents that occurred in 2021 were caused by drunk drivers.
Percentage of accidents based on drunk and undrunk drivers.
We also found that accidents occur more frequently in urban areas, which accounted for 38.8%, compared to rural areas, which accounted for 6.1%. This proves that urban areas have more activities such as schools, work, cafes, malls, and others.
Percentage of accidents in urban and rural areas.
Lastly, we also analyzed the occurrence of accidents based on the time of the accidents, including variables such as day, date, and month. For the day variable, we found that accidents occur more frequently on Mondays with 6,111 accidents, followed by Sundays with 5,862 accidents. Monday is the day when various activities and routines such as school, work, grocery shopping, and others start, followed by a weekend break. Meanwhile, Sunday is the day of preparation for starting usual activities and routines on Monday, for example, many people return to their workplace. Sunday is also a day when most people do not have to work on the weekend, so the number of people traveling by car will increase, resulting in a higher likelihood of accidents and traffic. Based on the date, the highest number of accidents occurred on the 19th and also at the beginning of the month, while at the end of the month, the number of accidents was relatively small. Furthermore, based on the month, it can be seen that accidents occur more frequently around the summer season.
Number of accidents by day of the week.
Number of accidents by date of occurrence.
Number of accidents by month of occurrence.
The NHTSA should pay more attention to traffic conditions during peak hours when driver activity is high
Encourage drivers to be more cautious
The NHTSA should also conduct further research on the findings we have obtained. Due to limited data, the findings we have generated are also limited. If more detailed findings are desired, we suggest including data such as:
Presentation Project Assignment 1 group 8