Cloud Datasets Documentation

Overview

Cloud Datasets is a collection of curated data sets built on Google Cloud & Google BigQuery. Datasets are ready for analytics and AI/ML initiatives without the need of data engineering. 


COVID Dataset

The Dataflix COVID dataset is a centralized repository of up-to-date and curated data focused on key tracking metics and U.S. census data. The dataset is publicly-readable & accessible on Google BigQuery – ready for analysis, analytics and machine learning initiatives.


The dataset is built on data sourced from trusted sources like CSSE at Johns Hopkins University and government agencies, covering a wide range of metrics including confirmed cases, new cases, % population, mortality rate and deaths, aggregated at various geographic levels including city, county, state and country. New data is published on daily basis. Our objective is to make structured COVID data available for organizations and individuals to help in the fight against COVID-19.


Data Catalog

Sample Queries

Total confirmed cases and new cases by state

SELECT state_name State, sum(confirmed) TotalCases, sum(confirmed_new) NewCases FROM `covid-assistant.covid.bi_usa_snapshot`

group by state_name

New cases trend in the U.S.

SELECT date, sum(confirmed_new) NewCases FROM `covid-assistant.covid.bi_usa_daily_trends`

group by date


Traffic & Safety Dataset

Traffic and safety dataset is a high-demand automotive curated dataset, making it easy to access and discover deep insights into vehicle safety, driver behavior and competitors. Dataset contain historical data sourced from authentic and trusted sources like The National Highway Traffic Safety Administration (NHTSA), the National Center for Statistics and Analysis (NCSA), and the Bureau of Economic Analysis (BEA).


Traffic and safety dataset is supports wide range of analysis including – Design and liability risk, Geo & Demographics, Driver Behavior, Crash Analysis and Competitor analysis.


Fatality Analysis Reporting System (FARS) data is made available to the public by National Highway Traffic Safety Administration (NHTSA). Over the years, changes have been made to the type of data collected and the way the data is presented in the data files. Some data files have been discontinued and new ones have been created. For the current data collection year there are 20 data files or 20 Tables.


Key Metrics Covered

Geo & Demographics

Driver Behavior

Crash Analysis


Tables

ACCIDENT: This table contains information about crash characteristics and environmental conditions at the time of the crash. There is one record per crash. For this table, data is present from 1975 to 2018


VEHICLE: This table contains information describing the in-transport motor vehicles and the drivers of in-transport motor vehicle who are involved in the crash. There is one record per in-transport motor vehicle. For this table, data is present from 1975 to 2018


PERSON: This table contains information describing all persons involved in the crash including motorists (i.e., drivers and passengers of in-transport motor vehicles) and non-motorists (e.g., pedestrians and pedal cyclists). It provides information such as age, sex, vehicle occupant restraint use, and injury severity. There is one record per person. Data is present from 1975 to 2018


PARKWORK: This is a new table. Data is present from 2014. This table contains information about parked and working vehicles that were involved in crashes. There is one record per parked/working vehicle.


PBTYPE: Data is present from 2014. This table contains information about crashes between motor vehicles and pedestrians, people on personal conveyances and bicyclists. There is one record for each pedestrian, bicyclist or person on a personal conveyance.


CEVENT: Data is present from 2010. This table contains information for all of the qualifying events (i.e., both harmful and non-harmful involving in-transport motor vehicles) which occurred in the crash. It details the chronological sequence of events resulting from an un-stabilized situation that constitutes a motor vehicle traffic crash. There is one record per event. Included in each record is a description of the event or object contacted the vehicles involved, and the vehicles’ area of impact.


VEVENT: Data is present from 2010. This table contains the sequence of events for each in- transport motor vehicle involved in the crash. In addition, this table has a data element that records the sequential event number for each vehicle (VEVENTNUM). There is one record for each event for each in-transport motor vehicle.


VSOE: Data is present from 2010. This table contains the sequence of events for each in- transport motor vehicle involved in the crash. There is one record for each event for each in-transport motor vehicle.


DAMAGE: Data is present from 2012. This table contains information about all of the areas on this vehicle that were damaged in the crash. There is one record per damaged area.


DISTRACT: Data is present from 2010. This table contains information about driver distractions. There is at least one record per in-transport motor vehicle. Each distraction is a separate record.


DRIMPAIR: Data is present from 2010. This table contains information about physical impairments of drivers of motor vehicles. There is one record per impairment and there is at least one record for each driver of an in-transport motor vehicle.


FACTOR: Data is present from 2010. This table contains information about vehicle circumstances which may have contributed to the crash. There is at least one record per in-transport motor vehicle. Each factor is a separate record.


MANEUVER: Data is present from 2010. This table contains information about actions taken by the driver to avoid something or someone in the road. There is at least one record per in-transport motor vehicle. Each maneuver is a separate record.


NMIMPAIR: Data is present from 2010. This table contains information about physical impairments of people who are not occupants of motor vehicles. There is one record per impairment and there is at least one record for each person who is not an occupant of a motor vehicle.


NMPRIOR: Data is present from 2010. This table contains information about the actions of people who are not occupants of motor vehicles (e.g., pedestrians and bicyclists) at the time of their involvement in the crash. There is one record per action and there is at least one record for each person who is not an occupant of a motor vehicle.


NMCRASH: Data is present from 2010. This table contains information about any contributing circumstances or improper actions of people who are not occupants of motor vehicles (e.g., pedestrians and bicyclists) noted on the police report. There is one record per action and there is at least one record for each person who is not an occupant of a motor vehicle.


SAFETYEQ: Data is present from 2010. This table contains information about safety equipment used by people who are not occupants of motor vehicles. There is one record for each person who is not an occupant of a motor vehicle.


VIOLATN: Data is present from 2010. This table contains information about violations which were charged to drivers. There is at least one record per in-transport motor vehicle. Each violation is a separate record.


VINDECODE: Data is present from 2013. This table contains vehicle descriptors for all vehicles, mainly passenger vehicles, trucks and motorcycles, based on the vehicle’s VIN which is decoded using the VINtelligence program. There is one record per vehicle.


VISION: Data is present from 2010. This table contains information about circumstances which may have obscured the driver’s vision. There is at least one record per in-transport motor vehicle. Each obstruction is a separate record.


DRUGS: Data is present from 2018. This table contains the specimens tested and the drug results from toxicology reports of all persons involved in the crash. There is one record per specimen tested and its corresponding drug result.


Performance

All tables in “Traffic and Safety” dataset are partitioned on “L_YEAR” column. This will help in improving the performance of queries and to reduce the cost of querying. Example: Accident table contains around 900MB for years between 1975 and 2018, when filtered on year, data size would be reduced to 20MB to 25MB.


Sample Queries

SELECT * FROM `dataflix-public-datasets.traffic_safety.accident`

WHERE L_YEAR = 2018


SELECT * FROM `dataflix-public-datasets.traffic_safety.accident`

WHERE L_YEAR between 2010 and 2018