On the very day of my discharge from military service, one book covered with dust caught my eyes. It was a book from the weekly book discussion during my first internship. I swept the dirt out, rolled up my sleeve, and skimmed through the notes I scribbled.
The book The Signal and the Noise by Nate Silver, who is well known for successfully calling the outcomes in 49 of the 50 states in the 2008 U.S. Presidential election, takes a comprehensive look at prediction across 13 fields, ranging from sports betting to earthquake forecasting. The book asks an ambitious question: What makes predictions succeed or fail?
Silver states that of the people he met at the forefront of science and technology, their predictions have often gone awry despite their best efforts. The discipline of meteorology, however, is an exception in that weather forecasts are much better now than 10 or 20 years ago.
A quarter-century ago, for instance, the average error in a hurricane forecast, made three days in advance of landfall, was about 350 miles. That means that if there is a hurricane sitting in the Gulf of Mexico, it might just as easily hit Houston-- making evacuation and planning all but impossible. Today, although there are storms like Hurricane Isaac that are tricky for forecasters, the average miss is much less at only about 100 miles.
Silver argues that such an achievement is due to a healthy balance between computer modeling and human judgment in weather forecasting that is lacking in many other disciplines. According to Silver's argument, we either take the output from poorly designed models too credulously, or we value our own subjective judgment much too highly (as in the case of baseball before the advent of Moneyball).
Exploring storms data
Although I don't have sufficient knowledge and skills to build computer models for weather forecast, I was prompted to explore hurricanes within my capabilities. So I searched the internet if I could get historical hurricane data and found that the dplyr R package provides a nicely formatted data called storm. This data is a subset of the NOAA Atlantic hurricane database best track data. The data includes the positions and attributes of 198 tropical storms, measured every six hours during the lifetime of a storm.
In this side-project, I took the following steps:
Transfer storms data from R to MySQL
Import data from MySQL to Elasticsearch using Logstash
Create a dashboard of storm data using Kibana
For those who are not familiar with Elasticsearch (ELK stack):
Elasticsearch is a search and analytics engine.
Logstash is a server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch.
Kibana lets users visualize data with charts and graphs in Elasticsearch.
In other words, Elasticsearch is an open source distributed full text search engine, which is developed alongside a data-collection and log-parsing engine called Logstash and an analytics and visualization platform called Kibana. The three products are designed for use as an integrated solution, referred to as the "ELK stack". Elasticsearch Logstash Kibana! Get it?
I was first exposed to this tool during my first internship to build a self-reporting tool for users. As the name implies, the Self-Reporting Tool facilitated non-technical users to create their own dashboards and removed too long and cumbersome tasks: write a ticket to the BI team with requirements, wait for the BI team to reply back, re-explain what needs to be added or deleted, etc. I have used this tool for my personal project and school projects ever since.
So I decided to use the ELK stack again for this side-project!
Tools used:
R
MySQL
Logstash
Elasticsearch
Kibana
OS/Software version:
Windows 7; 32 bit
Elasticsearch-5.5.3
Logstash-5.5.3
Kibana-5.5.3
Transfer data from R to MySQL:
This is a script from R. Here, I am fetching the storm data from dplyr pacakage, connecting to MySQL server at port 3301, and write the data to the storms database.
Logstash configuration:
This is the logstash configuration file I had to run in order to import data from MySQL server to Elasticsearch. Each part carries out distinct functions as follows:
Input: connect to SQL server using jdbc driver and run a SQL query to fetch data.
Filter: convert field names and types
Output: connect to elasticsearch and create a "storm" index
geo-point mapping template:
I created a mapping template to receive location data as geo_point type on Elasticsearch in order to create the map in Kibana. The template needed to be run on Kibana Dev Tools before running Logstash configuration file.
Below is a gif of the interactive dashboard I created:
There are more visualizations you can add on the dashboard on top of pie chart, area chart, and tile map. I filtered out three storms in the tile map, but you can add more storms in the filter, analyze thier tracks, size, intensity, etc. And this is what fascinates me about data visualization using ELK stack, as it brings down barriers for non-technical users to create as many visualizations and dashboards they need by using their creativity and imagination!
Comments