Kaggle – Pivot Billions

Visualizing Gender Disparities in Hi-Tech, Engineering, and Science

pivotteam — Wed, 13 Mar 2019 18:38:07 +0000

We’ve all heard about the many improvements and changes being made to help get women into Hi-Tech and the Sciences, but where are these improvements being seen? Is the entire nation reaching gender equality or are there pockets of improvement and pockets of stagnation. These are the questions I set out to answer using the Safe Graph US Census Bureau Census Block Group American Community Survey Data from Kaggle.

This census data is an extremely comprehensive dataset at the census block group level (on the order of 1500 households each) containing detailed information on a multitude of demographic aspects of the people living in the US. However, the data is quite sizable and comes in the form of a very large zip file that makes the data hard to explore. Moreover, the zip file contains many internal files containing different demographic topics, each with its own schema. I used Pivot Billions to preprocess my data for Tableau due to its data ETL and joining features.

Using it, I quickly extracted the key columns I wanted from the various internal files in the census data while the data was still zipped and in S3.

Then I joined those key columns together by census block group and enhanced the data with a few custom columns such as the percentage of women and the degree of gender inequality.

This second custom column is used to describe how far away from equal (50% men, 50% women) the computer, engineering, and science workers are in each census block group. The values range from 0 to 5 with 5 meaning highly biased towards one gender. Now that the main data I was interested in was ready and loaded into Pivot Billions, I then downloaded the Pivot Billions preprocessed data as a csv file and then simply imported the data into Tableau as a text file.

Now all I had to do to explore the distribution of gender equality and inequality in the United States was to visualize this data. I quickly opened a new worksheet and told Tableau to treat the Census Block, latitude/longitude, and DegreeOfGenderInqeuality fields as Dimensions (categories). Then I added some of the interesting fields to the displayed tooltip in Tableau and colored by DegreeOfGenderInequality. Focusing on the extremes of Equality (Green) and Inequality (Orange), I was left with a very telling interactive visualization.

It was immediately clear that the coastal regions of the US and the Southeastern US region consistently maintained a high level of gender equality. On the other hand, the North/NorthEastern US regions leaned heavily towards just one of the genders (typically towards men).

Though many areas of the US seem to be making strides towards gender equality in Hi-Tech and the Sciences, many still remain heavily biased and stagnant. It may be worthwhile to look at the regulations and companies operating in the coastal and southeastern regions of the US to see what policies (governmental or private) may be having an effect. With time hopefully these policies will be implemented across the US to equalize the representation of both genders in these fields and truly create a more equal society.

To interact with this visualization yourself, feel free to go through my Gender Equality and Inequality workbook on Tableau Public.

San Francisco – A City Full of Psychopaths?

pivotteam — Thu, 31 Jan 2019 19:52:27 +0000

With a population of over 800,000 people within approximately 47 square miles, San Francisco is one of the most densely populated cities in the United States. Cities with this level of population density must contend with a high level of policing activity to ensure safety. I wanted to explore just how safe it actually is so I decided to analyze the SF Police Calls for Services and Incidents dataset from Kaggle which covers many years (2003-2017) of policing data and contains a variety of useful information about each record.

Starting with the police-department-incidents.csv file and loading it into Pivot Billions, I quickly renamed the x and y fields to what they truly were (longitude and latitude) and started to view the whole data. Getting a quick distribution of the Resolution column I was able to see the different ways each police incident was resolved and get an idea for how common each resolution was.

I was surprised to discover that a resolution of “Psychopathic Cases” occurred nearly 30,000 times across the data! What the heck?! Is San Francisco a city of psychopaths? Sure San Francisco has had its share of famous serial killers, but this hardly seemed likely.

To investigate this further, I simply applied a filter on the Resolution column in Pivot Billions to show only the data that ended up classified as a “Psychopathic Case” by the San Francisco PD. I then downloaded this filtered data to visualize in Tableau, choosing to also drop the Location column since Tableau has trouble interpreting it.

Now that my data was in a size and format manageable for Tableau, I loaded the data into Tableau as a Text file and graphed the data by the latitude and longitude columns. By adding the Address and Number of Rows data to the visualization and quickly integrating some interactive filters and highlighters, I ended up with a useful and informative visualization of my Policing data.

This interactive map shows where each “Psychopathic Case” police incident occured and how many incidents were recorded at that location. We can clearly see that there is a high concentration of these cases in Northeast San Francisco. By using the interactive filters in my visualization it was easy to determine that the Central, Northern, Southern, and Tenderloin districts are the main districts where such incidents were recorded the most.

So is San Francisco a city overrun by psychopaths? Hardly.

A little further digging reveals that the vast majority of these so called psychopathic resolutions were in fact “non-criminal - aided cases” where a mentally ill person was involved.

In fact if you look at the Psychopathic Case designation over time, you notice that it disappears almost entirely after 2014.

It looks like this is a case of poor classification of a very serious problem that police in all major cities have to deal with, and that is the rise of mental health related incident calls. There have been a slew of high profile incidents between law enforcement and the mentally ill in San Francisco in the past, so it is a promising sign that the SFPD is moving in the right direction and starting to limit the use of the “Psychopathic Case” designation.

If you want to explore more of this data please feel free to go through my workbook on Tableau Public.

Data Twofer: Exploring LA Web Traffic and Google Appstore Sentiment

pivotteam — Fri, 25 Jan 2019 22:20:41 +0000

Overview

It's been a few weeks since we posted something in our 5 Minute Analysis series, so we decided to do two quick analyses on two different datasets we found on Kaggle. Instead of doing the analysis locally using PivotBillions on Docker, we opted to run the analysis in the free cloud version available on the PivotBillions website.

Data # 1: Los Angeles Website Traffic

The first is Los Angeles Website Traffic data from the LA open data portal. This data provides web traffic statistics from January 2014 to January 2019.

The dataset was moderately large at 127MB, with about 2.5 millions rows. Doing some basic exploratory data analysis showed some interesting insights.

A quick pivot based on Date and Visitor counts shows a huge spike in the number of visitors to the LAcity.org website as seen above. By zooming in on the date range that the spike occurred in the graph, we can see that the specific date of the spike occurred on Dec 6, 2017.

A quick search for that date shows that a large number of wildfires had occurred in the LA County area.

By also including DeviceCategory into the pivot chart, we see that the majority of that spike was driven by mobile traffic.

Data # 2: Google Play Store Apps

The second dataset in this post that we examined was the Google Play Store Apps data from Kaggle. The size of the data was not large, but it was interesting because it required joining two separate files, and involved some sentiment analysis.

Once the combined data was loaded into PivotBillions, we pivoted using Category and Sentiment columns. The resulting pivot table showed that there were Null and nan counts that did not contribute to our analysis so we deselected them.

After a quick sort, we charted the data shown below. We can see that apps in the Game category had the most positive as well as negative reviews by far of any category, followed by Family apps.

By analyzing app Rating and Sentiment, we see that around the 4.4 rating level, the positive and negative reviews peak.

By filtering for only the Game category, we see that negative reviews slightly outweigh positive reviews at the 4.4 rating level.

The Wrap-up

Although we can't conclude anything definitive from the exploratory data analysis we conducted on either dataset, the results open up some intriguing paths to further investigate. Perhaps by enriching with other datasets, we can get some clearer confirmation about what was the actual cause of the traffic spike in the LA data and why is there so much positive and negative sentiment in Game Apps with ratings of 4.4.

Both these datasets are available for anyone to play with through the free public demo of our cloud version of PivotBillions. Feel free to analyze them for yourself, or upload your own datasets and see what sort of interesting insight you can find in five minutes.

Health Data Analysis: CDC Behavioral Risk Factor data says eat your green veggies

pivotteam — Wed, 19 Dec 2018 22:49:03 +0000

Everyone wants to be healthy but there are many competing claims as to how you can achieve this. With so many contradictory diets, exercise routines that take enormous amounts of time and dedication, and many other perceived paths to a healthy body and mind; tying these claims to actual data becomes very necessary and useful. That is why we decided to explore the CDC’s Behavioral Risk Factor Surveillance System data from Kaggle.

This data contains over 400,000 individuals’ responses to survey questions from 2011 to 2015. It is an extremely comprehensive dataset containing answers to hundreds of questions regarding the individuals’ behaviors and their resulting health. However, all of this data becomes quite tedious and can be arduous to dive into. With Pivot Billions we are able to easily load and enhance the dataset and immediately start to understand the data and pivot it in various ways to see the trends buried within.

For our Behavioral Risk Factor data, we created additional columns categorizing each individuals’ intake of fruit, fruit juice, and green or orange vegetables from the data using the column-creation feature. We can then pivot the data by some of these columns and the amount of activity limitation due to health problems to immediately see the average activity limitation by green vegetable and fruit intake.

The over 400,000 surveys clearly show health improvements due to either daily intake of green vegetables or fruit. Moreover, we can easily see that daily eating of fruit is better than not eating fruit or green vegetables, but daily eating of green vegetables is preferred over eating fruit and that eating both green vegetables and fruit at least once daily is the best option. In fact, the data reveals that daily intake of green vegetables and fruit leads to an over 40% reduction in experienced health limitations!

By pivoting the data by orange vegetable and fruit juice intake, we can now quickly analyze their impact.

It appears that orange vegetables and fruit juice do reduce experienced health limitations by approximately 16%, but green vegetables and fruit are still much more effective.

These are just some of the underlying trends in the data that we found using Pivot Billions. There are many more to explore in this dataset. Stay tuned for our next analysis or feel free to try it out yourself.

Completing the Picture: Who is the Fantasy Football GOAT for Offense?

pivotteam — Fri, 14 Dec 2018 23:08:25 +0000

Fantasy football can be a relaxing past time but for anyone who takes the competition seriously, data immediately becomes very necessary. While many people track their favorite players from their favorite teams, to truly put together a winning team you need to be able to explore and understand large amounts of data. Moreover, the data you need is typically spread across many files and needs to be put back together again to truly understand it. Pivot Billions dramatically improves this process and makes it easy to merge data and start to analyze it.

As an example of this usecase, we chose to work with the historical NFL Statistics data from Kaggle. For fun, we decided to try to determine who would be considered the greatest offensive players of all time to have on your fantasy team. The period that the data covered was from 1924 to 2016.

For our analysis, we decided to use the fantasy scoring system found at FantasyData.com. For offensive players, points are distributed as follows:

OFFENSIVE PLAYERS

Passing Yards: 1 point per 25 yards

Passing Touchdowns: 4 points

Passing Interceptions: -2 points

Rushing Yards: 1 point per 10 yards

Rushing Touchdowns: 6 points

Receptions: 1 points

Receiving Yards: 1 point per 10 yards

Receiving Touchdowns: 6 points

2-Point Conversions: 2 points

Fumbles Lost: -2 points

Fumble Recovered for a Touchdown: 6 points

We had to merge the 4 separate files in order to combine Receiving, Passing, Rushing, and Fumble stats for offensive players. For some reason, this dataset did not have the 2 point conversion data, so we omitted it from the calculations. We also had to perform some ETL to convert some of the data from strings to integers.

Then we select what fields the files have in common and that we want to join by. This is made extremely simple through Pivot Billions’ easy Column Preview interface:

To make our analysis more efficient, we only loaded the data that directly correlated to the fantasy football point scoring system. A copy of the cleansed and filtered dataset can be found here.

For our Fantasy Football usecase we created an additional Fantasy Points column from the data using our column-creation feature. This is the combined offensive point score based on the scoring system.

With the new data column, we created a pivot table based on PlayerID and FantasyPts. We can now easily compare the best offensive players based on their point scores.

Quarterbacks dominate the top of the list. Considering that they are almost always involved in offensive plays it makes sense. Interestingly, the only non-quarterback to have more than 5,000 points in career fantasy scoring is Jerry Rice. Jerry Rice is consistently considered by players, fans, and analysts as the G.O.A.T. But based on our analysis, Peyton Manning would be the greatest fantasy football offensive player.

Because of the versatility and speed of PivotBillions, we can also analyze based on position, decade, and team just as easily. It's also fun to try different scoring methods. Next time, we'll do a similar analysis for defensive players.

Completing the Picture: Uncovering NHL MVP’s in a pile of data

pivotteam — Wed, 05 Dec 2018 18:58:03 +0000

Data is rarely consistent. The most consistent attribute of data is that it is usually dispersed across many files and needs to be put back together again to truly understand it.

Pivot Billions dramatically improves this process and makes it easy to merge your data and start to analyze it. As an example of this use case, we chose to analyze the disjointed 2012- 2017 National Hockey League (NHL) data from Kaggle. This is very interesting and potentially valuable data that describes various aspects of past NHL hockey games. It provides many views into the data from team statistics to to play-by-play results. However these views are each in distinct files that need to be merged.

With Pivot Billions we are able to easily select the datasets that we want to merge and put the pieces back together. This simple process is completed in two steps.

First, we select all of the files we want to merge:

Then we select what fields the files have in common and that we want to join by. This is made extremely simple through Pivot Billions’ easy Column Preview interface:

The data is now quickly imported into Pivot Billions and you have full access to the merged files.

We can now quickly and effortlessly analyze the data. Pivot Billions provide quick-analysis features such as viewing the distribution of each column in the data as well as powerful analysis features including creating new columns from the data and pivoting all of the data by any of its columns to get a new view of the data. This is one of Pivot Billions’ unique features and it allows us to quickly draw insights from the data.

For our example usecase we need to create an additional column to combine the first and last name of players using our f(x) column-creation feature.

Then filter the event column for "Goal". Finally we pivot the data by player name, player position and player type.

We can now immediately see in the pivot table that in a goal event, there are three types of players involved. The scorer, any assist player, and the goalie. With regard to position, we see that the Center and the Goalie(obviously) are the most involved in goal events. Interestingly, the Defenseman position contributes overall to more goals than either wing position.

By deselecting the goalie player type and adding the full name, we can find out which players contribute the most toward a goal in either scores or assists.

Sidney Crosby (Center) and Patrick Kane (Right Wing) lead the pack of players from 2012-2016 in overall contribution to their teams' goals.

This extremely powerful analysis allows us to start to put together the ideal NHL fantasy team as well as compare existing teams by their individual players. Aside from the stats in the dataset, you can derive other common hockey metrics such as the Corsi and the Fenwick along with the provided plus minus stat. It could also be used by NHL recruiters to determine which players might be the best fit for their team to increase their win percentage.

Finding Underutilized Kaggle Data

pivotteam — Fri, 09 Nov 2018 17:55:14 +0000

Overview

In this 5 Minute Analysis we are exploring a Kaggle dataset about Kaggle datasets. This dataset lets us see a list of the datasets on Kaggle, and shows which ones have the most engagement and activity. Our goal is to explore and filter the data to find popular datasets with many downloads but very few kernels. As a contributor to Kaggle, you may want to add kernels to datasets that fit this criteria as opposed to datasets with lots of kernels where your submission might get lost among the crowd.

Simplifying Iowa Liquor Sales Data for Loading to Tableau

pivotteam — Fri, 19 Oct 2018 21:41:45 +0000

Overview

In this 5 minute analysis, we pre-process, map, and explore complicated public sales data for liquor stores in Iowa to extract relevant latitude and longitude from a problematic column in the data.

We want to filter the data for the city with the most sales and prepare it for easy loading into the popular BI (business intelligence) tool, Tableau. Finally we use the extracted location data to visualize sales within the selected city using the Tableau Map.

Steps

1. Perform ETL and Load

In this example, we need to first do a little ETL(extract, transform, load) on one of the data columns before we load the data. The data column we are targeting is the StoreLocation column whose contents include both the address of the store as well as latitude and longitude coordinates.

Example: 1414 48TH ST FORT MADISON 52627 (40.624226, -91.373211)

We want to extract only the latitude and longitude data and create two new columns with the extracted data. To do this we do the following:

Drag and drop the data file into Pivot Billions, select the Main checkbox option for that file and then click on Column Preview.
Click the Plus (+) icon on the bottom left hand side of the window to add a new column.
Click the edit mode icon to the left of the data type selector to switch to Advanced ETL mode.
Enter latitude into the Column Name box.
Enter -eval s:latitude '""' -eval - 'RxMap(StoreLocation,"^.*($.*$,$.*$).*$",latitude,"%%1%%")' into the Ess Syntax box.

What's this Ess Syntax? It is a simple scripting format that Pivot Billion uses that allows for data pre-processing. In this step, the "-eval" function that starts the script indicates that we are substituting our new variable latitude with the contents from somewhere else. In this case, the content is coming from what's inside the parentheses of StoreLocation.
Click the Plus (+) icon again to create a second new column.
Click the icon to the left of the new edit box to switch to Advanced ETL mode.
Enter longitude into the Column Name box.
Enter -eval s:longitude '""' -eval - 'RxMap(StoreLocation,"^.*($.*$,$.*$).*$",longitude,"%%2%%")' into the Ess Syntax box.
Confirm the output is what is expected by clicking Check.
Click Import at the top of the screen.

Note: This is a very large file, 3.22 GB with over 12 million rows, so we recommend that your system has at least 16 GB of memory and that your docker setting allocates at least 7.5 GB for docker containers.

2. View and Explore the Data
We now have access to all 12,591,077 rows of the Iowa Liquor Sales data. This contains the original data in an easy tabular format along with the two new latitude and longitude features we’ve created.

We viewed the distribution of sales by city for our data by clicking on the distribution charting icon in the City column. Not surprisingly, the capital of Iowa (Des Moines) with the highest population of any city in the state has the highest total number of liquor sales.

Now, for our analysis we’ll filter all of the data to only include data for the city of Des Moines.

Click on the filter icon in the City column.
Select Equals from the dropdown and then enter “Des Moines” and press enter. All of our data is immediately filtered to just Des Moines data.

The data is almost ready for Tableau or another business intelligence tool but the StoreLocation column is troublesome and prevents Tableau from reading the data. So we’ll go ahead and download the data without this column.

Click the icon on the top right.
Unselect the box next to StoreLocation to remove that column from your selected data.
Now click the icon on the top right.
Enter “DesMoinesSales” for the file name and then click Selected Columns.
When the file is ready for download click the at the bottom of the screen as seen below.

3. Import to Tableau and Graph the Data

We can now easily import this data into Tableau and use its powerful visualizations to dive into Des Moines liquor sales by location.

Unzip the “DesMoinesSales.csv.zip” file we just downloaded from PivotBillions.
Open Tableau and click Text under Connect.
Navigate to the folder where the “DesMoinesSales.csv” file is located and Open it.
Then click on Sheet1.
Right click Latitude under Measures and select Convert to Discrete and then Convert to Dimension.
Repeat this for Longitude.
Drag Longitude to Columns and Drag Latitude to Rows as seen below.
Drag Sale Dollars to Size and drag whatever features you want to Detail.

The Wrap-up

It is clear from this SymbolMap plot that the sales for Des Moines are highest near the city center. However, it is also clear that there are many high volume locations throughout the city. We can now visualize the data however we see fit in Tableau. We could also easily remove our city filter from PivotBillions and load the whole preprocessed data into Tableau, or change our filter(s) to select a different subset or arrangement of the data.

To view and interact with this visualization or download the workbook to Tableau, see my Iowa Liquor Sales Workbook on Tableau Public.

Getting the Whole Picture of Retail Sales

pivotteam — Mon, 08 Oct 2018 18:04:34 +0000

Overview

It should be apparent that we like to use data sets found in Kaggle. If you are not familiar with Kaggle, you should check out their website to peruse the thousands of great datasets available for free to play around with.

In this 5 Minute Analysis of the Kaggle Retail Data Analytics we explore historical sales data for 45 retail locations. In addition, we join two related supplementary data files to the main sales data file to dive into the relation between store type, fuel prices, and other indicators and how they might affect average weekly sales.

Steps

1. Load the Data and View its Structure

In this example, the steps for loading data is a little different from our previous posts because we are joining two supplementary data files to the main data set. The steps to follow are:

Download and unzip the data onto your machine and access the Pivot Billions UI on your browser.
Click the Add (+) icon from the data loader screen and select Drag & Drop.
Drag your downloaded Features data set.csv, sales data-set.csv, and stores data-set.csv files to the Drag & Drop box in Pivot Billions.
Select the left checkbox (Main) next to the sales data-set.csv file and the right checkbox (Join) next to the Features data set.csv file and the stores data-set.csv file.
Click Column Preview at the top of the screen. The resulting preview will show the column labels and data types for both the main dataset and the data we are joining to it.
Pivot Billions automatically will find common keys between the data sets which are identified by the Key symbol next to column labels. You can modify keys here if you want, but for this example you can just use the ones selected as is.
Click Import at the top of the screen to load your data into the Report UI.

3. Pivot and Report the Data
Next we used the Pivot tool to do some quick analysis of a single dimension's relationship to total and average weekly sales. The basic steps are:

Select and open the Pivot tool and then select a Dimension like IsHoliday or Temperature for the pivot table.
Select Weekly Sales as our Value and click View.
In the resulting table, select the Pivot View option and then drag and drop the dimension we want to analyze into the columns area.
Change the visualization to Table Barchart and then select the data Summation option to see the effect of the dimension on total Weekly Sales.
Next change from Summation to Average to see effect on average Weekly Sales.

We did this same type of analysis for a few other dimensions which you can see below. We chose the Bar Chart visualization for these.

The Wrap-up

We are able to see that the CPI (consumer price index) and the IsHoliday dimensions affect total sales but have less of an effect on average sales. The Store Type and Store Size, however, seem to have a stronger correlation to both the total and average weekly sales.

In five minutes, we were able to run these as well as similar pivot charts using Temperature and Unemployment dimensions. These last two did not show a very strong correlation to total or average weekly sales. The next steps if we wanted to dive deeper would be to try to use two or more dimensions and see how they might relate to sales behavior. It would be a good idea to set some filters to better manage the size of the analysis.

Olympics Data Reveals Rising Gender Equality

pivotteam — Fri, 28 Sep 2018 23:31:43 +0000

Kaggle – Pivot Billions

Visualizing Gender Disparities in Hi-Tech, Engineering, and Science

San Francisco – A City Full of Psychopaths?

Data Twofer: Exploring LA Web Traffic and Google Appstore Sentiment

Overview

Data # 1: Los Angeles Website Traffic

Data # 2: Google Play Store Apps

The Wrap-up

Health Data Analysis: CDC Behavioral Risk Factor data says eat your green veggies

Completing the Picture: Who is the Fantasy Football GOAT for Offense?

Completing the Picture: Uncovering NHL MVP’s in a pile of data

Finding Underutilized Kaggle Data

Overview

Steps

The Wrap-up

Simplifying Iowa Liquor Sales Data for Loading to Tableau

Overview

Steps

The Wrap-up

Getting the Whole Picture of Retail Sales

Overview

Steps

Store Type -Total

Store Size - Total

CPI - Total

Store Type - Average

Store Size - Average

CPI - Average

The Wrap-up

Olympics Data Reveals Rising Gender Equality

Overview

Steps

The Wrap-up