5 Minute Analysis – Pivot Billions

5 Minute Analysis of Data.gov Datasets with PivotBillions

pivotteam — Fri, 19 Apr 2019 19:00:46 +0000

Overview

Data.gov is an open source of data provided by the US government that provides a wealth of interesting information ranging from agriculture, education, finance, health, science, etc...Currently, it lists over 200,000 datasets available to explore.

In this 5 Minute Analysis, we picked one dataset from the consumer data category, specifically the Financial Services Consumer Complaint Database. This dataset is provided by the Consumer Financial Protection Bureau or CFPB, and is a record of complaints received to the CFPB on various financial products and services offered by a multitude of companies.

The data is comprised of over 1.2 million rows in csv format and is about 700MB in size. Loading the data was pretty simple using the drag and drop feature in PivotBillions to upload the data to our cloud based demo portal.

EDA

Doing some quick exploratory data analysis (EDA), here are some general trends that we can see in the data.

Since the launch of the CFPB back in 2011, the number of recorded complaints has gone up year over year.

In that time, the top financial product or service that was associated with a complaint were the following:

By far, mortgage, debt collection and credit reporting outpace other types of products or services with regards to generating complaints.

The top companies who are the recipients of these complaints include the major credit reporting companies and banks.

The top five states for consumer complaints are California, Florida, Texas, New York and surprisingly Georgia. Georgia is a bit of an outlier considering it is only the 8th largest state in terms of population in the US, but somehow more complaints originated from there compared to Illinois which has a larger population.

The overwhelming preferred method for reporting complaints is online, via the web.

Out of the top three major financial product/services reported on, we see a trend of the mortgage related complaints declining while debt collection and credit reporting complaints have grown significantly over that time.

The Wrap-up

What insights did we gain through our EDA? The declining mortgage complaints seem to indicate that it looks like there is steady recovery from the subprime mortgage crisis of 2007-2008, although there is still friction with consumers trying to modify loans or avoid foreclosure. The slow rise of debt collection complaints might indicate that consumers might be taking on more credit card or loan debt and the rate of default is rising. The most drastic increase in complaints is with credit reporting/credit repair services. The huge spike in September 2017 is primarily a result of the Equifax data breach announcement, but even before then it looks like there was a steady rise of complaints with regard to how the credit reporting agencies were handling information in consumer credit reports.

A lot more can be gleaned by delving deeper into the data, but that might have to be reserved for another post. Currently, this dataset is available in our public demo portal for anyone to play with. Try it for yourself and see what insights you might find.

Data Twofer: Exploring LA Web Traffic and Google Appstore Sentiment

pivotteam — Fri, 25 Jan 2019 22:20:41 +0000

Overview

It's been a few weeks since we posted something in our 5 Minute Analysis series, so we decided to do two quick analyses on two different datasets we found on Kaggle. Instead of doing the analysis locally using PivotBillions on Docker, we opted to run the analysis in the free cloud version available on the PivotBillions website.

Data # 1: Los Angeles Website Traffic

The first is Los Angeles Website Traffic data from the LA open data portal. This data provides web traffic statistics from January 2014 to January 2019.

The dataset was moderately large at 127MB, with about 2.5 millions rows. Doing some basic exploratory data analysis showed some interesting insights.

A quick pivot based on Date and Visitor counts shows a huge spike in the number of visitors to the LAcity.org website as seen above. By zooming in on the date range that the spike occurred in the graph, we can see that the specific date of the spike occurred on Dec 6, 2017.

A quick search for that date shows that a large number of wildfires had occurred in the LA County area.

By also including DeviceCategory into the pivot chart, we see that the majority of that spike was driven by mobile traffic.

Data # 2: Google Play Store Apps

The second dataset in this post that we examined was the Google Play Store Apps data from Kaggle. The size of the data was not large, but it was interesting because it required joining two separate files, and involved some sentiment analysis.

Once the combined data was loaded into PivotBillions, we pivoted using Category and Sentiment columns. The resulting pivot table showed that there were Null and nan counts that did not contribute to our analysis so we deselected them.

After a quick sort, we charted the data shown below. We can see that apps in the Game category had the most positive as well as negative reviews by far of any category, followed by Family apps.

By analyzing app Rating and Sentiment, we see that around the 4.4 rating level, the positive and negative reviews peak.

By filtering for only the Game category, we see that negative reviews slightly outweigh positive reviews at the 4.4 rating level.

The Wrap-up

Although we can't conclude anything definitive from the exploratory data analysis we conducted on either dataset, the results open up some intriguing paths to further investigate. Perhaps by enriching with other datasets, we can get some clearer confirmation about what was the actual cause of the traffic spike in the LA data and why is there so much positive and negative sentiment in Game Apps with ratings of 4.4.

Both these datasets are available for anyone to play with through the free public demo of our cloud version of PivotBillions. Feel free to analyze them for yourself, or upload your own datasets and see what sort of interesting insight you can find in five minutes.

Finding Underutilized Kaggle Data

pivotteam — Fri, 09 Nov 2018 17:55:14 +0000

Overview

In this 5 Minute Analysis we are exploring a Kaggle dataset about Kaggle datasets. This dataset lets us see a list of the datasets on Kaggle, and shows which ones have the most engagement and activity. Our goal is to explore and filter the data to find popular datasets with many downloads but very few kernels. As a contributor to Kaggle, you may want to add kernels to datasets that fit this criteria as opposed to datasets with lots of kernels where your submission might get lost among the crowd.

Simplifying Iowa Liquor Sales Data for Loading to Tableau

pivotteam — Fri, 19 Oct 2018 21:41:45 +0000

Overview

In this 5 minute analysis, we pre-process, map, and explore complicated public sales data for liquor stores in Iowa to extract relevant latitude and longitude from a problematic column in the data.

We want to filter the data for the city with the most sales and prepare it for easy loading into the popular BI (business intelligence) tool, Tableau. Finally we use the extracted location data to visualize sales within the selected city using the Tableau Map.

Steps

1. Perform ETL and Load

In this example, we need to first do a little ETL(extract, transform, load) on one of the data columns before we load the data. The data column we are targeting is the StoreLocation column whose contents include both the address of the store as well as latitude and longitude coordinates.

Example: 1414 48TH ST FORT MADISON 52627 (40.624226, -91.373211)

We want to extract only the latitude and longitude data and create two new columns with the extracted data. To do this we do the following:

Drag and drop the data file into Pivot Billions, select the Main checkbox option for that file and then click on Column Preview.
Click the Plus (+) icon on the bottom left hand side of the window to add a new column.
Click the edit mode icon to the left of the data type selector to switch to Advanced ETL mode.
Enter latitude into the Column Name box.
Enter -eval s:latitude '""' -eval - 'RxMap(StoreLocation,"^.*($.*$,$.*$).*$",latitude,"%%1%%")' into the Ess Syntax box.

What's this Ess Syntax? It is a simple scripting format that Pivot Billion uses that allows for data pre-processing. In this step, the "-eval" function that starts the script indicates that we are substituting our new variable latitude with the contents from somewhere else. In this case, the content is coming from what's inside the parentheses of StoreLocation.
Click the Plus (+) icon again to create a second new column.
Click the icon to the left of the new edit box to switch to Advanced ETL mode.
Enter longitude into the Column Name box.
Enter -eval s:longitude '""' -eval - 'RxMap(StoreLocation,"^.*($.*$,$.*$).*$",longitude,"%%2%%")' into the Ess Syntax box.
Confirm the output is what is expected by clicking Check.
Click Import at the top of the screen.

Note: This is a very large file, 3.22 GB with over 12 million rows, so we recommend that your system has at least 16 GB of memory and that your docker setting allocates at least 7.5 GB for docker containers.

2. View and Explore the Data
We now have access to all 12,591,077 rows of the Iowa Liquor Sales data. This contains the original data in an easy tabular format along with the two new latitude and longitude features we’ve created.

We viewed the distribution of sales by city for our data by clicking on the distribution charting icon in the City column. Not surprisingly, the capital of Iowa (Des Moines) with the highest population of any city in the state has the highest total number of liquor sales.

Now, for our analysis we’ll filter all of the data to only include data for the city of Des Moines.

Click on the filter icon in the City column.
Select Equals from the dropdown and then enter “Des Moines” and press enter. All of our data is immediately filtered to just Des Moines data.

The data is almost ready for Tableau or another business intelligence tool but the StoreLocation column is troublesome and prevents Tableau from reading the data. So we’ll go ahead and download the data without this column.

Click the icon on the top right.
Unselect the box next to StoreLocation to remove that column from your selected data.
Now click the icon on the top right.
Enter “DesMoinesSales” for the file name and then click Selected Columns.
When the file is ready for download click the at the bottom of the screen as seen below.

3. Import to Tableau and Graph the Data

We can now easily import this data into Tableau and use its powerful visualizations to dive into Des Moines liquor sales by location.

Unzip the “DesMoinesSales.csv.zip” file we just downloaded from PivotBillions.
Open Tableau and click Text under Connect.
Navigate to the folder where the “DesMoinesSales.csv” file is located and Open it.
Then click on Sheet1.
Right click Latitude under Measures and select Convert to Discrete and then Convert to Dimension.
Repeat this for Longitude.
Drag Longitude to Columns and Drag Latitude to Rows as seen below.
Drag Sale Dollars to Size and drag whatever features you want to Detail.

The Wrap-up

It is clear from this SymbolMap plot that the sales for Des Moines are highest near the city center. However, it is also clear that there are many high volume locations throughout the city. We can now visualize the data however we see fit in Tableau. We could also easily remove our city filter from PivotBillions and load the whole preprocessed data into Tableau, or change our filter(s) to select a different subset or arrangement of the data.

To view and interact with this visualization or download the workbook to Tableau, see my Iowa Liquor Sales Workbook on Tableau Public.

Getting the Whole Picture of Retail Sales

pivotteam — Mon, 08 Oct 2018 18:04:34 +0000

Overview

It should be apparent that we like to use data sets found in Kaggle. If you are not familiar with Kaggle, you should check out their website to peruse the thousands of great datasets available for free to play around with.

In this 5 Minute Analysis of the Kaggle Retail Data Analytics we explore historical sales data for 45 retail locations. In addition, we join two related supplementary data files to the main sales data file to dive into the relation between store type, fuel prices, and other indicators and how they might affect average weekly sales.

Steps

1. Load the Data and View its Structure

In this example, the steps for loading data is a little different from our previous posts because we are joining two supplementary data files to the main data set. The steps to follow are:

Download and unzip the data onto your machine and access the Pivot Billions UI on your browser.
Click the Add (+) icon from the data loader screen and select Drag & Drop.
Drag your downloaded Features data set.csv, sales data-set.csv, and stores data-set.csv files to the Drag & Drop box in Pivot Billions.
Select the left checkbox (Main) next to the sales data-set.csv file and the right checkbox (Join) next to the Features data set.csv file and the stores data-set.csv file.
Click Column Preview at the top of the screen. The resulting preview will show the column labels and data types for both the main dataset and the data we are joining to it.
Pivot Billions automatically will find common keys between the data sets which are identified by the Key symbol next to column labels. You can modify keys here if you want, but for this example you can just use the ones selected as is.
Click Import at the top of the screen to load your data into the Report UI.

3. Pivot and Report the Data
Next we used the Pivot tool to do some quick analysis of a single dimension's relationship to total and average weekly sales. The basic steps are:

Select and open the Pivot tool and then select a Dimension like IsHoliday or Temperature for the pivot table.
Select Weekly Sales as our Value and click View.
In the resulting table, select the Pivot View option and then drag and drop the dimension we want to analyze into the columns area.
Change the visualization to Table Barchart and then select the data Summation option to see the effect of the dimension on total Weekly Sales.
Next change from Summation to Average to see effect on average Weekly Sales.

We did this same type of analysis for a few other dimensions which you can see below. We chose the Bar Chart visualization for these.

The Wrap-up

We are able to see that the CPI (consumer price index) and the IsHoliday dimensions affect total sales but have less of an effect on average sales. The Store Type and Store Size, however, seem to have a stronger correlation to both the total and average weekly sales.

In five minutes, we were able to run these as well as similar pivot charts using Temperature and Unemployment dimensions. These last two did not show a very strong correlation to total or average weekly sales. The next steps if we wanted to dive deeper would be to try to use two or more dimensions and see how they might relate to sales behavior. It would be a good idea to set some filters to better manage the size of the analysis.

Olympics Data Reveals Rising Gender Equality

pivotteam — Fri, 28 Sep 2018 23:31:43 +0000

Breaking Down Los Angeles Health Code Violations

pivotteam — Fri, 21 Sep 2018 00:00:01 +0000

Overview

Accessing and understanding the Kaggle LA Restaurant & Market Health Data in real-time, exploring the data, and pivoting the data to report the top violators of the health code and their violations.

Goals

Load the data to Pivot Billions and view its structure.
Explore the data using Pivot Billions built-in features.
Pivot the data to organize it by violator name and violation to see the worst violators and report our findings.

Steps

Load the Data and View its Structure

Download the dataset from Kaggle and unzip your downloaded data.
Access the Pivot Billions URL for your machine and click the Plus icon on the top right hand side of the window.
Select Drag & Drop and drag your downloaded “restaurant-and-market-health-violations.csv” file into Pivot Billions.
Then select the left checkbox next to the file and click Preview at the bottom of the screen.
You can now see the columns and types of the dataset and modify them as you see fit. You can also view or change which column or columns are set as primary keys.
When you are done viewing or modifying the data structure to be imported, click Import.

View and Explore the Data
After the data has been quickly imported you can now see and access all 272,801 rows of the data. By hovering over each column name you can sort the data by that column, view that column’s distribution over all of the data, filter by the data in that column, or rename that column. We’ll view the distribution of the data by the facitly name and owner’s name.

Click on the distribution icon for the facility name column to see the distribution of total health code violations by facility.
Click on the distribution icon for the owner name column to see the distribution of total health code violations by owner.

You can quickly see that Dodger Stadium is the facility with the most number of violations while Ralphs Grocery is the owner with the highest number of violations.

5 Minute Analysis – Pivot Billions

5 Minute Analysis of Data.gov Datasets with PivotBillions

Overview

EDA

The Wrap-up

Data Twofer: Exploring LA Web Traffic and Google Appstore Sentiment

Overview

Data # 1: Los Angeles Website Traffic

Data # 2: Google Play Store Apps

The Wrap-up

Finding Underutilized Kaggle Data

Overview

Steps

The Wrap-up

Simplifying Iowa Liquor Sales Data for Loading to Tableau

Overview

Steps

The Wrap-up

Getting the Whole Picture of Retail Sales

Overview

Steps

Store Type -Total

Store Size - Total

CPI - Total

Store Type - Average

Store Size - Average

CPI - Average

The Wrap-up

Olympics Data Reveals Rising Gender Equality

Overview

Steps

The Wrap-up

Breaking Down Los Angeles Health Code Violations

Overview

Goals

Steps

The Wrap-up