5 Minute Analysis – Pivot Billions https://pivotbillions.com Fri, 16 Sep 2022 23:21:15 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.26 5 Minute Analysis of Data.gov Datasets with PivotBillions https://pivotbillions.com/5-minute-analysis-of-data-gov-datasets-with-pivotbillions/ Fri, 19 Apr 2019 19:00:46 +0000 https://pivotbillions.com/?p=2319 continue reading »]]>

Overview

Data.gov is an open source of data provided by the US government that provides a wealth of interesting information ranging from agriculture, education, finance, health, science, etc...Currently, it lists over 200,000 datasets available to explore.

In this 5 Minute Analysis, we picked one dataset from the consumer data category, specifically the Financial Services Consumer Complaint Database. This dataset is provided by the Consumer Financial Protection Bureau or CFPB, and is a record of complaints received to the CFPB on various financial products and services offered by a multitude of companies.

The data is comprised of over 1.2 million rows in csv format and is about 700MB in size. Loading the data was pretty simple using the drag and drop feature in PivotBillions to upload the data to our cloud based demo portal.

 

 

EDA

Doing some quick exploratory data analysis (EDA), here are some general trends that we can see in the data.

Since the launch of the CFPB back in 2011, the number of recorded complaints has gone up year over year.

In that time, the top financial product or service that was associated with a complaint were the following:

By far, mortgage, debt collection and credit reporting outpace other types of products or services with regards to generating complaints.

The top companies who are the recipients of these complaints include the major credit reporting companies and banks.

The top five states for consumer complaints are California, Florida, Texas, New York and surprisingly Georgia.  Georgia is a bit of an outlier considering it is only the 8th largest state in terms of population in the US, but somehow more complaints originated from there compared to Illinois which has a larger population.

The overwhelming preferred method for reporting complaints is online, via the web.

Out of the top three major financial product/services reported on, we see a trend of the mortgage related complaints declining while debt collection and credit reporting complaints have grown significantly over that time.

 

The Wrap-up

What insights did we gain through our EDA? The declining mortgage complaints seem to indicate that it looks like there is steady recovery from the subprime mortgage crisis of 2007-2008, although there is still friction with consumers trying to modify loans or avoid foreclosure. The slow rise of debt collection complaints might indicate that consumers might be taking on more credit card or loan debt and the rate of default is rising. The most drastic increase in complaints is with credit reporting/credit repair services. The huge spike in September 2017 is primarily a result of the Equifax data breach announcement, but even before then it looks like there was a steady rise of complaints with regard to how the credit reporting agencies were handling information in consumer credit reports.

A lot more can be gleaned by delving deeper into the data, but that might have to be reserved for another post. Currently, this dataset is available in our public demo portal for anyone to play with.  Try it for yourself and see what insights you might find.

]]>
Data Twofer: Exploring LA Web Traffic and Google Appstore Sentiment https://pivotbillions.com/data-twofer-exploring-la-web-traffic-and-google-appstore-sentiment/ Fri, 25 Jan 2019 22:20:41 +0000 https://www.pivotbillions.com/?p=1884 continue reading »]]>

Overview

It's been a few weeks since we posted something in our 5 Minute Analysis series, so we decided to do two quick analyses on two different datasets we found on Kaggle.  Instead of doing the analysis locally using PivotBillions on Docker, we opted to run the analysis in the free cloud version available on the PivotBillions website.

 

Data # 1: Los Angeles Website Traffic

The first is Los Angeles Website Traffic data from the LA open data portal. This data provides web traffic statistics from January 2014 to January 2019.

The dataset was moderately large at 127MB, with about 2.5 millions rows. Doing some basic exploratory data analysis showed some interesting insights.

A quick pivot based on Date and Visitor counts shows a huge spike in the number of visitors to the LAcity.org website as seen above. By zooming in on the date range that the spike occurred in the graph, we can see that the specific date of the spike occurred on Dec 6, 2017.

A quick search for that date shows that a large number of wildfires had occurred in the LA County area.

By also including DeviceCategory into the pivot chart, we see that the majority of that spike was driven by mobile traffic.

 


Data # 2: Google Play Store Apps

The second dataset in this post that we examined was the Google Play Store Apps data from Kaggle. The size of the data was not large, but it was interesting because it required joining two separate files, and involved some sentiment analysis.

Once the combined data was loaded into PivotBillions, we pivoted using Category and Sentiment columns. The resulting pivot table showed that there were Null and nan counts that did not contribute to our analysis so we deselected them.

After a quick sort, we charted the data shown below. We can see that apps in the Game category had the most positive as well as negative reviews by far of any category, followed by Family apps.

By analyzing app Rating and Sentiment, we see that around the 4.4 rating level, the positive and negative reviews peak.

By filtering for only the Game category, we see that negative reviews slightly outweigh positive reviews at the 4.4 rating level.

The Wrap-up

Although we can't conclude anything definitive from the exploratory data analysis we conducted on either dataset, the results open up some intriguing paths to further investigate.  Perhaps by enriching with other datasets, we can get some clearer confirmation about what was the actual cause of the traffic spike in the LA data and why is there so much positive and negative sentiment in Game Apps with ratings of 4.4.

Both these datasets are available for anyone to play with through the free public demo of our cloud version of PivotBillions.  Feel free to analyze them for yourself, or upload your own datasets and see what sort of interesting insight you can find in five minutes.

 

]]>
Finding Underutilized Kaggle Data https://pivotbillions.com/finding-underutilized-kaggle-data/ Fri, 09 Nov 2018 17:55:14 +0000 https://www.pivotbillions.com/?p=1213 continue reading »]]>

Overview

In this 5 Minute Analysis we are exploring a Kaggle dataset about Kaggle datasets.   This dataset lets us see a list of the datasets on Kaggle, and shows which ones have the most engagement and activity.  Our goal is to explore  and filter the data to find popular datasets with many downloads but very few kernels.  As a contributor to Kaggle, you may want to add kernels to datasets that fit this criteria as opposed to datasets with lots of kernels where your submission might get lost among the crowd.


Steps

Unlike in past posts, instead of listing the steps, we have simply included a video that shows the whole process from start to finish.  This includes downloading the data files, uploading to Pivot Billions, filtering the data and then creating the pivot table.  The filters we used were kernel counts less than 5 and download counts greater than 100. We timed the process and the whole thing takes slightly more than 3 minutes from start to finish.

The Wrap-up

From the pivot table we can immediately see a variety of very popular datasets that have been downloaded thousands of times yet have very few or no kernels developed.

Many of these are likely underutilized datasets that may not be easily understood using existing tools and could benefit from additional exploration and analysis incorporating new tools such as PivotBillions. 

 

]]>
Simplifying Iowa Liquor Sales Data for Loading to Tableau https://pivotbillions.com/simplifying-iowa-liquor-sales-data-for-loading-to-tableau/ Fri, 19 Oct 2018 21:41:45 +0000 https://www.pivotbillions.com/?p=1025 continue reading »]]>

 

Overview

In this 5 minute analysis, we pre-process, map, and explore complicated public sales data for liquor stores in Iowa to extract relevant latitude and longitude from a problematic column in the data.

We want to filter the data for the city with the most sales and prepare it for easy loading into the popular BI (business intelligence) tool, Tableau. Finally we use the extracted location data to visualize sales within the selected city using the Tableau Map.

Steps

1. Perform ETL and Load

In this example, we need to first do a little ETL(extract, transform, load) on one of the data columns before we load the data.  The data column we are targeting is the StoreLocation column whose contents include both the address of the store as well as latitude and longitude coordinates.

Example: 1414 48TH ST FORT MADISON 52627 (40.624226, -91.373211)

We want to extract only the latitude and longitude data and create two new columns with the extracted data.  To do this we do the following:

  1. Drag and drop the data file into Pivot Billions, select the Main checkbox option for that file and then click on Column Preview.
  2. Click the Plus (+) icon on the bottom left hand side of the window to add a new column.
  3. Click the edit mode icon to the left of the data type selector to switch to Advanced ETL mode.
  4. Enter latitude into the Column Name box.
  5. Enter -eval s:latitude '""' -eval - 'RxMap(StoreLocation,"^.*(\(.*\),\(.*\)).*$",latitude,"%%1%%")' into the Ess Syntax box.

    What's this Ess Syntax?  It is a simple scripting format that Pivot Billion uses that allows for data pre-processing.  In this step, the "-eval" function that starts the script indicates that we are substituting our new variable latitude with the contents from somewhere else.  In this case, the content is coming from what's inside the parentheses of StoreLocation.

  6. Click the Plus (+) icon again to create a second new column.
  7. Click the icon to the left of the new edit box to switch to Advanced ETL mode.
  8. Enter longitude into the Column Name box.
  9. Enter -eval s:longitude '""' -eval - 'RxMap(StoreLocation,"^.*(\(.*\),\(.*\)).*$",longitude,"%%2%%")' into the Ess Syntax box.
  10. Confirm the output is what is expected by clicking Check.
  11. Click Import at the top of the screen.


Note:
This is a very large file, 3.22 GB with over 12 million rows, so we recommend that your system has at least 16 GB of memory and that your docker setting allocates at least 7.5 GB for docker containers.

2. View and Explore the Data
We now have access to all 12,591,077 rows of the Iowa Liquor Sales data. This contains the original data in an easy tabular format along with the two new latitude and longitude features we’ve created.

We viewed the distribution of sales by city for our data by clicking on the  distribution charting icon in the City column.  Not surprisingly, the capital of Iowa (Des Moines) with the highest population of any city in the state has the highest total number of liquor sales.

Now, for our analysis we’ll filter all of the data to only include data for the city of Des Moines.

  1. Click on the filter icon in the City column.
  2. Select Equals from the dropdown and then enter “Des Moines” and press enter. All of our data is immediately filtered to just Des Moines data.

The data is almost ready for Tableau or another business intelligence tool but the StoreLocation column is troublesome and prevents Tableau from reading the data. So we’ll go ahead and download the data without this column.

  1. Click the icon on the top right.
  2. Unselect the box next to StoreLocation to remove that column from your selected data.
  3. Now click the icon on the top right.
  4. Enter “DesMoinesSales” for the file name and then click Selected Columns.
  5. When the file is ready for download click the at the bottom of the screen as seen below.

3. Import to Tableau and Graph the Data

We can now easily import this data into Tableau and use its powerful visualizations to dive into Des Moines liquor sales by location.

  1. Unzip the “DesMoinesSales.csv.zip” file we just downloaded from PivotBillions.
  2. Open Tableau and click Text under Connect.
  3. Navigate to the folder where the “DesMoinesSales.csv” file is located and Open it.
  4. Then click on Sheet1.
  5. Right click Latitude under Measures and select Convert to Discrete and then Convert to Dimension.
  6. Repeat this for Longitude.
  7. Drag Longitude to Columns and Drag Latitude to Rows as seen below.
  8. Drag Sale Dollars to Size and drag whatever features you want to Detail.

 

The Wrap-up

It is clear from this SymbolMap plot that the sales for Des Moines are highest near the city center. However, it is also clear that there are many high volume locations throughout the city. We can now visualize the data however we see fit in Tableau. We could also easily remove our city filter from PivotBillions and load the whole preprocessed data into Tableau, or change our filter(s) to select a different subset or arrangement of the data.

To view and interact with this visualization or download the workbook to Tableau, see my Iowa Liquor Sales Workbook on Tableau Public.

 

]]>
Getting the Whole Picture of Retail Sales https://pivotbillions.com/retail-sales/ Mon, 08 Oct 2018 18:04:34 +0000 https://www.pivotbillions.com/?p=942 continue reading »]]>

Overview

It should be apparent that we like to use data sets found in Kaggle. If you are not familiar with Kaggle, you should check out their website to peruse the thousands of great datasets available for free to play around with.

In this 5 Minute Analysis of the Kaggle Retail Data Analytics we explore historical sales data for 45 retail locations. In addition, we join two related supplementary data files to the main sales data file to dive into the relation between store type, fuel prices, and other indicators and how they might affect average weekly sales.

 

Steps

1. Load the Data and View its Structure

In this example, the steps for loading data is a little different from our previous posts because we are joining two supplementary data files to the main data set. The steps to follow are:

  1. Download and unzip the data onto your machine and access the Pivot Billions UI on your browser.
  2. Click the Add (+) icon from the data loader screen and select Drag & Drop.
  3. Drag your downloaded Features data set.csv, sales data-set.csv, and stores data-set.csv files to the Drag & Drop box in Pivot Billions.
  4. Select the left checkbox (Main) next to the sales data-set.csv file and the right checkbox (Join) next to the Features data set.csv file and the stores data-set.csv file.
  5. Click Column Preview at the top of the screen. The resulting preview will show the column labels and data types for both the main dataset and the data we are joining to it.
  6. Pivot Billions automatically will find common keys between the data sets which are identified by the Key symbol next to column labels. You can modify keys here if you want, but for this example you can just use the ones selected as is.
  7. Click Import at the top of the screen to load your data into the Report UI.

2. View and Explore the Data
After the data has been imported, we can see a total of 421,569 rows. This contains the data and features from each of the three separate data files from our original dataset. We did some preliminary exploration by sorting data to find highs and lows as well as selected the Distribution option to see the distribution of data based on data column.

3. Pivot and Report the Data
Next we used the Pivot tool to do some quick analysis of a single dimension's relationship to total and average weekly sales. The basic steps are:

  1. Select and open the Pivot tool and then select a Dimension like IsHoliday or Temperature for the pivot table.
  2. Select Weekly Sales as our Value and click View.
  3. In the resulting table, select the Pivot View option and then drag and drop the dimension we want to analyze into the columns area.
  4. Change the visualization to Table Barchart and then select the data Summation option to see the effect of the dimension on total Weekly Sales.
  5. Next change from Summation to Average to see effect on average Weekly Sales.

We did this same type of analysis for a few other dimensions which you can see below. We chose the Bar Chart visualization for these.

Store Type -Total

Store Size - Total

CPI - Total

Store Type - Average

Store Size - Average

CPI - Average

The Wrap-up

We are able to see that the CPI (consumer price index) and the IsHoliday dimensions affect total sales but have less of an effect on average sales.  The Store Type and Store Size, however, seem to have a stronger correlation to both the total and average weekly sales.

In five minutes, we were able to run these as well as similar pivot charts using Temperature and Unemployment dimensions.  These last two did not show a very strong correlation to total or average weekly sales.  The next steps if we wanted to dive deeper would be to try to use two or more dimensions and see how they might relate to sales behavior.  It would be a good idea to set some filters to better manage the size of the analysis.

 

]]>
Olympics Data Reveals Rising Gender Equality https://pivotbillions.com/olympics-data-reveals-rising-gender-equality/ Fri, 28 Sep 2018 23:31:43 +0000 https://www.pivotbillions.com/?p=886 continue reading »]]>

Overview

In this 5 Minute Analysis of the Kaggle historical Olympic athletic data we filtered and pivoted the data in various ways to understand the trends in participation by gender over time.

Steps

1. Load the Data and View its Structure

The steps for loading data was covered in a previous post.

2. View and Explore the Data
After the data has been imported, we can see a total of 271,116 rows. We clicked on the distribution icons for the Sex to get some broad distributions of the data. The distribution of participation by sex shows that Male participants outnumber female participants by 3 to 1.

3. Pivot and Report the Data
Next we used the Pivot tool to reorganize our data to dive into gender equality. The basic steps are to select and open the Pivot tool and then select Sex, Year, and Season as the Dimensions for the pivot table. We place the Sex dimension into the row area as our starting point of analysis. By dragging and dropping Season and then Year we see some interesting trends in the data.

Overall, the data shows that Summer games have a higher disparity between male and female participation compared to Winter games.  By adding Year into the columns area, we can clearly see the gradual trend of both genders nearing parity.  If you filter by Season and select Line Chart, we see that the Summer games have show the more dramatic approach to parity compared to Winter.

The Wrap-up

The data is clear.  Participation in the Olympics is getting closer to parity each successive Games.  Is it possible that we may even see more female athletes than men in the near future?  In 2016, the US team fielded more female athletes and the US by far sends the most athletes to the Olympics historically, than any other country, so it's not outside the realm of possibility.

 

]]>
Breaking Down Los Angeles Health Code Violations https://pivotbillions.com/breaking-down-los-angeles-health-code-violations/ Fri, 21 Sep 2018 00:00:01 +0000 https://www.pivotbillions.com/?p=821 continue reading »]]>

This is the first in a series of practical application examples of Pivot Billions that we will be calling 5-Minute Analysis.  Each week, we'll take an interesting open data set and use Pivot Billions to find interesting insights within a five minute span.

Overview

Accessing and understanding the Kaggle LA Restaurant & Market Health Data in real-time, exploring the data, and pivoting the data to report the top violators of the health code and their violations.

Goals

  1. Load the data to Pivot Billions and view its structure.
  2. Explore the data using Pivot Billions built-in features.
  3. Pivot the data to organize it by violator name and violation to see the worst violators and report our findings.

Steps

Load the Data and View its Structure

  1. Download the dataset from Kaggle and unzip your downloaded data.
  2. Access the Pivot Billions URL for your machine and click the Plus icon on the top right hand side of the window.
  3. Select Drag & Drop and drag your downloaded “restaurant-and-market-health-violations.csv” file into Pivot Billions.
  4. Then select the left checkbox next to the file and click Preview at the bottom of the screen.
  5. You can now see the columns and types of the dataset and modify them as you see fit. You can also view or change which column or columns are set as primary keys.
  6. When you are done viewing or modifying the data structure to be imported, click Import.

View and Explore the Data
After the data has been quickly imported you can now see and access all 272,801 rows of the data. By hovering over each column name you can sort the data by that column, view that column’s distribution over all of the data, filter by the data in that column, or rename that column. We’ll view the distribution of the data by the facitly name and owner’s name.

  1. Click on the distribution icon for the facility name column to see the distribution of total health code violations by facility.
  2. Click on the distribution icon for the owner name column to see the distribution of total health code violations by owner.

You can quickly see that Dodger Stadium is the facility with the most number of violations while Ralphs Grocery is the owner with the highest number of violations.

Pivot and Report the Data
Now that we know which facility had the highest number of violations, we want to drill down into the data and see which health codes were violated. This is made extremely simple and fast using Pivot Billions.

  1. First hover over the facility name column and click on the filter icon.
  2. Set the filter condition to "Contains" and enter "Dodger" in the field beneath.
  1. Click on the Pivot Icon and select "violation_code" and "violation_description as your dimensions
  2. Click on the View button to create the pivot table.
  3. In the newly created table, select the Pivot View option.
  4. Drag and drop both dimension labels into the row section of the pivot table and sort vertically.
  1. Select the bar chart visualization to be able to see a graphical comparison of the violations.
  2. Hover over any bar to get details for that item.

The Wrap-up

So it looks like Dodger Stadium has some work to do.  I might have to think twice next time I consider getting a Dodger Dog.

That wraps up this 5-Minute Analysis.  Check back next week for another quick analysis of a new data set using Pivot Billions.

]]>