ETL – Pivot Billions

Real-time analysis of 50 billion records of IoT data

pivotteam — Wed, 16 Sep 2020 23:59:46 +0000

The demand for more and more analysis of IoT data has been growing exponentially with the explosion of connected devices. Unfortunately the cost and time associated to analyze this data has also grown exponentially as data volumes keep getting larger and larger.

An enterprise client, who has been collecting data from millions of devices, has been wrestling with the growing pain of having to analyse it as the volume of data is exceeding the capacity of their conventional data processing systems.

Their pain is not only related to the size of data but from a lack of agility. The requirements for their analysis changes rapidly and conventional systems simply could not adapt to such changes in a timely or economical manner.

Their latest analysis requirement was to process 50 billion records of GPS data for various different analytic specifications and identify particular behavioral patterns for their customers’ interests. That kind of dynamic requirement made conventional batch data processing very difficult and expensive.

They have researched numerous products from different vendors and chosen AuriQ’s Pivotbillions, a massively parallel, in-memory data processing and analysis solution. AuriQ Pivotbillions enabled the client to analyze their entire 50 billion records in real-time. That meant that analytic queries, including ad-hoc queries, against the entire data set could be processed in seconds or tens of seconds which allowed data analysts to test their hypotheses very efficiently.

Fig1: An example of a visualization of the analyzed 50 billion records of GPS data, showing how devices move before and after the airport.

Because Pivotbillions is a software solution that can run on Amazon Web Services, it did not require any special hardware. Excel-like user interface of Pivotbillions allows analysts to work on data immediately without any coding or learning.

The total system cost to analyze their 50 billions records utilizing PivotBillions on AWS was easily less than 1/10 of conventional systems. It tooks only a few weeks to complete analysis and visualization tasks for various different requirements.

Facts

Records: 50 billions records from few millions devices
Size: 6 TB in 365 compressed files
Repository: AWS S3
Instances: AWS EC2 m5.large (up to 500 concurrent EC2)
Time to preprocess and load : 30 minutes (from original data in S3)
- Conventional system: took few days to load a partial sampled data into database
Response time of queries on whole 50 billions records: few ~ few tens of seconds
- Conventional system: took few hours to days to process a query to partial sampled data.
- Performance varies slightly depending on conditions of AWS.

Trial versions of the PivotBillions service is available for free. Click here to request a demo or sign up for a free account.

Maximizing Sales with Market Basket Analysis

pivotteam — Mon, 08 Apr 2019 17:53:12 +0000

Sales data analyses can provide a wealth of insights for any business but rarely is it made available to the public. In 2018, however, a retail chain provided Black Friday sales data on Kaggle as part of a Kaggle competition. Although the store and product lines are anonymized, the dataset presents a great learning opportunity to find business insights! In this post, we’ll cover how to prepare data, perform basic analysis, and glean additional insights via a technique called Market Basket Analysis.

Let’s see what the data looks like. We use Pivot Billions to analyze and manipulate large amounts of data via an intuitive and familiar spreadsheet style. After importing, we see that the data contains over 500K rows at the bottom, along with example data for each column.

Visualizing the distribution of each column is easy with a simple click of the column name. Overall the data is clean, but Product_Category_2 and Product_Category_3 has many missing values (NAs). They seem like subcategories of Product_Category_1 which has no missing values. Therefore, let’s ignore them for now.

Our next step is preparing data for analysis. First, “Purchase” needs to be divided by 100 to show cents as a decimal point. We define a float column called Total_Purchase and set ToF(Purchase)/100.
Second, let’s make a column combining gender with age to visualize a bar plot. We define a string column called Gender_Age and then set Gender + Age as the below screenshot.

In Pivot Billions, we can pivot data like Excel. It helps to easily find out the demographics of customers, sales per city, occupation, product category and so on. The bar chart below shows total purchases by product category.

We see that product category 1 is the most popular. It turns out that it generates about 40% of total purchases in this retail chain, and combined purchases from categories 1, 5, and 8 account for more than 70%.

Next, let's examine the data a little more closely! The bar chart below shows total purchases and customer demographics by city.

City B has the most sales, and the majority of customers are males ranging in age from the 20’s to 30’s in each city. But City A has a lower ratio of total purchases from customers over 35 compared to the other two cities. City A may have an untapped opportunity to target older demographics in their area. We don’t know each city’s demographics since they’re not revealed in this dataset, but it needs to be considered to understand what causes the differences as a realistic next step. It gives us solid insight for discussing with marketing teams.

Lastly, let’s do Market Basket Analysis which uses association rule mining on transaction data to discover interesting associations between the products! I’m going to use Apriori algorithm in Python. R also has Apriori algorithm. For further information, please check out the following links:
Apriori(Python) : http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/#frequent-itemsets-via-apriori-algorithm
Apriori(R): https://www.rdocumentation.org/packages/arules/versions/1.6-3/topics/apriori

import pandas as pd
from mlxtend.frequent_patterns import apriori,association_rules
#Read the dataset
data = pd.read_csv('./BlackFriday.csv')
#Create the basket 
basket_data = data.loc[:,['User_ID','Product_ID']]
count = basket_data.groupby(['User_ID', 
'Product_ID']).size().reset_index(name='Count')
basket = (count.groupby(['User_ID', 'Product_ID'])['Count']
                  .sum().unstack().reset_index().fillna(0)
                  .set_index('User_ID'))
#For this dataset, all values are either 1 or 0. Thus, you can apply this 
"basket" straight to apriori algorithm.
#Set support >= 0.05
frequent_itemsets = apriori(basket, min_support=0.05, use_colnames=True)
#Filter lift >= 1
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
#Sort ascending by lift
rules.sort_values(by = 'lift',ascending=False)

We’ll explain briefly what the metrics indicate.

Support represents how frequently the items appear in the data.
Confidence represents how frequently the if(antecedents)-then(consequences) statements are found to be true.
Lift represents how much larger or smaller confidence than expected confidence is. If a lift is larger than 1.0, it implies that the relationship between the items is more significant than expected.The larger lift means more significant association.

Here’s the result of top 20 lift values by ascending.

In the Market Basket Analysis, we found strong cross-product purchasing patterns between certain products! These results can be the basis for further analysis or discussion if we are able to know more about the retail chain, and can lead to opportunities for strategic pricing and promotions.

Exposing Potential Fraud in Amazon Reviews

pivotteam — Tue, 02 Apr 2019 22:59:06 +0000

Amazon continues to be one of the most popular marketplaces in the US as well as the world due, at least in part, to its variety of product categories and product reviews. But how accurate are these reviews?

Do sellers or their competitors try and influence them in any way? Does the Verified Purchase tag actually affect the ratings? These questions nagged me until I finally gave in and decided to analyze Amazon’s Customer Review Dataset hosted on S3.This massive dataset contains over 130 million individual customer reviews stored in S3 tab separated files organized by product category.

I was mainly interested in the digital product reviews since they were easily verifiable by Amazon so I quickly connected to this data and created a category for the digital product categories using Pivot Billions. Then I used Pivot Billions’ column creation feature to extract the month from the review’s date column and loaded the data

Now that I had access to the over 23 million reviews in Amazon’s digital product categories, I could now explore each categories’ ratings and the effect of the Verified Purchase tag. I quickly pivoted my data by the product category, review month, and verified purchase columns to get an idea of the data’s makeup.

Digital Ebooks clearly made up the greatest proportion of reviews in the digital category. Given Amazon’s roots as an online book seller, this made a lot of sense. Now that I knew more about the distribution of the data and had made sure that the number of reviews for each product category was large enough to be used, I wanted to explore how the average star rating compared between categories. Switching from viewing the count to average for the star rating column and viewing the pivoted data as a horizontal bar chart I was left with a clear graphic of the ratings for each category over time.

I could clearly see a hierarchy among the digital product categories. Even with their variation over time, the video game and software categories were rated much lower than the others and significantly lower than the music category. However, digital software had an interesting ratings spike during the summer months. Wanting to dive deeper, I narrowed down to that category and added in the Verified Purchase tag to the pivot.

Surprisingly, the variation during the summer months came primarily from Non-Verified purchases while Verified Purchases remained relatively steady. This could indicate attempts to influence software reviews by a seller or one of their competitors or possibly a greater range of products that didn’t have a verification system through Amazon.
So it appears that there are significant differences in the ratings of the digital product categories, with music typically rated much higher and software and video games rated significantly lower. Moreover, the Verified Purchase tag does have a large effect on the ratings in some instances. This could indicate cases of fraudulent reviews so I dug deeper.
First, I re-pivoted the data by customer_id to get an idea of how many reviews each customer had.

Then I exported this data and joined it into my main data using Pivot Billions.

Now that my data was enhanced with the number of reviews each customer had submitted, I quickly restricted the data to only those customers with at least 1000 reviews in the data.

By quickly re-pivoting the data by the customer id, review month, and verified purchase columns and filtering the data to only the Non-Verified purchases, I started to see some suspicious behaviors.

Narrowing down this graph to just a few of the customers with the greatest degree of unverified reviews, I was able to isolate their behaviors and view them in more detail.

I could clearly see that some of the customers consistently submitted a high number of unverified reviews throughout the year (Ex: ID 37529167) whereas others were more concentrated events (Ex: ID 7080939). Due to their number of reviews and unverified status, these customers were highly likely to be fraudulent reviewers.
Now that I had a list of customers with suspicious behavior I wanted to see which products were affected the most so I pivoted the data by product parent, customer id, and review count and sorted by number of reviews.

I now had a clear view into which products saw the greatest number of these suspicious reviews. In fact, one product had over 22 unverified reviews from just this limited set of customers!

While Amazon is extremely popular and does have a vast database of verified reviews, it's clear there are still a variety of fraudulent reviews dispersed throughout the data that can have isolated or cumulative effects on their products. It is worth Amazon’s time to look into these reviews in greater detail and try to expand their Verified Purchase tag as much as possible. In the meantime make full use of Amazon’s extensive review system but you might want to check that the reviews are Verified before buying an expensive item or if you’re on the fence.

Visualizing Gender Disparities in Hi-Tech, Engineering, and Science

pivotteam — Wed, 13 Mar 2019 18:38:07 +0000

We’ve all heard about the many improvements and changes being made to help get women into Hi-Tech and the Sciences, but where are these improvements being seen? Is the entire nation reaching gender equality or are there pockets of improvement and pockets of stagnation. These are the questions I set out to answer using the Safe Graph US Census Bureau Census Block Group American Community Survey Data from Kaggle.

This census data is an extremely comprehensive dataset at the census block group level (on the order of 1500 households each) containing detailed information on a multitude of demographic aspects of the people living in the US. However, the data is quite sizable and comes in the form of a very large zip file that makes the data hard to explore. Moreover, the zip file contains many internal files containing different demographic topics, each with its own schema. I used Pivot Billions to preprocess my data for Tableau due to its data ETL and joining features.

Using it, I quickly extracted the key columns I wanted from the various internal files in the census data while the data was still zipped and in S3.

Then I joined those key columns together by census block group and enhanced the data with a few custom columns such as the percentage of women and the degree of gender inequality.

This second custom column is used to describe how far away from equal (50% men, 50% women) the computer, engineering, and science workers are in each census block group. The values range from 0 to 5 with 5 meaning highly biased towards one gender. Now that the main data I was interested in was ready and loaded into Pivot Billions, I then downloaded the Pivot Billions preprocessed data as a csv file and then simply imported the data into Tableau as a text file.

Now all I had to do to explore the distribution of gender equality and inequality in the United States was to visualize this data. I quickly opened a new worksheet and told Tableau to treat the Census Block, latitude/longitude, and DegreeOfGenderInqeuality fields as Dimensions (categories). Then I added some of the interesting fields to the displayed tooltip in Tableau and colored by DegreeOfGenderInequality. Focusing on the extremes of Equality (Green) and Inequality (Orange), I was left with a very telling interactive visualization.

It was immediately clear that the coastal regions of the US and the Southeastern US region consistently maintained a high level of gender equality. On the other hand, the North/NorthEastern US regions leaned heavily towards just one of the genders (typically towards men).

Though many areas of the US seem to be making strides towards gender equality in Hi-Tech and the Sciences, many still remain heavily biased and stagnant. It may be worthwhile to look at the regulations and companies operating in the coastal and southeastern regions of the US to see what policies (governmental or private) may be having an effect. With time hopefully these policies will be implemented across the US to equalize the representation of both genders in these fields and truly create a more equal society.

To interact with this visualization yourself, feel free to go through my Gender Equality and Inequality workbook on Tableau Public.

Indoor vs Outdoor Activities? CDC Health Data Shows Which Is Better For You

pivotteam — Tue, 05 Mar 2019 01:41:44 +0000

Diving into CDC Behavioral Risk Factor data using Pivot Billions to learn what exercise behaviors are associated with improved health.

Motivating yourself to go outside and get some exercise or play a sport can be hard, but it is worth it. I had trouble with this myself but after looking into the CDC’s Behavioral Risk Factor Surveillance System data from Kaggle I decided to redouble my efforts. The CDC dataset I examined contains over 400,000 individuals’ responses to survey questions for each year from 2011 to 2015 and is an extremely comprehensive dataset containing answers to hundreds of questions regarding the individuals’ behaviors and their resulting health.

However, the size and structure of the data makes it hard to explore. Particularly, the survey schema changes each year and the english meaning of each survey question (and the corresponding answers) are found in a different pdf for each year of the data. I decided to use Pivot Billions to analyze the data since it has useful ETL and data joining features, allowing me to quickly narrow down on the data and survey questions I wanted to explore and discover some of the trends buried within.

One relationship I wanted to explore was whether various types of physical activity affect a person’s physical, mental, and emotional health before and after retirement. Using Pivot Billions I enhanced the data with three new columns based off of the existing survey questions determining the surveyed persons’ age, main sport over the 30 days prior to the survey, and degree of health limitation. I needed to use a lookup table I’d made to create the MainSport column (matching survey answer to the corresponding activity) so I was grateful that it was easy to combine into my data.

Now that I had a full view of my data along with its enhancements, I quickly filtered the data to eliminate “Refused” and blank answers to the experienced health limitations survey question. This way I had a clear metric for my data: the average degree of experienced physical, mental, or emotional health limitations on a scale from 0 to 1.

My data was now ready for more detailed analysis so I used Pivot Billions’ pivot function to reorganize my data by age category and main sport. Viewing the result as a Bar Graph a picture started to form of what types of activities were associated with better overall health.

What I noticed first in this visualization is most of the activities associated with worse health for both age groups were activities a person would do in or around the house. I wanted to see how this effect affected each age group so I quickly filtered the pivoted data to the 18 to 64 age group and sorted by decreasing health limitations (increasing overall health).

Sure enough, with a few exceptions such as swimming in laps, the most health limitations were experienced pre-retirement when the individuals exercised or did chores primarily around the house. There was a dramatic improvement in overall health when an individual consistently went outside, even if it was just to play Golf.

Now, I wanted to see whether there was any effect on the post-retirement age group as well so I quickly re-filtered the pivoted data.

Again we see staying inside of or working around the house associated with a much greater degree of physical, mental, and emotional limitations. In fact getting outside of the house, even to go bowling or golfing, brought an over 40% reduction in experienced limitations for the more than 900,000 people surveyed. It turns out that finding a physical activity that will consistently get you out of the house can help not only your physical fitness, but also dramatically improve your mental and emotional well being.

Understanding 2 Billion Rows of Weblogs in Real-Time

pivotteam — Fri, 22 Feb 2019 22:23:15 +0000

Managing data just keeps getting tougher. The more we think we’ve gotten a handle on our data the more it grows and becomes too large for our existing analyses.

This issue became very clear to me after I undertook the task of trying to understand the effectiveness of ad campaigns using SiteCatalyst weblogs. Seeing as I’d analyzed weblogs before I didn’t think this would be much of an issue. The twist: the weblogs contained over 2 Billion rows!

Pivot Billions is ideal to analyze the data due to its scalability to handle massive datasets. Taking the over 2 Billion rows of data, Pivot Billions loaded them into 500 Amazon c4.large instances in a matter of minutes. Then I started to explore the data using Pivot Billions’ reorganization and transformation features. I was mainly interested in how the ad campaigns had worked throughout the data so I used Pivot Billions’ column creation function to quickly extract the month and weekday from my date column (took about 4 seconds). Then I did my first pivot.

All of my data was rearranged into a view by content type, month, and weekday. I was now able to interactively explore the distribution of my data by each of the combinations of these features. I wanted a quick overview of how each of the content types drove traffic each month so I viewed the content and month columns’ data as a Table in Pivot Billions’ PivotView.

This was a nice summary of my data but I wanted a more visual representation. I viewed the data as a Bar Graph so I could compare the content types and months more easily.

From this overview it appears that the traffic to the site experienced a significant jump during August for the Social and Media content categories. Focusing on the summer months, we can more clearly see the effect.

The Media and Social content categories saw an average 6% jump in traffic in August over the summer months. Seeing as these categories were already by far the best traffic generators this was pretty impressive.

Now I wanted to understand what caused this jump (and hopefully how to repeat it). My first guess was that this jump could correspond to the End-of-the-Summer campaign that was running at the start of each week (Monday) in August so I decided to dive a little deeper. By now viewing the data as a Table Barchart in Pivot Billions’ PivotView, dragging the weekday feature into my PivotView, and deselecting the other days of the week from the weekday feature, I was able to quickly visualize my data’s month-to-month Monday traffic.

Mondays in August did indeed see a large increase in Social and Media traffic, approximately 50% of the total August jump. This made it more likely that the End-of-Summer ad campaign was at least partially responsible for the increased traffic but I wanted a more complete view. After re-selecting the other days of the week I was able to see a more detailed view of how the ad campaign tracked with potential customers throughout the week.

It was now clear that the traffic had a very noticeable spike from social and media sources on Mondays in August, followed by high but declining traffic on Tuesdays and Wednesdays. This was not seen earlier in the summer since the campaign had not started. It is reasonable to conclude that the End-of-Summer ad campaign had a significant effect on social and media traffic.

This is already fairly useful information but really I’d like to drill down into the ad campaign and see which sites were driving the most traffic. I quickly pivoted my data again, this time by protocol/domain and month so I could get a closer view. Viewing the pivoted data as a Table Barchart again and sorting the data so the sites and months with the highest traffic were at the top and at the right, I was able get a detailed look at the best performing sites and which of them had the highest impact from the ad campaign.

Note: The protocol/domain data has been anonymized for this post.

It’s clear that some sites had much higher impacts from the ad campaign than others. Even amongst the five highest performing sites, two weren’t affected by the ad campaign, one had a moderate improvement, and two others had sizable increases. The highest performing site saw an over 17% increase in traffic from the ad campaign and the third highest performing site saw a nearly 50% gain! Now that I know the types of ad campaigns that are most effective and have a full list of sites that they are most effective on, this analysis will be helpful in improving the ROI of future ad campaigns and making sure the investments are spent in the right places.

Real Net Profit: 150% in just 4 Months

Ben Waxer — Fri, 08 Feb 2019 13:49:51 +0000

Developing a post-commission profitable currency trading model using Pivot Billions and R.

Needle, meet haystack. Searching for the right combination of features to make a consistent trading model can be quite difficult and takes many, many iterations. By incorporating Pivot Billions and R into my research process, I was able to dramatically improve the efficiency of each iteration making finding that needle in a haystack actually possible. Pivot Billions provided the raw power and scalability, while R provided the higher level manipulations and processes that allowed my to dive deep into my financial data and start to understand the underlying trends.

Utilizing Pivot Billions’ accurate financial backtesting simulator I was able to quickly test each version of my model as I developed it and see how it would perform in the real market. From testing initial general trading strategies to exploring individual and grouped features to see their distribution in my data and their effect on the trading strategies, my research process made great use of both tools. Adding features easily across all 143 Million rows of my data in Pivot Billions and being able to access, test, and simulate the effect of trading using these features from within my R code led to a very promising model ready for live trading.

After implementing this model in my real live trading account, I was able to achieve over 150% net profit in just four months! While there are still some small drawdowns the overall profit is very consistent and achieves great profitability in a very small amount of time.

I am continuing to trade this model and follow its performance. In the meantime I am working on minimizing its drawdowns and maximizing my profit by incorporating AI. Check out my Pivot Billions and Deep Learning post to see some of my preliminary results.

Taming 1.5 Billion Rows of “Big Apple” Data

pivotteam — Fri, 18 Jan 2019 00:03:25 +0000

The age of data has arrived. With it, more and more datasets are created and they just keep getting bigger. Whether dealing with private or open data, individuals and organizations across the world are realizing that there are enormous amounts of information and insights to be gained from massive data.

The public NYC Taxi and Limousine Commission Trip Record Data is a good example of an ever growing massive dataset. Pivot Billions is ideal for analyzing this type of data due to its ability to scale to handle any size of data.

In order to explore passenger and taxi trends over the years, I used Pivot Billions to process more than 200 compressed csv files to load 1.5 Billion rows of data into 170 Amazon c4.large instances in 3 minutes. Now that the data was loaded, I explored the data using Pivot Billions’ reorganization and transformation features. One thing I noticed right away is the data had tip and total taxi cost as separate columns. It’s more useful to compare percentages so I created a new tip percent metric from those columns using Pivot Billions’ column creation function (took about 4 seconds). Another messy data property I noticed was overlapping payment type codes. As seen in the following column distribution, the codes were modified over the years. In some years they used the complete spelling, others they abbreviated and more recently they began to use numeric values.

I quickly applied a lookup table in Pivot Billions to create a new, cleaner transformed column called PayType. Now that my data was clean and enhanced enough to draw some meaningful insights, I simply pivoted my data to get the number of taxi trips and taxi payment statistics by PayType, year, and tip_percent.

By entering Pivot Billions’ PivotView, I visualized the data using a heatmap to compare the distribution of tips for cash and credit card passengers.

It was immediately clear that people paying in cash generally did not pay a tip or, more likely, the driver did not report a tip. However, credit card users typically paid ~16% tip, possibly due to the ease of the taxis’ touch panel payment system for credit card users.

Now that I understood a bit more about passenger and taxi tip behavior I wanted to see if I could find any other trends underlying the data. Quickly pivoting the data again to view the number of trips and trip distance statistics by year and month, I started to draw some new insights. By visualizing the data as a LineChart in PivotView and comparing the total trip distances by year to the average trip distances by year, I noticed an interesting discrepancy.

Though the total trip distances were decreasing year over year, the average distances were rising slightly. The growing popularity of ridesharing in recent years is likely responsible for this trend, a nearly 40% reduction in total trip distances logged from 2009 to 2017.

By incorporating the rideshare data from the NYC Taxi and Limousine Commission Trip Record Data , and doing a quick comparison of total 2015 -2018 trips in the months of January, it shows a pretty remarkable shift in the riding habits in NYC.

This chart seems to validate the underlying trend surmised in the prior analysis, that rideshare growth has eroded yellow taxi ridership by nearly 40%.

Diving into large datasets like these can be a challenge and incredibly time consuming, but with PivotBillions, it just takes a few minutes from start to finish.

R NewYorkers Feeling the Holiday Spirit? Here’s Your Tip

Ben Waxer — Wed, 09 Jan 2019 16:53:12 +0000

The holiday season brings with it a degree of cheer and joy that many claim makes people act friendlier towards each other. I wanted to see how this effect translates to action so I decided to look into tips for New York green taxis both during the holiday season and the rest of the year. To start, I streamed all of the green taxi data files from the public NYC Taxi and Limousine Commission Trip Record Data for 2017-07-01 to 2018-06-31 (the most recent year of green taxi data) into Pivot Billions and enhanced the data with two new columns: holidayseason and tip_percent.

There were many rows that weren’t relevant to this analysis since cash payments did not have records of tips, so I filtered out cash payments from the payment_type column in Pivot Billions bringing the total rows to ~5 Million.

To dive into the data I made use of Pivot Billions’ pivot feature to quickly reorganize all of this filtered data by where the passenger(s) were dropped off (DOLocationID) and whether the trip occurred during the holiday season. My over 9 Million original rows of data were now shrunk down to a much more manageable 513 row detailed summary. Downloading this new view of the data from Pivot Billions I switched my focus to visualizing and analyzing the data in R.

Now that the data was shrunk down to a size R can easily handle, I loaded the Taxi Zone Shapefile and my newly downloaded DoLocationID_holiday_tips.csv file into R. This was a simple process of uploading the shapefile from our datasource as well as our Pivot Billions - processed file onto my machine running R and then joining them by setting Location ID equal to DOLocationID.

After quickly defining a new metric from our data called “Holiday Effect” that tracks the percentage difference in average tips between the holiday season and the rest of the year and adding additional information to the data to make it informative and explorable, I was left with a very clear and powerful visualization of the green taxi data.

It is immediately clear that there are regions with a much greater occurrence of positive holiday effects (green areas) than negative effects (orange areas) as well as the reverse. Utilizing R’s powerful indexing abilities it's easy to narrow down the data by location and explore which areas of New York experience the effect the most. It appears that Bronx and Brooklyn experience more negative effects whereas Queens is evenly spread between positive and negative. However, Manhattan and Newark Airport have a much higher proportion of positive effects due to the holiday season.

Though most of New York is being affected by the holidays for better or worse, people going to Manhattan and Newark Airport seem to be feeling the holiday spirit the most.

To create this visualization yourself you can download my R code, DOLocationID_holiday_tips.csv, and the Public Data’s Shapefile. You can also run this code replacing "DOLocationID_holiday_tips.csv" with "PULocationID_holiday_tips.csv" and DOLocationID with PULocationID to view the holiday effect on tips by Pick-Up Location.

Do the holidays mean bigger tips for NYC taxi drivers?

pivotteam — Mon, 31 Dec 2018 19:51:23 +0000

Now that the data was shrunk down to a size Tableau can handle, I loaded the Taxi Zone Shapefile and my newly downloaded DoLocationID_holiday_tips.csv file into Tableau. This was a simple process of loading the shapefile from our datasource and then joining it to our Pivot Billions - processed file by setting Location ID equal to DOLocationID.

After quickly defining a new metric from our data called “Holiday Effect” that tracks the percentage difference in average tips between the holiday season and the rest of the year and adding a few dynamic filters to the data to make it interactive and explorable, I was left with a very clear and powerful visualization of the green taxi data.

It is immediately clear that there are regions with a much greater occurrence of positive holiday effects (blue areas) than negative effects (orange areas) as well as the reverse. Utilizing Tableau’s dynamic filters it's easy to narrow down the data by location and explore which areas of New York experience the effect the most. It appears that Bronx and Brooklyn experience more negative effects whereas Queens is evenly spread between positive and negative. However, Manhattan and Newark Airport have a much higher proportion of positive effects due to the holiday season.

Though most of New York is being affected by the holidays for better or worse, people going to Manhattan and Newark Airport seem to be feeling the holiday spirit the most.

To view and interact with this visualization or download the workbook to Tableau, see my Holiday Effect on Tips by Drop-Off Location workbook on Tableau Public. You can also explore or download my other workbook to see the Holiday Effect on Tips by Pick-Up Location.