Pivot Billions

Real-time analysis of 50 billion records of IoT data

pivotteam — Wed, 16 Sep 2020 23:59:46 +0000

The demand for more and more analysis of IoT data has been growing exponentially with the explosion of connected devices. Unfortunately the cost and time associated to analyze this data has also grown exponentially as data volumes keep getting larger and larger.

An enterprise client, who has been collecting data from millions of devices, has been wrestling with the growing pain of having to analyse it as the volume of data is exceeding the capacity of their conventional data processing systems.

Their pain is not only related to the size of data but from a lack of agility. The requirements for their analysis changes rapidly and conventional systems simply could not adapt to such changes in a timely or economical manner.

Their latest analysis requirement was to process 50 billion records of GPS data for various different analytic specifications and identify particular behavioral patterns for their customers’ interests. That kind of dynamic requirement made conventional batch data processing very difficult and expensive.

They have researched numerous products from different vendors and chosen AuriQ’s Pivotbillions, a massively parallel, in-memory data processing and analysis solution. AuriQ Pivotbillions enabled the client to analyze their entire 50 billion records in real-time. That meant that analytic queries, including ad-hoc queries, against the entire data set could be processed in seconds or tens of seconds which allowed data analysts to test their hypotheses very efficiently.

Fig1: An example of a visualization of the analyzed 50 billion records of GPS data, showing how devices move before and after the airport.

Because Pivotbillions is a software solution that can run on Amazon Web Services, it did not require any special hardware. Excel-like user interface of Pivotbillions allows analysts to work on data immediately without any coding or learning.

The total system cost to analyze their 50 billions records utilizing PivotBillions on AWS was easily less than 1/10 of conventional systems. It tooks only a few weeks to complete analysis and visualization tasks for various different requirements.

Facts

Records: 50 billions records from few millions devices
Size: 6 TB in 365 compressed files
Repository: AWS S3
Instances: AWS EC2 m5.large (up to 500 concurrent EC2)
Time to preprocess and load : 30 minutes (from original data in S3)
- Conventional system: took few days to load a partial sampled data into database
Response time of queries on whole 50 billions records: few ~ few tens of seconds
- Conventional system: took few hours to days to process a query to partial sampled data.
- Performance varies slightly depending on conditions of AWS.

Trial versions of the PivotBillions service is available for free. Click here to request a demo or sign up for a free account.

Enhancing Trading Models with AI

pivotteam — Thu, 13 Jun 2019 23:10:10 +0000

In this day and age, it has become more and more apparent that data is scaling and changing beyond the ability of human-powered machine learning to make the most of it. This makes enhancing current analyses with deep learning all the more necessary. Our team invested heavily in developing a trading model that could continue to trade profitably in the currency exchange markets. Though we succeeded in developing such a model, there was still a lot of room for improvement and we were reaching the limits of our current machine learning methods. Due to the growing power and versatility of deep learning, we decided to enhance our model with a deep neural network to achieve greater gains in profitability.

The first question we faced was how to represent our financial currency data as an image. There were many ways to reshape currency data into an image; however, each required a great deal of processing power and research. Working through 143 million raw tick data points and countless possible features was dramatically slowing down our research cycle. Pivot Billions allowed us to quickly load and analyze our data and reshape it into an image using their custom module. The entire process from modifying our raw data to converting it to image data and then running deep learning on it now took a matter of minutes. This also made our selection of input features much easier since it allowed us to add and quickly iterate through various features that could be relevant to our deep learning.

Incorporating Pivot Billions into our Keras workflow quickly prepared our data by simulating and creating a profit label for our model’s signals, including the last 100 minutes prior to each of those signals in the row for that respective signal for each of our input features, and setting up our training and testing datasets. After many iterations with the types of input features and the structure of our deep neural network, we were able to determine a deep learning model that learned our model’s weaknesses and strengths and more accurately predicted the profitable signals.

Our raw base model (black line) was profitable but highly volatile due to periods of noisy and underperforming trades. However, our deep learning enhanced model (blue line) achieved amazing and much greater profit throughout our data!

We were happy to see our deep neural network heavily reduced the periods of drawdown in the original model and produced much more stable profit. We’ll continue to develop our deep learning models so look forward to another blog post with even greater improvements!

5 Minute Analysis of Data.gov Datasets with PivotBillions

pivotteam — Fri, 19 Apr 2019 19:00:46 +0000

Overview

Data.gov is an open source of data provided by the US government that provides a wealth of interesting information ranging from agriculture, education, finance, health, science, etc...Currently, it lists over 200,000 datasets available to explore.

In this 5 Minute Analysis, we picked one dataset from the consumer data category, specifically the Financial Services Consumer Complaint Database. This dataset is provided by the Consumer Financial Protection Bureau or CFPB, and is a record of complaints received to the CFPB on various financial products and services offered by a multitude of companies.

The data is comprised of over 1.2 million rows in csv format and is about 700MB in size. Loading the data was pretty simple using the drag and drop feature in PivotBillions to upload the data to our cloud based demo portal.

EDA

Doing some quick exploratory data analysis (EDA), here are some general trends that we can see in the data.

Since the launch of the CFPB back in 2011, the number of recorded complaints has gone up year over year.

In that time, the top financial product or service that was associated with a complaint were the following:

By far, mortgage, debt collection and credit reporting outpace other types of products or services with regards to generating complaints.

The top companies who are the recipients of these complaints include the major credit reporting companies and banks.

The top five states for consumer complaints are California, Florida, Texas, New York and surprisingly Georgia. Georgia is a bit of an outlier considering it is only the 8th largest state in terms of population in the US, but somehow more complaints originated from there compared to Illinois which has a larger population.

The overwhelming preferred method for reporting complaints is online, via the web.

Out of the top three major financial product/services reported on, we see a trend of the mortgage related complaints declining while debt collection and credit reporting complaints have grown significantly over that time.

The Wrap-up

What insights did we gain through our EDA? The declining mortgage complaints seem to indicate that it looks like there is steady recovery from the subprime mortgage crisis of 2007-2008, although there is still friction with consumers trying to modify loans or avoid foreclosure. The slow rise of debt collection complaints might indicate that consumers might be taking on more credit card or loan debt and the rate of default is rising. The most drastic increase in complaints is with credit reporting/credit repair services. The huge spike in September 2017 is primarily a result of the Equifax data breach announcement, but even before then it looks like there was a steady rise of complaints with regard to how the credit reporting agencies were handling information in consumer credit reports.

A lot more can be gleaned by delving deeper into the data, but that might have to be reserved for another post. Currently, this dataset is available in our public demo portal for anyone to play with. Try it for yourself and see what insights you might find.

Maximizing Sales with Market Basket Analysis

pivotteam — Mon, 08 Apr 2019 17:53:12 +0000

Sales data analyses can provide a wealth of insights for any business but rarely is it made available to the public. In 2018, however, a retail chain provided Black Friday sales data on Kaggle as part of a Kaggle competition. Although the store and product lines are anonymized, the dataset presents a great learning opportunity to find business insights! In this post, we’ll cover how to prepare data, perform basic analysis, and glean additional insights via a technique called Market Basket Analysis.

Let’s see what the data looks like. We use Pivot Billions to analyze and manipulate large amounts of data via an intuitive and familiar spreadsheet style. After importing, we see that the data contains over 500K rows at the bottom, along with example data for each column.

Visualizing the distribution of each column is easy with a simple click of the column name. Overall the data is clean, but Product_Category_2 and Product_Category_3 has many missing values (NAs). They seem like subcategories of Product_Category_1 which has no missing values. Therefore, let’s ignore them for now.

Our next step is preparing data for analysis. First, “Purchase” needs to be divided by 100 to show cents as a decimal point. We define a float column called Total_Purchase and set ToF(Purchase)/100.
Second, let’s make a column combining gender with age to visualize a bar plot. We define a string column called Gender_Age and then set Gender + Age as the below screenshot.

In Pivot Billions, we can pivot data like Excel. It helps to easily find out the demographics of customers, sales per city, occupation, product category and so on. The bar chart below shows total purchases by product category.

We see that product category 1 is the most popular. It turns out that it generates about 40% of total purchases in this retail chain, and combined purchases from categories 1, 5, and 8 account for more than 70%.

Next, let's examine the data a little more closely! The bar chart below shows total purchases and customer demographics by city.

City B has the most sales, and the majority of customers are males ranging in age from the 20’s to 30’s in each city. But City A has a lower ratio of total purchases from customers over 35 compared to the other two cities. City A may have an untapped opportunity to target older demographics in their area. We don’t know each city’s demographics since they’re not revealed in this dataset, but it needs to be considered to understand what causes the differences as a realistic next step. It gives us solid insight for discussing with marketing teams.

Lastly, let’s do Market Basket Analysis which uses association rule mining on transaction data to discover interesting associations between the products! I’m going to use Apriori algorithm in Python. R also has Apriori algorithm. For further information, please check out the following links:
Apriori(Python) : http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/#frequent-itemsets-via-apriori-algorithm
Apriori(R): https://www.rdocumentation.org/packages/arules/versions/1.6-3/topics/apriori

import pandas as pd
from mlxtend.frequent_patterns import apriori,association_rules
#Read the dataset
data = pd.read_csv('./BlackFriday.csv')
#Create the basket 
basket_data = data.loc[:,['User_ID','Product_ID']]
count = basket_data.groupby(['User_ID', 
'Product_ID']).size().reset_index(name='Count')
basket = (count.groupby(['User_ID', 'Product_ID'])['Count']
                  .sum().unstack().reset_index().fillna(0)
                  .set_index('User_ID'))
#For this dataset, all values are either 1 or 0. Thus, you can apply this 
"basket" straight to apriori algorithm.
#Set support >= 0.05
frequent_itemsets = apriori(basket, min_support=0.05, use_colnames=True)
#Filter lift >= 1
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
#Sort ascending by lift
rules.sort_values(by = 'lift',ascending=False)

We’ll explain briefly what the metrics indicate.

Support represents how frequently the items appear in the data.
Confidence represents how frequently the if(antecedents)-then(consequences) statements are found to be true.
Lift represents how much larger or smaller confidence than expected confidence is. If a lift is larger than 1.0, it implies that the relationship between the items is more significant than expected.The larger lift means more significant association.

Here’s the result of top 20 lift values by ascending.

In the Market Basket Analysis, we found strong cross-product purchasing patterns between certain products! These results can be the basis for further analysis or discussion if we are able to know more about the retail chain, and can lead to opportunities for strategic pricing and promotions.

Exposing Potential Fraud in Amazon Reviews

pivotteam — Tue, 02 Apr 2019 22:59:06 +0000

Amazon continues to be one of the most popular marketplaces in the US as well as the world due, at least in part, to its variety of product categories and product reviews. But how accurate are these reviews?

Do sellers or their competitors try and influence them in any way? Does the Verified Purchase tag actually affect the ratings? These questions nagged me until I finally gave in and decided to analyze Amazon’s Customer Review Dataset hosted on S3.This massive dataset contains over 130 million individual customer reviews stored in S3 tab separated files organized by product category.

I was mainly interested in the digital product reviews since they were easily verifiable by Amazon so I quickly connected to this data and created a category for the digital product categories using Pivot Billions. Then I used Pivot Billions’ column creation feature to extract the month from the review’s date column and loaded the data

Now that I had access to the over 23 million reviews in Amazon’s digital product categories, I could now explore each categories’ ratings and the effect of the Verified Purchase tag. I quickly pivoted my data by the product category, review month, and verified purchase columns to get an idea of the data’s makeup.

Digital Ebooks clearly made up the greatest proportion of reviews in the digital category. Given Amazon’s roots as an online book seller, this made a lot of sense. Now that I knew more about the distribution of the data and had made sure that the number of reviews for each product category was large enough to be used, I wanted to explore how the average star rating compared between categories. Switching from viewing the count to average for the star rating column and viewing the pivoted data as a horizontal bar chart I was left with a clear graphic of the ratings for each category over time.

I could clearly see a hierarchy among the digital product categories. Even with their variation over time, the video game and software categories were rated much lower than the others and significantly lower than the music category. However, digital software had an interesting ratings spike during the summer months. Wanting to dive deeper, I narrowed down to that category and added in the Verified Purchase tag to the pivot.

Surprisingly, the variation during the summer months came primarily from Non-Verified purchases while Verified Purchases remained relatively steady. This could indicate attempts to influence software reviews by a seller or one of their competitors or possibly a greater range of products that didn’t have a verification system through Amazon.
So it appears that there are significant differences in the ratings of the digital product categories, with music typically rated much higher and software and video games rated significantly lower. Moreover, the Verified Purchase tag does have a large effect on the ratings in some instances. This could indicate cases of fraudulent reviews so I dug deeper.
First, I re-pivoted the data by customer_id to get an idea of how many reviews each customer had.

Then I exported this data and joined it into my main data using Pivot Billions.

Now that my data was enhanced with the number of reviews each customer had submitted, I quickly restricted the data to only those customers with at least 1000 reviews in the data.

By quickly re-pivoting the data by the customer id, review month, and verified purchase columns and filtering the data to only the Non-Verified purchases, I started to see some suspicious behaviors.

Narrowing down this graph to just a few of the customers with the greatest degree of unverified reviews, I was able to isolate their behaviors and view them in more detail.

I could clearly see that some of the customers consistently submitted a high number of unverified reviews throughout the year (Ex: ID 37529167) whereas others were more concentrated events (Ex: ID 7080939). Due to their number of reviews and unverified status, these customers were highly likely to be fraudulent reviewers.
Now that I had a list of customers with suspicious behavior I wanted to see which products were affected the most so I pivoted the data by product parent, customer id, and review count and sorted by number of reviews.

I now had a clear view into which products saw the greatest number of these suspicious reviews. In fact, one product had over 22 unverified reviews from just this limited set of customers!

While Amazon is extremely popular and does have a vast database of verified reviews, it's clear there are still a variety of fraudulent reviews dispersed throughout the data that can have isolated or cumulative effects on their products. It is worth Amazon’s time to look into these reviews in greater detail and try to expand their Verified Purchase tag as much as possible. In the meantime make full use of Amazon’s extensive review system but you might want to check that the reviews are Verified before buying an expensive item or if you’re on the fence.

Visualizing Gender Disparities in Hi-Tech, Engineering, and Science

pivotteam — Wed, 13 Mar 2019 18:38:07 +0000

We’ve all heard about the many improvements and changes being made to help get women into Hi-Tech and the Sciences, but where are these improvements being seen? Is the entire nation reaching gender equality or are there pockets of improvement and pockets of stagnation. These are the questions I set out to answer using the Safe Graph US Census Bureau Census Block Group American Community Survey Data from Kaggle.

This census data is an extremely comprehensive dataset at the census block group level (on the order of 1500 households each) containing detailed information on a multitude of demographic aspects of the people living in the US. However, the data is quite sizable and comes in the form of a very large zip file that makes the data hard to explore. Moreover, the zip file contains many internal files containing different demographic topics, each with its own schema. I used Pivot Billions to preprocess my data for Tableau due to its data ETL and joining features.

Using it, I quickly extracted the key columns I wanted from the various internal files in the census data while the data was still zipped and in S3.

Then I joined those key columns together by census block group and enhanced the data with a few custom columns such as the percentage of women and the degree of gender inequality.

This second custom column is used to describe how far away from equal (50% men, 50% women) the computer, engineering, and science workers are in each census block group. The values range from 0 to 5 with 5 meaning highly biased towards one gender. Now that the main data I was interested in was ready and loaded into Pivot Billions, I then downloaded the Pivot Billions preprocessed data as a csv file and then simply imported the data into Tableau as a text file.

Now all I had to do to explore the distribution of gender equality and inequality in the United States was to visualize this data. I quickly opened a new worksheet and told Tableau to treat the Census Block, latitude/longitude, and DegreeOfGenderInqeuality fields as Dimensions (categories). Then I added some of the interesting fields to the displayed tooltip in Tableau and colored by DegreeOfGenderInequality. Focusing on the extremes of Equality (Green) and Inequality (Orange), I was left with a very telling interactive visualization.

It was immediately clear that the coastal regions of the US and the Southeastern US region consistently maintained a high level of gender equality. On the other hand, the North/NorthEastern US regions leaned heavily towards just one of the genders (typically towards men).

Though many areas of the US seem to be making strides towards gender equality in Hi-Tech and the Sciences, many still remain heavily biased and stagnant. It may be worthwhile to look at the regulations and companies operating in the coastal and southeastern regions of the US to see what policies (governmental or private) may be having an effect. With time hopefully these policies will be implemented across the US to equalize the representation of both genders in these fields and truly create a more equal society.

To interact with this visualization yourself, feel free to go through my Gender Equality and Inequality workbook on Tableau Public.

Indoor vs Outdoor Activities? CDC Health Data Shows Which Is Better For You

pivotteam — Tue, 05 Mar 2019 01:41:44 +0000

Diving into CDC Behavioral Risk Factor data using Pivot Billions to learn what exercise behaviors are associated with improved health.

Motivating yourself to go outside and get some exercise or play a sport can be hard, but it is worth it. I had trouble with this myself but after looking into the CDC’s Behavioral Risk Factor Surveillance System data from Kaggle I decided to redouble my efforts. The CDC dataset I examined contains over 400,000 individuals’ responses to survey questions for each year from 2011 to 2015 and is an extremely comprehensive dataset containing answers to hundreds of questions regarding the individuals’ behaviors and their resulting health.

However, the size and structure of the data makes it hard to explore. Particularly, the survey schema changes each year and the english meaning of each survey question (and the corresponding answers) are found in a different pdf for each year of the data. I decided to use Pivot Billions to analyze the data since it has useful ETL and data joining features, allowing me to quickly narrow down on the data and survey questions I wanted to explore and discover some of the trends buried within.

One relationship I wanted to explore was whether various types of physical activity affect a person’s physical, mental, and emotional health before and after retirement. Using Pivot Billions I enhanced the data with three new columns based off of the existing survey questions determining the surveyed persons’ age, main sport over the 30 days prior to the survey, and degree of health limitation. I needed to use a lookup table I’d made to create the MainSport column (matching survey answer to the corresponding activity) so I was grateful that it was easy to combine into my data.

Now that I had a full view of my data along with its enhancements, I quickly filtered the data to eliminate “Refused” and blank answers to the experienced health limitations survey question. This way I had a clear metric for my data: the average degree of experienced physical, mental, or emotional health limitations on a scale from 0 to 1.

My data was now ready for more detailed analysis so I used Pivot Billions’ pivot function to reorganize my data by age category and main sport. Viewing the result as a Bar Graph a picture started to form of what types of activities were associated with better overall health.

What I noticed first in this visualization is most of the activities associated with worse health for both age groups were activities a person would do in or around the house. I wanted to see how this effect affected each age group so I quickly filtered the pivoted data to the 18 to 64 age group and sorted by decreasing health limitations (increasing overall health).

Sure enough, with a few exceptions such as swimming in laps, the most health limitations were experienced pre-retirement when the individuals exercised or did chores primarily around the house. There was a dramatic improvement in overall health when an individual consistently went outside, even if it was just to play Golf.

Now, I wanted to see whether there was any effect on the post-retirement age group as well so I quickly re-filtered the pivoted data.

Again we see staying inside of or working around the house associated with a much greater degree of physical, mental, and emotional limitations. In fact getting outside of the house, even to go bowling or golfing, brought an over 40% reduction in experienced limitations for the more than 900,000 people surveyed. It turns out that finding a physical activity that will consistently get you out of the house can help not only your physical fitness, but also dramatically improve your mental and emotional well being.

Understanding 2 Billion Rows of Weblogs in Real-Time

pivotteam — Fri, 22 Feb 2019 22:23:15 +0000

Managing data just keeps getting tougher. The more we think we’ve gotten a handle on our data the more it grows and becomes too large for our existing analyses.

This issue became very clear to me after I undertook the task of trying to understand the effectiveness of ad campaigns using SiteCatalyst weblogs. Seeing as I’d analyzed weblogs before I didn’t think this would be much of an issue. The twist: the weblogs contained over 2 Billion rows!

Pivot Billions is ideal to analyze the data due to its scalability to handle massive datasets. Taking the over 2 Billion rows of data, Pivot Billions loaded them into 500 Amazon c4.large instances in a matter of minutes. Then I started to explore the data using Pivot Billions’ reorganization and transformation features. I was mainly interested in how the ad campaigns had worked throughout the data so I used Pivot Billions’ column creation function to quickly extract the month and weekday from my date column (took about 4 seconds). Then I did my first pivot.

All of my data was rearranged into a view by content type, month, and weekday. I was now able to interactively explore the distribution of my data by each of the combinations of these features. I wanted a quick overview of how each of the content types drove traffic each month so I viewed the content and month columns’ data as a Table in Pivot Billions’ PivotView.

This was a nice summary of my data but I wanted a more visual representation. I viewed the data as a Bar Graph so I could compare the content types and months more easily.

From this overview it appears that the traffic to the site experienced a significant jump during August for the Social and Media content categories. Focusing on the summer months, we can more clearly see the effect.

The Media and Social content categories saw an average 6% jump in traffic in August over the summer months. Seeing as these categories were already by far the best traffic generators this was pretty impressive.

Now I wanted to understand what caused this jump (and hopefully how to repeat it). My first guess was that this jump could correspond to the End-of-the-Summer campaign that was running at the start of each week (Monday) in August so I decided to dive a little deeper. By now viewing the data as a Table Barchart in Pivot Billions’ PivotView, dragging the weekday feature into my PivotView, and deselecting the other days of the week from the weekday feature, I was able to quickly visualize my data’s month-to-month Monday traffic.

Mondays in August did indeed see a large increase in Social and Media traffic, approximately 50% of the total August jump. This made it more likely that the End-of-Summer ad campaign was at least partially responsible for the increased traffic but I wanted a more complete view. After re-selecting the other days of the week I was able to see a more detailed view of how the ad campaign tracked with potential customers throughout the week.

It was now clear that the traffic had a very noticeable spike from social and media sources on Mondays in August, followed by high but declining traffic on Tuesdays and Wednesdays. This was not seen earlier in the summer since the campaign had not started. It is reasonable to conclude that the End-of-Summer ad campaign had a significant effect on social and media traffic.

This is already fairly useful information but really I’d like to drill down into the ad campaign and see which sites were driving the most traffic. I quickly pivoted my data again, this time by protocol/domain and month so I could get a closer view. Viewing the pivoted data as a Table Barchart again and sorting the data so the sites and months with the highest traffic were at the top and at the right, I was able get a detailed look at the best performing sites and which of them had the highest impact from the ad campaign.

Note: The protocol/domain data has been anonymized for this post.

It’s clear that some sites had much higher impacts from the ad campaign than others. Even amongst the five highest performing sites, two weren’t affected by the ad campaign, one had a moderate improvement, and two others had sizable increases. The highest performing site saw an over 17% increase in traffic from the ad campaign and the third highest performing site saw a nearly 50% gain! Now that I know the types of ad campaigns that are most effective and have a full list of sites that they are most effective on, this analysis will be helpful in improving the ROI of future ad campaigns and making sure the investments are spent in the right places.

Real Net Profit: 150% in just 4 Months

Ben Waxer — Fri, 08 Feb 2019 13:49:51 +0000

Developing a post-commission profitable currency trading model using Pivot Billions and R.

Needle, meet haystack. Searching for the right combination of features to make a consistent trading model can be quite difficult and takes many, many iterations. By incorporating Pivot Billions and R into my research process, I was able to dramatically improve the efficiency of each iteration making finding that needle in a haystack actually possible. Pivot Billions provided the raw power and scalability, while R provided the higher level manipulations and processes that allowed my to dive deep into my financial data and start to understand the underlying trends.

Utilizing Pivot Billions’ accurate financial backtesting simulator I was able to quickly test each version of my model as I developed it and see how it would perform in the real market. From testing initial general trading strategies to exploring individual and grouped features to see their distribution in my data and their effect on the trading strategies, my research process made great use of both tools. Adding features easily across all 143 Million rows of my data in Pivot Billions and being able to access, test, and simulate the effect of trading using these features from within my R code led to a very promising model ready for live trading.

After implementing this model in my real live trading account, I was able to achieve over 150% net profit in just four months! While there are still some small drawdowns the overall profit is very consistent and achieves great profitability in a very small amount of time.

I am continuing to trade this model and follow its performance. In the meantime I am working on minimizing its drawdowns and maximizing my profit by incorporating AI. Check out my Pivot Billions and Deep Learning post to see some of my preliminary results.

San Francisco – A City Full of Psychopaths?

pivotteam — Thu, 31 Jan 2019 19:52:27 +0000

With a population of over 800,000 people within approximately 47 square miles, San Francisco is one of the most densely populated cities in the United States. Cities with this level of population density must contend with a high level of policing activity to ensure safety. I wanted to explore just how safe it actually is so I decided to analyze the SF Police Calls for Services and Incidents dataset from Kaggle which covers many years (2003-2017) of policing data and contains a variety of useful information about each record.

Starting with the police-department-incidents.csv file and loading it into Pivot Billions, I quickly renamed the x and y fields to what they truly were (longitude and latitude) and started to view the whole data. Getting a quick distribution of the Resolution column I was able to see the different ways each police incident was resolved and get an idea for how common each resolution was.

I was surprised to discover that a resolution of “Psychopathic Cases” occurred nearly 30,000 times across the data! What the heck?! Is San Francisco a city of psychopaths? Sure San Francisco has had its share of famous serial killers, but this hardly seemed likely.

To investigate this further, I simply applied a filter on the Resolution column in Pivot Billions to show only the data that ended up classified as a “Psychopathic Case” by the San Francisco PD. I then downloaded this filtered data to visualize in Tableau, choosing to also drop the Location column since Tableau has trouble interpreting it.

Now that my data was in a size and format manageable for Tableau, I loaded the data into Tableau as a Text file and graphed the data by the latitude and longitude columns. By adding the Address and Number of Rows data to the visualization and quickly integrating some interactive filters and highlighters, I ended up with a useful and informative visualization of my Policing data.

This interactive map shows where each “Psychopathic Case” police incident occured and how many incidents were recorded at that location. We can clearly see that there is a high concentration of these cases in Northeast San Francisco. By using the interactive filters in my visualization it was easy to determine that the Central, Northern, Southern, and Tenderloin districts are the main districts where such incidents were recorded the most.

So is San Francisco a city overrun by psychopaths? Hardly.

A little further digging reveals that the vast majority of these so called psychopathic resolutions were in fact “non-criminal - aided cases” where a mentally ill person was involved.

In fact if you look at the Psychopathic Case designation over time, you notice that it disappears almost entirely after 2014.

It looks like this is a case of poor classification of a very serious problem that police in all major cities have to deal with, and that is the rise of mental health related incident calls. There have been a slew of high profile incidents between law enforcement and the mentally ill in San Francisco in the past, so it is a promising sign that the SFPD is moving in the right direction and starting to limit the use of the “Psychopathic Case” designation.

If you want to explore more of this data please feel free to go through my workbook on Tableau Public.