Big Data – Pivot Billions

Real-time analysis of 50 billion records of IoT data

pivotteam — Wed, 16 Sep 2020 23:59:46 +0000

The demand for more and more analysis of IoT data has been growing exponentially with the explosion of connected devices. Unfortunately the cost and time associated to analyze this data has also grown exponentially as data volumes keep getting larger and larger.

An enterprise client, who has been collecting data from millions of devices, has been wrestling with the growing pain of having to analyse it as the volume of data is exceeding the capacity of their conventional data processing systems.

Their pain is not only related to the size of data but from a lack of agility. The requirements for their analysis changes rapidly and conventional systems simply could not adapt to such changes in a timely or economical manner.

Their latest analysis requirement was to process 50 billion records of GPS data for various different analytic specifications and identify particular behavioral patterns for their customers’ interests. That kind of dynamic requirement made conventional batch data processing very difficult and expensive.

They have researched numerous products from different vendors and chosen AuriQ’s Pivotbillions, a massively parallel, in-memory data processing and analysis solution. AuriQ Pivotbillions enabled the client to analyze their entire 50 billion records in real-time. That meant that analytic queries, including ad-hoc queries, against the entire data set could be processed in seconds or tens of seconds which allowed data analysts to test their hypotheses very efficiently.

Fig1: An example of a visualization of the analyzed 50 billion records of GPS data, showing how devices move before and after the airport.

Because Pivotbillions is a software solution that can run on Amazon Web Services, it did not require any special hardware. Excel-like user interface of Pivotbillions allows analysts to work on data immediately without any coding or learning.

The total system cost to analyze their 50 billions records utilizing PivotBillions on AWS was easily less than 1/10 of conventional systems. It tooks only a few weeks to complete analysis and visualization tasks for various different requirements.

Facts

Records: 50 billions records from few millions devices
Size: 6 TB in 365 compressed files
Repository: AWS S3
Instances: AWS EC2 m5.large (up to 500 concurrent EC2)
Time to preprocess and load : 30 minutes (from original data in S3)
- Conventional system: took few days to load a partial sampled data into database
Response time of queries on whole 50 billions records: few ~ few tens of seconds
- Conventional system: took few hours to days to process a query to partial sampled data.
- Performance varies slightly depending on conditions of AWS.

Trial versions of the PivotBillions service is available for free. Click here to request a demo or sign up for a free account.

Enhancing Trading Models with AI

pivotteam — Thu, 13 Jun 2019 23:10:10 +0000

In this day and age, it has become more and more apparent that data is scaling and changing beyond the ability of human-powered machine learning to make the most of it. This makes enhancing current analyses with deep learning all the more necessary. Our team invested heavily in developing a trading model that could continue to trade profitably in the currency exchange markets. Though we succeeded in developing such a model, there was still a lot of room for improvement and we were reaching the limits of our current machine learning methods. Due to the growing power and versatility of deep learning, we decided to enhance our model with a deep neural network to achieve greater gains in profitability.

The first question we faced was how to represent our financial currency data as an image. There were many ways to reshape currency data into an image; however, each required a great deal of processing power and research. Working through 143 million raw tick data points and countless possible features was dramatically slowing down our research cycle. Pivot Billions allowed us to quickly load and analyze our data and reshape it into an image using their custom module. The entire process from modifying our raw data to converting it to image data and then running deep learning on it now took a matter of minutes. This also made our selection of input features much easier since it allowed us to add and quickly iterate through various features that could be relevant to our deep learning.

Incorporating Pivot Billions into our Keras workflow quickly prepared our data by simulating and creating a profit label for our model’s signals, including the last 100 minutes prior to each of those signals in the row for that respective signal for each of our input features, and setting up our training and testing datasets. After many iterations with the types of input features and the structure of our deep neural network, we were able to determine a deep learning model that learned our model’s weaknesses and strengths and more accurately predicted the profitable signals.

Our raw base model (black line) was profitable but highly volatile due to periods of noisy and underperforming trades. However, our deep learning enhanced model (blue line) achieved amazing and much greater profit throughout our data!

We were happy to see our deep neural network heavily reduced the periods of drawdown in the original model and produced much more stable profit. We’ll continue to develop our deep learning models so look forward to another blog post with even greater improvements!

Exposing Potential Fraud in Amazon Reviews

pivotteam — Tue, 02 Apr 2019 22:59:06 +0000

Amazon continues to be one of the most popular marketplaces in the US as well as the world due, at least in part, to its variety of product categories and product reviews. But how accurate are these reviews?

Do sellers or their competitors try and influence them in any way? Does the Verified Purchase tag actually affect the ratings? These questions nagged me until I finally gave in and decided to analyze Amazon’s Customer Review Dataset hosted on S3.This massive dataset contains over 130 million individual customer reviews stored in S3 tab separated files organized by product category.

I was mainly interested in the digital product reviews since they were easily verifiable by Amazon so I quickly connected to this data and created a category for the digital product categories using Pivot Billions. Then I used Pivot Billions’ column creation feature to extract the month from the review’s date column and loaded the data

Now that I had access to the over 23 million reviews in Amazon’s digital product categories, I could now explore each categories’ ratings and the effect of the Verified Purchase tag. I quickly pivoted my data by the product category, review month, and verified purchase columns to get an idea of the data’s makeup.

Digital Ebooks clearly made up the greatest proportion of reviews in the digital category. Given Amazon’s roots as an online book seller, this made a lot of sense. Now that I knew more about the distribution of the data and had made sure that the number of reviews for each product category was large enough to be used, I wanted to explore how the average star rating compared between categories. Switching from viewing the count to average for the star rating column and viewing the pivoted data as a horizontal bar chart I was left with a clear graphic of the ratings for each category over time.

I could clearly see a hierarchy among the digital product categories. Even with their variation over time, the video game and software categories were rated much lower than the others and significantly lower than the music category. However, digital software had an interesting ratings spike during the summer months. Wanting to dive deeper, I narrowed down to that category and added in the Verified Purchase tag to the pivot.

Surprisingly, the variation during the summer months came primarily from Non-Verified purchases while Verified Purchases remained relatively steady. This could indicate attempts to influence software reviews by a seller or one of their competitors or possibly a greater range of products that didn’t have a verification system through Amazon.
So it appears that there are significant differences in the ratings of the digital product categories, with music typically rated much higher and software and video games rated significantly lower. Moreover, the Verified Purchase tag does have a large effect on the ratings in some instances. This could indicate cases of fraudulent reviews so I dug deeper.
First, I re-pivoted the data by customer_id to get an idea of how many reviews each customer had.

Then I exported this data and joined it into my main data using Pivot Billions.

Now that my data was enhanced with the number of reviews each customer had submitted, I quickly restricted the data to only those customers with at least 1000 reviews in the data.

By quickly re-pivoting the data by the customer id, review month, and verified purchase columns and filtering the data to only the Non-Verified purchases, I started to see some suspicious behaviors.

Narrowing down this graph to just a few of the customers with the greatest degree of unverified reviews, I was able to isolate their behaviors and view them in more detail.

I could clearly see that some of the customers consistently submitted a high number of unverified reviews throughout the year (Ex: ID 37529167) whereas others were more concentrated events (Ex: ID 7080939). Due to their number of reviews and unverified status, these customers were highly likely to be fraudulent reviewers.
Now that I had a list of customers with suspicious behavior I wanted to see which products were affected the most so I pivoted the data by product parent, customer id, and review count and sorted by number of reviews.

I now had a clear view into which products saw the greatest number of these suspicious reviews. In fact, one product had over 22 unverified reviews from just this limited set of customers!

While Amazon is extremely popular and does have a vast database of verified reviews, it's clear there are still a variety of fraudulent reviews dispersed throughout the data that can have isolated or cumulative effects on their products. It is worth Amazon’s time to look into these reviews in greater detail and try to expand their Verified Purchase tag as much as possible. In the meantime make full use of Amazon’s extensive review system but you might want to check that the reviews are Verified before buying an expensive item or if you’re on the fence.

Understanding 2 Billion Rows of Weblogs in Real-Time

pivotteam — Fri, 22 Feb 2019 22:23:15 +0000

Managing data just keeps getting tougher. The more we think we’ve gotten a handle on our data the more it grows and becomes too large for our existing analyses.

This issue became very clear to me after I undertook the task of trying to understand the effectiveness of ad campaigns using SiteCatalyst weblogs. Seeing as I’d analyzed weblogs before I didn’t think this would be much of an issue. The twist: the weblogs contained over 2 Billion rows!

Pivot Billions is ideal to analyze the data due to its scalability to handle massive datasets. Taking the over 2 Billion rows of data, Pivot Billions loaded them into 500 Amazon c4.large instances in a matter of minutes. Then I started to explore the data using Pivot Billions’ reorganization and transformation features. I was mainly interested in how the ad campaigns had worked throughout the data so I used Pivot Billions’ column creation function to quickly extract the month and weekday from my date column (took about 4 seconds). Then I did my first pivot.

All of my data was rearranged into a view by content type, month, and weekday. I was now able to interactively explore the distribution of my data by each of the combinations of these features. I wanted a quick overview of how each of the content types drove traffic each month so I viewed the content and month columns’ data as a Table in Pivot Billions’ PivotView.

This was a nice summary of my data but I wanted a more visual representation. I viewed the data as a Bar Graph so I could compare the content types and months more easily.

From this overview it appears that the traffic to the site experienced a significant jump during August for the Social and Media content categories. Focusing on the summer months, we can more clearly see the effect.

The Media and Social content categories saw an average 6% jump in traffic in August over the summer months. Seeing as these categories were already by far the best traffic generators this was pretty impressive.

Now I wanted to understand what caused this jump (and hopefully how to repeat it). My first guess was that this jump could correspond to the End-of-the-Summer campaign that was running at the start of each week (Monday) in August so I decided to dive a little deeper. By now viewing the data as a Table Barchart in Pivot Billions’ PivotView, dragging the weekday feature into my PivotView, and deselecting the other days of the week from the weekday feature, I was able to quickly visualize my data’s month-to-month Monday traffic.

Mondays in August did indeed see a large increase in Social and Media traffic, approximately 50% of the total August jump. This made it more likely that the End-of-Summer ad campaign was at least partially responsible for the increased traffic but I wanted a more complete view. After re-selecting the other days of the week I was able to see a more detailed view of how the ad campaign tracked with potential customers throughout the week.

It was now clear that the traffic had a very noticeable spike from social and media sources on Mondays in August, followed by high but declining traffic on Tuesdays and Wednesdays. This was not seen earlier in the summer since the campaign had not started. It is reasonable to conclude that the End-of-Summer ad campaign had a significant effect on social and media traffic.

This is already fairly useful information but really I’d like to drill down into the ad campaign and see which sites were driving the most traffic. I quickly pivoted my data again, this time by protocol/domain and month so I could get a closer view. Viewing the pivoted data as a Table Barchart again and sorting the data so the sites and months with the highest traffic were at the top and at the right, I was able get a detailed look at the best performing sites and which of them had the highest impact from the ad campaign.

Note: The protocol/domain data has been anonymized for this post.

It’s clear that some sites had much higher impacts from the ad campaign than others. Even amongst the five highest performing sites, two weren’t affected by the ad campaign, one had a moderate improvement, and two others had sizable increases. The highest performing site saw an over 17% increase in traffic from the ad campaign and the third highest performing site saw a nearly 50% gain! Now that I know the types of ad campaigns that are most effective and have a full list of sites that they are most effective on, this analysis will be helpful in improving the ROI of future ad campaigns and making sure the investments are spent in the right places.

Real Net Profit: 150% in just 4 Months

Ben Waxer — Fri, 08 Feb 2019 13:49:51 +0000

Developing a post-commission profitable currency trading model using Pivot Billions and R.

Needle, meet haystack. Searching for the right combination of features to make a consistent trading model can be quite difficult and takes many, many iterations. By incorporating Pivot Billions and R into my research process, I was able to dramatically improve the efficiency of each iteration making finding that needle in a haystack actually possible. Pivot Billions provided the raw power and scalability, while R provided the higher level manipulations and processes that allowed my to dive deep into my financial data and start to understand the underlying trends.

Utilizing Pivot Billions’ accurate financial backtesting simulator I was able to quickly test each version of my model as I developed it and see how it would perform in the real market. From testing initial general trading strategies to exploring individual and grouped features to see their distribution in my data and their effect on the trading strategies, my research process made great use of both tools. Adding features easily across all 143 Million rows of my data in Pivot Billions and being able to access, test, and simulate the effect of trading using these features from within my R code led to a very promising model ready for live trading.

After implementing this model in my real live trading account, I was able to achieve over 150% net profit in just four months! While there are still some small drawdowns the overall profit is very consistent and achieves great profitability in a very small amount of time.

I am continuing to trade this model and follow its performance. In the meantime I am working on minimizing its drawdowns and maximizing my profit by incorporating AI. Check out my Pivot Billions and Deep Learning post to see some of my preliminary results.

Taming 1.5 Billion Rows of “Big Apple” Data

pivotteam — Fri, 18 Jan 2019 00:03:25 +0000

The age of data has arrived. With it, more and more datasets are created and they just keep getting bigger. Whether dealing with private or open data, individuals and organizations across the world are realizing that there are enormous amounts of information and insights to be gained from massive data.

The public NYC Taxi and Limousine Commission Trip Record Data is a good example of an ever growing massive dataset. Pivot Billions is ideal for analyzing this type of data due to its ability to scale to handle any size of data.

In order to explore passenger and taxi trends over the years, I used Pivot Billions to process more than 200 compressed csv files to load 1.5 Billion rows of data into 170 Amazon c4.large instances in 3 minutes. Now that the data was loaded, I explored the data using Pivot Billions’ reorganization and transformation features. One thing I noticed right away is the data had tip and total taxi cost as separate columns. It’s more useful to compare percentages so I created a new tip percent metric from those columns using Pivot Billions’ column creation function (took about 4 seconds). Another messy data property I noticed was overlapping payment type codes. As seen in the following column distribution, the codes were modified over the years. In some years they used the complete spelling, others they abbreviated and more recently they began to use numeric values.

I quickly applied a lookup table in Pivot Billions to create a new, cleaner transformed column called PayType. Now that my data was clean and enhanced enough to draw some meaningful insights, I simply pivoted my data to get the number of taxi trips and taxi payment statistics by PayType, year, and tip_percent.

By entering Pivot Billions’ PivotView, I visualized the data using a heatmap to compare the distribution of tips for cash and credit card passengers.

It was immediately clear that people paying in cash generally did not pay a tip or, more likely, the driver did not report a tip. However, credit card users typically paid ~16% tip, possibly due to the ease of the taxis’ touch panel payment system for credit card users.

Now that I understood a bit more about passenger and taxi tip behavior I wanted to see if I could find any other trends underlying the data. Quickly pivoting the data again to view the number of trips and trip distance statistics by year and month, I started to draw some new insights. By visualizing the data as a LineChart in PivotView and comparing the total trip distances by year to the average trip distances by year, I noticed an interesting discrepancy.

Though the total trip distances were decreasing year over year, the average distances were rising slightly. The growing popularity of ridesharing in recent years is likely responsible for this trend, a nearly 40% reduction in total trip distances logged from 2009 to 2017.

By incorporating the rideshare data from the NYC Taxi and Limousine Commission Trip Record Data , and doing a quick comparison of total 2015 -2018 trips in the months of January, it shows a pretty remarkable shift in the riding habits in NYC.

This chart seems to validate the underlying trend surmised in the prior analysis, that rideshare growth has eroded yellow taxi ridership by nearly 40%.

Diving into large datasets like these can be a challenge and incredibly time consuming, but with PivotBillions, it just takes a few minutes from start to finish.