Datenexploration von Web-Analytics-Daten mit Spark

Data-Driven Marketing: Wie die kundenzentrierte Ansprache in Zukunft gelingt, Teil 2

Zur Blog-Serie https://www.shi-gmbh.com/herausforderung-customer-centricity/

Inhalt

  1. Vorbereitung und einlesen der Daten
  2. Datenstruktur
  3. Deskriptive Analyse - Beispiele
    • 3.1. Device
    • 3.2. GeoNetwork
    • 3.3. Traffic Source
    • 3.4. Channel Grouping
    • 3.5. VisitNumber
    • 3.6. Date
    • 3.7. FullVisitorId - Besucherprofile
  4. Webanalyitcs - Hits und Metriken
    • 4.1. Beliebteste Landing Pages, explode-Beispiel
    • 4.2. Bounce Rate
    • 4.3. Churn Analysis
    • 4.4. Schlüsselmerkmale von Bescuhen für das Revenue

1. Vorbereitung und einlesen der Daten

Damit jupyter den lokal installierten Spark mit Python nutzen kann, benutzen wir das Modul findspark. Ausserdem werden die pySpark-Module importiert, die für die Analyse benötigt werden.

In [3]:
import findspark
findspark.init('/usr/local/Cellar/apache-spark/2.3.2/libexec/')

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

Wir starten die Spark Applikation indem wir eine SparkSession erzeugen

In [4]:
sc = SparkSession.builder.appName("TestGA").getOrCreate()

Wir werden die Daten in einem sog. DataFrame reinladen, der im wesentlichen einer Tabelle mit benannten Spalten entspricht und SQL-ähnliche Abfragen der Daten emöglicht. DataFrames lassen sich für parallele Berechnungen ähnlich wie mit der grundlegenden Datenstruktur in Spark, sog. RDDs, auf einem Cluster gut verteilen, bieten aber eben den Vorteil einer Schnittstelle, die jedem, der mit einer klassischen Datenbank gearbeitet hat, bekannt vorkommen dürfte. Dank eines Query-Optimizer im Hintergrund, Catalyst, lassen sich Dataframes sehr effizient abfragen und transformieren.

Die mit gzip kompirmierten JSON-Daten lassen sich direkt einlesen, mit dem Stern-Operator im Pfad lassen sich mehrere Dateien in den selben Dataframe reinladen

In [5]:
dataF = sc.read.json('/Users/eduardoschumann/Documents/Projekte/E-CommerceAnalytics/ga_data/ga_sessions_2016*.json.gz')

2. Datenstruktur

In [6]:
dataF.count(), len(dataF.columns)
Out[6]:
(436393, 13)

Wir haben grob die Hälfte der Daten reingeladen, der gesamte Datensatz hat (903653, 13) Zeilen / Spalten. Jede Zeile stellt eine Sitzung dar, in der ein Benutzer den Store besucht. Wir können uns das Schema anzeigen lassen, um mehr über die Spalten zu erfahren.

In [7]:
dataF.printSchema()
root
 |-- channelGrouping: string (nullable = true)
 |-- customDimensions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- index: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- date: string (nullable = true)
 |-- device: struct (nullable = true)
 |    |-- browser: string (nullable = true)
 |    |-- browserSize: string (nullable = true)
 |    |-- browserVersion: string (nullable = true)
 |    |-- deviceCategory: string (nullable = true)
 |    |-- flashVersion: string (nullable = true)
 |    |-- isMobile: boolean (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- mobileDeviceBranding: string (nullable = true)
 |    |-- mobileDeviceInfo: string (nullable = true)
 |    |-- mobileDeviceMarketingName: string (nullable = true)
 |    |-- mobileDeviceModel: string (nullable = true)
 |    |-- mobileInputSelector: string (nullable = true)
 |    |-- operatingSystem: string (nullable = true)
 |    |-- operatingSystemVersion: string (nullable = true)
 |    |-- screenColors: string (nullable = true)
 |    |-- screenResolution: string (nullable = true)
 |-- fullVisitorId: string (nullable = true)
 |-- geoNetwork: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- cityId: string (nullable = true)
 |    |-- continent: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- latitude: string (nullable = true)
 |    |-- longitude: string (nullable = true)
 |    |-- metro: string (nullable = true)
 |    |-- networkDomain: string (nullable = true)
 |    |-- networkLocation: string (nullable = true)
 |    |-- region: string (nullable = true)
 |    |-- subContinent: string (nullable = true)
 |-- hits: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- appInfo: struct (nullable = true)
 |    |    |    |-- exitScreenName: string (nullable = true)
 |    |    |    |-- landingScreenName: string (nullable = true)
 |    |    |    |-- screenDepth: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |-- contentGroup: struct (nullable = true)
 |    |    |    |-- contentGroup1: string (nullable = true)
 |    |    |    |-- contentGroup2: string (nullable = true)
 |    |    |    |-- contentGroup3: string (nullable = true)
 |    |    |    |-- contentGroup4: string (nullable = true)
 |    |    |    |-- contentGroup5: string (nullable = true)
 |    |    |    |-- contentGroupUniqueViews1: string (nullable = true)
 |    |    |    |-- contentGroupUniqueViews2: string (nullable = true)
 |    |    |    |-- contentGroupUniqueViews3: string (nullable = true)
 |    |    |    |-- previousContentGroup1: string (nullable = true)
 |    |    |    |-- previousContentGroup2: string (nullable = true)
 |    |    |    |-- previousContentGroup3: string (nullable = true)
 |    |    |    |-- previousContentGroup4: string (nullable = true)
 |    |    |    |-- previousContentGroup5: string (nullable = true)
 |    |    |-- customDimensions: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- customMetrics: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- customVariables: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- eCommerceAction: struct (nullable = true)
 |    |    |    |-- action_type: string (nullable = true)
 |    |    |    |-- option: string (nullable = true)
 |    |    |    |-- step: string (nullable = true)
 |    |    |-- eventInfo: struct (nullable = true)
 |    |    |    |-- eventAction: string (nullable = true)
 |    |    |    |-- eventCategory: string (nullable = true)
 |    |    |    |-- eventLabel: string (nullable = true)
 |    |    |-- exceptionInfo: struct (nullable = true)
 |    |    |    |-- isFatal: boolean (nullable = true)
 |    |    |-- experiment: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- hitNumber: string (nullable = true)
 |    |    |-- hour: string (nullable = true)
 |    |    |-- isEntrance: boolean (nullable = true)
 |    |    |-- isExit: boolean (nullable = true)
 |    |    |-- isInteraction: boolean (nullable = true)
 |    |    |-- item: struct (nullable = true)
 |    |    |    |-- currencyCode: string (nullable = true)
 |    |    |    |-- transactionId: string (nullable = true)
 |    |    |-- latencyTracking: struct (nullable = true)
 |    |    |    |-- domContentLoadedTime: string (nullable = true)
 |    |    |    |-- domInteractiveTime: string (nullable = true)
 |    |    |    |-- domLatencyMetricsSample: string (nullable = true)
 |    |    |    |-- domainLookupTime: string (nullable = true)
 |    |    |    |-- pageDownloadTime: string (nullable = true)
 |    |    |    |-- pageLoadSample: string (nullable = true)
 |    |    |    |-- pageLoadTime: string (nullable = true)
 |    |    |    |-- redirectionTime: string (nullable = true)
 |    |    |    |-- serverConnectionTime: string (nullable = true)
 |    |    |    |-- serverResponseTime: string (nullable = true)
 |    |    |    |-- speedMetricsSample: string (nullable = true)
 |    |    |-- minute: string (nullable = true)
 |    |    |-- page: struct (nullable = true)
 |    |    |    |-- hostname: string (nullable = true)
 |    |    |    |-- pagePath: string (nullable = true)
 |    |    |    |-- pagePathLevel1: string (nullable = true)
 |    |    |    |-- pagePathLevel2: string (nullable = true)
 |    |    |    |-- pagePathLevel3: string (nullable = true)
 |    |    |    |-- pagePathLevel4: string (nullable = true)
 |    |    |    |-- pageTitle: string (nullable = true)
 |    |    |    |-- searchCategory: string (nullable = true)
 |    |    |    |-- searchKeyword: string (nullable = true)
 |    |    |-- product: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- customDimensions: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- customMetrics: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- isClick: boolean (nullable = true)
 |    |    |    |    |-- isImpression: boolean (nullable = true)
 |    |    |    |    |-- localProductPrice: string (nullable = true)
 |    |    |    |    |-- localProductRevenue: string (nullable = true)
 |    |    |    |    |-- productBrand: string (nullable = true)
 |    |    |    |    |-- productListName: string (nullable = true)
 |    |    |    |    |-- productListPosition: string (nullable = true)
 |    |    |    |    |-- productPrice: string (nullable = true)
 |    |    |    |    |-- productQuantity: string (nullable = true)
 |    |    |    |    |-- productRevenue: string (nullable = true)
 |    |    |    |    |-- productSKU: string (nullable = true)
 |    |    |    |    |-- productVariant: string (nullable = true)
 |    |    |    |    |-- v2ProductCategory: string (nullable = true)
 |    |    |    |    |-- v2ProductName: string (nullable = true)
 |    |    |-- promotion: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- promoCreative: string (nullable = true)
 |    |    |    |    |-- promoId: string (nullable = true)
 |    |    |    |    |-- promoName: string (nullable = true)
 |    |    |    |    |-- promoPosition: string (nullable = true)
 |    |    |-- promotionActionInfo: struct (nullable = true)
 |    |    |    |-- promoIsClick: boolean (nullable = true)
 |    |    |    |-- promoIsView: boolean (nullable = true)
 |    |    |-- publisher_infos: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- referer: string (nullable = true)
 |    |    |-- social: struct (nullable = true)
 |    |    |    |-- hasSocialSourceReferral: string (nullable = true)
 |    |    |    |-- socialInteractionNetworkAction: string (nullable = true)
 |    |    |    |-- socialNetwork: string (nullable = true)
 |    |    |-- time: string (nullable = true)
 |    |    |-- transaction: struct (nullable = true)
 |    |    |    |-- affiliation: string (nullable = true)
 |    |    |    |-- currencyCode: string (nullable = true)
 |    |    |    |-- localTransactionRevenue: string (nullable = true)
 |    |    |    |-- localTransactionShipping: string (nullable = true)
 |    |    |    |-- localTransactionTax: string (nullable = true)
 |    |    |    |-- transactionCoupon: string (nullable = true)
 |    |    |    |-- transactionId: string (nullable = true)
 |    |    |    |-- transactionRevenue: string (nullable = true)
 |    |    |    |-- transactionShipping: string (nullable = true)
 |    |    |    |-- transactionTax: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- socialEngagementType: string (nullable = true)
 |-- totals: struct (nullable = true)
 |    |-- bounces: string (nullable = true)
 |    |-- hits: string (nullable = true)
 |    |-- newVisits: string (nullable = true)
 |    |-- pageviews: string (nullable = true)
 |    |-- timeOnSite: string (nullable = true)
 |    |-- totalTransactionRevenue: string (nullable = true)
 |    |-- transactionRevenue: string (nullable = true)
 |    |-- transactions: string (nullable = true)
 |    |-- visits: string (nullable = true)
 |-- trafficSource: struct (nullable = true)
 |    |-- adContent: string (nullable = true)
 |    |-- adwordsClickInfo: struct (nullable = true)
 |    |    |-- adNetworkType: string (nullable = true)
 |    |    |-- criteriaParameters: string (nullable = true)
 |    |    |-- gclId: string (nullable = true)
 |    |    |-- isVideoAd: boolean (nullable = true)
 |    |    |-- page: string (nullable = true)
 |    |    |-- slot: string (nullable = true)
 |    |-- campaign: string (nullable = true)
 |    |-- campaignCode: string (nullable = true)
 |    |-- isTrueDirect: boolean (nullable = true)
 |    |-- keyword: string (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- referralPath: string (nullable = true)
 |    |-- source: string (nullable = true)
 |-- visitId: string (nullable = true)
 |-- visitNumber: string (nullable = true)
 |-- visitStartTime: string (nullable = true)

Erläuterung Schema

Als erstes fällt auf, dass die meisten der dreizehn verschiedenen Spalten, die Eigenschaften des Besuchs erfassen, ihrerseits eingebettete Strukturen mit einem eigenen Unterschema haben. Außerdem werden alle Werte als String eingelesen.

Einfache Spalten mit einzelnen Werten lassen sich leicht von Namen her interpretieren: fullVisitorId, date, visitId, visitStartTime, visitNumber

Die meisten Spalten haben jedoch Unterspalten (Typ struct), um verwandte Merkmale zusammenzufassen, so z.B. die Spalte device mit Unterspalten für Merkmale des für den Besuch verwendenten Geräts. Verschiedene geographischen Informationen werden unter geoNetwork gesammelt, Details zur Herkunft des Besuchs unter trafficSource.

Die Folge der im Prinzip beliebig vielen Interaktionen innerhalb eines Besuchs werden in der Spalte hits gesammelt, die dafür von Typ array ist. Jedes Element in dieser Liste von hits ist wiederum ein Objekt mit einem festen, umfangreichen Schema, um damit zusammenhängende Informationen zu erfassen, etwa die besuchte Seite (page), vordefinierte Content-Gruppen (content), angezeigte Produkte (product), getätigte Transaktionen bzw. Käufe (transaction), usw.

Die Spalte totals fasst verschiedene Statistiken über die hits des Besuchs zusammen: Anzahl, die Gesamteinnahmen aller Transaktionen (totalTransactionRevenue), Anzahl der besuchten Seiten, usw.

Spark bietet verschiedene Funktionen, um den Datentyp einer Spalte anzupassen und mit den verschiedenen Arten von Verschachtelungen in einem DataFrame umzugehen. Damit können Abfragen ausgewertet, ohne dass dafür die Daten vorher denormalisiert werden müssen, wie wir bei der weiteren Erkundung sehen werden.

3. Deskriptive Analyse - Beispiele

Arbeiten mit DataFrames

Spark stellt eine funktionellen Sprache zu Verfügung, die sich an SQL orientiert, um ein DataFrame abzufragen. Histogramme bzw. Frequenzen für die Werte in einer Spalte lassen sich damit analog zu SQL berechnen.

In [8]:
dataF.groupBy('channelGrouping').count().orderBy("count").show()
+---------------+------+
|channelGrouping| count|
+---------------+------+
|        (Other)|     6|
|        Display|  2958|
|     Affiliates|  6642|
|    Paid Search| 11076|
|       Referral| 46007|
|         Direct| 56597|
| Organic Search|147126|
|         Social|165981|
+---------------+------+

Eingebaute Funktionen wie describe() berechnen grundlegende Verteilungs- / Streuungsparameter. Im Beispiel sieht man wie man auf die Unterspalten von totals zugreifen kann.

In [9]:
dataF.select('totals.hits', 'totals.timeOnSite', 'totals.transactions').describe().show()
+-------+------------------+------------------+-------------------+
|summary|              hits|        timeOnSite|       transactions|
+-------+------------------+------------------+-------------------+
|  count|            436393|            218901|               5201|
|   mean| 4.771146649923349|243.96800836908008| 1.0515285522014997|
| stddev|10.626958855819973| 476.8294417539209|0.34311647159360587|
|    min|                 1|                 1|                  1|
|    max|                99|               999|                  8|
+-------+------------------+------------------+-------------------+

Zeilen können mit show ausgegeben werden.

In [10]:
dataF.select('trafficSource').show(5, truncate=False)
+-------------------------------------------------------------------------------------------------------------------------+
|trafficSource                                                                                                            |
+-------------------------------------------------------------------------------------------------------------------------+
|[, [, not available in demo dataset,,,,], (not set),,, (not provided), organic,, google]                                 |
|[, [, not available in demo dataset,,,,], (not set),,, (not provided), organic,, google]                                 |
|[, [, not available in demo dataset,,,,], (not set),,, (not provided), organic,, google]                                 |
|[, [, not available in demo dataset,,,,], (not set),, true,, (none),, (direct)]                                          |
|[, [, not available in demo dataset,,,,], (not set),,,, referral, /2015/03/11/google-merch-store-new-url/, phandroid.com]|
+-------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

In [11]:
dataF.show(10)
+---------------+--------------------+--------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------------+
|channelGrouping|    customDimensions|    date|              device|      fullVisitorId|          geoNetwork|                hits|socialEngagementType|              totals|       trafficSource|   visitId|visitNumber|visitStartTime|
+---------------+--------------------+--------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------------+
| Organic Search|         [[4, APAC]]|20160831|[Chrome, not avai...|5478623230383885328|[(not set), not a...|[[[www.googlemerc...|Not Socially Engaged|[, 6, 1, 6, 148,,...|[, [, not availab...|1472632459|          1|    1472632459|
| Organic Search|         [[4, EMEA]]|20160831|[Chrome, not avai...|9612697236886680399|[not available in...|[[[shop.googlemer...|Not Socially Engaged|[, 13,, 13, 421,,...|[, [, not availab...|1472643849|          4|    1472643849|
| Organic Search|         [[4, EMEA]]|20160831|[Internet Explore...|0837658092652033256|[London, not avai...|[[[shop.googlemer...|Not Socially Engaged|[, 4, 1, 4, 1422,...|[, [, not availab...|1472659695|          1|    1472659695|
|         Direct|         [[4, EMEA]]|20160831|[Chrome, not avai...|6532055062759933865|[not available in...|[[[shop.googlemer...|Not Socially Engaged|[, 5,, 5, 1156,,,...|[, [, not availab...|1472648364|         23|    1472648364|
|       Referral|         [[4, EMEA]]|20160831|[Chrome, not avai...|0040335064426021006|[not available in...|[[[shop.googlemer...|Not Socially Engaged|[, 5, 1, 5, 135,,...|[, [, not availab...|1472659563|          1|    1472659563|
| Organic Search|         [[4, EMEA]]|20160831|[Chrome, not avai...|7977143020316134741|[London, not avai...|[[[shop.googlemer...|Not Socially Engaged|[, 7,, 7, 142,,,, 1]|[, [, not availab...|1472650713|          2|    1472650713|
| Organic Search|[[4, North America]]|20160831|[Chrome, not avai...|0703987475057178830|[not available in...|[[[shop.googlemer...|Not Socially Engaged|[, 7, 1, 3, 26,,,...|[, [, not availab...|1472702236|          1|    1472702236|
| Organic Search|[[4, North America]]|20160831|[Chrome, not avai...|8501295234755313921|[not available in...|[[[shop.googlemer...|Not Socially Engaged|[, 10, 1, 4, 54,,...|[, [, not availab...|1472666162|          1|    1472666162|
|         Direct|                  []|20160831|[Safari, not avai...|5407384644549670219|[Bangkok, not ava...|[[[shop.googlemer...|Not Socially Engaged|[, 11, 1, 10, 839...|[, [, not availab...|1472697195|          1|    1472697195|
| Organic Search|[[4, Central Amer...|20160831|[Chrome, not avai...|9740144896864972546|[not available in...|[[[shop.googlemer...|Not Socially Engaged|[, 11, 1, 11, 764...|[, [, not availab...|1472671647|          1|    1472671647|
+---------------+--------------------+--------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------------+
only showing top 10 rows

Auf die Elemente aus dem Array aus der Spalte Hits kann über einen Index zugegriffen werden.

In [12]:
dataF.select(expr('hits[1]')).show(3,truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|hits[1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[www.googlemerchandisestore.com/home, www.googlemerchandisestore.com/home, 0, www.googlemerchandisestore.com/home], [(not set), (not set), (not set), (not set), (not set),,,, (not set), (not set), (not set), (not set), (not set)], [], [], [], [0,, 1],, [true], [], 2, 1,,, true,,, 34, [www.googlemerchandisestore.com, /home, /home, , , , Google Online Store,,], [], [],, [], https://www.google.com.sg/, [No,  : , (not set)], 5628,, PAGE]                                                                                                                                                                                                                                                                   |
|[[shop.googlemerchandisestore.com/google+redesign/electronics, www.googlemerchandisestore.com/home, 0, www.googlemerchandisestore.com/home], [(not set), (not set), (not set), (not set), (not set),,,, (not set), (not set), (not set), (not set), (not set)], [], [], [], [0,, 1],, [true], [], 2, 4,,, true,,, 44, [www.googlemerchandisestore.com, /home, /home, , , , Google Online Store,,], [], [],, [], https://www.google.it/, [No,  : , (not set)], 23406,, PAGE]                                                                                                                                                                                                                                              |
|[[shop.googlemerchandisestore.com/google+redesign/electronics, www.googlemerchandisestore.com/home, 0, www.googlemerchandisestore.com/home], [(not set), (not set), (not set), (not set), (not set),,,, (not set), (not set), (not set), (not set), (not set)], [], [], [], [0,, 1],, [true], [], 2, 9,,, true,,, 31, [www.googlemerchandisestore.com, /home, /home, , , , Google Online Store,,], [], [],, [], https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&sqi=2&ved=0ahUKEwjJuKfthOzOAhXKL8AKHZpHBhMQFggcMAA&url=https%3A%2F%2Fwww.googlemerchandisestore.com%2F&usg=AFQjCNGK7lG66tWBMQkleeCMprO_xNS6qw&sig2=bI7YzaxvZNBLmNas9gNX_A&bvm=bv.131286987,d.d24, [No,  : , (not set)], 1401314,, PAGE]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows

3.1 Device - eingesetztes Gerät

Anzahl der Besuche nach verschiedenen Merkmalen

In [13]:
dataF.groupBy("device.browser").count().orderBy(desc("count")).show()
+--------------------+------+
|             browser| count|
+--------------------+------+
|              Chrome|289040|
|              Safari|106647|
|             Firefox| 16364|
|   Internet Explorer|  8281|
|                Edge|  4212|
|               Opera|  3267|
|          Opera Mini|  2403|
|     Safari (in-app)|  2070|
|           YaBrowser|  1061|
|          UC Browser|   780|
|     Android Webview|   731|
|             Coc Coc|   327|
|     Android Browser|   261|
|         Amazon Silk|   240|
|            MRCHROME|   169|
|Mozilla Compatibl...|   133|
|             Maxthon|   120|
|          BlackBerry|    94|
|    Nintendo Browser|    70|
|       Nokia Browser|    31|
+--------------------+------+
only showing top 20 rows

In [14]:
dataF.agg(countDistinct("device.browser")).show()
+------------------------------+
|count(DISTINCT device.browser)|
+------------------------------+
|                            42|
+------------------------------+

... es gibt 42 browser...

In [15]:
dataF.groupBy("device.deviceCategory").count().orderBy(desc("count")).show()
+--------------+------+
|deviceCategory| count|
+--------------+------+
|       desktop|349772|
|        mobile| 74693|
|        tablet| 11928|
+--------------+------+

In [16]:
dataF.groupBy("device.operatingSystem").count().orderBy(desc("count")).show()
+---------------+------+
|operatingSystem| count|
+---------------+------+
|        Windows|177977|
|      Macintosh|145949|
|        Android| 45171|
|            iOS| 38154|
|          Linux| 15557|
|      Chrome OS| 10756|
|      (not set)|  1880|
|  Windows Phone|   590|
|     BlackBerry|   106|
|        Samsung|    92|
|   Nintendo Wii|    67|
|     Firefox OS|    57|
|           Xbox|    32|
|        FreeBSD|     2|
|          Nokia|     2|
|          SunOS|     1|
+---------------+------+

Durchschnitt-Revenue nach verwendeten Browser

Die Spalten eines DataFrame werden als String eingelesen. Um das Durschnittsrevenue zu berechnen bei der nächsten Queries casten wir die Spalte zu einem double und ersetzen null entsprechend

In [17]:
dataF.select("device.browser",expr("totals.transactionRevenue").cast("double").alias("revenue")).fillna(0.0).groupBy("browser").avg("revenue").orderBy(desc("avg(revenue)")).show()
+-------------------+------------------+
|            browser|      avg(revenue)|
+-------------------+------------------+
|             Chrome|2180227.5117630777|
|               Edge|  786156.220322887|
|            Firefox| 382613.6641407969|
|  Internet Explorer|303738.67890351405|
|             Safari|235921.21672433356|
|        Amazon Silk|124958.33333333333|
|    Android Webview| 64158.68673050615|
|    Safari (in-app)|21217.391304347828|
|              Opera| 17438.01652892562|
|         BlackBerry|               0.0|
|          Lunascape|               0.0|
|osee2unifiedRelease|               0.0|
|     LYF_LS_4002_12|               0.0|
|            Mozilla|               0.0|
|        ThumbSniper|               0.0|
|                ADM|               0.0|
|                 YE|               0.0|
|          YaBrowser|               0.0|
|          Konqueror|               0.0|
|  Hisense M20-M_LTE|               0.0|
+-------------------+------------------+
only showing top 20 rows

Visualisierung mit pandas und plotly

DataFrames können mit Hilfe der Funktion toPandas() zu einem Pandas-Objekt konvertiert werden, das als Input für eine der weitverbreiteten python Graphik-Bibliotheken wie matplolib oder plotly dient. Aber Vorsicht: mit Spark kann man prinzipiell sehr große Datenmengen verarbeiten, dieser Weg der Visualisierung eignet sich jedoch nur für kleinere Datenmengen, da ansonsten sehr große Pandas-Objekte erzeugt werden, die in Spark nicht parallelisiert verarbeitet werden können und einen Flaschenhals in der Berechnung darstellen.

In [6]:
import plotly.plotly as py
import plotly.graph_objs as go
import pandas as pd
import requests
from plotly.offline import iplot, init_notebook_mode
from plotly import tools
init_notebook_mode()


devCat = dataF.groupBy("device.deviceCategory").count().orderBy(desc("count")).toPandas()

iplot(go.Figure(layout=dict(title='Besuche nach Geräteart') , data= [go.Pie(labels=devCat["deviceCategory"], values=devCat["count"])]))

avgRev = dataF.select("device.browser",expr("totals.transactionRevenue").cast("double").alias("revenue")).fillna(0.0).groupBy("browser").agg(avg("revenue").alias("avg_revenue")).where(expr("avg_revenue > 0")).orderBy(desc("avg_revenue")).toPandas()

iplot(go.Figure(layout=dict(title='Durchschnittl. Einnahmen nach Browser') , data= [go.Bar( y = avgRev["avg_revenue"], x = avgRev["browser"])]))

3.2 GeoNetwork:

  • Besuche nach Kontinent, Land
  • Durschnitt revenue nach Kontinent, Land
In [18]:
dataF.groupBy("geoNetwork.continent").count().orderBy(desc("count")).show()
+---------+------+
|continent| count|
+---------+------+
| Americas|205240|
|     Asia|130872|
|   Europe| 85527|
|   Africa|  8102|
|  Oceania|  5828|
|(not set)|   824|
+---------+------+

In [19]:
dataF.groupBy("geoNetwork.country").count().orderBy(desc("count")).show()
+--------------+------+
|       country| count|
+--------------+------+
| United States|158556|
|         India| 21891|
|       Vietnam| 21881|
|      Thailand| 16823|
|        Turkey| 16221|
|United Kingdom| 14145|
|        Brazil| 13069|
|        Canada| 11064|
|        Mexico|  8481|
|         Japan|  8270|
|       Germany|  7728|
|   Philippines|  6105|
|        France|  6078|
|        Russia|  5837|
|     Indonesia|  5693|
|        Taiwan|  5259|
|     Australia|  4915|
|         Italy|  4858|
|       Romania|  4769|
|        Poland|  4675|
+--------------+------+
only showing top 20 rows

In [20]:
dataF.select("geoNetwork.continent",expr("totals.transactionRevenue").cast("double").alias("revenue")).fillna(0.0).groupBy("continent").avg("revenue").orderBy(desc("avg(revenue)")).show()
+---------+------------------+
|continent|      avg(revenue)|
+---------+------------------+
| Americas|3167925.7454687194|
|(not set)| 918070.3883495146|
|   Africa| 656727.9684028635|
|     Asia| 68805.24481936549|
|  Oceania| 66120.45298558682|
|   Europe|22756.088720521006|
+---------+------------------+

In [21]:
dataF.select("geoNetwork.country",expr("totals.transactionRevenue").cast("double").alias("revenue")).fillna(0.0).groupBy("country").avg("revenue").orderBy(desc("avg(revenue)")).show()
+-------------+--------------------+
|      country|        avg(revenue)|
+-------------+--------------------+
|      Curaçao|1.5871538461538462E7|
|    Venezuela|1.0885788546255507E7|
|        Kenya|1.0601006036217304E7|
|United States|  3886736.4212013423|
|  Puerto Rico|  2997040.8163265307|
|   Guadeloupe|  1941428.5714285714|
|       Canada|   1658882.863340564|
|    St. Lucia|  1268181.8181818181|
|    (not set)|   918070.3883495146|
|    Nicaragua|   876575.3424657534|
|      Ecuador|   598015.4639175257|
|  El Salvador|   474629.6296296296|
|    Hong Kong|  458525.34562211984|
|        Japan|   427592.5030229746|
|       Cyprus|  285757.57575757575|
|     Portugal|   260628.0193236715|
|        Chile|    244447.983014862|
|    Indonesia|   210537.5021956789|
|      Armenia|  195384.61538461538|
|  Switzerland|  184687.66684463428|
+-------------+--------------------+
only showing top 20 rows

In [7]:
contDate = dataF.groupBy("geoNetwork.continent").count().orderBy(desc("count")).toPandas()

iplot(go.Figure(layout=dict(title='Besuche nach Kontinent') , data= [go.Pie(labels=contDate["continent"], values=contDate["count"])]))

3.3 Traffic Source

  • Anzahl Besuche nach Kampagne, Quelle

Kampagne können über Google Analytics definiert werden. Die Quelle ist die Seite, aus der der Besuch des Stores kommt.

In [22]:
dataF.groupBy("trafficSource.campaign").count().orderBy(desc("count")).show()
+--------------------+------+
|            campaign| count|
+--------------------+------+
|           (not set)|426502|
|    Data Share Promo|  6642|
|AW - Dynamic Sear...|  3165|
|    AW - Electronics|    71|
|    AW - Accessories|     9|
|        All Products|     4|
+--------------------+------+

In [23]:
dataF.groupBy("trafficSource.source").count().orderBy(desc("count")).show()
+--------------------+------+
|              source| count|
+--------------------+------+
|            (direct)|182022|
|         youtube.com|162211|
|              google| 61476|
|            Partners|  6643|
|analytics.google.com|  5256|
|                 dfa|  2958|
|               baidu|  2242|
|          google.com|  1757|
|    sites.google.com|  1543|
|siliconvalley.abo...|  1235|
|        facebook.com|  1001|
|           qiita.com|   725|
|          reddit.com|   646|
|      m.facebook.com|   574|
|     mail.google.com|   572|
|           quora.com|   465|
|               yahoo|   456|
|                bing|   450|
|       dealspotr.com|   372|
|                t.co|   328|
+--------------------+------+
only showing top 20 rows

3.4 Channel Grouping

In [25]:
dataF.groupBy("channelGrouping").count().orderBy(desc("count")).show()
+---------------+------+
|channelGrouping| count|
+---------------+------+
|         Social|165981|
| Organic Search|147126|
|         Direct| 56597|
|       Referral| 46007|
|    Paid Search| 11076|
|     Affiliates|  6642|
|        Display|  2958|
|        (Other)|     6|
+---------------+------+

  • Organic Search: Traffic, der über Suchmaschinenergebnisse auf eine Website gelangt
  • Direct: Dirkete Eingabe der URL im Browser
  • Referral: Eingebaute Links auf einer Seite die auf den Store zeigen
  • Paid Search: Anzeige in der Google Suche
  • Affiliates: Traffic die über Affiliate-Partner reinkommen
  • Display: Display Werbung wie Banner
  • Social: Traffic der über Social Media reinkommt
In [8]:
channel = dataF.groupBy("channelGrouping").count().orderBy(desc("count")).toPandas()

iplot(go.Figure(layout=dict(title='Besuche nach Kanälen') , data= [go.Pie(labels=channel["channelGrouping"], values=channel["count"])]))

3.5 VisitNumber

VisitNumber ist ein Zähler, der die Besuche eines einzelnen Besuchers (einezlne fullVisitorId) hochzählt

In [26]:
dataF.describe("visitNumber").show()
+-------+-----------------+
|summary|      visitNumber|
+-------+-----------------+
|  count|           436393|
|   mean|2.189237682547612|
| stddev|8.586319156430145|
|    min|                1|
|    max|               99|
+-------+-----------------+

Berechnung Frequenz Nice To Have Binned Histogram Freqenzbereiche (1, 2-5, 5-10, 10-50, 50 -100)

In [27]:
dataF.groupBy("visitNumber").count().orderBy(desc("count")).show()
+-----------+------+
|visitNumber| count|
+-----------+------+
|          1|347719|
|          2| 40420|
|          3| 15960|
|          4|  8520|
|          5|  5153|
|          6|  3468|
|          7|  2422|
|          8|  1795|
|          9|  1396|
|         10|  1051|
|         11|   846|
|         12|   687|
|         13|   569|
|         14|   482|
|         15|   410|
|         16|   359|
|         17|   306|
|         18|   270|
|         19|   242|
|         20|   216|
+-----------+------+
only showing top 20 rows

3.6 Date

Die Spalte date muss mit Hilfe von to_date() von einem String zu einem date konvertiert werden. Mit dayofweek(), month(), kann man verschiedene Aggregationen berechnen. Z.B.

Anzahl pro Besuche, Revenue pro Wochentag

(1 = Sunday, 2 = Monday, ..., 7 = Saturday)

In [28]:
dataF.select(dayofweek(to_date("date", "yyyyMMdd")).alias("wday")).groupBy("wday").count().orderBy("wday").show()
+----+-----+
|wday|count|
+----+-----+
|   1|48436|
|   2|69128|
|   3|70069|
|   4|69848|
|   5|68241|
|   6|62418|
|   7|48253|
+----+-----+

In [29]:
dataF.select(dayofweek(to_date("date", "yyyyMMdd")).alias("wday"), expr("totals.transactionRevenue").cast("double").alias("revenue")).fillna(0.0).groupBy("wday").sum("revenue").orderBy("wday").show()
+----+------------+
|wday|sum(revenue)|
+----+------------+
|   1| 3.753569E10|
|   2|1.2631296E11|
|   3|1.1957369E11|
|   4|1.1725351E11|
|   5|1.2460317E11|
|   6|1.1251313E11|
|   7| 2.980652E10|
+----+------------+

In [11]:
revenue = dataF.select(dayofweek(to_date("date", "yyyyMMdd")).alias("wday"), expr("totals.transactionRevenue /100").cast("double").alias("revenue")).fillna(0.0).groupBy("wday").agg(avg("revenue").alias("avg_revenue")).orderBy("wday").toPandas()
visits = dataF.select(dayofweek(to_date("date", "yyyyMMdd")).alias("wday")).groupBy("wday").count().orderBy("wday").toPandas()

tr1= go.Bar( name ="Anzahl Besuche", y = visits["count"][::-1], x = visits["wday"][::-1], marker=dict(opacity=0.5, color="green"))
tr2 = go.Bar( name= "Durchschnitt. Einnahmen", y = revenue["avg_revenue"][::-1], x = revenue["wday"][::-1],  marker=dict(opacity=0.5, color="blue"))

fig = tools.make_subplots(rows=1, cols=2,  subplot_titles=["Anzahl Besuche ", "Durchschnitt. Einnahmen"], print_grid=False)
fig.append_trace(tr1, 1, 1)
fig.append_trace(tr2, 1, 2)

layout=dict(title='Anzahl der Besuche und Einnahmen nach Wochentag',showlegend=False)
fig['layout'].update(layout)
iplot(fig)
In [12]:
revenueM = dataF.select(to_date("date", "yyyyMMdd").alias("mday"), expr("totals.transactionRevenue /100").cast("double").alias("revenue")).fillna(0.0).groupBy("mday").agg(sum("revenue").alias("total_revenue")).orderBy("mday").toPandas()
visitsM = dataF.select(to_date("date", "yyyyMMdd").alias("mday")).fillna(0.0).groupBy("mday").count().orderBy("mday").toPandas()

tr1 = go.Scatter(mode="lines", x = visitsM["mday"].astype(str), y = visitsM["count"])
layout = go.Layout(title="Besuche nach Datum", height=400)
fig = go.Figure(data = [tr1], layout = layout)
iplot(fig)


tr2 = go.Scatter(mode="lines", x = revenueM["mday"].astype(str), y = revenueM["total_revenue"])
layout = go.Layout(title="Einnahmen nach Datum", height=400)
fig = go.Figure(data = [tr2], layout = layout)
iplot(fig)

3.7 FullVisitorId - Besucherprofile

  • Anzhal der Besucher die was gekauft haben
In [31]:
dataF.select("fullvisitorId","totals").fillna(0.0).filter(expr("totals.transactionRevenue > 0")).distinct().count()
Out[31]:
5151
In [32]:
dataF.select("fullVisitorId").distinct().count()
Out[32]:
354810

Wir haben 4562 Besucher die Revenue generiert haben von den insg. 354810

  • Top Besucher nach Revenue

Wir nutzen agg um zusätzliche Statistiken zu deren Besuch zu erhalten

In [37]:
dataF.select("fullvisitorId","totals").fillna(0.0).filter(expr("totals.transactionRevenue > 0")).groupBy("fullVisitorId").agg(sum("totals.bounces").alias("total_bounces"),
      sum("totals.hits").alias("total_hits"), sum("totals.newVisits").alias("new_visit_count"), sum("totals.pageviews").alias("total_pageviews"), sum("totals.transactionRevenue").alias("total_revenue"), sum("totals.transactions").alias("transaction_count"), avg("totals.transactions").alias("transaction_avg")
     ).orderBy(desc("total_revenue")).show(5,truncate=False)
+-------------------+-------------+----------+---------------+---------------+-------------+-----------------+------------------+
|fullVisitorId      |total_bounces|total_hits|new_visit_count|total_pageviews|total_revenue|transaction_count|transaction_avg   |
+-------------------+-------------+----------+---------------+---------------+-------------+-----------------+------------------+
|1956307607572137989|null         |542.0     |null           |395.0          |5.17871E9    |10.0             |1.4285714285714286|
|6760732402251466726|null         |632.0     |null           |481.0          |4.33092E9    |27.0             |1.8               |
|3751433429956400495|null         |50.0      |null           |43.0           |3.6793E9     |2.0              |1.0               |
|2446685875964479851|null         |632.0     |null           |466.0          |3.47836E9    |5.0              |1.0               |
|428994201200499894 |null         |89.0      |null           |69.0           |3.28657E9    |3.0              |1.0               |
+-------------------+-------------+----------+---------------+---------------+-------------+-----------------+------------------+
only showing top 5 rows

4. Webanalyitcs - Hits und Metriken

Wir betrachten hier die Verwendung der Seite und berechnen typische Metriken im Bereich Webanalytics.

4.1 Beliebteste Landing Pages, explode-Beispiel

Um die beliebtesten Seiten für den Start des Besuchs im Store zu berechnen müssen wir auf die Elemente des Array in der Spalte hits zugreifen.

Ein Hit ist eine Interaktion, die dazu führt, dass Daten an Google Analytics gesendet werden. Es gibt verschiedene Hits oder Treffer-Arten:

Pageview Hit (Seitenaufruf Treffer) Screenview Hit (Bildschirmaufruf Treffer) Event Hit (Ereignis Treffer) Social Interaction Hit (Soziale Interaktion Treffer) E-Commerce Hit (E-Commerce Treffer) User Timing Treffer (Nutzer Timing Treffer) Exception Hit (Ausnahme Treffer)

Da in jeder Zeile die Anzahl der Hits im Prinzip unterschiedlich lang ist, verwenden wir die Funktion expolode() um für jedes Element in Hits in einer nene Spalte eine neue Zeile in der Tabelle einzuführen, sodass dann mit den üblichen Aggregationsfunktionen Statistiken berechnet werden könenn.

In [38]:
dataF.select(explode("hits").alias("hit"), expr("hit.page.pagePath").alias("path")).where("hit.type == 'PAGE' AND hit.hitNumber == 1").groupBy("path").count().orderBy(desc("count")).show(truncate=False)
+-------------------------------------------------+------+
|path                                             |count |
+-------------------------------------------------+------+
|/home                                            |345481|
|/google+redesign/apparel/men++s/men++s+t+shirts  |14798 |
|/google+redesign/shop+by+brand/youtube           |13962 |
|/signin.html                                     |4746  |
|/google+redesign/drinkware                       |3923  |
|/google+redesign/apparel/men++s/men++s+outerwear |3649  |
|/google+redesign/bags                            |3494  |
|/basket.html                                     |3441  |
|/google+redesign/apparel                         |2375  |
|/store.html                                      |2052  |
|/google+redesign/brands/youtube/home             |1919  |
|/google+redesign/apparel/women+s/women+s+t+shirts|1696  |
|/google+redesign/electronics                     |1537  |
|/google+redesign/office                          |1348  |
|/home-2                                          |1312  |
|/asearch.html                                    |1191  |
|/google+redesign/apparel/headgear                |1159  |
|/google+redesign/bags/backpacks                  |827   |
|/google+redesign/shop+by+brand/google            |812   |
|/google redesign/apparel/men s/men s t shirts    |795   |
+-------------------------------------------------+------+
only showing top 20 rows

4.2 Bounce Rate

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

also müssen wir die visits zählen wenn sie ein hit haben und ein bounce

In [6]:
dataF.select("trafficSource.source", expr("totals.bounces").cast("integer").alias("bounces"),expr("totals.pageviews").cast("integer").alias("pageviews"), expr("totals.hits").cast("integer").alias("hits") ).fillna(0).groupBy("source").agg(count("source").alias("total_visits"), count(when((expr("hits") == 1) & (expr("bounces") == 1)  , True)).alias("total_bounces")).orderBy(desc("total_visits")).select("source","total_visits","total_bounces", expr("(total_bounces / total_visits) * 100 ").alias("rate")).show()
+--------------------+------------+-------------+------------------+
|              source|total_visits|total_bounces|              rate|
+--------------------+------------+-------------+------------------+
|            (direct)|      182022|        71381|39.215589324367386|
|         youtube.com|      162211|       104907| 64.67317259618645|
|              google|       61476|        25072| 40.78339514607327|
|            Partners|        6643|         3095|  46.5903959054644|
|analytics.google.com|        5256|         2607| 49.60045662100457|
|                 dfa|        2958|         1016| 34.34753211629479|
|               baidu|        2242|         1507| 67.21677074041035|
|          google.com|        1757|          653| 37.16562322140011|
|    sites.google.com|        1543|          479| 31.04342190537913|
|siliconvalley.abo...|        1235|          385| 31.17408906882591|
|        facebook.com|        1001|          433|43.256743256743256|
|           qiita.com|         725|          310|42.758620689655174|
|          reddit.com|         646|          261| 40.40247678018576|
|      m.facebook.com|         574|          282| 49.12891986062718|
|     mail.google.com|         572|          217| 37.93706293706294|
|           quora.com|         465|           83|  17.8494623655914|
|               yahoo|         456|          198| 43.42105263157895|
|                bing|         450|          169| 37.55555555555555|
|       dealspotr.com|         372|          182|48.924731182795696|
|                t.co|         328|          225| 68.59756097560977|
+--------------------+------------+-------------+------------------+
only showing top 20 rows

In [14]:
bounces = dataF.select("trafficSource.source", expr("totals.bounces").cast("integer").alias("bounces"),expr("totals.pageviews").cast("integer").alias("pageviews"), expr("totals.hits").cast("integer").alias("hits") ).fillna(0).groupBy("source").agg(count("source").alias("total_visits"), count(when((expr("hits") == 1) & (expr("bounces") == 1)  , True)).alias("total_bounces")).select("source","total_visits","total_bounces", expr("(total_bounces / total_visits) * 100 ").alias("bounce_rate")).orderBy(desc("total_visits")).limit(10).toPandas()
tr1= go.Bar( name ="Anzahl Besuche", x = bounces["total_visits"][::-1], y = bounces["source"][::-1], orientation="h", marker=dict(opacity=0.5, color="green"))
tr2 = go.Bar( name= "Bounce Rate", x = bounces["bounce_rate"][::-1], y = bounces["source"][::-1], orientation="h", marker=dict(opacity=0.5, color="blue"))

fig = tools.make_subplots(rows=1, cols=2,  subplot_titles=["Anzahl Besuche ", "Bounce Rate"], print_grid=False)
fig.append_trace(tr1, 1, 1)
fig.append_trace(tr2, 1, 2)

layout=dict(title='Top 10 Quellen nach Anzahl der Besuche',showlegend=False)
fig['layout'].update(layout)

iplot(fig)

4.3 Churn Analysis

Churn bezeichnet "Abwanderung". Uns interessiert wie lange die User dem Store treu bleiben. Dafür interessiert uns ihr verhalten über der Zeit.

In [40]:
dataF.select(month(to_date("date", "yyyyMMdd")).alias("wday")).groupBy("wday").count().orderBy("wday").show()
+----+------+
|wday| count|
+----+------+
|   8| 74759|
|   9| 71032|
|  10| 97506|
|  11|113972|
|  12| 79124|
+----+------+

In [48]:
dataF.select(month(to_date("date", "yyyyMMdd")).alias("wday"), "fullVisitorId").groupBy("wday").agg(countDistinct("fullVisitorId")).orderBy("wday").show()
+----+-----------------------------+
|wday|count(DISTINCT fullVisitorId)|
+----+-----------------------------+
|   8|                        61699|
|   9|                        59121|
|  10|                        84901|
|  11|                        99734|
|  12|                        63839|
+----+-----------------------------+

Für August und September bauen wir eine Tabelle auf mit den Besucher die was gekauft haben und ihr revenue.

In [49]:
firstMonthsVisitors = dataF.select("fullVisitorId", expr("totals.transactionRevenue").cast("float").alias("revenue"), month(to_date("date", "yyyyMMdd")).alias("month")).where(((expr("month") == 8) | (expr("month") == 9)) & (expr("revenue") > 0)).groupBy("fullVisitorId").agg(sum("revenue").alias("first_revenue"))
In [50]:
firstMonthsVisitors.show(3)
+-------------------+-------------+
|      fullVisitorId|first_revenue|
+-------------------+-------------+
|9972043774359472649| 6.71219968E8|
|6332431001963915083|      6.367E7|
|4132503829043338043|      7.645E7|
+-------------------+-------------+
only showing top 3 rows

Analog bauen eine Tabelle für die zwei darauffolgenden Monate auf und tragen das Revenue aus den Vormonaten ein, falls der Besucher aus diesem Zeitraum bereits bekannt ist. Dafür machen wir einen left Join mit firstMonthsVisitors

In [51]:
nextMonths= dataF.select("fullVisitorId", expr("totals.transactionRevenue").cast("float").alias("revenue"), month(to_date("date", "yyyyMMdd")).alias("month")).where(((expr("month") == 10) | (expr("month") == 11)) & (expr("revenue") > 0)).groupBy("fullVisitorId").agg(sum("revenue").alias("current_revenue")).join(firstMonthsVisitors, "fullVisitorId", "left")
In [52]:
nextMonths.show()
+-------------------+---------------+-------------+
|      fullVisitorId|current_revenue|first_revenue|
+-------------------+---------------+-------------+
|8706126538675944851|        3.352E8|         null|
|3090631969460248576|        1.359E7|         null|
|  93310453484553750|        2.719E7|         null|
|8008255526797390672|        6.087E7|         null|
|0829523931932991940|   3.79750016E8|         null|
|4389802785459743912|        5.799E7|         null|
|9077832746989941740|        2.594E7|         null|
|2435691687281327196|        1.287E8|         null|
|3269834865385146569|        1.799E8|         null|
|0385969934087457197|        1.119E8|         null|
|8250541234138796403|        8.096E7|         null|
|2913538643362383211|        2.493E7|         null|
|4056519069919299079|      9480000.0|         null|
|1658702239532336072|        4.479E7|         null|
|2497486617423212304|        9.673E7|         null|
|9019087328365285075|        1.599E7|         null|
|1306270201462959225|       1.7916E8|         null|
|7441657642188873185|        4.479E7|         null|
|2864376493455301677|       4.5854E8|         null|
|5175327826829754798|        5.272E7|         null|
+-------------------+---------------+-------------+
only showing top 20 rows

Wie viele Besucher kennen wir aus der Vorperiode? Wieviele gibt es in der aktuellen?

In [53]:
nextMonths.filter(col("first_revenue").isNotNull()).count(), nextMonths.count()
Out[53]:
(61, 1634)

Durschnitt der Einnahmen bereits bekannter Besucher aus der Vorperiode.

In [54]:
nextMonths.filter(col("first_revenue").isNotNull()).agg(avg("current_revenue")).show()
+--------------------+
|avg(current_revenue)|
+--------------------+
|2.3332032629508197E8|
+--------------------+

Durschnitt der Einnahmen neuer Besucher:

In [55]:
nextMonths.filter(col("first_revenue").isNull()).agg(avg("current_revenue")).show()
+--------------------+
|avg(current_revenue)|
+--------------------+
|1.3865886873998728E8|
+--------------------+

4.4 Schlüsselmerkmale von Besuchen für das Revenue

In [56]:
dataF.select("trafficSource.source","device.operatingSystem","device.browser","device.deviceCategory", "geoNetwork.country","totals.transactionRevenue").groupBy("source","operatingSystem","browser","deviceCategory","country").agg(sum("transactionRevenue").alias("revenue")).orderBy(desc("revenue")).show()
+---------------+---------------+-------+--------------+-------------+------------+
|         source|operatingSystem|browser|deviceCategory|      country|     revenue|
+---------------+---------------+-------+--------------+-------------+------------+
|       (direct)|      Macintosh| Chrome|       desktop|United States|3.0546743E11|
|       (direct)|        Windows| Chrome|       desktop|United States|1.0013092E11|
|       (direct)|      Chrome OS| Chrome|       desktop|United States| 5.990554E10|
|         google|      Macintosh| Chrome|       desktop|United States| 2.438618E10|
|       (direct)|          Linux| Chrome|       desktop|United States|  2.00006E10|
|         google|        Windows| Chrome|       desktop|United States| 1.601503E10|
|       (direct)|      Macintosh| Safari|       desktop|United States| 1.407139E10|
|       (direct)|        Android| Chrome|        mobile|United States| 1.254485E10|
|       (direct)|      Macintosh| Chrome|       desktop|       Canada| 1.015902E10|
|         google|        Windows| Chrome|       desktop|    Venezuela| 1.000559E10|
|         google|      Chrome OS| Chrome|       desktop|United States|   8.69521E9|
|            dfa|      Macintosh| Chrome|       desktop|United States|   7.67516E9|
|mail.google.com|        Windows| Chrome|       desktop|United States|   7.40864E9|
|       (direct)|        Windows| Chrome|       desktop|        Kenya|    5.2687E9|
|mail.google.com|      Macintosh| Chrome|       desktop|United States|    4.2785E9|
|         google|        Windows| Chrome|       desktop|       Canada|   3.89583E9|
|       (direct)|      Macintosh| Chrome|       desktop|        Japan|   3.43474E9|
|       (direct)|            iOS| Safari|        mobile|United States|     3.381E9|
|       (direct)|        Windows|Firefox|       desktop|United States|   3.37258E9|
|       (direct)|        Windows|   Edge|       desktop|United States|   2.28119E9|
+---------------+---------------+-------+--------------+-------------+------------+
only showing top 20 rows