Data-Driven Marketing: Wie die kundenzentrierte Ansprache in Zukunft gelingt, Teil 4

Automatische Voraussage von Bounces mit Maschinellem Lernen

Zur Blog-Serie: https://www.shi-gmbh.com/herausforderung-customer-centricity/

Author: Dr. Eduardo Torres Schumann, SHI Gmbh

1. Einlesen der Daten und Auswahl der Features

Die Google Analytics-Daten werden von der Festplatte als DataFrame eingelesen, wir holen uns zunächst die Visits für einen Monat:

In [1]:
import findspark
findspark.init('/usr/local/Cellar/apache-spark/2.3.2/libexec/')

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window
In [3]:
sc = SparkSession.builder.appName("TestGA").getOrCreate()

ga_data = sc.read.json('/Users/eduardoschumann/Documents/Projekte/E-CommerceAnalytics/ga_data/ga_sessions_201707*.json.gz')

Die Information, ob es sich bei dem Visit um ein Bounce handelt, ist in der Spalte 'totals' kodiert. Wir behalten die entsprechende Unterspalte 'bounces' und behalten zusätzlich die Spalten, die uns als Feature dienen werden:

  • Zur Herkunft des Besuchs: Kanal (channelGrouping'), geographische Region ('geoNetwork.subcontinent')
  • Merkmales des Besuchers eingesetzten Gerät ('device.browser', 'device.deviceCategory')
  • Landing Page des Besuchs: Diese muss aus dem ersten Hit ('hit.page.pagePath') des Besuchs extrahiert werden, weswegen die Hits mit Hilfe von explode auseinander gefaltet werden und wir nur das erste behalten ('hit.time' ==0)

Damit decken wir unterschiedliche Aspekte eines Besuchs ab. Eine erste Datenexploration war der Auswahl der Features schon vorausgegangen.
Wir wissen daher, dass die Features unabhägig voneinander sind, was wichtig beim Trainieren des Models ist.

In [117]:
data = ga_data.select('fullVisitorId', 'visitId', 'device.browser', 'device.deviceCategory' ,'channelGrouping',  'geoNetwork.subcontinent', 'hit.page.pagePath',  expr("totals.bounces").cast("integer").alias("bounces"), explode('hits').alias('hit')).where('hit.time ==0 ').fillna(0)
In [121]:
data = data.drop('fullVisitorId').drop('visitId').drop('hit')
data.show(5)
+-------+--------------+---------------+----------------+--------+-------+
|browser|deviceCategory|channelGrouping|    subcontinent|pagePath|bounces|
+-------+--------------+---------------+----------------+--------+-------+
| Chrome|        mobile|         Direct|Northern America|   /home|      0|
| Chrome|        mobile|         Direct|  Southeast Asia|   /home|      0|
| Chrome|       desktop| Organic Search|    Eastern Asia|   /home|      0|
| Chrome|        mobile| Organic Search|  Southeast Asia|   /home|      0|
| Chrome|       desktop| Organic Search|  Southeast Asia|   /home|      0|
+-------+--------------+---------------+----------------+--------+-------+
only showing top 5 rows

In [122]:
data.printSchema()
root
 |-- browser: string (nullable = true)
 |-- deviceCategory: string (nullable = true)
 |-- channelGrouping: string (nullable = true)
 |-- subcontinent: string (nullable = true)
 |-- pagePath: string (nullable = true)
 |-- bounces: integer (nullable = true)

Zielklasse Bounces

Im ausgewählten Datenset sind Bounces ('bounces' == 1) sowie "echte" Besuche ('bounces' == 0) vergelichbar oft repräsentiert:

In [5]:
data.groupBy('bounces').count().show()
+-------+-----+
|bounces|count|
+-------+-----+
|      1|36408|
|      0|35466|
+-------+-----+

Das stimmt damit überein, was in der ersten Datenexploration beobachtet wurde. Die Anwendung von Techniken wie stratified Sampling ist daher nicht notwendig, um ein ausgewogenes Datenset bzgl. der Ausprägungen der Zielklasse für das Training zu bekommen.

2. Feature Engineering

Die ausgewählten Features stellen im Prinzip Kategorien dar, wir haben keine numerischen Wertebereiche. Wir gucken uns die verschiedenen Ausprägungen je nach Feature genauer an: Deren Anzahl sollte im Vergleich zur Anzhal der Datensätze klein sein, Ausprägungen sollten zudem nicht nur mit einzelnen Datensätzen dünnbesetzt sein.

Die Implementierung von Entscheidungsbaum Algorithmus, die wir verwendent werden, steuert mit dem Parameter "maxBins", wie groß ein kategorisches Feature sein darf. Der Defaultwert ist 32, an dem wir uns orientieren, um unsere Kategorien zu bilden.

Browser

Bei 'browser' haben wir 29 Asuprägungen, die z.T. sehr dünn besiedelt sind.

Wir behalten die Browser, die eine gewisse Häufigkeit überschreiten, und fügen sie mittels "join" in den Daten ein. Alle anderen seltenen Browser werden als OTHER kodiert.

In [126]:
browser = data.select("browser").groupBy("browser").count().orderBy("browser")
browser.show()
+--------------------+-----+
|             browser|count|
+--------------------+-----+
|           (not set)|    1|
|                   0|    7|
|         Amazon Silk|   47|
|     Android Browser|   41|
|     Android Runtime|    1|
|     Android Webview|  890|
|          BlackBerry|   10|
|              Chrome|50022|
|             Coc Coc|   52|
|                Edge|  994|
|             Firefox| 2770|
|   Internet Explorer| 1527|
|                Iron|    3|
|            MRCHROME|    5|
|             Maxthon|   14|
|             Mozilla|    1|
|Mozilla Compatibl...|   43|
|    Nintendo Browser|    5|
|               Opera|  323|
|          Opera Mini|  640|
+--------------------+-----+
only showing top 20 rows

In [127]:
selectedBrowser = browser.select(expr("browser").alias("normBrowser")).where(expr("count > 150"))
selectedBrowser.show()
+-----------------+
|      normBrowser|
+-----------------+
|  Android Webview|
|           Chrome|
|             Edge|
|          Firefox|
|Internet Explorer|
|            Opera|
|       Opera Mini|
|           Safari|
|  Safari (in-app)|
|       UC Browser|
+-----------------+

In [128]:
data = data.join( selectedBrowser, data.browser == selectedBrowser.normBrowser,how='left').fillna({'normBrowser':'OTHER'})
In [129]:
data.printSchema()
root
 |-- browser: string (nullable = true)
 |-- deviceCategory: string (nullable = true)
 |-- channelGrouping: string (nullable = true)
 |-- subcontinent: string (nullable = true)
 |-- pagePath: string (nullable = true)
 |-- bounces: integer (nullable = true)
 |-- normBrowser: string (nullable = false)

Device Category

Das Feature 'deviceCategory' eignet sich bereits gut als Kategorie:

In [131]:
data.select("deviceCategory").groupBy("deviceCategory").count().show()
+--------------+-----+
|deviceCategory|count|
+--------------+-----+
|       desktop|45646|
|        mobile|23118|
|        tablet| 3110|
+--------------+-----+

Kanal

'channelGrouping' ebenso:

In [132]:
data.select("channelGrouping").groupBy("channelGrouping").count().show()
+---------------+-----+
|channelGrouping|count|
+---------------+-----+
|        Display|  690|
|     Affiliates| 1790|
|         Social| 7749|
|         Direct|12318|
|       Referral| 9530|
|        (Other)|    1|
|    Paid Search| 2106|
| Organic Search|37690|
+---------------+-----+

Geographische Information

Indem statt bspw. 'country' wir die geographischen Information mittels 'subcontinent' kodieren, erhalten wir ein Kategorienfeature einer verwendbaren Größe:

In [133]:
data.agg(countDistinct("subcontinent")).show()
+----------------------------+
|count(DISTINCT subcontinent)|
+----------------------------+
|                          23|
+----------------------------+

Landing Page

Die Pfade, die Landing Page abbilden, sind sehr variabel:

In [8]:
data.agg(countDistinct("pagePath")).show()
+------------------------+
|count(DISTINCT pagePath)|
+------------------------+
|                     436|
+------------------------+

In [134]:
data.select("pagePath").where( expr("pagePath") != "/home").show(20, truncate=False)
+------------------------------------------------+
|pagePath                                        |
+------------------------------------------------+
|/google+redesign/drinkware                      |
|/google+redesign/apparel/headgear               |
|/google+redesign/drinkware                      |
|/google+redesign/apparel/mens/mens+outerwear    |
|/google+redesign/accessories/stickers/home      |
|/google+redesign/drinkware/mugs+and+cups        |
|/google+redesign/apparel/womens/womens+t+shirts |
|/google+redesign/accessories/stickers/home      |
|/google+redesign/electronics                    |
|/google+redesign/apparel/headgear               |
|/google+redesign/apparel/headgear               |
|/google+redesign/drinkware/mugs+and+cups        |
|/google+redesign/bags/water+bottles+and+tumblers|
|/google+redesign/bags/backpacks/home            |
|/google+redesign/bags/backpacks/home            |
|/google+redesign/bags/backpacks/home            |
|/google+redesign/apparel/womens/womens+t+shirts |
|/google+redesign/apparel/kids/kids+youth        |
|/google+redesign/shop+by+brand/google           |
|/google+redesign/shop+by+brand/android          |
+------------------------------------------------+
only showing top 20 rows

Um diese Vielfalt zu verkleinern, extrahieren die Produktkategorie nach dem zweiten Schrägstrich im Pfad und fassen die Landing Page daruch zusammen. Hierfür wenden wir die Funktionen 'substring_index' und 'regexp_replace' auf die Spalte 'pagePath' hintereinander an und fügen das Ergebnis als weitere Spalte 'category' an. Weitere Reguläre Ausdrücke normalisieren Sonderzeichen und Endungen.

In [137]:
data = data.withColumn("category", regexp_replace(substring_index(regexp_replace(regexp_replace('pagePath',r'(^/)|([-+\)])|(\.html)|([&?\.].*$)', ''),r'google ?redesign/', ''), '/', 1),r'((?<=accessorie|office|apparel|home).*)|s$' ,''))
data.filter(expr("pagePath") != "/home").show()
+-------+--------------+---------------+----------------+--------------------+-------+-----------+-----------+
|browser|deviceCategory|channelGrouping|    subcontinent|            pagePath|bounces|normBrowser|   category|
+-------+--------------+---------------+----------------+--------------------+-------+-----------+-----------+
| Chrome|        mobile| Organic Search|  Western Europe|/google+redesign/...|      1|     Chrome|  drinkware|
| Chrome|        mobile| Organic Search|Northern America|/google+redesign/...|      1|     Chrome|    apparel|
| Chrome|       desktop| Organic Search|Northern America|/google+redesign/...|      1|     Chrome|  drinkware|
|Firefox|       desktop| Organic Search| Northern Europe|/google+redesign/...|      1|    Firefox|    apparel|
| Chrome|       desktop| Organic Search|    Eastern Asia|/google+redesign/...|      1|     Chrome| accessorie|
| Chrome|       desktop|         Direct|  Eastern Europe|/google+redesign/...|      1|     Chrome|  drinkware|
| Chrome|       desktop|         Direct|Northern America|/google+redesign/...|      1|     Chrome|    apparel|
| Safari|        mobile| Organic Search| Northern Europe|/google+redesign/...|      1|     Safari| accessorie|
| Chrome|        mobile| Organic Search|Northern America|/google+redesign/...|      1|     Chrome| electronic|
| Chrome|        mobile| Organic Search|Northern America|/google+redesign/...|      1|     Chrome|    apparel|
| Chrome|        mobile| Organic Search|Northern America|/google+redesign/...|      1|     Chrome|    apparel|
| Chrome|       desktop|         Direct|Northern America|/google+redesign/...|      1|     Chrome|  drinkware|
| Chrome|        mobile| Organic Search| Southern Africa|/google+redesign/...|      1|     Chrome|        bag|
| Chrome|       desktop|         Direct| Southern Europe|/google+redesign/...|      1|     Chrome|        bag|
| Chrome|       desktop|         Direct|    Western Asia|/google+redesign/...|      1|     Chrome|        bag|
| Chrome|       desktop|       Referral|Northern America|/google+redesign/...|      1|     Chrome|        bag|
|Firefox|       desktop| Organic Search|   Southern Asia|/google+redesign/...|      1|    Firefox|    apparel|
|Firefox|       desktop| Organic Search|Northern America|/google+redesign/...|      1|    Firefox|    apparel|
| Chrome|       desktop| Organic Search|Northern America|/google+redesign/...|      1|     Chrome|shopbybrand|
| Safari|        mobile|         Direct| Northern Europe|/google+redesign/...|      1|     Safari|shopbybrand|
+-------+--------------+---------------+----------------+--------------------+-------+-----------+-----------+
only showing top 20 rows

Damit können wir die Landing Page als 'category' als kategorial Feature nutzen:

In [139]:
data.select("category").groupBy("category").count().show(31)
+--------------+-----+
|      category|count|
+--------------+-----+
|        signin| 1473|
|        office|  927|
|     storeitem|   20|
|          shop|   37|
|       asearch|  359|
|      giftcard|  137|
|          nest|  121|
|     lifestyle|  344|
|    accessorie| 1059|
|           fun|    7|
|    electronic| 1697|
|   shopbybrand|18553|
|       apparel| 7429|
|     drinkware|  986|
|registersucces|    4|
|  storepolicie|  210|
|         brand|   52|
|           eco|    5|
|      yourinfo|    2|
|         store|  350|
|   revieworder|    3|
|           bag| 2000|
|     myaccount|   25|
|          home|35232|
|      wearable|   48|
|      register|    3|
|        basket|  776|
|   new2015logo|    8|
|     madeinusa|    2|
|       payment|    5|
+--------------+-----+

3. Aufbau einer Trainings-Pipeline mit der Spark ML Bibliothek

Der Spark ML Package ist die neuere Spark Bibliothek für Maschinelles Lernen, die nicht mehr nur auf RRDs aufbaut, sondern auch mit DataFrames umgehen kann. Damit kann ein Pipeline aufgebaut werden, um die Daten in einem DataFrame in das richtige Format für den Algorithmus des Maschinellen Lernens zu transformieren, das mit dem Algorithums zu trainieren bzw. anzuwenden und die Ergebnisse zu evaluieren.

Zielformat für Anwendung vom Algorithmus: Kategorialdaten als Zahlen kodieren.

In [147]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

channelIndexer = StringIndexer(inputCol="channelGrouping", outputCol="channelIdx", handleInvalid='keep')
browserIndexer = StringIndexer(inputCol="normBrowser", outputCol="browserIdx", handleInvalid='keep')
deviceIndexer = StringIndexer(inputCol="deviceCategory", outputCol="deviceCategoryIdx", handleInvalid='keep')
geoIndexer = StringIndexer(inputCol="subcontinent", outputCol="subcontinentIdx", handleInvalid='keep')
categoryIndexer = StringIndexer(inputCol="category", outputCol="categoryIdx", handleInvalid='keep')


assembler = VectorAssembler(inputCols=["channelIdx", "browserIdx", "deviceCategoryIdx", "subcontinentIdx",
     "categoryIdx"], outputCol="features")

dtree = DecisionTreeClassifier( labelCol="bounces", maxBins=40, maxDepth=5)

pipeline = Pipeline(stages=[channelIndexer, browserIndexer, deviceIndexer, geoIndexer, categoryIndexer, assembler, dtree])
      

split data and reserve for final test. run using cross validation.

In [24]:
develop, validation = data.randomSplit([0.8, 0.2], seed=12345)
In [25]:
develop.show()
+--------+---------------+---------------+---------------+--------------+---------------+----------------+--------------------+--------------------+----------+-------+-----------+---------------+
|  source|        browser|operatingSystem|channelGrouping|deviceCategory|        country|    subcontinent|              region|            pagePath|timeOnSite|bounces|   category|    normBrowser|
+--------+---------------+---------------+---------------+--------------+---------------+----------------+--------------------+--------------------+----------+-------+-----------+---------------+
|(direct)|    Amazon Silk|        Android|         Direct|        tablet|  United States|Northern America|            New York|/google+redesign/...|         0|      1|shopbybrand|          OTHER|
|(direct)|    Amazon Silk|        Android|         Direct|        tablet|  United States|Northern America|            New York|/google+redesign/...|       397|      0|shopbybrand|          OTHER|
|(direct)|    Amazon Silk|        Android|         Direct|        tablet|  United States|Northern America|not available in ...|               /home|         0|      1|       home|          OTHER|
|(direct)|Android Browser|        Android|         Direct|        mobile|          India|   Southern Asia|           Telangana|/google+redesign/...|       186|      0|shopbybrand|          OTHER|
|(direct)|Android Webview|        Android|         Direct|        mobile|     Bangladesh|   Southern Asia|      Dhaka Division|/google+redesign/...|        32|      0|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|         Brazil|   South America|not available in ...|/google+redesign/...|       211|      0|    apparel|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|        Germany|  Western Europe|not available in ...|/google+redesign/...|        44|      0|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|        Germany|  Western Europe|not available in ...|/google+redesign/...|         0|      1|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|          India|   Southern Asia|               Delhi|/google+redesign/...|         0|      1|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|          India|   Southern Asia|               Delhi|               /home|         0|      1|       home|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|          India|   Southern Asia|         West Bengal|         /store.html|       678|      0|      store|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|          India|   Southern Asia|not available in ...|/google+redesign/...|         0|      1|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|         Mexico| Central America|not available in ...|/google+redesign/...|         0|      1|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|Myanmar (Burma)|  Southeast Asia|not available in ...|/google+redesign/...|         0|      1|    apparel|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|    Netherlands|  Western Europe|not available in ...|/google+redesign/...|         0|      1|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|       Pakistan|   Southern Asia|not available in ...|/google+redesign/...|         0|      1|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|       Pakistan|   Southern Asia|not available in ...|         /store.html|       180|      0|      store|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|           Peru|   South America|         Lima Region|/google+redesign/...|        74|      0|        bag|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|   Saudi Arabia|    Western Asia|not available in ...|/google+redesign/...|        85|      0|shopbybrand|Android Webview|
|(direct)|Android Webview|        Android|         Direct|        mobile|    South Korea|    Eastern Asia|               Seoul|/google+redesign/...|         0|      1|  lifestyle|Android Webview|
+--------+---------------+---------------+---------------+--------------+---------------+----------------+--------------------+--------------------+----------+-------+-----------+---------------+
only showing top 20 rows

In [142]:
model = pipeline.fit(develop)
treeModel = model.stages[6]
treeModel.toDebugString
Out[142]:
'DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4a9c9d361cc90ae20eaf) of depth 5 with 63 nodes\n  If (feature 3 in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0})\n   If (feature 4 in {2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,21.0,22.0,23.0,25.0,26.0,27.0,29.0})\n    If (feature 2 in {1.0,2.0})\n     If (feature 4 in {3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,17.0})\n      If (feature 4 in {3.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,17.0})\n       Predict: 1.0\n      Else (feature 4 not in {3.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,17.0})\n       Predict: 1.0\n     Else (feature 4 not in {3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,17.0})\n      If (feature 3 in {3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,22.0})\n       Predict: 1.0\n      Else (feature 3 not in {3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,22.0})\n       Predict: 1.0\n    Else (feature 2 not in {1.0,2.0})\n     If (feature 1 in {1.0,2.0,3.0,4.0,8.0,9.0})\n      If (feature 1 in {1.0,3.0,4.0,8.0,9.0})\n       Predict: 0.0\n      Else (feature 1 not in {1.0,3.0,4.0,8.0,9.0})\n       Predict: 1.0\n     Else (feature 1 not in {1.0,2.0,3.0,4.0,8.0,9.0})\n      If (feature 4 in {3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,21.0,22.0,23.0,26.0,27.0,29.0})\n       Predict: 0.0\n      Else (feature 4 not in {3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,21.0,22.0,23.0,26.0,27.0,29.0})\n       Predict: 0.0\n   Else (feature 4 not in {2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,21.0,22.0,23.0,25.0,26.0,27.0,29.0})\n    If (feature 0 in {2.0,4.0,5.0,6.0})\n     If (feature 3 in {12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,22.0})\n      If (feature 3 in {14.0,15.0,19.0,22.0})\n       Predict: 0.0\n      Else (feature 3 not in {14.0,15.0,19.0,22.0})\n       Predict: 1.0\n     Else (feature 3 not in {12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,22.0})\n      If (feature 3 in {1.0,2.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0})\n       Predict: 0.0\n      Else (feature 3 not in {1.0,2.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0})\n       Predict: 1.0\n    Else (feature 0 not in {2.0,4.0,5.0,6.0})\n     If (feature 1 in {2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0})\n      If (feature 1 in {4.0,5.0,7.0,8.0,9.0,10.0})\n       Predict: 1.0\n      Else (feature 1 not in {4.0,5.0,7.0,8.0,9.0,10.0})\n       Predict: 1.0\n     Else (feature 1 not in {2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0})\n      If (feature 2 in {1.0,2.0})\n       Predict: 1.0\n      Else (feature 2 not in {1.0,2.0})\n       Predict: 1.0\n  Else (feature 3 not in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0})\n   If (feature 4 in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0,28.0,30.0})\n    If (feature 0 in {2.0,4.0,5.0,6.0})\n     If (feature 2 in {1.0,2.0})\n      If (feature 0 in {2.0,5.0,6.0})\n       Predict: 0.0\n      Else (feature 0 not in {2.0,5.0,6.0})\n       Predict: 1.0\n     Else (feature 2 not in {1.0,2.0})\n      If (feature 4 in {1.0,4.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0})\n       Predict: 0.0\n      Else (feature 4 not in {1.0,4.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0})\n       Predict: 0.0\n    Else (feature 0 not in {2.0,4.0,5.0,6.0})\n     If (feature 0 in {3.0})\n      If (feature 2 in {1.0,2.0})\n       Predict: 1.0\n      Else (feature 2 not in {1.0,2.0})\n       Predict: 1.0\n     Else (feature 0 not in {3.0})\n      If (feature 2 in {0.0,2.0})\n       Predict: 0.0\n      Else (feature 2 not in {0.0,2.0})\n       Predict: 1.0\n   Else (feature 4 not in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0,28.0,30.0})\n    If (feature 0 in {0.0,2.0,3.0,4.0,5.0,6.0,7.0})\n     If (feature 0 in {3.0,5.0,6.0,7.0})\n      If (feature 0 in {6.0,7.0})\n       Predict: 0.0\n      Else (feature 0 not in {6.0,7.0})\n       Predict: 0.0\n     Else (feature 0 not in {3.0,5.0,6.0,7.0})\n      If (feature 2 in {1.0,2.0})\n       Predict: 0.0\n      Else (feature 2 not in {1.0,2.0})\n       Predict: 0.0\n    Else (feature 0 not in {0.0,2.0,3.0,4.0,5.0,6.0,7.0})\n     If (feature 2 in {1.0,2.0})\n      If (feature 1 in {2.0,3.0,5.0,6.0,7.0,8.0,9.0,10.0})\n       Predict: 0.0\n      Else (feature 1 not in {2.0,3.0,5.0,6.0,7.0,8.0,9.0,10.0})\n       Predict: 1.0\n     Else (feature 2 not in {1.0,2.0})\n      If (feature 1 in {2.0,3.0,4.0,9.0})\n       Predict: 1.0\n      Else (feature 1 not in {2.0,3.0,4.0,9.0})\n       Predict: 0.0\n'
In [148]:
predictions = model.transform(validation)
In [149]:
predictions.select('bounces','prediction').groupBy('bounces','prediction').count().show()
+-------+----------+-----+
|bounces|prediction|count|
+-------+----------+-----+
|      1|       0.0| 2333|
|      0|       0.0| 4024|
|      1|       1.0| 5018|
|      0|       1.0| 3011|
+-------+----------+-----+

In [ ]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator


evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='bounces')

evalF1 = MulticlassClassificationEvaluator(predictionCol="prediction",labelCol='bounces', metricName='f1')

evalRecall = MulticlassClassificationEvaluator(predictionCol="prediction",labelCol='bounces', metricName='weightedRecall' )

evalPrecision =  MulticlassClassificationEvaluator(predictionCol="prediction",labelCol='bounces', metricName='weightedPrecision') 

evalAccuracy =  MulticlassClassificationEvaluator(predictionCol="prediction",labelCol='bounces', metricName='accuracy' )
In [150]:
accuracy = evaluator.evaluate(predictions)
accuracy
Out[150]:
0.5670936279985308
In [151]:
precision = evalPrecision.evaluate(predictions)
precision
Out[151]:
0.6289056431192772
In [152]:
recall = evalRecall.evaluate(predictions)
recall
Out[152]:
0.6285277352982066

4. Parameteroptimierung und Cross-Validierung

In [154]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = ParamGridBuilder().addGrid(dtree.maxDepth, [3,4,5]).build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

cvModel = crossval.fit(develop)
In [155]:
bestTreeModel = cvModel.bestModel.stages[6]
bestTreeModel
Out[155]:
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4dc0b43aaa4541720839) of depth 3 with 15 nodes
In [87]:
bestTreeModel.toDebugString
Out[87]:
'DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4a9c9d361cc90ae20eaf) of depth 3 with 15 nodes\n  If (feature 3 in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0})\n   If (feature 4 in {2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,21.0,22.0,23.0,25.0,26.0,27.0,29.0})\n    If (feature 2 in {1.0,2.0})\n     Predict: 1.0\n    Else (feature 2 not in {1.0,2.0})\n     Predict: 0.0\n   Else (feature 4 not in {2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,21.0,22.0,23.0,25.0,26.0,27.0,29.0})\n    If (feature 0 in {2.0,4.0,5.0,6.0})\n     Predict: 1.0\n    Else (feature 0 not in {2.0,4.0,5.0,6.0})\n     Predict: 1.0\n  Else (feature 3 not in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,22.0})\n   If (feature 4 in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0,28.0,30.0})\n    If (feature 0 in {2.0,4.0,5.0,6.0})\n     Predict: 0.0\n    Else (feature 0 not in {2.0,4.0,5.0,6.0})\n     Predict: 1.0\n   Else (feature 4 not in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0,28.0,30.0})\n    If (feature 0 in {0.0,2.0,3.0,4.0,5.0,6.0,7.0})\n     Predict: 0.0\n    Else (feature 0 not in {0.0,2.0,3.0,4.0,5.0,6.0,7.0})\n     Predict: 1.0\n'

''' DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4a9c9d361cc90ae20eaf) of depth 3 with 15 nodes If ('subcontinent' is not 'Northern America') If ('category' in {apparel,bag,electronic,signin,accessorie,drinkware,office,basket,asearch,lifestyle,store,storepolicie,giftcard,nest,brand,wearable,shop,storeitem,new2015logo,fun,eco,payment,register,revieworder,home2,madeinusa}) If ('deviceCategory' in {mobile,tablet}) Predict: 1.0 Else ('deviceCategory' is 'desktop') Predict: 0.0 Else ('category' in {home,shopbybrand,myaccount,registersucces,yourinfo}) If ('channel' in {Referral,Paid Search,Affiliates,Display}) Predict: 1.0 Else ('channel' in {Organic Search,Direct,Social,(Other)}) Predict: 1.0 Else ('subcontinent' is 'Northern America') If ('category' other than {home,fun,madeinusa}) If ('channel' in {Referral,Paid Search,Affiliates,Display}) Predict: 0.0 Else ('channel' in {Organic Search,Direct,Social,(Other)}) Predict: 1.0 Else ('category' in {home,fun,madeinusa}) If ('channel' is not 'Direct') Predict: 0.0 Else ('channel' is 'Direct') Predict: 1.0 '''

In [157]:
cvModel.avgMetrics
Out[157]:
[0.603665525166462, 0.5812802110796941, 0.5703925081627877]
In [160]:
validationPreds = cvModel.transform(validation)
In [161]:
confusionMatrix = validationPreds.select('bounces','prediction').groupBy('bounces','prediction').count()
In [162]:
confusionMatrix.show()
+-------+----------+-----+
|bounces|prediction|count|
+-------+----------+-----+
|      1|       0.0| 1473|
|      0|       0.0| 2927|
|      1|       1.0| 5878|
|      0|       1.0| 4108|
+-------+----------+-----+

In [163]:
valAccuracy = evaluator.evaluate(validationPreds)
valAccuracy
Out[163]:
0.5931925095744822
In [164]:
valPrecision = evalPrecision.evaluate(validationPreds)
valPrecision
Out[164]:
0.626084347937472

Auflösung der numerisch kodierten Merkmale

In [96]:
debugPipeline = Pipeline(stages=[channelIndexer, browserIndexer, deviceIndexer, geoIndexer, categoryIndexer])
In [98]:
transformed = debugPipeline.fit(develop)
In [100]:
trdf = transformed.transform(develop)
In [104]:
# 0 channelIndexer = StringIndexer(inputCol="channelGrouping", outputCol="channelIdx", handleInvalid='keep')
# 1 browserIndexer = StringIndexer(inputCol="normBrowser", outputCol="browserIdx", handleInvalid='keep')
# 2 deviceIndexer = StringIndexer(inputCol="deviceCategory", outputCol="deviceCategoryIdx", handleInvalid='keep')
# 3 geoIndexer = StringIndexer(inputCol="subcontinent", outputCol="subcontinentIdx", handleInvalid='keep')
# 4 categoryIndexer = StringIndexer(inputCol="category", outputCol="categoryIdx", handleInvalid='keep')

# Feature 0

trdf.select('channelGrouping','channelIdx').distinct().orderBy('channelIdx').show(50)
+---------------+----------+
|channelGrouping|channelIdx|
+---------------+----------+
| Organic Search|       0.0|
|         Direct|       1.0|
|       Referral|       2.0|
|         Social|       3.0|
|    Paid Search|       4.0|
|     Affiliates|       5.0|
|        Display|       6.0|
|        (Other)|       7.0|
+---------------+----------+

In [105]:
# Feature 1

trdf.select('normBrowser','browserIdx').distinct().orderBy('browserIdx').show(50)
In [141]:
# Feature 2 

trdf.select('deviceCategory','deviceCategoryIdx').distinct().orderBy('deviceCategoryIdx').show(50)
+--------------+-----------------+
|deviceCategory|deviceCategoryIdx|
+--------------+-----------------+
|       desktop|              0.0|
|        mobile|              1.0|
|        tablet|              2.0|
+--------------+-----------------+

In [107]:
# Feature 3 

trdf.select('subcontinent','subcontinentIdx').distinct().orderBy('subcontinentIdx').show(50)
+------------------+---------------+
|      subcontinent|subcontinentIdx|
+------------------+---------------+
|  Northern America|            0.0|
|     Southern Asia|            1.0|
|   Northern Europe|            2.0|
|    Western Europe|            3.0|
|      Eastern Asia|            4.0|
|    Southeast Asia|            5.0|
|    Eastern Europe|            6.0|
|   Southern Europe|            7.0|
|     South America|            8.0|
|       Australasia|            9.0|
|      Western Asia|           10.0|
|   Central America|           11.0|
|   Northern Africa|           12.0|
|    Western Africa|           13.0|
|         Caribbean|           14.0|
|   Southern Africa|           15.0|
|    Eastern Africa|           16.0|
|         (not set)|           17.0|
|      Central Asia|           18.0|
|     Middle Africa|           19.0|
|Micronesian Region|           20.0|
|         Melanesia|           21.0|
|         Polynesia|           22.0|
+------------------+---------------+

In [108]:
# Feature 4 

trdf.select('category','categoryIdx').distinct().orderBy('categoryIdx').show(50)
+--------------+-----------+
|      category|categoryIdx|
+--------------+-----------+
|          home|        0.0|
|   shopbybrand|        1.0|
|       apparel|        2.0|
|           bag|        3.0|
|    electronic|        4.0|
|        signin|        5.0|
|    accessorie|        6.0|
|     drinkware|        7.0|
|        office|        8.0|
|        basket|        9.0|
|       asearch|       10.0|
|     lifestyle|       11.0|
|         store|       12.0|
|  storepolicie|       13.0|
|      giftcard|       14.0|
|          nest|       15.0|
|         brand|       16.0|
|      wearable|       17.0|
|          shop|       18.0|
|     myaccount|       19.0|
|     storeitem|       20.0|
|   new2015logo|       21.0|
|           fun|       22.0|
|           eco|       23.0|
|       payment|       24.0|
|      register|       25.0|
|   revieworder|       26.0|
|         home2|       27.0|
|registersucces|       28.0|
|     madeinusa|       29.0|
|      yourinfo|       30.0|
+--------------+-----------+