Data-Driven Marketing: Wie die kundenzentrierte Ansprache in Zukunft gelingt, Teil 5

Gezielte Kundenansprache dank KI

Zur Blog-Serie https://www.shi-gmbh.com/herausforderung-customer-centricity/

Author: Dr. Eduardo Torres Schumann, SHI Gmbh

Inhalt:

1. Merkmale zur Beschreibung von Kunden

  • Einlesen der Google-Analytics-Daten
  • Extraktion von Produkttransaktionen aus den Besuchen
  • Bildung von Produktkategorien und Aufsummierung der Ausgaben nach Kategorien
  • Durchschnittliche Warenkorbgröße und Zeit bis zum Wiederkauf
  • Finale Kundenmatrix

2. Eine ML Spark Pipeline für Clustering

  • Feature-Engineering
  • Clustering mit K-Means
  • Parameter-Optimierung und Auswahl eines Modells

3. Analyse der Kundensegmente

  • Segment-Profile
  • Visualisierung

1. Merkmale zur Beschreibung von Kunden

Kunden sollen anhand ihrer Käufe in unterschiedlichen Produktkategorien beschrieben werden. Hierfür werden alle Transaktionen eines Kunden betrachtet und für die einzelnen Kategorien den Gesamteinkaufswert ermittelt. Neben den Ausgaben nach Kategorie sollen die durschnittliche Produktanzahl und das Volumen des Warenkorbs sowie die durchschnitliche Zeit in Tagen bis zum Wiederkauf den Kunden charakterisieren.

Wir betrachten alle Kunden die Ende 2016 eingekauf haben.

Einlesen der Google-Analytics-Daten

In [1]:
import findspark
findspark.init('/usr/local/Cellar/apache-spark/2.3.2/libexec/')

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window
In [2]:
sc = SparkSession.builder.appName("TestGA").getOrCreate()

# vor ga_sessions_201611* weitgehend category "(not set)"
# pyspark wird manchmal falsch interpretiert mit Java > 8

ga_data = sc.read.json('/Users/eduardoschumann/Documents/Projekte/E-CommerceAnalytics/ga_data/ga_sessions_2016*.json.gz')

Extraktion von Produkttransaktionen aus den Besuchen

Filterung der 'Hits', die eine Transaktion darstellen:

In [3]:
transactionsInfo = ga_data.select('fullVisitorId','visitId', 'visitNumber',
                                  expr("totals.transactions").cast('integer'),
                                  expr("totals.transactionRevenue").cast('integer'),
                                  explode('hits').alias('hit'),
                                  'hit.contentGroup',
                                  'hit.page',
                                  'hit.isExit',
                                  'hit.isInteraction',
                                  'hit.product',
                                  'hit.transaction',
                                  'hit.transaction.transactionId',
                                  'hit.product.productQuantity',
                                  'hit.type').fillna(0)\
            .filter("transactions > 0")\
            .filter(col('hit.transaction.transactionId').isNotNull())\
            .filter(col('hit.isExit').isNull())\
            .drop('hit')
In [4]:
transactionsInfo.show(3,truncate=False)
+-------------------+----------+-----------+------------+------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+---------------+---------------+----+
|fullVisitorId      |visitId   |visitNumber|transactions|transactionRevenue|contentGroup                                                                                                   |page                                                                                                        |isExit|isInteraction|product                                                                                                                                        |transaction                                                                                                |transactionId  |productQuantity|type|
+-------------------+----------+-----------+------------+------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+---------------+---------------+----+
|2324511444137196326|1472681213|1          |1           |61610000          |[(not set), (not set), (not set), (not set), (not set),,,, (not set), Apparel, (not set), (not set), (not set)]|[shop.googlemerchandisestore.com, /ordercompleted.html, /ordercompleted.html, , , , Checkout Confirmation,,]|null  |true         |[[[], [],,, 51990000, 52990000, (not set), (not set), 0, 51990000, 1, 52990000, GGOEGAFJ036217,  2XL, (not set), Google Men's Pullover Hoodie]]|[Google Merchandise Store, USD, 62610000, 7500000, 4120000,, ORD201608312125, 62610000, 7500000, 4120000]  |ORD201608312125|[1]            |PAGE|
|6810807631474905667|1472682771|1          |1           |15040000          |[(not set), (not set), (not set), (not set), (not set),,,, (not set), Office, (not set), (not set), (not set)] |[shop.googlemerchandisestore.com, /ordercompleted.html, /ordercompleted.html, , , , Checkout Confirmation,,]|null  |true         |[[[], [],,, 3990000, 14980000, (not set), (not set), 0, 3990000, 2, 14980000, GGOEGOCR017899,  RED, (not set), Recycled Paper Journal Set]]    |[Google Merchandise Store, USD, 22040000, 13500000, 7560000,, ORD201608312148, 22040000, 13500000, 7560000]|ORD201608312148|[2]            |PAGE|
|6810807631474905667|1472682771|1          |1           |15040000          |[(not set), (not set), (not set), (not set), (not set),,,, (not set), Office, (not set), (not set), (not set)] |[shop.googlemerchandisestore.com, /ordercompleted.html, /ordercompleted.html, , , , Checkout Confirmation,,]|null  |true         |[[[], [],,, 3990000,, (not set), (not set), 0, 3990000,,, GGOEGOCR017899,  RED, (not set), Recycled Paper Journal Set]]                        |[Google Merchandise Store, USD,,,,, ORD201608312148,,,]                                                    |ORD201608312148|[]             |PAGE|
+-------------------+----------+-----------+------------+------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+---------------+---------------+----+
only showing top 3 rows

Produktumsatz in den Transaktion-'Hits'

In [5]:
productTransactions = transactionsInfo.select('fullVisitorId','transactionRevenue',
                                              explode('product').alias('sproduct'),
                                              'transaction.transactionId',
                                              'sproduct.v2ProductName',
                                              'sproduct.v2ProductCategory',
                                              expr('sproduct.productPrice').cast('integer').alias('price'),
                                              expr('sproduct.productQuantity').cast('integer').alias('quantity'),
                                              (col("sproduct.productPrice") * col("sproduct.productQuantity")).alias('productSales'),
                                              'transactionRevenue')\
                        .filter(col('productQuantity').isNotNull()).drop('sproduct')
In [6]:
productTransactions.show(truncate=False)
+-------------------+------------------+---------------+----------------------------------------------------+-----------------+--------+--------+------------+------------------+
|fullVisitorId      |transactionRevenue|transactionId  |v2ProductName                                       |v2ProductCategory|price   |quantity|productSales|transactionRevenue|
+-------------------+------------------+---------------+----------------------------------------------------+-----------------+--------+--------+------------+------------------+
|2324511444137196326|61610000          |ORD201608312125|Google Men's Pullover Hoodie                        |(not set)        |51990000|1       |5.199E7     |61610000          |
|6810807631474905667|15040000          |ORD201608312148|Recycled Paper Journal Set                          |(not set)        |3990000 |2       |7980000.0   |15040000          |
|6810807631474905667|15040000          |ORD201608312148|Recycled Paper Journal Set                          |(not set)        |3990000 |null    |null        |15040000          |
|6810807631474905667|15040000          |ORD201608312148|Recycled Paper Journal Set                          |(not set)        |3990000 |null    |null        |15040000          |
|0973104829844612521|91100000          |ORD201608312216|Google Laptop and Cell Phone Stickers               |(not set)        |1990000 |10      |1.99E7      |91100000          |
|0973104829844612521|91100000          |ORD201608312216|8 pc Android Sticker Sheet                          |(not set)        |1990000 |5       |9950000.0   |91100000          |
|0973104829844612521|91100000          |ORD201608312216|Android Sticker Sheet Ultra Removable               |(not set)        |2990000 |5       |1.495E7     |91100000          |
|0973104829844612521|91100000          |ORD201608312216|Google Ballpoint Pen Black                          |(not set)        |1990000 |20      |3.98E7      |91100000          |
|3797990038447668987|51080000          |ORD201608312276|Google Men's 100% Cotton Short Sleeve Hero Tee Navy |(not set)        |16990000|1       |1.699E7     |51080000          |
|3797990038447668987|51080000          |ORD201608312276|Google Men's 100% Cotton Short Sleeve Hero Tee Navy |(not set)        |16990000|1       |1.699E7     |51080000          |
|3797990038447668987|51080000          |ORD201608312276|Collapsible Shopping Bag                            |(not set)        |4990000 |1       |4990000.0   |51080000          |
|3797990038447668987|51080000          |ORD201608312276|Google Laptop and Cell Phone Stickers               |(not set)        |1990000 |1       |1990000.0   |51080000          |
|1704269846042149545|126200000         |ORD201608311837|Windup Android                                      |(not set)        |3990000 |30      |1.197E8     |126200000         |
|1704269846042149545|126200000         |ORD201608311837|Windup Android                                      |(not set)        |3990000 |null    |null        |126200000         |
|1124125006906712847|27160000          |ORD201608311905|Google 17oz Stainless Steel Sport Bottle            |(not set)        |18990000|1       |1.899E7     |27160000          |
|330289116549575054 |39950000          |ORD201608311935|Google Baby Essentials Set                          |(not set)        |27190000|1       |2.719E7     |39950000          |
|9972043774359472649|671220000         |ORD201608311928|Google 17oz Stainless Steel Sport Bottle            |(not set)        |15190000|35      |5.3165E8    |671220000         |
|3906774598490786925|356450000         |ORD201608311979|Google Men's 100% Cotton Short Sleeve Hero Tee Black|(not set)        |13590000|1       |1.359E7     |356450000         |
|3906774598490786925|356450000         |ORD201608311979|22 oz YouTube Bottle Infuser                        |(not set)        |3990000 |16      |6.384E7     |356450000         |
|3906774598490786925|356450000         |ORD201608311979|Google Zipper-front Sports Bag                      |(not set)        |15990000|2       |3.198E7     |356450000         |
+-------------------+------------------+---------------+----------------------------------------------------+-----------------+--------+--------+------------+------------------+
only showing top 20 rows

Bildung von Produktkategorien und Aufsummierung der Ausgaben nach Kategorien

In [7]:
customerCategories = productTransactions.groupBy('fullVisitorId', 'v2ProductCategory').agg(sum('productSales').alias('categorySales'))
In [8]:
customerCategories.show(truncate=False)
+-------------------+-----------------+-------------+
|fullVisitorId      |v2ProductCategory|categorySales|
+-------------------+-----------------+-------------+
|2100286135111282949|(not set)        |4.038E7      |
|9540801604803443114|(not set)        |1.7667E8     |
|8084853554695349186|(not set)        |5.994E7      |
|1634397925109674668|(not set)        |7.395E7      |
|4932662498284478415|(not set)        |6.6591E8     |
|1999795518481333408|(not set)        |2.876E8      |
|788289060362109435 |Backpacks        |5.2789E8     |
|476894088375216381 |Apparel          |1.3354E8     |
|8461110695299432015|(not set)        |7.641E7      |
|0532638128006986594|Apparel          |1.699E7      |
|5575936581064854068|Drinkware        |7990000.0    |
|28920166996312450  |Apparel          |8.799E7      |
|4164473973717598175|Drinkware        |6.076E7      |
|087919605807565397 |Lifestyle        |3990000.0    |
|7247995041552365194|Office           |2.143E7      |
|1691262461712944619|Apparel          |1.1998E8     |
|6218917791895202821|Bags             |7.999E7      |
|9372618001980854712|Lifestyle        |3.5E7        |
|3158193721801238039|Office           |1590000.0    |
|8488247280044091366|Apparel          |3.199E7      |
+-------------------+-----------------+-------------+
only showing top 20 rows

In [9]:
customerCategories.groupBy("v2ProductCategory").count().show(30)
+--------------------+-----+
|   v2ProductCategory|count|
+--------------------+-----+
|          Housewares|   22|
|           Backpacks|   33|
|              Office|  772|
|                Waze|   40|
|                Bags|  318|
|             Apparel| 1735|
|         Electronics|  288|
|             Bottles|   57|
|           More Bags|   26|
|            Headgear|  121|
|              Google|    3|
|         Accessories|   73|
|           Lifestyle|  330|
|            Tumblers|    6|
|Notebooks & Journals|   13|
|${productitem.pro...|   45|
|             Android|    1|
|           (not set)| 2125|
|           Drinkware|  683|
|          Gift Cards|   16|
+--------------------+-----+

Zusammenfassung von Kategorien in Top-Level-Kategorien

Top-Level Kategorien:

  • Apparel
  • Lifestyle
  • Electronics
  • Brand
  • Office
  • Bags
  • Drinkware
In [16]:
drinkware_regex = "Housewares|Drinkware|Bottles|Tumblers"
bags_regex = "Backpacks|Bags|More Bags"
office_regex = "Office|Notebooks & Journals|Gift Cards"
brand_regex = "Waze|Google|Android"
electronics_regex = "Electronics"
lifestyle_regex = "Accessories|Lifestyle|Fun"
apparel_regex = "Apparel|Headgear"

toplevelCategories = customerCategories.filter(~((col('v2ProductCategory') == '${productitem.product.origCatName}')|
                            (col('v2ProductCategory') == '(not set)'))).\
  withColumn("category",
    regexp_replace(
        regexp_replace(
            regexp_replace(
                regexp_replace(
                    regexp_replace(
                        regexp_replace(
                            regexp_replace(col("v2ProductCategory"),
                            apparel_regex, "Apparel"),
                        lifestyle_regex, "Lifestyle"),
                    electronics_regex,"Electronics"),
                brand_regex,"Brand"),
            office_regex,"Office"),
        bags_regex, "Bags"),
    drinkware_regex, "Drinkware"))


toplevelCategories.show()

#groupBy("category").count().show(30)
+-------------------+-----------------+-------------+---------+
|      fullVisitorId|v2ProductCategory|categorySales| category|
+-------------------+-----------------+-------------+---------+
| 788289060362109435|        Backpacks|     5.2789E8|     Bags|
| 476894088375216381|          Apparel|     1.3354E8|  Apparel|
|0532638128006986594|          Apparel|      1.699E7|  Apparel|
|5575936581064854068|        Drinkware|    7990000.0|Drinkware|
|  28920166996312450|          Apparel|      8.799E7|  Apparel|
|4164473973717598175|        Drinkware|      6.076E7|Drinkware|
| 087919605807565397|        Lifestyle|    3990000.0|Lifestyle|
|7247995041552365194|           Office|      2.143E7|   Office|
|1691262461712944619|          Apparel|     1.1998E8|  Apparel|
|6218917791895202821|             Bags|      7.999E7|     Bags|
|9372618001980854712|        Lifestyle|        3.5E7|Lifestyle|
|3158193721801238039|           Office|    1590000.0|   Office|
|8488247280044091366|          Apparel|      3.199E7|  Apparel|
|5216295436197840308|          Apparel|      8.396E7|  Apparel|
| 254413001205574315|        Drinkware|      2.499E7|Drinkware|
|5358177725781959243|          Apparel|      3.118E7|  Apparel|
|9029794295932939024|        Backpacks|     5.2789E8|     Bags|
|  64739735733627926|           Office|      6.092E7|   Office|
|1531540148956296102|        Drinkware|      3.998E8|Drinkware|
|4471415710206918415|             Bags|      3.995E8|     Bags|
+-------------------+-----------------+-------------+---------+
only showing top 20 rows

Aufsummierung der Ausgaben pro Kategorie und Pivotierung der Tabelle

Pivotierung von DataFrames in Spark: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

In [17]:
visitorRevenueByCategory = toplevelCategories.groupBy('fullVisitorId').pivot('category').agg(sum('categorySales')).fillna(0.0)

visitorRevenueByCategory.show()
+-------------------+--------+-------+-----+---------+-----------+---------+---------+
|      fullVisitorId| Apparel|   Bags|Brand|Drinkware|Electronics|Lifestyle|   Office|
+-------------------+--------+-------+-----+---------+-----------+---------+---------+
|7983896832980033675| 5.999E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|3220580369738167789|     0.0|    0.0|  0.0|7190000.0|        0.0|      0.0|1590000.0|
|4186017304980916856| 9.198E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|8706126538675944851|     0.0|    0.0|  0.0| 1.5975E8|        0.0|      0.0| 1.7545E8|
|6402267118449500708|1.1598E8|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|3090631969460248576| 1.359E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|0829523931932991940|     0.0|    0.0|  0.0| 3.7975E8|        0.0|      0.0|      0.0|
|8008255526797390672|     0.0|    0.0|  0.0|      0.0|        0.0|      0.0|  6.087E7|
|  93310453484553750| 2.719E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|0385969934087457197|     0.0|    0.0|  0.0|  1.119E8|        0.0|      0.0|      0.0|
|9077832746989941740|     0.0|    0.0|  0.0|      0.0|    1.197E7|      0.0|  1.397E7|
|8250541234138796403|     0.0|    0.0|  0.0|  8.096E7|        0.0|      0.0|      0.0|
|2435691687281327196|     0.0|7.994E7|  0.0|      0.0|    4.317E7|5590000.0|      0.0|
|3782396217900147405| 3.358E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|7758585801799209122| 5.999E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|3269834865385146569| 1.799E8|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
| 419161784882208907|2.0292E8|5.492E7|  0.0| 1.0986E8|        0.0|   3.98E7| 1.6675E8|
|2497486617423212304| 1.999E7|7.674E7|  0.0|      0.0|        0.0|      0.0|      0.0|
|1658702239532336072| 4.479E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|9019087328365285075| 1.599E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
+-------------------+--------+-------+-----+---------+-----------+---------+---------+
only showing top 20 rows

Durchschnittliche Warenkorbgröße und Zeit bis zum Wiederkauf

In [18]:
transactions = ga_data.select('fullVisitorId',
                               to_date("date", "yyyyMMdd").alias("pdate"),
                               expr("totals.transactions").cast('integer'),
                               expr("totals.totalTransactionRevenue").cast('float')).fillna(0).filter("transactions > 0")


#transactions.show(50)

timeintertrans = transactions.withColumn("time_intertrans",datediff(transactions.pdate, lag(transactions.pdate, 1).over(Window.partitionBy("fullVisitorId").orderBy("pdate")))).fillna(0)
timeintertrans.show()
+-------------------+----------+------------+-----------------------+---------------+
|      fullVisitorId|     pdate|transactions|totalTransactionRevenue|time_intertrans|
+-------------------+----------+------------+-----------------------+---------------+
|0829523931932991940|2016-11-18|           1|           3.87750016E8|              0|
|1944533426362788043|2016-09-21|           1|                1.499E7|              0|
|3090631969460248576|2016-10-27|           1|                2.259E7|              0|
|3213840074316400693|2016-08-01|           1|                1.704E8|              0|
|3220580369738167789|2016-12-20|           1|                1.378E7|              0|
|3745754200111893080|2016-08-09|           1|               1.7177E8|              0|
|4132503829043338043|2016-08-15|           1|                8.345E7|              0|
|4186017304980916856|2016-12-11|           1|                9.298E7|              0|
|4389802785459743912|2016-10-02|           1|                6.499E7|              0|
| 439449392549956950|2016-09-20|           1|           3.63390016E8|              0|
|4681864752414054401|2016-08-04|           1|                2.947E7|              0|
|6332431001963915083|2016-08-01|           1|                3.219E7|              0|
|6332431001963915083|2016-08-15|           1|                3.748E7|             14|
|6402267118449500708|2016-12-12|           1|               1.2198E8|              0|
|6845776877174113141|2016-08-21|           1|               2.3166E8|              0|
|7285721176134268388|2016-08-13|           1|                5.008E7|              0|
|7709768845264385801|2016-09-21|           1|               2.2105E8|              0|
|7983896832980033675|2016-12-08|           1|                6.399E7|              0|
|8008255526797390672|2016-11-09|           1|                6.287E7|              0|
|8706126538675944851|2016-11-29|           1|                3.362E8|              0|
+-------------------+----------+------------+-----------------------+---------------+
only showing top 20 rows

In [19]:
visitorTransactionKPIs = timeintertrans.groupBy('fullVisitorId').agg(avg("totalTransactionRevenue").alias("avg_revenue"),
                                            sum("totalTransactionRevenue").alias("sum_revenue"),
                                            sum("transactions").alias("total_transactions"),
                                            avg("time_intertrans").alias("avg_time"))

Finale Kundenmatrix

In [13]:
customerMatrix = visitorTransactionKPIs.join(visitorRevenueByCategory,"fullVisitorId")
In [20]:
customerMatrix.limit(15).toPandas()
Out[20]:
fullVisitorId avg_revenue sum_revenue total_transactions avg_time Apparel Bags Brand Drinkware Electronics Lifestyle Office
0 7983896832980033675 63990000.0 63990000.0 1 0.0 59990000.0 0.0 0.0 0.0 0.0 0.0 0.0
1 3220580369738167789 13780000.0 13780000.0 1 0.0 0.0 0.0 0.0 7190000.0 0.0 0.0 1590000.0
2 4186017304980916856 92980000.0 92980000.0 1 0.0 91980000.0 0.0 0.0 0.0 0.0 0.0 0.0
3 8706126538675944851 336200000.0 336200000.0 1 0.0 0.0 0.0 0.0 159750000.0 0.0 0.0 175450000.0
4 6402267118449500708 121980000.0 121980000.0 1 0.0 115980000.0 0.0 0.0 0.0 0.0 0.0 0.0
5 3090631969460248576 22590000.0 22590000.0 1 0.0 13590000.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0829523931932991940 387750016.0 387750016.0 1 0.0 0.0 0.0 0.0 379750000.0 0.0 0.0 0.0
7 8008255526797390672 62870000.0 62870000.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 60870000.0
8 93310453484553750 35190000.0 35190000.0 1 0.0 27190000.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0385969934087457197 114900000.0 114900000.0 1 0.0 0.0 0.0 0.0 111900000.0 0.0 0.0 0.0
10 9077832746989941740 26940000.0 26940000.0 1 0.0 0.0 0.0 0.0 0.0 11970000.0 0.0 13970000.0
11 8250541234138796403 85960000.0 85960000.0 1 0.0 0.0 0.0 0.0 80960000.0 0.0 0.0 0.0
12 2435691687281327196 133700000.0 133700000.0 1 0.0 0.0 79940000.0 0.0 0.0 43170000.0 5590000.0 0.0
13 3782396217900147405 36580000.0 36580000.0 1 0.0 33580000.0 0.0 0.0 0.0 0.0 0.0 0.0
14 7758585801799209122 68990000.0 68990000.0 1 0.0 59990000.0 0.0 0.0 0.0 0.0 0.0 0.0

2. Eine ML Spark Pipeline für Clustering

Feature-Engineering

Für den ML-Algorithmus müssen die Merkmale als Vector von Zahlen kodiert werden. Da die Warenkorb- und Kategorienausgaben sich in unterschiedlichen Bereichen bewegen, werden sie skaliert.

In [21]:
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
In [22]:
#scaling
sScaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)
In [23]:
#input to the algorithm must be vector

vectorAssembler = VectorAssembler()\
    .setInputCols(["avg_revenue", "sum_revenue", "total_transactions", "avg_time",
               "Apparel", "Bags", "Brand", "Drinkware", "Electronics", "Lifestyle", "Office"])\
    .setOutputCol("features")

# assembling, scaling
transformationPipeline = Pipeline().setStages([vectorAssembler,sScaler])
In [24]:
# collect statistics for scaling
fittedPipeline = transformationPipeline.fit(customerMatrix)
In [25]:
# actually transform
transformedMatrix = fittedPipeline.transform(customerMatrix)
In [26]:
# put it into cache for accessing it many times efficently
transformedMatrix.cache()
Out[26]:
DataFrame[fullVisitorId: string, avg_revenue: double, sum_revenue: double, total_transactions: bigint, avg_time: double, Apparel: double, Bags: double, Brand: double, Drinkware: double, Electronics: double, Lifestyle: double, Office: double, features: vector, scaledFeatures: vector]
In [27]:
transformedMatrix.select("features", "scaledFeatures").show()
+--------------------+--------------------+
|            features|      scaledFeatures|
+--------------------+--------------------+
|(11,[0,1,2,4],[6....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7,10],...|(11,[0,1,2,7,10],...|
|(11,[0,1,2,4],[9....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7,10],...|(11,[0,1,2,7,10],...|
|(11,[0,1,2,4],[1....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[2....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7],[3....|(11,[0,1,2,7],[1....|
|(11,[0,1,2,10],[6...|(11,[0,1,2,10],[0...|
|(11,[0,1,2,4],[3....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7],[1....|(11,[0,1,2,7],[0....|
|(11,[0,1,2,8,10],...|(11,[0,1,2,8,10],...|
|(11,[0,1,2,7],[8....|(11,[0,1,2,7],[0....|
|(11,[0,1,2,5,8,9]...|(11,[0,1,2,5,8,9]...|
|(11,[0,1,2,4],[3....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[6....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[1....|(11,[0,1,2,4],[0....|
|[7.3116E8,7.3116E...|[2.58141286470510...|
|(11,[0,1,2,4,5],[...|(11,[0,1,2,4,5],[...|
|(11,[0,1,2,4],[5....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[2....|(11,[0,1,2,4],[0....|
+--------------------+--------------------+
only showing top 20 rows

Clustering mit K-Means

In [28]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
In [29]:
# Trains a k-means model.
# Test with k = 10
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(transformedMatrix)
In [30]:
# Make predictions
predictions = model.transform(transformedMatrix)
In [31]:
model.computeCost(transformedMatrix)
Out[31]:
1.4004123823026964e+20
In [32]:
predictions
Out[32]:
DataFrame[fullVisitorId: string, avg_revenue: double, sum_revenue: double, total_transactions: bigint, avg_time: double, Apparel: double, Bags: double, Brand: double, Drinkware: double, Electronics: double, Lifestyle: double, Office: double, features: vector, scaledFeatures: vector, prediction: int]
In [33]:
# Evaluate clustering by computing Silhouette score

# The Silhouette is a measure for the validation of the consistency within clusters. 
#I t ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other 
# points in the same cluster and far from the points of the other clusters.
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

evaluator = ClusteringEvaluator()

# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)
Silhouette with squared euclidean distance = 0.834235010594473
Cluster Centers:
[5.81771813e+07 6.24261807e+07 1.10599284e+00 7.06425164e-01
 3.52531798e+07 4.42479875e+06 1.84950805e+05 5.73668157e+06
 2.29698569e+06 1.87020125e+06 5.75765206e+06]
[7.23102426e+08 6.79384497e+09 1.85000000e+01 4.84761905e+00
 9.48220000e+08 1.70685000e+08 0.00000000e+00 1.99618000e+09
 2.61890000e+08 2.25350000e+08 1.04896000e+09]
[7.98711772e+08 3.16860749e+09 6.12500000e+00 1.19705357e+01
 2.16402500e+08 2.16368750e+08 6.98750000e+05 5.13008750e+08
 3.50368750e+08 3.60990000e+08 9.77097500e+08]
[3.10956333e+09 3.41733833e+09 1.33333333e+00 3.66666667e+00
 3.57371667e+08 6.74608333e+08 0.00000000e+00 6.74525000e+08
 7.92983333e+08 6.64166667e+07 5.78566667e+08]
[1.53355001e+09 4.60065003e+09 6.00000000e+00 4.33333333e+00
 4.58265000e+09 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[2.95605388e+08 3.74175101e+08 1.59595960e+00 4.10867003e+00
 1.25032626e+08 2.23641919e+07 8.07575758e+04 5.41484343e+07
 3.31446970e+07 1.47384848e+07 8.93552525e+07]
[7.00350003e+09 7.00350003e+09 1.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.99650000e+09
 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[1.35352462e+09 1.98355091e+09 1.95454545e+00 6.44696970e+00
 6.71865000e+08 4.42477727e+08 0.00000000e+00 2.74834091e+08
 1.37850455e+08 1.23099091e+08 2.26979091e+08]
[6.96652430e+08 9.58375627e+08 2.03125000e+00 5.88076637e+00
 1.42125469e+08 1.30504688e+08 7.40937500e+06 1.82208281e+08
 1.16909688e+08 6.23792188e+07 1.63022656e+08]
[7.23027102e+08 1.73297200e+09 2.70000000e+00 1.61433333e+01
 6.09420000e+07 1.07547000e+08 0.00000000e+00 1.44440000e+08
 1.63446000e+08 1.27230000e+08 9.41692000e+08]

Parameter-Optimierung und Auswahl eines Modells

In [34]:
bestK = 0
bestSilhouette = 0.0
bestModel = None

evaluator = ClusteringEvaluator()

for k in range(5,16):
    kmeans = KMeans().setK(k).setSeed(1)
    model = kmeans.fit(transformedMatrix)
    cost = model.computeCost(transformedMatrix)
    predictions = model.transform(transformedMatrix)
    silhouette = evaluator.evaluate(predictions)
    print("k: ", k, "silhouette: ", silhouette )
    if (silhouette > bestSilhouette):
            bestK = k
            bestSilhouette = silhouette
            bestModel = model

print("best k: ", bestK, "silhouette: ", bestSilhouette )

k:  5 silhouette:  0.8884372518274559
k:  6 silhouette:  0.9143857363097061
k:  7 silhouette:  0.8922666959148186
k:  8 silhouette:  0.8911821945231404
k:  9 silhouette:  0.8217613365683168
k:  10 silhouette:  0.834235010594473
k:  11 silhouette:  0.8220060929467548
k:  12 silhouette:  0.8808577205951386
k:  13 silhouette:  0.8749689752418214
k:  14 silhouette:  0.7761558116055427
k:  15 silhouette:  0.7896723885115109
best k:  6 silhouette:  0.9143857363097061
In [35]:
bestModel.summary.clusterSizes
Out[35]:
[2372, 2, 1, 2, 131, 40]

3. Darstellung der Kundensegmente

Segment-Profile

In [36]:
predictions = bestModel.transform(transformedMatrix)
In [37]:
predictions
Out[37]:
DataFrame[fullVisitorId: string, avg_revenue: double, sum_revenue: double, total_transactions: bigint, avg_time: double, Apparel: double, Bags: double, Brand: double, Drinkware: double, Electronics: double, Lifestyle: double, Office: double, features: vector, scaledFeatures: vector, prediction: int]
In [38]:
clusterProperties = predictions.groupBy('prediction').agg(avg(expr('avg_revenue / 1.0E6')).alias('avgRev'),
                                     avg(expr('sum_revenue / 1.0E6')).alias('sumRev'),
                                     avg('total_transactions').alias('transactions'),
                                     avg('avg_time').alias('days'),
                                     avg(expr('Apparel / 1.0E6')).alias('Apparel'),
                                     avg(expr('Bags / 1.0E6')).alias('Bags'),
                                     avg(expr('Brand / 1.0E6')).alias('Brand'),
                                     avg(expr('Drinkware / 1.0E6')).alias('Drinkware'),
                                     avg(expr('Electronics / 1.0E6')).alias('Electronics'),
                                     avg(expr('Lifestyle/ 1.0E6')).alias('Lifestyle'),
                                     avg(expr('Office / 1.0E6')).alias('Office'),
                                     count('prediction').alias('size')).orderBy('prediction')
In [39]:
clusterProperties.show()
+----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+----+
|prediction|            avgRev|            sumRev|      transactions|              days|          Apparel|             Bags|              Brand|         Drinkware|       Electronics|         Lifestyle|            Office|size|
+----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+----+
|         0| 68.68862881590783| 76.35014331871841|1.1332209106239461|0.8954820123664979|40.09289629005059|5.320876897133218|0.17434654300168634| 7.648912310286676|3.9968001686340604|2.3556197301854955| 8.512647554806067|2372|
|         1| 723.1024258285714| 6793.844967999999|              18.5| 4.847619047619047|           948.22|          170.685|                0.0|           1996.18|            261.89|            225.35|           1048.96|   2|
|         2|       7003.500032|       7003.500032|               1.0|               0.0|              0.0|              0.0|                0.0|            6996.5|               0.0|               0.0|               0.0|   1|
|         3|       4431.340032|       4431.340032|               1.0|               0.0|         1072.115|         1224.825|                0.0|          1312.025|               0.0|              99.5|           718.875|   2|
|         4| 566.1913733675028| 767.8419863206109|1.8473282442748091| 5.158796801163214|144.2908396946565|76.11045801526716| 3.7419083969465654| 131.3081679389313| 74.06809160305342| 40.82541984732824| 195.6112977099236| 131|
|         5|1270.6264768233339|2396.9127420000004|               3.0| 9.873273809523809|        509.51175|           355.54|0.13974999999999999|322.08174999999994|          305.6405|          180.6125|486.09274999999997|  40|
+----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+----+

Visualisierung

In [40]:
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
In [41]:
clusterSizes = clusterProperties.select(col('prediction').cast('string'),'size').toPandas()

iplot(go.Figure( layout=dict(title='Cluster Sizes', width=500, height=500),

                data= [go.Bar( y = clusterSizes["size"],
                                x = clusterSizes["prediction"],
                                text= clusterSizes["size"],
                                textposition='auto')]))
In [42]:
clusterProperties.select(max('avgRev'),
                         max('sumRev'),
                         max('transactions'),
                         max('days'),
                         max('Apparel'),
                         max('Bags'),
                         max('Brand'),
                         max('Drinkware'),
                         max('Electronics'),
                         max('Lifestyle'),
                         max('Office')).toPandas()
Out[42]:
max(avgRev) max(sumRev) max(transactions) max(days) max(Apparel) max(Bags) max(Brand) max(Drinkware) max(Electronics) max(Lifestyle) max(Office)
0 7003.500032 7003.500032 18.5 9.873274 1072.115 1224.825 3.741908 6996.5 305.6405 225.35 1048.96
In [43]:
# modified from https://www.kaggle.com/fabiendaniel/customer-segmentation


def _scale_data(data, ranges):
    (x1, x2) = ranges[0]
    d = data[0]
    return [(d - y1) / (y2 - y1) * (x2 - x1) + x1 for d, (y1, y2) in zip(data, ranges)]

class RadarChart():
    def __init__(self, fig, location, sizes, variables, ranges, n_ordinate_levels = 6):

        angles = np.arange(0, 360, 360./len(variables))

        ix, iy = location[:] ; size_x, size_y = sizes[:]

        axes = [fig.add_axes([ix, iy, size_x, size_y], polar = True,
        label = "axes{}".format(i)) for i in range(len(variables))]

        _, text = axes[0].set_thetagrids(angles, labels = variables)

        for txt, angle in zip(text, angles):
            if angle > -1 and angle < 181:
                txt.set_rotation(angle - 90)
            else:
                txt.set_rotation(angle - 270)

        for ax in axes[1:]:
            ax.patch.set_visible(False)
            ax.xaxis.set_visible(False)
            ax.grid(False)

        for i, ax in enumerate(axes):
            grid = np.linspace(*ranges[i],num = n_ordinate_levels)
            grid_label = [""]+["{:.0f}".format(x) for x in grid[1:-1]]
            ax.set_rgrids(grid, labels = grid_label, angle = angles[i])
            ax.set_ylim(*ranges[i])

        self.angle = np.deg2rad(np.r_[angles, angles[0]])
        self.ranges = ranges
        self.ax = axes[0]

    def plot(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.plot(self.angle, np.r_[sdata, sdata[0]], *args, **kw)

    def fill(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.fill(self.angle, np.r_[sdata, sdata[0]], *args, **kw)

    def legend(self, *args, **kw):
        self.ax.legend(*args, **kw)

    def title(self, title, *args, **kw):
        self.ax.text(0, 1.1, title, transform = self.ax.transAxes, *args, **kw)
In [44]:
merged_df = clusterProperties.toPandas()

n_clusters = 6

fig = plt.figure(figsize=(12,12))

attributes = ['avgRev', 'sumRev', 'transactions', 'days', 'Apparel', 'Bags', 'Brand', 'Drinkware', 'Electronics', 'Lifestyle', 'Office']
#to-do
ranges = [[0, 7100], [0, 7100], [0, 20], [0, 10], [0, 1500], [0, 1500], [0, 10], [0, 7100], [0, 1500], [0, 1500], [0, 1500]]
#7003.500032 	7003.500032 	18.5 	9.873274 	1072.115 	1224.825 	3.741908 	6996.5 	305.6405 	225.35 	1048.96

index  = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

n_groups = n_clusters ; i_cols = 3
i_rows = n_groups//i_cols
size_x, size_y = (1/i_cols), (1/i_rows)

for ind in range(n_clusters):
    cluster_name = merged_df.loc[index[ind], 'prediction']
    cluster_size = merged_df.loc[index[ind], 'size']
    ix = ind%3 ; iy = i_rows - ind//3
    pos_x = ix*(size_x + 0.1) ; pos_y = iy*(size_y + 0.0)
    location = [pos_x, pos_y]  ; sizes = [size_x, size_y]
    #______________________________________________________
    data = np.array(merged_df.loc[index[ind], attributes])
    radar = RadarChart(fig, location, sizes, attributes, ranges)
    radar.plot(data, color = 'b', linewidth=2.0)
    radar.fill(data, alpha = 0.2, color = 'b')
    radar.title(title = 'cluster {}, size = {}'.format(cluster_name, cluster_size), color = 'r', size=14)
    ind += 1