Big Data Wonders: April 2020

How to build your own Google Analytics ETL?

Sometimes Google Analytics webpage is just not enough (or maybe you want to add some data from other sources?). Here we will try to build a working lambda function to extract data from a Google Analytics account. The function will create a csv file with transaction_id and other marketing columns of one day and one account view.

The demo is done on Mac OS Catalina.

Requirements:
1) Google Account
2) Google Analytics account
3) AWS account
4) Installed Docker

Here are the main steps:
1) Create Google API service account and download credentials file
2) Enable Google Analytics and Google Analytics reporting for this API
3) Grant Google Analytics access to your service account email
4) Create python function to write api query result to S3

STEP 1

Go to your cloud credential page: https://console.cloud.google.com/apis/credentials and click "Create Credentials" and select "Service account"

Create desired account name:

Skip part 2:

Click "CREATE KEY" and generate JSON key file. It should download to your computer.

Step 2

Go to https://console.cloud.google.com/apis/library and enable both analytics services

Step 3

Go to your Google Analytics. Go to Admin -> Account User Management and add your service account email (e.g. mytesting@local-storm-255611.iam.gserviceaccount.com) and give "Read & Analyse" permissions.

Step 4

Create AWS account (or log in if you already have one) and create a bucket:

https://console.aws.amazon.com/s3/bucket/create?region=eu-north-1

Go to https://eu-north-1.console.aws.amazon.com/lambda/home?region=eu-north-1#/functions and press "Create Function"

Then, name your function and choose python 3.8 runtime

Create credential file in your lambda function and copy credential json file, downloaded in the previous step.

Copy the function to lambda_function.py:

import json
import boto3
import csv
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials

SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = 'cred.json'

def initialize_analyticsreporting():
  """Initializes an Analytics Reporting API V4 service object.
  Returns:
    An authorized Analytics Reporting API V4 service object.
  """
  credentials = ServiceAccountCredentials.from_json_keyfile_name(
      KEY_FILE_LOCATION, SCOPES)
  # Build the service object.
  analytics = build('analyticsreporting', 'v4', credentials=credentials, cache_discovery=False)
  return analytics

def get_report(analytics, VIEW_ID, dt):
  """Queries the Analytics Reporting API V4.
  Args:
    analytics: An authorized Analytics Reporting API V4 service object.
  Returns:
    The Analytics Reporting API V4 response.
  """
  return analytics.reports().batchGet(
      body={
        'reportRequests': [
        {
          'viewId': VIEW_ID,
          'dateRanges': [{'startDate': dt, 'endDate': dt}],
          'hideTotals': True,
          'metrics': [{'expression': 'ga:itemQuantity'}],
          'dimensions': [{'name': 'ga:transactionId'},
                         {'name': 'ga:dateHour'},
                         {'name': 'ga:adContent'},
                         {'name': 'ga:sourceMedium'},
                         {'name': 'ga:campaign'},
                         {'name': 'ga:keyword'}
                         ]
        }]
      }
  ).execute()
def lambda_handler(event, context):
    VIEW_ID = str(event['view_id'])
    dt=str(event['dt'])
    dt
    analytics = initialize_analyticsreporting()
    response = get_report(analytics, VIEW_ID, dt)
    if response['reports'][0]['data'].get('rows'):
      a=[]
      for row in json.loads(json.dumps(response))['reports'][0]['data']['rows']:
          a.append(row['dimensions'])
    
      with open('/tmp/temp.csv', 'w', newline='') as file:
          writer = csv.writer(file)
          writer.writerows(a)
        
      with open("/tmp/temp.csv", "rb") as f:
          s3 = boto3.client('s3')
          fn='google/GA/' + dt[0:4] + '/' + dt[5:7] + '/' + dt[8:10] + '/' + str(VIEW_ID) + '.csv'
          s3.upload_fileobj(f, str(event['bucket']), fn)

If you would try to run the function, it won't work. Google api library is not a standard one, so we need to import it by adding a layer.

First, we will create a layer using Docker desktop. You will need to open terminal and just paste this code

cd Documents/mkdir google_layercd google_layer/mkdir lambda_function lambda_layersmkdir -p lambda_layers/python/lib/python3.8/site-packagesCd lambda_layersdocker run -v "$PWD":/var/task "lambci/lambda:build-python3.8" /bin/sh -c "pip install google-api-python-client -t python/lib/python3.8/site-packages/; pip install oauth2client -t python/lib/python3.8/site-packages/; exit"zip -r ga.zip python > /dev/null

After this you will find in your documents folder, google_layers folder with ga.zip

Upload this file to your lambda layers.

In Lambda menu, go to Layers and press "Create layer".

Upload a .zip file that we have created in the previous step.

Name it and choose Python 3.8 compatible runtime

Go back to your created function and click on Layers part of Designer and click "Add a layer"

Chose your created layer here and click add. Do not forget to save the function after.

Let's test our function.

Click on test and configure your test event (paste this and change where needed):

{

"view_id": 1111111,

"dt": "2020-04-20",

"bucket": "mytestbucket"

}

You can find your own view_id on this page: https://ga-dev-tools.appspot.com/account-explorer/

Finally you would need to give access to your lambda role to access your S3 bucket.

Go to https://console.aws.amazon.com/iam/home?region=eu-north-1#/home

From the left menu, select roles and click on your lambda role:

Click on "Attach policies":

Start typing "S3" and check the box on "AmazonS3FullAccess"

Big Data Wonders

Monday, April 20, 2020

Google Analytics serverless ETL function

Google Analytics serverless ETL function

Blog Archive