Zombie Annihilation Workshop


The AWS Signal Corps has built the beginning of a data lake, but their colony was infected before they could finish. In this workshop, we pick up where they left off.

We discover and organize the various datasets using AWS Glue, query against them using Amazon Athena, and empower business users through Amazon QuickSight to analyze the data without having to write SQL.

Because some of the data being streamed into the data lake is coming in real time through Amazon Kinesis Firehose, we add real-time analytics using Kinesis Analytics.

Lastly, we run Spark analytics using Amazon EMR, which is connected to the AWS Glue Data Catalog.


Prereqs

Create and download EC2 Key Value Pair (if you don't already have one to use)

  1. Login into the AWS Management Console and navigate to the EC2 Dashboard
  2. Make a note of the AWS region name, for example, for this lab you will need to choose the US East (N. Virginia) region
  3. Click on Key Pairs in the EC2 Dashboard Navigation pane
  4. Click on Create Key Pair
  5. Enter a name (e.g. zombieLabKey) for the Key pair name in the pop up window
  6. Click Create
  7. Save the key as a file on your local machine with ".pem" extension (e.g. zombieLabKey.pm).

Deploying CloudFormation Stack

In this section you will use the CloudFromation template to create the following


Activity 1: Discovering, Organizing, and Understanding the Data

The previous organization started to build a Data Lake, but it really turned into a data swamp. The survivors were quickly overrun by zombies and weren't able to categorize and make the data discoverable to have analytics run against them.

In this section, we'll first discover and organize the data by automatically crawling the data sources uses AWS Glue. The crawling will allow us initially create a catalog of what data exists, what the datasets contain, how to read the datasets and where to access it. The crawlers will also automatically maintain the metadata for us it adjusts to changes in the Lake.

Next -- you will help survivors start to query and analyze the data using Amazon Athena. We'll query against the virus data, survivor stats, and other information such as weather and population to perform descriptive analytics.

Constants/Names

Create Catalog

  1. Navigate to the Crawlers section of the AWS Glue console. AWS Glue Crawlers Click the Add crawler button.

Console

  1. Fill out the add crawler wizard.

    1. Crawler info
      • Crawler name: zombie_annihilation
    2. Data store
      • Data store: S3
      • Crawler data in: Specified path in another account
      • Include path: s3://serverless-analytics/zombie-datalake/swamp/
    3. Select 'Next'
    4. Data store (Add another data store)
      • No
    5. Select 'Next'
    6. IAM Role
      • Choose an exiting IAM role
      • IAM role: GlueJobRole-Zombies
    7. Select 'Next'
    8. Schedule
      • Frequency: Run on demand
    9. Select 'Next'
    10. Configure Crawlers Output
      • Add database: zombie_annihilation
      • -NOTE it's very important to name the database: zombie_annihilation
      • Prefix added to tables: (blank)
    11. Select 'Next'
    12. Review all steps
      • Select 'Finish'
  2. Select the new crawler and click Run crawler.

Run crawler

Explore the tables

  1. Navigate to the Databases>Tables section of the AWS Glue console. Click on the names of the tables to see details on each schema.

Tables

  1. Navigate to the Amazon Athena Console.
    Amazon Athena Select zombie_annihilation under DATABASE

  2. Execute the following queries

    SELECT *
    FROM flumart limit 10;
    SELECT *
    FROM flumart
    WHERE DATE '2015-12-21' BETWEEN sdate AND edate;

    For the second query we get an error since Athena thinks sdate and edate varchars not dates.

Update the flumart Schema

  1. Navigate to the Databases>Tables section of the AWS Glue console.
  2. Click into the flumart table.
  3. Click Edit schema.
  4. Click on "���country" and rename it "country".
  5. Click on string next to sdate and update it to DATE.
  6. Repeat for edate.
  7. Click Save Schema

  8. Next, Select Compare Versions in the console to see the changes we made in the Table.

  9. Scroll down and confirm/inspect the changes that were made.
  10. In the Amazon Athena console rerun the date query.
    SELECT *
    FROM flumart
    WHERE DATE '2015-12-21' BETWEEN sdate AND edate;

Explore the data

  1. Run queries in Amazon Athena to find potentially safe countries.

Let's first query the flu data -- notice when we query -- we don't need to know anything about the underlining storage formats.

SELECT country, SUM(all_inf) AS infections
FROM flumart
WHERE sdate > DATE '2017-01-01'
GROUP BY country
ORDER BY infections DESC;

We can also join multiple datasets together, each could be in different formats.

SELECT flu.country1.0 * infections / population * 10e6 AS inf_per_mm
FROM world_factbook
  INNER JOIN (
    SELECT flumart.countrySUM(flumart.all_infAS infections
    FROM flumart
    WHERE flumart.sdate > DATE '2017-01-01'
    GROUP BY flumart.country
    ORDER BY infections DESC
  ) flu
  ON flu.country = world_factbook.country
ORDER BY inf_per_mm DESC;

Activity 1b (Extra Credit)- Convert to Parquet

Constants/Names

Create the Reencoding Job

  1. Navigate to the Jobs section of the AWS Glue console.
  2. Click the Add job button
  3. Fill out the Add job wizard

    1. Job properties
      • Name: zombies-flunet-parquet
      • IAM role:
      • This job runs: A proposed script generated by AWS Glue
      • Temporary directory: s3://[Zombie Data Lake S3 Bucket from cfn template]/tmp
    2. Data sources
      • flumart
    3. Data targets
      • "Create tables in your data target"
      • Data Store: Amazon S3
      • Format: Parquet
      • Target path: s3://[Zombie Data Lake S3 Bucket from cfn template]/flumart/
    4. Schema
      • Leave as is and click Next
    5. Review
      • Click Finish
  4. Click Save then Run job and Run job again.

  5. Wait for the job to finish

Index the new table

  1. Navigate to the Crawlers section of the AWS Glue console. Click the Add crawler button.

  2. Fill out the add crawler wizard.

    1. Crawler info
      • Crawler name: zombies-parquet
    2. Data store
      • Data store: S3
      • Crawler data in: Specified path in another account
      • Include path: s3://[Zombie Data Lake S3 Bucket from cfn template]/flumart/
    3. Data store (Add another data store)
      • No
    4. IAM Role
      • Create an IAM role
      • IAM role: AWSGlueServiceRole-zombies-parquet-crawler
    5. Schedule
      • Frequency: Run on demand
    6. Output
      • Add database: zombies-parquet
      • Prefix added to tables: (blank)
    7. Review all steps
      • Finish
  3. Select the new crawler and click Run crawler.

Compare the new and old table

  1. Navigate to the Amazon Athena console.
  2. Run queries against the original table and the new table, note the time the queries take to run and the amount of data scanned.

Activity 2 - Empowering non-technical survivors help drive zombie insights

Now that we know what data exists and how to access it, we want to get the help of all the survivors to start analyzing the data. Unfortunately, some of the survivors aren't technical and asking them to perform queries using SQL or writing code won't work.

Amazon QuickSight is going to help us here.

In this activity, you'll create a set of datasets within QuickSight that uses the Glue Data Catalog and Athena as a backend. This enables these users to easily query the data in S3, but not have to understand the raw data formats, how to read the data, or how to write SQL.

These non-technical users will then be able to create a set of dashboard to help us understand the zombie movements better.

Let’s create the various datasets that we’ll visualize.

  1. Bring up QuickSight

    Amazon QuickSight

  2. In the top right, select “Manage data”

    Console

  3. Now select “New data set” in the top left corner

    Console

  4. From New Data Sources, Select “Athena”

  5. Type in “Zombie Data Lake” for the Data Source Name
  6. Select “Create Data Source”

    Console

Now we’ll use that datasource to create a dataset
  1. Select the Database that we were using in the first lab, and then select the world_factbook table.

    Console

  2. And hit the “Select” button on the right.

  3. Select the query directly and the “Visualize” option

    Console

Building our first visualizations:

On the left hand side, you’ll notice each field for the table or dataset we just created:

Console

Go ahead and select the “country”, “naturalgasproduction”, and “oilproduction” fields.

This will show a visualization that looks like this:

Console

Notice it automatically selected a chart type. Suppose we wanted to see this as a treemap view, let’s select that under Visual Types:

Console

And now you should see a chart where the size is the amount of naturalgas produced and the color is the amount of oil produced in each country.

Console

Across the bottom of the visualization (when the browser is maximized), you’ll see it called out here:

Console

Next, let’s resize this visual to only take the top half of the display. Click and hold the resize icon in the bottom left of the visual and drag

Console

Now let's create a new dataset in QuickSight, joining the factbook and flu data together. We'll calculate the amount of flu, based on the population of the regions.

  1. Click the QuickSight logo in the top left corner to go back to the main homepage
  2. Select "Manage data"
  3. Select "New Data Set"
  4. Scroll to the bottom and select the Athena source created prior named "Zombie Data Lake"
  5. Select Create Data Set
  6. Select zombie_annihilation database
  7. Select Edit/Preview Data
  8. Now select custom SQL tool under the Tables section:

    Console

  9. Enter in ZombieRegionTrend for the SQL name and the following SQL:

    SELECT flu.sdateflu.whoregion as region, 1.0 * sum(flu.all_inf/ sum(population) * 10e6 AS inf_per_mm
    FROM zombie_annihilation.world_factbook wf
    join zombie_annihilation.flumart flu on (flu.country = wf.country)
    group by sdate, whoregion
  10. Select Finish
  11. Save it as the name "ZombieRegionTrend" and Select "Save"

    Console

Now we'll add this new dataset to the existing analysis we were doing...

  1. Go into the world factbook analysis that you were working on
  2. Select the drop down for world_factbook, and select Edit Data sets

    Console

  3. Select "Add data set"

  4. Select the "ZombieRegionTrend" dataset

Now we can have multiple datasets on one analysis.. Let's start visualizing this now tool

  1. Select all 3 fields under the ZombieRegionTrend dataset
  2. resize the visualization so you can see both chats on the same display:

    Console


Activity 3 - We need to know what's happening now - with real-time data

Overview

In Part 3 we are going to be looking at the chat events coming into the chat application from the Zombie Apocalypse workshop. Messages are being sent from people across the globe informing each other of the status of their countries and the level of infection. In this simulation, events are being streamed to a kinesis firehose delivery stream that you will build a Kinesis Analytics application on top of to evaluate and aggregate the sentiment value of the messages and push to another delivery stream to persist to S3.

Start the simulator

At this point the deployment of the CloudFormation template should be complete and all resources deployed. There is a simulator API deployed to the EC2 instance. We are going to validate it is up and running and will start it.

To validate the simulator is running you can open in the browser:

http://<ec2-public-ip>:8080/health

This will return the health of the endpoint and you should se a status of:

{"status":"UP"}

To start and stop the simulator you can:

    # go to
    http://<ec2-public-ip>:8080/swagger-ui.html

Click on either start/stop simulator

All application logs for the simulator get pushed to CloudWatch logs in a logging group zombie-chat-simulator.

What's in the box

The components of the Zombie Annihilation simulator consist of:

You will be creating in this lab:

Validate records are being sent to the Kinesis firehose delivery stream and S3.

Now that the simulator is running we want to validate the records are flowing to the kinesis stream. Follow the steps below to validate.

  1. From the Services menu click on Kinesis.
  2. Click Kinesis Firehose and select the deliver stream labeled zombie-annihilation-simula-IngestionFirehoseStream-*.
  3. Click the "Monitoring" tab. On the IncomingRecords and DeliveryToS3 graphs you should have positive increasing values.
  4. Also check the S3 Logs tab and ensure you have no errors there.
  5. Finally we will take a look at the S3 location where the files are perisiting to see the directory structure.

Create Kinesis Analytics application to aggregate chat message sentiment by country

The first task is to create the Kinesis Firehose for which the output of the Kinesis Analytics Application will use. This application will use a Lambda function to pre-process the records coming into the Kinesis Analytics Application and you will be writing SQL to aggreagte the Positive, Negative, and Total counts by Country and sending them to a Kinesis Firehose destination.

Create output destination for the sentiment analysis

To create the destination Kinesis Firehose for the Analytics application follow the instructions below:

  1. Click "Go to the Firehose console"
  2. Create Delivery Stream
  3. Name: ZombieAnnihilationSentimentAggregationStream
  4. Direct PUT or other sources selected.
  5. Click "Next".
  6. Leave the defaults and Click "Next" again.
  7. Destination: "S3"
  8. Select S3 Bucket
  9. Prefix: "sentiment/"
  10. Click "Next"
  11. Change Buffer Interval to 60 seconds.
  12. Leave defaults for everything else on this page.
  13. Click create new IAM role.
  14. Leave IAM Role as "firehose_delivery_role" and Policy Name to "Create a New Role Policy" and Click "Allow"
  15. Click "Next"
  16. Click "Create Delivery Stream"

Once you have the delivery stream created and we have validated records are streaming to the source stream we can create the Kinesis Analytics application.

To get started from the AWS console go to Kinesis and select Kinesis Analytics.


Extra Credit - Discover and Visualize real-time data

Let the analytics application run for a few minutes and check S3 to ensure you have data in the sentiment prefix of the working bucket.

  1. Navigate to the Crawlers section of the AWS Glue console. Click the Add crawler button. Console
  1. Fill out the add crawler wizard.

    1. Crawler info
      • Crawler name: ZombieAnnihilationSentimentCrawler
    2. Data store
      • Data store: S3
      • Crawler data in: Specified path in my account
      • Include path: /sentiment
    3. Data store (Add another data store)
      • No
    4. IAM Role
      • Create an IAM role
      • IAM role: AWSGlueServiceRole-zombies-crawler
    5. Schedule
      • Frequency: Run on demand
    6. Output
      • Add database: zombie_annihilation
      • Prefix added to tables: zombie_
    7. Review all steps
      • Finish
  2. Select the new crawler and click Run crawler. Run crawler

    This is going to create the zombie_sentiment table in the zombie_annihilation database in the Glue Data Catalog. We will use this table with the Athena connector in QuickSight to visualize the zombie infection spreading across the globe.

Modify the schema

When Glue discovers the sentiment data it cannot infer the column names from the Kinesis Analytics application. You will need to click on the table in the Glue console and click "Edit Schema". The schema of the table should look like: Crawled Schema

You will make the changes to look like: Modified Schema

Verify table contents with Athena

Verify the table is created and validate you can query the table with Athena.

  1. Open Athena from the Service console.
  2. Select the database: zombie_annihilation
  3. Verify the zombie_sentiment table exists and click the ... next to the table and select "Preview table". Look at the result of the table to see the returns columns.

Visualize with QuickSight

We are now going to open the QuickSight console and build a dashboard with the results from the Kinesis Analytics application. Ensure QuickSight is in the N. Virginia region.

From the QuickSight console we are going to create a new data set to visualize.

  1. Click "Manage Data"
  2. Click "New Data set"
  3. Select the Athena connector.
  4. Choose data source name "ZombieSentiment" and click "Create".
  5. Select the zombie_annihilation database and the zombie_sentiment table.
  6. Select directly query your data and click "Edit/Preview data".
  7. In the "Fields" left hand menu select the Latitude column and select the "..." menu and select "Add to hierarchy".
  8. Select "Create a new geospatial hierarchy" and click "Add".
  9. Select the "longitude" field for the Longitude field of the hierarchy and click "Update".
  10. Click "Save & Visualize"
  11. In Visual Types select the Globe icon
  12. In the field wells menu select "Latitude" for geospatial, "negative_count" for Size, and "longitude" for color.

This will draw the dahsboard visualiation and show you where the most negative messages are coming from around the globe and help you plot your next move to escape from the zombies. Zombie Spread


Activity 4 - Predicting Zombie Zone using Spark ML

Some of the survivors come from a rich research background and want to run advanced analytics on the data lake.

In this lab, we are going to show how they can launch an EMR cluster integrated with the data lake and Glue Data Catalog.

Using a Zeppelin Notebook, the survivors will run ML algorithms using SparkML in order to predict areas of the highest spread of Zombies and where we want to move to with the greatest survivor rates.

Note: This may take approximately 1 min for both the crawlers to parse the data in CSV and Parquet format.

  1. To open the web interfaces, in your browser's address bar, type master-public-dns followed by the port number or URL.

Running the Spark Application from Zepplin NoteBook

  1. Download the following JSON:
     https://s3.amazonaws.com/serverless-analytics/zombie-datalake/deploy/SparkMLLab.json
  2. Select your Amazon CloudFormation stack (ZombieDataLake)
  3. Click on the Outputs tab
  4. Click on the URL link for the ZepplinNoteBookUrl to open the Zepplin Notebook
  5. In the Zeppelin page click on Import note
  6. Click on Choose a json here to select and import the SparkMLLab.json that you downloaded

  7. Click on SparkMLLab notebook you just imported.

Note: There are 4 paragraphs in this Lab

  1. To run the K-Means Model Creation using SparkML paragraph copy the value of the ZombieBucket from the CFN output from the Outputs tab of the CloudFormaiton teamplate.

  2. Click the |> in the K-Means Model Creation using SparkML to generate the prediction model

  3. After the execution is complete, with the cursor in K-Means Model Creation using SparkML paragraph press Ctrl+Option+R to Disable Run
  4. Click the |> in the Running prediction based on the K-Means model using Spark Streaming to start the Spark Streaming job.
  5. After the execution is complete, with the cursor in Running prediction based on the K-Means model using Spark Streaming paragraph press Ctrl+Option+R to disable Run

Note: |> is hidden in the paragraph when the the Run Option is disabled.

  1. Ensure that the run option is disabled Terminate Spark Streaming Job paragraph as well.
  2. Ensure that the run option is enabled for Prediction Report

Note: Run option is disable for all paragraphs except Prediction Report

  1. Click on the Run scheduler (clock icon in the menu bar) and set the -Preset value to 1m (i.e. 1 min) in the pop menu. This will schedule the Prediction Report to run every minute which will referesh as new Zombie data come in.

Cleanup

  1. Delete the S3 bucket zombie S3 bucket
  2. Delete the CFN Template out of CloudFormation
  3. Delete the Kinesis Analytics Applications
  4. Delete the Glue Crawler, Database (zombie_annihilation, zombies-parquet (if you did the bonus)
  5. Glue ETL (if you did the bonus)
  6. Delete the QuickSight Dataset, Athena Data Source, and Analysis