Zombie Annihilation Workshop

The AWS Signal Corps has built the beginning of a data lake, but their colony was infected before they could finish. In this workshop, we pick up where they left off.

We discover and organize the various datasets using AWS Glue, query against them using Amazon Athena, and empower business users through Amazon QuickSight to analyze the data without having to write SQL.

Because some of the data being streamed into the data lake is coming in real time through Amazon Kinesis Firehose, we add real-time analytics using Kinesis Analytics.

Lastly, we run Spark analytics using Amazon EMR, which is connected to the AWS Glue Data Catalog.

Prereqs

Create and download EC2 Key Value Pair (if you don't already have one to use)

Login into the AWS Management Console and navigate to the EC2 Dashboard
Make a note of the AWS region name, for example, for this lab you will need to choose the US East (N. Virginia) region
Click on Key Pairs in the EC2 Dashboard Navigation pane
Click on Create Key Pair
Enter a name (e.g. zombieLabKey) for the Key pair name in the pop up window
Click Create
Save the key as a file on your local machine with ".pem" extension (e.g. zombieLabKey.pm).

Deploying CloudFormation Stack

In this section you will use the CloudFromation template to create the following

Amazon EMR cluster to build the KMeans prediction model using Apache SparkML and the Spark Streaming to run prediction on the simulated data using the model
AWS Lambda function that simulates the changes in the birthrate , deathrate and infantmortalityrate for a Zombie Zone
Amazon KinesisFirehose to deliver the simulated to Amazon S3.
Open the Amazon CloudFormation Console with Template

This will open cloudformation and auto fill the template:
```
https://s3.amazonaws.com/serverless-analytics/zombie-datalake/deploy.yaml
```
Click Next
In the Parameters section create the select the name of the EC2 Key Pair (e.g. zombieLabKey) that you created for KeyPairName.
Leave the others as the default values.
Click Next
Click Next
Check I acknowledge that AWS CloudFormation might create IAM resources.
Check I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Click Click Create ChangeSet
Click Execute

Activity 1: Discovering, Organizing, and Understanding the Data

The previous organization started to build a Data Lake, but it really turned into a data swamp. The survivors were quickly overrun by zombies and weren't able to categorize and make the data discoverable to have analytics run against them.

In this section, we'll first discover and organize the data by automatically crawling the data sources uses AWS Glue. The crawling will allow us initially create a catalog of what data exists, what the datasets contain, how to read the datasets and where to access it. The crawlers will also automatically maintain the metadata for us it adjusts to changes in the Lake.

Next -- you will help survivors start to query and analyze the data using Amazon Athena. We'll query against the virus data, survivor stats, and other information such as weather and population to perform descriptive analytics.

Constants/Names

Raw Data Location: s3://serverless-analytics/zombie-datalake/swamp/
Crawler name: zombie_annihilation
Crawler role: AWSGlueServiceRole-zombies
Database name : zombie_annihilation

Create Catalog

Navigate to the Crawlers section of the AWS Glue console. AWS Glue Crawlers Click the Add crawler button.

Console

Fill out the add crawler wizard.
1. Crawler info
  - Crawler name: zombie_annihilation
2. Data store
  - Data store: S3
  - Crawler data in: Specified path in another account
  - Include path: s3://serverless-analytics/zombie-datalake/swamp/
3. Select 'Next'
4. Data store (Add another data store)
  - No
5. Select 'Next'
6. IAM Role
  - Choose an exiting IAM role
  - IAM role: GlueJobRole-Zombies
7. Select 'Next'
8. Schedule
  - Frequency: Run on demand
9. Select 'Next'
10. Configure Crawlers Output
  - Add database: zombie_annihilation
  - -NOTE it's very important to name the database: zombie_annihilation
  - Prefix added to tables: (blank)
11. Select 'Next'
12. Review all steps
  - Select 'Finish'
Select the new crawler and click Run crawler.

Run crawler

Explore the tables

Navigate to the Databases>Tables section of the AWS Glue console. Click on the names of the tables to see details on each schema.

Tables

Navigate to the Amazon Athena Console.
Amazon Athena Select zombie_annihilation under DATABASE

Execute the following queries

SELECT *
FROM flumart limit 10;

SELECT *
FROM flumart
WHERE DATE '2015-12-21' BETWEEN sdate AND edate;

For the second query we get an error since Athena thinks sdate and edate varchars not dates.

Update the `flumart` Schema

Navigate to the Databases>Tables section of the AWS Glue console.
Click into the flumart table.
Click Edit schema.
Click on "��country" and rename it "country".
Click on string next to sdate and update it to DATE.
Repeat for edate.
Click Save
Next, Select Compare Versions in the console to see the changes we made in the Table.
Scroll down and confirm/inspect the changes that were made.

In the Amazon Athena console rerun the date query.

SELECT *
FROM flumart
WHERE DATE '2015-12-21' BETWEEN sdate AND edate;

Explore the data

Run queries in Amazon Athena to find potentially safe countries.

Let's first query the flu data -- notice when we query -- we don't need to know anything about the underlining storage formats.

SELECT country, SUM(all_inf) AS infections
FROM flumart
WHERE sdate > DATE '2017-01-01'
GROUP BY country
ORDER BY infections DESC;

We can also join multiple datasets together, each could be in different formats.

SELECT flu.country, 1.0 * infections / population * 10e6 AS inf_per_mm
FROM world_factbook
  INNER JOIN (
    SELECT flumart.country, SUM(flumart.all_inf) AS infections
    FROM flumart
    WHERE flumart.sdate > DATE '2017-01-01'
    GROUP BY flumart.country
    ORDER BY infections DESC
  ) flu
  ON flu.country = world_factbook.country
ORDER BY inf_per_mm DESC;

Activity 1b (Extra Credit)- Convert to Parquet

Constants/Names

Raw Data Location: s3://serverless-analytics/zombie-datalake/swamp/
Crawler name: zombies
Database name : zombie_annihilation

Create the Reencoding Job

Navigate to the Jobs section of the AWS Glue console.
Click the Add job button
Fill out the Add job wizard
1. Job properties
  - Name: zombies-flunet-parquet
  - IAM role:
  - This job runs: A proposed script generated by AWS Glue
  - Temporary directory: s3://[Zombie Data Lake S3 Bucket from cfn template]/tmp
2. Data sources
  - flumart
3. Data targets
  - "Create tables in your data target"
  - Data Store: Amazon S3
  - Format: Parquet
  - Target path: s3://[Zombie Data Lake S3 Bucket from cfn template]/flumart/
4. Schema
  - Leave as is and click Next
5. Review
  - Click Finish
Click Save then Run job and Run job again.
Wait for the job to finish

Index the new table

Navigate to the Crawlers section of the AWS Glue console. Click the Add crawler button.
Fill out the add crawler wizard.
1. Crawler info
  - Crawler name: zombies-parquet
2. Data store
  - Data store: S3
  - Crawler data in: Specified path in another account
  - Include path: s3://[Zombie Data Lake S3 Bucket from cfn template]/flumart/
3. Data store (Add another data store)
  - No
4. IAM Role
  - Create an IAM role
  - IAM role: AWSGlueServiceRole-zombies-parquet-crawler
5. Schedule
  - Frequency: Run on demand
6. Output
  - Add database: zombies-parquet
  - Prefix added to tables: (blank)
7. Review all steps
  - Finish
Select the new crawler and click Run crawler.

Compare the new and old table

Navigate to the Amazon Athena console.
Run queries against the original table and the new table, note the time the queries take to run and the amount of data scanned.

Activity 2 - Empowering non-technical survivors help drive zombie insights

Now that we know what data exists and how to access it, we want to get the help of all the survivors to start analyzing the data. Unfortunately, some of the survivors aren't technical and asking them to perform queries using SQL or writing code won't work.

Amazon QuickSight is going to help us here.

In this activity, you'll create a set of datasets within QuickSight that uses the Glue Data Catalog and Athena as a backend. This enables these users to easily query the data in S3, but not have to understand the raw data formats, how to read the data, or how to write SQL.

These non-technical users will then be able to create a set of dashboard to help us understand the zombie movements better.

Let’s create the various datasets that we’ll visualize.

Bring up QuickSight

Amazon QuickSight
In the top right, select “Manage data”
Now select “New data set” in the top left corner
From New Data Sources, Select “Athena”
Type in “Zombie Data Lake” for the Data Source Name
Select “Create Data Source”

Now we’ll use that datasource to create a dataset

Select the Database that we were using in the first lab, and then select the world_factbook table.
And hit the “Select” button on the right.
Select the query directly and the “Visualize” option

Building our first visualizations:

On the left hand side, you’ll notice each field for the table or dataset we just created:

Console

Go ahead and select the “country”, “naturalgasproduction”, and “oilproduction” fields.

This will show a visualization that looks like this:

Console

Notice it automatically selected a chart type. Suppose we wanted to see this as a treemap view, let’s select that under Visual Types:

Console

And now you should see a chart where the size is the amount of naturalgas produced and the color is the amount of oil produced in each country.

Console

Across the bottom of the visualization (when the browser is maximized), you’ll see it called out here:

Console

Next, let’s resize this visual to only take the top half of the display. Click and hold the resize icon in the bottom left of the visual and drag

Console

Now let's create a new dataset in QuickSight, joining the factbook and flu data together. We'll calculate the amount of flu, based on the population of the regions.

Click the QuickSight logo in the top left corner to go back to the main homepage
Select "Manage data"
Select "New Data Set"
Scroll to the bottom and select the Athena source created prior named "Zombie Data Lake"
Select Create Data Set
Select zombie_annihilation database
Select Edit/Preview Data
Now select custom SQL tool under the Tables section:

Enter in ZombieRegionTrend for the SQL name and the following SQL:

SELECT flu.sdate, flu.whoregion as region, 1.0 * sum(flu.all_inf) / sum(population) * 10e6 AS inf_per_mm
FROM zombie_annihilation.world_factbook wf
join zombie_annihilation.flumart flu on (flu.country = wf.country)
group by sdate, whoregion

Select Finish
Save it as the name "ZombieRegionTrend" and Select "Save"

Now we'll add this new dataset to the existing analysis we were doing...

Go into the world factbook analysis that you were working on
Select the drop down for world_factbook, and select Edit Data sets
Select "Add data set"
Select the "ZombieRegionTrend" dataset

Now we can have multiple datasets on one analysis.. Let's start visualizing this now tool

Select all 3 fields under the ZombieRegionTrend dataset
resize the visualization so you can see both chats on the same display:

Activity 3 - We need to know what's happening now - with real-time data

Overview

In Part 3 we are going to be looking at the chat events coming into the chat application from the Zombie Apocalypse workshop. Messages are being sent from people across the globe informing each other of the status of their countries and the level of infection. In this simulation, events are being streamed to a kinesis firehose delivery stream that you will build a Kinesis Analytics application on top of to evaluate and aggregate the sentiment value of the messages and push to another delivery stream to persist to S3.

Start the simulator

At this point the deployment of the CloudFormation template should be complete and all resources deployed. There is a simulator API deployed to the EC2 instance. We are going to validate it is up and running and will start it.

To validate the simulator is running you can open in the browser:

http://<ec2-public-ip>:8080/health

This will return the health of the endpoint and you should se a status of:

{"status":"UP"}

To start and stop the simulator you can:

    # go to
    http://<ec2-public-ip>:8080/swagger-ui.html

Click on either start/stop simulator

All application logs for the simulator get pushed to CloudWatch logs in a logging group zombie-chat-simulator.

What's in the box

The components of the Zombie Annihilation simulator consist of:

EC2 instance running the simulator
Kinesis Firehose receiving incoming messages that match the Zombie Apocalypse workshop
Kinesis Firehose that will receive the messages form the simulator and persist to S3.
Lambda function that will contain the code to execute sentiment analysis against the messages streaming to the firehose.

You will be creating in this lab:

Zombie Sentiment Kinesis Application
Kinesis Firehose for Sentiment Analysis from the Kinesis Analytics application.
(Extra Credit) Create Glue Crawler to crawl the Sentiment Analysis bucket.
(Extra Credit) Athena connector for QuickSight
(Extra Credit) QuickSight dashboard showing number of negative messages by geographic location.

Validate records are being sent to the Kinesis firehose delivery stream and S3.

Now that the simulator is running we want to validate the records are flowing to the kinesis stream. Follow the steps below to validate.

From the Services menu click on Kinesis.
Click Kinesis Firehose and select the deliver stream labeled zombie-annihilation-simula-IngestionFirehoseStream-*.
Click the "Monitoring" tab. On the IncomingRecords and DeliveryToS3 graphs you should have positive increasing values.
Also check the S3 Logs tab and ensure you have no errors there.
Finally we will take a look at the S3 location where the files are perisiting to see the directory structure.

Create Kinesis Analytics application to aggregate chat message sentiment by country

The first task is to create the Kinesis Firehose for which the output of the Kinesis Analytics Application will use. This application will use a Lambda function to pre-process the records coming into the Kinesis Analytics Application and you will be writing SQL to aggreagte the Positive, Negative, and Total counts by Country and sending them to a Kinesis Firehose destination.

Create output destination for the sentiment analysis

To create the destination Kinesis Firehose for the Analytics application follow the instructions below:

Click "Go to the Firehose console"
Create Delivery Stream
Name: ZombieAnnihilationSentimentAggregationStream
Direct PUT or other sources selected.
Click "Next".
Leave the defaults and Click "Next" again.
Destination: "S3"
Select S3 Bucket
Prefix: "sentiment/"
Click "Next"
Change Buffer Interval to 60 seconds.
Leave defaults for everything else on this page.
Click create new IAM role.
Leave IAM Role as "firehose_delivery_role" and Policy Name to "Create a New Role Policy" and Click "Allow"
Click "Next"
Click "Create Delivery Stream"

Once you have the delivery stream created and we have validated records are streaming to the source stream we can create the Kinesis Analytics application.

To get started from the AWS console go to Kinesis and select Kinesis Analytics.

Name: ZombieSentimentApplication
Click "connect to source".
Select the ZombieAnnihilationIncoming* stream
Record pre-processor enabled.
Select the lamdba function ZombieAnnihilationDataTransformation*
Click "discover schema". Take a moment to inspect the elements of the messages coming into the firehose.
Click "Save and Continue"
Click "Go to SQL results" and Click "Start Application"

In the SQL text box replace the existing text with the SQL below:

-- Creates an output stream and defines a schema
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
   INGEST_TIME timestamp,
   COUNTRY        VARCHAR(140),
   LATITUDE       DOUBLE,
   LONGITUDE      DOUBLE,
   NEGATIVE_COUNT  INTEGER,
   POSITIVE_COUNT INTEGER,
   MESSAGE_COUNT  INTEGER);
 
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '30' SECOND) AS "ingest_time",
   "country", "latitude", "longitude",
   COUNT(CASE WHEN "sentiment" = 'negative' THEN 1 ELSE NULL END) as NEGATIVE_COUNT,
   COUNT(CASE WHEN "sentiment" <> 'negative' THEN 1 ELSE NULL END) as POSITIVE_COUNT,
   COUNT(*) AS MESSAGE_COUNT
FROM "SOURCE_SQL_STREAM_001"
GROUP BY "country", "latitude", "longitude",
   STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '30' SECOND);

Click "Save and run" and then "Exit (done editing)"
Click "Connect to a destination"
Select the sentiment stream you created above
"In application stream name" choose DESTINATION_SQL_STREAM
Output format "CSV"
Access to chosen resource select "create/update"
Click "Save and Continue"

Looking at the SQL you can see it looks very much like ANSI SQL. The differences occur on how the Kinesis Analytics Application can process and continuous stream of data. It can do this by creating a series of STREAMs and PUMPs. The STREAM can parse the data coming into the application and the PUMP is a way to move aggregates of data over time to be evaluated. In this example we are creating a "DESTINATION_SQL_STREAM" STREAM to send the aggregates. We will analyize the the streaming data coming in every 30 seconds and write a record out to the destination stream. This destination stream will persist the data to S3 to be used in your data lake to determine the spread of zombie infection based on the analysis of messages coming in from each country.

Extra Credit - Discover and Visualize real-time data

Let the analytics application run for a few minutes and check S3 to ensure you have data in the sentiment prefix of the working bucket.

Navigate to the Crawlers section of the AWS Glue console. Click the Add crawler button.

Fill out the add crawler wizard.
1. Crawler info
  - Crawler name: ZombieAnnihilationSentimentCrawler
2. Data store
  - Data store: S3
  - Crawler data in: Specified path in my account
  - Include path: /sentiment
3. Data store (Add another data store)
  - No
4. IAM Role
  - Create an IAM role
  - IAM role: AWSGlueServiceRole-zombies-crawler
5. Schedule
  - Frequency: Run on demand
6. Output
  - Add database: zombie_annihilation
  - Prefix added to tables: zombie_
7. Review all steps
  - Finish
Select the new crawler and click Run crawler.

This is going to create the zombie_sentiment table in the zombie_annihilation database in the Glue Data Catalog. We will use this table with the Athena connector in QuickSight to visualize the zombie infection spreading across the globe.

Modify the schema

When Glue discovers the sentiment data it cannot infer the column names from the Kinesis Analytics application. You will need to click on the table in the Glue console and click "Edit Schema". The schema of the table should look like: Crawled Schema

You will make the changes to look like: Modified Schema

Verify table contents with Athena

Verify the table is created and validate you can query the table with Athena.

Open Athena from the Service console.
Select the database: zombie_annihilation
Verify the zombie_sentiment table exists and click the ... next to the table and select "Preview table". Look at the result of the table to see the returns columns.

Visualize with QuickSight

We are now going to open the QuickSight console and build a dashboard with the results from the Kinesis Analytics application. Ensure QuickSight is in the N. Virginia region.

From the QuickSight console we are going to create a new data set to visualize.

Click "Manage Data"
Click "New Data set"
Select the Athena connector.
Choose data source name "ZombieSentiment" and click "Create".
Select the zombie_annihilation database and the zombie_sentiment table.
Select directly query your data and click "Edit/Preview data".
In the "Fields" left hand menu select the Latitude column and select the "..." menu and select "Add to hierarchy".
Select "Create a new geospatial hierarchy" and click "Add".
Select the "longitude" field for the Longitude field of the hierarchy and click "Update".
Click "Save & Visualize"
In Visual Types select the Globe icon
In the field wells menu select "Latitude" for geospatial, "negative_count" for Size, and "longitude" for color.

This will draw the dahsboard visualiation and show you where the most negative messages are coming from around the globe and help you plot your next move to escape from the zombies. Zombie Spread

Activity 4 - Predicting Zombie Zone using Spark ML

Some of the survivors come from a rich research background and want to run advanced analytics on the data lake.

In this lab, we are going to show how they can launch an EMR cluster integrated with the data lake and Glue Data Catalog.

Using a Zeppelin Notebook, the survivors will run ML algorithms using SparkML in order to predict areas of the highest spread of Zombies and where we want to move to with the greatest survivor rates.

Note: This may take approximately 1 min for both the crawlers to parse the data in CSV and Parquet format.

To open the web interfaces, in your browser's address bar, type master-public-dns followed by the port number or URL.

Running the Spark Application from Zepplin NoteBook

Download the following JSON:

 https://s3.amazonaws.com/serverless-analytics/zombie-datalake/deploy/SparkMLLab.json

Select your Amazon CloudFormation stack (ZombieDataLake)
Click on the Outputs tab
Click on the URL link for the ZepplinNoteBookUrl to open the Zepplin Notebook
In the Zeppelin page click on Import note
Click on Choose a json here to select and import the SparkMLLab.json that you downloaded
Click on SparkMLLab notebook you just imported.

Note: There are 4 paragraphs in this Lab

K-Means Model Creation using SparkML - This paragraph generate a K-Means model based on the Birthrate, Deathrate, Infantmortalityrate from WorldFactBook.csv data. The spark application use the AWS Glue Data Catalog table zombiedatalakeworldfactbook_csv to read the data from Amazon S3.

Running prediction based on the K-Means model using Spark Streaming - As simulated data generated by the Lambda functio lands into the Amazon S3 bucket the data is used to run prediction against the generated model using Spark Streaming. The result is store in a temp table "PredictionTable" to prediction queries to generate report. The prediction classifies the zombie data based on the model into 3 catagories: i. 0 - Safe Zone - with lower mortality rates ii. 1 - Warning Zone - with moderate birth and mortality rates iii. 2 - Danger Zone - with high mortality rates.

Terminate Spark Streaming Job - Helper paragraph to gracefully terminate the Spark streaming job.

To run the K-Means Model Creation using SparkML paragraph copy the value of the ZombieBucket from the CFN output from the Outputs tab of the CloudFormaiton teamplate.
Click the |> in the K-Means Model Creation using SparkML to generate the prediction model
After the execution is complete, with the cursor in K-Means Model Creation using SparkML paragraph press Ctrl+Option+R to Disable Run
Click the |> in the Running prediction based on the K-Means model using Spark Streaming to start the Spark Streaming job.
After the execution is complete, with the cursor in Running prediction based on the K-Means model using Spark Streaming paragraph press Ctrl+Option+R to disable Run

Note: |> is hidden in the paragraph when the the Run Option is disabled.

Ensure that the run option is disabled Terminate Spark Streaming Job paragraph as well.
Ensure that the run option is enabled for Prediction Report

Note: Run option is disable for all paragraphs except Prediction Report

Click on the Run scheduler (clock icon in the menu bar) and set the -Preset value to 1m (i.e. 1 min) in the pop menu. This will schedule the Prediction Report to run every minute which will referesh as new Zombie data come in.

Cleanup

Delete the S3 bucket zombie S3 bucket
Delete the CFN Template out of CloudFormation
Delete the Kinesis Analytics Applications
Delete the Glue Crawler, Database (zombie_annihilation, zombies-parquet (if you did the bonus)
Glue ETL (if you did the bonus)
Delete the QuickSight Dataset, Athena Data Source, and Analysis

Zombie Annihilation Workshop

Prereqs

Create and download EC2 Key Value Pair (if you don't already have one to use)

Deploying CloudFormation Stack

Activity 1: Discovering, Organizing, and Understanding the Data

Constants/Names

Create Catalog

Explore the tables

Update the flumart Schema

Explore the data

Activity 1b (Extra Credit)- Convert to Parquet

Constants/Names

Create the Reencoding Job

Index the new table

Compare the new and old table

Activity 2 - Empowering non-technical survivors help drive zombie insights

Activity 3 - We need to know what's happening now - with real-time data

Overview

Start the simulator

What's in the box

Validate records are being sent to the Kinesis firehose delivery stream and S3.

Create Kinesis Analytics application to aggregate chat message sentiment by country

Create output destination for the sentiment analysis

Extra Credit - Discover and Visualize real-time data

Modify the schema

Verify table contents with Athena

Visualize with QuickSight

Activity 4 - Predicting Zombie Zone using Spark ML

Running the Spark Application from Zepplin NoteBook

Cleanup

Update the `flumart` Schema