AWS Machine Learning using Spark & EMR

N^o 1Amazon EMR

Amazon Elastic Map Reduce (EMR)

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

N^o 2Apache Spark
& Amazon EMR

Run Spark on EMR!

Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters. Similar to Apache Hadoop, Spark is an open-source, distributed processing system commonly used for big data workloads. However, Spark has several notable differences from Hadoop MapReduce. Spark has an optimized directed acyclic graph (DAG) execution engine and actively caches data in-memory, which can boost performance, especially for certain algorithms and interactive queries.

Click Here to learn more about how customers are leveraging Spark on EMR.

N^o 3 Demos

Demo 1: News Article Analysis

See how Apache Spark, Amazon EMR, and Amazon SageMaker can be integrated to cluster news stories so users can be notified of recommendations on news stories that are like others of interest. This demo uses a public dataset called GDELT. GDELT captures news articles from around the world, and extracts and exposes metadata such as names, dates, locations, and themes for analysis. Machine learning libraries built into Spark called SparkML, as then used to do clustering analysis with the KMeans algorithm. This demo was built by Ron Weinstein, AWS Professional Services.

To log-in:

Username: demo
Password: Sp@rkEmr

View the Architecture Diagram

AWS Machine Learning using Spark & EMR

No 1Amazon EMR

No 2Apache Spark & Amazon EMR

No 3 Demos

No 3 More Demos

N^o 1Amazon EMR

N^o 2Apache Spark
& Amazon EMR

N^o 3 Demos

N^o 3 More Demos