News Article Recommendation

No 1Architecture

Architecture Diagram

No 2Data Flow

The data flow is:

  1. GDELT data is stored in S3. It is accessed and streamed into EMR for use by data scientists.
  2. The data scientist uses a notebook for exploratory analytics, data formatting, cleaning and algorithm tuning. The data scientist uses SparkML to call and tune the clustering algorithms.
  3. Cluster predictions and news data are loaded in the Amazon Elastic Search Service (ES) for search and analysis.
  4. Users access the index through Kibana, a web interface for dash boarding, search and visualization