Skip to content
This repository was archived by the owner on Dec 30, 2024. It is now read-only.

Commit 6de8b46

Browse files
authored
Merge pull request #51 from aws-solutions/develop
Update to version v1.7.0
2 parents a056642 + 4bd97f6 commit 6de8b46

File tree

212 files changed

+397422
-20487
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

212 files changed

+397422
-20487
lines changed

CHANGELOG.md

+56-36
Original file line numberDiff line numberDiff line change
@@ -5,96 +5,116 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [1.7.0] - 2022-02-14
9+
10+
### Added
11+
12+
- The capability to ingest custom data by uploading files as JSON, XLSX, or CSV files
13+
14+
### Updated
15+
16+
- Use Amazon Kinesis Data Firehose partition projection to store and partition the data by source date (instead of system processing date)
17+
- Use Amazon Athena dynamic partitioning features to run SQL queries on data stored in S3 bucket
18+
- AWS CDK version 1.137.0
19+
- AWS SDK version 2.1067.0
20+
21+
### Removed
22+
23+
- Creating of AWS Glue partitions (replaced with Amazon Athena dynamic partitions)
24+
825
## [1.6.1] - 2021-10-26
26+
927
### Fixed
10-
- GitHub [issue #42](https://github.com/aws-solutions/discovering-hot-topics-using-machine-learning/issues/42). To fix the issue, RSS feed ingestion lambda function and SQLs related to the Amazon QuickSight dashboard were updated.
28+
29+
- GitHub [issue #42](https://github.com/aws-solutions/discovering-hot-topics-using-machine-learning/issues/42). To fix the issue, RSS feed ingestion lambda function and SQLs related to the Amazon QuickSight dashboard were updated.
1130

1231
### Updated
13-
- AWS CDK version to 1.125.0
14-
- AWS SDK version to 2.1008.0
32+
33+
- AWS CDK version to 1.125.0
34+
- AWS SDK version to 2.1008.0
1535

1636
## [1.6.0] - 2021-09-27
1737

1838
### Added
1939

20-
- Capability to ingest YouTube comments
40+
- Capability to ingest YouTube comments
2141

2242
### Updated
2343

24-
- AWS CDK version to 1.121.0
25-
- AWS SDK version to 2.991.0
26-
- Updated Amazon QuickSight analysis and dashboard to reflect the new ingestion source
44+
- AWS CDK version to 1.121.0
45+
- AWS SDK version to 2.991.0
46+
- Updated Amazon QuickSight analysis and dashboard to reflect the new ingestion source
2747

2848
## [1.5.0] - 2021-07-22
2949

3050
### Added
3151

32-
- Ingest RSS feeds from over ~3000+ news websites across the world
52+
- Ingest RSS feeds from over ~3000+ news websites across the world
3353

3454
### Updated
3555

36-
- AWS CDK version to 1.110.1
37-
- AWS SDK version to 2.945.0
38-
- Updated Nodejs Lambda runtimes to use Nodejs 14.x
39-
- Updated Amazon QuickSight analysis and dashboard to reflect the new ingestion source
40-
- Updated AWS StepFunction workflows to handle parallel ingestion (tweets from Twitter and RSS feeds from News websites)
56+
- AWS CDK version to 1.110.1
57+
- AWS SDK version to 2.945.0
58+
- Updated Nodejs Lambda runtimes to use Nodejs 14.x
59+
- Updated Amazon QuickSight analysis and dashboard to reflect the new ingestion source
60+
- Updated AWS StepFunction workflows to handle parallel ingestion (tweets from Twitter and RSS feeds from News websites)
4161

4262
### Fixed
4363

44-
- Truncated tweets through merging [GitHub pull request #26](https://github.com/awslabs/discovering-hot-topics-using-machine-learning/pull/26)
64+
- Truncated tweets through merging [GitHub pull request #26](https://github.com/awslabs/discovering-hot-topics-using-machine-learning/pull/26)
4565

4666
## [1.4.0] - 2021-02-04
4767

4868
### Added
4969

50-
- Capability to use geo coordinates when invoking the Twitter API to filter tweets returned by its Search API
51-
- New visuals and sheets (tabs) on Amazon QuickSight to perform analysis using geo coordinates (when available with tweets)
52-
- Additional remediation to handle throttling conditions from Twitter v1.1 API calls and push additional information to Amazon CloudWatch Logs that can be used to create alarms or notifications using CloudWatch Metric Filters
70+
- Capability to use geo coordinates when invoking the Twitter API to filter tweets returned by its Search API
71+
- New visuals and sheets (tabs) on Amazon QuickSight to perform analysis using geo coordinates (when available with tweets)
72+
- Additional remediation to handle throttling conditions from Twitter v1.1 API calls and push additional information to Amazon CloudWatch Logs that can be used to create alarms or notifications using CloudWatch Metric Filters
5373

5474
### Updated
5575

56-
- Switched to AWS Managed KMS keys for AWS Glue Security Configuration
57-
- AWS CDK version to 1.83.0
58-
- AWS SDK version to 2.828.0
76+
- Switched to AWS Managed KMS keys for AWS Glue Security Configuration
77+
- AWS CDK version to 1.83.0
78+
- AWS SDK version to 2.828.0
5979

6080
## [1.3.0] - 2020-11-24
6181

6282
### Changed
6383

64-
- Implementation to refactor and to reuse the following architecture patterns from [AWS Solutions Constructs](https://aws.amazon.com/solutions/constructs/)
65-
- aws-kinesisfirehose-s3
66-
- aws-kinesisstreams-lambda
67-
- aws-lambda-step-function
84+
- Implementation to refactor and to reuse the following architecture patterns from [AWS Solutions Constructs](https://aws.amazon.com/solutions/constructs/)
85+
- aws-kinesisfirehose-s3
86+
- aws-kinesisstreams-lambda
87+
- aws-lambda-step-function
6888

6989
### Updated
7090

71-
- The join condition for Topic Modeling in Amazon QuickSight dataset to provide accurate topic identification for a specific run
72-
- ID and name generation for Amazon QuickSigh resource to use dynamic value based on the stack name
73-
- AWS CDK version to 1.73.0
74-
- AWS SDK version to 2.790.0
91+
- The join condition for Topic Modeling in Amazon QuickSight dataset to provide accurate topic identification for a specific run
92+
- ID and name generation for Amazon QuickSigh resource to use dynamic value based on the stack name
93+
- AWS CDK version to 1.73.0
94+
- AWS SDK version to 2.790.0
7595

7696
## [1.2.0] - 2020-10-29
7797

7898
### Added
7999

80-
- New and simplified interactive Amazon QuickSight dashboard that is now automatically generated through an AWS CloudFormation deployment and that customers can extend to suit their business case
100+
- New and simplified interactive Amazon QuickSight dashboard that is now automatically generated through an AWS CloudFormation deployment and that customers can extend to suit their business case
81101

82102
### Updated
83103

84-
- Updated to AWS CDK v1.69.0
85-
- Consolidate Amazon S3 access Log bucket across the solution. All access log files have a prefix that corresponds to the bucket for which they are generated
104+
- Updated to AWS CDK v1.69.0
105+
- Consolidate Amazon S3 access Log bucket across the solution. All access log files have a prefix that corresponds to the bucket for which they are generated
86106

87107
## [1.1.0] - 2020-09-29
88108

89109
### Updated
90110

91-
- S3 storage for inference outputs to use Apache Parquet
92-
- Add partitioning to AWS Glue tables
93-
- Update to AWS CDK v1.63.0
94-
- Update to AWS SDK v2.755.0
111+
- S3 storage for inference outputs to use Apache Parquet
112+
- Add partitioning to AWS Glue tables
113+
- Update to AWS CDK v1.63.0
114+
- Update to AWS SDK v2.755.0
95115

96116
## [1.0.0] - 2020-08-28
97117

98118
### Added
99119

100-
- Initial release
120+
- Initial release

NOTICE.txt

+3
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,15 @@ boto3 - Apache-2.0
2525
botocore - Apache-2.0
2626
chai - MIT license
2727
crhelper - Apache-2.0
28+
googleapis - Apache-2.0
2829
jest - MIT license
30+
jmespath - MIT License
2931
momentjs - MIT license
3032
moto - Apache-2.0
3133
newscatcher - MIT license
3234
nock - MIT license
3335
node - MIT license
36+
openpyx - MIT license
3437
pytest-cov - MIT license
3538
pytest - MIT license
3639
requests - Apache-2.0

README.md

+29-27
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,21 @@ The solution automates digital asset (text and image) ingestion from twitter, RS
66

77
The solution performs the following key features:
88

9-
- **Performs topic modeling to detect dominant topics**: identifies the terms that collectively form a topic from within customer feedback
10-
- **Identifies the sentiment of what customers are saying**: uses contextual semantic search to understand the nature of online discussions
11-
- **Determines if images associated with your brand contain unsafe content**: detects unsafe and negative imagery in content
12-
- **Helps customers identify insights in near real-time**: you can use a visualization dashboard to better understand context, threats, and opportunities almost instantly
9+
- **Performs topic modeling to detect dominant topics**: identifies the terms that collectively form a topic from within customer feedback
10+
- **Identifies the sentiment of what customers are saying**: uses contextual semantic search to understand the nature of online discussions
11+
- **Determines if images associated with your brand contain unsafe content**: detects unsafe and negative imagery in content
12+
- **Helps customers identify insights in near real-time**: you can use a visualization dashboard to better understand context, threats, and opportunities almost instantly
1313

1414
This solution deploys an AWS CloudFormation template that supports Twitter, RSS feeds, and YouTube comments as data source options for ingestion, but the solution can be customized to aggregate other social media platforms and internal enterprise systems.
1515

1616
For a detailed solution deployment guide, refer to [Discovering Hot Topics using Machine Learning](https://aws.amazon.com/solutions/implementations/discovering-hot-topics-using-machine-learning)
1717

1818
## On this Page
1919

20-
- [Architecture Overview](#architecture-overview)
21-
- [Deployment](#deployment)
22-
- [Source Code](#source-code)
23-
- [Creating a custom build](#creating-a-custom-build)
20+
- [Architecture Overview](#architecture-overview)
21+
- [Deployment](#deployment)
22+
- [Source Code](#source-code)
23+
- [Creating a custom build](#creating-a-custom-build)
2424

2525
## Architecture Overview
2626

@@ -54,13 +54,13 @@ After you deploy the solution, use the included Amazon QuickSight dashboard to v
5454

5555
[AWS CDK Solutions Constructs](https://aws.amazon.com/solutions/constructs/) make it easier to consistently create well-architected applications. All AWS Solutions Constructs are reviewed by AWS and use best practices established by the AWS Well-Architected Framework. This solution uses the following AWS CDK Constructs:
5656

57-
- aws-events-rule-lambda
58-
- aws-kinesisfirehose-s3
59-
- aws-kinesisstreams-lambda
60-
- aws-lambda-dynamodb
61-
- aws-lambda-s3
62-
- aws-lambda-step-function
63-
- aws-sqs-lambda
57+
- aws-events-rule-lambda
58+
- aws-kinesisfirehose-s3
59+
- aws-kinesisstreams-lambda
60+
- aws-lambda-dynamodb
61+
- aws-lambda-s3
62+
- aws-lambda-step-function
63+
- aws-sqs-lambda
6464

6565
## Deployment
6666

@@ -78,12 +78,12 @@ The solution is deployed using a CloudFormation template with a lambda backed cu
7878
├── bin [entrypoint of the CDK application]
7979
├── lambda [folder containing source code the lambda functions]
8080
│ ├── capture_news_feed [lambda function to ingest news feeds]
81-
│ ├── create-partition [lambda function to create glue partitions]
8281
│   ├── firehose_topic_proxy [lambda function to write topic analysis output to Amazon Kinesis Firehose]
8382
│   ├── firehose-text-proxy [lambda function to write text analysis output to Amazon Kinesis Firehose]
84-
│   ├── ingestion-consumer [lambda function that consumes messages from Amazon Kinesis Data Stream]
83+
│   ├── ingestion-consumer [lambda function that consumes messages from Amazon Kinesis Data Streams]
84+
│   ├── ingestion-custom [lambda function that reads files from Amazon S3 bucket and pushes data to Amazon Kinesis Data Streams]
8585
│   ├── ingestion-producer [lambda function that makes Twitter API call and pushes data to Amazon Kinesis Data Stream]
86-
│   ├── ingestion-youtube [lambda function that ingests comments from YouTube videos and pushes data to Amazon Kinesis Data Stream]
86+
│   ├── ingestion-youtube [lambda function that ingests comments from YouTube videos and pushes data to Amazon Kinesis Data Streams]
8787
│   ├── integration [lambda function that publishes inference outputs to Amazon Events Bridge]
8888
│ ├── layers [lambda layer function library for Node and Python layers]
8989
│ │ ├── aws-nodesdk-custom-config
@@ -106,10 +106,12 @@ The solution is deployed using a CloudFormation template with a lambda backed cu
106106
│   ├── ingestion [CDK constructs for data ingestion]
107107
│   ├── integration [CDK constructs for Amazon Events Bridge]
108108
│ ├── quicksight-custom-resources [CDK construct that invokes custom resources to create Amazon QuickSight resources]
109+
│ ├── s3-event-notification [CDK construct that configures S3 events to be pushed to Amazon EventBridge]
109110
│   ├── storage [CDK constructs that define storage of the inference events]
110111
│   ├── text-analysis-workflow [CDK constructs for text analysis of ingested data]
111112
│   ├── topic-analysis-workflow [CDK constructs for topic visualization of ingested data]
112113
│   └── visualization [CDK constructs to build a relational database model for visualization]
114+
├── discovering-hot-topics.ts
113115
```
114116

115117
## Creating a custom build
@@ -124,22 +126,22 @@ Clone this git repository
124126

125127
### 2. Build the solution for deployment
126128

127-
- To run the unit tests
129+
- To run the unit tests
128130

129131
```
130132
cd <rootDir>/source
131133
chmod +x ./run-all-tests.sh
132134
./run-all-tests.sh
133135
```
134136

135-
- Configure the bucket name of your target Amazon S3 distribution bucket
137+
- Configure the bucket name of your target Amazon S3 distribution bucket
136138

137139
```
138140
export DIST_OUTPUT_BUCKET=my-bucket-name
139141
export VERSION=my-version
140142
```
141143

142-
- Now build the distributable:
144+
- Now build the distributable:
143145

144146
```
145147
cd <rootDir>/deployment
@@ -148,7 +150,7 @@ chmod +x ./build-s3-dist.sh
148150
149151
```
150152

151-
- Parameter details
153+
- Parameter details
152154

153155
```
154156
$DIST_OUTPUT_BUCKET - This is the global name of the distribution. For the bucket name, the AWS Region is added to the global name (example: 'my-bucket-name-us-east-1') to create a regional bucket. The lambda artifact should be uploaded to the regional buckets for the CloudFormation template to pick it up for deployment.
@@ -158,13 +160,13 @@ $CF_TEMPLATE_BUCKET_NAME - The name of the S3 bucket where the CloudFormation te
158160
$QS_TEMPLATE_ACCOUNT - The account from which the Amazon QuickSight templates should be sourced for Amazon QuickSight Analysis and Dashboard creation
159161
```
160162

161-
- When creating and using buckets it is recommeded to:
163+
- When creating and using buckets it is recommeded to:
162164

163-
- Use randomized names or uuid as part of your bucket naming strategy.
164-
- Ensure buckets are not public.
165-
- Verify bucket ownership prior to uploading templates or code artifacts.
165+
- Use randomized names or uuid as part of your bucket naming strategy.
166+
- Ensure buckets are not public.
167+
- Verify bucket ownership prior to uploading templates or code artifacts.
166168

167-
- Deploy the distributable to an Amazon S3 bucket in your account. _Note:_ you must have the AWS Command Line Interface installed.
169+
- Deploy the distributable to an Amazon S3 bucket in your account. _Note:_ you must have the AWS Command Line Interface installed.
168170

169171
```
170172
aws s3 cp ./global-s3-assets/ s3://my-bucket-name-<aws_region>/discovering-hot-topics-using-machine-learning/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name

source/.eslintignore

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# don't ever lint node_modules
2+
node_modules
3+
cdk.out
4+
# don't lint build output (make sure it's set to your correct build folder name)
5+
dist
6+
# don't lint test folders
7+
test
8+
# don't lint coverage output
9+
coverage
10+
*.config.js
11+
# dont lint eslint config
12+
*.eslint*

source/.gitignore

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
*.js
2+
!jest.config.js
3+
*.d.ts
4+
node_modules
5+
6+
# CDK asset staging directory
7+
.cdk.staging
8+
cdk.out
9+
10+
# Parcel build directories
11+
.cache
12+
.build
13+
14+
# JS files in the lambda folder should not be ignored
15+
!lambda/**/*.js
16+
17+
# Coverage reports
18+
test/coverage-reports

source/.prettierignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
node_modules

source/.prettierrc.yml

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# .prettierrc or .prettierrc.yaml
2+
proseWrap: 'preserve'
3+
trailingComma: 'none'
4+
tabWidth: 4
5+
semi: true
6+
singleQuote: true
7+
quoteProps: 'preserve'
8+
printWidth: 120

source/images/architecture.png

635 KB
Loading

source/lambda/.eslintrc.js

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
module.exports = {
2+
root: true,
3+
parserOptions: {
4+
ecmaVersion: 2021
5+
},
6+
env: {
7+
node: true
8+
},
9+
extends: ['eslint:recommended'],
10+
rules: {
11+
indent: ['error', 4],
12+
quotes: ['warn', 'single']
13+
}
14+
};
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
moto==2.2.2
2-
pytest==6.2.4
3-
pytest-cov==2.12.1
1+
moto==2.3.1
2+
pytest==6.2.5
3+
pytest-cov==3.0.0
44
botocore
5-
mock==4.0.3
5+
mock==4.0.3
6+
responses==0.16.0

source/lambda/capture_news_feed/test/test_stream_helper.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@
1515
import json
1616
import os
1717
import unittest
18-
from datetime import datetime
18+
from datetime import datetime, timezone
1919

2020
from moto import mock_kinesis
2121
from shared_util.service_helper import get_service_client
22-
from util.stream_helper import buffer_data_into_stream
22+
from shared_util.stream_helper import buffer_data_into_stream
2323

2424

2525
@mock_kinesis
@@ -48,7 +48,7 @@ def test_buffer_data_into_stream(self):
4848
"platform": "fakeplatform",
4949
"search_query": "query_str",
5050
"feed": {
51-
"created_at": datetime.now().timestamp(),
51+
"created_at": datetime.now(timezone.utc).timestamp(),
5252
"id": "fakeid",
5353
"id_str": "fakeid",
5454
"text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",

0 commit comments

Comments
 (0)