Code Monkey home page Code Monkey logo

accelerated-data-lake's Introduction

AWS Accelerated Data Lake (3x3x3)

A packaged Data Lake solution, that builds a highly functional Data Lake, with a data catalog queryable via Elasticsearch

License

This library is licensed under the Apache 2.0 License.

3x3x3 DataLake installation instructions

These are the steps required to provision the 3x3x3 Packaged Datalake Solution and watch the ingress of data.

  • Provision the Data Lake Structure (5 minutes)
  • Provision the Visualisation
    • Provision Elasticsearch (15 minutes)
    • Provision Visualisation Lambdas (5 minutes)
  • Provision the Staging Engine and add a trigger (10 minutes)
  • Configure a sample data source and add data (5 minutes)
    • Configure the sample data source
    • Ingress a sample file for the new data source
    • Ingress a sample file that has an incorrect schema
  • Initialise and use Elasticsearch / Kibana (5 minutes)

Once these steps are complete, the provided sample data file (or any file matching the sample data source criteria) can be dropped in the DataLake's raw bucket and will be immediately staged and recorded in the data catalog.

If a file dropped into the raw bucket fails staging (for example it has an incorrect schema, or non-recognised file name structure), it will be moved to the failed bucket and recorded in the data catalog.

Whether staging is successful or not, the file's ingress details can be seen in the datalake's elasticsearch / kibana service.

NOTE: There are clearly improvements that can be made to this documentation and to the cloudformation templates. These are being actioned now.

1. Provisioning the Data Lake Structure

This section creates the basic structure of the datalake, primarily the S3 buckets and the DynamoDB tables.

Execution steps:

  • Go to the CloudFormation section of the AWS Console.
  • Think of an environment prefix for your datalake. This prefix will make your S3 buckets globally unique (so it must be lower case) and wil help identify your datalake components if multiple datalakes share an account (not recommended, the number of resources will lead to confusion and pottential security holes). Ideally the prefix should contain the datalake owner / service and the environment - a good example is: wildrydes-dev-
  • Create a new stack using the template /DataLakeStructure/dataLakeStructure.yaml
  • Enter the stack name. For example: wildrydes-dev-datalake-structure
  • Enter the environment prefix, in this case: wildrydes-dev-
  • Add a KMS Key ARN if you want your S3 Buckets encrypted (recommended - also, there are further improvements with other encryption options imminent in this area)
  • All other options are self explanatory, and the defaults are acceptable when testing the solution.

2. Provisioning the Visualisations

This step is optional, but highly recommended. If the customer does not want elasticsearch, a temporary cluster will allow debugging while the datalake is set up, and will illustrate its value.

If you want elasticsearch visualisation, both of the following steps are required (in order).

2.1 Provision Elasticsearch

This step creates the elasticsearch cluster the datalake will use. For dev databases, a single T2 instance is acceptable. For production instances, standard elasticsearch best practise apply.

NOTE: Elasticsearch is a very easy service to over-provision. To assist in determining the correct resources, the cloudformation template includes a CloudWatch dashboard to display all relevant cluster metrics.

Execution steps:

  • Go to the CloudFormation section of the AWS Console.
  • Create a new stack using the template /Visualisation/elasticsearch/elasticsearch.yaml
  • Enter the stack name. For example: wildrydes-dev-datalake-elasticsearch
  • Enter the environment prefix, in this case: wildrydes-dev-
  • Enter the your ip addresses (comma separated), in this case: <your_ip/32>,<another_cidr>
  • Change the other parameters as per requirements / best practise. The default values will provision a single instance t2.medium node - this is adequate for low tps dev and testing.

2.2 Provision the Visualisation Lambdas

This step creates a lambda which is triggered by changes to the data catalog DynamoDB table. The lambda takes the changes and sends them to the elasticsearch cluster created above.

Execution steps:

  • Create a data lake IAM user, with CLI access.
  • Configure the AWS CLI with the user's access key and secret access key.
  • Install AWS SAM.
  • Open a terminal / command line and move to the Visualisation/lambdas/ folder
  • Package and deploy the lambda functions. There are following two ways to deploy it:
    • Execute the ./deploy.sh <environment_prefix> script OR
    • Execute the AWS SAM package and deploy commands detailed in: deploy.txt

For this example, the commands should be:

sam package --template-file ./lambdaDeploy.yaml --output-template-file lambdaDeployCFN.yaml --s3-bucket wildrydes-dev-visualisationcodepackages

sam deploy --template-file lambdaDeployCFN.yaml --stack-name wildrydes-dev-datalake-elasticsearch-lambdas --capabilities CAPABILITY_IAM --parameter-overrides EnvironmentPrefix=wildrydes-dev-

3. Provision the Staging Engine and add a trigger

This is the workhorse of 3x3x3 - it creates lambdas and a step function, that takes new files dropped into the raw bucket, verifies their source and schema, applies tags and metadata, then copies the file to the staging bucket.

On both success and failure, 3x3x3 updates the DataCatalog table in DynamoDB. All changes to this table are sent to elasticsearch, allowing users to see the full history of all imports and see what input files were used in each DataLake query.

Execution steps: (ignore these steps if already done in the visualisation step)

  • Create a data lake IAM user, with CLI access.
  • Configure the AWS CLI with the user's access key and secret access key.
  • Install AWS SAM. (mandatory steps)
  • Open a terminal / command line and move to the StagingEngine/ folder
  • Package and deploy the lambda functions. There are following two ways to deploy it: * Execute the ./deploy.sh <environment_prefix> script OR * Execute the AWS SAM package and deploy commands detailed in: deploy.txt

For this example, the commands should be:

sam package --template-file ./stagingEngine.yaml --output-template-file stagingEngineDeploy.yaml --s3-bucket wildrydes-dev-stagingenginecodepackages

sam deploy --template-file stagingEngineDeploy.yaml --stack-name wildrydes-dev-datalake-staging-engine --capabilities CAPABILITY_IAM --parameter-overrides EnvironmentPrefix=wildrydes-dev-

3.1 Add the Staging trigger

Cloudwatch cannot create and trigger lambdas from changes to existing S3 buckets - this forces the Staging Engine startFileProcessing lambda to have its S3 PUT trigger attached manually.

Execution steps:

  • Go into the AWS Console, Lambda screen.
  • Find the lambda named: <ENVIRONMENT_PREFIX>datalake-staging-StartFileProcessing-<RANDOM CHARS ADDED BY SAM>
  • Manually add an S3 trigger, generated from PUT events on the RAW bucket you created (in this example, this would be wildrydes-dev-raw)

NOTE: Do not use "Object Created (All)" as a trigger - 3x3x3 copies new files when it adds their metadata, so a trigger on All will cause the staging process to begin again after the copy.

Congratulations! 3x3x3 is now fully provisioned! Now let's configure a datasource and add some data.

4. Configure a sample data source and add data

4.1 Configure the sample data source

Execution steps:

  • Open the file DataSources/RydeBookings/ddbDataSourceConfig.json
  • Copy the file's contents to the clipboard.
  • Go into the AWS Console, DynamoDB screen.
  • Open the DataSource table, which for the environment prefix used in this demonstration will be: wildrydes-dev-dataSources
  • Go to the Items tab, click Create Item, switch to 'Text' view and paste in the contents of the ddbDataSourceConfig.json file.
  • Save the item.

You now have a fully configured DataSource. The individual config attributes will be explained in the next version of this documentation.

4.2 Ingress a sample file for the new data source

Execution steps:

  • Go into the AWS Console, S3 screen, open the raw bucket (wildrydes-dev-raw in this example)
  • Create a folder "rydebookings". This is because the data source is configured to expect its data to be ingressed into a folder with this name (just use the bucket settings for the new folder).
  • Using the console, upload the file DataSources/RydeBookings/rydebooking-1234567890.json into this folder.
  • Confirm the file has appeared in the staging folder, with a path similar to: wildrydes-dev-staging/rydebookings/2018/10/26/rydebooking-1234567890.json

If the file is not in the staging folder, one of the earlier steps has been executed incorrectly.

4.3 Optional. Ingress a sample file that has an incorrect schema

Execution steps:

  • Go into the AWS Console, S3 screen, open the raw bucket (wildrydes-dev-raw in this example)
  • Create a folder "rydebookings" if it does not already exist.
  • Using the console, upload the file DataSources/RydeBookings/rydebooking-2000000000.json into this folder.
  • Confirm the file has appeared in the failed folder, with a path similar to: wildrydes-dev-failed/rydebookings/rydebooking-2000000000.json

If the file is not in the failed folder, one of the earlier steps has been executed incorrectly.

5. Initialise and use Elasticsearch / Kibana

The above steps will have resulted in the rydebookings files being entered into the DynamoDB datacatalog table (wildrydes2-dev-dataCatalog). The visualisation steps subscribed to these table's stream and all updates are now sent to elasticsearch.

The data will already be in elasticsearch, we just need to create a suitable index.

Execution steps:

  • Go to the kibana url (found in the AWS Console, under elasticsearch)
  • You will see there is no data - this is because the index needs to be created (the data is present, so we will let kibana auto-create it)
  • Click on the management tab, on the left.
  • Click "Index Patterns"
  • Paste in: wildrydes-dev-datacatalog (so <ENVIRONMENT_PREFIX>datacatalog). You will see this name in the available index patterns at the base of the screen.
  • Click "Next step"
  • Select @Timestamp in the "Time Filter field name" field - this is very important, otherwise you will not get the excellent kibana timeline.
  • Click "Create Index Pattern" and the index will be created. Click on the Discover tab to see your data catalog and details of your failed and successful ingress.

accelerated-data-lake's People

Contributors

cchew avatar grusy avatar jpeddicord avatar paulmacey1 avatar sukenshah avatar tpbrogan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

accelerated-data-lake's Issues

Acceptable error threshold [Enhancement]

What is an error threshold?

The accelerated data lake framework follows strict validation policy. The data file is considered failed if there is even a single validation error. The error threshold is a mechanism to allow an acceptable number of validation errors per data file.

Does error threshold apply to all validations?

The error threshold only applies to semantic validation rules e.g. number ranges, enumeration values etc. The data file is rejected if there are syntactic errors.

How is error threshold configured?

The error threshold can be configured via the data source config file. The error threshold is represented as % of allowed errors per total number of records in a file.

Where do I need error threshold?

The error threshold can be useful when dealing with low quality data files. With error threshold, the customers can still build the data lake without having to correct the errors for each files.

cc @paulmacey1 @tpbrogan

Sample cannot be queried through Athena

Whilst the sample "rydebooking-1234567890.json" is successfully ingested and staged, it cannot be queried by Athena. Athena does not support multi-line .json. Attempting to query results in an error:

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Stripping the newlines from the input file, allows you to query via Athena.

Suggest that there should be a further step at the end of the instructions, for query via Athena.

Visulation Template File

When generating the Visulation Lambdas sam package --template-file ./lambdaDeploy.yaml --output-template-file lambdaDeployCFN.yaml

The template file created has and incorrect TemplateFormatVersion line at the end of the file and wont deploy

"\xEF\xBB\xBFAWSTemplateFormatVersion": '2010-09-09')

When changed to

"AWSTemplateFormatVersion": '2010-09-09')

I was able to deploy. This was also happening for the Staging Engine as well but when I downloaded the latest version yesterday the Staging Engine template didn't have that problem

Multi Part Causing Loop

I have been working on implementing the accelerated data lake and was having an issue with larger files that were using the multipart upload.

Once the process hit the staging-engine-AttachTagsAndMetaDataToFi Lambda step it would start the whole process again for the same file over and over again. If I disabled the multipart upload through the AWS CLI and uploaded a large file then the process went through fine. After some testing I was able to lock it down to this section in the Lambda

s3.copy( copy_source, bucket, key, ExtraArgs={ "Metadata": metadata, "MetadataDirective": "REPLACE" } )

Changing it to this has stopped the looping problem

s3.copy_object( Bucket =bucket, Key = key, CopySource = copy_source, Metadata = metadata, MetadataDirective='REPLACE' )

Updating existing file meta data tags

Hi,

I saw in the presentation describing the accelerated data lake, that updating the DataSource in Dynamo DB should work retrospecively on existing data, eg. if I add a metadata tag all existing Files would be updated to contain this Meta Data Tag. I don't seem to be able to get this to work?

Is this actually a feature of the accelerated data lake?
Is it possible to update all existing files with new or updated metadata?

Thanks in advance

Update to default botocore impacting streaming to ElasticSearch

An update to the botocore version inside Lambda prevents the ElasticSearch streaming function from working. You will see an error in the Cloudwatch logs similar to "cannot import name 'BotocoreHTTPSession'"

This can be fixed by adding a Lambda Layer specific to the region the function is running in:

ap-northeast-1: arn:aws:lambda:ap-northeast-1:249908578461:layer:AWSLambda-Python-AWS-SDK:1
us-east-1: arn:aws:lambda:us-east-1:668099181075:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-1: arn:aws:lambda:ap-southeast-1:468957933125:layer:AWSLambda-Python-AWS-SDK:1
eu-west-1: arn:aws:lambda:eu-west-1:399891621064:layer:AWSLambda-Python-AWS-SDK:1
us-west-1: arn:aws:lambda:us-west-1:325793726646:layer:AWSLambda-Python-AWS-SDK:1
ap-east-1: arn:aws:lambda:ap-east-1:118857876118:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-2: arn:aws:lambda:ap-northeast-2:296580773974:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-3: arn:aws:lambda:ap-northeast-3:961244031340:layer:AWSLambda-Python-AWS-SDK:1
ap-south-1:631267018583: arn:aws:lambda:ap-south-1:631267018583:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-2: arn:aws:lambda:ap-southeast-2:817496625479:layer:AWSLambda-Python-AWS-SDK:1
ca-central-1: arn:aws:lambda:ca-central-1:778625758767:layer:AWSLambda-Python-AWS-SDK:1
eu-central-1: arn:aws:lambda:eu-central-1:292169987271:layer:AWSLambda-Python-AWS-SDK:1
eu-north-1: arn:aws:lambda:eu-north-1:642425348156:layer:AWSLambda-Python-AWS-SDK:1
eu-west-2: arn:aws:lambda:eu-west-2:142628438157:layer:AWSLambda-Python-AWS-SDK:1
eu-west-3: arn:aws:lambda:eu-west-3:959311844005:layer:AWSLambda-Python-AWS-SDK:1
sa-east-1: arn:aws:lambda:sa-east-1:640010853179:layer:AWSLambda-Python-AWS-SDK:1us-us-east-2: arn:aws:lambda:us-east-2:259788987135:layer:AWSLambda-Python-AWS-SDK:1
us-west-2: arn:aws:lambda:us-west-2:420165488524:layer:AWSLambda-Python-AWS-SDK:1
cn-north-1: arn:aws-cn:lambda:cn-north-1:683298794825:layer:AWSLambda-Python-AWS-SDK:1
cn-northwest-1: arn:aws-cn:lambda:cn-northwest-1:382066503313:layer:AWSLambda-Python-AWS-SDK:1
us-gov-west-1:: arn:aws-us-gov:lambda:us-gov-west-1:556739011827:layer:AWSLambda-Python-AWS-SDK:1
us-gov-east-1: arn:aws-us-gov:lambda:us-gov-east-1:138526772879:layer:AWSLambda-Python-AWS-SDK:1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.