Notebook Automation in the Cloud
5 min read
Purpose: In this post, we discuss how to setup a custom Python enviroment in AWS with specific permissions needed to run Jupyter Notebooks in an automated, headless manner on a compute instance and schedule of your choosing. In my experience, this is a very flexible, cost-efficient and reliable way to automate repetitive notebook runs.
This post assumes:
- Basic familiarity with cloud setup
- You already have a working notebook that you want to automate. Notebook development and testing steps preclude this setup.
Without further ado, let’s dive in!
Term Glossary
Term | Definition |
---|---|
EventBridge | AWS Service for scheduling events and triggers |
ECR | Elastic Container Registry, AWS service to save your machine configuration in the form of a Dockerfile |
AWS Lambda | Python function in the cloud |
SageMaker | AWS Jupyter Notebook cloud service |
DLQ | Dead letter queue |
Contents
Background
The beautiful simplicity of UNIX cron jobs — running something that just works in the background without having the think about it. Can we do something like that for our notebooks in the cloud?
This post covers the steps needed to setup scheduled Jupyter notebook runs on a hardware and schedule of your choosing in the cloud. Why? It might be useful for several reasons:
- Generating a report based on a daily data feed, possibly with some parameters
- Retraining a shadow mode model everyevery day and get the performance report in email
- Automating a repetitive analysis.
Having a reliable setup in the cloud means that you don’t have to keep your maching running, worry about the network, something going wrong with your computer etc. With this setup, with SageMaker Batch Processing Jobs, you only pay for the amount of time your notebook runs. Let’s see how it looks!
The entire setup shouldn’t take more than 30 minutes.
Birds Eye View
The system uses Eventbridge to trigger a Lamda function, which spins up a SageMaker Batch Processing Job using an ECR
Setup
This is typically done on a Terminal on SageMaker Studio or a running instance.
Configure Run Environment
There’s a handy library that helps you set this up called sagemaker-run-notebook
. Please clone this and then make specific changes you need for your project. I had a custom requirement for my project where I made modification to the original library.
Check these commits for reference.
Add Permissions
Refer to the note on permissions. The package creates a very basic and minimal role. If your notebook accesses databases, Glue connectors etc. then you need to attach policies to this role. Please configure the policies accordingly.
I would consider adding AWSLambdaBasicExecutionRole to the permissions here. I found that the author has set it up without CloudWatch logging permissions which caused issues for me while debugging.
Another thing to consider is to add a DLQ SNS for the EventBridge to debug failures.
Create AWS Setup
Once you have tweaked the library code to your liking, install it with pip install .
in the library folder to install it. Now run these to set everything up on your AWS account:
Parameterize & Test Notebook
Now you can test the notebook! Run this and verify CloudWatch logs to ensure everything ran smoothly. The cool thing is that you can specify parameters for each schedule that the notebook should use at runtime. This library uses Papermill under the hood, so to set this up, it is very important to have one cell with your parameters in the notebook that has a tag parameters
added to the cell. See the steps on how to do that here. Once you do this, the value provided in -p "python_variable=value"
below, will be replaced in the tagged cell at runtime.
Schedule Notebook
If everything went smooth, now you can schedule your notebook. Please refer to the EventBridge CRON expression reference.
To unschedule,
Bonus: Debugging
If something went wrong, you can download the exact notebook output.
Bonus: Updates
If you decide to change your code on the notebook and want to reschedule it, simply unschedule and reschedule with the name schedule_name.
Conclusion
We learned how to setup a custom Python enviroment in AWS with specific permissions needed to run your Jupyter Notebooks in an automated and headless manner on a compute instance and schedule of your choosing. I hope this setup unblocks you with automating your work and provides you with a powerful framework for setting up custom and frugal Machine Learning workflow automation. Happy experimenting!