18.3 C
London
Friday, September 20, 2024

Custom Image Creation for Geospatial Analysis with Amazon SageMaker Distribution in Amazon SageMaker Studio

Introduction

Amazon SageMaker Studio is a comprehensive suite of fully managed integrated development environments (IDEs) for machine learning (ML) that provides data scientists and data engineers with a collaborative development environment. To effectively utilize geospatial data in ML and analytics workflows, access to the right tools and libraries is crucial. In this post, we’ll show you how to build and use custom container images tailored for geospatial analysis in SageMaker Studio by extending SageMaker Distribution with specialized geospatial libraries.

Geospatial data, including satellite images, coordinate traces, or aerial maps enriched with characteristics or attributes of other business and environmental datasets, is becoming increasingly available and has valuable use cases in fields such as environmental monitoring, urban planning, agriculture, disaster response, transportation, and public health.

To create a custom geospatial image in SageMaker Studio, follow the steps outlined below. We will explain how to build, deploy, and use custom container images to extend the functionalities of SageMaker Distribution. This involves installing additional dependencies, extending SageMaker Distribution, and accessing the image from a SageMaker Studio domain.

Solution Overview

The solution comprises the following steps:

  1. Create a Dockerfile that includes the additional Python libraries and tools.
  2. Build a custom container image from the Dockerfile.
  3. PUSH the custom container image to a private repository on Amazon Elastic Container Registry (Amazon ECR).
  4. Attach the image to your Amazon SageMaker Studio domain.
  5. Access the image from your JupyterLab space.

The diagram below illustrates the solution architecture.

The solution uses AWS CodeBuild, a fully managed service that compiles source code and produces deployable software artifacts, to build a new container image from a Dockerfile.

Extend SageMaker Distribution

By default, SageMaker Studio provides a selection of curated pre-built Docker images as part of SageMaker Distribution. These images include popular frameworks for ML, data science, and visualization, including deep learning frameworks like PyTorch, TensorFlow, and Keras; popular Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab and Code Editor. Each distribution version is available in two variants, CPU and GPU, and is hosted on the Amazon ECR Public Gallery. To be able to work with geospatial data in SageMaker Studio, you need to extend SageMaker Distribution by adding the required geospatial libraries like GDAL, geopandas, leafmap, or rioxarray and make it accessible to users through SageMaker Studio.

Here is the custom Dockerfile for the geospatial image.

# set distribution type (cpu or gpu)
ARG DISTRIBUTION_TYPE

# get SageMaker Distribution base image
# use fixed version for reproducibility, use "latest" for most recent version
FROM public.ecr.aws/sagemaker/sagemaker-distribution:1.8.0-$DISTRIBUTION_TYPE

#set SageMaker specific parameters and arguments
#see here for supported values: 
ARG NB_USER="sagemaker-user"
ARG NB_UID=1000
ARG NB_GID=100

ENV MAMBA_USER=$NB_USER

USER $ROOT

#set environment variables required for GDAL
ARG CPLUS_INCLUDE_PATH=/usr/include/gdal
ARG C_INCLUDE_PATH=/usr/include/gdal

#install GDAL and other required Linux packages
RUN apt-get --allow-releaseinfo-change update -y -qq 
   && apt-get update 
   && apt install -y software-properties-common 
   && add-apt-repository --yes ppa:ubuntugis/ppa 
   && apt-get update 
   && apt-get install -qq -y groff unzip libgdal-dev gdal-bin ffmpeg libsm6 libxext6 
   && apt-get install -y --reinstall build-essential 
   && apt-get clean 
   && rm -fr /var/lib/apt/lists/*

# use micromamaba package manager to install required geospatial python packages
USER $MAMBA_USER

RUN micromamba install gdal==3.6.4 --yes --channel conda-forge --name base 
   && micromamba install geopandas==0.13.2 rasterio==1.3...

In-Notebook Interactive Development using a Custom Image

After choosing the custom geospatial image as the base image for your JupyterLab space, SageMaker provides you with access to many geospatial libraries that can now be imported without the need for additional installs.

Highly Parallelized Geospatial Processing Pipelines using a SageMaker Processing Job and a Custom Image

You can specify the custom image as the image to run a SageMaker processing job.

Clean Up

After you’re done running the notebook, don’t forget to stop the SageMaker Studio JupyterLab application to avoid incurring unnecessary costs.

Conclusion

This post has equipped you with the knowledge and tools to build and use custom container images tailored for geospatial analysis in SageMaker Studio.

About the Authors

Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions and building ML platforms on AWS.

Dr. Karsten Schroer is a Senior Machine Learning (ML) Prototyping Architect at AWS, focused on helping customers leverage artificial intelligence (AI), ML, and generative AI technologies.

Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning.

Frequently Asked Questions

Q1. What is the role of AWS CodeBuild in this solution?

AWS CodeBuild is a fully managed service that compiles source code and produces deployable software artifacts. In this solution, it builds a new container image from a Dockerfile and pushes the custom image to a private repository on Amazon Elastic Container Registry (Amazon ECR).

Q2. Why is geospatial data increasingly important for businesses?

Geospatial data is becoming increasingly important for businesses due to its ability to unlock valuable insights from data that are enriched with geographical characteristics. This allows businesses to leverage location-based services, optimize logistical routes, and develop targeted marketing strategies, among other uses.

Q3. What are the different libraries that can be installed to enhance the functionalities of SageMaker Distribution?

SageMaker Distribution is an extensible framework that supports multiple libraries and tools, including GDAL, geopandas, leafmap, rioxarray, and others.

Q4. Can I extend SageMaker Distribution to include other libraries?

Yes, you can extend SageMaker Distribution to include other libraries by customizing the Dockerfile or using CodeBuild to build and deploy new images with additional dependencies.

Q5. Are there any potential drawbacks to extending SageMaker Distribution with additional libraries?

Potential drawbacks to extending SageMaker Distribution with additional libraries may include increased deployment complexity, potential incompatibility with certain versions of dependencies, and longer deployment times. It is important to carefully consider these factors before customizing the base image.

Latest news
Related news