We were recently engaged by a charitable foundation to come up with a simple way to generate Pentaho reports on an Amazon Web Services (AWS) platform. The charitable foundation had already created the reports with Pentaho, but preferred not to deploy a full Pentaho BI Server on an EC2 instance. As it happens, this was an excellent opportunity to employ a Lambda function.
Why Lambda?
If you are unfamiliar with the concept of a Lambda function, it can best be described as a chunk of remote code that can be executed at-will. This was desirable because it eliminated the maintenance and cost of setting up an EC2 instance, and replaced it with a metered service that only incurs cost during actual execution time.
The Big Picture
A complete solution involved integrating the Lambda function as part of an AWS service stack along with an S3 Bucket for storage and an API Gateway to serve as a simple RESTful endpoint. The diagram below shows how these service layers work together to generate a report.
Components
If this structure seems familiar to you, it’s probably because it can easily be translated to the classic Model-View-Controller (MVC) paradigm. The S3 files compose the primary Model elements needed to control execution and store results. Files include Report definitions (*.prpt), properties, and outputs such as PDF, CSV, and HTML. Logging of process and exception conditions is routed to the CloudWatch service and can also be considered part of the Model.
The API Gateway provides View services by allowing the users to trigger the function and providing necessary feedback. This can be configured from within the Amazon console.
Finally, the Lambda function itself is the Control layer, receiving requests from the View and processing the Model accordingly.
About That Lambda Function
Although S3 and API Gateway services are fully customizable within the API console, the Lambda function is custom code we developed in house. Here you can see how the Lambda function controls the process.
Sequence
Report execution with this method is generally fairly quick; a few seconds for a report based on a low-latency query, but an initial use is somewhat delayed by the initialization process of the reporting engine. In a Lambda configuration, the processing power assigned to the function is directly related to the memory allocated to it. Since the resulting report must be contained in memory before it can be written to the S3 bucket, we recommend 2 gigs be allocated initially.
You can pull down our Lambda-Pentaho project from github at pentaho-lambda report generator.