Execution Information Logging is one of the major new features in the Apache Hop 2.1.0 release. What exactly is execution information logging, and how does it help you as an Apache Hop user?
Developing workflows and pipelines is an important aspect of a successful data project. However, the goal of any data project is to be deployed to production and to process data repeatedly and correctly.
Apache Hop helps you where possible to guarantee your data workflows and pipelines do exactly that: focus on what you want to process your data instead of how through visual design, verify your data is processed exactly the way you want it with unit tests, manage your project's life cycle through separation of code (projects) and configuration (environments).
All of these actions let you work proactively: visual design lets you build pipelines that are easy to understand, projects and environments help you to quickly and easily deploy to multiple environments, and unit tests let you build tests for problem scenarios you know or expect.
While working proactively is great and an absolute necessity, reactively finding out what is going on can't be ignored. Basic logging and monitoring was already possible in Apache Hop, but 2.1.0 takes a major leap forward with the introduction of a new execution information and data profiling platform.
Finding out what happened during workflow and pipeline execution is crucial to understand how your data flows through your project. The new execution information platform does precisely that.
As the name implies, Execution Information Logging allows Hop users to store workflow and pipeline execution information, but there's more.
A lot more, actually. Let's take a closer look.
The new workflow and pipeline execution platform decouples the actual workflow or pipeline execution from the client (Hop Gui, hop-run, Hop Server) that executes it. You can now for example start a pipeline through hop-run on a remote server, and follow up on its progress through Hop Gui.
The execution information and data profiling add various new metadata items and metadata options to your Apache Hop installation:
We'll walk through the basic steps to capture execution information and data profiling on your local system and explore the results after some executions.
As discussed earlier, we'll start by creating metadata items for our Execution Information Location and Execution Data Profile.
The Execution Information Location metadata type takes a couple of parameters:
To get our basic execution logging configured, go to the metadata perspective, right-click and select "New" on "Execution Information Location" and enter the following parameters: execution-logging as the name, File location as the location type and "${PROJECT_HOME}/logging/execution-logging" as the Root folder.
Creating a data profile is similar: right-click the "Execution Data Profile" in the metadata perspective and hit "new". Give your data profile a name, "local-data-profile" in the example below, and add the samplers you need. We've added all available samplers in the example below, with the default options.
The last thing we need to do is enable execution logging and data profiling in the pipeline and workflow run configurations. We'll use Hop's native engine for this example, but the same configuration options are available for any of the supported Beam pipeline engines.
Open your local run configuration settings (right click -> Edit or double click) and select the Execution information location and Execution data profile we just created. Your workflow Run Configuration settings are similar, the only difference is that data profiling isn't available for workflow run configurations.
You now have everything in place to start collecting logging information and data profiling information. After you've run a number of workflows and pipelines, you'll have execution logging and data profiling information available in the folder you configured.
Switch to the Execution Information perspective to explore the information you just captured.
In this example, we ran a couple of pipelines from the samples project:
The Execution Information provides a ton of information:
The workflow or pipeline view allows you to drill up or down to the parent or child pipeline or workflow. Select an action or transform and hit either the drill up or down icon to go to a parent or child execution. The arrow button left to the drill up and down buttons takes you directly to the workflow or pipeline editor.
In the lower half of the perspective, you'll find the Info, Log, metrics (pipelines only) and data tabs:
Even though you won't ever need to change or even consult the raw execution or profiling data directly, it is available as a set of JSON files on your local or server file system: A separate folder (with a hash for the execution as the folder name) is created for each execution, with the following files:
As always, we'll be happy to help you get the most out of Apache Hop.
Get in touch if you'd like to discuss running your Apache Hop implementation in production, training or custom development.