Hi there! My name is Paul, co-founder of Acheron. For 2,5 years now, I’ve been using Matillion. In this blog, I will share some of my best practises to make Matillion more performant, cheaper and easier to manage in terms of life-cycle-management.
Backend
First, let’s talk about the backend of Matillion. Once we understand the inner mechanism, the best-practises will make much more sense.
Webserver (Apache Tomcat); ETL-flows are created by dragging and dropping components in a web browser. On the event of a click, a drag or a drop, an internal webserver is busy trying to update the web page of the user that performed the event. It’s Matillion’s goal to give all users a fluent experience in their browser. Quote a task when there a multiple users building an ETL flow.
Database (PostgreSQL); Matillion depends heavily on its database to act as brain. Every event happening in Matillion is captured as a database change, whether it is a job being scheduled, a change in an ETL flow or a user logging in. Because every change is automatically saved, you will not find a save button in Matillion. Note that the Database runs on the same machine as the webserver, taking up some of Matillion’s resources.
API; Matillion ships with an advanced API, that allows programs and users to interact with Matillion outside of the browser. Many operations that a user can perform in the Matillion frontend have their equivalent in the API.
Quartz Job Scheduler; this component is responsible for the order in which Matillion-jobs are run. When more jobs like to start, Quartz will decide which one will go first. Matillion is configured to only allow a handful of jobs to be executed in parallel. Note that the Quartz Job Scheduler also keeps its state in the database.
Too many responsibilities
The default Matillion installation ships on one single virtual machine. This machine has quite a lot of responsibilities:
- handle requests of all users currently logged in,
- run ETL-flows,
- run a job scheduler,
- update the database when some change is happening
- or retrieve data from source systems.
When available resources (CPU & memory) are shrinking, Matillion will not be able to handle all its responsibilities.
On a previous project of mine, I received warnings that Matillion’s CPU was at 99%. At the same time, BI-developers complaint that they were not able to login to Matillion. On top of that, I started to receive messages from Matillion’s job scheduler that it was being overloaded.
IT-problems rarely occur on their own. When CPU is high, it means that some process is taking a lot of the computing resources of the machine. In this particular case, Matillion was being overloaded with events saying that a new file is ready to be picked up by Matillion, caused by a big batch of files that was put on cloud storage. Matillion did its utmost best to handle all requests, explaining the 99% CPU usage. As a consequence, Matillion did not have resource available to give users a fluent experience, explaining why the BI-developers could not login anymore. Even worse, the UI was not responding at all, which limited my options of how to resolve this issue.
In my opinion, Matillion is doing too many things on one single VM. Let’s try to change that.
Speed-up by focus
Matillion has many responsibilities. Let’s try to prioritize & focus there.
- HIGH PRIO: Matillion has a great UI that allows BI-developers to easily create complex flows. In a Matillion-development environment, giving these developers a flawless UI-experience is the most essential. Other responsibilities are less important (Best Practise #1)
- HIGH PRIO: Matillion needs its brain to function. While PostgreSQL is an efficient database, it takes up precious resources as well. We need to make sure that the brain is not competing with the UI (Best-Practise #2)
- HIGH PRIO: A Matillion-production environment should favour speed of the ETL. User experience is less relevant (Best-Practise #3).
- LOWER PRIO: Matillion has many connectors to extract data from data sources. They take up precious space for jobs to run and CPU at the expense of the previous mentioned priorities (Best-Practise #4).
Best-Practise #1: use a single node cluster. The more cluster-nodes Matillion uses, the more resources Matillion spends on keeping the nodes in sync, slowing down the UI-experience.
Best-Practise #2: separate PostgreSQL from the Matillion-application. This will make sure that the brain never needs to compete with the UI for resources. In my next blog-post, I will share the steps how to do this yourself #cliff-hanger
Best-Practise #3: if your load is not so heavy on production, schedule your load outside of business hours.
Best-Practise #4: I advise to keep the number of data sources that Matillion is extracting to a bare minimum. Instead, adopt other solutions that have data-extraction as their core functionality e.g. Apache Nifi . Note that extra solutions always come at the cost of yet another tool that needs to managed and maintained.
Cost-Saving
Matillion charges by the hour that the VM is up.
At my current customer, we discovered some strange patterns in charged costs. Initially, we setup Matillion’s Powerschedule to wake Matillion at 7:00 sharp and let Matillion sleep at 19:00. We expected the price per day to be the same, but the charged costs showed otherwise. Some days were charged one hour more than others.
Matillion charges by whole hours. If the Matillion VM starts the shutdown exactly at 19:00, which can cause Matillion to charge for the hour from 19:00-20:00.
Best-Practise #5: Turn of Matillion 5 minutes before the whole hour to be sure that there are no accidental costs involved for the following hour.
The heavier the VM Matillion is using the higher the hourly-rate.
Best-Practise #6: separate PostgreSQL from the Matillion-application (emphasizing BP#2). This will reduce the need for a heavier machine. Additionally, the cloud-costs for a separate PostgreSQL database are far less than the extra license costs for a Matillion-VM on steroids.
Easier life-cycle-management
Matillion has a high frequency of releases. Installing a new release comes with a risk that Matillion behaves differently then before.
Effective life-cycle-management comes with covering acceptance tests: after every update, we need to validate that Matillion behaves as expected (in other words: an upgrade should not impact the ETL flows that already have been developed).
Best-Practise #7: separate PostgreSQL from the Matillion-application (emphasizing Best-Practise #2 and #6).
Say, we like to upgrade Matillion from Version v1 to Version v2. In my experience, the best way to do this is to add a shadow node with the newer version to the environment for acceptance testing. In the picture below, I depicted the shadow node in grey.
Notice that the shadow node is interacting with the same database as the active node, colored in green. This allows the shadow node to access all ETL-flows. This can come in handy when a flow is being run on the shadow node, to see if the result is as expected. When the acceptance tests are successful, the shadow node will become the single active node. The now deactivated node can be removed at a convenient time.
Best-Practise #8: use a shadow node for acceptance tests
Best-Practise #9: always make a backup of the database and the node that will be upgraded, before an upgrade is initiated (or a shadow node is added to the network).
Summary
Matillion has many tasks that it tries to perform on a single node. I advise to reduce these tasks as much as possible in order to have the best Matillion-UI experience.
The main advise to make Matillion more efficient, cheaper and to easy life cycle management, is to separate the database from the Matillion-VM. In my next blog, I will walk you through the steps how to do this.