“Most of our Customers ask ‘Why do we need CI/CD for our Data Warehouse?‘. We ask, why shouldn’t you?
Continuous Integration (CI) and Continuous Deployment (CD) are best practices for implementing the agile methodology that drives Devops teams. It enables them to deliver code changes more frequently and reliably. The basis for CI/CD is automation; a so-called CI/CD pipeline guarantees that every change is automatically tested and automatically deployed. This saves time and allows Devops teams to focus on realizing business requirements.
CI/CD is a well-known standard in the world of Software for shipping higher quality of code with an improved software deployment that delivers speed. Developing a Data Warehouse is very similar compared to developing software. So, why isn’t CI/CD so frequently used in the world of Data? Keep reading for the answer!”
Why do data engineers need CI/CD?
The movement to the cloud enables many new possibilities like scalable computing power up or down, depending on the demand. These possibilities also affect the Data Warehouse environments, where many companies are now Big Data processers. With more utilization of a cloud Data Warehouse, new business needs emerge.
A project manager shouts; “I want this change to be on production in 3 hours”. Pressure is high for data engineers to deliver fast. There is no time to waste. Data engineers need to manage these expectations properly. This is where CI/CD comes in!
A CI/CD pipeline is the tool in the toolbelt of the modern Data Engineer to deliver changes quickly and with a high degree of quality.
- Instead of skipping a deployment to the Acceptance environment because of time pressure, the CI/CD automatically propagates a change to Test to Acceptance –and finally– to Production.
- Instead of taking a shortcut to deliver a change without properly validating if no test is failing after the change, the CI/CD-pipeline will do this automatically for any change you give it.
- Instead of making print screens as proof that all tests are successfully executed, the CI/CD-pipeline logs any action performed by a user or automated process.
CI/CD enables data engineers to spend less time on testing, deployment, and administration. The time saved can be used to focus on developing code changes , allowing them to deliver changes faster and with a high level of quality standards. To manage these new expectations, we need automation. CI/CD gives you the solution for managing your Data Warehouse environment in a controlled way. It enables more effective change management by helping data engineers to deliver higher quality code in less time.
Business needs are continuously changing and so is your data landscape. Because of this, your company is only so effective as how quick your Data Warehouse can deliver changes.
Challenges in CI/CD for Data Warehouses?
The process of developing Software is very similar to developing Data Warehouses. Therefore a delivery/deployment process fit both worlds. The main difference that we experience is that every change in a Data Warehouse involves data that needs to be managed as well. Consequently, a more complicated CI/CD solution is required.
Different technologies
A complex landscape creates challenges in management. We want to provide insight into why the data landscape is so complex.
A Data Warehouse is managed with a challenging pipeline of different technologies, systems, and subsystems. We identify the following categories:
- Data Warehouse solutions, like Snowflake, Google BigQuery and Azure SQL
- Ingestion tools, like AWS DMS, Fivetran, Matillion Data Transformation and Talend Data Fabric
- Scripting languages, like SQL, Python and Spark
- Reporting tools, like Power BI and Tableau
A data landscape is complex. If complexity is not managed correctly, there is an increased chance of manual actions to quickly fix a small problem. With the risk that landscape and codebase run out-of-sync and that ultimately production deployments are accompanied by unexpected outcomes. Because a typical data warehouse consists of many different technologies that are all linked together, a complex landscape is created that consists of many dependencies. When these dependencies are not managed properly, they can cause for numerous of data quality problems. And they all have their own methodologies that needs to be filled in the CI/CD process, where your point of truth the Data Warehouse must be reliable.
Aspects of How to be compliant in a complex data environment?
Data Engineers are generally proactive people. With the “I’ll do that” mentality, they are able to solve many problems quickly, but they sometimes lose sight of major issues, such as Security, Privacy and Code Quality.
Proactivity is also stimulated by a CI/CD driven way of working. In addition, an engineer is helped to implement changes quickly, but in a controlled manner. This allows an engineer to focus on speed, but within the set guidelines and frameworks.
Companies that consider compliance to be of paramount importance, we see that engineers previously spent the most time on Administration, followed by Deployment followed by Development. Due to a CI/CD-driven way of working, we see a turnaround, in which an Engineer can focus the majority of his time on Development. Deployment is largely automated away by CI/CD. And Administration around compliance is secured in the deployment process.
Compliance points where a CI/CD pipeline can make the difference:
- Security: who is allowed to have access or change the data with what rights. Do you need to give God mode to a developer for making changes on the Data Warehouse? Developers are allowed to delete or change code with possible great impact on the Data landscape without anybody to know about the change.
- Reliability: is it ok for developers to make changes on the Data Warehouse without being controlled or checked before deploying? Developers are trying to make changes as fast as possible to be comply to the business needs, with sometimes overlooking the impact of what a change can have.
- Auditability: how can be audit who made what change when to the Data Warehouse? Logs of a Data Warehouse can’t always give the insight needed, for because for example the retention saving period exceeded and the log is removed.
- Standardization: how do you ensure that objects that are the same also have the same name? That naming conventions are properly followed.
The demands of speed and flexibility vs. the need of controlled changes is presenting a conflict that needs to be handled properly.
What can Acheron add to the Data Warehouse CI/CD?
Acheron provides functionality for:
- Manage DDL and Permissions with easy templates
- Setting Architecture standards
- Validate DDL/Permission objects
- Dry runs against test stages
- Auditability logs
- Automated execution
- Detect thrifts between Git and the Data Warehouse
- Import existing database to Git
The above-mentioned functionalities can be accomplished from a managed CI/CD pipeline and interoperable with commonly used Git environments.
It integrates with companies existing git environments like:
- GitHub
- Azure Dev Ops
- GitLab
- Jenkins
And therefore, integrates easy in your Devops process, where it enables Git support for your Data Warehouse, making features such as version control, version management and merge workflow.
Acheron Environment