In one of my projects, Pentaho Data Integration tool was used to implement data extraction, transformation and loading processes. Before you read further, let me make it clear that I’m not an ETL expert and do not have experience working with other ETL tools. So, some of the points I share here, strengths or weaknesses, may be applicable to other tools as well.
What is PDI?
Pentaho Data Integration is open source project named Kettle. It is a very powerful ETL tool built using Java. The performance numbers we have seen in the current project are extremely good under certain test conditions. It has a number of other positive things going in its favor and you should be able to get those good things from an Internet search or from our Open Source Enterprise Stack recommendations.
Issue 1 – Java skills are a MUST
We made the mistake of staffing the entire team with resources skilled in other ETL tools, not necessarily Java. While the team members had extremely good ETL skills, PDI needs decent Java programmers. If the functionalilty being built is complex, you will run into issues that can be understood/resolved only if you attach the Kettle source code to Eclipse IDE and start debugging. Now, we cannot expect a non-Java developer to do this. Hence, have at least 25% of you team members with Java skills, GOOD Java skills.
Issue 2 – Version control system is inadequate
PDI stores code in 3 possible formats – file, database and enterprise repository. While file based code control is good and can be easily integrated with the version control system used by your client (like Subversion, Perforce, PVCS etc), it has some significant challenges – folder structure maintenance is not possible. All code gets saved in the root folder and this is a maintenance nightmare. Also, testing ETL code requires data that maybe available only on servers, developers will have to checkin code after every small change, checkout the code on server and then unit test on server. A very long process. Enterprise repository on the other had is a very sophisticated solution. It maintains multiple versions of code, connections and other server properties are shared across jobs and remote submission is very easy. Problem with enterprise repository is it is very slow. Export and import takes a lot of time. Also, you cannot integrate code with the clients version control tool easily. We had instances where developers by mistake deleted the entire repository.
Issue 3 – JSON plugin is slow
JSON is the source data format in our case and hence the PDI plugin to handle JSON data is the first thing in the job flow. Now, JSON implementation probably deals with converting the whole incoming string into a JSON tree with all nodes identified. Then the developer mentioned parsing is applied. This is slow but can be forgiven for smaller data strings. In our case, we had JSON strings running into megabytes. That step started choking big time. We implemented a standard Java step where Jackson API is used to parse the JSON data instead of PDI provided one
Issue 4 – Error handling is not consistent
Not all plugins provided by PDI support error handling. So, if something fails there, the entire job comes crashing down. This is a major issue when your job runs may go for hours and 1 trivial issue can stop it entirely. Also, there is no global try-catch logic that can be used to trap exceptions in a particular block of code. This means you have to select each step and configure it for errors. Then redirect that error to a different step. Very indirect way to handle errors and makes it difficult to read the code.
Issue 6 – Enterprise repository connections choke when triggering jobs
Pentaho Scheduler provides basic job scheduling features. Dependencies can be configured inside the specific jobs. This is a good feature and works for most cases when batch job management is required. We ran into significant problems when the scheduler stared skipping runs. After a good number of attempts and some severe escalations from client, we found out the real reason – scheduler was submitting the jobs (configured to run every minute) properly. But the jobs have additional sub-jobs and transformations embedded. PDI goes to repository to fetch code each time. Due to high hits to the fragile repository DB (Jackrabbit), jobs started freezing. Our solution is to move all job that need to be triggered very frequently into file system. Do not invoke them from repository.
Issue 7 – Shell Script plugin can cause lot of confusion
The step which enables us to run Shell scripts is a very useful one. It gives us the flexibility to reuse existing scripts and also integrate certain tasks that otherwise cannot be implemented using PDI steps. But you have to be careful with the ‘Working Directory’ field present in the step configuration. We simply left it as ‘/tmp’ and the job started failing in Production. Worked in all sub-Prod environments. The shell script executes fine when run on the server directly. Problem was with the ‘/tmp’ path. PDI loads everything that is there in that path. ‘/tmp’ being a very commonly used directly by many other applications, we usually see a number of jars, libraries etc in that path. In case those jars conflict with the ones needed by your shell script, results will be chaotic. A better practice is to have a working directory like ‘/tmp/job-a-step-b’ so that no other job/process dumps unnecessary files into this location.