Pentaho Data Integration – Few things to remember

In one of my projects, Pentaho Data Integration tool was  used to implement data extraction, transformation and loading processes. Before  you read further, let me make it clear that I’m not an ETL expert and do not  have experience working with other ETL tools. So, some of the points I share  here, strengths or weaknesses, may be applicable to other tools as well.

What is PDI?
Pentaho Data Integration is open source project named Kettle. It is a very  powerful ETL tool built using Java. The performance numbers we have seen in the  current project are extremely good under certain test conditions. It has a  number of other positive things going in its favor and you should be able to  get those good things from an Internet search or from our Open Source Enterprise  Stack recommendations.

Issue 1 – Java skills are a MUST
We made the mistake of staffing the entire team with resources skilled in  other ETL tools, not necessarily Java. While the team members had extremely good  ETL skills, PDI needs decent Java programmers. If the functionalilty being built  is complex, you will run into issues that can be understood/resolved only if you  attach the Kettle source code to Eclipse IDE and start debugging. Now, we cannot  expect a non-Java developer to do this. Hence, have at least 25% of you team  members with Java skills, GOOD Java skills.

Issue 2 – Version control system is inadequate
PDI stores code in 3 possible formats – file, database and enterprise  repository. While file based code control is good and can be easily integrated  with the version control system used by your client (like Subversion, Perforce,  PVCS etc), it has some significant challenges – folder structure maintenance is  not possible. All code gets saved in the root folder and this is  a maintenance  nightmare. Also, testing ETL code requires data that maybe available only on  servers, developers will have to checkin code after every small change, checkout  the code on server and then unit test on server. A very long process. Enterprise  repository on the other had is a very sophisticated solution. It maintains  multiple versions of code, connections and other server properties are shared  across jobs and remote submission is very easy. Problem with enterprise  repository is it is very slow. Export and import takes a lot of time. Also, you  cannot integrate code with the clients version control tool easily. We had  instances where developers by mistake deleted the entire repository.

Issue 3 – JSON plugin is slow
JSON is the source data format in our case and hence the PDI plugin to handle  JSON data is the first thing in the job flow. Now, JSON implementation probably  deals with converting the whole incoming string into a JSON tree with all nodes  identified. Then the developer mentioned parsing is applied. This is slow but  can be forgiven for smaller data strings. In our case, we had JSON strings  running into megabytes. That step started choking big time. We implemented a  standard Java step where Jackson API is used to parse the JSON data instead of  PDI provided one

Issue 4 – Error handling is not consistent
Not all plugins provided by PDI support error handling. So, if something  fails there, the entire job comes crashing down. This is a major issue when your  job runs may go for hours and 1 trivial issue can stop it entirely. Also, there  is no global try-catch logic that can be used to trap exceptions in a particular  block of code. This means you have to select each step and configure it for  errors. Then redirect that error to a different step. Very indirect way to  handle errors and makes it difficult to read the code.

Issue 5 – JavaScript is widely supported by the tool but has  significant performance impact
JavaScript is something very easy for developers to write custom code in. PDI  comes with a number of handy methods to be used in JavaScript code. We used it  extensively in the project. But the Pentaho official recommendation is not to  use JavaScript but use Java instead. This is a valid recommendation but needs  good Java skills in your team

Issue 6 – Enterprise repository connections choke when triggering  jobs
Pentaho Scheduler provides basic job scheduling features. Dependencies can be  configured inside the specific jobs. This is a good feature and works for most  cases when batch job management is required. We ran into significant problems  when the scheduler stared skipping runs. After a good number of attempts  and some severe escalations from client, we found out the real reason –  scheduler was submitting the jobs (configured to run every minute) properly. But  the jobs have additional sub-jobs and transformations embedded. PDI goes to  repository to fetch code each time. Due to high hits to the fragile repository  DB (Jackrabbit), jobs started freezing. Our solution is to move all job that  need to be triggered very frequently into file system. Do not invoke them from  repository.

Issue 7 – Shell Script plugin can cause lot of confusion

The step which enables us to run Shell scripts is a very useful one. It gives  us the flexibility to reuse existing scripts and also integrate certain tasks  that otherwise cannot be implemented using PDI steps. But you have to be careful  with the ‘Working Directory’ field present in the step configuration. We simply  left it as ‘/tmp’ and the job started failing in Production. Worked in all  sub-Prod environments. The shell script executes fine when run on the server  directly. Problem was with the ‘/tmp’ path. PDI loads everything that is there  in that path. ‘/tmp’ being a very commonly used directly by many other  applications, we usually see a number of jars, libraries etc in that path. In  case those jars conflict with the ones needed by your shell script, results will  be chaotic. A better practice is to have a working directory like  ‘/tmp/job-a-step-b’ so that no other job/process dumps unnecessary files into  this location.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s