ETL Performance troubleshooting with Pentaho Data interchange

February 17, 2014

System monitoring, memory, partitioning, CPU balancing and disk arrays are all part of making the ETL performance of your Pentaho Data Interchange fast and efficient. System monitoring is a critical first step and catch all for performance issues. As the amount of your data grows, processing the data becomes more of a challenge and can significantly slow your system . Do you have the right processes and tools in place to quickly assess data access issues?

There are a few basic things that one should address to start the performance troubleshooting process. Making sure you have enough memory throughout your process is an area that also demands a lot of attention and should be checked at regular intervals to make sure it is being utilized efficiently and there is enough allocated for the task. Disk partitioning is another area that can be problematic. Is your disk partition correctly to allow load balancing to occur properly and your recovery to work as planned in the event of a system outage? When it comes to storage of your data, are your disk arrays functioning the way they should and are you testing them regularly to ensure data is being stored in the manner in which you intended. It is vital to your organization and your clients to test your storage capabilities on a regular, monthly or bi-monthly basis to ensure that the information being managed is not lost in the event of hardware failure. As for the quick check items, you should review the latest version, known problems, and patches with Pentaho Data Integration.

After you have performed the basic system level items, then it is typically beneficial to look at the specific functions and techniques that have been used for the extraction and integration program. Here are a few items to consider:

Hitting Database Too Many Times

Alternative: Create a ‘cache’ table once, then consume that data and eliminate. Of course, this has to be a special case where the run time of the onetime ‘cache’ table is offset by the savings of the repeated execution.

Speed versus Dynamic

When to store versus when to calculate is an ongoing challenge when extracting data for analysis. There are many factors that can affect the decision; user needs, network limitations, the amount of data, but there should not be a hard a fast rule throughout the process since business needs change. Speed is also a major factor in this decision but not always. If more network resources are need to generate data outputs, then there is always a cost associated with it so ROI and importance of the data must be considered.
If reports take 5 seconds to generate and you need to move fast, should you have temp storage?
Example: Total record count is high (millions) and or the number of detail reports is very high (hundreds/thousands).

Pulling Data From Multiple Sources

When to merge and when to pull from separate database is also major consideration. Mixing databases slows queries and pulling from multiple databases can require a lot more calculation and also slow your process. You should consider how many times you need to access the database, how much maintenance and management is required and the ability of your network to process the work flow.

As you list options to consider to troubleshoot your PDI program you may want to check http://infocenter.pentaho.com/help/index.jsp