The last few months I’ve spent on an interesting Data Mining project. The aim of this data science exercise was to help business reduce customer handling time in a contact center. The project itself and the results we’ve achieved will be covered in my next article, but today I would like to share how I solved a technical issue that came up during the project.
For the modeling phase I’ve used R studio and RPART library for building of the classification tree. For further interpretation and usage of rules generated by RPART we used rpart.utils package by Craig Varrichio. I’ve modeled three classification trees with different number of classes. All models were pushed to Oracle database with rpart.rules.push. Models with a little number of classes were pushed to FRAME, RULES and SUBRULES tables without any problems, but one of the models with 28 classes was crashing each time when populating data to the FRAME table. After a bit of research of rpart.rules.push source code I concluded that the reason of crashes was this call to sqlSave function:
For a model with 28 classes rpart.rules.push generates data frame with over 50 columns, and as per my understanding sqlSave function from RODBC package has a limitation for the number of columns for a created table, and this limitation causes this crash of rpart.rules.push. For tests I’ve tried to save data frames with different numbers of columns to a table with the help of sqlSave, and confirmed that it can’t create and load table for data frames with large number of columns.
After that I’ve decided to try the dbWriteTable function from RJDBC library. And a miracle happened! dbWriteTable worked just fine and pushed data frames of different sizes to database without any issues. So I decided to refactor rpart.rules.push utility to use dbWriteTable. Code for rpart.rules.push you can find in my GitHub repository.
Usage of the utility is almost the same as with the original version:
rpart.rules.push.jdbc(fit, jdbcConnection, rulePrefix = "", tablePrefix = "")
The only restriction for this refactored version is this: for each model you have to maintain its own set of FRAME, RULES and SUBRULES tables.
Enjoy! And if you have any questions or comments – be sure to get in touch.
|About the author: Vladimir Loiterstein is an experienced Business Intelligence and DWH consultant, his prime competence is system analysis, data warehouses design, ETL, data migrations, reports and dashboards developments, as well as SQL performance tuning. Being a founder at Idea Port Riga, he has been with the company since the very beginning in 2007.|