After announcing the beta release of TabPy(https://github.com/tableau/TabPy), a new API that enables evaluation of Python code from within a Tableau workbook, I integrated Python in Tableau to do a Predictive model to find if the patient is diagnosed by Hepatitis is Dead or Alive and also a prediction on HR related data where we can learn about the attrition of the employees if the employee is leaving or not. I and my colleague Puxin we created the dashboards taking this blog (https://www.tableau.com/about/blog/2017/1/building-advanced-analytics-applications-tabpy-64916) and tweaked the notebook to show the results.
For Hepatitis dashboard we used the publicly available dataset in which there are 155 samples and each sample has 1 target and 20 features. After correlation analysis, we shrink the number of features to 7 to avoid overfitting which are 'ALKPHOSPHATE', 'SGOT', 'ALBUMIN', 'PROTIME', 'BILIRUBIN', 'LIVERBIG', 'LIVERFIRM'. Scaler transformation and cross validation are used to make a model more stable and robust. We start by trying linear SVM and RBF kernel SVM, measuring model by statistics like f1, accuracy, precision, recall, and roc_auc. By grid search of hype-parameter, we end up to 0.858 accuracy and 0.706 roc_auc, which is not bad. Then, we work on logistic regression and Naive Bayes, go through the same process, but they don’t give us a better result. At the end of experiments, the gradient boosting model gives us the best result, 0.994 accuracies and 0.984 roc_auc. Thus, we decide to use it as final model in the dashboard. When users input the features on the dashboard, the data will be sent to the model backend and returned a prediction which you can see under Result if the patient is Dead or Alive.
And for the HR dashboard we used similar models and in a dataset, there are 1470 samples and each sample have 1 target 'Attrition' and 34 features associated with the sample. After correlation analysis, we shrink the number of features to 10 to avoid overfitting, which are 'EducationField', 'OverTime', 'EnvironmentSatisfaction', 'JobSatisfaction', 'StockOptionLevel', 'WorkLifeBalance', 'Age', 'TotalWorkingYears', 'MonthlyIncome', 'YearsWithCurrManager'. Scaler transformation and cross validation are used to make a model more stable and robust. We start from linear SVM and logistic regression, measuring model by statistics like f1, precision, recall and roc_auc. Based on grid search of hype-parameter, we choose the best SVM and logistic regression. Also, we used lightgbm, the current best gradient boosting machine. At the end, we ensemble these 3 models to make the prediction if the employee is going stay or not.
Below is the calculated field and attached the Jupyter work containing all the Python code used for model training and evaluation,
Calculated field :
SCRIPT_STR("return tabpy.query('Predict',_arg1,_arg2,_arg3,_arg4,_arg5,_arg6,_arg7,_arg8,_arg9, _arg10)['response']",
[age], [totalWorkingYears], [monthlyIncome], [yearsWithCurrManager], [environmentSatisfaction],
[jobSatisfaction], [stockOptionLevel], [workLifeBalance], [educationField], [overTime])