End to end model of Data Analysis & Prediction using Python on SAP HANA Table data:

This blog helps to connect with SAP HANA DB (Version 1.0 SPS12) then extract the data from HANA table/View and analyze the data using Python Pandas library. Then you can clean and select independent variables/features data to feed the Machine learning algorithms to predict dependent variables or find insights.

In today’s digital economy, businesses cannot take action on stale insights, thus a true in-memory data platform should support real-time processing for transactions and analytics for all of a company’s data. SAP HANA helps to manage data in a single in-memory platform, so you can take action in the moment. Accelerate the pace of innovation and run live in this new digital economy. SAP HANA Capabilities include database services, advanced analytics processing, app development, data access, administration, and openness.

Python is becoming popular in analytics and data science. It is also portable, free, and easy .Python lets you work quickly and integrate systems more effectively. Python has large library base that you can use so you don’t have to write your own code for every single thing. There are libraries for regular expressions, documentation-generation, unit-testing, web browsers, threading, databases, CGI, email, image manipulation, and a lot of other functionality.

Scenario: I am taking the state wise startup company’s expenditure (R&D Spend, Administration Spend, and Marketing Spend) and profit data.
Goal is to predict the Profit for the given set of expenditure values. I am not explaining details about the ML Algorithm and the parameter tuning here. I would like to show the end to end process of Data extraction from SAP HANA DB, analyzing, cleaning, feature selection, and applying machine learning model and finally write back the results and ML algorithm performance metrics to the HANA tables.
The linear regression is the most commonly used model in research and business and is the simplest to understand, so using the random forest regression method we will predict the Profit.
The below diagram shows ML Prediction life cycle and steps fallowed in the use case.


The basic steps involved in this process are:
1.      Check the HANA Table data and analyze it using SQL in HANA Studio/WEB IDE.
(Make sure you have required privileges to do DMLOperations on the tables in SAP HANA Database.)
2.      Import pyodbc, pandas, Sklearn, Matplotlib, seaborn libraries in python.
3.      Create connection to HANA data base and execute required SQL.
4.      Extract all the historical data into data frame object and start analyzing it in Python using pandas.
5.      Do the feature engineering, data cleaning and then feed the final set of independent variables to Machine learning algorithm (Random Forest) to predict dependent variable (Profit).
6.      Analyze the Machine learning algorithm metrics and fine tune for the better accuracy by repeating the step 5.Store the Machine learning algorithm metrics in log table and also update the predicted value of historical data into the HANA Table.
7.      For the new data set, create the python program which reads the new data using pyodbc connection and predict the dependent variable (Profit) and updates the actual transactional table for reporting.
8.      Schedule this program and keep monitoring the model metrics and predicted value.

1.      Check the HANA Table data and analyze it using SQL.

I have created two tables, one contains the actual company data which is used to store the transactional data. Second one to store the metrics of the Machine learning algorithm (Ex: MAE, R Squared, MAPE, RMSE, Accuracy etc.)

Please find the structures of both tables below.
Main transactional data table structure.

ML Metrics table to store the evaluation parameters.

         The historical data (which is lesser than current month) before prediction with state wise expenditure and profit shown below. I added Predicted profit column as well, to store the predicted value by ML Algorithm using python and it is updated as NULL now.
         SELECT * FROM "SCHEMA"."STARTUP_DATA2"

Let us analyze the data using SQL in HANA Studio with available functions, we can find mean, standard deviation,median,max,min and count of nulls using below SQL. (You can try for all the measures)
Select State, sum(Profit),avg(Profit),count(Profit),max(Profit),
count( distinct profit) as "unique",stddev(Profit),min(Profit),median(Profit),
sum(case when Profit IS NULL THEN 1 ELSE 0 END) AS "NO OF NULLS"
from "SCHEMA"."STARTUP_DATA2" group by State
order by State;



2.      Import pyodbc, pandas, Sklearn libraries in python.

Now I am coming to Python scripting interface (Jupyter or spyder).
Import the required libraries.
PYODBC is an open source Python module that makes accessing ODBC databases simple. It implements the DB API 2.0 specification but is packed with even more Pythonic convenience. Using pyodbc, you can easily connect Python applications to data sources with an ODBC driver.
           Typically, pyodbc is installed like any other Python package by running:
           pip install pyodbc
          
            from a Windows DOS prompt or Unix shell.
          
           To install this package with conda run:
           
           conda install -c anaconda pyodbc 
          
           For more information on pyodbc, see the Github Pyodbc Wiki

Seaborn: Seaborn is a graphic library built on top of Matplotlib. It allows to make your charts prettier, and facilitates some of the common data visualization needs
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Numpy: NumPy is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data
Sklearn: Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. It features various classificationregression and clustering algorithms including support vector machinesrandom forestsgradient boostingk-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy

3.      Create connection to HANA data base and execute required SQL.
Establish the connection and once a connection has been established, your application can execute selects, inserts, or other ODBC operations supported by your driver and database.

conn = pyodbc.connect('DRIVER={HDBODBC};SERVERNODE=10.90.261.59:30115;SERVERDB=HD1;UID=USERID;PWD=PASSWORD')

#Open connection to SAP HANA and check for count of records.
# check if table has entries
cursor = conn.cursor()
sql_query1 = 'SELECT count(*) FROM SCHEMA.STARTUP_DATA2'
dcount = pd.read_sql(sql_query1, conn)
dcount
print ('Table exists and contains', dcount.head(1), 'records' )

4.      Extract the data into data frame object and start analyzing it in Python using pandas.

Execute the Select SQL and read the data and save it to data frame. Once you get the data into data frame, you can apply all statistical functions to analyze the data as shown below.

#querying the sap hana db data and store in dataframe
sql_query2 = 'SELECT * FROM SCHEMA.STARTUP_DATA2'
df = pd.read_sql(sql_query2, conn)
df.head()

#USING GROUP BY CLAUSE TO ANALYZE THE DATA
df1.groupby('STATE').agg({'PROFIT':['sum','mean','count','max','std','min','median']}).round(0)


5.      Do the feature engineering, data cleaning, feed the final set of independent variables to Machine learning algorithm (Random Forest) to predict dependent variable
Check for the missing values or nulls using the functions and replace the values with mean/mode accordingly.

For Example I have two empty values in Rnd Spend column, so replacing with Mean/Median value. If the column is having more null values, then you can drop the column form the analysis.

Encode the nominal values, for example State has 3 unique values ('New York','California','Florida') replacing them with 0,1,2 codes using map function

Map ({'New York':0,'California':1,'Florida':2})

Now find the correlation between the dependent variable and independent variables to finalize the final independent variables/features for prediction. After observing the data choosing (R&D Spend, Administration Spend, and Marketing Spend) as independent variables and profit as a dependent variable.
Now I am using random forest regression to predict the profit.
Random forests or random decision forests are an ensemble learning method for classification and regression (and other tasks) which operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees .Random forests are bagged decision tree models that train on a subset of features on each split (in each iteration).




6.      Store the Machine learning algorithm metrics in log table and also update the predicted value into the HANA Table.

Using the insert and update statements you can send the data back to the data base tables (Ex: ML Metrics and Predicted values). Take care of the primary key while doing insertions.






Check the data in SAP HANA table to see the updated values in predicted profit column.

7.      For the new data set, create the python program which reads the new data using pyodbc connection and predict the dependent variable (Profit) and updates the actual transactional table for reporting.

In a business new transactional data will be flowing into Tables/data ware house. (In my case SAP HANA SQL data ware house). For the new data set, suppose you want to know what could be the profit if I spend “X” expenditure for next month.

Analyze whether this cost increase/decrease gives you return on investment/profit. Then,

Create the Python script in which you perform following tasks
·         fetch the new dataset and process it through the predictor/model
·         Update predicted value back to the DB.
·         Schedule this python script using windows Scheduler/ python scheduler


8.      Schedule this program and keep monitoring the model metrics and predicted value using HANA Models/reports.

You can schedule the Python script using Windows Scheduler/ using the scheduler as shown below.
# sample piece of code
$ pip install schedule
import schedule
import time

def predict_job(x,y,z):
    Y_pred = rf.predict([[x,y,z]])
    print("Predicted Value is”, Y_pred)

#@daily schedule
schedule.every().day.at("10:30").do(predict_job)

schedule.every().monday.do(predict_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Now in SAP HANA Client tools WEB IDE/ HANA Studio, Create the HANA model to join the main transactional data with dimension tables like (Time, Location) for reporting purpose.
State wise Actual Vs Predicted Profit Comparison Chart.


When using Python IDE’s such as Jupyter, the data is persisted to the client with the above approach and this means more processing time when you have large data set, which leads to drop the productivity of Data Scientists.
This is where the SAP HANA Data Frame can add real value to a Data Scientist’s work. More features and capabilities are included in SAP HANA 2.0 SPS03 Version to analyze / address the data science use cases with Python driver (hdbcli) and then the Python Client API for machine learning algorithms.

Please find the python code in GitHub for reference
References:

Comments

Popular posts from this blog

SAP BW DeltaQueue (RSA7)