End to end model of Data Analysis & Prediction using Python on SAP
HANA Table data:
This blog helps to connect with SAP HANA DB (Version 1.0
SPS12) then extract the data from HANA table/View and analyze the data using Python
Pandas library. Then you can clean and select independent variables/features data
to feed the Machine learning algorithms to predict dependent variables or find
insights.
In today’s digital economy, businesses cannot take action on
stale insights, thus a true in-memory data platform should support real-time
processing for transactions and analytics for all of a company’s data. SAP HANA
helps to manage data in a single in-memory platform, so you can take action in
the moment. Accelerate the pace of innovation and run live in this new digital
economy. SAP HANA Capabilities include database services, advanced analytics
processing, app development, data access, administration, and openness.
Python is becoming popular in analytics and data science. It
is also portable, free, and easy .Python lets you work quickly and integrate
systems more effectively. Python has large library base that you can use so you
don’t have to write your own code for every single thing. There are libraries
for regular expressions, documentation-generation, unit-testing, web browsers,
threading, databases, CGI, email, image manipulation, and a lot of other functionality.
Scenario: I am taking the state wise startup company’s
expenditure (R&D Spend, Administration Spend, and Marketing Spend) and
profit data.
Goal is to predict the Profit for the given set of expenditure
values. I am not explaining details about the ML Algorithm and the parameter
tuning here. I would like to show the end to end process of Data extraction from
SAP HANA DB, analyzing, cleaning, feature selection, and applying machine
learning model and finally write back the results and ML algorithm performance
metrics to the HANA tables.
The linear regression is the most commonly used model in
research and business and is the simplest to understand, so using the random
forest regression method we will predict the Profit.
The below diagram shows ML Prediction life cycle and steps
fallowed in the use case.
The basic steps involved in this process are:
1.
Check the HANA Table data and analyze it
using SQL in HANA Studio/WEB IDE.
(Make sure you have required privileges
to do DMLOperations on the tables in SAP HANA Database.)
2.
Import pyodbc, pandas, Sklearn,
Matplotlib, seaborn libraries in python.
3.
Create connection to HANA data base and
execute required SQL.
4.
Extract all the historical data into
data frame object and start analyzing it in Python using pandas.
5.
Do the feature engineering, data
cleaning and then feed the final set of independent variables to Machine
learning algorithm (Random Forest) to predict dependent variable (Profit).
6.
Analyze the Machine learning algorithm
metrics and fine tune for the better accuracy by repeating the step 5.Store the
Machine learning algorithm metrics in log table and also update the predicted
value of historical data into the HANA Table.
7.
For the new data set, create the python
program which reads the new data using pyodbc connection and predict the
dependent variable (Profit) and updates the actual transactional table for
reporting.
8.
Schedule this program and keep
monitoring the model metrics and predicted value.
1.
Check the HANA Table data and analyze it using SQL.
I have created two tables, one contains
the actual company data which is used to store the transactional data. Second
one to store the metrics of the Machine learning algorithm (Ex: MAE, R Squared, MAPE, RMSE, Accuracy etc.)
Please find the structures of both
tables below.
Main transactional data table structure.
ML Metrics table to store the evaluation
parameters.
The historical
data (which is lesser than current month) before prediction with state wise
expenditure and profit shown below. I added Predicted profit column as well, to
store the predicted value by ML Algorithm using python and it is updated as
NULL now.
SELECT * FROM
"SCHEMA"."STARTUP_DATA2"
Let us analyze the data using SQL in HANA Studio with
available functions, we can find mean, standard deviation,median,max,min and
count of nulls using below SQL. (You can try for all the measures)
Select State, sum(Profit),avg(Profit),count(Profit),max(Profit),
count( distinct profit) as "unique",stddev(Profit),min(Profit),median(Profit),
sum(case when Profit IS NULL THEN 1 ELSE 0 END) AS "NO OF
NULLS"
from "SCHEMA"."STARTUP_DATA2" group by State
order
by State;
2.
Import pyodbc, pandas, Sklearn libraries in python.
Now I am coming to Python scripting
interface (Jupyter or spyder).
Import the required libraries.
PYODBC is an open source Python
module that makes accessing ODBC databases simple. It implements the DB API 2.0
specification but is packed with even more Pythonic convenience. Using pyodbc,
you can easily connect Python applications to data sources with an ODBC driver.
Typically, pyodbc is installed like
any other Python package by running:
pip install pyodbc
from a Windows DOS prompt or Unix
shell.
To install this package with conda
run:
conda install -c anaconda
pyodbc
For more information on pyodbc, see
the Github Pyodbc Wiki.
Seaborn: Seaborn
is a graphic library built on top of Matplotlib. It allows to make your charts
prettier, and facilitates some of the common data visualization needs
Pandas: pandas is
an open source, BSD-licensed library providing high-performance, easy-to-use
data structures and data analysis tools for the Python programming language.
Numpy: NumPy is
the fundamental package for scientific computing with Python. Besides its
obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data
Sklearn: Scikit-learn provides a range of supervised and
unsupervised learning algorithms via a consistent interface in Python. It
features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy
3.
Create connection to HANA data base and execute required SQL.
Establish the connection and once a
connection has been established, your application can execute selects, inserts,
or other ODBC operations supported by your driver and database.
conn = pyodbc.connect('DRIVER={HDBODBC};SERVERNODE=10.90.261.59:30115;SERVERDB=HD1;UID=USERID;PWD=PASSWORD')
#Open connection to SAP HANA and check
for count of records.
# check if table has entries
cursor = conn.cursor()
sql_query1 = 'SELECT count(*) FROM
SCHEMA.STARTUP_DATA2'
dcount = pd.read_sql(sql_query1, conn)
dcount
print ('Table exists and contains',
dcount.head(1), 'records' )
4.
Extract the data into data frame object and start analyzing it in Python
using pandas.
Execute the Select SQL and read the data
and save it to data frame. Once you get the data into data frame, you can apply
all statistical functions to analyze the data as shown below.
#querying the sap hana db data and store
in dataframe
sql_query2 = 'SELECT * FROM SCHEMA.STARTUP_DATA2'
df = pd.read_sql(sql_query2, conn)
df.head()
#USING GROUP BY CLAUSE TO ANALYZE THE
DATA
df1.groupby('STATE').agg({'PROFIT':['sum','mean','count','max','std','min','median']}).round(0)
5.
Do the feature engineering, data cleaning, feed the final set of
independent variables to Machine learning algorithm (Random Forest) to predict
dependent variable
Check for the missing values or nulls
using the functions and replace the values with mean/mode accordingly.
For Example I have two empty values in
Rnd Spend column, so replacing with Mean/Median value. If the column is having
more null values, then you can drop the column form the analysis.
Encode the nominal values, for example
State has 3 unique values ('New York','California','Florida') replacing them
with 0,1,2 codes using map function
Map ({'New
York':0,'California':1,'Florida':2})
Now find the correlation between the dependent variable and
independent variables to finalize the final independent variables/features for
prediction. After observing the data choosing (R&D Spend, Administration
Spend, and Marketing Spend) as independent variables and profit as a dependent
variable.
Now I am using random forest regression to predict the profit.
Random forests or random decision forests are
an ensemble learning method
for classification and regression (and other tasks) which operates by constructing
a multitude of decision trees at training time and outputting the class that
is the mode of the classes (classification) or mean prediction
(regression) of the individual trees .Random forests are bagged
decision tree models that train on
a subset of features on
each split (in each iteration).
6.
Store the Machine learning algorithm metrics in log table and also
update the predicted value into the HANA Table.
Using the insert and update statements
you can send the data back to the data base tables (Ex: ML Metrics and
Predicted values). Take care of the primary key while doing insertions.
Check the data in SAP HANA table to see the updated values in predicted
profit column.
7.
For the new data set, create the python program which reads the new data
using pyodbc connection and predict the dependent variable (Profit) and updates
the actual transactional table for reporting.
In a business new transactional data
will be flowing into Tables/data ware house. (In my case SAP HANA SQL data ware
house). For the new data set, suppose you want to know what could be the profit
if I spend “X” expenditure for next month.
Analyze whether this cost increase/decrease
gives you return on investment/profit. Then,
Create the Python script in which you
perform following tasks
·
fetch the new dataset and process it
through the predictor/model
·
Update predicted value back to the DB.
·
Schedule this python script using
windows Scheduler/ python scheduler
8.
Schedule this program and keep monitoring the model metrics and
predicted value using HANA Models/reports.
You can schedule the Python script using
Windows Scheduler/ using the scheduler as shown below.
# sample piece of code
$ pip install schedule
import schedule
import time
def predict_job(x,y,z):
Y_pred =
rf.predict([[x,y,z]])
print("Predicted
Value is”, Y_pred)
#@daily schedule
schedule.every().day.at("10:30").do(predict_job)
schedule.every().monday.do(predict_job)
while True:
schedule.run_pending()
time.sleep(1)
Now in SAP HANA Client tools WEB IDE/ HANA Studio, Create the
HANA model to join the main transactional data with dimension tables like (Time,
Location) for reporting purpose.
State wise Actual Vs Predicted Profit Comparison Chart.
When using Python IDE’s such as Jupyter, the data is
persisted to the client with the above approach and this means more processing
time when you have large data set, which leads to drop the productivity of Data
Scientists.
This is where the SAP HANA Data Frame can add real value to a
Data Scientist’s work. More features and capabilities are included in SAP HANA
2.0 SPS03 Version to analyze / address the data science use cases with Python driver
(hdbcli) and then the Python Client API for machine learning algorithms.
Please find the python code in GitHub for reference
Published in Medium:
Blog post at SAP:
References:
Comments
Post a Comment