Data Analysis with PL/Python

Date: 2016-09-15
Time: 13:30 - 14:20
Room: Ft. Worth 2
Level: Intermediate

PostgreSQL has extensive functionality for allowing users to extend functionality with user-defined functions in a wide-range of languages. Most applications segment their data-storage and data analysis tasks and manage them independently. Doing so is often more difficult and less performant than performing the analysis within the database and can slow the development process. PostgreSQL is traditionally used for storing the data that is the subject of the analysis, but it can also be used to execute the data analysis without having to segment the process into a data collection step followed by a data analysis step. Integrating the data analysis into the database allows for more robust interaction between the data and the analysis and simplifies interaction between different analyses. Python has extensive well-maintained scientific computing and data analysis libraries such as NumPy and SciPy which can be used to quickly develop data analysis applications. The PL/Python procedural language allows users to write functions in python and make use of the vast python data analysis ecosystem within the PostgreSQL database environment.

Case Study of In-Database Analysis: Regression Analysis in PostgreSQL

Performing regression analysis on a data set is a common data analysis task that is often used to fit curves, analyze trends and make projections for the future. A function written in PL/Python using the NumPy and SciPy modules was used to extend PostgreSQL to allow PostgreSQL to compute mean and quantile regressions on a set of x, y points using an arbitrary target function. The function takes the target function (an equation string) with an arbitrary number of unknowns, initial guesses for the unknowns being solved for, an array of x points, and an array of y points as inputs and outputs the regression. The equation form and number of parameters can vary, so the output is provided as JSON.

The PyParsing module allows you to define and execute grammars which makes it well suited to interpreting user input. For example, this allows basic mathematical equations to be easily interpreted. To allow for generalized regression analysis using any target function, the target function needed to be interpreted and executed during the regression analysis. A basic grammar was defined that could interpret basic math and the execution result was used to dynamically build a python function that implemented the equation.

By combining regression analysis and equation parsing in a PostgreSQL PL/Python function a simple query can be used to get the optimized values for b, D, and q0 for the example equation: SELECT regression({“equation”:”q0 / ((1+bDx)^(1/b))”, “initial_guess”: {“b”: 1, “D”: 0.2, “q0”:500}}, array_agg(x_column), array_agg(y_column)) FROM datatable; The result can then be used to construct curves, make projections, and analyze the relationship between variables.


Keith Robertson