Scalable in-database machine learning with PL/Python

Date: 2017-09-08
Time: 11:30 - 12:20
Room: Market Street
Level: Intermediate

Enterprises who wish to build data driven machine learning applications should have access to technology that enables processing large volumes and flavors (structured, unstructured) of data efficiently and economically. There are vast collections of libraries for several specialized domains in the PyData ecosystem that are almost universally adopted by data scientists and engineers. Harnessing these libraries on a scalable platform/computation framework would help enterprises rapidly derive value from their data. In-database analytics brings computations to where the data resides, thereby reducing transport costs and I/O bottlenecks. PL/Python is a glue that binds the rich set of libraries in the PyData stack with the data residing in a Postgres database.

In this talk I'll demonstrate how to harness the power Python and it's ecosystem of data science and machine learning libraries to build statistical models for machine learning applications, in database on Postgres. I will begin with an overview of PL/Python on Postgres and describe concepts such as User Defined Functions (UDF) and Aggregates (UDA) in detail. I will then describe data parallel and model parallel machine learning problems with real world applications in Natural Language Processing (NLP) and illustrate with examples how to leverage libraries such as numpy, scipy, scikit-learn for solving them through PL/Python.


Srivatsan Ramanujam