Review of "Towards a Unified Architecture for in-RDBMS Analytics"
As use of statistical analysis in enterprise applications increasing, database vendors started to implement new statistical techniques from scratch in the RDBMS, which leads to a lengthy and complex development process. This paper proposes Bismarck, a unified architecture that can implement various machine learning analytics in existing RDBMS systems.
This paper identifies a classical algorithm from the mathematical programming cannon, called incremental gradient descent (IGD), which is used to solve convex programming problems, has a data-access pattern that is essentially identical to the data access pattern of any SQL aggregation function, e.g., an SQL AVG. This paper leverages this observation and built a unified architecture which shows that one can implement IGD for different models using the user-defined aggregate features that are available inside every major RDBMS. Analytical tasks that can be implemented include Logistic Regression, Support Vector Machine Classifier, Recommendation (LMF), Labeling (CRF), Kalman Filters, Portfolio Optimization, etc.
This paper shows that this approach is 2–4x faster than existing in-database analytic tools for simple tasks and for some newly added tasks such as matrix factorization, order of magnitude faster.
Although the method proposed by this paper is both more efficient to implement than existing approaches (building analytics tools from scratch for RDBMS), and faster to solve the same problems. I have some doubt on how easy the programming interface will be of different machine learning models for end developers. Will analyst buy programming in SQL for these complex models, are these models flexible enough to cover different requirements on different kinds of data?