Utilizing #PredictiveAnalytics & #BigData To Improve Accuracy of #EPM Forecasting Process

Great #ApacheSpark use case included for Telecommunications providers

Musings on ML, Deep Learning & AI

I was amazed when I read the @TidemarkEPM awesome new white paper on the “4 Steps to a Big Data Finance Strategy.” This is an area I am very passionate about; some might say, it’s become my soap-box since my days as a Business Intelligence consultant. I saw the dawn of a world where EPM, specifically, the planning and budgeting process was elevated from gut feel analytics to embracing #machinelearning as a means of understanding which drivers are statistically significant from those that have no verifiable impact , and ultimately using those to feed a more accurate forecast model.

Big Data Finance

Traditionally (even still today), finance teams sit in a conference room with Excel spreadsheets from Marketing, Customer Service etc., and basically, define the current or future plans based on past performance mixed with a sprinkle of gut feel (sometimes, it was more like a gallon of gut feel to every tablespoon of historical data). In these same meetings just…

View original post 811 more words


*FREE* edX Courses Offered by Berkeley on Spark Fundamentals & Programming with PySpark

**Did I mention it is free?**



Machine learning aims to extract knowledge from data and enables a wide range of applications. With datasets rapidly growing in size and complexity, learning techniques are fast becoming a core component of large-scale data processing pipelines. This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including feature extraction, supervised learning, model evaluation, and exploratory data analysis. Students will gain hands-on experience applying these principles by using Apache Spark to implement several scalable learning pipelines.


Programming background; comfort with mathematical and algorithmic reasoning; familiarity with basic machine learning concepts; exposure to algorithms, probability, linear algebra and calculus; experience with Python (or the ability to learn it quickly). All exercises will use PySpark, but previous experience with Spark or distributed computing is NOT required. You should take this Python quiz before the course and take this Python mini-course if you need to learn Python or refresh your Python knowledge. This self-assessment document provides online resources that review additional relevant background material.

WEEK 1: Course Overview and Introduction to Machine Learning – Launches June 29 at 16:00 UTC

  • Topics: Course goals, Apache Spark overview, basic machine learning concepts, steps of typical supervised learning pipelines, linear algebra review, computational complexity / big O notation review.
  • Lab 1: NumPy, Linear Algebra, and Lambda Function Review. Gain hands on experience using Python’s scientific computing library to manipulate matrices and vectors, and learn about lambda functions which will be used throughout the course. (Due July 11, 2015 at 07:00 UTC)

WEEK 2: Introduction to Apache Spark – Launches June 29 at 16:00 UTC

  • Topics: Big data and hardware trends, history of Apache Spark, Spark’s Resilient Distributed Datasets (RDDs), transformations, and actions.
  • Lab 2: Learning Apache Spark. Perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare’s plays.  (Due July 11, 2015 at 07:00 UTC)
  • Note to CS100.1x students: This material is identical to Week 2 of BerkeleyX CS100.1x, and if you’ve already completed Lab 1 of CS100.1x you can simply submit your completed notebook to receive credit.

WEEK 3: Linear Regression and Distributed Machine Learning Principles – Launches July 6 at 16:00 UTC

  • Topics: Linear regression formulation and closed-form solution, distributed machine learning principles (related to computation, storage, and communication), gradient descent, quadratic features, grid search.
  • Lab 3: Millionsong Regression Pipeline. Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. You will implement a gradient descent solver for linear regression, use Spark’s machine Learning library ( mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition. (Due July 18, 2015 at 07:00 UTC)

WEEK 4: Logistic Regression and Click-through Rate Prediction  – Launches July 13 at 16:00 UTC

  • Topics: Online advertising, linear classification, logistic regression, working with probabilistic predictions, categorical data and one-hot-encoding, feature hashing for dimensionality reduction.
  • Lab 4: Click-through Rate Prediction Pipeline. Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. You will extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot. (Due July 25, 2015 at 07:00 UTC)

WEEK 5: Principal Component Analysis and Neuroimaging – Launches July 20 at 16:00 UTC

  • Topics: Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA.
  • Lab 5: Neuroimaging Analysis via PCA – Identify patterns of brain activity in larval zebrafish. You will work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish’s neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, you will use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli.  (Due August 1, 2015 at 07:00 UTC)