Skip to content

Latest commit

 

History

History
132 lines (89 loc) · 7.53 KB

README.md

File metadata and controls

132 lines (89 loc) · 7.53 KB

CE_JSM2017: Preparing Statistician/Statistics Graduates to Be Data Scientist

General information

Learning Outcome

  • Understand data science is the process of defining the problem with business knowledge, gathering needed data from various sources, developing models, extracting insight and recommend actions.
  • Get familiar with the cloud-based big data platform (Hadoop/ Hive/ Spark/ GPU etc.) that are widely used in the development and production setting for industry and know how to quickly transit from academia environment to enterprise environment.
  • Get familiar with data extraction, transformation and load from various database systems such that participates can learn how to become self- sufficient to get needed data in enterprise environment.
  • Understand how to leverage interactive dashboard to present insight and results and communicate efficient with business partners and customers.
  • Understand what data scientists “in the wild” are doing to better prepared to be successfully data scientist in the future.
  • Learn how to encode a real problem to a data science problem, search for the right data, preprocess data and deploy of analytical results through a case study.
  • Get familiar with how to achieve high-performance computing for standard statistical procedures with big data infrastructure.

Event

Joint Statistical Meetings 2017, Continuing Education Course, Baltimore, Maryland

Venue

H-Holiday Ballroom 1

Instructor

Hui Lin (website, Github, Twitter)

Ming Li Email: ([email protected])

Requirements

The prerequisite knowledge is MS level education in statistics and entry level of R-Studio. Please bring a laptop with the latest version of R and Rstudio installed (see more below for the technical setup).

Time schedule

Time Topic
Slot 1 8.30 a.m. - 10.15 a.m. Introduction, The art of data science, Big Data Cloud Platform
Slot 2 10.30 a.m. - 12.30 a.m. Big Data Cloud Platform, Data Wrangling
Slot 3 2.00 p.m. - 3.15 p.m. Data Pre-processing, Markdown, HTML5
Slot 4 3.30 p.m. - 5.00 p.m. Shiny Dashboard, Soft Skills for Data Scientists, Question and Evaluation

R Code

Accompanying Book

Set up

Obtaining R and RStudio

To run the code in this workshop, you will need to install both R and the RStudio IDE. Both are open source, free, and easy to install. They are already available on our lab computers, but you can install manually on your own machine from the following sources:

Required packages

We are going to need a couple of packages from CRAN: You can install them all by executing the following code chunk in RStudio:

p_needed <- c("plyr", "dplyr", "caret","e1071","gridExtra","lattice","imputeMissings",
              "RANN","corrplot","nnet","car","gpairs","reshape2","psych","tidyr",
              "ggplot2","sparklyr","readr","mvtnorm","MASS","data.table","magrittr",
              "shiny","metricsgraphics","DT","leaflet","dygraphs","xts","lubridate",
              "highcharter","rbokeh","networkD3","threejs","knitr","formatR")
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
  install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)

Apply Free Databricks Community Edition Account

There are many cloud computing environments such as Amazon’s AWS which provides a complete list of functions for heavy-duty enterprise applications. For beginners, Databricks provides an easy to use cloud system for learning purpose. Databricks is a company founded by the creators of Apache Spark and it provides a user-friendly web-based notebook environment that can create Hadoop/Spark/GPU cluster on the fly to run R/Python/Scala/SQL. We will use Databricks’ community edition to run demos for this training.

Addtional Setup

In case we cannot connect to Databricks during the training. We can also simulate Spark System and use Sparklyr in personal computer such as your own laptop and desktop. To use the simulated Spark system, we will need the following steps

# Installing sparklyr takes a few minutes, 
# because it installs +10 dependencies.

if (!require("sparklyr")) {
  install.packages("sparklyr")  
}

# Load sparklyr package.
library(sparklyr)

## download and create a local Spark environment, it will take a little while to finish
spark_install(version = "1.6.2")

# create a sparklyr connection, it will take a little while to run. 
# If there is no error message, the installation is sucessful.

sc <- spark_connect(master = "local")

Live updates