# Tutorial 7 - Pipelines and regression

## 11 important questions on Tutorial 7 - Pipelines and regression

### What are the 5 parts of big data analysis?

- Understanding the business/the problem
- Loading data
- Exploring data
- Visualizing data
- Preparing data

### How can you read a bunch of files into a single dataframe?

**data=spark.read.csv('.../foldername/', header=True, inferSchema=True)**

### How can you check the types of the columns?

- Data.dtypes
- Data.printSchema()

### How can you get a basic statistical summary of your data? What does it show?

- Use
**data.describe().show()** - Shows count, mean, stddev, min and max per column.

### How can you easily visualize the dataframe?

**display(data)**

### How can you prepare the data for machine learning?

- Use
**Vector Assembler**(import it) to transform it into a vector of features. - Then type the code:

- columns = ['column1', 'column2', 'column3', 'column4']
- vectorizer = VectorAssembler(inputCols=columns,outputCol="features")
- dataset = vectorizer.transform(data)

### How do you prepare the data and evaluate how well your linear regression model predicts power output?

**randomSplit()**function to divide into a test and a training set.

### How do you create a linear regression model?

from pyspark.ml.regression import

**LinearRegression**

from pyspark.ml.regression import

**LinearRegressionModel**

from pyspark.ml import

**Pipeline**

**Lr = LinearRegression**

### Which 2 parameters are not optional in linear regression model and how do you set them?

- Name of the
**label column**to the values to**learn** - Name of the
**prediction column**, where the**prediction values**should be stored.

**lr.setPredictionCol('predicted_PE').setLabelCol('PE')**

### What is a pipeline?

**lrPipeline = Pipeline()**

lrPipeline.setStages([vectorizer, lr])

### How is a linear regression model created that has been trained with the training dataset?

**LrModel = lrPipeline.fit(trainingSetDF)**

