환웅 데이터
Classification Case Study 본문
Caret package:
- createDataPartition
- preProcess
- knn3
- trainControl
- train
Tidyverse package:
- slice
Classification Analysis
For predicitive modeling, we want to construct a model and evaluate how accurate we think that model will be on new, unseen data. The outline of a predictive analysis is:
1. Split the data into train and test sets.
2. Transform features in training data.
3. Fit model to training data.
4. Evaluate model on test data.
Train / Test set split
The very first step of predicitve modeling is to do a train/test split. After doing to the train/test split, we do not want to touch the test data until we are ready to evaluate the model.
The caret library provides a helpful function called createDataPartition that creates the train/test split for us.
#createDataPartition
set.seed(83234)
train_indices <- createDataPartition(y= penguins$species,
p = 0.7,
list = FALSE)
train_df <- penguins %>%
slice(train_indices)
test_df <- penguins %>%
slice(-train_indices)
createDataPartition
This randomly samples the indices of the observations that should go in the training set. Here we’ve specified that 70% of the data go into the training set, leaving 30% for the test set. $ operator is used to access a specific data frame column. penguins$species pulls out the species column as a vector.
데이터를 훈련 데이터와 시험 데이터로 분할.
y에 데이터, p에 훈련 데이터에서 사용할 데이터의 비율, list에 리스트 형식의 표현 여부(TRUE면 리스트 반환, FALSE면 행렬 반환)를 넣는다.
set.seed
set.seed makes sure this random split will be the same everytime we run the code. Someone else can run your code and get the same answers. 마인크래프트 시드라 생각하면 편함.
slice
Slice selects rows just like filter function, but by their index(행 번호) instead of a condition on the values of their variables.
해당 위치의 행을 잘라서 선택한다. 인덱스를 음수로 정하면 해당 행을 제외시킨다.
Feature Standardization
Feature is a synonym for variable or column of a data frame. For most data analyses, we transform the original variables to create new variables that are better for the task we are interested in. This is often called feature engineering or feature transformation. One of the most common feature transformation is called standardization.
We create a new version of each varaible x with following code:
#standardization
train_df %>%
mutate(bill_length_stand = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm)) %>%
select(bill_length_mm, bill_length_stand)
This is all called z-scores; measures how far above/below the average bill length a penguin is on a scale relative to the natural variability of penguin bill lengths.
Feature standardization is important for KNN because different variables can be measured on different scales.
Feature standardization code
At this stage, we only want to use the training data. When we eventually evaluate the model on the test dataset, we will need to apply the same feature transformation to the test data that we applied to the training data.
1. Compute and save the mean/standard deviation of each varaible from the training data.
2. Transform the training data.
3. Eventually transform the test data using the same mean/std.
We will use caret package to do all of this.
#Feature standardization code
#compute the mean/std of each variable in the training data frame
standardize_params <- preProcess(train_df, method = c("center", "scale"))
#transform each variable in the training data frame.
train_stand <- predict(standardize_params, train_df)
평균이 0, 표준편차가 1인 표준정규분포로 값을 변환하는 표준화 과정.
preProcess의 method가 center일 경우, 데이터에서 평균을 빼주게 되며, method가 scale일 경우, 데이터의 표준편차로 나누어 준다. 두 방법을 위 코드와 같이 하나로 묶어 표준화 객체를 생성할 수 있다.
#verify if each column of train_stand has mean 0
train_stand %>%
select(-species) %>%
summarize_all(mean)
summarize_all() 함수는 summarize()와 같은 기능을 가지지만, 모든 변수에 함수를 적용시킨다.
Fitting KNN
Now, we are fitting the KNN model.
#fitting KNN model using knn3 function
knn_fit <- knn3(species ~. k = 1, data= train_stand)
The notation species ~ . means predict spcies from all of the remaining variables. Variable을 지정하고 싶으면 species ~ bill_length_mm + bill_depth_mm 이런식으로 작성하면 된다.
Evaluating test set error (미완)
Finally, we are evaluating our KNN model on the test data set. Since we fit the KNN model on the standardized training data (train_stand), we also need to standardize the test data.
#process the test data
test_stand <- predict(standardize_params, test_df)
#test prediction
species_pred_test = predict(knn_fit, newdata = test_stand, type = 'class')
#test set accuaracy
test_stand %>%
mutate(species_pred = species_pred_test,
correct_prediction = species_pred == species) %>%
summarize(accuracy = mean(correct_prediction))
Hyperparamter tuning
'R 프로그래밍 > Stat20 Lecture Note' 카테고리의 다른 글
STAT20 Questions and Data Scope (0) | 2022.10.28 |
---|