☞ http://ya-n-ds.tistory.com/3230 : Deep Learning from Scratch (1) - Python, Perceptron
☞ http://ya-n-ds.tistory.com/3231 : Deep Learning from Scratch (2) - Neural Network
# Reference : Deep Learning from Scratch ( 사이토 고키, 한빛미디어 )
Chap 4. 신경망 학습
cf. Perceptron : 선형분리 문제는 학습 가능
4.1.1 데이타 주도 학습
- 가능한한 사람의 개입 배제 : 패턴 인식에서 장점
- Feature 추출(변환기, 벡터 형식) -> 학습
- 기계학습 접근법
. 사람의 알고리즘(e.g. Perceptron) -> 결과
. 사람이 생각한 특징(SIFT, HOG 등) -> 기계학습(SVM, KNN 등) -> 결과
. 신경망(딥러닝) -> 결과 // End-to-end machine learning, 모든 문제를 같은 맥락에서 접근
4.1.2 훈련 데이터와 시험 데이터
- Training data for Optimized Parameters + Test data for Evaluation of the parameters
- Overfitting : Too much optimizsed for the specified Dataset
4.2 손실 함수 ( Loss function or Cost Function )
- Reference for the optimized parameters
4.2.1 Mean Squared Error(MSE)
E = Sum((y_k-t_k)^2)/2
e.g. y_k(Output for k'th image), t_k(Label for k'th image)
y = [0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.1, 0.0, 0.0]
t = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] // One-hot encoding
def mean_squared_error(y, t):
return 0.5 * np.sum((y-t)**2)
4.2.2 Cross Entropy Error(CEE)
E = -sum(t_k*log(y_k))
def cross_entropy_error(y, t):
delta = 1e-7 # Prevention of log(0)
return -np.sum(t * np.log(y+delta))
4.2.3 미니 배치(mini-batch) 학습
E = -sum_n( sum_k(t_nk*log(y_nk)) )/N // for Mean Loss fucntion -> Independent on number of data
- Random selection of data for mini-batch
import sys, os
sys.path.append(os.pardir)
import numpy as np
from dataset.mnist import load_mnist # from dataset/mnist.
(x_train, t_train), (x_test, t_test) =\
load_mnist(normalized=True, on_hot_label=True)
print(x_train.shape) # (60000, 784)
print(t_train.shape) # (60000, 10)
train_size = x_train[0] # size of 1st dimension
batch_size = 10
batch_mask = np.random.choice(train_size, batch_size) # random 'batch_size' selection out of 'train_size' data
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
4.2.4 CEE for batch data
def cross_entropy_error(y, t):
if y.ndim == 1:
t = t.reshape(1, t.size) # reshape(t,(1,t_size)) ?
y = y.reshape(1, y.size)
delta = 1e-7 # Prevention of log(0)
batch_size = y.shape[0]
return -np.sum(t * np.log(y+delta)) / batch_size
# return -np.sum( np.log(y[np.arange(batch_size), t] +delta) ) / batch_size
# in case that 't' is a number label
4.2.5 Why Loss Funcntion?
- The optimized parameters(weight, bias) which make the Loss function value as small as possible
- Differentiation values are used for the optimization
- 매개변수에 대한 손실함수의 미분
If < 0 -> 매개변수 값 증가시켜 손실함수 감소
If > 0 -> 매개변수 값 감소시켜 손실함수 감소
If == 0 -> Optimization value
- '정확도' : 매개변수에 대한 미분값이 대부분 '0', 값 변화가 불연속 -> 신경망 학습을 할 수 없음
4.3 수치(numerical) 미분
def nemerical_diff(f, x):
h = 1e-4 # np.float32(1e-50) -> '0.0' rounding error
return (f(x+h) - f(x-h))/(2*h) # 중앙 차분 <-> (f(x+h)-f(x))/h
cf. dy/dx : analytic differentiation
4.3.2 수치 미문의 예
import numpy as np
import matplotlib.pylab as plt
def function_1(x):
return 0.01*x**2 + 0.1*x
def tangent_line(f, x):
d = numerical_diff(f, x)
print(d)
y = f(x) - d*x # y intercept
return lambda t: d*t + y # 'lambda' - similar to 'def'
x = np.arange(0.0, 20.0, 0.1)
y = function_1(x)
numerical_diff(function_1, 5)
numerical_diff(function_1, 10)
tf = tangent_line(function_1, 5)
y2 = tf(x)
tf = tangent_line(function_1, 10)
y3 = tf(x)
plt.xlabel("x")
plt.ylabel("f(x)")
plt.plot(x, y)
plt.plot(x, y2)
plt.plot(x, y3)
plt.show()
4.3.3 편미분
f(x0, x1) = x0^^2 + x1^^2
def function_2(x, y):
return x**2 + y**2 # same as np.sum(x**2)
grid = np.arange(-3.0, 3.0, 0.1)
x0, x1 = np.meshgrid(grid, grid)
z = function_2(x0, x1)
plt.surface(x0, x1, z)
plt.show()
fig = plt.figure()
ax = fig.gca(projection='3d') # Get Current Axes
ax.plot_surface(x0, x1, z)
plt.show()
// Partial differentiation of x0 with x0=3, x1=4
def function_tmp1(x0): # x1 = 4.0
return x0*x0 + 4.0**2.0
numerical_diff(function_tmp1, 3.0) # at x0 = 3.0
// Partial differentiation of x1 with x0=3, x1=4
def function_tmp2(x1): # x0 = 3.0
return 3.0**2 + x1*x1
numerical_diff(function_tmp2, 4.0) # at x0 = 3.0
4.4 기울기
- gradient : Vectors of the partial differentiation of all variables
def function_2(x):
if x.ndim == 1:
return np.sum(x**2)
else:
return np.sum(x**2, axis=1)
def numerical_gradient(f, x):
h = 1e-4
grad = np.zeros_like(x) # same-type array as 'x'
for idx in range(x.size):
tmp_val = x[idx] # (x0, x1)
x[idx] = tmp_val + h # (x0+h, x1+h)
fxh1 = f(x) # ( ((x0+h)*(x0+h) + x1*x1), (x0*x0 + (x1+h)*(x1+h)) )
x[idx] = tmp_val - h # (x0-h, x1-h)
fxh2 = f(x) # ( ((x0-h)*(x0-h) + x1*x1), (x0*x0 + (x1-h)*(x1-h)) )
grad[idx] = (fxh1 - fxh2)/(2*h) # ( df/dx0, df/dx1 )
x[idx] = tmp_val
return grad
numerical_gradient(function_2, np.array([3.0, 4.0]) # array([6,8])
x0 = np.arange(-2, 2.5, 0.25)
x1 = np.arange(-2, 2.5, 0.25)
X, Y = np.meshgrid(x0, x1)
X = X.flatten()
Y = Y.flatten()
Z = zip(X, Y)
grad = numerical_gradient(function_2, np.array([X, Y])) # np.arrary([X,Y])== Z
plt.figure()
plt.quiver(X, Y, -grad[0], -grad[1], angles="xy",color="#666666")
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.xlabel('x0')
plt.ylabel('x1')
plt.grid()
plt.legend()
plt.draw()
plt.show()
4.4.1 경사법
- Optimal parameters when Loss function is the minimum value
cf. Gradient = 0, where the minimum/maximum value(최소/최대값) or local minimum/maximum value(극소/극대값) or saddle point(안장점) lies
cf. plateau (고원) - 학습 진행되지 않음
- 학습률(learning rate) : Too big(발산), Too small(Too much iteration number is required)
x0 = x0 - η(df/dx0)
x1 = x1 - η(df/dx1)
def gradient_descent(f, init_x, lr=0.01, step_num=100):
x = init_x
for i in range(step_num):
grad = numerical_gradient(f, x)
x -= lr*grad
return x
def function_2(x):
return x[0]**2 + x[1]**2
init_x = np.array([-3.0, 4.0])
gradient_descent(function_2, init_x=init_x, lr=0.01, step_num=100)
4.4.2 신경망에서의 기울기
e.g. 가중치 W(shape 2x3), 손실함수 L -> dL/dW
dL/dw11, dL/dw12, dL/dw13
dL/dw21, dL/dw22, dL/dw23
import sys, os
sys.path.append(os.pardir)
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient functions import softmax, cross_entropy_error
class simpleNet:
def __init__(self):
self.W = np.random.randn(2,3) # Standard Normal Distribution ( mean=0, std=1 )
def predict(self, x):
return np.dot(x, self.W) # x : 1x2 or Nx2(batch)
def loss(self, x, t):
z = self.predict(x)
y = softmax(z) # y_k = exp(a_k)/sum(exp(a_i)) -> prevention of Overflow
loss = cross_entropy_error(y, t) # E = -sum(t_k*log(y_k))
return loss
net = simpleNet()
print(net.W)
x = np.array([0.6, 0.9])
p = net.predict(x)
np.argmax(p) # Index of Max element
t = np.array([0, 0, 1]) # Label of correct value
net.loss(x,t)
def f(W):
return net.loss(x,t)
# f = lambda w: net.loss(x,t)
dW = numerical_gradient(f, net.W)
print(dW)
------