Feedforward Neural Networks

This page explains various ways of implementing single-layer and multi-layer neural networks as a supplementary material of this lecture. The implementations appear in explicit to abstract order so that one can understand the black-boxed internal processing in deep learning frameworks.

In order to focus on the internals, this page uses a simple and classic example: threshold logic units. Supposing $x=0$ as false and $x=1$ as true, single-layer neural networks can realize logic units such as AND ($\wedge$), OR ($\vee$), NOT ($\lnot$), and NAND ($|$). Multi-layer neural networks can realize logical compounds such as XOR.

$x_1$ $x_2$ AND OR NAND XOR
0 0 0 0 1 0
0 1 0 1 1 1
1 0 0 1 1 1
1 1 1 1 0 0

Single-layer perceptron

We consider a single layer perceptron that predicts a binary label $\hat{y} \in {0, 1}$ for a given input vector $\boldsymbol{x} \in \mathbb{R}^d$ ($d$ presents the number of dimensions of inputs) by using the following formula,

\[\hat{y} = g(\boldsymbol{w} \cdot \boldsymbol{x} + b) = g(w_1 x_1 + w_2 x_2 + ... + w_d x_d + b)\]

Here, $\boldsymbol{w} \in \mathbb{R}^d$ is a weight vector; $b \in \mathbb{R}$ is a bias weight; and $g(.)$ denotes a Heaviside step function (we assume $g(0)=0$).

Let’s train a NAND gate with two inputs ($d = 2$). More specifically, we want to find a weight vector $\boldsymbol{w}$ and a bias weight $b$ of a single-layer perceptron that realizes the truth table of the NAND gate: $\{0,1\}^2 \to \{0,1\}$.

We convert the truth table into a training set consisting of all mappings of the NAND gate,

\[\boldsymbol{x}_1 = (0, 0), y_1 = 1 \\ \boldsymbol{x}_2 = (0, 1), y_2 = 1 \\ \boldsymbol{x}_3 = (1, 0), y_3 = 1 \\ \boldsymbol{x}_4 = (1, 1), y_4 = 0 \\\]

In order to train a weight vector and bias weight in a unified code, we include a bias term as an additional dimension to inputs. More concretely, we append $1$ to each input,

\[\boldsymbol{x}'_1 = (0, 0, 1), y_1 = 1 \\ \boldsymbol{x}'_2 = (0, 1, 1), y_2 = 1 \\ \boldsymbol{x}'_3 = (1, 0, 1), y_3 = 1 \\ \boldsymbol{x}'_4 = (1, 1, 1), y_4 = 0 \\\]

Then, the formula of the single-layer perceptron becomes,

\[\hat{y} = g((w_1, w_2, w_3) \cdot \boldsymbol{x}') = g(w_1 x_1 + w_2 x_2 + w_3)\]

In other words, $w_1$ and $w_2$ present weights for $x_1$ and $x_2$, respectively, and $w_3$ does a bias weight.

The code below implements Rosenblatt’s perceptron algorithm with a fixed number of iterations (100 times). We use a constant learning rate 0.5 for simplicity.

import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([0, 0, 0, 1])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    for i in range(len(y)):
        y_pred = np.heaviside(np.dot(x[i], w), 0)
        w += (y[i] - y_pred) * eta * x[i]
w
array([ 1. ,  0.5, -1. ])
np.heaviside(np.dot(x, w), 0)
array([0., 0., 0., 1.])

Single-layer perceptron with mini-batch

It is better to reduce the execusion run by the Python interpreter, which is relatively slow. The common technique to speed up a machine-learning code written in Python is to to execute computations within the matrix library (e.g., numpy).

The single-layer perceptron makes predictions for four inputs,

\[\hat{y}_1 = g(\boldsymbol{x}_1 \cdot \boldsymbol{w}) \\ \hat{y}_2 = g(\boldsymbol{x}_2 \cdot \boldsymbol{w}) \\ \hat{y}_3 = g(\boldsymbol{x}_3 \cdot \boldsymbol{w}) \\ \hat{y}_4 = g(\boldsymbol{x}_4 \cdot \boldsymbol{w}) \\\]

Here, we define $\hat{Y} \in \mathbb{R}^{4 \times 1}$ and $X \in \mathbb{R}^{4 \times d}$ as,

\[\hat{Y} = \begin{pmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \hat{y}_3 \\ \hat{y}_4 \\ \end{pmatrix}, X = \begin{pmatrix} \boldsymbol{x}_1 \\ \boldsymbol{x}_2 \\ \boldsymbol{x}_3 \\ \boldsymbol{x}_4 \\ \end{pmatrix}\]

Then, we can write the four predictions in one dot-product computation, \(\hat{Y} = X \cdot \boldsymbol{w}\)

The code below implements this idea. The function np.heaviside() yields a vector corresponding to the four predictions, applying the step function for every element of the argument.

This technique is frequently used in mini-batch training, where gradients for a small number (e.g., 4 to 128) of instances are computed.

import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = np.heaviside(np.dot(x, w), 0)
    w += np.dot((y - y_pred), x)
w
array([-1., -1.,  2.])
np.heaviside(np.dot(x, w), 0)
array([1., 1., 1., 0.])

Stochastic gradient descent (SGD) with mini-batch

Next, we consider a single-layer feedforward neural network with sigmoid activation function. In essence, we replace Heaviside step function with sigmoid function when predicting $\hat{Y}$ and to use the formula for stochastic gradient descent when updating $\boldsymbol{w}$.

import numpy as np

def sigmoid(v):
    return 1.0 / (1 + np.exp(-v))

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = sigmoid(np.dot(x, w))
    w -= np.dot((y_pred - y), x)
w
array([-5.59504346, -5.59504346,  8.57206068])
sigmoid(np.dot(x, w))
array([0.99981071, 0.95152498, 0.95152498, 0.06798725])

Automatic differentiation

autograd

import autograd
import autograd.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = autograd.grad(loss)
print(loss(w, x))
print(grad_loss(w, x))
0.47407698418010663
[-0.37754067 -0.37754067 -0.37754067]

PyTorch

import torch

dtype = torch.float

x = torch.tensor([1, 1, 1], dtype=dtype)
w = torch.tensor([1.0, 1.0, -1.5], dtype=dtype, requires_grad=True)

loss = -torch.dot(x, w).sigmoid().log()
loss.backward()
print(loss.item())
print(w.grad)
0.4740769565105438
tensor([-0.3775, -0.3775, -0.3775])

Chainer

import numpy as np
from chainer import Variable
import chainer.functions as F

dtype = np.float32

x = np.array([1,1,1], dtype=dtype)
w = Variable(np.array([1.0,1.0,-1.5], dtype=dtype), requires_grad=True)

loss = -F.log(F.sigmoid(np.dot(x,w)))
loss.backward()
print(loss.data)
print(w.grad)
0.47407696
[-0.37754062 -0.37754062 -0.37754062]

TensorFlow

import tensorflow as tf

x = tf.constant([1., 1., 1.])
w = tf.Variable([1.0, 1.0, -1.5])

loss = -tf.log(tf.sigmoid(tf.tensordot(x, w, axes=1)))
grad = tf.gradients(loss, w)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    loss_value = sess.run(loss)
    grad_value = sess.run(grad)
    print(loss_value)
    print(grad_value)
0.47407696
[array([-0.37754062, -0.37754062, -0.37754062], dtype=float32)]

MXNet

import mxnet as mx
from mxnet import nd, autograd, gluon

x = nd.array([1., 1., 1.])
w = nd.array([1.0, 1.0, -1.5])
w.attach_grad()

with autograd.record():
    loss = -nd.dot(x, w).sigmoid().log()
loss.backward()
print(loss)
print(w.grad)
[0.47407696]
<NDArray 1 @cpu(0)>

[-0.37754065 -0.37754065 -0.37754065]
<NDArray 3 @cpu(0)>

Single-layer neural network using automatic differentiation

PyTorch
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
w = torch.randn(3, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = x.mm(w).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()      # The loss value.
    #print(t, loss.item())
    loss.backward()             # Compute the gradients of the loss.

    with torch.no_grad():
        w -= eta * w.grad       # Update weights using SGD.        
        w.grad.zero_()          # Clear the gradients for the next iteration.
w
tensor([[-4.2327],
        [-4.2320],
        [ 6.5405]])
x.mm(w).sigmoid()
tensor([[ 0.9986],
        [ 0.9096],
        [ 0.9095],
        [ 0.1274]])
Chainer
import numpy as np
import chainer
from chainer import Variable
import chainer.functions as F

dtype = np.float32

# Training data for NAND
x = Variable(np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype))
y = Variable(np.array([[1], [1], [1], [0]], dtype=dtype))
w = Variable(np.random.rand(3, 1).astype(dtype=dtype), requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = F.sigmoid(F.matmul(x, w))
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -F.sum(F.log(ll))    # The loss value.
    #print(t, loss)
    loss.backward()             # Compute the gradients of the loss.

    with chainer.no_backprop_mode():
        w -= eta * w.grad       # Update weights using SGD.
        w.cleargrad()           # Clear the gradients for the next iteration.
w
variable([[-4.238654 ],
          [-4.2391624],
          [ 6.550305 ]])
F.sigmoid(F.matmul(x, w))
variable([[0.99857235],
          [0.90979564],
          [0.90983737],
          [0.12702626]])
TensorFlow
import tensorflow as tf

# Training data for NAND
x_data = [[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]]
y_data = [[1], [1], [1], [0]]

x = tf.placeholder(tf.float32, [4, 3])
y = tf.placeholder(tf.float32, [4, 1])
w = tf.Variable(tf.random_normal([3,1]))

# y_pred = \sigma(x \cdot w)
y_pred = tf.sigmoid(tf.matmul(x, w))
ll = y * y_pred + (1 - y) * (1 - y_pred)
loss = -tf.reduce_sum(tf.log(ll))
grad = tf.gradients(loss, w)

eta = 0.5
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for t in range(100):
        grads = sess.run(grad, feed_dict={x: x_data, y: y_data})
        sess.run(w.assign_sub(eta * grads[0]))
    print(sess.run(w))
    print(sess.run(y_pred, feed_dict={x: x_data, y: y_data}))
[[-4.203982 ]
 [-4.2044005]
 [ 6.4987044]]
[[0.9984969 ]
 [0.90840423]
 [0.90843904]
 [0.12901704]]
MXNet
import mxnet as mx
from mxnet import nd, autograd

# Training data for NAND.
x = nd.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = nd.array([[1], [1], [1], [0]])
w = nd.random.normal(0, 1, shape=(3, 1))
w.attach_grad()

eta = 0.5
for t in range(100):
    with autograd.record():
        # y_pred = \sigma(x \cdot w).
        y_pred = nd.dot(x, w).sigmoid()
        ll = y * y_pred + (1 - y) * (1 - y_pred)
        loss = -ll.log().sum()      # The loss value.
        #print(t, loss)
    loss.backward()                 # Compute the gradients of the loss.
    w -= eta * w.grad               # Update weights using SGD.
w
[[-4.2020216]
 [-4.20314  ]
 [ 6.4963117]]
<NDArray 3x1 @cpu(0)>
nd.dot(x, w).sigmoid()
[[0.9984933 ]
 [0.90831   ]
 [0.90840304]
 [0.12911019]]
<NDArray 4x1 @cpu(0)>

Multi-layer neural network using automatic differentiation

PyTorch
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
w1 = torch.randn(3, 2, dtype=dtype, requires_grad=True)
w2 = torch.randn(2, 1, dtype=dtype, requires_grad=True)
b2 = torch.randn(1, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()
    #print(t, loss.item())
    loss.backward()
    
    with torch.no_grad():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad
        
        # Clear the gradients for the next iteration.
        w1.grad.zero_()
        w2.grad.zero_()
        b2.grad.zero_()
print(w1)
print(w2)
print(b2)
tensor([[ 5.1994,  7.0126],
        [ 5.2027,  7.0301],
        [-7.9606, -3.1728]])
tensor([[-12.1454],
        [ 11.3713]])
tensor([[-5.2753]])
x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()
tensor([[ 0.0080],
        [ 0.9942],
        [ 0.9941],
        [ 0.0062]])
Chainer
import numpy as np
import chainer
from chainer import Variable
import chainer.functions as F

dtype = np.float32

# Training data for XOR.
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = np.array([[0], [1], [1], [0]],dtype=dtype)
w1 = Variable(np.random.randn(3, 2).astype(dtype),requires_grad=True)
w2 = Variable(np.random.randn(2, 1).astype(dtype),requires_grad=True)
b2 = Variable(np.random.randn(1).astype(dtype), requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = F.sigmoid(F.bias(F.matmul(F.sigmoid(F.matmul(x, w1)), w2), b2))
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -F.sum(F.log(ll))
    #print(t, loss.data)
    loss.backward()
    with chainer.no_backprop_mode():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad

        # Clear the gradients for the next iteration.
        w1.cleargrad()
        w2.cleargrad()
        b2.cleargrad()
print(w1)
print(w2)
print(b2)
variable([[-6.898038  -6.5185432]
          [ 7.0828695  6.2416263]
          [ 3.471656  -3.3139799]])
variable([[-11.242552]
          [ 11.863684]])
variable([5.270227])
F.sigmoid(F.bias(F.matmul(F.sigmoid(F.matmul(x,w1)) ,w2), b2))
variable([[0.00539303],
          [0.9949782 ],
          [0.9927317 ],
          [0.00462809]])
TensorFlow
import tensorflow as tf

# Training data for XOR.
x_data = [[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]]
y_data = [[0], [1], [1], [0]]

x = tf.placeholder(tf.float32, [4, 3])
y = tf.placeholder(tf.float32, [4, 1])
w1 = tf.Variable(tf.random_normal([3, 2]))
w2 = tf.Variable(tf.random_normal([2, 1]))
b2 = tf.Variable(tf.random_normal([1, 1]))

y_pred = tf.sigmoid(tf.add(tf.matmul(tf.sigmoid(tf.matmul(x, w1)), w2), b2))
ll = y * y_pred + (1 - y) * (1 - y_pred)
log = tf.log(ll)
loss = -tf.reduce_sum(log)
grad = tf.gradients(loss, [w1, w2, b2])

eta = 0.5
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for t in range(1000):
        w1_grad, w2_grad, b2_grad = sess.run(grad, feed_dict={x: x_data, y: y_data})
        sess.run(tf.assign_sub(w1, eta * w1_grad))
        sess.run(tf.assign_sub(w2, eta * w2_grad))
        sess.run(tf.assign_sub(b2, eta * b2_grad))
        
    print(sess.run(w1))
    print(sess.run(w2))
    print(sess.run(b2))
    print(sess.run(y_pred, feed_dict={x: x_data, y: y_data}))
[[ 8.545845   8.527924 ]
 [ 4.8393583 -4.2591324]
 [-1.4335978  2.7701557]]
[[ 7.452827 ]
 [-7.1250057]]
[[-0.32773858]]
[[0.00369266]
 [0.9962198 ]
 [0.49852544]
 [0.5015694 ]]
MXNet
import mxnet as mx
from mxnet import nd, autograd

# Training data for XOR.
x = nd.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = nd.array([[0], [1], [1], [0]])

w1 = nd.random.normal(0, 1, shape=(3, 2))
w2 = nd.random.normal(0, 1, shape=(2, 1))
b2 = nd.random.normal(0, 1, shape=(1, 1))
w1.attach_grad()
w2.attach_grad()
b2.attach_grad()

eta = 0.5
for t in range(1000):
    with autograd.record():
        # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
        y_pred = (nd.dot(nd.dot(x, w1).sigmoid(), w2) + b2).sigmoid()
        ll = y * y_pred + (1 - y) * (1 - y_pred)
        loss = -ll.log().sum()
    loss.backward()
    
    # Update weights using SGD.
    w1 -= eta * w1.grad
    w2 -= eta * w2.grad
    b2 -= eta * b2.grad
print(w1)
print(w2)
print(b2)
[[  4.5155373   4.041809 ]
 [ -7.8481655   7.4811954]
 [  5.9294176 -10.316905 ]]
<NDArray 3x2 @cpu(0)>

[[-5.775163 ]
 [-7.2728887]]
<NDArray 2x1 @cpu(0)>

[[5.775924]]
<NDArray 1x1 @cpu(0)>
(nd.dot(nd.dot(x, w1).sigmoid(), w2) + b2).sigmoid()
[[0.50396055]
 [0.99037385]
 [0.4968157 ]
 [0.00550799]]
<NDArray 4x1 @cpu(0)>

Single-layer neural network with high-level NN modules

PyTorch
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

eta = 0.5
for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.
model.state_dict()
OrderedDict([('0.weight', tensor([[-4.3067, -4.3060]])),
             ('0.bias', tensor([ 6.6506]))])
model(x).sigmoid()
tensor([[ 0.9987],
        [ 0.9125],
        [ 0.9124],
        [ 0.1232]])
Chainer
import chainer
import numpy as np
from chainer import Variable, Function
import chainer.functions as F
import chainer.links as L

dtype = np.float32

# Training data for NAND
x = Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = Variable(np.array([[1], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
model = chainer.Sequential(
    L.Linear(2, 1, nobias=False)            # 2 dims (with bias) -> 1 dim
)
# Binary corss-entropy loss after sigmoid function.
loss_fn=F.sigmoid_cross_entropy

eta = 0.5
for t in range(100):
    y_pred = model(x)                       # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)
    # print(t, loss.data)
    model.cleargrads()                      # Zero-clear the gradients.
    loss.backward()                         # Compute the gradients.

    with chainer.no_backprop_mode():
        for para in model.params():
            para.data -= eta * para.grad    # Update the parameters using SGD.
for para in model.params():
    print(para)
variable b([3.2144895])
variable W([[-1.9686245 -1.9545643]])
F.sigmoid(model(x))
variable([[0.96137595],
          [0.7790133 ],
          [0.77658325],
          [0.32988632]])

Multi-layer neural network with high-level NN modules

PyTorch
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

eta = 0.5
for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.
model.state_dict()
OrderedDict([('0.weight', tensor([[ 7.0281,  7.0367],
                      [ 5.1955,  5.1971]])),
             ('0.bias', tensor([-3.1767, -7.9526])),
             ('2.weight', tensor([[ 11.4025, -12.1782]])),
             ('2.bias', tensor([-5.2898]))])
model(x).sigmoid()
tensor([[ 0.0079],
        [ 0.9942],
        [ 0.9942],
        [ 0.0061]])
Chainer
import chainer
import numpy as np
from chainer import Variable, Function
import chainer.functions as F
import chainer.links as L

dtype = np.float32

# Training data for XOR.
x = chainer.Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = chainer.Variable(np.array([[0], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
init=chainer.initializers.HeNormal()
model = chainer.Sequential(
    L.Linear(2, 2, nobias=False, initialW=init), # 2 dims (with bias) -> 2 dims
    F.sigmoid,                                   # Sigmoid function
    L.Linear(2, 1, nobias=False, initialW=init), # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn=F.sigmoid_cross_entropy

eta = 0.5
for t in range(1000):
    y_pred = model(x)                            # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)
    # print(t, loss.data)
    model.cleargrads()                           # Zero-clear the gradients.
    loss.backward()                              # Compute the gradients.

    with chainer.no_backprop_mode():
        for para in model.params():
            para.data -= eta * para.grad     # Update the parameters using SGD.
for para in model.params():
    print(para)
variable W([[-5.0357113  4.694917 ]
            [ 5.884984  -6.006705 ]])
variable b([-2.5778637 -3.3951974])
variable W([[7.5983677 7.613726 ]])
variable b([-3.705806])
F.sigmoid(model(x))
variable([[0.05105227],
          [0.95592314],
          [0.9653981 ],
          [0.04323387]])

Single-layer neural network with an optimizer

PyTorch
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.
model.state_dict()
OrderedDict([('0.weight', tensor([[-4.2726, -4.2721]])),
             ('0.bias', tensor([ 6.6000]))])
model(x).sigmoid()
tensor([[ 0.9986],
        [ 0.9112],
        [ 0.9111],
        [ 0.1251]])
Chainer
import chainer
import numpy as np
from chainer import functions as F
from chainer import links as L
chainer.config.train = True

dtype=np.float32

# Training data for NAND.
x = chainer.Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = chainer.Variable(np.array([[1], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
model = chainer.Sequential(
    L.Linear(2, 1, nobias=False),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = F.sigmoid_cross_entropy

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = chainer.optimizers.SGD(lr=0.5)
optimizer.setup(model)

for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)   # Compute the loss.
    #print(t, loss.data)
    
    model.cleargrads()          # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.update()          # Update the parameters using the gradients.
for para in model.params():
    print(para)
variable b([3.4636898])
variable W([[-2.134361  -2.1379907]])
F.sigmoid(model(x))
variable([[0.9696368 ],
          [0.79012835],
          [0.7907295 ],
          [0.30817568]])
TensorFlow
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Activation
from tensorflow.keras import optimizers

# Training data for NAND.
x_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_data = np.array([[1], [1], [1], [0]])

# Define a neural network using high-level modules.
model = Sequential([
    Flatten(),
    Dense(1, activation='sigmoid')
])

model.compile(
    optimizer=optimizers.SGD(lr=0.5),
    loss='binary_crossentropy',
    metrics=['accuracy']
    )

model.fit(x_data, y_data, epochs=100)
Epoch 1/100
4/4 [==============================] - 0s 68ms/step - loss: 0.8059 - acc: 0.2500
Epoch 2/100
4/4 [==============================] - 0s 471us/step - loss: 0.7065 - acc: 0.5000
Epoch 3/100
4/4 [==============================] - 0s 3ms/step - loss: 0.6347 - acc: 0.5000
Epoch 4/100
4/4 [==============================] - 0s 718us/step - loss: 0.5838 - acc: 0.7500
Epoch 5/100
4/4 [==============================] - 0s 805us/step - loss: 0.5477 - acc: 0.7500
Epoch 6/100
4/4 [==============================] - 0s 1ms/step - loss: 0.5217 - acc: 0.7500
Epoch 7/100
4/4 [==============================] - 0s 3ms/step - loss: 0.5024 - acc: 1.0000
Epoch 8/100
4/4 [==============================] - 0s 426us/step - loss: 0.4875 - acc: 1.0000
Epoch 9/100
4/4 [==============================] - 0s 437us/step - loss: 0.4756 - acc: 1.0000
Epoch 10/100
4/4 [==============================] - 0s 1ms/step - loss: 0.4657 - acc: 1.0000
Epoch 11/100
4/4 [==============================] - 0s 499us/step - loss: 0.4572 - acc: 1.0000
Epoch 12/100
4/4 [==============================] - 0s 2ms/step - loss: 0.4496 - acc: 1.0000
Epoch 13/100
4/4 [==============================] - 0s 1ms/step - loss: 0.4426 - acc: 1.0000
Epoch 14/100
4/4 [==============================] - 0s 2ms/step - loss: 0.4362 - acc: 0.7500
Epoch 15/100
4/4 [==============================] - 0s 648us/step - loss: 0.4301 - acc: 0.7500
Epoch 16/100
4/4 [==============================] - 0s 960us/step - loss: 0.4244 - acc: 0.7500
Epoch 17/100
4/4 [==============================] - 0s 422us/step - loss: 0.4188 - acc: 0.7500
Epoch 18/100
4/4 [==============================] - 0s 538us/step - loss: 0.4135 - acc: 0.7500
Epoch 19/100
4/4 [==============================] - 0s 560us/step - loss: 0.4084 - acc: 1.0000
Epoch 20/100
4/4 [==============================] - 0s 524us/step - loss: 0.4034 - acc: 1.0000
Epoch 21/100
4/4 [==============================] - 0s 1ms/step - loss: 0.3986 - acc: 1.0000
Epoch 22/100
4/4 [==============================] - 0s 713us/step - loss: 0.3939 - acc: 1.0000
Epoch 23/100
4/4 [==============================] - 0s 630us/step - loss: 0.3894 - acc: 1.0000
Epoch 24/100
4/4 [==============================] - 0s 519us/step - loss: 0.3849 - acc: 1.0000
Epoch 25/100
4/4 [==============================] - 0s 523us/step - loss: 0.3806 - acc: 1.0000
Epoch 26/100
4/4 [==============================] - 0s 779us/step - loss: 0.3764 - acc: 1.0000
Epoch 27/100
4/4 [==============================] - 0s 661us/step - loss: 0.3723 - acc: 1.0000
Epoch 28/100
4/4 [==============================] - 0s 481us/step - loss: 0.3683 - acc: 1.0000
Epoch 29/100
4/4 [==============================] - 0s 2ms/step - loss: 0.3644 - acc: 1.0000
Epoch 30/100
4/4 [==============================] - 0s 1ms/step - loss: 0.3606 - acc: 1.0000
Epoch 31/100
4/4 [==============================] - 0s 684us/step - loss: 0.3569 - acc: 1.0000
Epoch 32/100
4/4 [==============================] - 0s 660us/step - loss: 0.3533 - acc: 1.0000
Epoch 33/100
4/4 [==============================] - 0s 419us/step - loss: 0.3497 - acc: 1.0000
Epoch 34/100
4/4 [==============================] - 0s 519us/step - loss: 0.3463 - acc: 1.0000
Epoch 35/100
4/4 [==============================] - 0s 2ms/step - loss: 0.3429 - acc: 1.0000
Epoch 36/100
4/4 [==============================] - 0s 1ms/step - loss: 0.3396 - acc: 1.0000
Epoch 37/100
4/4 [==============================] - 0s 584us/step - loss: 0.3363 - acc: 1.0000
Epoch 38/100
4/4 [==============================] - 0s 503us/step - loss: 0.3331 - acc: 1.0000
Epoch 39/100
4/4 [==============================] - 0s 382us/step - loss: 0.3300 - acc: 1.0000
Epoch 40/100
4/4 [==============================] - 0s 2ms/step - loss: 0.3270 - acc: 1.0000
Epoch 41/100
4/4 [==============================] - 0s 1ms/step - loss: 0.3240 - acc: 1.0000
Epoch 42/100
4/4 [==============================] - 0s 6ms/step - loss: 0.3211 - acc: 1.0000
Epoch 43/100
4/4 [==============================] - 0s 799us/step - loss: 0.3182 - acc: 1.0000
Epoch 44/100
4/4 [==============================] - 0s 2ms/step - loss: 0.3154 - acc: 1.0000
Epoch 45/100
4/4 [==============================] - 0s 1ms/step - loss: 0.3127 - acc: 1.0000
Epoch 46/100
4/4 [==============================] - 0s 814us/step - loss: 0.3100 - acc: 1.0000
Epoch 47/100
4/4 [==============================] - 0s 547us/step - loss: 0.3073 - acc: 1.0000
Epoch 48/100
4/4 [==============================] - 0s 663us/step - loss: 0.3047 - acc: 1.0000
Epoch 49/100
4/4 [==============================] - 0s 578us/step - loss: 0.3022 - acc: 1.0000
Epoch 50/100
4/4 [==============================] - 0s 508us/step - loss: 0.2997 - acc: 1.0000
Epoch 51/100
4/4 [==============================] - 0s 373us/step - loss: 0.2972 - acc: 1.0000
Epoch 52/100
4/4 [==============================] - 0s 1ms/step - loss: 0.2948 - acc: 1.0000
Epoch 53/100
4/4 [==============================] - 0s 812us/step - loss: 0.2925 - acc: 1.0000
Epoch 54/100
4/4 [==============================] - 0s 530us/step - loss: 0.2902 - acc: 1.0000
Epoch 55/100
4/4 [==============================] - 0s 504us/step - loss: 0.2879 - acc: 1.0000
Epoch 56/100
4/4 [==============================] - 0s 521us/step - loss: 0.2856 - acc: 1.0000
Epoch 57/100
4/4 [==============================] - 0s 680us/step - loss: 0.2834 - acc: 1.0000
Epoch 58/100
4/4 [==============================] - 0s 745us/step - loss: 0.2813 - acc: 1.0000
Epoch 59/100
4/4 [==============================] - 0s 583us/step - loss: 0.2791 - acc: 1.0000
Epoch 60/100
4/4 [==============================] - 0s 2ms/step - loss: 0.2770 - acc: 1.0000
Epoch 61/100
4/4 [==============================] - 0s 954us/step - loss: 0.2750 - acc: 1.0000
Epoch 62/100
4/4 [==============================] - 0s 536us/step - loss: 0.2729 - acc: 1.0000
Epoch 63/100
4/4 [==============================] - 0s 352us/step - loss: 0.2709 - acc: 1.0000
Epoch 64/100
4/4 [==============================] - 0s 461us/step - loss: 0.2690 - acc: 1.0000
Epoch 65/100
4/4 [==============================] - 0s 699us/step - loss: 0.2670 - acc: 1.0000
Epoch 66/100
4/4 [==============================] - 0s 581us/step - loss: 0.2651 - acc: 1.0000
Epoch 67/100
4/4 [==============================] - 0s 409us/step - loss: 0.2633 - acc: 1.0000
Epoch 68/100
4/4 [==============================] - 0s 676us/step - loss: 0.2614 - acc: 1.0000
Epoch 69/100
4/4 [==============================] - 0s 639us/step - loss: 0.2596 - acc: 1.0000
Epoch 70/100
4/4 [==============================] - 0s 840us/step - loss: 0.2578 - acc: 1.0000
Epoch 71/100
4/4 [==============================] - 0s 970us/step - loss: 0.2561 - acc: 1.0000
Epoch 72/100
4/4 [==============================] - 0s 476us/step - loss: 0.2543 - acc: 1.0000
Epoch 73/100
4/4 [==============================] - 0s 576us/step - loss: 0.2526 - acc: 1.0000
Epoch 74/100
4/4 [==============================] - 0s 726us/step - loss: 0.2509 - acc: 1.0000
Epoch 75/100
4/4 [==============================] - 0s 496us/step - loss: 0.2493 - acc: 1.0000
Epoch 76/100
4/4 [==============================] - 0s 437us/step - loss: 0.2476 - acc: 1.0000
Epoch 77/100
4/4 [==============================] - 0s 418us/step - loss: 0.2460 - acc: 1.0000
Epoch 78/100
4/4 [==============================] - 0s 482us/step - loss: 0.2444 - acc: 1.0000
Epoch 79/100
4/4 [==============================] - 0s 568us/step - loss: 0.2428 - acc: 1.0000
Epoch 80/100
4/4 [==============================] - 0s 542us/step - loss: 0.2413 - acc: 1.0000
Epoch 81/100
4/4 [==============================] - 0s 436us/step - loss: 0.2397 - acc: 1.0000
Epoch 82/100
4/4 [==============================] - 0s 434us/step - loss: 0.2382 - acc: 1.0000
Epoch 83/100
4/4 [==============================] - 0s 2ms/step - loss: 0.2367 - acc: 1.0000
Epoch 84/100
4/4 [==============================] - 0s 1ms/step - loss: 0.2353 - acc: 1.0000
Epoch 85/100
4/4 [==============================] - 0s 789us/step - loss: 0.2338 - acc: 1.0000
Epoch 86/100
4/4 [==============================] - 0s 1ms/step - loss: 0.2324 - acc: 1.0000
Epoch 87/100
4/4 [==============================] - 0s 473us/step - loss: 0.2310 - acc: 1.0000
Epoch 88/100
4/4 [==============================] - 0s 463us/step - loss: 0.2296 - acc: 1.0000
Epoch 89/100
4/4 [==============================] - 0s 489us/step - loss: 0.2282 - acc: 1.0000
Epoch 90/100
4/4 [==============================] - 0s 1ms/step - loss: 0.2269 - acc: 1.0000
Epoch 91/100
4/4 [==============================] - 0s 627us/step - loss: 0.2255 - acc: 1.0000
Epoch 92/100
4/4 [==============================] - 0s 2ms/step - loss: 0.2242 - acc: 1.0000
Epoch 93/100
4/4 [==============================] - 0s 906us/step - loss: 0.2229 - acc: 1.0000
Epoch 94/100
4/4 [==============================] - 0s 628us/step - loss: 0.2216 - acc: 1.0000
Epoch 95/100
4/4 [==============================] - 0s 373us/step - loss: 0.2203 - acc: 1.0000
Epoch 96/100
4/4 [==============================] - 0s 585us/step - loss: 0.2190 - acc: 1.0000
Epoch 97/100
4/4 [==============================] - 0s 490us/step - loss: 0.2178 - acc: 1.0000
Epoch 98/100
4/4 [==============================] - 0s 508us/step - loss: 0.2166 - acc: 1.0000
Epoch 99/100
4/4 [==============================] - 0s 588us/step - loss: 0.2153 - acc: 1.0000
Epoch 100/100
4/4 [==============================] - 0s 718us/step - loss: 0.2141 - acc: 1.0000





<tensorflow.python.keras.callbacks.History at 0x7f9f80286da0>
model.get_weights()
[array([[-2.202066],
        [-2.160888]], dtype=float32), array([3.5287492], dtype=float32)]
model.predict(x_data)
array([[0.9714948],
       [0.7970343],
       [0.7902915],
       [0.3027567]], dtype=float32)
MXNet
import mxnet as mx
from mxnet import nd, autograd, gluon

# Training data for NAND.
x = nd.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = nd.array([[1], [1], [1], [0]])

# Define a neural network using high-level modules.
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(1))
net.collect_params().initialize(mx.init.Normal(sigma=1.))
  
# Binary cross-entropy loss agter sigmoid function.
loss_fn = gluon.loss.SigmoidBinaryCrossEntropyLoss()

# Optimizer based on SGD
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})

for t in range(100):
    with autograd.record():
        # Make predictions.
        y_pred = net(x)
        # Compute the loss.
        loss = loss_fn(y_pred, y)
    # Compute the gradients of the loss.
    loss.backward()
    # Update weights using SGD.
    # the batch_size is set to one to be consistent with the slide.
    trainer.step(batch_size=1)
for v in net.collect_params().values():
    print(v, v.data())
Parameter sequential0_dense0_weight (shape=(1, 2), dtype=float32) 
[[-4.182336  -4.1832795]]
<NDArray 1x2 @cpu(0)>
Parameter sequential0_dense0_bias (shape=(1,), dtype=float32) 
[6.466928]
<NDArray 1 @cpu(0)>
net(x).sigmoid()
[[0.9984484 ]
 [0.90751374]
 [0.9075929 ]
 [0.13025706]]
<NDArray 4x1 @cpu(0)>

Multi-layer neural network with an optimizer

PyTorch
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.
model.state_dict()
OrderedDict([('0.weight', tensor([[ 7.3702, -7.1611],
                      [-6.6066,  6.9133]])),
             ('0.bias', tensor([ 3.6234,  3.3088])),
             ('2.weight', tensor([[-9.1519, -9.2072]])),
             ('2.bias', tensor([ 13.4994]))])
model(x).sigmoid()
tensor([[ 0.0134],
        [ 0.9826],
        [ 0.9824],
        [ 0.0118]])
Chainer
import chainer
import numpy as np
from chainer import functions as F
from chainer import links as L

dtype=np.float32

# Training data for XOR.
x = chainer.Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = chainer.Variable(np.array([[0], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
model = chainer.Sequential(
    L.Linear(2, 2, nobias=False),
    F.sigmoid,
    L.Linear(2, 1, nobias=False),
)

# Binary corss-entropy loss after sigmoid function.
loss_fn=F.sigmoid_cross_entropy

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = chainer.optimizers.SGD(lr=0.5)
optimizer.setup(model)

for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)  # Compute the loss.
    #print(t, loss.data)
    
    model.cleargrads()          # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.update()          # Update the parameters using the gradients.
for param in model.params():
    print(param)
variable b([-2.2869124 -5.1613417])
variable W([[5.922025  5.878332 ]
            [3.4194984 3.4142656]])
variable b([-3.3487024])
variable W([[ 7.281869 -7.459597]])
F.sigmoid(model(x))
variable([[0.06181788],
          [0.93281394],
          [0.93301374],
          [0.08725712]])
TensorFlow
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Activation
from tensorflow.keras import optimizers

# Training data for XOR.
x_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_data = np.array([[0], [1], [1], [0]])

# Define a neural network using high-level modules.
model = Sequential([
    Flatten(),
    Dense(2, activation='sigmoid'),    # 2 dims (with bias) -> 2 dims
    Dense(1, activation='sigmoid')     # 2 dims (with bias) -> 2 dims
])

model.compile(
    optimizer=optimizers.SGD(lr=0.5),
    loss='binary_crossentropy',
    metrics=['accuracy']
    )

model.fit(x_data, y_data, epochs=1000, verbose=0)
<tensorflow.python.keras.callbacks.History at 0x7fd00b4d00f0>
model.get_weights()
[array([[2.0502675, 5.6302657],
        [1.0301563, 4.942775 ]], dtype=float32),
 array([-1.615101 , -1.5490855], dtype=float32),
 array([[-4.0722337],
        [ 5.53848  ]], dtype=float32),
 array([-2.3617785], dtype=float32)]
model.predict(x_data)
array([[0.11236145],
       [0.8234226 ],
       [0.6484979 ],
       [0.4670412 ]], dtype=float32)
MXNet
import mxnet as mx
from mxnet import nd, autograd, gluon

# Training data for XOR.
x = nd.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = nd.array([[0], [1], [1], [0]])

# Define a neural network using high-level modules.
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(2))
    net.add(gluon.nn.Activation('sigmoid'))
    net.add(gluon.nn.Dense(1))
net.collect_params().initialize(mx.init.Normal(sigma=1.))

# Binary cross-entropy loss agter sigmoid function.
loss_fn = gluon.loss.SigmoidBinaryCrossEntropyLoss()

# Optimizer based on SGD
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})

for t in range(1000):
    with autograd.record():
        # Make predictions.
        y_pred = net(x)
        # Compute the loss.
        loss = loss_fn(y_pred, y)
    # Compute the gradients of the loss.
    loss.backward()
    # Update weights using SGD.
    # the batch_size is set to one to be consistent with the slide.
    trainer.step(batch_size=1)
for v in net.collect_params().values():
    print(v, v.data())
Parameter sequential0_dense0_weight (shape=(2, 3), dtype=float32) 
[[ 8.537011  -4.4268546  1.9961522]
 [ 8.598794   4.8988695 -1.3610638]]
<NDArray 2x3 @cpu(0)>
Parameter sequential0_dense0_bias (shape=(2,), dtype=float32) 
[ 0.9527137  -0.12632234]
<NDArray 2 @cpu(0)>
Parameter sequential0_dense1_weight (shape=(1, 2), dtype=float32) 
[[-7.1571183  7.3222814]]
<NDArray 1x2 @cpu(0)>
Parameter sequential0_dense1_bias (shape=(1,), dtype=float32) 
[-0.16509056]
<NDArray 1 @cpu(0)>
net(x).sigmoid()
[[0.00362506]
 [0.9962937 ]
 [0.49854442]
 [0.5015438 ]]
<NDArray 4x1 @cpu(0)>

Single-layer neural network with a customizable NN class.

PyTorch
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class SingleLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_out):
        super(SingleLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_out, bias=True)

    def forward(self, x):
        return self.linear1(x)

model = SingleLayerNN(2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.
model.state_dict()
OrderedDict([('linear1.weight', tensor([[-4.2693, -4.2689]])),
             ('linear1.bias', tensor([ 6.5951]))])
model(x).sigmoid()
tensor([[ 0.9986],
        [ 0.9110],
        [ 0.9110],
        [ 0.1253]])
Chainer
import chainer
import numpy as np
from chainer import Variable, Function
import chainer.functions as F
import chainer.links as L

x = Variable(np.array([[0,0],[0,1],[1,0],[1,1]],dtype=np.float32))
y = Variable(np.array([[1],[1],[1],[0]],dtype=np.int32))

class Linear(chainer.Chain):
    def __init__(self):
        super().__init__()
        with self.init_scope():
            self.l1 = L.Linear(2,1)
    def __call__(self,x):
        return self.l1(x)

model = Linear()

optimizer = optimizers.SGD(lr=0.5).setup(model)
for t in range(1000):
    y_pred = model(x)
    loss = F.sigmoid_cross_entropy(y_pred,y)
    #print(t,loss.data)
    model.cleargrads()
    loss.backward()
    optimizer.update()
F.sigmoid(model(x))
variable([[0.99989665],
          [0.95996976],
          [0.95997   ],
          [0.05611408]])

Multi-layer neural network with a customizable NN class.

PyTorch
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class ThreeLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super(ThreeLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_hidden, bias=True)
        self.linear2 = torch.nn.Linear(d_hidden, d_out, bias=True)

    def forward(self, x):
        return self.linear2(self.linear1(x).sigmoid())

model = ThreeLayerNN(2, 2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.
model.state_dict()
OrderedDict([('linear1.weight', tensor([[ 6.6212, -6.8110],
                      [ 6.7129, -6.4369]])),
             ('linear1.bias', tensor([-3.5404,  3.2040])),
             ('linear2.weight', tensor([[ 11.6606, -11.1694]])),
             ('linear2.bias', tensor([ 5.2589]))])
model(x).sigmoid()
tensor([[ 0.0058],
        [ 0.9921],
        [ 0.9947],
        [ 0.0049]])
Chainer
import chainer
import numpy as np
from chainer import Variable
from chainer import functions as F
from chainer import links as L
from chainer import optimizers

x = Variable(np.array([[0,0],[0,1],[1,0],[1,1]],dtype=np.float32))
y = Variable(np.array([[0],[1],[1],[0]],dtype=np.int32))

class Linear(chainer.Chain):
    def __init__(self):
        super().__init__()
        with self.init_scope():
            self.l1 = L.Linear(2,2)
            self.l2 = L.Linear(2,1)
      
    def __call__(self,x):
        h = F.sigmoid(self.l1(x))
        o = self.l2(h)
        return o
    
model = Linear()
optimizer = optimizers.SGD(lr=0.5).setup(model)
for t in range(1000):
    y_pred = model(x)
    loss = F.sigmoid_cross_entropy(y_pred,y)
    #print(t,loss.data)
    model.cleargrads()
    loss.backward()
    optimizer.update()
F.sigmoid(model(x))
variable([[0.33073208],
          [0.5335392 ],
          [0.8444923 ],
          [0.26925102]])