AI Tutorials for Software Engineers

A series of tutorials that dive deep into the fundamental concepts underlying modern artificial intelligence with the theory and implementations in Go and Rust.

These tutorials should be accessible to anyone and only a basic understanding of mathematics is assumed. All the prerequisites and foundational concepts are introduced to allow an understanding from the basic to the advanced.

Prerequisites and Foundations

Probability & Statistics: An introduction to probability and statistics and the fundamental concepts that will be important in many of the tutorials to follow. Starting with the basic concepts of probability of an event occurring to probability density functions, odds, log odds and the odds ratios among other things.

Differential Calculus: A deep dive into the rate of change of a quantity for a given input. This tutorial on differential calculus walks through the concepts of slope of a line and the techniques used to calculate the slope of a curve at a given point on the curve. Derivatives of several important functions are introduced along with proofs as well as the chain rule and product rule of derivatives which will be important in tutorials to follow. Finally partial derivatives are explained with an eye to understanding the derivative of a function with respect to a given parameter.

Vectors & Matrices: Vectors, Matrices and Matrix algebra are foundational to understanding the concepts behind AI and this tutorial walks through the building blocks and fundamental concepts of matrices and vectors and the key axioms of linear algebra using matrices.

Linear Regression

Simple Linear Regression: Simple linear regression is a statistical technique of making predictions from data. The tutorial introduces a linear model in two dimensions that examines the relation ship between one dependent variable and one independent variable and then finds line that best fits the data. Once the line of best fit is determined it can be used to predict values of y (the dependent variable) for any value of x (the independent variable)

Simple Linear Regression - Proof of the Closed Form: This tutorial walks through a detailed proof of the the closed form solution of simple linear regression introduced above. This proof walks through solving two partial differential equations to compute the values of the two parameters. While this proof provides a detailed explanation (and requires an understanding of differential calculus), for the curious reader of how the closed form solution is obtained, it isn't strictly necessary for the material ahead and can be skipped if needed.

Simple Linear Regression in Go: A Go implementation of Simple Linear Regression.

Multiple Regression: Multiple regression extends the two dimensional linear model introduced in Simple Linear Regression to k + 1 dimensions with one dependent variable, k independent variables and k+1 parameters. The general matrix form of the model is introduced along with a closed form solution that depends on the feature matrix being invertible (which is not always possible) and sets the stage for using gradient descent to solve linear regression.

Multiple Regression - Deriving the Matrix Form: The closed matrix form for multiple regression is a solution of the partial differential equation w.r.t to the parameter vector (B). This tutorial walks through the steps of solving the matrix partial differential equation to derive the closed form solution which depends on the feature matrix being invertible (which is not always possible).

Multiple Regression in Go: A Go implementation of Multiple Regression.

Gradient Descent

Gradient Descent for Simple Linear Regression: An introduction to gradient descent and its application to simple linear regression. Gradient optimizes the model parameters by successively computing the slope of the cost function (the gradient) and then iteratively walking down the cost curve till the gradient is close to zero or a specific number of iterations.

Gradient Descent for Multiple Regression: Gradient descent is applied to the general matrix form of multiple regression and a model with k+1 parameters is optimized by successively computing the gradient vector and then walking down the cost function. The matrix closed form solution for multiple regression requires inverting a matrix which is computationally expensive in most cases (and impossible in many cases). Gradient descent is computationally more efficient and is preferred especially when working with large datasets and models.

Logistic Regression

Logistic Regression - Fundamentals: An introduction to binary classification using Logistic Regression. Logistic Regression uses the Logistic Function to convert a linear combination of input features into probabilities along with a threshold to classify the predictions into one of two values (binary classification). The logistic function and the general matrix form for Logistic Regression is introduced with a model with parameters B and W.

Cost Function & Gradient Descent for Logistic Regression: An introduction to the Cost function for Logistic Regression long with its partial derivative (the gradient vector). The model parameters (B & W) are then optimized using Maximum Likelihood Estimation and Gradient Descent.

Logistic Regression - Derivative of the Cost Function: This tutorial walks through the partial derivative of the cost function for logistic regression. This cost function can then be used to optimize the model parameters using Maximum Likelihood Estimation & Gradient Descent.

Logistic Regression in Go: A Go implementation of Logistic Regression

Neural Networks

Neural Networks - Fundamentals: An introduction to Neural Networks starting from a foundation of linear regression, logistic classification and multi class classification models. The matrix representation of a neural network generalized to k layers with n neurons per layer is introduced along with an introduction to activation functions. This sets the stage for training in neural networks using Gradient Descent.

Forward & Back Propagation in Neural Networks: Neural networks are trained via repeated iterations of Gradient Descent (forward and backward propagation). Inputs flow through the NN layers using current weights and biases to produce predictions, which are compared to observations to calculate loss. Backward propagation then computes gradients by working backward through the network, determining how each parameter should change to reduce loss. The algorithm iterates till the loss is minimized.

Activation & Cost Functions in Neural Networks: To learn complex patterns, neural networks require non-linearity, which is introduced by activation functions applied to each neuron's output. In this tutorial, we cover a variety of activation functions and their properties. We also explore the loss functions used during training for different tasks: linear regression, binary classification, and multi-class classification.

Derivation of the Backpropagation Equations: Training a neural network requires computing how the loss changes with respect to each weight and bias—this is achieved through backpropagation. In this tutorial, we derive the gradient equations using the chain rule for a feedforward network for binary classification (with binary cross-entropy loss and sigmoid output activation) as well as for multi class classification (with categorical cross entropy and softmax output activation). We show how gradients flow backward layer by layer and present the final vectorized formulas for updating weights and biases across multiple observations. We show that the backpropagation equations have the same form for both binary and multi-class classification

Simple Feedforward Neural Network in Go: A detailed walk through of a Go implementation of a simple 3 layer Feedforward neural network using categorical cross entropy cost function with softmax activation in the output layer for multi class classification.

Recurrent Neural Networks

Recurrent Neural Networks - Fundamentals: While Feedforward Neural Networks excel at processing independent data points, they cannot capture temporal dependencies and sequential ordering in time-series data. This lecture introduces Recurrent Neural Networks (RNNs), which address this limitation through recurrent connections. Topics include RNN notation, mathematical representation (with batching), and various architectures (sequence-to-sequence, sequence-to-vector, vector-to-sequence, and encoder-decoder) for different applications.

RNN Training & Backpropagation Through Time: Training Recurrent Neural Networks requires computing gradients through sequential dependencies, which presents unique challenges. This lecture covers Backpropagation Through Time (BPTT), the algorithm that extends standard backpropagation to handle temporal dependencies. Topics include the BPTT algorithm, computational graphs, gradient flow through time, and the vanishing/exploding gradient problem with mitigation strategies.

LSTM Networks: While RNNs can model sequential data, they struggle to learn long-term dependencies due to vanishing gradients during backpropagation. This lecture introduces Long Short-Term Memory (LSTM) networks, a specialized RNN architecture that addresses this limitation through a sophisticated gating mechanism. Topics include the LSTM cell architecture, the three gates (input, forget, and output) that control information flow, the cell state that preserves information across many time steps, the mathematical formulation of LSTM operations, and practical applications in language modeling, speech recognition, and time-series prediction where long-term context is crucial.

Bidirectional RNNs and GRU Networks: Sequential processing can benefit from both historical and future context, especially in tasks like speech recognition and named entity recognition where the entire sequence is available. This lecture introduces Bidirectional RNNs, which simultaneously process sequences forward and backward, combining information from both directions. We also cover Gated Recurrent Units (GRUs), a streamlined alternative to LSTMs that achieves comparable performance with simpler architecture and fewer parameters. Topics include the bidirectional RNN architecture, combining forward and backward hidden states, the GRU cell with reset and update gates, mathematical formulation of GRU operations, computational and memory trade-offs between GRU and LSTM, and practical applications in natural language processing, speech recognition, and sequence labeling tasks.

arrsingh.com

Distributed Systems & Artificial Intelligence