dev-resources.site
for different kinds of informations.
My (highly caffeinated) journey to unlock the hidden knowledge of AI.
Welcome to my (highly caffeinated) journey to unlock the hidden knowledge of AI.
I have no idea how AI or ML works, but I’m here to document my learning journey. I started off reading the Blitz PyTorch course, mostly because I was told it was the fastest way to get from 0 knowledge to a working model. Still a long way off publishing a paper, but it’s a start.
It was going great until the 13th word: "Matrices." After a small sidetrack (thank you, The Organic Chemistry Tutor) and a tea break, I now understand what a matrix is. Time to go back to PyTorch. It turns out that while tensors are a simple concept, they’re hard to get my head around. Something wasn’t clicking. I put some code together to help me visualise the 3D tensor. So, a tensor with the shape (4, 2, 3)
:
A tensor is a way of storing numbers. A 0D tensor is a single number, like saying x = 9
. If you’re doing Physics/Mathematics, it would store a scalar value.
A 1D tensor is similar to an array of numbers (or a list in Python). In maths and physics, it would be a vector:
x = [2, 4, 5, 6]
A 2D tensor (a matrix) is a two-dimensional array of numbers in rows and columns:
x = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
A 3D tensor can be pictured like matrices stacked on top of matrices. The image below is something I made using Matplotlib.
Of course, higher-dimensional tensors exist, but they are hard to picture. I currently don’t have a way to draw in the 4th and 5th dimensions. Luckily, computers can just look at the numbers, so we don’t need to visualise them.
Tensors seem like they can do a lot of things. They feel powerful, and I love how they can bridge over to NumPy.
Next up on my highly caffeinated journey: autograd. Autograd requires two steps: Forward Propagation and Backward Propagation.
So, Forward Propagation is exactly what you think it is—as long as you think about the way input data is passed through the multiple layers of a neural network. At each neuron, it makes an educated guess about the output.
Backward Propagation is a bit more complicated. We adjust the parameters used to make the estimates based on the errors in its guesses. It does this by traversing backwards through the graph, calculating the gradient, and then changing the parameters to decrease the gradient.
Autograd keeps track of all the calculations done on a piece of data and creates a list to store the order of each operation. This list is built up during Forward Propagation. Once the list is built, we use Backward Propagation to calculate the gradient at each stage. At each stage, we can then change parameters to lower the gradient and make our model more accurate. Autograd also allows for dynamic computation, meaning we can change the layout and the parameters depending on the input.
Now, the fun stuff. A Feed-Forward Network is a form of multi-layer neural network that takes inputs, passes them through multiple layers of neurons, and has an output. It’s important to note there are no loops; all the data is always passed forward. Feed-Forward Networks always have 1 input layer, at least 1 hidden layer, and then 1 output layer.
To be honest, I’ve gotten a bit bored of reading and not coding, so let’s try to make an OCR model (Optical Character Recognition model) that I can link up to OpenCV to read characters from my webcam.
To start off, we create a class for our dataset. Basically, this is just a fancy way to load an image for our tensor, but we also pass in our transformers (I’ll mention this in a minute) as well as a flag called is_training
(this is the way I’m doing it). If this flag is true, then we also pass in the label with the image. In addition to deciding what data to pass in (e.g., label or no label), we resize all the images to be the same, convert the image into an array, and then finally apply the transformers. Transformers are filters applied to an image that will help your model recognise them in more situations, such as adding some small random rotation, changing hue and brightness, and greyscaling. While these are not mandatory, they help lower the amount of data you need to train your model.
Step two involves loading our data using the class. When we load the data, we specify a batch_size
, which is the number of samples (pieces of data) passed through the model (as inputs to the internal neurons) during each epoch. An epoch represents one complete pass through the training dataset, and within it, the model processes the data in smaller groups or "batches." A higher batch size can speed up the learning process, but it may also lead to decreased accuracy in the model.
Epochs are the number of times we loop through the data, and while there is no upper limit, you can do too many loops. Overfitting happens when the model doesn’t just learn the patterns but also learns all the background noise. This means the model can only spot the data it was trained on. There’s also a point where the results improve so marginally that it’s not worth the power spent.
In each epoch, we do backtracking and optimisation. Optimisation is the process of trying to lower the gradient. I’ve come across two types of optimisation: SGD (Stochastic Gradient Descent) and Adam. SGD updates the model parameters using a fixed learning rate for all parameters, which can be simple but sometimes slow to converge (the point at which a model's performance stabilises, with the loss function reaching a minimal and constant value). Adam (Adaptive Moment Estimation), on the other hand, adapts the learning rate for each parameter individually, often leading to faster convergence and better performance on complex problems.
Now that we have our training model, we let it run for a few hours and listen to our laptop cry. (I find this is a good opportunity to go to the gym, the pub, or both. Trust me—let your laptop have its time.) Due to this break, it’s time for a quick message from our sponsor: ME. There’s a small "Buy Me a Coffee" button in the bottom right. I’d greatly appreciate it if you took a look.
Now for the most important final step: saving the model. This one line can save your sanity: torch.save(model.state_dict(), 'model.pth')
. I may or may not have fully trained my model and then realised I didn’t have this SINGLE line of code to save it. This led to a small cry and me spending the next FOUR HOURS retraining the model so I could save it.
Finally, using the model is straightforward:
- Load the model we made earlier.
- Load the saved weights that we just trained.
- Take a photo using OpenCV and pass it through a fun bit of code to make it greyscale.
- Input it into the fancy black-box AI we made and wait for it to spit out an output.
- Celebrate that it works!
So that’s a very high-level overview of an OCR model. If you’re interested, here’s the code. If you have any questions, head over to our Discord.
Next, I need to make an LSTM (Long Short-Term Memory) and an ARIMA (Autoregressive Integrated Moving Average) for my research project. However, this feels like a natural break, so let’s make it a 2-parter. If you’d like me to make some YT videos on this, let me know, and as always, thank you for reading.
If you enjoyed the blog, feel free to join the Discord or Reddit to get updates and talk about the articles. Or follow the RSS feed.
Featured ones: