Learning Out Loud

Shame

2020-07-02T00:00:00-07:00

2020 has turned out to be a completely outrageous year. Not at all what I expected when I joined OpenAI’s Scholars Program. All told, it’s been extremely hard to concentrate on my work with personal medical issues, and the world falling apart. While I wish so many of the things that happened this year hadn’t, I am grateful that racial inequality is getting such high profile, and desperately needed attention. And honestly, being distracted in my program is such a tiny price for me, as a white person to pay for the good that I hope may come out of this revolutionary time.

With that said, my original vision for this blog (as a narrative arc where I share and de-mystify my process of learning) is no longer practical. I’ve been too distracted to keep up blogging as I’ve learned, so I’m not going to try to go back and write months worth of things all at once. Instead I’m going to jump forward to where I am now, write about what I’m learning, and try to backfill what’s needed in the future.

There is one thing I’ve learned over the past few months that I do think is important to highlight here, though. That’s about shame. In mid-February, I started struggling with asthma. My asthma made me unable to concentrate or work productively, which made me feel ashamed that I wasn’t doing as much or as well as I wanted to, and knew I could. I got wrapped up in that shame, and it turned into anxiety, which exacerbated the asthma, and it was a downward spiral. Through this process I did manage to solve the original environmental causes of the asthma, and I finally managed to get my anxiety under control as well. To do that, I had to embrace the parts of me that I was ashamed of. I had to accept that maybe I wasn’t going to be able to finish the program or my project at all, that maybe I was just going to fail and maybe that really was the best I could do.

Only when I started to look at that as a real possibility and make peace with the idea of failure, did I finally start taking the steps I needed to take to care for my health. I got a prescription for anxiety medication and I also cut myself off from the news for a while. This felt dangerous as a trans person, but it was an urgent self-care decision. When I stayed in touch with the news, my heart hurt. I don’t mean emotionally. I mean, reading the news literally caused strong physical pain in my chest. My heart started skipping beats regularly, and I stopped sleeping. For a while, my body was also losing the ability to regulate my temperature. I would be bundled up and freezing at 75 degrees but my thermometer showed I wasn’t running a fever. My allergist told me it was the first time she’d ever seen me so fragile.

And through all of that, I felt like it was my one shot to finally follow the career I’d been dreaming of for so long. I felt like I needed to impress my mentor and my co-workers and show them what I knew I could do. And I felt like that chance was slipping out of my hands. But the reality was so much kinder and more open-ended than that. I remember talking to Christina, one of the recruiters who runs the program, back in April and I just broke down crying. She was so kind. She told me they knew how hard this period was for all of us and that she and my mentor, Christine, were working to get us more support and an extension for our projects. I was so grateful for the compassion in my period of fear and struggle. One of the things that’s impressed me the most about OpenAI is the company’s compassion. Every person I’ve talked to there has been an incredible combination of smart and kind. On the weeks when I couldn’t get any work done at all, Christine gently supported and encouraged me. She helped me relax and set aside my belief that I had to be perfect all the time and she just told me to do what I can. That calmed me down enough that, in the end, I was able to put together a project that’s gotten a lot of praise and that I’m really quite proud of.

So, I’m writing this post for other people dreaming of going into AI, especially as the world teeters on the brink of precipices that might threaten you personally. If you see my work in the future and think you have to prove yourself every minute of every day to do what I’ve done, you don’t. You can be weak. You can be afraid. You can struggle. And you can still succeed. I know because I did it. You just need to be kind to yourself and you need to have people around you who are also kind to you. You deserve that kindness. If you’re part of a minority group, you may sometimes forget that you deserve kindness because you may be less accustomed to receiving it than others. But you do. And if you accept your failings, and love yourself despite them, and do your best to surround yourself with people who will treat you with kindness, then you can be successfuln and you can be happy, which is so much better than perfect.

Looking for Grammar in all the Right Places

2020-07-02T00:00:00-07:00

Interpretability

Over the course of the OpenAI Scholars Program, I became fascinated with Interpretability. Interpretability is like “mind reading” for neural networks. It’s about looking inside of networks to understand how they represent and process information. This is difficult because of how different deep learning is from traditional software engineering. In traditional software engineering, a human being writes software, and that software takes inputs and gives outputs. (If the software is a word processor, the inputs are keystrokes and clicks and the outputs are documents. If the software is a search engine, the inputs are search quesries and the outputs are links to webpages.)

In deep learning, a human being creates math and feeds a bunch of training data through that math. Tthe data, the math, and the computer(s) create a piece of software, which takes inputs and outputs. The human being does not directly create the software and consequently does not plan how it will work and (most likely) does not even know how it works:

It’s much harder to understand software created by math than it is to understand software created by human beings.

But it’s also extremely important to understand how AI systems in our lives work. These systems have a huge impact on our lives.

GPT-2 Interpretation

I’m particularly fascinated by transformer-based language models, so I decided to try my hand at interpreting GPT-2, a transformer-based language model which had state of the art performance when it was released by OpenAI in early 2019. As a tractable first project, I decided to look for how GPT-2 understands English grammar. For a lay-person friendly explanation of transformers, and GPT-2 in particular, I encourage you to watch the short talk I gave about this project, here:

For a deeper dive into transformers, I recommend The Illustrated Transformer.

Datasets

I decided to look at GPT-2’s representations of simple part of speech, detailed part of speech, and syntactic dependencies. To accomplish this, I began by building three large datasets of sentences, one for each of these categories. First I started with the Corpus of Linguistic Acceptability, which is a widely known dataset that labels sentences as grammatically and incorrect. I dropped all of the grammatically incorrect sentences and kept only the grammatically correct ones. Then I tokenized these sentences using spaCy and the GPT-2 BPE tokenizer. I only kept sentences whose spaCy tokenization resulted in the same number of tokens as the GPT-2 tokenization. In other words, sentences that had a one-to-one correlation between words + punctuation marks and GPT-2 BPE tokens. Then, for each sentence, I labeled that sentence with the list of spaCy token.pos_ (simple part of speech), token.tag_(detailed part of speech), and token.dep_ (syntactic dependency) tags for each token in the sentence. For the final puctuation mark, I kept the punctuation intact. So here’s an example of how a sentence would get labeled for each dataset

Example sentence: “I enjoyed this project!”

Dataset	Sentence Label	`I`	`enjoyed`	`this`	`project`	`!`
simple parts of speech	`PRON\|VERB\|DET\|NOUN\|!`	PRON (pronoun)	VERB (verb)	DET (determiner)	NOUN (noun)	!
detailed parts of speech	`PRP\|VBD\|DT\|NN\|!`	PRP (pronoun, personal)	VBD (verb, past tense)	DT (determiner)	NN (noun, singular or mass)	!
syntactic dependencies	`nsubj\|ROOT\|det\|dobj\|!`	nsubj (nominal subject)	ROOT (None)	det (determiner)	dobj (direct object)	!

Then I used openwebtext to download a few months worth of webtext websites. Then I extracted the sentences from those, and added each sentence to my datasets if and only if (1) the number of spaCy tokens in the sentence matched the number of GPT-2 tokens in the sentence, and (2) the label for that sentence matched a pre-existing label in that dataset for a grammatically correct sentence from CoLA. Finally I split each of my three datasets into train, validation and test sets and ended up with:

Dataset	Labels/Grammatical Structures	Training labels with > 500 sentences each	Total Training Sentences	Total Validation Sentences	Total Test Sentences
simple parts of speech	2,837	104	222,794	15,903	16,017
detailed parts of speech	3,162	60	125,756	8,948	9,226
syntactic dependencies	2,643	157	333,427	23,962	24,119

Measuring Grammatical Understanding

After building these three datasets, I replaced the language modeling linear layer on top of GPT-2 with three other, independent linear layers which output probilities of grammatical tokens like this:

I then froze GPT-2 and trained each of my grammatical classifier linear layers using cross-entropy loss. I recorded the loss after each epoch of training. In addition, I repeated this training process on top of each transfomer layer, as well as on top of the imput embedding (using no transformer layers of GPT-2). I looked at how slow/difficult it was to train these classifiers on each transformer layer and found that they trained the fastest and achieved the best loss on the middle layers of the network. Here’s what that looked like for syntactic dependency loss:

The horizontal axis here is epochs and the vertical axis is the number of transformer layers of GPT-2, from 0 transfomer layers (directly on top of the input embedding) to 12-transformer layers (all of gpt-2 small). The color goes from red (high loss), to green (low loss). The syntactic dependency classifier has a progressively easier and easier time classifying the incoming sentence as we increase through the first five transformer layers of GPT-2, until it got its best overall score (in the fewest epochs) at layer 5. Then at higher layers, it starts to have a harder time again.

This made me think the first half of the network might be focused on understanding the incoming tokens and the second half of the network might be focused on producing the outgoing tokens. Since GPT-2 is built to generate probabilities of subsequent tokens in each position, I tested my theory by shifting the grammatical labels one position to the left (so they would better match the outgoing tokens) and repeating the experiment. And here’s what I saw:

When the classifier was trying to produce the grammatical structure of a likely output sentence, its loss was high in the first half of the network and only got better in the second half, with the best loss score coming at layer 8. This was convincing evidence that the first half of the network was, indeed, more focused on the grammar of the incoming tokens and the second half was more focused on the grammar of the outgoing (probable) tokens. In addition, this gave a small validation that the training results for my linear layer do serve as a viable measure of informational availability at each transfomer layer.

It’s also interesting (and makes sense) that the outgoing classifier scored a much higher loss than the incoming classifier. Since the network does not deterministically produce output tokens, and since future input tokens are not allowed to impact past output positions (by means of attention masking in GPT-2), this means that the output is more flexible and less precisely defined than the input. So it makes sense that it would be harder to definitively say that the output token will match the grammatical structures of the input sentences. This is all a long-winded way of saying that it’s harder to describe the future than the past, because the future is not precisely known the way the past is.

I also ran this experiment for simple part of speech and detailed part of speech and found these results:

We can see that it was slightly easier for the classifier to understand simple part of speech than to understand detailed part of speech. This makes sense because simple part of speech is, well, simpler. In addition, both simple and detailed part of speech have the best loss scores at layer 3, which indicates that part of speech is easier to extract from the initial embedding vectors than syntactic dependency is. This would be expected because part of speech can (often) be determined from simply knowing an individual token, but syntactic dependency requires that token, its position, and likely the other tokens in nearby positions. The embedding space could contain some part of speech information but it’s unlikely to contain much, if any, syntactic dependency information.

Hunting Wabbit (Heads)

Training the classifiers told me that incoming part of speech is iunderstood in layers 1-3 and incoming syntactic dependencies are understood in layers 1-5. So, I used truncated versions of GPT-2 (with the grammatical classifiers I trained on these trucated versions) to search for which heads played the biggest role in understanding grammar.

Initially, I followed the method laid out in the paper Are Sixteen Heads Better Than One?, which was also implemented by Huggingface in their Bertology example. This technique involves creating a ones tensor whose dimensions are the number of transformer layers and the number of attention heads per layer. Then I set this tensor to require gradients and multiplied the [i, j]th element of the tensor by the output of the jth attention head in layer i. Then I used back-propagation to calculate the jacobian of the grammatical classification loss with respect to the head coefficients. The idea was that if the derivative of the loss with respect to the coefficient of a given head is low, then this gives us some indication that that head was not extremely important for labeling the grammatical structure. And by contrast, if the derivative of the loss with respect to the coefficient of a head was high, then that head shoudlbe important for understanding the grammar of the incoming tokens.

Since the loss can be (and almost certainly is) a non-linear function of these coefficients, though, the derivatives with respect to the coefficients are only local, linear approximations of the importance of the heads. So, I decided to compare the results of this strategy with the impact on the loss when I pruned each individual head (which gives a definitive measurement of that head’s importance). And sadly I found that the coefficient derivative strategy did not do as good of a job of reflectiing the results as pruning the heads did.

So, in the end, I just looked at the impact on the loss for each grammatical structure from pruning each head. And here are the impact maps for the top 30 syntactic dependency structures in my dataset:

Using these maps, I slowly pruned more and more heads out of GPT-2 for each syntactic dependency structure, so as to find which combination of pruned heads produced the lowest loss. Here are the best masks for the top 30 syntactic dependency structures. Note that black means the head is pruned and white means it’s retained):

These masks seem extremely likely to give a map of which heads are involved in understanding each of these incoming syntactic dependency structures. We can see that similar structures require similar collections of heads, and that simpler grammatical structures require fewer heads than more complex grammatical stcutures. Finally, I looked for which heads have the most impact for structures containing each syntactic dependency label and I found that many labels seem to each be understood by a very small number heads:

In the future, I would like to test whether the heads needed for an given sentence structure are the same as the union of the heads needed for the labels comprising that sentence structure. I would also like to open up the individual heads for each part of speech label and understand how the query, key, and value weights map clusters of tokens in the incoming from the embedding space to clusters in the much lower dimensional key/value spaces. I have a suspicion that tokens in the embedding space will cluster into parts of speech and that these cluster boundaries will be critical in the projections to lower dimension key/value spaces performed by grammatical heads.

Deep Learning Hardware

2020-02-24T00:00:00-08:00

Deep learning requires special kinds of computer hardware. This is a post about what makes that hardware so different from the traditional computer architecture, and how to get access to the right kind of hardware for deep learning. If you just want practical recommendations about how to get your hands on the right hardware, feel free to jump to the end. If you want to understand what makes GPUs (and TPUs) such a philosophically different approach to computing than CPUs, and why that matters for deep learning, then read on…

History

To really understand the difference between hardware that works well for deep learning and hardware that doesn’t, we need to cover some history. In the mid 19th century, there was a growing crisis in the mathematical world. The historical techniques that had worked well for reasoning about algebra and geometry were producing contradictions when mathematicians tried to apply them to some of the more subtle infinities and infinitesimals of calculus. So mathematicians started looking for new, more precise ways of reasoning that could resolve these contradictions.

This was the birth of modern formal logic. At its core, it was a way to translate math into strings of symbols, and rules for using prior strings of symbols to derive new strings of symbols that represented consistent statements about mathematics. You’re probably familiar with something very similar to this kind of deductive symbol processing from algebra:

\[\begin{align} x &= (4+5)/x\\ x &= 9 /x\\ x*x &= 9\\ x^2 &= 9\\ x &= \sqrt{9}\\ x &= 3 \end{align}\]

Each step here is a string of symbols, and there are specific rules for how to convert from the string of symbols in one step to the string of symbols in the next step. For instance, to go from step 1 to step 2, we follow a rule that tells us we’re allowed to convert these five symbols: “$(4+5)$” into this one: “$9$”. And similarly, to go from step 3 to step 4, we follow a rule that tells us we’re allowed to convert “$x*x$” into “$x^2$”.

Formal mathematical logic can be thought of as a generalization of this process, that allows one to express all of mathematics and all mathematical proofs as sequences of strings of symbols, where each string was derived from prior strings in accordance with rules of symbolic deduction. As mathematicians understood how these systems of symbolic logic worked, they started describing such systems in terms of machines that could perform the operations that would transition from one string of symbols to the next. These theoretical machines were the basis of modern computers. In fact, most computers built today still roughly follow the general architecture laid out by the mathematician, John von Neumann, in 1943.

von Neumann Computers and Central Processing Units (CPUs)

The von Neumann architecture consists of a few basic building blocks:

Input - This is a way for a person to give the computer data and instructions for processing that data. On modern desktop computers, this might be a keyboard and mouse, and on servers, this might be a network connection that takes requests.
Memory - This is where the data and instructions are stored. Today this means both RAM and Hard Drive. RAM is much faster than hard drives (usually thousands of times faster), but it’s not able to retain information without a constant supply of electricity. So long term storage is on hard drives, and short term storage is in RAM.
Output - This is how the computer returns the results of the computation. On a modern desktop, this would be a screen or speakers. For a modern server this would be a reply sent through the network connection.
Arithmetic Unit - This is what actually performs the symbolic manipulation to go from one or more strings of symbols to the next string of symbols.
Control Unit - This keeps track of which string(s) the Arithmetic Unit needs to process for the current step.

Today, the Arithmetic Unit and Control Unit are bundled together into a “Central Processing Unit” or CPU.

The von Neumann architecture, and von Neumann-style CPUs in particular, are excellent at performing the kinds of sequential symbolic processing that computers were originally designed for. This worked well for many years while the primary computer interface was a prompt where the user would input a string of symbols, which the computer would process to calculate the next string of symbols.

Eventually though, it turned out that formal logic was only a tiny fraction of the diversity of ways that human beings engage with information (who knew?!), and it wasn’t long before people wanted to use these new, fast electrical information machines for more diverse tasks. So the Graphical User Interface (GUI) was born:

With the growth of graphical interfaces, CPUs were used to compute more abstract and complex things than before, including windows and icons on higher resolution screens. This was only possible because the CPUs of the early 1980s had become fast enough to perform millions of symbolic processing steps per second.

For a number of years, transistor sizes shrank according to Moore’s Law, and CPU speed continued to increase accordingly. And so CPUs were able to draw more and more complex and rapidly changing user interfaces including many video games. But ultimately, this was a hack. The basic design of CPUs was developed by mathematicians to perform the kind of sequential symbolic operations that happen in mathematical proofs, which are a profoundly different kind of task than drawing complex and rapidly changing images on a screen. As the video game industry took off, it was clear that CPUs could not perform sequential operations fast enough, and a new type of parallel computational unit would have to be invented. And so the Graphical Processing Unit (or GPU) was born.

Graphical Processing Units (GPUs)

GPUs were a fundamental departure from the von Neumann architecture. Where CPUs were designed to operate on strings of symbols, performing one computation at a time, GPUs were designed to operate in parallel on huge collections of numbers to calculate whole expanses of a visual field simultaneously. But GPUs were originally thought of as just a tool to draw pictures on the screen.

Additionally, the long history of von Neumann computers meant that the majority of programming languages were designed for such computers. Today this includes languages like C, C++, Java, Javascript, Python, Swift, Go, Ruby, Rust, etc. Since these programming languages were all designed for von Neumann computers, this means that all of the software written in these languages runs on CPUs. This includes every major operating system, every word processor, every web browser, every email client, etc. This software legacy means that any GPU attached to a modern computer will be a peripheral component. And today, GPUs connect to traditional CPU-based computers using a protocol known as PCI (Peripheral Component Interconnect).

In servers and desktops, a GPU is one or more chips attached to a physical expansion card known as a “graphics card” or a “video card”. These expansion cards get their name from their card-like shape, which plugs into a PCI slot on a computer’s motherboard:

In laptops, televisions, smart phones, and smartwatches today, there are no PCI slots and consequently no expansion cards to plug into them. People do still sometimes refer to GPUs in laptops as “graphics cards” out of habit, even though this is technically inaccurate. Laptop GPUs are soldered directly onto motherboards as shown here in red:

The computing paradigm of GPUs was so radically different than traditional CPUs that GPUs were originally largely isolated from the software running on CPUs. GPUs were only exposed to such software via higher level graphics APIs designed for rendering images on screens. Eventually, though, GPU manufacturers started to understand the unique computational power of GPUs, and they created programming languages like CUDA, with low level APIs for programmers to write code that could directly execute in parallel on GPUs.

The subsequent shift from von Neumann style symbolic processing CPUs to massively parallel calculations on GPUs was the core technological advancement that made the modern Deep Learning Era possible.

Deep learning is inspired by the neural information processing patterns of the human brain. And human brains are fundamentally not von Neumann machines. Human beings evolved from single cell organisms and we inherited the cellular structure of most life on Earth. Evolution didn’t have high speed, high energy silicon transistors available to build our brains, and formal mathematical logic wasn’t the top priority for the early stages of our ancestors’ cognitive development. We had to develop ways to use cells to perform analog, spatial tasks like moving toward light, or recognizing and fleeing from predators. And consequently, animal and human brains are large collections of cellular neurons that process information in massively parallel ways, much more like GPUs than CPUs. This is why GPUs are so much better suited for running software simulations of neural networks than CPUs are.

To quote Jeremy Howard, “When we say Python [on a CPU] is too slow, we don’t mean twenty percent too slow; we mean thousands of times too slow.”

Nvidia

While many companies today make GPUs for video games, one company in particular, Nvidia, realized the revolutionary importance of GPU computing for deep learning earlier than the others, and they prioritized making high quality programming languages and software libraries for deep learning tools to build on. In the video game space, AMD is the biggest competitor to Nvidia, and their GPUs have the hardware capabilities needed for deep learning, but the AMD software stack is nowhere near as robust as Nvidia’s. So nearly all modern deep learning libraries today are built for Nvidia GPUs.

At this point its worth mentioning that Python is a very popular language for deep learning. But Python runs on CPUs, not GPUs. What makes Python a viable candidate for deep learning is its ability to transparently call code written in other languages, including languages for GPUs. So for instance, in PyTorch when you see model.to(torch.device("cuda")), this is where Python is loading code written in CUDA into the GPU to execute there. If you run a PyTorch deep learning program without this line, you’ll find it’s thousands of times slower than with this line. This is because the code would execute on the CPU rather than the GPU.

Since nearly all deep learning libraries today are (ultimately) built on CUDA, Nvidia’s GPU programming language, this means Nvidia effectively has a monopoly on GPUs suitable for running deep learning programs. Interestingly though, Nvidia does not have a monopoly on GPUs capable of running video games. Nvidia must compete on price with AMD in the desktop/laptop GPU space, but doesn’t have to compete with anyone in the deep learning server space. This has led Nvidia to differentiate their GPUs into two lines: one for personal computers and one for servers. While these two lines are not drastically different in computational capacity, the licensing for the personal computer GPUs forbids using them in datacenters. This allows Nvidia to charge much more money for datacenter GPUs than it does for comparable personal computer GPUs.

Given how critical GPUs are for deep learning, one of the most important advancements for deep learning today would be porting deep learning libraries like Tensorflow to run on AMD GPUs. This would break Nvidia’s monopoly on commerical deep learning hardware and drastically drive down pricing.

So what should I do?

If you’re new to deep learning, you’re going to need access to some kind of computer with a fast Nvidia GPU. You can either access such a computer online through a cloud provider, or you can buy or build your own. As I described above, Nvidia has a monopoly on deep learning GPUs, so cloud providers will have to pay much more for GPUs in the datacenters than you would have to pay for a comparable personal GPU for your own computer. This might lead you to assume that buying your own computer with a GPU is the way to go. That would probably be true if not for Google. Google is offering very basic, and limited access to cloud servers with GPUs for the low, low cost of… NOTHING.

Google Colab

That’s right, Google has a cloud service called Colaboratory (or “Colab”, for short), which offers free access to basic GPUs, through a Jupyter Notebook interface for up to 12 hours at a time, for free. For someone just starting in deep learning, this is an excellent way to get easy access to a server with the right kind of hardware (and already pre-configured with the right software). The service offered through Colab is fast enough to start learning about deep learning, but there’s a good chance you’ll eventually hit the limits of what you can do with it. If those limits are primarily about speed, you can try upgrading to the $10/month Colab Pro plan which guarantees you faster GPUs and longer execution times.

Your own hardware

If you outgrow Colab, then it may be time to buy or build your own server. Rather than dig deep into the details of what GPU you should get, I’ll point you to Tim Dettmer’s excellent page, which he updates with each new hardware release. As of Spring 2020, Tim recommends an RTX 2070 for most newish users (or RTX 2080 Ti or potentially RTX Titan upgraded). These recommendations will likely change when Nvidia releases new video cards in mid-2020. If you would like to buy a pre-configured desktop or laptop, I would recommend buying from:

System76 - They sell high quality Linux machines with options to add the kinds of GPUs you need for deep learning. This is a good source for a person wanting to break into the field.
Lambda Labs - They sell high end deep learning workstations, preconfigured with multiple GPUs and deep learning software. This is a good source for a small business trying to start a small AI team.

If you want something a little cheaper and have a desktop around that you’re willing to run Linux on and install a GPU in, then you should go for Ubuntu Linux which is the most popular Linux used in AI today, or Pop!_OS, which is a compatible derivative of Ubuntu that has a lot of additional stuff built in to make it nicer to use. As for where to buy a GPU, Newegg amd Ebay, are both good choices. If you’re buying multiple GPUs to put in a desktop, make sure to buy “blower style” GPUs, which vent the heat out the back of the machine.

Cloud Services

If you outgrow a desktop or two, especially if maintaining the OSes on them becomes too complex, then you may want to consider renting cloud based compute from Google Cloud Platform, Amazon Web Services, or Microsoft Azure. Note that cloud based servers are much more expensive on an ongoing basis than running your own desktop, and they also use a completely different line of GPUs.

If you’re going with cloud servers, It’s important to understand the idea of virtual machines (or VMs). A virtual machine is a piece of software which acts like it’s a physical computer, even though it’s just software running on another computer. Here’s a picture of Microsoft Windows running inside of Virtual Machine software, which is in turn running on macOS, which is in turn running on a physical (Mac) computer:

When you rent servers from a cloud provider, you’re renting access to this kind of virtual machine, which is in turn running on the physical computers that the cloud company has installed in their datacenters. For your purposes, the machine you rent will act like a physical computer (though it may be slightly slower than a physical computer, since some of its hardware is emulated).

People sometimes refer to cloud virtual machines as “VMs”, “servers”, “nodes”, or “compute nodes”, and cloud providers have different “size” VMs you can rent. For instance, on the Azure Sponsorship account we get through OpenAI, Azure offers what they call “NC6_Promo” VMs, which each have 6 (virtual) CPUs and 1 GPU. The next size up that they offer is the “NC12_Promo”, which has 12 (virtual) CPUs and 2 GPUs. Then the next size up (and the maximum size we have access to) is an “NC24_Promo”, which has 12 (virtual) CPUs and 4 GPUs.

For reference, here are lists of popular Nvidia GPUs, in rough order of performance, as of early 2020:

Personal Computer Nvidia GPUs (fastest to slowest):

RTX Titan
RTX 2080 Ti
RTX 2080 Super
RTX 2080
RTX 2070
RTX 2060
GTX 1080 Ti
GTX 1080
GTX 1070
GTX 1650
GTX 1060

Cloud GPUs (fastest to slowest):

V100
P100
P4
T4
K80 (this is really just two K40 GPUs on one graphics card)
K40

Limited benchmarks

As part of the OpenAI Scholars program, we receive some Microsoft Azure credit. I decided to use some of my credit to benchmark K40/80 GPUs against Google Colab and a GTX 1080 Ti in my home desktop: I built an image recognition network (similar to ResNet-34) and I trained it on a few thousand images. I timed each epoch and averaged them to get these numbers. Note that this may not quite reflect the performance of the actual video cards since all of the cloud benchmarks were running inside of virtual machines and the benchmark on my desktop was not:

Server	GPU Kind	# GPUs	speed (lower is better)
Azure NC6_Promo	K40	1	61.5 seconds
Google Colab	P100	1	37.0 seconds
Azure NC24_Promo	K40	4	22.4 seconds
Google Colab Pro	P100	1	22.4 seconds
My home desktop	1080 Ti	1	20.6 seconds

Note that the Azure Servers technically have K80 graphics cards, but a K80 graphics card is just two K40 GPUs on one card, and Azure describes each virtual machine GPU as “Half of a K80”, which means in practice that they are K40s. Also note that the Google Colab Pro server had more RAM than the Google Colab free server did.

Bonus Round: TPUs

I mentioned above that Nvidia has a monopoly on deep learning GPUs. That’s true, but there is technically one other kind of hardware built for deep learning that you may want to know about: Google’s Tensor Processing Units (TPUs). TPUs are custom built by Google for deep learning and they’re only available to users through Google Cloud Platform and Google Collaboratory. TPUs are sufficiently restrictive that many deep learning programs built for GPUs won’t easily run on them, but if you can get your code to run on them, they’re much faster than even the fastest Nvidia GPUs. And they’re also available for free or very cheap through Colab, so it could be worth checking out!

Starting from Scratch

2020-02-14T00:00:00-08:00

Starting From Scratch

I’ve been excited about AI since I was about seven years old. When I was a kid, my Dad told me stories about the sci-fi books he read, and I always thought the most interesting ones were about AI. I loved imagining that one day computers might think like people do, and that this could give us new insights into humanity. By the time I was in high school, I was dreaming of how to transfer my mind into a robot body and wondering whether I would still be conscious. I don’t know if these things will be possible in my lifetime, but they sparked my imagination, and my interest in AI has only grown over time.

I’ve spent many years reading news stories about AI and thinking “I wish I could do that”. But I haven’t studied machine learning and I don’t have a PhD, and honestly I’ve felt intimidated pursuing the field. I didn’t know whether someone like me could even get into this. So, after a number of years of doing other work and dreaming of AI, I finally decided I had to try (with some encouragment of my wonderful partner). :-) So, I applied for the OpenAI Scholars Program and, much to my surprise, I was accepted into the 2020 cohort.

Now that I’m in the program, I want to share my process learning about something exciting and new to me. My hope is that seeing someone else starting from scrtach will inspire you to start (this or something else) from scratch as well. That’s why I’ve called this blog “Learning Out Loud”. I plan to learn out loud in front of you (and of course the pun is that I’m also going to be talking out loud about learning, both human and artificial). I know, my witty puns are suuuuper funny. ;-P Anyway, I hope you enjoy reading about my process of discovery, and I especially hope that if you’re excited and unsure whether to pursue this kind of work, then this blog can help give you the spark of courage and excitement to push you forward too. :-)