How to implement a research paper from start to finish

I’ve just finished making a video course that walks through how to implement a research paper:

Actually, it’s more like implementing several research papers:

  • GPT-style transformer network
  • T5 relative position bias
  • Transformer-XL recurrence
  • RAG(ish) retrieval

Plus a smattering of topics:

  • Reading papers effectively
  • Reading papers critically
  • Tips for implementing
  • Data processing
  • Identifying performance bottlenecks
  • Why and how attention works (or why it’s good enough)
  • Torch gotchas and medium-deep internals (broadcasting, copying, contiguity)
  • Batch and multihead operations
  • Position bias
  • Faiss tutorial
  • Considerations for vector database build and search
  • Einsum
  • Recurrence
  • Model architecture and build
  • Running on GPUs

Why?

Implementing a paper is a weird skill. I haven’t found any tutorial for how to do it, and assume that you either learn it by osmosis during university / research job, or you dive in the deep end. The only way to get good at it is fairly obvious: do it a bunch of times. But doing it the very first time is hard:

There’s a lot of tips and more-of-an-art things that make it painful, and not educational, to learn on your own. Papers are underdetermined, details are missing (in fairness: there are a lot of details), and even if you have the code implementation you can’t always replicate results. So it’s a poorly explained skillset.

It’s also an important skillset. Being able to pull some paper and implement it to solve your big problem is incredibly useful. But more so, inevitably you’ll need to implement your own idea, or modify an existing model, or try something weird. When that time comes, if you’ve already had plenty of experience putting together and debugging new ideas, you’ll be in a much better position.

So it’s both an important skill and a poorly explained skill. For me that made trying to teach it to others an important goal.

It was a lot of fun but much harder than I anticipated for a few reasons:

  • Maintaining a story. A story is important. Framing something as a story improves retention, understanding, and helps motivate you to ask questions and continue on to the next lesson. It’s how our brains work, so it’s silly not to leverage it to help yourself teach or learn something. Maintaining a story about, effectively, a seven hour coding session is hard. Ideally, it means that every topic you cover is motivated by some problem or question that came as a result of the previous topic. But sometimes boilerplate code is boilerplate code, and there’s no apparent reason an author did X, or a library function does Y, or you’re building a boring component, so you have to work hard to keep things interesting.
    • (The downside of storyfying things is that a compelling dramatization can dominate the truth. How much should you sugar a bland subject with dramatization in order to give a reader handholds and keep them interested? Is the Thirty Years war a result of: a guy getting thrown out of a window in Prague? A religious dispute coming to a head? A proxy battle between Bourbons and Habsburgs? Or some combination of fifty things whose complexity you can only make sense of once you’ve got a dramatic arc to hang all the detail and nuance on?)
  • Finding the right level of detail. You have to make a lot of assumptions about the audience. I found it helps to be explicit about this:
    • Decide on the probable distribution of audience knowledge. For this case, anyone clicking on the video is probably more advanced than a beginner but less advanced than a research engineer. This is the range.
    • Inside of that range, weight each topic by how important it is to understand against how difficult it is explain, and that is how you spend your time. Advanced users can always just skip ahead to the next section, so it’s better to lean towards the beginner audience.
    • In the case that you have an important topic that is difficult to explain but it not directly relevant to the wider lesson, just provide a taste of the subject to get the audience interested and then link out to a good resource. For example, it makes no sense for me to give an einsum tutorial, or show how to write a training loop because while they are both important and take time to explain, there’s already a thousand good resources out there.
  • Knowing every detail: the compulsion is to know what every line of code is doing, and what every line of code inside that line of code is doing. This is because you don’t want to have a student ask “why” about something where you yourself have not asked “why” about.
    • As far as I can tell, the strength of the Feynman technique is that you are driven by the fear of being embarrassed by your students. You don’t want to set yourself up as the teacher and then have to say “I dunno” because you didn’t look hard enough at your own material.
    • Teaching others is a great way to teach yourself because you reexamine your tools through the eyes of a beginner. You uncover everything that you take for granted, and I found that I take a lot of deep learning / software things for granted. I won’t pretend I wasn’t embarrassed by some of the things I thought I knew but didn’t.
    • Knowing every detail is not a good goal. There’s a lot of layers of “why” and “how” under every line of code, and there’s a certain impulse to want to know it all. Should you? I think the answer is clearly no. Your time is limited. To reframe this in a more general, related way: is attaining expertise a good goal in and of itself? Especially in technical fields I think many people just default into thinking it is, but in fact it is always a means and should be viewed that way. You should aim only for expertise that is sufficient to get some goal accomplished. Even when we think of expertise as a goal or good, there’s often an implied actual goal that it’s subsidiary to: a career, making a new discovery, advising others, problem solving in that field…I think it also works in practice this way: when you see expertise you admire, it’s usually just a byproduct of the person’s effort, not the focus of it. All this to say: I’ve backed off of thinking that, e.g. learning about pytorch internals to satisfy curiosity and a completionist impulse would be a good use of time. When I do have a reason to learn about it because it’s blocking the path from A to B, then I will dive in.
      • (There’s an argument for expertise in itself, or craftsmanship as an end in itself. To me this still sounds like the expertise is subsidiary to purpose, or duty, or well-being, or the act of dedicating yourself to something at all. I don’t think the world-class violinist’s fulfillment is dependent on him having chosen the violin rather than the viola.)
  • Presenting to no one. Talking into a webcam for several hours with no feedback is a strange format. I find it much harder than presenting in person, or presenting over webcam with at least a few people nodding along or asking questions. There’s no way to read interest and energy. Staying upbeat and energetic for long periods doesn’t work well, but too little energy and it becomes deathly boring.

I went back to re-record quite a few episodes because I felt they weren’t up to standard. There’s a tight balance between over- and under-explaining code: it’s easy to become too boring or too brief. There’s no perfect solution besides using judgment and trusting that the audience is here because they’re motivated, and will fill in the gaps themselves. Otherwise you’re explaining everything in a twelve-hour video with high information but low information density for any given viewer.

There’s already ten things I would do differently if I could do it again but I’m happy enough with the result. I hope you get something out of it.