Jupyter Notebook Fatigue

11 Feb 2025

I have written a lot of Jupyter notebooks, about 1,300 in total the last time I checked, and I don’t want to do it any more. I’ve tried a few different tools, including the regular Jupyter Notebook web interface, Jupyter Lab, Jupytext, VSCode’s Jupyter plugin and PyCharm’s notebook and scientific mode. Nothing really felt great.

Maybe it’s because I picked up Data Science many years after becoming a Software Engineer, but I have never liked the linear workflow that notebooks suggest, not to mention how easy that rule is to break. I want to build out components of the analysis then compile them into a single report.

It’s also virtually impossible to read old notebooks, let alone run them. Code, output and exposition are all mixed together, and grepping is a nightmare because notebooks are serialized as JSON. (I’ve cooked up some jq incantations that should never see the light of day, but those weren’t enough.)

I’ve contemplated whether or not this is a skill issue on my part, but I’ve come to the conclusion that there’s just something about the format that encourages sloppy output. After about ten cells of non-trivial work, you end up with many potential branches for the analysis, charts and tables that are not entirely useful, but also not worth deleting (yet), and multiple versions of models with various parameter tweaks.

An append-mostly log is not a good data structure for this kind of thing. You really need a tree. Folding cells can help, but that’s too ephemeral; the structure needs to be baked in. You can drill down to parts of a tree in isolation, without worrying about (or affecting) everything else.

Jupytext is an improvement, but still suffers from the linearity problem, and you still have to use the notebook format if you want to solve output. PyCharm scientific mode was the best of the approaches that I tried, which allowed for this pattern, mostly because it doesn’t enforce a linear sequence of notebook cells. But it doesn’t really enforce anything. This led to reinventing some kind of structure for each analysis, which ended up just as messy as notebooks.

I just wanted to do my analytics work in the terminal, using tools that I already use for software development. All of these tools are more or less nice wrappers around the IPython shell, so that firing that up alongside vim in a tmux session actually wasn’t too bad - more or less equivalent to the PyCharm scientific mode, notwithstanding its nice management of charts and data frames.

Missing those features was kind of annoying at first, but after a while, it occurred to me that they were more of a hindrance than a help for my goal: do tree-like analyses, which can compile to reports and be readable later on. This new approach forced me to factor the code that generated those charts and tables into functions, choose good names for things, and store output in a well-defined location.

Now, that might sound like extra friction that slows down the process. But it’s not. The extra friction forces you to do the thing that you’re supposed to be doing: thinking. Plus you end up with a nice library of composible functions, which makes compiling the final report much easier.

I packaged this up into a little library called jove - (shorter Jupiter, get it?) - which bootstraps a basic directory structure:

(base) ➜  /tmp jove start myanalysis
INFO:jove:Created /tmp/myanalysis
INFO:jove:Created /tmp/myanalysis/data
INFO:jove:Created /tmp/myanalysis/figures
INFO:jove:Created /tmp/myanalysis/README.md
INFO:jove:Created /tmp/myanalysis/libjove.py
INFO:jove:Created /tmp/myanalysis/code.py
INFO:jove:Created /tmp/myanalysis/shell.sh
  • data is where CSVs, JSON files, etc. go
  • figures is where chart PNGs go
  • code.py is where analysis code / functions go
  • libjove.py contains some helper functions
  • shell.sh boots up an IPython shell and loads libjove.py and code.py
  • README.md contains all the analysis exposition / notes

The libjove.py file contains a few functions, which make it easier to get code and data out of the IPython session:

  • save_csv saves a DataFrame to CSV to the data directory, using sequential numbering (e.g. data/table-0.csv, data/table-1.csv, etc.)
  • save_fig saves a Matplotlib Figure to the figures directory, using sequential numbering (e.g. figures/fig-0.png, figures/fig-1.png, etc.)
  • save_wip clears the IPython history and dumps the code into code.py, so it can be refactored into functions

The main document is named README.md so you can still get a nicely rendered document when looking at the directory in GitHub. As analysis progresses, tables and figures can be saved to their respective directories, and linked in the README.md file.

The library is available here, if you’re interested. It’s not particularly impressive at the moment, but it’s at least a good starting point for structured analyses outside of the notebook format:


Building Japanese Cars

19 Jan 2025

There’s a common sentiment that “Germans make the best cars; Japanese make cars the best.” If you’re unfamiliar, this blog captures it pretty well:

German cars are known for their robust engineering and attention to detail, but they can be more complex and expensive to maintain.

Japanese cars, particularly those from brands like Toyota and Honda, are renowned for their exceptional reliability and durability, often requiring less maintenance over the long term.

https://www.supaquick.com/blog/the-difference-between-german-and-japanese-carmakers

I think that there’s an obvious connection between this and how software is built. There’s also a less obvious connection to how we think that we build software.

I suspect that most teams, consciously or otherwise, tend towards the German approach. Just consider the microservices trend over the past decade. Clearly, robust engineering has been involved. In fact it’s usually necessary to achieve reliability.

It would be hard to find someone who wouldn’t admit that microservices are complex and expensive to maintain, though, even among their proponents.

On the opposite end of the spectrum, you have the Majestic Monolith. (Ruby, coincidentally, also Japanese.) I admit that I tend to prefer that approach, only breaking out chunks into “macro-services” if/when it makes sense. But unless you’re starting a greenfield project, it’s difficult to go the other direction (technically as well as politically.)

Nobody sets out to build a system that is complex and expensive to maintain, but we certainly have a lot of them. Whether it’s robust engineering, attention to detail or something else, how we get there isn’t as important as being able to climb out of the hole.

I don’t foresee the tendency changing any time soon, and this is where the analogy to car manufacturing breaks down. Notwithstanding software updates, it’s pretty difficult to dramatically change vehicles after they leave the assembly line. But that happens all the time with software.

So given that inevitably we end up with slop, what can be done? The knee jerk answer is always, “rewrite the whole system,” which is almost always the wrong answer. If the project is optimized for deletion, then cleaning up is tractable, but there’s no guarantee the project is structured that way.

Another complicating factor is that no two messes of code are alike, which rules out generic solutions. Ironically, attempting a generic solution at this problem would probably just yield another mess to clean up.

There is also never enough time for refactoring. It is extremely difficult to frame in terms of business value, so it almost never becomes prioritized, until at some point it’s virtually impossible to implement anything in a reasonable amount of time. (Time for a rewrite?)

The only option left is to engineer a way out of the mess.

Let’s say (hypothetically, of course) that we bolted on a React frontend, via auto-generated Apollo GraphQL Typescript hooks, to a Graphene endpoint in a brownfield Django application. I’m sure that there is more than one project out there with a similar configuration, but certainly not enough to warrant an open source solution, let alone a commercial service to help rein in this polyglot Audi after 75,000 miles.

Fortunately there is a technique that can simulatenously scratch the itch to write complex code, bring sanity to the current system, and avoid complicating things even further: metaprogramming in one-off scripts.

Continuing the example, we have a few things working for us. GraphQL is fundamentally type-oriented, so there is some hope that we can connect the dots, despite the sources having different kinds of ASTs. Given a type, we need three things to construct the call graph:

  • The file/class on the server containing the GraphQL resolver
  • The GraphQL query/mutation definition files on the client, which compile to hooks
  • The hook usages across all of the frontend Typescript files

Once you’ve got that, it’s easy to identify dead and deprecated code paths. Just look for instances that don’t have all three of those things. To get there, import ast and graphql in Python, and start hammering on ChatGPT.

Here’s a quick-and-dirty script that solves this particular case, and an example invocation: https://gist.github.com/brandtg/761f8735ccf3389935cd76f949063c8b

./analyzegraphene.py ./my-project ./usages.csv --exclude 'my-client.tsx'

It’s not perfect, but a good enough, 80/20 way to quickly reason about the call graph and identify opportunities to clean things up. On top of that, as long as the architecture doesn’t change too dramatically, you can repeatedly run it instead of having someone wade through the mess again and again.


When I Gave Up IntelliJ for Lent

05 Jan 2025

I know that it’s a controversial opinion among programmers under 50, but I actually like Java.

All of the newer language features, like streams, records and local variable type inference, make writing Java feel as nice as any other modern language. The JVM is amazing and there are so many high-quality open source libraries available (h/t Dropwizard)

However, working with Java can be very annoying. There are basically only two options: stay in the warm, fuzzy IDE, or learn everything about build systems, Javadoc and editor configuration to be able to work from the terminal.

Many years ago, I was frustrated by how magical and complicated building Java projects felt, so I decided to give up IntelliJ for Lent. That left me with vim and Gradle in a terminal.

The first hurdle was understanding how to navigate the codebase. This was before LSPs, but it was easy enough to figure out with ctags and grep. (These days things are pretty good with jdtls in Neovim.) The hard part was understanding third party libraries.

A lot documentation is published online, but it can be hard to find it for specific dependency versions. At any rate, doing that is much slower than the “go to definition” functionality that IDEs provide. So I figured that I could use Gradle to do something similar.

It turned out that getting the Javadoc (and sources) was straightforward using the idea plugin:

idea {
  module {
    downloadJavadoc = true
    downloadSources = true
  }
}

After that runs, the dependency code and documentation are downloaded to the ~/.gradle/caches directory:

~/.gradle/caches$ find . -name 'commons-io*.jar'
./modules-2/files-2.1/commons-io/commons-io/2.18.0/44084ef756763795b31c578403dd028ff4a22950/commons-io-2.18.0.jar
./modules-2/files-2.1/commons-io/commons-io/2.18.0/e2281d62ae24acd84de1ef7273e70bfb38c75658/commons-io-2.18.0-javadoc.jar
./modules-2/files-2.1/commons-io/commons-io/2.18.0/9ab23bf96cc41b731cb123fbee97198f901af68e/commons-io-2.18.0-sources.jar

The Javadoc artifact contains a full static website, so all that you have to do is extract it and load up index.html in the browser:

~/Desktop$ mkdir commons-io
~/Desktop$ cd commons-io/
~/Desktop/commons-io$ jar xf ../commons-io-2.18.0-javadoc.jar
~/Desktop/commons-io$ tree | head -25
.
├── allclasses-index.html
├── allpackages-index.html
├── constant-values.html
├── deprecated-list.html
├── element-list
├── help-doc.html
├── index-all.html
├── index.html
├── jquery-ui.overrides.css
├── legal
│   ├── ADDITIONAL_LICENSE_INFO
│   ├── ASSEMBLY_EXCEPTION
│   ├── jquery.md
│   ├── jqueryUI.md
│   └── LICENSE
├── member-search-index.js
├── META-INF
│   ├── MANIFEST.MF
│   └── maven
│       └── commons-io
│           └── commons-io
│               ├── pom.properties
│               └── pom.xml
├── module-search-index.js
...

It was annoying to do that every time that I wanted to look at documentation, let alone navigate from simple class name reference in code to the right place, so I put together some simple tooling.

I wanted something similar to pydoc for Python, and it had to do three main things:

  • Extract - Extract the documentation / sources into a common place.
  • Index - Parse documentation for class names and build an index.
  • Search - Match a query against the list of class names and return documentation.

The tool creates a root directory, walks the local repository to find *-sources.jar and *-javadoc.jar artifacts, then extracts them under ${root}/sources/${artifact-name} and ${root}/javadoc/${artifact-name}. At this point, it’s easy to explore the documentation top-down if you know the artifact name, but it’s hard to find a specific class.

Fortunately there are pages like allclasses-index.html that list all of the class names and link to their docs, (though naming isn’t always consistent.) The tool looks for these files, then builds a mapping of class name, (really Maven coordinates), to the documentation file path. This mapping gets dumped to a JSON file under the root directory.

Finally, the tool can search that JSON file using simple keyword matching and regular expressions to output matching class names along with their Javadoc and sources.

I put this together and called it jdoc. It’s not very complicated - only about 250 lines of code. After Javadoc and sources are downloaded via the build tool, just run jdoc --index, and search by passing one or more patterns to it:

$ jdoc XmlStreamWriter
   name: org.apache.commons.io.output.XmlStreamWriter
    jar: commons-io-2.18.0-javadoc.jar
javadoc: file:///home/greg-brandt/jdoc/javadoc/commons-io-2.18.0-javadoc.jar/org/apache/commons/io/output/XmlStreamWriter.html
 source: file:///home/greg-brandt/jdoc/sources/commons-io-2.18.0-sources.jar/org/apache/commons/io/output/XmlStreamWriter.java

   name: org.apache.commons.io.output.XmlStreamWriter.Builder
    jar: commons-io-2.18.0-javadoc.jar
javadoc: file:///home/greg-brandt/jdoc/javadoc/commons-io-2.18.0-javadoc.jar/org/apache/commons/io/output/XmlStreamWriter.Builder.html
 source: file:///home/greg-brandt/jdoc/sources/commons-io-2.18.0-sources.jar/org/apache/commons/io/output/XmlStreamWriter.Builder.java

Also here is my favorite vim keybinding for it, which uses the -e flag to do exact class name match, and targets the token under the cursor:

nnoremap <leader>j :execute '!jdoc -e ' . expand('<cword>')<CR>

You can download the tool if you’re interested here: https://github.com/brandtg/jdoc