Building Japanese Cars

19 Jan 2025

There’s a common sentiment that “Germans make the best cars; Japanese make cars the best.” If you’re unfamiliar, this blog captures it pretty well:

German cars are known for their robust engineering and attention to detail, but they can be more complex and expensive to maintain.

Japanese cars, particularly those from brands like Toyota and Honda, are renowned for their exceptional reliability and durability, often requiring less maintenance over the long term.

https://www.supaquick.com/blog/the-difference-between-german-and-japanese-carmakers

I think that there’s an obvious connection between this and how software is built. There’s also a less obvious connection to how we think that we build software.

I suspect that most teams, consciously or otherwise, tend towards the German approach. Just consider the microservices trend over the past decade. Clearly, robust engineering has been involved. In fact it’s usually necessary to achieve reliability.

It would be hard to find someone who wouldn’t admit that microservices are complex and expensive to maintain, though, even among their proponents.

On the opposite end of the spectrum, you have the Majestic Monolith. (Ruby, coincidentally, also Japanese.) I admit that I tend to prefer that approach, only breaking out chunks into “macro-services” if/when it makes sense. But unless you’re starting a greenfield project, it’s difficult to go the other direction (technically as well as politically.)

Nobody sets out to build a system that is complex and expensive to maintain, but we certainly have a lot of them. Whether it’s robust engineering, attention to detail or something else, how we get there isn’t as important as being able to climb out of the hole.

I don’t foresee the tendency changing any time soon, and this is where the analogy to car manufacturing breaks down. Notwithstanding software updates, it’s pretty difficult to dramatically change vehicles after they leave the assembly line. But that happens all the time with software.

So given that inevitably we end up with slop, what can be done? The knee jerk answer is always, “rewrite the whole system,” which is almost always the wrong answer. If the project is optimized for deletion, then cleaning up is tractable, but there’s no guarantee the project is structured that way.

Another complicating factor is that no two messes of code are alike, which rules out generic solutions. Ironically, attempting a generic solution at this problem would probably just yield another mess to clean up.

There is also never enough time for refactoring. It is extremely difficult to frame in terms of business value, so it almost never becomes prioritized, until at some point it’s virtually impossible to implement anything in a reasonable amount of time. (Time for a rewrite?)

The only option left is to engineer a way out of the mess.

Let’s say (hypothetically, of course) that we bolted on a React frontend, via auto-generated Apollo GraphQL Typescript hooks, to a Graphene endpoint in a brownfield Django application. I’m sure that there is more than one project out there with a similar configuration, but certainly not enough to warrant an open source solution, let alone a commercial service to help rein in this polyglot Audi after 75,000 miles.

Fortunately there is a technique that can simulatenously scratch the itch to write complex code, bring sanity to the current system, and avoid complicating things even further: metaprogramming in one-off scripts.

Continuing the example, we have a few things working for us. GraphQL is fundamentally type-oriented, so there is some hope that we can connect the dots, despite the sources having different kinds of ASTs. Given a type, we need three things to construct the call graph:

The file/class on the server containing the GraphQL resolver
The GraphQL query/mutation definition files on the client, which compile to hooks
The hook usages across all of the frontend Typescript files

Once you’ve got that, it’s easy to identify dead and deprecated code paths. Just look for instances that don’t have all three of those things. To get there, import ast and graphql in Python, and start hammering on ChatGPT.

Here’s a quick-and-dirty script that solves this particular case, and an example invocation: https://gist.github.com/brandtg/761f8735ccf3389935cd76f949063c8b

./analyzegraphene.py ./my-project ./usages.csv --exclude 'my-client.tsx'

It’s not perfect, but a good enough, 80/20 way to quickly reason about the call graph and identify opportunities to clean things up. On top of that, as long as the architecture doesn’t change too dramatically, you can repeatedly run it instead of having someone wade through the mess again and again.