After my most recent F# work, I’ve gone back to working primarily in Python and other dynamic languages. I think I’ve picked up on quite a few new problems, but I’ve also seen that the type differences alone aren’t enough to make or break Python for me. Many of my complaints about F# were tooling and consistency related, and I think Python suffers from the same problems but in very different ways. I don’t think I would have noticed this without switching between the two styles of languages in two deep dives.

Tools and Types

I haven’t yet had the pleasure of working with libraries that have type annotations, but I would be excited that it would improve the experience with the tools. I’m not so excited about going through and annotating my own code since I’d be concerned that I’d end up debugging working code just to figure out the names of what it already uses without problems. A problem with most statically typed languages is the verbosity and excessive constraints on type usages, which can make development feel slower. Where I find explicit type constraints more helpful than hurtful is in a single specific return type. I love being able to always know exactly what type to expect back from a function with minimal overhead. For many common libraries, this is left completely unspecified in the documentation, and it’s very common for functions to return different types depending on the type of the inputs. This makes it easy to use in some cases, but hard to predict in others.

I’ve complained before about Python linters leaving too much room for error and I’ve discovered that a more powerful and opinionated linter (pylint over flake8!) makes a huge difference in productivity. It really closed the gap on stupid errors such that I’m not constantly debugging and testing for minimal progress and functionality once I use any non-trivial abstractions. I’ve even moved to VS Code over Notepad++ for tooltips and autocompletion via plugins, and it’s really changed my perspective on how to write Python. But even with these new tools, it isn’t enough to alleviate my frustrations when working with incompletely documented functions and places where it’s just wrong at runtime. The same return type troubles I want solved by type annotations is only partially solved by linters and doc strings.

I’d much rather spend time type casting and converting than guessing and checking on functionality. Determining the correct return type in Python requires either runtime inspection (via REPL) or preferably tests. Just reading the implementation can be a rabbit hole of dependencies that end in a call to a compiled C library. I’ve read far more code in my Python dependencies than I have in any other language to just grasp proper type usage. I don’t enjoy the exercise of exploring a library’s code structure via debugger or running code in my head when all I want to know is why the result is different with similar inputs. With the flexible nature of many Python functions, I’d even find myself doing this with my own code where I failed to initially see how the implementation would produce unexpectedly different results depending on the types provided. A level of runtime testing is just inescapable, and it’s hard to be exhaustive so the code is then flexible in many ways where the testing just constraints it enough around the expected inputs. Whether this behavior is desired depends on the context and conventions of the entire system.

For Python documentation, examples need to be really exhaustive for me not to care what is going on behind the scenes, because as soon as a different type comes out, I’m going to have to dig into the exact difference in context that causes the change. It’s really the cascade of unknowns that makes situations like that difficult to debug, since finding specific information is much more time consuming in a system designed to be flexible. In some cases where the platform or implementation differs between the dev and prod environments, results from REPL based design can even be misleading. For everything else that just blazes along in Python for development speed, I still haven’t found a workflow or toolset that gives me the ability to be safe and complete when I build non-trivial systems.

True Stories

What I still like about Python is the power growth from test script to a small system. If well architected from the start(!), the pieces can really grow smoothly in complexity and functionality. I’m going to compare a few different Python projects in how this works and doesn’t work.

Naive Scripts to Simple App

This project was to take another group’s internal project public. It was an Express app where some of the data was generated via Python’s scientific libraries. The Python scripts that did this were short (less than 2k lines total) with minimal structure. There weren’t many functions, most of the state was global, it had no tests and no accompanying documentation or inline comments. It tackled enough business complexity such that I didn’t want to take the time to write it from scratch, so I knew from the start I could probably just finesse the context and outputs to make it ready for production.

Instead of just papering it over with a single huge facade and calling it done, I broke it down one step at a time to create something I’d be proud of maintaining. First by splitting the code into functions with documentation, then parameterizing the hardcoded configurations, then replacing mutable globals with parameters, then collecting functions and state into classes. At each step, I introduced some necessary complexity that pushed it to the next stage of a mature design. I thought about adding tests, but I realized that I wasn’t looking to standardize internal interfaces or preserve backwards compatibility. I was reorganizing code with non-functional changes. If I had added tests at an early stage, I would have had to continually rewrite them as I quickly evolved my whole design. I really didn’t have problems with accidentally introducing breaking changes just with help from the linter.

This wasn’t just a refactoring project, I did have other requirements that would require new functionality. The goal was to make it easier to add these features and integrate into a larger system that would have a longer life. I’d say it took longer to polish it to completion than if I had just hacked the integration code on the end because I was so focused on keeping it clean throughout the whole process. As I learned more about the final purpose and organization of the application, I’d go back to clean the code for consistency. Features like Python’s attribute decorators and easily mockable interfaces allowed different parts of the application to evolve in different directions. Some parts were very flexible with few type requirements, while others had to meet very specific interfaces for integration. The ability to change functionality in a single place and have the change flow through the all parts of the application was a massive timesaver. As I added more integration points I was able to easily expand the interface without breaking older usages. I didn’t need to introduce explicit shims or adapter code if I knew I was changing the interface in a way it supported, which was great when the interfaces were small and I knew them very well. The new code and interfaces I was adding churned quickly, and I found that beneficial for the consistency of the system.

While this was solo project, I always intended to keep the changes understandable and maintainable by the original authors, who didn’t have as much experience in writing and running production applications. I immediately started reaping the simplicity benefit once the app hit user testing. The changes had expanded on the lack of structure that was there, so it was easy to find the familiar code snippets from before. As I introduced these design and functional changes with documentation I focused on describing the general layout as a network of small components. Keeping each component quick to explain meant that even if the magic that tied the parts together was complicated, it was easy for someone who didn’t understand the whole system to make changes in an isolated component for a big impact. That level of shared understanding made it feel like a safe place to start a long term project. A good chunk of the original code was unchanged, where most of the refactoring work was in proper encapsulation, naming, and data flow. A complete rewrite could have been a mess, but I was able to meet my requirements with a minimal amount of effort and discipline.

Python’s learning curve made this project possible in the first place by enabling the first contributor to make a system that was useful enough to be a standalone product! The language and environment made it easy enough to meet all of the initial requirements and eventually grew to meet the final system requirements. It took some time to manually test this work, but it was surprisingly painless for what it could have become. It started with the simplest tools available and was able to use progressively more powerful dynamic features as necessary. The final product wasn’t all that different in terms of functionality from where it started. There were some problems with deploying it to the intended environment, but it performed as expected without pain and within estimates.

Stubborn ETL Script

On the flip side of this smooth growth was an ETL script that had always been a hack. Built for a database that was still in beta and only running on local test machines meant that it acquired too many sloppy habits from “It broke overnight, quick patch and it’s up again.” So when I went to productize this code, the mess was just all over from just 6 months of neglect. It was so much harder to re-architect this app that already had crumbling structure than to work with a set of files that had none. Breaking just a few implicit assumptions had the whole thing teetering on the brink of collapse. I honestly considered rewriting the app entirely out of Python, but was able to patch it up with more comments than code.

Python’s power bit back here. Because it was so easy to make quick changes to solve problems, the true impact of the changes were not fully assessed until they nearly outgrew their underpinnings. Every change needed to deal with another layer of redirection that needed slightly different conventions for each database interface. There was so much that ‘just worked, save for this one exception’ for the first few iterations that the process grew to expect a very specific context and order of special cases that was more fragile than anticipated. The linter enabled using more complex logic and structure to meet increasingly implicit contracts until it couldn’t support them all in concert (as they became conditional and dynamic). This trivial project stretched the tools to their limit, and they broke when needed most. I couldn’t rely on them to save or fix the sloppy code. The tools only helped as pieces came back together, again accelerating positive developments just as they had with dangerous practices. The safe growth of the system was really dependent on discipline, and not just on tools.

According to the linter, this highly structured ETL script was of much higher code quality than the set of files with all globals, but this one took far more effort to improve. This kind of experience with Python leaves a similar taste to C++ in my mouth. It’s a very powerful language with lots of great features and tools to get work done, but it has a number of easily uncovered design pitfalls that the tools can’t prevent and cause productivity to crash. I used the dynamic features to hide the differences in complexity, so the code appeared to be more generic than it was. I was always able to find just one more clever way to manipulate the evolving query results to fit the first expected format. It was more of a bad design that was enabled by a patchwork of patterns than a flaw in the type system or language semantics. The tools had given me more than enough rope to do very painful damage, which wasn’t readily apparent until it was too late. I want to fall on the side of preventing future pain rather than seeking programmer happiness because I’d much rather work with a naive system that’s underutilized than a system that stretches ease of getting things done to the limit.

I had a few unit tests and integration tests, which brought my code coverage number high, but those weren’t enough to catch the problems I was seeing in production. Adding tests to help stabilize the problem code didn’t help much since changes in one part would cause a different area to break because the contracts were so ill-specified. With the above small Express application, the churn was spread across the code base and slowly solidified around code into good interfaces. With this configuration, most of the changes were in three or four hot spots that really controlled the data flow format. The interfaces between the components were already defined, so the tricks were just getting those exactly right to prevent breaking all other interfaces that were dependent on that. A consistent bad interface was worse for development productivity in this case than an untested, constantly evolving one. Changing the bad interface would have been nearly impossible because most of the application relied on it implicitly as a matter of convention. This was something I didn’t expect when designing and didn’t enjoy the fact that it was difficult to correct at the source.

This project ended up meeting its deployment requirements, but a number of system compromises left it in a precarious state for the future. Near the end of this I was really yearning for static types and strong interfaces, since many of the abuses that had caused the system to end up in such a state would have been easier to troubleshoot with more static type information. The great power of dynamic Python wasn’t well applied here and it made the system much more brittle than it should have been. It made me think of Python as more fragile than flexible, since there was so much complexity hidden in function contracts. Trading off early development time cost much more frustration later.

From Standalone to Distributed

This project was much broader in scope than just Python. It started as a standalone application to show, send and save hardware sensor data. The first pass was carefully designed and purposefully simple. It started with multiple classes and well-organized modules that used very few dynamic features, mostly leveraging the close at hand visualization libraries. Like many early Python experiences, it was extraordinarily quick to go from concept to production. It again had minimal tests, but extensive documentation on every function, class, and module. It worked well enough for what it was designed to do, but then the requirements changed. It no longer became feasible to have the UI and the sensor data collection running on the same machine, so the application had to be broken up.

The application was tightly coupled when viewed from this new requirement. It made quite a number of simplifying assumptions about direct and sole control of the hardware. On one level, that assumption would hold true, there would only be a single piece of hardware that would need to be controlled at a time, but the distributed aspect could raise the question of multiple clients. This also means the changes from one client must be communicated to others via the server, a complication that wasn’t necessary for single-user mode. This meant that the current setup would need to be evolved with some major changes while keeping the existing functionality, a good challenge for a short timeframe.

It took four or five iterations to really complete it. Built-in tests and continuous rework meant that each iteration could change more of the system without breaking things, but getting to that point where the interfaces stabilized caused quite a few headaches. The real stumbling blocks were with timing control and IPC plumbing. The first few iterations relied on the GIL and the built-in TCP server to alleviate concurrency concerns, but I started to stray from the simple to the more complex parts of the standard library faster than I expected. While it’s possible to do concurrent work correctly in Python, I was often guessing at some edge cases for yield points. The last few iterations were mostly setting up IPC with multi-processing for some additional control that wasn’t available with threads. The amount of overhead for hooking it all up was pretty minimal, but it did require careful planning up front about what messages would be needed. I did have to make some compromises on possible functionality to keep things simple, but it wasn’t a show-stopper.

Again, growing slowly and continually rewriting to keep the application as a whole consistent were what made the difference here. If I had to keep a specific interface from when the system was designed before, it would have been a much different story in terms of changes. I did take advantage of quickly mocking and wrapping objects for new interfaces, even monkey patching functionality in when working with less complete versions of the hardware libraries. What this really exposed isn’t that Python is fast to develop because of its dynamic type system, but that relying on convention leads to fewer complex interfaces. The lack of explicit constraints brings a necessity of consistency in the name of agility. The ability to quickly type out 3 or 4 lines to test an interface out, even if it isn’t in a REPL, was a massive boost in design productivity. This reinforces my experience with projects in more verbose languages accumulating cruft and complexity because it’s so time-consuming and visable to make and test changes with broad impact.

I used tests sparingly here, mostly for setting up and benchmarking communication performance between the new components. Because most interfaces other than 0MQ were parsable by the linter, only the most fragile interfaces were tested, and only with production style integration workloads. I had very few problems with breaking changes causing bugs here, even when I’d change the meaning of an internal interface. Most of this was because everything was so cleanly encapsulated, changes in one part of the application wouldn’t cause other areas to break. I had enforced this by the design of the application, ensuring that most functions had a single specific output whose type wasn’t dependent on other functions within the application. This caused quite a bit of boilerplate for processing inputs and some duplicated code within each function, but the resulting ability to quickly iterate interfaces was undeniably beneficial for design maturity and functional progress.

Complex Prototype

The goal for this project was another application cleanup and productization with the eventual goal of integrating much of the functionality in this application into another (possibly F#). This had more structure than the global scripts, in that it had a single god object which communicated from a K-V store to the frontend. Attempting to break this up was much more time consuming than expected.

In this case, dynamic features were used to build a pseudo-ORM on top of the KV store, treating the backing store as a single object that had a few common access patterns. This means the linter couldn’t help at all on attribute names or usages. The god object which contained this backing store was passed by reference to every function, which caused data dependencies across the application to be a nightmare. While the logic itself was generally straightforward and uncomplicated, the sheer mass of it deserved far more isolated structure than was in place. I put tests around pieces as I went along, but they generally just exposed how intertwined the backend logic was with the presentation and representation layers. In short, another design problem.

In comparison to previous projects, this had all the tough qualities of the overgrown script with all of the requirements to grow an application to stability. The tools were already failing and it was difficult to get a handle on any aspect of the code from which to recover control. It wasn’t simple or short enough to just add structure around the edges, and it didn’t have just a few problem spots to nail down. I had to take a methodical incremental test and refactor approach, where I created tests for each public method and then refactored without changing the interface.

I yearned to break the existing interfaces and put together something more compact by separating concerns. I attempted this a few times and stopped shortly after realizing that any useful change was very difficult to make all across the application. The number of assumptions about the program state and requirements was unchecked and very difficult to document, even in a working state. In my statically typed programs, I could quickly fix broken interfaces since I would be immediately confronted with all of the call sites so that I could enforce a level of consistency. Without any tools to make guarantees about the suitability of the program, I was forced to a guess and check mentality by writing tests. The very same implicit assumptions that made this application quick to develop in the first place made it very slow to evolve in a different direction. By accumulating features faster than structure, it had become a victim of its own success.

Remedying this required excessive diligence. I ruthlessly abstracted what I could as I tested the interfaces, taking on even more complexity in the hope that multiple refactoring passes would benefit from a less tightly coupled structure. My normal rule of thumb with refactoring is line count savings, but I forced myself to abandon that in pursuit of a format that would have more explicit contracts and assumptions for later changes. As I mentioned before, it was very disappointing to have to turn down ideas for new abstractions that wouldn’t fit with the old interfaces. I sometimes found myself experiencing Stockholm Syndrome with the code, thinking it was good enough now that I had slightly re-written and wrapped it in a few new layers. I’d only come to my senses after looking at what test cases were required to fully exercise the code paths. The number of possibilities for things to go wrong at the interface was just massive, I had to take such liberties about what could happen that I was often rethinking the caller design as much as the interface.

This whole experience made me dread refactoring in Python. Where refactoring in statically typed languages hurts because of compile times and the verbosity of changing type names and call sites, refactoring when the types are unchecked means you have no real idea what needs to be changed until it breaks at runtime, and then you don’t know if the thing that broke should really be where the fix should go. With decent types, it’s much easier to know that you’ve successfully satisfied an interface. Some OO anti-patterns make this contract-based debugging less useful, but I’m comparing primarily functional-style typed languages to Python’s duck types.

While I might extol the virtues of TDD in cases like this, I think that addresses the wrong issue. TDD could have wholesale prevented the design of the application like this, but that’s not helpful once the application already exists. The entire implementation could have been tested as is, and that would have only made it more tedious to make interface changes with multiple changing assumptions in client code and tests. It would have required double the amount of effort to make the application wide changes since it would affect the code and the tests! My primary motivation for using Python is a really great effort-reward ratio when making applications!

I’m still in the process of slogging through this project, and I expect I’ll rewrite more than I intended to as I encounter many necessary design changes. Even with this low expectation of progress, Python’s ability to do quick mocks, fast tests, and big changes will be key to pushing this project to completion. I’m waiting for the time when I can make contained changes in parts of the application without effect, and make small changes in key parts of the application for more complex functionality. I really do love that Python makes both types of improvements easy, even if I hate that it makes messes easier to create and harder to clean up.

Python Takeaways

This has probably been my deepest dive into Python application design, and my experiences have been split about 50⁄50. Until now, most of my Python experience has been in one-off utility scripts and services. It was the glue to connect systems with a bit of logic and handy magic, not something with the requirements of a standalone application. I’d grown small scripts into more complex featureful systems before, but they had been less focused solo projects evolving continuously over a number of months. The projects above were great to document because they had clear before and after points to make progress and explicit decisions.

I had undervalued tests in Python. They are essentially a long form runnable dynamic analysis. I had been able to get by with minimal tests before because it was easy to be consistent and provide robust interfaces when expected. I had only reached for tests before when faced with problems where running code in the REPL wasn’t enough to understand the functionality. Tests are absolutely required when working with custom classes and complex interfaces. They can be a huge pain to do well and they do make bad interfaces harder to change, but at the very least they provide sanity checks that linters cannot.

I think this boils down to my problem with tooling in Python. By requiring every non-trivial application to have exhaustive tests to maintain sanity during refactoring, you are essentially building your own custom analysis tooling for every new project that you maintain. While this can be awesome in some sense for long-term development velocity, the effort outlay required from the start of the project negates part of what made Python so attractive and easy to use in the first place!

Expressive power via flexibility is a dangerous tool to leverage, explicit types or not. Python makes a number of extremely useful features easy to use (dictionaries and comprehensions) but also makes some very powerful features appear simple when they are not (classes and duck types). This hidden power causes some of my programs to become lopsided in terms of complexity, where I’ll have large portions of the code (by line count) doing very trivial things with setup and data handling and then there will be a part where “magic happens”. This is where the complexity is packed into just a few lines that make the whole rest of the program possible and simple. Because Python just makes it all look so easy, this complexity isn’t apparent and if it continues to work under the correct assumptions, functionally invisible. The trap then comes when that region becomes a bottleneck and bogged down in complexity from both limitations of Python and the domain. It can affect the whole application organization since everything was relying on this component to make anything happen. Recovering from this state without powerful tools to restrict unexpected changes is a major undertaking. I underestimated how easy it was to build a very complex Python program without ever realizing it. Design mistakes in Python are much more difficult to recover from in comparison to simpler, statically checked languages.

Refactoring significant Python code brings a very different set of challenges than building it up for the first time. Python is still my go-to toolbox for knocking things out quickly. The more experience I have with its pitfalls, the harder I have to work to realize that Python has answers to these problems, even if I dislike their requirements. The deeper I go into Python’s ecosystem the more I find that it is designed for real world constraints and tradeoffs. I still have quite a number of reservations about Python for certain situations, but putting my lessons learned from other systems in Python has proven very fruitful.