
This post focuses on practical techniques for bridging the gap between a free-form PySpark notebook and modular, production-ready code. The goal is twofold:
We will walk through an example notebook that trains a model to predict taxi ride tip amounts. The notebook works, but most of the logic lives in a single long PySpark query that might be difficult to understand. Even the author will have issues untangling the logic after some time at another project.
We will progressively refactor this notebook, breaking the logic into functions, modules, and tests. Each stage represents a different level of maintainability. You can stop at any stage once your project’s requirements are satisfied. Projects have very different needs: from a one-off analysis with no maintenance requirements to a business-critical system with strict SLAs. The stages below intentionally scale from minimal to rigorous.
Assume the following (artificial but realistic) task. We want to train a model that predicts the tip amount for New York City taxi rides based on trip information. The requirements are:
Suitable only for one-time analysis or proof-of-concept work.
The full prototype notebook for the project can be found here. Set up instructions for running the notebook locally are in the README.md of the same repo.
At this stage, correctness matters more than structure. The notebook produces the desired output, and maintainability is not yet a concern. Getting the desired output once does not guarantee successful runs in the future. Debugging will be challenging as well. Hence, extra work is required to extend the application scope.
Suitable for non-critical code that needs to be rerun occasionally.
Once the notebook produces correct results for a single run, it is time to make it readable and easier to reason about. The main goals are:
Extracting dataset loading
We start by removing duplication in dataset loading:
Datasets now live in a single dictionary (sdfs), keyed by name, which simplifies downstream logic.
Splitting the logic into functional pieces
The core processing logic in the prototype notebook is implemented as a single chained PySpark query.
While concise, it hides intent and makes modification risky.
We group related filters into functions and expose their parameters explicitly:
Parameters are kept outside functions to make them easy to override later (e.g., via runtime configuration).
Chaining logic with .transform()
Instead of creating a new dataframe variable for each step, we use .transform() (available in Spark 2.x+):
This pattern keeps the pipeline linear, readable, and easy to modify during debugging. The .transform() method can also take additional parameters:
Thus, we can reuse the function for both the pickup and drop off locations:
Grouping logic into abstraction levels
Filtering and feature engineering form logical units in the transformation step of the code:
At this point, the executable logic shrinks to an ETL-like skeleton:
print() statements are acceptable in prototypes but not in production. Replacing them with logging early avoids churn later. A minimal setup below would allow for keeping the printed output while replacing the print() statements with logger.info():
An extended setup allows consistent formatting and future redirection to files or logging systems:
Notebook with the refactored code can be found here.
Suitable for code maintained by more than one person.
Notebooks are convenient for exploration but awkward for collaboration and version control. Reading flow is inverted (low-level functions first) and merge request diffs are noisy.
Moving code into Python modules improves readability, collaboration, and testability. I added the functionality to .py files in the src directory of the repo.
Most logic lives in tip_amount_model.py.
Shared parameters are grouped into a configuration dataclass:
Another class holds the logic and shared variables to enable the computations:
A single run() method defines the table of contents for the job:
Lower-level methods follow immediately after their callers, preserving top-down readability:
Running the job from the command line with parameters makes it easy to iterate on experiments in an automated and reproducible way. This requires a few small additions that allow the module to be executed as a script, for example:
The function below creates command-line arguments for all fields defined in the configuration dataclass:
his approach ensures that:
main entry pointThe __main__ block allows the tip_amount_model.py module to be executed directly, with parameters passed from the command line:
To allow Python to resolve imports correctly when running the script from outside the project directory, add empty __init__.py files to:
This marks both directories as Python packages and enables execution via python -m ... instead of relying on relative paths or manual PYTHONPATH manipulation.
Reusable infrastructure such as logging deserves its own module. A shared logging setup ensures consistent conventions across jobs by calling:
Any function used across jobs should follow the logging module suite and be imported as opposed to defined within a job module. One caveat with independent functions is that all parameters for such a function should be passed instead of being available through self.
Even with production code in modules, notebooks remain useful for debugging and experimentation.
Debugging module-based code from a notebook is primarily about shortening the feedback loop between code changes and observed behavior.
The typical debugging workflow consists of three steps:
Make modules available for import
Because the notebook lives outside the src directory, Python does not automatically know where to find the project modules. One simple way to fix this in a notebook is to add the project root to Python’s import path:
Here we add the blog_posts directory (two levels above notebooks) to Python’s module search path, making pyspark_to_production importable.
Create job objects
Next, import the main classes and create the configuration and job instances:
At this point:
job.config.job object.Run the job
Finally, run the full pipeline to ensure that everything works end-to-end: In our case it is:
If an error occurs, the stack trace will point directly to the failing method in the module code, making it clear where to investigate.
Inspect intermediate results
One advantage of the class-based design is that intermediate artifacts are preserved on the job instance. Any dataset stored in self.sdfs can be inspected interactively.
For example, to inspect the training dataset:
Iterate on fixes
After fixing a bug in the module code, you have two options:
One of the main advantages of interacting with the job through a notebook is the ability to experiment quickly while still using production-ready code. Experimentation typically falls into two categories: parameter tuning and prototyping new logic.
Parameter-level experimentation
Many experiments can be performed by simply modifying configuration values and rerunning the relevant stages. This is the safest and fastest way to explore model behavior, as it requires no code changes.
For example, to evaluate the impact of a smaller test set, update the configuration directly on the job instance and rerun the training and validation stages:
Because the pipeline stages are independent, only the affected parts need to be rerun. This makes parameter tuning fast and encourages systematic experimentation.
Local prototyping
Some experiments go beyond parameter changes and require trying out new ideas that are not yet part of the production code. For example, you may want to quickly try out a different model or a feature-processing tweak. In such cases, the goal is to prototype without modifying the repository, keeping experiments local to the notebook.
Suppose we want to try a Gradient-Boosted Tree regressor instead of the default model. We can define a modified training function directly in a notebook cell:
The function can then be dynamically bound to the existing job instance:
After that, the modified logic can be exercised by rerunning the relevant stages:
This approach allows rapid exploration of alternative implementations while preserving a clean separation between experimental code and the production codebase. However, if the experiment grows, becomes reproducible, or needs team review, a proper repository branch will be a better choice.
All examples shown in this section are available in the playground notebook.
Required for time-critical or frequently changing systems.
Code changes introduce bugs far more often than we would like. The goal of unit testing is to catch those bugs before they reach production, reducing rollbacks, hot fixes, and firefighting.
The fastest way to start unit testing is to write simple behavioral checks directly in the module-interaction notebook from the previous stage. As an example, let’s verify that the add_features() function actually produces the feature columns listed in feature_cols:
This test:
add_features() is applied.Even simple checks like this are effective at catching accidental column renames or mismatches between transformation logic and configuration.
While inline notebook checks work, they quickly become problematic:
To address this, we refactor the test logic into self-contained test functions and introduce reusable fake data generators.
Reusable fake data generation
We start by defining lightweight schema classes and a helper function to generate PySpark Row objects with defaults:
This setup allows each test to override only the fields it cares about, while the rest are filled with sensible defaults.
Test function
With reusable data generation in place, we can write proper unit tests that exercise larger parts of the pipeline. The test function will then look like the follows:
This test verifies behavior at the level of the transform() stage, which is why we also provide a minimal fake taxi_zone_geo dataset.
Additional tests
The data-generation boilerplate becomes worthwhile once multiple tests reuse it. For example, we can validate that airport trips continue to be excluded:
Test functions can be executed directly from a notebook cell:
This approach keeps the feedback loop short while building tests that can later be moved into a proper test suite with minimal changes.
All examples in this section are available in the test notebook.
Notebook-based tests are far better than having no tests at all. They work well for small projects or exploratory code, but they do not scale to large, actively developed, production-facing systems.
As the codebase grows, testing must be automated. A scalable testing setup should:
To achieve this, the notebook-based test functions are moved into test modules. In this example, the tests are placed in tests/unittests/test_tip_amount_model.py. The tests themselves remain largely unchanged. The example uses pytest, while the standard unittest library could be used as well.
Sharing a Spark session across tests
First, let’s define a single shared Spark session for all the tests because recreating it for every test is slow and wasteful. Adding a pytest fixture with session-level scope to the global settings module conftest.py will do the trick:
An additional benefit of this set up is that tests can now depend on the Spark session directly, instead of pulling it from a job instance. The test signature changes accordingly:
Parameterizing testspytest also allows running the same test logic with multiple input combinations. This is especially useful for validating edge cases and understanding exactly which inputs cause failures. For example, airport filtering can be tested as follows:
Here we test each combination of pickup and dropoff locations separately and also passing how many rows we expect after the filtering.
Running the test suite
Before running the tests, ensure that __init__.py files are present in both the project root and test directories so that Python can correctly resolve imports.
Tests can then be executed from the project root with:
Optional flags: - -vv to display test names; - -s to show print statements and logs.
Closing the loop
Once the test infrastructure is in place, additional test functions should be added to cover the rest of the pipeline logic. Tests should be updated alongside code changes and run automatically as part of the development workflow.
Notebooks are an excellent medium for exploration, but they are a poor long-term container for production logic. The problem is not the use of notebooks themselves, it is the lack of a clear path from experimentation to maintainable code.
In this post, we walked through that path step by step:
Each stage represents a trade-off between speed and rigor. Not every project needs to reach the final stage, but every project benefits from knowing when to stop. A one-off analysis, a recurring internal report, and a business-critical pipeline all justify different levels of structure and testing.
The key idea is not to “productionize everything”, but to design notebooks and modules so that evolution is possible. When code is structured with clear stages, explicit parameters, and testable units, the handoff between data scientists and engineers becomes collaboration instead of reimplementation.
Sr. Data Scientist at Xomnia
