Source code for my diploma thesis written (without a professional programming experience) in 2020.
The repository contains code running training and inference of a 3D CNN model for detection and classification of myeloma in the images of femur CT scans. Besides a classic supervised learning, a code supporting transfer learning approach is also included as well as an attempt for multiple instance learning (MIL).
In hindsight, many things in the code seem outdated or impractical. The repository is pretty bare and it lacks in terms of best coding practices, structuring and automatization. I intentionally left the repository in this state to have a benchmark to measure the progress of my programming skills.
I went through the code and compiled a (not exhaustive) list of improvements which I would implement if I wrote this today.
Sorted by the priority in a descending order, those are:
- use formatter tools like
black
,isort
etc. - use typehints and static typechecker system like
mypy
. - I'd set up a
pre-commit
routine to run the aforementioned checks/formats (and a few others) automatically. - I'd sacrifice some performance and use
pydantic
orattrs
for automatic data validation. - I'd set up a simple CI pipeline to test new code using Github actions or a similar tool.
- better structured docstrings which would allow for generation of HTML documentation using
Sphinx
for example. - write unit tests since day one (omitted back then due to the lack of time) and pursuit some reasonable coverage level.
- use some sophisticated dependency manager (like
Poetry
). - use convenience libraries like
pathlib
(instead ofos
),typer
/click
(instead ofargparse
),loguru
etc. - I'd consider using
Pytorch Lightning
to reduce boilerplate code and to get access to some additional goods (the callback system, robust logging, useful classes etc.). - better configuration loading and processing. Today, I'd use
Hydra
(which didn't exists back then) or some simpler tool likeDynaconf
. - I'd add a
Dockerfile
if there was time left. - I'd probably set up some simple MLOps pipeline using
MLFlow
orClearML
to be able to store and compare models (it gets messy with just tensorboard and local file system storage)