Alma: Free Tool to Speed Up your PyTorch Model | Auto-Benchmark 50+ Options
Impact
Inference optimization is a massive part of modern ML and AI. The faster you can run your model, the less money you have to spend on compute while also making your application latency that much less, making your end-user that much happier.
However, it’s difficult to know which conversion option to use for any given combination of a model, data, and hardware. alma
is an open source, free Python package that allows you to find out the best conversion option for your situation, with one function call.
Background
My friend Saif Haq (MLOps wizard) and I set ourselves a challenge: 1B row challenge, but for a Neural Network that did MNIST classification. Spoilers: we have not accomplished that.
However, along the way we did spend a lot of time diving into PyTorch model conversion options. I have a big interest in neural network quantization, but I didn’t know a ton about torch.export, or about torch.compiling to CUDA Graphs, or what OpenXLA was.
As we experimented with how we could speed the model up as much as possible, we ended up creating a project with a:
- Scaleable structure where we could mix and match different converison options.
- 50+ supported conversion options.
- Easy to use CLI.
- Multiprocessing-setup that allowed us to benchmark each process sequentially in isolation, ensuring that no option affected the glocal torch state.
- Graceful failures.
- Easy CI integration.
- Model-agnostic setup.
Having built all of that, we figured we should open source it, and so was born alma
: a Python library for benchmarking PyTorch model speed for different conversion options 🚀
Alma is designed as a one-stop shop, where with one function call you can benchmark your PyTorch model inference speed across all of the dozens of different conversion options, so that you get the best option for your model, data, and hardware.
More than that, it’s intended to help people learn about these options. All of the code is very modular, and each conversion option works, in terms of code, is made to be easy to understand.
Here’s the Github link. For installation, it’s as simple as:
pip install alma-torch
Usage
The core API for alma is benchmark_model
, which is used to benchmark the speed of a model for different conversion options. The usage is as follows:
from alma import benchmark_model
from alma.benchmark import BenchmarkConfig
from alma.benchmark.log import display_all_results
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# Load the model
model = ...
# Load the dataloader used in benchmarking
data_loader = ...
# Set the configuration (this can also be passed in as a dict)
config = BenchmarkConfig(
n_samples=2048,
batch_size=64,
device=device, # The device to run the model on
)
# Choose with conversions to benchmark
conversions = ["EAGER", "TORCH_SCRIPT", "COMPILE_INDUCTOR_MAX_AUTOTUNE", "COMPILE_OPENXLA"]
# Benchmark the model
results = benchmark_model(model, config, conversions, data_loader=data_loader)
# Print all results
display_all_results(results)
Depending on your hardware, your printed results may looks like this (we used an NVIDIA Titan GPU):
EAGER results:
Device: cuda
Total elapsed time: 0.0206 seconds
Total inference time (model only): 0.0074 seconds
Total samples: 2048 - Batch size: 64
Throughput: 275643.45 samples/second
TORCH_SCRIPT results:
Device: cuda
Total elapsed time: 0.0203 seconds
Total inference time (model only): 0.0043 seconds
Total samples: 2048 - Batch size: 64
Throughput: 477575.34 samples/second
COMPILE_INDUCTOR_MAX_AUTOTUNE results:
Device: cuda
Total elapsed time: 0.0159 seconds
Total inference time (model only): 0.0035 seconds
Total samples: 2048 - Batch size: 64
Throughput: 592801.70 samples/second
COMPILE_OPENXLA results:
Device: xla:0
Total elapsed time: 0.0146 seconds
Total inference time (model only): 0.0033 seconds
Total samples: 2048 - Batch size: 64
Throughput: 611865.07 samples/second
Documentation and advanced usages
We have lots of examples on how to use alma
, including discussion of advanced usages and design decisions. E.g., we support multiprocessing as a means to have each conversion method be benchmarked inside a dedicated environment, optional graceful or fast failures, full control of what device the conversion option runs on with controllable graceful fallbacks in case of incompatabilities, and more.
We have multiple example scripts (e.g. here), documentation READMEs, and a Jupyter notebook.
Available conversion options
We currently support over 50+ options, including:
- torch.compile (with different backends)
- torch.export
- torchao (GPU quantization)
- HuggingFace’s optimum quanto
- torch.ao.quantization (edge device quantization)
- TensorRT
- OpenXLA
- TVM
- ONNX
- ONNX RT
- CUDA Graphs
- JIT tracing
- TorchScript
- Half-precision (fp16, bf16) options
- Extensive combinations of the above!
We are constantly adding new options! See here for an up-to-date table of options.
If you’re interested, please check out the repo. Again, the link is here.
If there is any particular option you are interested in, please open an issue, or if you want to have a crack at implementing it, contributions as PRs are very welcome! Adding new conversion options is very simple.
If alma
is helpful to you in anyway, please send us a message, leave a star on the repo, or let us know somehow, it’s always appreciated! alma
is under the open-source MIT license, so you’re free to use it in any of your projects however you wish!
Please feel free to connect with us on LinkedIn:
Saif Haq: https://www.linkedin.com/in/saifhaq99/
Oscar Savolainen: https://www.linkedin.com/in/oscar-savolainen-phd-b88277121/