PerformanceΒΆ
- Theano uses several tricks to obtain good performance:
- common sub-expression elimination
- [custom generated] C code for many operations
- pre-allocation of temporary storage
- loop fusion (which gcc normally can’t do)
On my neural net experiments for my course projects, I was getting around 10x speed improvements over basic numpy by using theano. [More specific speed tests would be nice.]
With a little work, Theano could also implement more sophisticated optimizations:
- automatic ordering of matrix multiplications
- profile-based memory layout decisions (e.g. row-major vs. col-major)
- gcc intrinsics to use MMX, SSE2 parallelism for faster element-wise arithmetic
- conditional expressions