Skip to content

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

Ramla-I/matrixmultiply

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

matrixmultiply

General matrix multiplication for f32, f64 matrices.

Allows arbitrary row, column strided matrices.

Uses the same microkernel algorithm as BLIS, but in a much simpler and less featureful implementation. See their multithreading page for a very good diagram over how the algorithm partitions the matrix (Note: this crate does not implement multithreading).

Please read the API documentation here

Blog posts about this crate:

build_status crates

NOTE: Compile this crate using RUSTFLAGS="-C target-cpu=native" so that the compiler can produce the best output.

Recent Changes

  • 0.1.14
    • Avoid an unused code warning
  • 0.1.13
    • Pick 8x8 sgemm (f32) kernel when AVX target feature is enabled (with Rust 1.14 or later, no effect otherwise).
    • Use rawpointer, a µcrate with raw pointer methods taken from this project.
  • 0.1.12
    • Internal cleanup with retained performance
  • 0.1.11
    • Adjust sgemm (f32) kernel to optimize better on recent Rust.
  • 0.1.10
    • Update doc links to docs.rs
  • 0.1.9
    • Workaround optimization regression in rust nightly (1.12-ish) (#9)
  • 0.1.8
    • Improved docs
  • 0.1.7
    • Reduce overhead slightly for small matrix multiplication problems by using only one allocation call for both packing buffers.
  • 0.1.6
    • Disable manual loop unrolling in debug mode (quicker debug builds)
  • 0.1.5
    • Update sgemm to use a 4x8 microkernel (“still in simplistic rust”), which improves throughput by 10%.
  • 0.1.4
    • Prepare support for aligned packed buffers
    • Update dgemm to use a 8x4 microkernel, still in simplistic rust, which improves throughput by 10-20% when using AVX.
  • 0.1.3
    • Silence some debug prints
  • 0.1.2
    • Major performance improvement for sgemm and dgemm (20-30% when using AVX). Since it all depends on what the optimizer does, I'd love to get issue reports that report good or bad performance.
    • Made the kernel masking generic, which is a cleaner design
  • 0.1.1
    • Minor improvement in the kernel

About

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Rust 82.8%
  • Makefile 9.1%
  • D 8.1%