Flash attention offers performance optimization for attention layers, making it especially useful for large language models (LLMs) that benefit from faster and more memory-efficient attention computations.
Once enabled, supported layers like layer_multi_head_attention
will attempt to
use flash attention for faster computations. By default, this feature is
enabled.
Note that enabling flash attention does not guarantee it will always be
used. Typically, the inputs must be in float16
or bfloat16
dtype, and
input layout requirements may vary depending on the backend.
See also
config_disable_flash_attention()
config_is_flash_attention_enabled()
Other config: config_backend()
config_disable_flash_attention()
config_disable_interactive_logging()
config_disable_traceback_filtering()
config_dtype_policy()
config_enable_interactive_logging()
config_enable_traceback_filtering()
config_enable_unsafe_deserialization()
config_epsilon()
config_floatx()
config_image_data_format()
config_is_interactive_logging_enabled()
config_is_traceback_filtering_enabled()
config_set_backend()
config_set_dtype_policy()
config_set_epsilon()
config_set_floatx()
config_set_image_data_format()