Blog Logo
TAGS

Bytes Are All You Need: Transformers Operating Directly On File Bytes

This paper investigates the possibility of performing deep learning classification directly on file bytes without the need for decoding files at inference time. The authors demonstrate ByteFormer, a model that achieves 77.33% ImageNet Top-1 classification accuracy when training and testing directly on TIFF file bytes using a transformer backbone, and 95.42% classification accuracy when operating on WAV files. The model also has applications in privacy-preserving inference, allowing for inference on obfuscated input representations with no loss of accuracy. Additionally, the authors propose ByteFormers ability to perform inference on privacy-preserving camera input. The removal of modality-specific input preprocessing provides benefits for model development, while the maintenance of privacy makes ByteFormer an attractive prospect for private inference. Code for ByteFormer is available at https://github.com/apple/ml-cvnets/tree/main/examples/byteformer.