Depending on what you want the "feature" to represent, choose a model:
Subtract the mean and divide by the standard deviation (specific to the dataset the model was trained on).
You can average the vectors from all sampled frames (Global Average Pooling) to create one unique "fingerprint" for the entire file. 5. Implementation (Python Snippet)
Use ResNet-50 or ViT (Vision Transformer) pre-trained on ImageNet.