Human Presence Classification
CLIP-Based Linear Probe Logistic Regression classification model to detect the presence of humans in fashion-domain images.
@author: Adham Elarabawy (www.adhamelarabawy.com)
Overview
I needed a human presence classification model to help with structuring a very large scraped dataset of fashion imagery. CLIP-based similarity scoring was not sufficient, since desired precision would result in a substantial drop rate. I trained a logistic model on top of CLIP image features as a linear probe for classification, using DeepFashion paired images. Achieved 100% accuracy on the test set (20% = ~2k imgs). Definitely overfit to fashion imagery, but that's fine since that's the downstream use case. This is extremely low latency, especially if you've already encoded your images using ViT-B/32 CLIP variant.
On an A10, it takes about ~23 milliseconds to encode the image, and ~0.28 milliseconds to classify the features.
Dataset
I used a subset of DeepFashion v1 in order to curate a dataset of paired images of a garment and then the garment on a person. I then used this structuring to create the final dataset with binary labels of human presence. Some notes:
- The images seem to be predominantly women.
- The human models seem to have good coverage on most ethnicities/body types. Early analysis also shows that there is not any ethnicity/body type bias.
- Most/all the images have a white background. From my testing, the model generalizes quite well to other domains (with natural/diverse backgrounds/poses).
- My hypothesis is that the paired nature of the data allowed the model to pick up on the correct features, which has made it very robust. |Presence Case|Absence Case| |---|---| |<img src="https://datasets-server.huggingface.co/cached-assets/forgeml/viton_hd/--/forgeml--viton_hd/train/226/image/image.jpg" width="100px">|<img src="https://datasets-server.huggingface.co/cached-assets/forgeml/viton_hd/--/forgeml--viton_hd/train/226/cloth/image.jpg" width="100px">|
Usage:
import clip
import torch
import pickle
import sklearn
import time
from PIL import Image
from huggingface_hub import hf_hub_download
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, clip_preprocess = clip.load("ViT-B/32", device)
repo_id = "adhamelarabawy/fashion_human_classifier"
model_path = hf_hub_download(repo_id=repo_id, filename="model.pkl")
with open(model_path, 'rb') as file:
human_classifier = pickle.load(file)
# time the prediction
start = time.time()
features = clip_model.encode_image(clip_preprocess(img).unsqueeze(0).to(device)).detach().cpu().numpy()
encode_time = time.time() - start
pred = human_classifier.predict(features) # True = has human, False = no human
pred_time = time.time() - encode_time - start
print(f"Encode time: {encode_time*1000:.3f} milliseconds")
print(f"Prediction time: {pred_time*1000:.3f} milliseconds")
print(f"Prediction (has_human): {pred}")