In this blog post, let's explore the possibility of event detections in crowded scenes, specifically in a shopping mall setting with models capable of running in real time as well as with more openness to different scene understandings.
As robotics technology advances, security robots have the potential to become ubiquitous in public spaces such as shopping malls.
To further enhance their capabilities and be a part to further develop such a future, we can leverage computer vision and machine learning algorithms to detect unusual patterns of movement or objects, indicating potential security breaches or unusual events. T
The unusual events could be running, falling down, burglar-like full-face coverage, sudden gathering, objects left unattended for long time, etc.
However, these capabilities would either be more accurate with more open scene understanding capabilities or faster to detect the events in real-time.
If the robots rely on internet connectivity for accessing bigger and accurate models, the video streaming adds latency. On the other hand, smaller models would require careful training for specific events/tasks before deployment.
A future shopping mall thought with security robots
Vision Models
Vision models, commonly based on convolution neural networks, are usually the first choice of algorithms to be tested and deployed on today’s AI-based vision systems.
For a security robot in a mall, many pretrained vision models can be fine-tuned to add the capability of detecting certain events like ‘cash payment at the counter’, 'fall detection', etc. For specific use cases, this requires data to train or fine-tune the vision model on. A recent use-case of fall detections in a warehouse setting tested at WISIO is shown in the video below:
The trained vision models can perform inference up to 17 frames per second (FPS) on a CPU and up to 60 FPS on NVIDIA T4 GPU making them detect objects in real-time. At WISIO, we try to increase the confidence of these vision models by carefully curating the data and the training pipelines. Once trained, the models can perform multilabel inferences at the fraction of a second.
Vision models can also be used to detect anomalous activities in fixed places, which the model has not seen before e.g., crowd gathering, escalator malfunction, unauthorized access/intrusion, fire/smoke, or any other contingency. Once an anomaly is detected, further necessary actions are triggered. For instance, a passageway may not be allowed to gather on, and a region counter helps in determining such a situation as shown in an example video below:
Vision-Language Models
For training our autonomous watchman roaming around in a shopping mall for anomalous event detections, if a supervised learning approach for a vision model is adopted such as training a ResNet101 model, it can work in real-time and give superb accuracies. But it depends heavily on data diversity and the training is expensive for a real-world application. There are vision foundation models pre-trained on large sets of data but as soon as a new class appears, which the model has not seen during its training, the model may not be able to infer it.
To address this lack of openness and generalization to new scenes, recent studies encode text into the vision architectures so that the contextual richness in the natural language helps the model fetch the potentially unseen classes during the inference stage. This allows for a good zero-shot performance (where no training data was provided). A pretrained vision-language foundation model can be further fine-tuned on a dataset for a use case.
Vision-language model taking an image, text phrase, and corresponding label as input for fine-tuning
After fine-tuning, a different prompt in a different scene capturing a different people behavior
For example, in a shopping mall deployment, a vision-language model could be trained to recognize common events like “entering/exiting the mall”, "customer browsing," "cashier processing payment," and "security guard patrolling." However, if people are loitering near the busy area, where it is not allowed, the model might be able to detect such activity if provided in the text prompt. The model, leveraging the linguistic context, can then better understand the situation and classify it as an anomaly, even without prior training examples of loitering in the mall. This approach enhances the robustness and adaptability of the autonomous watchman system compared to purely vision-based methods. A list of better suited text prompts is provided once and we are good to go with the model’s inference. Next time, if a new event needs to be added, a new text prompt for that event should ideally be enough to make the robot detect the new event without further fine-tuning.
At WISIO, such near-real-time vision-language models are being actively tested for different use cases with FPS of more than 50 on NVIDIA V100 GPU. In a crowded scene with many objects being detected in parallel, the added latency in comparison to a single object detection is around 15-20%. Even with good generalization abilities, some fine-tuning for different cases is performed. These models do not have tens or hundreds of billions of parameters and could be deployed on the edge devices for real-time performance, therefore, could be suitable for a moving robot.
Multimodal Video Models
Going more towards slower foundation models with general scene understanding capabilities in a multimodal setting, WISIO is actively testing the development and deployment of such models. With the video foundation models, spatiotemporal events can be queried, understood, and used in applications. For example, in the passageway video above, when we queried one of the multimodal foundation models whether it sees some people not walking, the response (zero-shot) was:
Yes, some people in the video are not walking. There is a small group of people, including some children, clustered together near the center-left of the frame. They appear to be stationary, likely talking or interacting with each other, while the rest of the people in the video are walking across the mall floor.
|
---|
These natural language responses can assist security personnel with general event capturing without any fine-tuning. For the future of the security robots in shopping malls, detailed instructions can be provided with the natural language prompts to give coverage for generally any event.
The accuracies of the above-mentioned domain of models are continuously being improved and consistently developed across the community; therefore, we can leverage and contribute into this ongoing effort.