Efficiency Follows Capability: A Decade of Video Understanding Research Trends

Abstract

Using arXiv CS publication trends (2015–2025), this article tracks the growth of video understanding research and the subset that explicitly frames contributions around computational constraints (efficiency, real-time, or lightweight).

Video understanding grows from 5 papers (2015) to 621 papers (2025).
The efficiency/real-time/lightweight subset reaches 245 papers in 2025 (39.5% of the video understanding corpus).
In 2024, growth is explosive in both series (+152% VU vs +181% efficient), suggesting efficiency has become embedded rather than trailing behind.
Growth-rate dynamics indicate a recurring lag: efficiency work tends to surge 1–2 years after capability surges.

Counts come from keyword-based queries over arXiv CS and should be read as a thermometer of attention, not a complete census.

Understanding the Data

This analysis tracks arXiv computer science publications from 2015 to 2025 across two primary research themes: video understanding (the broad field of building models that comprehend video content) and efficient video understanding (work that explicitly addresses computational constraints such as FLOPs, parameters, latency, or memory footprint). The data serves as a thermometer of research attention rather than a complete census of the literature.¹

Publication counts are influenced by terminology: what researchers choose to call their work shapes what appears in keyword searches. Even with that caveat, sustained growth in counts usually coincides with broader changes: more researchers entering the area, shared tooling becoming available, and practical applications creating demand.² The patterns in this dataset follow that shape.

The Mainstreaming of Video Understanding

Video understanding has expanded significantly over the past decade. Early work in the 2010s relied on hand-crafted features and relatively shallow architectures applied to carefully curated datasets like UCF101 and HMDB51 for action recognition.^{3, 4} The field remained somewhat specialized, with most computer vision research focusing on static images rather than temporal sequences.

The field accelerated in the late 2010s and through the 2020s. Large-scale benchmarks such as Kinetics-400/600/700⁵, spatiotemporal architectures like I3D and SlowFast networks, and video transformers such as TimeSformer and Video Swin Transformer contributed to higher research throughput.^{5, 6, 7, 8} Video understanding moved from a niche specialization toward a more mainstream research domain, with applications in autonomous systems, content moderation, accessibility, and human-computer interaction.

Figure 1: Growth of video understanding publications on arXiv (2015–2025), rising from 5 papers in 2015 to 621 in 2025.

Figure 1 shows low volumes through 2019 and sustained expansion after 2020, with the largest jump in 2024-2025, with a near 104% increase in the number of publications. This period overlaps with broader shifts in the ML ecosystem: more video-language work, tighter integration into multimodal foundation models, and widely used pretraining datasets like HowTo100M and WebVid.^9,10

Efficiency Emerges as a Core Research Priority

Raw publication counts can mask structural changes when an entire field is expanding. Figure 2 shows both video understanding and efficiency-focused work surging together, particularly from 2023 onward, but this parallel growth could simply reflect the overall expansion of ML research on arXiv, not a genuine shift in priorities.

Figure 2: The rise of efficient video understanding, comparing total video understanding papers against those explicitly addressing efficiency, demonstrating efficiency's emergence from niche (0 papers in 2015) to mainstream (245 in 2025).

A more revealing view is the relative share of efficiency-focused work within the video understanding corpus (Figure 3). This normalization filters out the effect of overall field growth and reveals whether efficiency is becoming a more common framing in the literature.

Figure 3: Share of efficiency-focused papers within video understanding research (2015–2025), demonstrating the field's transition from efficiency as an afterthought (0–17%) to a central priority (nearly 40%).

Figure 3 reveals the structural shift: efficiency-focused work has grown from zero in 2015 to nearly two-fifths of all video understanding papers by 2025. The steepest increase occurs after 2020, coinciding with deployment pressures from real-world applications. This isn't just "more papers because the field is bigger", it's a fundamental reallocation of research attention toward computational constraints as a first-order design consideration.

Why Efficiency Becomes Central as Fields Mature

Video understanding is inherently expensive from a computational perspective. Unlike static images, videos present multiple frames per second of high-dimensional visual data, creating sequences that can span hundreds or thousands of frames for even short clips. Temporal modeling requires architectures that capture motion and long-range dependencies, further multiplying computational costs. State-of-the-art video transformers can require hundreds of billions of FLOPs per inference, with memory footprints that exceed what is feasible on edge devices or even single GPUs.

As model capabilities have expanded, so too has deployment pressure. Real-world applications demand real-time inference (autonomous vehicles, robotics), low-latency streaming (live content moderation, video conferencing), and on-device processing (mobile applications, privacy-sensitive scenarios). These constraints make efficiency not just desirable but essential.

What "Efficiency" Means in Practice

Efficiency-focused video understanding research encompasses a diverse set of approaches, all aimed at reducing resource consumption while preserving semantic performance:

Faster inference: Reducing wall-clock latency through architectural optimizations, pruning, and quantization
Lower memory footprints: Compressing models and intermediate activations to fit tighter memory budgets
Improved scaling to long videos: Processing minute- or hour-long content without exhausting compute or memory
Reduced frame and token processing: Adaptive sampling, temporal pooling, and token merging to limit redundant computation
Hardware-aware designs: Architectures optimized for specific deployment targets (mobile GPUs, edge TPUs, cloud infrastructure)

These tactics vary in implementation, but they share a common goal: making video understanding systems deployable under real-world constraints.^13,14

In this dataset, efficiency moves from a relatively rare framing to a common one. Many papers now report FLOPs, parameter counts, inference latency, and throughput alongside accuracy, and a growing subset explicitly frames contributions as trading small accuracy changes for large computational savings.^{15, 16}

Growth Dynamics: Identifying Regime Changes

Absolute publication counts show growth; growth rates reveal when the pace changes and, more importantly, how the two series move relative to each other. Figure 4 compares year-over-year changes for video understanding and the efficiency subset from 2020 onward, when volumes are large enough to produce stable signals.

Figure 4: Year-over-year growth rates for video understanding and efficient video understanding (2020–2025). Both categories show explosive growth in 2024, with efficiency-focused work growing even faster (+181%) than the parent category (+152%).

Three phases stand out. (1) 2020–2021: video understanding grows modestly in 2020 (+16%) before surging in 2021 (+128%), likely driven by the arrival of video transformers and large-scale pretraining. Efficiency work grows steadily in both years (+30%, then +77%) but does not keep pace with the 2021 capability boom. (2) 2022–2023: the field contracts slightly in 2022 (−5% VU) and recovers modestly in 2023 (+30%). During this consolidation, efficiency work continues to grow (+35% in 2022, +19% in 2023), quietly gaining share while overall output stalls. (3) 2024–2025: both categories explode, with VU more than doubling in a two-year span.

A pattern emerges from these phases: efficiency research lags capability surges by roughly one to two years. When the field expands rapidly (2021), the initial wave of papers focuses on pushing accuracy and generality. Efficiency catches up afterward, during the consolidation that follows (2022–2023). By the time the next expansion arrives (2024), efficiency has become embedded in the research agenda rather than trailing behind it.

The 2024–2025 data supports this interpretation. In 2024, video understanding grew +152% while efficiency-focused work grew +181%, outpacing the parent category by nearly 30 percentage points. Unlike the 2021 boom, efficiency is no longer playing catch-up, it is expanding faster than the field itself. By 2025, growth rates moderate but the gap persists (+136% efficiency vs +104% VU), suggesting that computational constraints have become a first-order design consideration rather than an afterthought.

Across the six years shown, efficiency-focused work outpaces video understanding in four (2020, 2022, 2024, 2025). The two exceptions (2021, 2023) are both years of strong overall VU growth, precisely the capability-driven surges where new architectures and benchmarks dominate attention before efficiency work absorbs and optimizes them.

Future Outlook and Open Questions

The efficiency share has climbed from near zero to roughly 40% in a decade. If the current trajectory holds, efficiency-focused work could represent a majority of video understanding publications within the next two to three years. But there are reasons to expect the curve to flatten rather than cross 50%.

Efficiency may become invisible. As techniques like token merging, adaptive frame sampling, and quantization-aware training mature, they risk becoming standard practice, baked into architectures rather than highlighted as contributions. When efficiency is the default, papers stop advertising it. The share metric would plateau or even decline, not because efficiency stopped mattering, but because it stopped being novel enough to foreground. This would be a sign of success, not retreat.

The lag pattern may repeat. The growth dynamics section identified a recurring pattern: capability surges precede efficiency surges by one to two years. If a new architectural paradigm emerges (like state-space models for video, or native video generation as a pretraining objective) the initial wave of publications will likely focus on what the new approach can do, with efficiency work following as the community works out how to make it deployable. Watching for this lag in future data would test whether the pattern is structural or coincidental.

Benchmarking remains fragmented. Efficiency metrics are still reported heterogeneously: different input resolutions, hardware platforms, batch sizes, and measurement protocols make direct comparisons difficult. The community would benefit from standardized efficiency benchmarks analogous to MLPerf¹⁷, but specialized for video understanding, covering not just throughput but latency, memory, and energy under realistic deployment conditions.

Hardware could redefine the problem. Most current efficiency work optimizes within the constraints of existing accelerators. But emerging approaches (from analog compute to physically-realized gradient computation that sidesteps backpropagation entirely¹⁸) could shift the efficiency frontier in ways that make today’s software-level optimizations less relevant. If training and inference costs drop by orders of magnitude at the hardware level, the research community’s definition of "efficient" will need to be recalibrated around new bottlenecks: data, annotation, or evaluation rather than compute.

A concrete test for the next update: does the efficiency share continue to climb, or does it plateau around 40%? A continued rise would indicate the field is still actively shifting its priorities. A plateau would suggest efficiency has been absorbed into the baseline expectations of the community: a different kind of victory.

Methodological Notes and Caveats

Data Collection and Limitations

Keyword-based categorization: Papers are grouped based on keyword searches of arXiv abstracts and titles using boolean query operators. Terminology shifts can create artificial "birth" events in time series. For example, work that would previously have been labeled "compact video models" may now be labeled "efficient video understanding," creating apparent growth that partly reflects relabeling rather than genuinely new work.

Search methodology: The primary category is identified by requiring explicit mentions of "video understanding" or "audio-visual understanding" in titles or abstracts. The efficiency subset is identified by combining the video understanding query with papers containing "efficiency", "efficient", "real-time", "realtime", "light-weight" or "lightweight" terms. This ensures focus on papers that explicitly address both video and efficiency.

arXiv vs. peer review: arXiv captures preprints, not necessarily peer-reviewed publications. Submission behavior varies across research communities and over time. Some groups post every experiment; others post only after conference acceptance. This analysis treats arXiv as a measure of research attention and activity, not necessarily quality or impact.

Potential overlap: A single paper may plausibly contribute to multiple search queries. A work on "efficient video transformers for action recognition" appears in both the "video understanding" and "efficient video understanding" categories. Overlaps are expected and do not invalidate trend analysis, but they do mean absolute counts should not be summed naively.

Geographical and linguistic biases: arXiv predominantly captures English-language research from institutions with strong preprint cultures. Work published primarily in non-English venues or conferences with less arXiv adoption may be underrepresented.

Despite these limitations, arXiv trends remain a valuable signal. They capture the direction and velocity of research activity, even if they do not provide a complete picture. The patterns observed here are robust enough to survive reasonable changes in categorization or sampling methodology.

References

[1] Harvey, C. (2024). "The Evolution of AI Research: Analyzing arXiv Submission Trends." Blog post, November 2024.

[2] arXiv Publication Trends Analysis (2022). "The number of AI papers on arXiv per month grows exponentially." r/singularity discussion, October 2022.

[3] Soomro, K., Zamir, A. R., & Shah, M. (2012). "UCF101: A dataset of 101 human actions classes from videos in the wild." arXiv:1212.0402.

[4] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). "HMDB: A large video database for human motion recognition." In 2011 International Conference on Computer Vision (pp. 2556–2563). IEEE. 2011 IEEE International Conference on Computer Vision (ICCV).

[5] Carreira, J., & Zisserman, A. (2017). "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset." CVPR 2017.

[6] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). "SlowFast Networks for Video Recognition." ICCV 2019.

[7] Bertasius, G., Wang, H., & Torresani, L. (2021). "Is Space-Time Attention All You Need for Video Understanding?" ICML 2021 (TimeSformer).

[8] Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). "Video Swin Transformer". CVPR 2022.

[9] Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). "HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips." ICCV 2019.

[10] Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021). "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval." ICCV 2021.

[11] NeurIPS (2024). "Faster Video Transformers with Run-Length Tokenization." NeurIPS 2024.

[12] Meta AI (2024). "How Meta Deployed Super Resolution at Scale to Transform Video Quality." @Scale Conference, October 2024.

[13] Jin, S. et al. (2024). "Efficient Multimodal Large Language Models: A Survey." arXiv preprint.

[14] Milvus (2024). "What are the challenges in building multimodal AI systems?" AI Quick Reference, December 2024.

[15] Yang, M., Jia, Z., Dai, Z., Guo, S., & Wang, L. (2025). "Mobileviclip: An efficient video-text model for mobile devices." ICCV 2025

[16] Luo, W., Zhang, D., Tang, Y., Wu, F., & Zhang, Yaoxue. (2025). "EdgeOAR: Real-Time Online Action Recognition on Edge Devices." IEEE Transactions on Mobile Computing, 24(12), 13426–13440..

[17] Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., Charlebois, M., Chou, W., Chukka, R., Coleman, C., Davis, S., Deng, P., Diamos, G., Duke, J., Fick, D., Gardner, J. S., Hubara, I., … Zhou, Y. (2020). "MLPerf Inference Benchmark." 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (pp. 446–459). IEEE..

[18] Pourcel, Guillaume & Ernoult, Maxence. (2025). "Learning long range dependencies through time reversal symmetry breaking." arXiv preprint arXiv:2506.05259.