GTC Insights | Visual AI Agents for Real Time Video Understanding
Vision AI is a not-new-but-hotter-than-it-used-to-be capability. The thing that changed is the opportunity in physical AI, as physical AI gets increasingly real. As a refresher, vision AI is AI that can look at images or video and understand what is in them, while physical AI is does not just understand information – it can sense, reason, and act in the real world through machines like robots, cars, drones, or factory systems. Vision AI might see a box, but physical AI can see the box, decide what to do with it, and then pick it up or move around it.
The Problems Working with Video
Low search accuracy is a problem with traditional video searching. This is because traditional search is limited to trained attributes. With a single embedding model, we can move search from retrieval based on trained attributes to generative. Traditional approaches might fire alters when a triggering event happens in a video (think a dog walks by your security camera and you get an alert). This is great, expect we all get “alert fatigue” amirite? At first you look at every alert and then, after a while, you look at nothing. It is difficult for a human to filter the important alerts (sketchy looking human passing the video camera in the middle of the night) from the unimportant (squirrel!). We want to find the true positives, things that need to be escalated, and things where we need to act. It’s the classic needle in a haystack problem.
What is Video Search and Summarization (VSS)?
Video search and summarization (VSS) is a set of vision aware tools that can connect to agents so the agents can understand what they are seeing. Think about tools for decomposing, searching, retrieving, critiquing, and summarizing context of a bunch of video. VSS is a hard problem – think about what you would have to do to summarize a video – you’d have to find it, watch it, figure out what’s important, maybe watch it again, pick out some highlights, summarize it, and then check your work. That’s what a VSS does. And it can do that for a one-hour video in about six minutes. Wow.
We will see this capability enter mortgage in policy procedure generation, job redesign, and opportunity analysis.
The Value Proposition
The KPIs here were pretty strong. 80% quicker onboarding for a training company, 80% reduction in incident reporting fatigue in a manufacturing company, 95% cost reduction for a training company. Let’s take an example, imagine you want to find all the places in a soccer game where a particular person scored a goal. You know how you would do that and can estimate how long it would take. If the game is an hour, it would take… at least an hour.
This will be a little like the Ronco rotisserie - set it and forget it. Get the pipeline going and continuously feed it data. Wake up the next day and see what it found. Amazing.
Now imagine that you could process the video in six minutes and then ask a natural language questions and get verified evidence of every time that person score a goal. All in about six minutes to process the video, and two seconds to answer the questions (with evidence!). That’s literally a 90% performance gain, conservatively.
The Technical Solution Architecture
It won’t surprise you by now that “there’s an NVIDIA solution for that” and it’s their VSS blueprint. NVIDIA VSS turns video from passive footage into an AI-readable system you can search, summarize, and act on. It is interesting because it treats video as a first-class enterprise data source, combining computer vision, VLMs, LLMs, and RAG so organizations can move from watching footage to querying and operationalizing it.
I know it looks a little overwhelming, but it really isn't. Just read the icons top to bottom, left to right.
GTC Insights | Open Source and Agentic Development
I vividly remember last year having a light bulb moment when Jensen provided his definition of an AI agent. An AI agent is an AI that can perceive, plan, reason, and act. For some reason, it took me hearing it from Jensen in context to really get it. I attended multiple sessions at GTC this year on a family of open source models designed specifically for agentic development, which deepened my understanding of accelerated computing, what it means to put AI “at the edge”, how a generative model is trained, and what agentic development really is.
Approximating Human Agency
Perceive, plan, reason, and act. As humans, we do these things intuitively and without thinking. We just know how to do it. Our ability to perceive our environment relies on vision, hearing, our sense of touch, and our ability to translate this sensory information into meaning. Planning requires us to have executive functioning. This higher order mental ability allows us to make sense of data and organize it into a logical framework. As we reason through a problem, we consider alternatives and think back on prior experiences. And when we act, we intersect our inner world with the physical world. We use tools. We apply skills we have honed over time and based on our experience.
This intuitive set of human processes relies on multiple complicated systems. It takes a system of systems to make this possible. This is really the essence of agentic development – how to create a system of systems that proactively work together to meet our objectives without being explicitly told how.
I want you to really think about how you do what you do and then put that in the context of how agents work.
Let's look at an example with a deep research agentic system. NVIDIA provides solution blueprints, models to implement, and packaged AI inference microservices. (They call these NIM – NVIDIA Inference Microservices). You may want to ask your vendor partners if they are using these blueprints, they really are a wealth of actionable information and tools. They also memorialize industry best practices.
This is a zoomed out view of the "AIQ" blueprint from NVIDIA. It's an agentic system for deep research.
To put this in context, thank about a set of helpdesk deep research agents in a call center. There could be an intake agent, a triage agent, an escalation agent, and a prior cases agent. All of these are deep research agents. These four agents would each be designed with skills and tools. They are small and contained to maximize their effectiveness and prevent the agent from “wandering off”.
For an agent to be effective, it must have authority, which makes the agent more difficult to secure. This is called the agent paradox. The default pattern in good agent design is to DENY permissions, and ALLOW only as a thoughtful, secured choice.
Human Analogies for Open Source, Accelerated Computing, and the Edge
You can think of a human as a closed source model. We are opaque by default. I perceive a set of inputs and take an action as a result. Even if I explain how I did it, the true technical mechanisms are largely closed to other humans. Even a neurologist only has a very selective view into the complex organic relationships that make human agency possible.
Only took four iterations to get NotebookLM to give me this version. It came out ok, not perfect but it will do.
But what if our inner world was transparent? What if we could break ourselves down and inspect the way we do things? What if we could look at all the experiences we had over a lifetime that caused us to make a particular decision? What if we could double click on each other? Think of that as open source. If we could understand the pieces and parts, we could identify weaknesses, improve results, remove bottlenecks. We could make systems work more effectively, faster. Think of that as accelerated computing. And if we could do that not just with internal processes, but within the context of physical reality, we could also impact the world. You can think of that as “the edge” – that place in human reality that intersects with physical reality. (I'm definitely not saying I would want open source people, by the way. Just trying to create an analogy).
Open Source Model Development
I had not realized that NVIDIA was in the model making business. NVIDIA Nemotron is a family of open-source, foundation models and datasets designed to build and deploy agentic AI systems. Now that we understand how open source allows us to better understand and optimize how things work, we can start to see why NVIDIA might want to build models and expose them to others.
Building models allows them to anticipate what their clients will need by experiencing those needs firsthand. Designing, pre-training, training, post-training, deploying, and supporting a family of models requires a kind of learning that can only be gained by doing. It required NVIDIA to become experts at what is needed to design the requisite AI infrastructure and to accelerate the ecosystem around that infrastructure. Knowing what to build and having a perspective on what needs to be built. Frankly, this is one of the many reasons we build products at PhoenixTeam, and why I spend weekends and early hours tinkering.
Making models open source invites others into the process who likely see things differently and can make different contributions to the process. The OpenClaw moment was a shining example of the impact a small, community project can have. It showed us how one small project can create a robust and diverse ecosystem (that requires NVIDIA solutions, of course). Hence the rationale for making open source models.
What is the significance of data in this equation, and the fact that NVIDIA released the data? They found that 75% of the compute required during the process are actual required in synthetic test data degeneration and running experiments. Remember, we are running out of data and data scarcity of training is a big deal.
Base model development (pretraining) is all about KNOWLEDGE and post-training is all about BEHAVIOR and ensuring the model behaves in the way "we think it should".
There are four basic stages to training a generative model, and Nemotron as a family of models is really an approach. In addition to models, NVIDIA has also released the training and fine tuning data and the techniques. This goes with the model family. This model family includes everything you need to design and deploy agents, most of which can run locally if wish (except the largest model in the family, which is 253B parameters and is too big to run on a DGX Spark). But it also has smaller, specialized models for reasoning, speech, and rag. I think my new DGX Spark comes Nemotron ready.
Accelerated Computing
Accelerated computing is this idea that you can take certain parts of a system workload and offload them to accelerated processing that uses a different type of processor (the GPU – graphic processing unit). Accelerated computing speeds up the right workloads enormously, but if you apply it to the wrong things, it can slow most things down. That is why Jensen talks so much about identifying the parts of the stack that are truly “accelerable” instead of assuming every workload belongs on a GPU. Wasted effort makes AI that is dumber. Think of accelerated computing as steps in a chain, and the goal is to accelerate as many of those steps as possible. So, the data set used for training is a link in the chain of acceleration.
Accelerated computing is this idea that you can take certain parts of a system workload and offload them to accelerated processing that uses a different type of processor.
Accelerated computing slows most things down. This really resonates with me. The solution to all problems is not a large language model. Every problem doesn’t need an agent. Accelerated computing has a point of view, focus, and specialization. So the idea is to identify what computations in the future will be important and can/should be accelerated. Think beyond the GPU and stick with first principles. Preach!
You can design for acceleration, which is what they did with Nemotron. It was one of the up front design principles. Nemotron was also designed for mixture of experts (MOE) from the beginning and designed to deal with “numerics” and “sparsity”. To be designed for numerics means NVIDIA designed Nemotron so it works well with the messy realities of GPU math, not just idealized floating-point math on paper. To be designed for sparsity means the models are structured so that only the needed parts fire, and the system is built to make that selective firing efficient instead of wasteful.
In a model that is more closed, you can really only alter the behavior through prompt engineering and model fine tuning (and whatever other capabilities the model provider shooses to allow you to have). There can be very inefficient parts of the model, it's processes, or its data that simply cannot be altered. Accelerated computing is about making the chain go faster, if parts of the chain are hidden then you lose that opportunity for acceleration.
Open Source at the Edge
I never really understood what the edge even was until last year. And I really didn't get the socalled "internet of things (IoT)". The edge is where digital meets our world. A scanner in a grocery store. A sensor on a shelf in a warehouse. To put generative capabilities on "the edge" is to deploy models that are small enough to be practical and also smart enough to be worth it. That's a serious engineering problem.
This is what makes my new DGX Spark so exciting. Some of you will remember my journey to deploy a 70 billion parameter model on my massive Mac pro. I tried to run a bigger one but my machine did not have sufficient compute. DGX Spark is a small desktop AI computer from where you can run, test, and fine-tune large AI models locally. It’s basically a very powerful “AI box for your desk,” built specifically for AI workloads rather than general-purpose computing. And the best part is you can air gap it. I picked one up at the NVIDIA swag store for an amount I will only tell you if you ask me directly. I can say it's fully five times less expensive than my Mac pro.
I call him Sparky. He will be a safe, secure place to run my OpenClaw agents and take my tinkering to the next level.
I also learned about something I want to look into for our commercial product, which is Llama Nemotron Nano VL. I had to do a bit of research on this, but they described it as a “tiny open VLM that rivals closed models in doc extraction” (VLM is vision language model). I want to try this one out for sure.
Open source at the edge means more opportunity to accelerate. Given that the edge requires seriously compact and optimized solution, you can see that being able to engineer the inner parts of the model could be the difference that's needed to make edge computing feasible.
Open Source as Component of Mixture of Expert (MOE) Architecture
Model selection is not an “or” but an “and”. It’s not open or closed, this model or the other, it’s open AND closed, this model AND these other models. We are creating or at least using systems of models. When you use any of the (closed) foundation models, you are really using multiple models to do different things – that complexity, however, is abstracted from that interaction. That is part of the value of working with a closed foundation model.
However, it’s not one size fits all. More and more (I’d say approaching table stakes), we are seeing “mixture of experts (MOE)” approaches where we use different models to do different things, in the same overall system. Maybe a VLM for document analysis, Claude for general knowledge, and a domain specific model to solve a particular type of problem. We are seeing more and more usefulness of this model when we have specialized AI solutions (i.e. a general model isn’t the best) or when we are looking for efficiency (again, a general model is not terrifically efficient).
There are also MOE approaches where you use different models for different problems, as opposed to this pattern where the models are used to come up with a composite score.
About the data – the most valuable data in an organization ALWAYS has the most restrictions. And the argument for open here is that we need more diversity to optimize for unique and specialized data and domain needs. We simply may not trust a closed model whose inner working we do not understand, whose training data we have no visibility into, to handle our most valuable data.
Nemotron Performance Engineering
For the more technical of you readers, Nemotron is a pretty solid model family according to all the benchmarks, how this was achieved:
Hybrid transformer architecture – A hybrid transformer architecture means the AI model uses more than one kind of “thinking part” instead of just the standard transformer setup. In Nemotron, that helps it handle long inputs and run more efficiently while still staying strong at reasoning and language tasks.
Multi-token prediction (take advantage of free tokens) – multi-token prediction helps the model look a few words ahead, and when those guesses check out, you get more output with less waiting. They are not literally free in money terms, they are “free” in the sense of extra tokens for almost the same decoding work.
1M context length – this means the model can keep about 1 million tokens of text in its working memory at one time (~750,000 words). The model can reason across very long documents, large codebases, long chat histories, or many retrieved documents at once instead of chopping everything into lots of smaller pieces.
Latent MOE – latent means a smaller internal representation of the token, not the full version the model normally uses. In latent MOE, the model compresses the token into the latent space, does the work, then projects it back up, saving bandwidth and compute.
It’s fastest on PinchBench and has frontier level accuracy. PinchBench is a benchmark for testing how well language models perform as OpenClaw agents. Seems worth trying out for me. DGX Spark here I come.
You’ve heard me talk before about how coming to NVIDIA especially is like stepping into the future. The robots, the level of technical acumen, the global perspective, and, of course, the tech. This year I found myself, well, depressed. AI is eating everything, absolutely everything. Most products out there today will be eaten by either (1) a foundation model provider, (2) a fantastically well-funded, AI -native application provider, or (3) a mega-consultancy. I found myself feeling very excluded. I wondered about how we fit in and what version of the future includes us. It was a little bit defeating. But we power on and take this as a moment to adapt our original pivot and think very carefully about what our niche is in the AI future.
Everything is about tokens to Jensen. It was last year and it was this year again. Tokens, as you know if you’ve taken one of my classes, are numerical representations of words or parts of words. They are the language of generative AI. The machine of AI as we know it today runs on tokens and ultimately, Jensen’s perspective is that every company’s success in the AI future is based on how they optimize the value associated with these tokens. If we traverse the AI stack, it starts at the bottom with power and ends at the top with value. And tokens are the digital fuel.
For context, the “AI stack” is a five-layer cake with land, power and shell at the bottom. This is the level of energy and the shelter for energy. The next level is compute silicon (microchips), which turns electricity into work. Next is infrastructure (platforms), in the NVIDIA context you can think software and libraries (CUDA), integrated systems, and AI factories. Only then do we get to models, the actual embodied algorithms, compute, and systems that generate insights. And finally, the application layer – this is the layer where value is created. The model and application layers are the ones that are eating everything.
Not actually from the keynote but I wanted to provide some context for readers on how NVIDIA positions itself so you can understand the keynote better.
NVIDIA describes itself as a platform company, so let’s talk about platforms for a minute. At the software and libraries level, we have CUDA, which stands for Compute Unified Device Architecture and is a parallel computing platform and programming model developed by NVIDIA that allows software to use NVIDIA graphical processing units (GPUs) for general-purpose processing. CUDA was that original NVIDIA breakthrough 25 years ago that has powered accelerated computing as we know it today. Then we have systems, those integrated machines and infrastructures that NVIDIA sells (that are effectively powering all of AI today globally). And then AI factories, a newer Jensen metaphor, these are large scale infrastructure solutions that output, yes, tokens.
Of course, it wouldn’t be GTC without a discussion of the flywheel powered by CUDA. CUDA has a massive installed base attracting developers who create breakthroughs that power the ecosystem. And the flywheel continues.
It all started with video games and GeForce, invented 25 years ago. Jenson’s joke was that “your parents paid for you to become an NVIDIA customer”, and he’s not wrong. A graphics card every year was an investment in the future. From pixel shader to the big bang of AI. From here, he spent a lot of time talking about structured data as the foundation of trustworthy AI, the ground truth.
Of note, the largest number of participants at GTC this year are from the financial services sector. NVIDIA identifies nine industry verticals – automotive, financial services, healthcare and life sciences, industrial, media and entertainment, quantum, retail and supply chain, robotics, and telco. This to me is a very strong (additional) signal that AI is coming for mortgage. I mostly see this in how we compete in the services space. I routinely find myself competing with the likes of McKinsey and Boston Consulting Group. We also see that skills dropped by foundation model providers are wiping out whole products. These is a tail to this.
In terms of where we are, Jensen looked back on 2023 (which was really 2022) as starting it all with OpenAI and ChatGPT. Then in 2024 reasoning AI came out, also OpenAI, and then Claude Code in 2025. So the journey is AI that could generate, to AI that could perceive, and finally AI that could do work. This brings us to the inflection point of inference. The inference inflection has arrived, and it was made that much more significant by the Open Claw moment we are having now in 2026 (which really started in November 2025, when Open Claw dropped).
Jensen introduced a new scaling law, so we do have to talk briefly about them. Let me explain them first with the one thing you have to know. More Compute = Better AI. That’s it.
Pre-training scaling law – more data, more parameters, and more compute during the initial training phase results in a more intelligent “base” model.
Post-training scaling law – more compute applied during reinforcement learning and fine-tuning allows models to perfect specific skills and improve accuracy.
Test-time scaling (inference-time scaling) law – also known as “long thinking”, allowing a model more compute at the moment it answers (enabling it to “reason” through multiple steps before outputting a result) significantly boosts performance on complex tasks.
Agentic scaling law – the newest law, compute demand scales when AIs interact not just with humans, but with other AIs and tools. These systems handle long contexts and perform multi-step, autonomous workflows, requiring high-speed, low-latency inference at a massive scale.
And then there were all the chip and infrastructure announcements. Yawn. Yes, they are awe inspiring and I’m sure they are quite important, but this is always where you lose me during the keynote. I can only hang in for so long and the chips is where I reach my full AI saturation.
Notwithstanding significant, valid criticism of NVIDIA and Jensen in the geopolitical sphere, I am still a (somewhat guarded) Jensen superfan. I am always in awe of Jensen’s ability and willingness to pivot. He described the decision to rearchitect Hopper, right while Hopper was in its prime adoption. But I guess when you have almost five trillion in market cap, that’s a much easier decision than when you are scratching to make a living.
Process and Technical Hurdles to Achieve the Mortgage AI Future
My take on 2026 is that it will be a race to achieve what I call the mortgage AI "FOMO" strategy as fast as possible. FOMO, of course, is fear of missing out. It's a perfectly reasonable strategy since achieving FOMO faster than everyone else is a differentiating edge. FOMO has things like voice agents, knowledge bots, generative AI for data gathering, and a bunch of other things (reach out if you want the list).
The AI Future
We can contrast FOMO with what I see as the AI future, a radically more effective mortgage organization that embraces controlled autonomy. The AI future has things like seamless compliance change integration, a consumer data flywheel generating new opportunities in addition to uncovering service quality limitations, and a move from reactive lines of defense to proactive lines of offense. (Again reach out if you want more details).
The gap between FOMO and the AI future is large.
The long term differentiating value won't come from FOMO. It will come from a total reimagination of how mortgage works. We are very focused today on task level automation. Removing friction from the current process. Improving the value of KPIs we already have. That is not the same as looking at mortgage end-to-end to find and forge the new way of working.
Survival, Differentiation, or Domination
Kind of stark I suppose, but we really need to think existentially about our business. The questions I ask of my clients are:
From a strategy perspective, do you want to survive, differentiate, or dominate?
Is your business as it exists right now relevant in three years time?
To achieve your strategy, what does your business need to look like in three years time?
What are the jobs of your AI future and how are they different from today?
What actions do you need to take now to ensure you achieve your AI future?
Gartner poses the first question in terms of "defend, extend, and upend", which is a really helpful framing although a little too watered down. We can blend these ideas and express them in terms of spend - maybe we want to survive for now, differentiate in the next two years, and completely crush everything ultimately. That would say we need to "catch up" first, lean into our differentiation points, while also setting in motion at least one major reimagination. No one has unlimited time and unlimited budget, we have to prioritize.
The three year horizon is a useful thought exercise because it gives the illusion of time, thereby removing some fear and enabling a bit of clear thinking. The reality is that no one has three years to wait for AI.
Differentiation Through Reimagination
This process reimagination is taking different forms, and will be unique and proprietary to each company (which means I won't write about it). It is a perfect example of the truly unique human value we can offer as experienced operators and experts in our crafts. Yes we can use genAI to expedite our process, come up with some idea remixes - but the real innovation will come from operators, not machines.
The Jagged Frontier
In addition to process reimagination, there are real technical hurdles to overcome. Ethan Mollick, of course, calls this the jagged frontier. My take on this is playing out in my lab this week. I'm working on four different challenges at the moment. I'll talk more about the use case I'm working on as I get it figured out.
Right Scoping
Agent mania has officially taken hold, and the insanity it is bringing has reached a fever pitch. Let's step back a moment. So you can build and agent. So what. You can build an agent in 45 minutes if you want to, we do it in our classes all the time. Not everything needs an agent, which is the first (and perhaps the most important) thing. Once we pass that gate, how we scope agents matters a lot. Scope them narrow and scope them atomic.
Agent Orchestration
I think (one of the) major technical hurdles of 2026 is agent orchestration (although some say this is not needed if agents are scoped and designed right). How do you reliably create networks of agent that work together to achieve an objective or related set of objectives? What is the contract between the agents and how is it memorialized, enforced, and make transparent? I'm looking at a few things here - dust, crew, and gastown. I'll report back with what I discover. Maybe agent orchestration doesn't even matter. (Gasp!)
Context Management
Oh context engineering... maybe the most important consideration. How to shove just the right amount of context into the context window? How to compress a long and circuitous history into a meaningful collection of right sized bites? This is really the trick, I think, to solving the memory problem. We humans have such an interesting and not well understood way of remembering. How does our brain differentiate between important and unimportant? Why do you remember what you remember and forget the things you forget? How do we get a machine to do that reliably?
Systematic Evals
This is an old new problem. Congratulations to our colleagues at Promptfoo on their transition to OpenAI, I saw that coming years ago, truthfully. I only thought it would happen faster. Ian Webster and Michael D'Angelo deserve all the awesomeness that I hope will come to them for being out in front on this. I hope they will remain as accessible, approachable, and passionate as they have been to date. In any event, doubling down on agent evals is a thing I continue to explore using the open source promptfoo framework.
Neurosymbolism
And last but not least, the intersection of deterministic and probabilistic - neurosymbolic AI. This is the idea that we use reasoning capabilities when it makes sense, and code-based solutions when it doesn't. Yes we have an amazing magical hammer (large language models), but not every problem is a nail amirite? Code-based solutions are still super useful. Maybe you need 100% accuracy 100% of the time. Guess what - probabilistic technology probably isn't the right solution. So I'm working this too.
Man, now I'm tired. I wonder if a week is enough to put all these things together? Might need two. Anyway, hope to hear from some of you with your thoughts. We will be at NVIDIA GTC next week, come find me if you are too.