{"posts":[{"id":20247,"title":"Large Geospatial Models, OVRMaps and the Future of Physical AI","excerpt":"What Are Physical AI and Large Geospatial Models? Since the launch of ChatGPT in November 2022, we have witnessed an explosion of interest in AI, and specifically in Large Language Models (LLMs), given their capability to understand and manipulate human language by ingesting internet-scale amounts of text. The capabilities of LLMs cannot be overstated, and [&hellip;]","content":"<h4><b>What Are Physical AI and Large Geospatial Models?<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Since the launch of ChatGPT in November 2022, we have witnessed an explosion of interest in AI, and specifically in Large Language Models (LLMs), given their capability to understand and manipulate human language by ingesting internet-scale amounts of text. The capabilities of LLMs cannot be overstated, and in the near future, they will automate most white-collar jobs. But a new frontier is emerging: <\/span><b>Physical AI<\/b><span style=\"font-weight: 400;\">. The vision behind this new paradigm is to go beyond language, empowering machines and robots to interact with and complete tasks in the physical world, fostering a new industrial revolution and an era of unprecedented abundance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Jensen Huang, NVIDIA\u2019s CEO, used a bold statement to describe this future during his March 2025 keynote: \u201cEverything that moves will be autonomous.\u201d This paints Physical AI as the next logical evolution of AI.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-20248 size-large\" src=\"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.29.37-1024x653.png\" alt=\"\" width=\"1024\" height=\"653\" srcset=\"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.29.37-1024x653.png 1024w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.29.37-300x191.png 300w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.29.37-768x490.png 768w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.29.37-1536x980.png 1536w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.29.37.png 1866w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Empowering machines and robots to interact with the physical world requires a broad range of capabilities, starting with the understanding of the 3D space around them. There is a cognitive dissonance regarding how valuable and complex this capability is. Humans tend to downplay the importance of everything physical and overestimate everything that has to do with language. This bias is reflected and reinforced in society, where intellectual (white-collar) jobs commonly have higher value and command higher status than physical (blue-collar) jobs. Yet, if we look at evolution and the structure of our brain, the picture is very different. The area of our brain that controls language is small and evolutionarily recent. Considering this, it&#8217;s not surprising that machines can handle language so well. Language, unlike the physical world, is a human construct; it is purely generative and doesn&#8217;t exist in nature.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But vision is a totally different story. It has been shaped by millions of years of evolution. Every form of life that developed an eye had to solve the same hard problem: converting 2D images into a 3D predictive reconstruction of the physical space around them in order to survive. This is what enables animals and humans to thrive in the physical world and effectively interact with it.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But it does not end here: the understanding of the 3D structure of our world is also what powers an important part of our reasoning capabilities. Creativity\u2014across design, movie production, architecture, industrial design, and even science\u2014is inherently visual, perceptual, and spatial. When Francis Crick and James Watson co-discovered the beautiful DNA double helix, they didn&#8217;t just reason through language; they were able to infer the DNA molecule&#8217;s 3D structure from X-ray diffraction patterns.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-20249 size-large\" src=\"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.31.33-1024x653.png\" alt=\"\" width=\"1024\" height=\"653\" srcset=\"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.31.33-1024x653.png 1024w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.31.33-300x191.png 300w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.31.33-768x490.png 768w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.31.33-1536x980.png 1536w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-15.31.33.png 1866w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As stated by Fei-Fei Li, language is a \u201clossy way to capture the 3D physical world.\u201d If you are blindfolded in a room, a linguistic description is insufficient for task execution. However, with sight, the brain immediately reconstructs the 3D space in the mind\u2019s eye, enabling efficient manipulation and interaction. The entire evolutionary history of animals is built upon perceptual and embodied intelligence, which humans further leverage to construct and change the world. This is what <\/span><b>Spatial Intelligence<\/b><span style=\"font-weight: 400;\"> is about, and <\/span><b>Large Geospatial Models (LGMs)<\/b><span style=\"font-weight: 400;\"> are tackling this fundamental capability: enabling machines to understand the 3D structure of the world.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Just like our brain, an LGM can take a 2D view and create a full 3D representation, including what&#8217;s unseen. This means one can manipulate, move, measure, and stack objects within the computer, enabling applications in robotics, architecture, design, gaming, and even filling in missing data. Machine perception\u2014the ability to reconstruct 3D representations from 2D inputs\u2014is inherently a generative task. It&#8217;s about filling the informational gaps to map 2D data into a 3D space. LGMs need to extrapolate priors on the 3D structure of the world to achieve this goal. These models are trained using multiview images of objects and environments, and <\/span><b>OVER has built one of the world&#8217;s largest datasets<\/b><span style=\"font-weight: 400;\"> of such pictures, with <\/span><b>over 130,000 locations mapped<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The primary application of LGMs is in robotics, which encompasses all embodied machines. These machines require understanding and training in 3D space to perform independent or collaborative tasks. The problem is fundamentally 3D because physics happens in 3D, and interaction happens in 3D. Navigating behind objects and composing the world physically or digitally all require a 3D understanding. While humans can reconstruct 3D from 2D video, a computer program or robot needs explicit 3D information to perform spatial tasks like measuring distance or grabbing objects. LGMs bridge this gap.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But the applications of LGMs do not stop here. If you can fill the gaps between 2D and 3D space, you can also generate infinite, coherent worlds. Early experiments of this capability can be seen with models trained by companies like <\/span><a href=\"https:\/\/odyssey.world\/introducing-interactive-video\"><span style=\"font-weight: 400;\">Odyssey<\/span><\/a><span style=\"font-weight: 400;\">. Applications in gaming, the metaverse, and simulation are virtually endless.<\/span><\/p>\n<div style=\"width: 640px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-20247-2\" width=\"640\" height=\"360\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/odyssey.world\/introducing-interactive-video\/world-simulator.mp4?_=2\" \/><a href=\"https:\/\/odyssey.world\/introducing-interactive-video\/world-simulator.mp4\">https:\/\/odyssey.world\/introducing-interactive-video\/world-simulator.mp4<\/a><\/video><\/div>\n<h4><b>The Role of OVRMaps in Training LGMs<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Let\u2019s go back to the high-level task accomplished by LGMs: inferring 3D from 2D inputs. How can such a capability be achieved? Just as LLMs are trained by masking words in a sentence and learning to predict the missing part, LGMs are trained on multiview images of objects and spaces to extrapolate priors on the 3D structure of the world. From a frontal view of a table, the model needs to learn to predict how that object is structured on the backside, even without additional views.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The raw data used to create OVRMaps\u2014multiview images of locations and objects\u2014is the exact kind of data needed to train LGMs, and our dataset is massive. Ilya Sutskever, co-founder of OpenAI, once famously said, \u201cModels just want to learn.\u201d The implication is that if you give them enough data, they will extrapolate its inner structure, unlocking emergent capabilities that seemed unthinkable. This has been empirically demonstrated with language and LLMs. Leveraging the OVRMaps dataset to train LGMs will allow the field to move far beyond current capabilities.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here below, a comparative chart on the size of the datasets used to train a state of the art LGM like DUST3r\u00a0 compared with the size of OVER\u2019s dataset<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-20260 size-large\" src=\"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-16.55.46-1024x371.png\" alt=\"\" width=\"1024\" height=\"371\" srcset=\"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-16.55.46-1024x371.png 1024w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-16.55.46-300x109.png 300w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-16.55.46-768x278.png 768w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-16.55.46-1536x556.png 1536w, https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-05-at-16.55.46.png 1774w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Join our mapping community to become part of the Physical AI revolution<b><\/b><\/p>\n<p>Start mapping: <a href=\"https:\/\/link.ovr.ai\/download\">https:\/\/link.ovr.ai\/download<\/a><\/p>\n","permalink":"large-geospatial-models","date":"2025-08-05 13:37:19","image_small":"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/VISUAL-LGM-07-1-150x150.jpg","image_medium":"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/VISUAL-LGM-07-1-300x169.jpg","image_large":"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/VISUAL-LGM-07-1-1024x576.jpg","image_full":"https:\/\/blog.ovr.ai\/wp-content\/uploads\/2025\/08\/VISUAL-LGM-07-1.jpg","single_url":"https:\/\/blog.ovr.ai\/large-geospatial-models\/","translations":{"en":{"single_url":"https:\/\/blog.ovr.ai\/large-geospatial-models\/","permalink":"large-geospatial-models"},"fr":{"single_url":"https:\/\/blog.ovr.ai\/large-geospatial-models\/","permalink":"large-geospatial-models"},"es":{"single_url":"https:\/\/blog.ovr.ai\/large-geospatial-models\/","permalink":"large-geospatial-models"},"tr":{"single_url":"https:\/\/blog.ovr.ai\/large-geospatial-models\/","permalink":"large-geospatial-models"},"zh":{"single_url":"https:\/\/blog.ovr.ai\/large-geospatial-models\/","permalink":"large-geospatial-models"}}}]}