Video: December Town Hall | Duration: 3437s | Summary: December Town Hall | Chapters: Community Highlights 2025 (0s), DataHub's Major Upgrades (331.94500000000005s), 2026 Product Roadmap (743.05s), DataHub's Context Vision (1166.395s), Building Context Platforms (1920.4950000000001s), Integrating DataHub Tools (2872.08s), Conclusion and Roadmap (3181.99s)
Transcript for "December Town Hall": So for those of you I've not met, I'm Maggie Hays. I'm the founding product manager over here at DataHub. Today, I'm gonna talk you through a couple of things. One, we're gonna dig into, kind of a DataHub community highlight, understanding or kinda taking a look at at what's been going on, within the DataHub community and the code base. And then, we're also gonna take a look at what's to come in 2026. So without further ado, we will kick things off with our community highlights. So, as you are likely aware, DataHub is opens an open source project, and we are an extremely busy open source project. My goodness. We crunched the numbers. Within the past twelve months, we had just nearly, 2,400 pull requests. That is so many, so many, so many, so many. I don't think we've ever had that high of a volume. Within those PRs, we had, 1,300,000 net lines of lines of credit. No. No. No. That's that's a callback from my banking days. Lines of code submitted, from 177 unique contributors. Now keep in mind, that's a 177 wonderful humans across the world, but 67% of those contributors are actually coming from the community and only, so 119 of them, in fact, were folks just like you in our our open source community. And only 58 of our DataHub team members contributed to that. When we take a look a little bit closer at what those PRs contained, well over 5050% were focused on metadata ingestion. Not surprising because metadata is all about bringing in that ingest or all about ingesting that metadata. On the commit type side of things, we had about 42% of, focus on bug fixes, 32% on feature work. Very modest, but very important, 9% in our docs. And month over month, we were averaging between two hundred and three hundred PRs. It's just pretty incredible. So what we like to do is celebrate some special folks in our community who are keeping us move or keeping the project moving forward. So this year, we're gonna focus or we're gonna highlight four folks in our community, and I'm gonna go ahead and start with Rahul. First of all, Rahul, I gotta say, I love your GitHub. Your GitHub handle, relax boy, is so great. Rahul submitted 47 pull requests, and over half of those have have already been merged. That's so many. Honestly, probably more than I've contributed in the past, I don't know, three years combined. But what's amazing about this work is that a lot of the work focused on addressing critical vulnerabilities, across Spring, across, Postgres, MySQL. He also did some work to strengthen our SSL and TLS functionality. And so just overall, massive thank you to Rahul. We genuinely, really, deeply appreciate that hard work. Benjamin Maquette, you are up next. Ben is, Benjamin, excuse me, is, submitted nine pull requests this year. All nine of those have already been merged. And the area of focus was on, kind of on our ingestion side. So looking at superset our superset slash preset connector, some enhancements to our DBT cloud source, airflow lineage for specifically, hooking up with BigQuery, and then also some improvements to our CLI warnings. So a big thanks to you, Benjamin, for contributions across the board. Austin, you are up next. 10 BRs from you, and five of those have already been merged. Thank you so much for all of your work on making some big improvements to our s three ingestion source. There's also some work, that you did in there to improve, readability, so kind of some refactor work to improve readability to make our code base easier to navigate for other folks. We always deeply appreciate that type of support. Last but certainly not least, we have my buddy, Lance. Lance has been joining us at office hours for years at this point. I've genuinely enjoyed every conversation I've ever had with him. He's always come into the the conversation with some big ideas really well, well thought out and, questions and and kind of stumpers for us. But on top of that, Lance also, merged or submitted 10 PRs. Three of those have already been merged. Some of the areas of focus has been around, superset and preset. And, also, it looks like I haven't checked these out, but, little birdie tells me that there's some, draft or stage slash draft PRs that are focused on our glossary improvements. So this is certainly not all of the humans that contributed from the community, but these are folks who have really gone above and beyond. So genuinely from the bottom of my heart, and I know that I'm also speaking on behalf of the the broader DataHub team, We genuinely could not keep this project, up up and running, and kind of up with the the data at times because things are moving so fast. So we genuinely appreciate all of your support. So we talked about, kind of who are the people behind, behind the community. Let's talk about some of the things that we shipped in 2025. So of those 2400 PRs, guess what? I'm not gonna be able to describe everything that we shipped. But there are a few things that I wanna call out. Earlier this year, so the I think we it was in January, during our town hall, we announced, well, actually, we celebrated DataHub fifth birthday. To celebrate that landmark, we rolled out DataHub one point o, which on one hand is a huge mark of maturity for any open source project. But even more exciting, we actually completely overhauled our user experience, focused on search discovery, governance observability, and making all of those workflows really clean and easy for folks to navigate. This screen in particular is showing you our simplified homepage view. We drastically simplified, kind of what we're presenting to get people into the most important workflows that are relevant to them, also providing custom customization support. On I have John over here hyping me up or hyping someone up. In addition to that, we also drastically simplified our search experience by pulling in some quick filters, also giving, the ability to kind of search and navigate by, by kind of browse path. And then also within the search experience, we're showing a preview of those assets so you can quickly get a sense, if the assets or or the datasets are relevant to you before you start digging in. And let's talk about what happens when you start digging into an asset. More often than not, folks come to DataHub because we have extremely robust and detailed data lineage. And and guess what? In a production environment, lineage graphs get extremely complex extremely fast. So we focus on making sure that those were easy to navigate, that they're performant, and just overall an impactful experience. Last but certainly not least, this year, we put a ton of investment in data observability within open source DataHub. So if you are, executing data quality checks within maybe DBT tests or Great Expectations or maybe you're you're kinda running your own data quality, platform in your organization, we will surface up that information directly within DataHub. So not only can you find the relevant resources, in your data discovery journey, but you can also see the full history of all the data quality checks that have been executed, their outcomes. And then along with that, we have a full incident management feature. So you can easily create, assign, categorize, pry prioritize, and manage the status or the state of data incidents when those data quality issues inevitably pop up. Now beyond a flashy new UI, we also made some pretty big inroads in the, LLM slash MCP space. So earlier this summer, we launched the DataHub MCP server, and and we've just been blown away by the adoption by the community, also our DataHub Cloud customers. So what this does is it provides, metadata context to LLMs, makes it incredibly easy and powerful to execute, impact analysis or and or agent based development. We found that this is really useful for both data practitioners who are kind of building or managing data pipelines, but also we're hearing some pretty amazing success stories from software engineers who are actually referencing it in their day to day code development as well. One thing I will say is that we actually dog food this on a daily basis internally, and it has genuinely made my life easier. So if you haven't tried it yet, highly recommend checking it out. On the ingestion side, we, invested quite a bit in performance tuning our top sources, of, you know, making sure that that metadata ingestion is scalable within, you know, true, production environments. But we also started to expand our investment, in ingesting AI specific sources, particularly Vertex AI and MLflow. We also rolled out Hex. Again, this is one that I use daily. My teammate and I get tremendous value out of this one. Then I'll also call out these seven sources that are currently in development. We have Azure Data Factory, Confluent Cloud, Fink, excuse me, among others. While those aren't technically merged, they they will be merged soon and just wanted to make sure that those are on your radar and we'll count them towards our 2025 progress. Another area excuse me. Another area of really exciting progress was, rolling out Iceberg REST catalog. So for folks who are implementing a data lake house, one common pinpoint is that access policies can quickly get scattered across systems. It just creates extra work. It's super fragile. Policies drift from or access policies drift from one, platform to the other despite, you know, folks' best intentions. So, what we built out was a kinda one stop shop for defining your policy in one spot in DataHub. And then Iceberg rest rest API actually enforces that everywhere everywhere else. So Jen mentioned this at the top of the show where, earlier this summer we had or this fall, excuse me, we had our context virtual conference. And Ryan from Demandbase was actually one of the the speakers who came to share his experience, both, actually digging into the the, Iceberg rest catalog integration with DataHub. So, Demandbase has has actually been using Iceberg, I think, for about four years, somewhere around, like, a petabyte or more of of data processing in there. So they're running at some serious scale. And what they had found is that while they had all this really robust infrastructure, they just wanted a clean Rust catalog endpoint to integrate with so that they could just, kind of build that into their existing workflows. So by doing so with, with DataHub's ISE Rust catalog, they got that kind of central spot for for defining and managing policies while also getting the automatic bonus of lineage across the entirety of their, ecosystem. One other, one other session during context was with Vikram from Foursquare. So, this is actually a a totally different use case for using Iceberg rest catalog and DataHub. So the Foursquare team, built out a data marketplace of surfacing up their geospatial geospatial datasets, or data products, excuse me. And so they were actually able to use DataHub as that central governance layer, to manage all this really, really complex kind of data product, ecosystem. So I definitely recommend checking out both of those if you're curious about Iceberg REST catalog or, or just their experiences otherwise. So that's a lot of cool stuff that we did in 2025. Let's talk a little bit about 2026. On the discovery and governance side, we are so excited, so so excited, to be taking part of Snowflake's open semantic interchange. So if you've heard of Snowflake OSI, if you kinda heard that floating around, it's basically a working group of, of leaders in the AI and BI and data tooling space to finally wrangle a kind of common definition of a metric so that there's interoperable, movement of data between systems. So we're working we're part of that, kind of working group working alongside, like I said, both, vendors within the space, but then also kind of enterprise level, customers who are are, you know, kind of deeply motivated to, to actually leverage something like this. So we're gonna be kind of working side by side with them to make sure that kind of the data hub perspective is well understood and represented. As soon as that standard is set, we will absolutely start moving forward in the implementation. So more to come there. On the metadata ingestion side, a couple of areas that I'm very excited about personally. We have, we have work queued up for the 2026 to start digging into Microsoft Fabric. So Microsoft Fabric is a very complex and very involved ecosystem. So where we're gonna start is getting support for, Microsoft Fabric Data Factory as well as One Lake. But the more that we understand kind of what our open source or kind of the DataHub core community or, you know, community members, like what your fabric environment looks like and kind of other connectors that you would require in there, as well as from our DataHub Cloud customers, We'll iterate through that and expand that out, over time. The other one I'll call out is that we are starting to, tee up work to support ingesting data quality run results from Monte Carlo. So this is gonna be a really nice way to round out our kind of observability support. So in addition to our, support for great expectations and DBT tests, if your team, is thinking about or is currently using Monte Carlo for your data discovery observability, that'll that'll, kinda surface up in there. And now you might be asking yourself, why is she only showing four metadata sources when every time I blink there's 15 new data tools out in the wild that we wanna bring into DataHub? Well, I can't tell you too many details yet, but I can tell you that we are currently working on building out some drastically impactful, drastically better tools and resources to make it easier than ever and faster than ever, most importantly, to, start building production grade connectors to start bringing data into DataHub. So we will be announcing more soon. This will be coming out in the next couple of months. But 2026 will I have a feeling we're gonna go from 75 ish connectors to more than we can you know, more than that. Quite a bit. Also, on the observability side, here's what we have queued up for the open source community. There's a few different areas or kind of workflows that we like to or that we tend to focus on within observability. So number one, we wanna make sure that that it's easy for you to detect data quality issues early, so that you can go and address them before they you know, before ripple effects start to set in. So what we will be, in order to kind of, bolster your your early detection, we're gonna be rolling out a data health dashboard to make it really easy for you to see all of your assertions at kind of a a, you know, bird's eye view, any incidents that are raised along with that, and really just make that triage process a breeze. So once you've detected your issues, now you're on call, you wake up, and it's five you know, there's five new incidents for you. Where where do you even start? Right? What we are also gonna be rolling out is within an incident, you can actually define a run book and or just kind of provide, notes or context along with that. So that doesn't matter, you know, who's on call or what you know, how the team structure responsibility of those tests evolve. You can capture all of that, kind of context around here's, you know, here are the steps to follow once you, once you are ready to start resolving it. So, the other kind of added bonus here is that this is tremendous context for all the AI agents that we're starting to that folks are starting to deploy within their environment to really rapidly, and deeply accelerate that root cause analysis and then ideally stage PRs for you so that you can go back to sleep and, you know, have your make your on call rotation a breeze. So you've detected incidents early. You have kind of crowdsource guidance or runbooks around what to do when when they pop up. We also wanna make sure that within the observability flow, we are providing robust reporting and prevention mechanisms. So on the prevention side of things, we're gonna be expanding out our data contracts to support structured properties. So, this is gonna be a really useful way for organizations to kinda start adding a little bit more qualitative information or detail on on how a contract is enforced. You can imagine if you're sharing third party data and you wanna make sure that it's not shared outside of a specific reason, or if they're SLA windows that can't be breached, you can basically set those those properties that are specific to your organization or specific to that, to that data contract to make your your contract enforcement and your contract monitoring, really robust. And then on the reporting side, we know that there's always gonna be slices and dices of, ways that you wanna look at outcomes. So we're gonna be making some big improvements to our SDKs, to make sure that we have endpoints to extract that information. So what I will say is that this has been a very fast and very high level overview of what's to come in 2026. I held off from talking any mentioning anything about, kinda AI driven, workflows because now it's time to hand it over to mister Shirshanka and mister John, and I believe Nick is gonna be joining us as well, to get us up to speed on what's in that happening in that domain. So yeah. Happy happy 2025, everybody. Always happy to do this at the end of, the year. everyone, There you go. and, thanks, Maggie, for here. running us through the the year. It seemed like it went by pretty fast, and yet, here we are making big plans for what's, coming up ahead. One of my Yeah. Yep. Every year goes. faster than the last. One of my favorite, parts of the year was context. We heard so many great stories from folks from around the world on what they're doing with Data Hub and where they're taking it. And I thought that I would spend a little bit of time doing a quick debrief of what I shared at context and the direction in which DataHub is going. And then instead of talking, actually let John and Nick share what they've been building and see how some of this vision is actually coming to life. So what is DataHub doing for context, and why are humans and AI agents important? Well, we gotta start from the beginning. This is where we all started. Brock, fictional but completely real human being, sales analyst at a famous, company. He's been, asked by his CRO Monday morning. Sales are down. I have a board meeting in forty eight hours. Brock figured out what's going on. Of course, Brock works hard, runs through 50 Slack messages, figures out there's so many tables called sales v two final final v seven, and, you know, pulls an all nighter. He doesn't get access to a bunch of tables. He has to buy a bunch of service tickets. And finally, when he actually gets access to the dataset, he realizes it doesn't have the region field that he needed to do the breakdown. And, of course, he decides to build his own SQL. He doesn't quite know what he's doing because, you know, it's his first time working with these tables specifically. And Monday morning I mean, Wednesday morning, he's got most of everything done, and then the CRF says, hey, Brock. I have one more question. It's, it's a bit too familiar for a lot of us who've been in the data space. I think every time Maggie and I talk about her past life, she gets, a little bit of PTSD thinking about those times when those data pull requests didn't quite go as planned. And we we are all thinking, wait a minute. AI is gonna solve all this. AI is gonna come in and fix all of our problems. But we're also a little bit worried, because all these AI agents are gonna come in. We're gonna build them, and they're gonna start processing all this data super fast. But they're actually gonna run into a lot of the same problems that Rob ran into. They're not gonna know what data to actually use. They're gonna run into the same fragmentation challenges. They're gonna run into the same access control challenges, and they're gonna run into the same trust challenges. And in the end of it, what we found ourselves is that AI agents and AI models in particular tend to be extremely susceptible to the human that's guiding their work. So they tend to agree with us. They tend to confirm our worst, suspicions. And so I think the situation might actually be a lot worse than we think. We are instead of one confused but hardworking Brock, we're gonna get a bunch of very delighted AI agents that will be confused on the inside but present a very happy exterior and give us a lot of wrong insights very quickly if we are not, careful. And that's where DataHub comes in. You know, as you know, DataHub today makes humans a lot more productive. It helps them to find, understand, and build on the entire enterprise data supply chain. Helps them understand where data came from, how it was transformed, and how to use it in the right way. And we're well on our way to making DataHub do the same thing for agents. So as, you know, the digital twins of data engineer agents show up where they're accelerating data engineering outcomes, data analyst agents, software engineer agents. You know, the the world is our oyster right now in terms of the number of agents that we can dream up, and how they can help humans and work hand in hand with humans. And DataHub job really is gonna evolve to not only providing context to humans, but also to agents and help them take the same advantage, that that humans have been taking. So, really, we see DataHub as becoming this context platform for building applications and agents that are going to be data intensive in nature, the ones that really need to work on a lot of data in a short amount of time, and they really have to be, working in a very trustworthy way. But, of course, I'm an architect, and I like to think about building things, and I like to think about how things are layered on top of each other. So let's think about the personas that we have to design for if we were to build this context platform platform from scratch. We have humans. They continue to be important because in some cases, they are actually driving the agent. And in other cases, they are creating, work that agents are going to work on autonomously, but then, you know, collaborate with the humans on from time to time. Below that lives this magical thing called the context platform, and this is the thing that's going to make context available to agents in the most, trustworthy way possible. And below the context platform, of course, lives all of the assets that the enterprise needs to work on. And our definition of how big this thing is is changing every day. When we started in, you know, 2020, 2021, we were focused on the warehouse systems. And then over time, we started focusing on the BI tools, and then we moved upstream and started focusing on streaming systems and operational systems. And now we're saying, guess what? AI and AI agents need access to pretty much everything. They need an understanding of every single system that your enterprise is using. And so we see the scope of the context platform in terms of the systems that it integrates with just expanding to cover the entire, breadth of all of the data assets at the company, all the way from production systems. This could be your, SaaS systems, sales systems, etcetera, and as well as your operational databases to your ingestion systems, transformation, lake warehouse, AI systems, BI systems, etcetera. But how is the context platform actually architected? What does it provide? How are what are the components of this system? First off, we need a context graph. And then, you know, I use the word graph very intentionally because a lot of times we think of context as just a bunch of disconnected nodes with a lot of attributes, but, really, it's a graph. Where did this asset come from? Who owns it? Who transformed it? Why was it created? All of these things are important context that have to be linked together for AI agents to be able to work with them. And what's in this context graph? Of course, technical context, the names of things, the identity of things, the identity of actors, the structure of those things, all of the things that we normally call technical metadata, is really technical context. And in addition to that, we need business context, runbooks, operational guidelines, why was this thing created, a small Slack conversation that explained, hey. We actually don't do it this way. We do it a different way, or this is what this really means. All of that is business context. And then new identities have to show up on the context graph, the new tools that agents have to work with data, the MCP servers, as we call them, as well as the agents themselves. If users are part of the context graph and humans are part of the context graph, then agents need to be as well because they are important actors in the enterprise graph. And on top of the context graph lives some important capabilities. Context persistence, the ability to actually record context, not just read it from external systems. And that, of course, means you need to have versioning of the context that you're storing. You need to have tiering of that particular storage because this can be pretty large amount of context. You need to be able to cache it for efficient performance, and you need to have a way to subscribe to changes that are happening in context. This is important because you might have many, many, many agents running in your enterprise, and not all of them are able to continuously connect up to the platform, but they might want to subscribe to certain subsets of context that they're operating on. Imagine an agent that's, you know, responding to, an incident. It needs to know if something very far upstream of that particular asset it's monitoring had an issue or if those issues have been happening consistently. It doesn't need to know about many other unrelated things. And so for agents to be able to operate, quote, unquote, in the field, they need to have a way of having their own subset of memories, their own subset of context that they care about, and a way to keep it continuously refreshed. And the ability to subscribe to the context graph is a super important part of that. And, of course, on top of that lives a lot of the techniques that we are now calling context engineering. The ability to filter context. So during retrieval, how do I make sure I only get the context I want? Context compression, the ability to make sure that I'm able to compact and only retain the relevant parts of context as I have longer and longer sessions. Context performance, the ability to make sure that, you know, my interactions with the context back end is super fast and super efficient. And finally, context observability, the ability for me to record not just, sessions or memories, but also to record every single, operational detail about what happened so that monitoring and improvements can happen in the same way that we do app observability or data observability. And so this really is the breadth of what it takes to build a real context platform, a context platform that AI agents and users can actually use to build and operate AI agents at scale. But that's just a lot of talk and a lot of boxes and arrows. Let's talk about what DataHub is actually doing to get to that vision. Turns out, DataHub already has a lot of the ingredients of a context platform. If you kind of look at the architecture of DataHub and you look at how it's built, but also what things it does, it already has integrations to 70 plus systems in the enterprise AI and data stack. We just heard about a bunch of AI integrations that happened this year and also a bunch of BI integrations that happened. It pulls it all in into a single metadata graph that includes a lot of this technical context and, increasingly, a lot of business context is coming into this graph. And on top of that, it offers kind of these distinct capabilities, discovery, observability of data and metadata signals, as well as governance. And so these foundations make it super easy and super convenient for DataHub to expand and offer additional capabilities and evolve towards this context platform vision. And to do that, I'm super excited to welcome on stage, repeat offender, John Joyce, cofounder at, DataHub, and, a new entrant, Nick Adams. John and Nick, take it away. Awesome. I'm gonna share my screen if you don't mind, Shirshanka. Great. So I'm gonna talk through how we're actually building the context platform for humans and AI, both across DataHub's cloud offering and the open source DataHub product as well. When we first set out to build the context platform, we decided to start with actually an agent. If we're building for humans and agents, we wanted to develop some intuition for what an agent would need to be productive and useful on top of the context that DataHub has today and will have tomorrow. And so we built Ask DataHub. Ask DataHub is an agent that is embedded directly inside of DataHub and also available where you work, in Slack and Microsoft Teams, that enables your teams to build data, find data on their own. So specifically, Ask DataHub empowers your team to find the right data for a particular use case across your entire data ecosystem. It enables you to understand the impact of changes before you make them, again, in natural language. You can generate accurate SQL and dbt models by having all of this powerful context that DataHub already has, lineage, queries, you know, descriptions, and more. And then you can make changes to your data assets. So you can actually manage the metadata about your assets directly from the agent as well. So what questions maybe you want to ask this agent we've built? Well, here's a few examples. Right? So you can ask DataHub things like, what is the dashboard for monitoring the marketing email click through rates? Right? Or how is user retention rate calculated? DataHub will search through all of the context that it has, glossary terms, domains, your data asset graph to try to build a picture to this this question an answer to this question. How do I calculate the number of orders that were returned last month in the EU? Right? So being able to actually understand the data graph, understand the context, and generate accurate SQL. If I remove the date partition column of our purchase events table, what will happen? Right? What will be impacted? And finally, actually taking action. So you can create glossary terms, you can create tags, you can attach them to tables. So you can say something like create a new glossary term to represent email addresses, and then find relevant tables to add it to. Right? Maybe any of them that have email addresses in them. And for those of you who are interested in sort of how things are built, I'll just quickly cover how we built Ask DataHub. You know, Ask DataHub is one agent available in multiple service areas, DataHub, Slack, and Microsoft Teams. It sits on top of the DataHub MCP server, which we'll talk a little bit more about in a later section of the presentation, but you can think of it as a pluggable API that agents can use to tap into the rich context and the data graph that DataHub already has. And so what is that? Well, obviously, DataHub has a picture of the data assets, the tables, the dashboards, the notebooks, all of the actual physical things. It also has access to all of this rich context that we bring in during ingestion, Data lineage, data documentation, data quality usage, ownership. Right? All of that kind of who, what, when, where, and why around the data. And I'd like to do a quick demo of Ask DataHub, but before, let's maybe set up the the example. So you can imagine that, you know, maybe we work at a fictional bank. We'll call it fictionbank.com, very creatively. And, you know, what the bank cares about is not only the actual data that we we have. Right? It's also these higher level concepts like reference data, right, which teaches us or helps us understand how to use the data, and critical data elements, which are the most important attributes, inside of our data landscape. And then most importantly, we care about actual concepts. Right? We care about trading and loans. We care about our customers. We care about risk and compliance. And so, you know, it's not just the data that we care about, it's actually the meaning of the data that we care about as well. And so with that, I want to jump into a quick demo of Ask DataHub. I'll just kind of narrate through it here. So what you'll notice in DataHub is that we now have a chat interface where we can paste a question. For example, what data should I use to understand market trading activity this morning? It's gonna search through your entire DataHub catalog and give you a response. In this case, it found a market data feed table. It also tells us it's a gold tier asset, so maybe it's important. We've tagged it as gold. Here's a look at the columns. You know, you can see we have market securities that are trading. We have an ask price, a bid price. And what we can ask on top of that data is how do I generate the average spread between the bid and the ask for today's trading session? Right? Or maybe for a specific security demo corporation. And you can see that DataHub will actually use the context it has to generate the SQL you need to answer the question. And I wanna pause here just for a second because one of the things that we've worked hard on is ensuring that DataHub, when it doesn't have all of the context it needs, it's able to ask the user for additional clarifications instead of just hallucinating, right, a query. And so what you're seeing here is that DataHub actually doesn't have all of the context it needs. It's asking the user, hey. What should the security ID format be for this table, and when should the trading, start time be? Right? And so this will become important in the next section, as well. But you could see we can generate some SQL here and then we'll move on to our next question, which is around impact. Right? So this table's on Oracle. If I wanted to migrate it to Snowflake, would anything break? DataHub will search through all of your lineage information and it'll tell you, yes. Something will break if we move this, to Snowflake. Specifically, there's a credit risk metrics table that is downstream of the table that may break. And then finally, I'll actually ask DataHub to make some changes. So I'll say, hey, please please add the trading assets glossary term to this dataset, and it'll go ahead and add that glossary term directly from within Ask DataHub. And you can see we've added it here. The final thing I want to cover is that you can actually customize Ask DataHub's behavior. So we've gone into settings here and we can see that we can provide base instructions for the AI agent. So here we're telling the agent to prioritize assets that have been tagged with the gold tier tag. Right? And so this will instruct DataHub when it's searching for data data to prioritize that type of data. Super powerful to customize it to your needs. And then finally, DataHub is also available in the search bar. Ask DataHub is available in the search bar, so you can jump right into a conversation directly. Cool. And just a quick recap of what we saw. Ask DataHub enables you to do a bunch of useful things. You can find trustworthy assets across your entire data landscape. You can generate accurate SQL using all of the context we already have. You can dive deeper and understand lineage and impact analysis. You can actually make changes to your metadata directly from within Ask DataHub, and it's fully customizable to suit your organization's needs. And you can see it's available in Slack, it's available in DataHub, and in Teams as well. Cool. So the second thing I wanna walk through is another project we've been working on inspired by the the experience building the Ask DataHub agent. You You know, when we started rolling out Ask DataHub to our kind of initial beta customers, one of the feedbacks we got is that it would be great to give DataHub access to more context, specifically context that typically lives in people's heads or in unstructured documents on Notion or Confluence or in Google Drive. And so we started to think about how to expand the footprint of DataHub's context graph beyond just structured data or your data supply chain. And so that's what we're calling the data context graph. Now what exactly is that? Well, it's really a way to bring in unstructured documents or create unstructured documents on DataHub so that agents can consistently and reliably provide answers. So it doesn't won't just have access to your underlying data graph, but it will have access to a knowledge repository on top of all of your data assets. And so specifically, you can now create documents directly on DataHub. You can also bring documents and index documents that live in external platforms like Notion, Confluence, Google Drive. You can connect data assets to unstructured documents, so you can actually link them in a graph structure. And then finally, we make all of this unstructured context super easily accessible to agents through semantic search and our MCP server. So the agents can not only read through all of this unstructured context to help provide better answers to to your questions and to answer a much broader set of questions, but so that agents can also start to record their own context back into the graph. So what types of questions can we answer with the context graph that we couldn't answer just with Ask DataHub before? We can start to answer higher level questions like which metrics are approved for executive reporting? Maybe we have an FAQ document floating around somewhere that actually defines that. Or what data quality checks should I add to my new table? Maybe we have a quality runbook in the data engineering doc space. How should I label or handle sensitive or regulated data? Right? Or what's the process for requesting new data access? So these questions are not just about a specific data asset or a specific group of data assets. They're oftentimes higher level about how to actually perform your task. The main difference between this new architecture with the context graph versus what we saw with Ask DataHub is you actually notice that we're adding, again, a new type of context into the graph, and that is unstructured context, documents. Right? Which can either be created by a user directly on DataHub or ingested and indexed from an external provider that you already have. Right? Notion, Google Drive, Confluence. And then the MCP server layer, we're adding one new very powerful capability, the ability to semantically search across the document space. And for those of you who don't know what semantic search is, it's just a fancy word to say it's really easy for agents to query and find the right information from the context space, from these unstructured documents. And so now I'd like to jump into another demo of just the context space and I'm gonna start by actually going over to our Notion space for our fictional bank. You can see we've got a bunch of different documents in here about concepts that the bank cares about, how different data is related, how processes are run. And the first thing you'll notice is that you can actually now access those Notion documents directly from DataHub. So what we just did is we searched for this, security document in the search bar and we were able to navigate to Notion. The second thing we'll look at is a new capability to define docs directly within DataHub. So you'll find this new context section on the nav bar that enables you to actually create unstructured documents directly inside of DataHub. So you can see we've got one about retention policies and maybe we'll create another one just for an example about credit risk. We can give it a type which is super important because it enables the agent to narrow down the space of the docs that it cares about. So in this case, we're creating one saying, hey, it's a definition of credit risk. It could be a runbook, it could be an FAQ, it could be anything really. We can see the change history of the doc here in the change history drawer. We can move docs around for different parents and then maybe most powerfully we can link assets to the docs. So this will enable the agent to bounce between the asset graph and the unstructured context graph. And what you're seeing is that you can actually navigate from the asset, to the docs that are related to it from the asset page as well. Okay. Let's actually go through some questions we can now answer by virtue of bringing in that unstructured context. So I can start by asking a very broad question that, again, isn't specific to a particular piece of data. Do all tables require retention policies? Right? DataHub will search across the space and use the retention policies doc that we had created on that sidebar to answer my question. So no, not all tables require retention policies. We can click on the retention policies and see that it's actually pulling directly from that doc. Now let's try another question about generating SQL. And this is a mouthful, but it's it's important we kind of understand this. So generate a query to generate the total trade volume on commercial real estate loan trades over the past quarter with a safe loan to value ratio. Why is this an interesting question? It's interesting because you have to understand concepts, like what is a safe loan to value ratio for the bank? How much collateral does someone need to put up, in order to get a real estate loan? A second thing you need to understand is a specific type of loan. Right? Real estate loan, not just any type of loan. And so what we'll see here is that DataHub will search through all of our Notion documents which define these things, real estate loans, loan to value ratio, and it'll use that to generate SQL. Right? So it'll say based on your organization's commercial loan underwriting guidelines, which is a document in Notion, a safe LTV for commercial real estate is 75% or less. And you can see it actually appear in the SQL statement. So why this is important is in contrast to what we saw in the previous SQL generation example where it didn't have enough context to exactly create the right SQL and it asked the user, Hey, can you fill in the gaps? DataHub can now fill in the gaps on its own, right, using that unstructured context that we've brought in. And it actually references, in this case, three or four different documents that it scanned through in order to generate that SQL. So it's super, super powerful way to combine all of this context. Alright, I'm gonna go back to the slides here. Just a quick recap of what we saw. With DataHub context graph, which is available in both cloud and open source, you'll be able to create unstructured context documents on DataHub. You'll also be able to connect third party unstructured document sources like Notion, Google Drive, and Confluence. We're starting with Notion. And then finally, you'll be able to semantically search across all of this unstructured context and that'll be available through our agent Ask DataHub as well as the MCP server as a tool. Now I want to hand it over to my teammate Nick to describe how you can build your own version of Ask DataHub using DataHub's primitives. And this is what we call the DataHub Agent Toolkit. Thank you, John. Yeah. In addition to Ask DataHub, all of these tools are available via, we can go to the next slide. Sorry. Yep. All these tools are available via, DataHub's MCP server in open source as well as cloud to connect to existing LLM apps or to build your own agents or to connect to any agent that you've built in your organization. The MCP server will allow for, searching assets, exploring asset details including lineage, popular queries. And with these new features with unstructured documents, it can also do semantic search over context docs. It can, update asset metadata directly from the agent. So you can do mutations and rights from the conversational agent directly. And this open source MCP server is, self hosted, and then the hosted MCP server is available in DataHub cloud. So a couple of, options for integration. One example we have is, like, integrating with, Cloud Desktop. So you can configure the Cloud Desktop config JSON to connect to the DataHub MCP server with a with a token and have it be able to chat with the the data through, through the MCP server. And we have a little video of an example here. So it's showing all the all the customer tables, and it's able to execute search queries. It will be able to do some it could do semantic search. It can browse entities. It can load data and present to you, full results inside the the Claude Claude desktop app. We're not stopping at just providing MCP tools for LLM apps like Claude and hoping you know how to use them. We're building integrations for some of the most popular AI frameworks out there, Langchain, Langgraph, Google ADK, crew.ai, and and others as they get developed. Our goal is to make sure that whatever agent framework you're programming in or using, you can easily use the DataHub platform as your context platform for for both retrieval and for persistence. We'll be open sourcing a repository of example agents to help you get started on this journey and make it really easy for you to to build to build agents. And we have a video demo of a console agent that we built using, Ask DataHub or not using using the lane chain. And this is the same query we saw before. It's that John John to generate a query to generate the total value of trade volume on a real estate loan and with a safe loan to value ratio. And so this agent's able to execute search queries. It's calling search documentation to do semantic search over, with specific terms for real estate loan trade volume. It's gonna be loading some entities, doing some further search searches and documentation to really build out the query and generate the the end result of of the query. We can zoom ahead on this a little bit. And so it's able to generate a a full query, and we're able to have it, generate the full query, and it has, like, a safe threshold. And then we're also able to tell it to bookmark bookmark this. And so what it'll do is it'll actually save the results of this query and summary back as a context page into DataHub as an asset as a unstructured document for future queries. And so this is just an example of of that same document that we generated and stored back into into DataHub, that we're able to see in the in the UI. Having an agent, being able to access the the rest of the context graph is is super great. But being able to persist these memories and data back into the context graph so they can be used for other agents, other queries, Future analysis is really, really powerful. And so we had the the lang chain agent remember something, and then we're able to to view it in the DataHub UI as well as use it in Ask DataHub and with other agents in in the future. And with that, I would like to hand it back to John to wrap up our our demo. Hey, John. You're muted, buddy. Hey, guys. I got auto muted or something, but I'm back. So I just wanna wrap up by just summarizing, the focus areas that we're currently working on on the context platform. So, obviously, Ask DataHub, which is an agent that helps you to find data, understand impact, generate SQL, and take action using natural language. This is gonna be cloud only. The context graph, which allows you to create docs directly on DataHub, ingest unstructured docs from third parties, and then make it accessible, right, to agents via semantic search. That'll be available to both open source and cloud. And then agent build kit, which is a way to build your own agents with, DataHub. Right? On top of Lanechain, Google ADK, CrewAI. It includes the DataHub MCP server, SDKs, and documentation as well. This is just the beginning. Obviously, what Shirshanka had presented at the very beginning is a big and bold vision, and this is our current focus, but you'll see it evolve over time as we make progress and start to add new concepts like agents and MCP servers and all of the great things that Shirshanka outlined at the very beginning here. So when is all of this happening? Just a quick note I want to leave you guys on before we get off here. Ask DataHub, is already live in DataHub Cloud. It's available in private beta. If you're interested in trying it out, just reach out to us and we'll get it turned on for you. The context graph is coming at the December in the open source release one four zero, which is the next release. Should be one four zero. Notion connector will be also released at the December as well as the agent build kit, which is the SDKs, the MCP tools, and also recipes and example code for how to build on top of these things. And then more unstructured data sources, right, and that'll come in 2026. Alright. I want to thank everybody for sticking with us here and I'll hand it back to Jen. Cool. That was fun. Hope everyone had the same kind of, excitement as we looked at in the past, what we did this year, and also what we're setting up to build. I was super excited to see some of the capabilities that we are starting to give back to the community, the ability to pull structured metadata and unstructured docs together in one place, and then, unleash agents to go off and, actually build interesting things together. That's gonna be really, really killer. As you might imagine, we are the first beneficiaries of a lot of this technology internally at DataHub, the company. And we are so excited that we're, able to bring these capabilities to, the whole industry at large. So thank you, John, for, leading us through this amazing demo. And, Nick, congratulations on landing your first demo at, town hall and your first presentation. I think that went really well. And thanks to all of you who've, joined today and and all of the chat, you know, Jars and Ben and, Vipin and all of the all of the folks that engaged. Thank you so much. See you on Slack. Let's get excited about building together. We have a lot of, agents and agent infrastructure to build, and we'd love to get collaboration going on building a lot of these connectors. We'll, of course, open source a bunch of the work that we're doing and, help you get integrating your unstructured data sources. But let's get building, and let's get, building this context graph together. Amazing. Another year in review on the books. Thank you all so much. Happy 2025. I almost said '24. Whatever. Happy holidays. Safe travels to everyone. Have a nice take a little break. Not too see you on got. stuff to build, but take a little break. And I think we will. yeah. We'll see you on Slack. Bye, guys. Good one.