Name: August Town Hall: The Latest in DataHub Lineage
Uploaded: 2025-08-21T17:31:11.657Z
Duration: 1 h 3 min 31 s
Description: August Town Hall: The Latest in DataHub Lineage

Transcript for "August Town Hall: The Latest in DataHub Lineage": Hello, everyone, and welcome to DataHub's August town hall. We're excited to have you with us today. Before we dive in, just a few quick housekeeping notes. Today's session is being recorded, and we'll share the recording afterwards. You can watch on demand. If you have questions throughout the session, pop into the q and a section to the right of the chat, and we'll do our best to answer. Also, let us know where you're dialing in from. If you have, if this is your first time at our town hall, welcome. These sessions are all about sharing updates, learning from the community, and provides a chance to connect with the team behind DataHub. I also wanna give a special thank you to those of you who are regulars at town halls. We see you. We're grateful, and we love that you keep showing up and sharing with us. And, of course, a huge thank you to everyone here who supports open source. Your contributions, feedback, and participation are what make this community thrive. And if you haven't already, we'd love for you to join the DataHub community. You'll find a call to action in the navigation to sign up, and, also, Mackenzie will drop the link in the chat. Let's get into the agenda. It's packed with great content. We'll kick off with John Joyce and Ryan Novakovsky of Demandbase. You'll hear Ryan's story about migrating to DataHub's Iceberg rest catalog, sitting on top of DataHub and the learnings along the way. We'll hear from Mike Burke in our champion spotlight. I I had the chance to sit down with Mike to hear about his experience exploring DataHub MCP, and Maggie Hayes will take us home with a deep dive on lineage and the latest on our product road map, and we'll wrap with closing remarks from Shrashanka. We're thrilled to share all of this with you today, so let's dive into John and Ryan's session. It's packed with insights and lessons learned. And the session runs about twenty seven minutes, and trust me, every minute is worth it. So let's roll the recording. Hey, everybody. Welcome to DataHub August Town Hall. I'm super excited to have Ryan Novakowski here with me today to talk about how Demandbase has deployed DataHub and also tried the new right, Iceberg rest catalog on top of DataHub. Maybe, Ryan, you can give a quick introduction into yourself, and I'd love to get into some questions I have for you. Yeah. Absolutely. Thank you guys for having me. Yes. So, I am the manager of our data systems team here at Manbase. So we oversee, what we refer to as our unified data platform. So the idea when we we started building this was that it's our internal source of truth for all of our data at Demandbase. When we set out to do that, we built this entirely around Apache Iceberg, because we saw a lot of the value in the lake new lakehouse technologies, you know, the the vast cost efficiency, the scalability, and all of that. So so that is now expanded to all of our internal data demand base. And with that, of course, we need a catalog and and have, you know, adopted the Iceberg rest catalog a short time ago and, you know, more recently even have, moved to DataHub as our Iceberg Rest catalog. Awesome. Yeah. It was really cool to hear that you guys were thinking about using DataHub for your rest catalog. I guess, can you just walk us through some of the key challenges or maybe pain points that you were facing prior to considering DataHub? Yeah. Definitely. So the biggest one certainly was data discoverability. Right? Like, that is one of the biggest downsides of this iceberg architecture is that it is very distributed amongst teams. So there is no central snowflake or central BigQuery or central database that everyone can go to log in and see everything there immediately. It it just doesn't exist. But that's kind of a feature of this. Right? Like, we have this very strong separation of storage and compute. We can store our data in any number of AWS accounts, in GCP, and in AWS. But because of that, it gives it this very, you know, distributed nature. So there's also no place single place you can go and say, what data exists, who owns it, what should it be used for, what's the structure of it. It it makes it kind of, you know, much more difficult to do that. So we had very low data discoverability where even though we were moving a lot of our data into this unified data platform and that was making it much easier to utilize that data, you know, access was streamlined, governance was streamlined, the the knowledge of what data was there and what it should be used for was was very, very difficult to come by. So that was by far our biggest challenge of, okay. How do we now that it is now accessible, how do we democratize that information, and how do we enable teams to know what's there, know how to use it, and keep that up to date, you know, very, very regularly, obviously. Yeah. Absolutely. And I think I think you've touched on this, but maybe we can take a step back and and just go over how you came to Iceberg in the first place, you know, versus something like Snowflake where maybe you do have some of that discoverability built in. Yeah. Certainly. So like I said, so when we we set out on this, and and we were actually somewhat early to kind of the iceberg trend because we started building with this probably about March, maybe four years ago now. So before the the wave had really kind of taken off. And when we set out on this, we knew that we needed to build this this unified platform that was eventually gonna be used for for all of our internal data. And one of the main things we knew we needed was cost efficiency. We have a ton of internal data at Demandbase that is you know, our data is the backbone of our application. We knew that we would be getting even more. We we had just gone through an acquisition, but we knew that we, you know, we're gonna continue to expand our data. And we knew that, you know, in the past, some of these other, you know, data warehouse solutions have become quite expensive. So cost efficiency was was a big one right off the bat, and that's what really drove us to, you know, I guess, initially, a file based solution. But then, you know, kinda kinda looking at some of the the drawbacks of that, we we kind of immediately pulled back from that. And that really in particular is because our very first use case for this was a change data capture CDC use case. So doing change data capture into raw parquet files is very, very difficult. You have to do all of the management yourself. You have to do all of the reconciliation of the rows yourself. You have to keep track of everything. It becomes very, very cumbersome. So immediately, we were looking for how can we do that CDC use case? How can we implement that in an efficient manner and and a cost effective manner? So very quickly, we kind of moved to this this lake house, you know, space with with Iceberg, Kootie, and and Delta. From there, it really was, you know, looking at the three of them and and deciding which one was going to be the best for us long term. So we looked at a number of different things. Obviously, we looked at the functionality, you know, as is. You know, what could it do that day? We obviously needed, you know, merge into functionality, which most of them, you know, supported. But then more broadly, you know, what was kind of the state of the project? What was, you know, how confident were we that we were going to be able to pick up this tool and use it? How confident were we in the long term of it? And that's what really drove us toward Iceberg. It had a flourishing community. It was truly open source. It had all the functionality we needed, functionality they had on their road map that they've now delivered, you know, with stuff we were very, very interested in. And we felt like it was the right project for us, you know, at the right time. So that's how we kind of, you know, then picked up Iceberg, implemented it for that initial use case, as we're doing kind of this big migration, and then since have built this entire unified data platform around, around Iceberg. Awesome. Yeah. It sounds like Iceberg was kind of the perfect fit for you guys at the time. It's funny that you say you're kind of looking at using raw files directly and then found this emerging ecosystem of Hudi, Iceberg, etcetera, right at the right time. So before DataHub, you were using a different Iceberg rest catalog, presumably, and you decided then to move all of that into DataHub. Hopefully, I guess, to address that discoverability challenge that you you had called out. Were there other things that prompted that move? Or what, you know, what was the motivation behind making that that choice? Yeah. For sure. So our previous, rest catalog was Tabular, which, you know, as I'm sure you're aware, was acquired by Databricks about a year and a half ago now, maybe a little less than that. So with that, you know, came, obviously, uncertainty. You know, the the our our catalog is now being acquired by a much, much larger company that has their whole ecosystem. Obviously, in the lakehouse space, it was, I think, a little bit of a contentious acquisition as well as they are the, you know, creators of a competing format in Delta. So really can open up this question of, like, okay. So what what should we do long term? And a big part of it was the migration was not going to be there was no, like, opaque path. So there was no path where we just did nothing and it continued to work. Databricks obviously wanted us to move to their catalog, the Unity catalog, which in its own right is is is a great option, but there was no zero effort path here. So that really allowed us to take a step back and say, okay. We have to do a migration anyway. Right? Like, there's no world where we don't we get to just not do anything, which would have been so nice. So we have to do a migration anyway. What really is the right long term path for us? And, you know, we looked at a bunch of different options. We went out and looked at, you know, a variety of open source options, Polaris and Lakekeeper and Gravitino. And this is such a new space. There's a lot of projects out there, but all of them are pretty new. Right? And and and, you know, pretty immature in a lot of ways. You know, I think, Polaris only just cut their first, release version major release version a few weeks ago, and many other ones don't don't have major release versions yet. And because of, you know, how critical the the invited platform is for Demandbase and how big our use case is, we were a little bit hesitant to to go with, you know, a kind of an unproven open source option. Right. So continue to look around and and try to understand, okay. So what what can we do here, and what would be good? And and it would always kind of been in the back of my mind that for a while, we had DataHub, but we also were using this other rest catalog. And we hadn't really gotten around to fully integrating all of our iceberg data from our old rest catalog into DataHub using the the DataHub integrations. So, you know, it always kind of struck me of, like, well well, maybe this would come around. Maybe, you know, if if we could have both here, that would be so much easier. We'd unlock so much by not having to do those integrations, not having to have that extra step of complexity. And, somewhat serendipitously, I think about a month and a half, maybe two months before we started to execute up on this migration, DataHub announced it's coming. Iceberg rest catalog support, will be here soon. So immediately reached out to you all. We almost very, very interested in in getting started on that, and, I was really glad you all were able to support us with a with a pretty quick POC even before I think it had been released. And, it just made a sense from there. You know? We we didn't have to do that extra step, like you said, of of doing the ACE progress catalog integration. And we got all of our data in DataHub for free, and and we're able to start u utilizing us now our our business catalog and get our schemas in there and start documenting them and all that. And that is exactly what we had in mind, obviously, when we started to talk about supporting iceberg rest catalog and DataHub is, hey. How can we actually get on the critical path and and use DataHub not just as a discovery layer, but as a place that you actually keep as the source of truth to your data. So I I guess you mentioned you were using DataHub previously, and you're integrating it into your existing Iceberg rest catalog. Can you give us a sense of what you were using DataHub for before the migration? Was it primarily data documentation, lineage, or or how are you using the catalog? Yeah. Primarily data documentation and and from kinda, I guess, like, a few different angles. On my side of things, primarily from owners of the data platform and trying to make sense of all of the things that we have and help teams, you know, access that data and and democratize it more. We also had a strong push from our security and governance side, being able to document what existed and where did it come from and what were, you know, what data was stored there? Was there PII in it or not? What was our, like, retention period? So there was a strong push there from, you know, our government's team, you know, around that data as well and and a lot of, you know, value and use cases there. But, yeah, from from my perspective, it was, you know, kind of that business catalog, being able to document, you know, what was there and how to access it. You know? But with with Iceberg, it was a little slow going having to do that extra step of that integration, and had, you know, so much else going on that that we did, you know, part of it, but didn't quite get all of our warehouses, you know, integrated. Yep. That that makes sense. So I guess you had to then take on this migration to from the existing rest catalog to DataHub. I'd love to hear a little bit about how that migration process went. How many tables did you have to migrate, and how did you achieve it? Yeah. For sure. I think we have, at this point, something upwards of 200 tables. So those are distinct tables. So there's three sets of those, dev stage and prod. And those are spread out amongst probably a dozen different warehouses. And each warehouse, generally is owned by a team. Some teams will have more than one depending on kind of their use cases. So, you know, like I said originally, like, a pretty distributed architecture. So a lot of kind of downstream implications, a lot of teams we need to coordinate with, which was definitely the the hardest part of this migration. It is the migration itself is is relatively straightforward, which is really, really nice. You know, it's it's really it can be done majority of the by the team that owns the warehouse. The parts that need to be done by teams other teams reading from that warehouse can be done relatively asynchronously, because the old catalog will still, work and as long as it's pointing to a valid snapshot. So it it makes it kinda nice, but, yeah, it was this this process of coordinating with all of these different teams, helping them with tooling around, registering their tables across catalogs, helping them with, you know, access management. How do we make sure that we have one to one access management from our old catalog to our new catalog? How do we, you know, have the same kind of authorization. We actually built some things around DataHub to allow us to have the same, authentication scheme as we had in our old catalog. So it was a lot of those pieces. So a lot of, you know, heavy coordination, helping teams understand what they needed to do, help them understand failure modes, building tooling around the process, and then building some additional functionality as well to to help us, you know, kind of have a one to one migration so we can continue to use it and and be as seamless as possible. And I guess how long did it take you to get through the migration of those, you know, 200 tables across the different environments? Yeah. All told, it took us about two to three weeks, working with various teams. It was, definitely a tighter timeline than I wanted. That was mostly pushed by our our need to get off our old catalog. But it, you know, ultimately went fairly smoothly. We had a few hiccups here and there, but stuff that was was relatively easy to get through. And, you know, all in all was a was a major success and and has been, you know, great ever since. And would you say you're now, like, in full production and and the migration is fully closed at this point? Yep. Everything is fully in production. We've been in production since, I think the July. With all of our tables, all of our warehouses, everything is there. Awesome. So it sounds like what you really did is you kinda took this business catalog that you had in DataHub and this more technical catalog that you had in your existing, rest catalog for Iceberg, and you kinda merge them into one operational catalog. Oftentimes in the industry, people talk about these things as separate things, you know, business catalogs and technical catalogs. Do you believe this is a meaningful distinction? Like, why or why not? Yeah. It's a great question, and it's it's such an interesting one because I see, you know, I see, especially LinkedIn posts all the time talking about data catalogs and talking about, you know, the uses of them. And and even prior to to DataHub supporting those progress catalog, you know, I would see, you know, articles, like, comparing DataHub to Polaris and to these other ones. And while they are in spirit somewhat of the same, you know, they really especially at that time, were so different in their usage, so different in in kind of the problems that they solve. So I think in principle, yes. Like, there is certainly a distinction in terms of functionality and, like, what they are for. But I think in in practice is where that gets kind of really interesting. So in principle, absolutely. You know, a business catalog, at least in my view of things, is much more of a UI application. It is a place that you go to search your datasets. It is a place that you go to write documentation about them. It is a place where you go to, you know, catalog quality about your datasets, catalog, your properties about them, all of these things. And then a technical catalog is, you know, as in its need, much more on the technical side of things and really about the operation of the data itself. So, you know, disambiguating rights, you know, everything that iceberg can do with asset transactions, managing snapshots, everything that iceberg can do with time travel, you know, is really what that technical catalog is then for. But, you know, as as you kinda talk about that, those two are completely disjoint functionalities. Right? So there is no reason why they can't be the same tool, and there's a lot of really good value when they are the same tool. And that's why, you know, it's been so nice utilizing Data Hub in this way as both because, you know, now we are in a place where when we make any change to any of our tables in our unified platform, it is immediately available in Data Hub. Teams can see it. Teams can see what the change is. You know, as we kind of get into more advanced functionality, like, safe schema changes and things like that, it'll be super helpful and really allows us to now have everything in sync at all times, and and never have this kind of lagging problem, with information. So, you know, it it, they are, I think, at principle different things, but when you can have them be the same, even have it all in one place. It provides a lot of a lot of additional value. And did you see any change in how people's maybe workflows worked before you do the migration and after migration with respect to some of the the things around data documentation and some of the kind of human context around the data? Did it feel like having them in the same platform has made for a lower friction experience for your end users? Yeah. Definitely. You know, and I think for a number of different reasons, especially the initial stuff, a lot of it was nontechnical users talking to technical users. Because previously, you know, a lot of that was done very ad hoc over DMs, over, you know, in Slack, and it didn't just get lost in translation that way. Right? You have this, you know, conversation. Someone calls a dataset by one name, then they go and talk to a bunch of other people, and they're asking about it, and they're not really sure what you mean or where it's coming from. And and there's really no way, you know, previously for for nontechnical users to self serve a lot of this information because you had to, again, because the distributing nature deeply integrate with, you know, these technical tools to to understand, to to find what was there. Whereas now what I've seen in in, you know, the last month and a half is that nontechnical users and tech users are both able to go into Data Hub, and we're speaking the same language. We're talking about the same datasets. They can see what is in our unified data platform, what is in Iceberg. They can also see where they've been using things that are not in in the unified data platform. We had a number of processes that were using raw BigQuery tables, and that was fine, you know, a year and a half or two years ago for us. But now when you're asking things of, like, well, how can I expand this? Or how can I operationalize the delivery of this data? Or, this other team needs access to it. How do I give them access to this BigQuery table? The answer is, well, it's because it's not unified data platform. We really need to get this in Diceburg and get it into the UDP so that it can be accessed in all of the great ways that we already have. So, you know, it was really made it really, really easy for teams to have this single place to go and understand what is there, what are they using, what's you know, how how is it supposed to be used, see documentation about it, and all of that stuff, which, particularly for, you know, like I said, that nontechnical detectable users has been, you know, a huge leap forward for us. Yeah. That's great to hear. I think especially the point around having that consistent language or way to talk about the data and the concepts around them. Oftentimes, we talk about DataHub as a, you know, accessibility plane really for data from DBT and all of these technical tools that you're maybe you're less technical business folks are not, you know, deeply familiar with or in every single day. How do we bring those to a place that's way more accessible? And I think, you know, DataHub is kind of that place. I'd love to maybe share with the audience a quick demo of what you guys have done with the iceberg rest catalog implementation. If you wouldn't mind, kind of walking us through that, that would be great. Yeah. Absolutely. Let's go ahead and share here real quick. We'll jump into our, dev instance. Awesome. Alrighty. So this is our, like I said, serve dev DataHub. So, the most of what we have here right now is is Iceberg is our dev, you know, by data platform. So we've got, yeah, you know, just shy of 300, 269 at different, tables and such across so these are all of the different warehouses that we have, you know, across various different teams. So immediately, you know, you have, like I said, the single place where you can see all the data we have. You can see, you know, the names of the warehouses. You can see the structure of them. You know, as we kinda build this out, we're able to then tag them and tag the different types of data that are part of this, who owns it, all that great stuff. And then, you know, coming in, you know, further, we're able to also see, you know, different aspects about the dataset, you know, itself, you know, different stats and and whatever else. If we go to, you know, this one here, we can see, you know, now we have a great place to view our schema. What are the types? What is this partition on? You know, the partitioning is one that I I I wanna call out because this has been, such an important aspect, you know, when when using Iceberg because, you know, at the end of the day, Iceberg is just files on s three. Right? Right. Yep. Which, you know, by and large, if you do it in a naive manner, can be really cumbersome and really costly to query, because there's not a lot of ability to smartly prune the data that you're retrieving. But one of the powers of iceberg is around partitioning and around your ability to have more efficient query. And the, you know, the the heaviest hammer of those and and the, you know, the easiest to use is is partition keys. So having this here, you know, before we had this in data, they can't tell you how many times that I had conversations with teams where they come to us and say, this query is is running so slowly, and I just don't know why. And, you know, it it seems to be taking forever. We look at their query, and I'm like, well, you're not filtering any any any of your partition keys, so of course it is. It's it's scanning tons and tons of data way more than you need. Whereas if you put, you know, some simple predicates here on your partition key, this is gonna be way, way faster. So this is a pretty simple one based on tenant ID. We have some other tables as well that are partitioned on, other columns, including transforms on other columns, like some of our date based data. And it's been hugely helpful having this, you know, just in DataHub where teams can come here and see, oh, that's a partition key. I really should be querying on that. It's also been great to have, you know, all of the the different statistics, here, you know, row count and and seeing kind of things increase, you know, slowly over time, especially in our in our dev environment. You know, and definitely looking forward to to kinda leveling up our usage and using some of these things like assertions. A big one for us as well is, you know, storage size, in particular because one of the biggest problems that I think a lot of places run into with, Iceberg is not either not having maintenance or not having correctly configured Iceberg maintenance. And one of the only ways, that you can really find this and and particularly find, maybe one of the most insidious parts of it, not having snapshot retention properly configured, is the storage size of your bucket is going to increase and increase and increase and increase and increase exponentially over time. And it's where we found, you know, some of the biggest waste money wise for us, when it comes to iceberg is we don't have snapshot retention or we have really long retention, and the storage size grows over time, and suddenly we're paying for terabytes of data for a table that's really a cake. Yep. So having these kinds of things here and being able to see them visually is, you know, is is really, really nice and and, you know, such a huge advancement for us. And then as well as see be able to see the other parts about the table, you know, being able to see, you know, our table properties. You know, this is an aspect of iceberg that that is, at times, I think, a little opaque. You know, not really being able to tell, like, okay. Well, what is, you know, the partition limit or or what is the compression, you know, of this table? And and be able to have this here, and see it, you know, in the application makes it really, really easy then to understand, you know, what's going on with the table, what's the most recent snapshot of the table, and really self serve a lot of that information, even a technical information out of the same catalog as we're serving all of our nontechnical users and everything else. So it it sounds like you're positioning Data Hub inside of Demandbase as sort of the starting point for finding the right data to use for a given, you know, use case. And then getting that next level of context, whether it's the technical context or some of the semantic context rate about which which, queries to like, how to build queries and and what are the right columns used for partitioning and that sort of thing. Awesome. Yeah. Do you I I I do you have any, like I guess, what are the next steps for this integration from your point of view? What would you like to get out of DataHub that maybe you're not, you know, getting out of it yet or features you're not taking advantage of yet? Yeah. Absolutely. So so like I said, definitely building out more of this, you know, qual the quality aspects is is, I think, kind of the next big thing for us. Now that we can really clearly see what all of our datasets exist, it's it's making it much more easy for us to start to build out, you know, kind of our version of, you know, the the Medallion architecture, or, I think it came out of Airbnb, the Midas standard, you know, different ways of categorizing the health of your data, the quality of it, what should be used and shouldn't be used by, you know, other teams, you know, and and really kind of moving in that direction. Now that we have a place where everything is documented, we can now enhance all of that with this additional information about quality, about lineage, additional statistics, additional, you know, tagging and glossary terms for these columns, to to help us with, like, relationships between datasets and, you know, what specific columns mean. So really kind of then moving into, you know, all of that additional, you know, documentation. Awesome. Yeah. And looking forward just continuing on that that theme, you know, if you could, wave a magic wand and add one feature capability to what you have currently set up, you know, what would it be? What else could DataHub ideally help demand base with? Yeah. I think the next big thing for us is really is really automated lineage would be amazing. You know, the ability to, you know, have the the Iceberg client, either itself or with a plug in, be able to report when it's doing it right, what iceberg datasets came into that that final data frame, and being able to track that automatically. You know, even going so far as as column or row based lineage would be amazing. But even just at the dataset level, this is something that we really, you know, I think, have struggled with in the past of how is this data being used, who is using it, and being able to get that, you know, directly now that, you know, like we said, we have our technical catalog and our business catalog in one place. You know, being able to see both of those things and and being able to have that lineage, you know, be automatic for our expert tables, would be incredible. Awesome. Well, Ryan, thank you so much for sharing this with all of us. I think this will be very, very informative to everybody in the community. So I really appreciate your time, and we appreciate you being a great partner, along the way here. I know we had a few hiccups as we rolled this out, but it was really great working with you to get this all the way to production. So thank you. Yeah. Absolutely. Thank you so much, John. And, yeah. You know, a few summers along the way, but nothing that the DataHub team couldn't take care of. So really appreciate your all's help here. And it's been great. I'm looking forward to to using the tool even more. Awesome. Alright. We're gonna head back to regular agenda for DataHub Town Hall. Thank you all for tuning in. Thank you so much, Ryan and John. What a phenomenal session. For our next segment, I'm excited to spotlight another member of our community. I had the chance to sit down with Mike Burke, a DataHub champion and senior developer, to hear about his journey exploring DataHub MCP. Mike's been using DataHub since 2023, joined the community shortly after, and has been an active data champion for a year now. He's been such a thoughtful contributor, and it was a real pleasure learning about his experience. Let's take a look. Thank you so much, Mike, for joining me today to chat about MCP and your just experience exploring DataHub MCP and what what it's been like for you. Yeah. So so prior to this project, I hadn't worked on MCP at all, and everything was MCP related was new to me. So, you know, I I I hear the word MCP server, and I right away think, well, is it a server? Is it a container? Where does it run? What does it do? And and, just trying to understand the architecture to begin with was was quite a bit of fun, and then working through from start to finish was also a a really interesting process. I know that you and I had a chance to connect for a few minutes before our time today. And I I'm curious. When you first started working with MCP server, you mentioned being excited and cautious. And, what were some of the security considerations that you had in mind, and how did you address those? Yeah. Excited and cautious is is how I start most projects. What I what I found interesting with this was just trying to understand how everything works. I usually start at the very beginning and try to understand where the architecture works. I think it was a couple town halls ago that they there was a demo that showed how everything worked, and it looks extremely cool, but it was also pretty slick. So understanding all the pieces from from left to right was pretty cool. And and as far as the security went, it was, you know, you go through and try to understand where everything's running, what you're giving access to, and just working through it. Awesome. And I know that you're very much of a a self study person, and you took it upon yourself to really upscale your knowledge of MCP through courses and self study. And, were there any resources that were extra helpful that might be beneficial for the community that you might recommend? Yeah. YouTube is a is a great place to look for this stuff. MCP is pretty new. So there there wasn't a lot of really good content, but there was a few really good videos as to how, you know, how you can use MCP. There was there was good overviews of the of the clients, like, whether you use Claude or Cursor or something else. And I did take I did take one Udemy course, and it was also really good. It was it was pretty in-depth. It was about six hours or so, but I I really liked it. I thought it it covered covered the topics well. And for a new product, it's it's not easy to find training. So I I was pretty happy with that. Well, you're clearly thinking beyond, just MCP to work, and I know that you're exploring how to potentially productionalize this. What are your steps in moving from a staging setup to more of a secure and scalable production? Yeah. So for this, like, the the POC was just, hey. Can I get it running on my laptop? Does it work? You know, what's the user experience gonna be like? And then from there, it's it's where does it actually live? You know, I I think of, you know, just an easy server, maybe a maybe a container or maybe a maybe a part of where the application is running. So there's still work to do there as to there's a lot of best practice documents on MCP architectures online. So it's a lot of, how does this work? You know? How do I wanna support it? How is it gonna stay running? So far so good, but there's, yeah, there's work to do. And you described your current implementation as pieces on a workbench. And when I think of this, idea or this concept of all these different opportunities that you have to, like, bring in tools and, just in your day to day as a professional, like, can you walk us through what you've set up so far? Yeah. So, for this, I did start with just the AI client. So I I use quad desktop, and I I really liked it. The free version does give you quite a few messages, But I would also, like I would kinda go on stages. So in the in the morning, if I was working on this, I would work on it consistently for an hour or two. And if I ran out of messages, it would give you four hours, but still you could still you could work on it again. So give you a couple hours to do something else. I thought the way the engineers tied together the authentication token from DataHub was really good. So it does give you some security. You have to generate a token, insert it into your JSON, insert that into the AI clients. I thought that was pretty good. There's probably more layers of security that could be added, but thought that was a really nice start. And that's probably the one of the next steps for me is, you know, how much is enough, before you can turn it over to users. Very cool. I I know that one thing that we talked about last time that there's so much to DataHub that, you know, there's there's a lot of different paths that you can take. Mhmm. And it it could be overwhelming, especially for non engineers. How has the MCP server and AI assistant experience changed the way you might explore metadata as an engineer? What I really liked about it is, you know, the DataHub UI is really good. The the refresh that was done this year has has really gone over well. But what I would say is, like, when you load up DataHub with a lot of data, there are lots of screens. There's lots of pointing and clicking. You go into those lineage graphs. There's lots to look through. And sometimes instead of clicking and pointing, you can just have a conversation. Yeah. I really like that because, you know, it it's sometimes easier to just stop and allow your brain to process a little bit. And even as an engineer or non engineer, having a conversation and not clicking and and going through stuff is is nice sometimes. If you had one piece of advice for someone in your position who might want to get started with DataHub and CP server, what would you share with them? I would say try out the AI clients first. You know, whether, I mean, I think most people use AI at some part of their life somewhere. You know, try out claw, try out cursor, find something that you're comfortable with, and then start from there. And the documentation of the data obtained data is really good. It covers the basics and it gets you on your way. And if you, run into trouble, hop on to Slack and talk to the community, and I'm sure someone's been where you've been before. Thank you, Mike. You bet. I'm on mute. Hello. There we go. Thank you so much, Mike and, Jen, for that awesome conversation. We're gonna move on to our next, next topic. So for those of you if I haven't met you, and Maggie, I'm the founding product manager over here on the data hub team. Couple of sessions for you today. I'm gonna go through, what you can do to make the most of using Lineage and DataHub, and then we'll also do a deep dive into our road map. So let me go ahead and share my screen, And we should be good there. Alright. So, for folks who, who have maybe never seen, DataHub before, one of the areas where we really shine is in our data lineage. So, data practitioners know that, you know, the cheaper and faster it is to produce and store more data, the more dependencies we have, the more complexities we build for ourselves. So our data pipelines are growing in number and also growing in complexity, across various platforms. And each of those platforms tend to have their own, you know, source specific logic or language. Maybe they do a great job of kind of helping you understand the dependencies within that tool, but understanding kind of that 30,000 feet view can be really tough. So data pipelines in general and, you know, kind of our modern data, practitioner workflows get super messy, super fast. But the reality is DataHub makes it super easy to tackle that. So currently, DataHub supports over 70 ingestion sources. I did a quick scan of it yesterday. I more than half of our sources, we, we actually automatically extract and produce lineage within the each platform and across platforms. We've also, really deeply invested in our own ability to parse and really identify those lineage connections. So our own kind of proprietary lineage parser has up to 99.5% accuracy, which is the first time I saw that number, I thought it was a typo, but it's not. The reason, the reason it's so precise is because DataHub DataHub has all the context about, kind of physical structure of datasets, so we're able to predict or or really interpret the, the dependencies from system to system. As one point of comparison, there are a couple of other open source kinda lineage extractor, lineage deriving tools out there. We did a benchmark a couple years ago and found that or actually, excuse me. We did a benchmark within the last year, and found that, you know, our lineage accuracy was up at 99.5. Best case scenario, they were hitting about 80%. What we've seen time and time again is that lineage is only useful when it is as complete as possible. So really what this all that is to say, DataHub goes really broad with a ton you know, with 70 or more connectors and really, really deep. So we're able to extract out that really nuanced and, you know, kind of all the edge cases that that make, data pipelines really, really complex. The only thing I'll call out is that, you know, DataHub is backed by open source. To date in our code base, we have over 600 and and I think it's I just looked yesterday. It was, like, 664 code contributors, and more than half of them are contributing back to our ingestion project. So what does that mean? What that means is that while we have, you know, while we are the kind of central data data hub team backing the project, we are also kind of small in numbers. But we have this tremendous community behind us that can help contribute to either, you know, generating new, ingest ingestion sources or just really providing that subject matter expertise of how it fits into modern data stacks and how we should be kind of modeling that within our stack. So in terms of, lineage and DataHub, why why do we care about this? Right? So like I said, DataHub does a fantastic job of automatically extracting and detecting that lineage, so, you know, you don't have to go in and kind of manually instrument or define those. But at the end of the day, once you have that cross platform view of all of the interdependencies, it makes it so much faster to start to resolve data quality issues as they pop up. On the flip side of that, as you're, you know, kind of in your day to day workflows of building data resources, wherever those wherever those might exist, whichever tool it might be, DataHub makes it super easy to understand the impact of breaking changes. So if you're dropping a column, maybe you're changing a column type, or the under the underlying business of how something is derived or calculated, DataHub makes it super easy to, kinda understand the ripple effects of of that downstream. The last thing I'll call out is that in DataHub Cloud, we also, have the ability to automatically propagate your metadata enrichment across those lineage edges. So if you've been in the DataHub community for a while, if you've ever seen me in a, one of these videos, you probably heard me talk about shift left. We talk about shift left as kind of a the principle that, in a lineage graph, you wanna move your kind of documentation as as close to the source of where data is produced as as possible. So instead of treating documentation as an afterthought, you wanna make sure that you're kind of incorporating, incorporating kind of the the annotation and the enrichment of that added source. So once you've shifted left, you've already done the hard work to, you know, kind of go in and document it and annotate your resources. In DataHub Cloud, we actually have automations that allow for, propagating that enrichment across lineage graph. So kind of like you can think of it as, like, don't repeat yourself or dry documentation. That really just leverages that amazing and rich, knowledge graph. So I, am flying against, guidance here, and I'm gonna do a live demo. So we're gonna actually talk about or see what this looks like in, in DataHub. So say some some prayers or, to the demagogues for me, and we will jump on into this. And alright. So, let's say let's first talk about breaking changes. So or sorry. Let's talk about data quality. So let's say, I'm the owner of this dataset called pet, pet details, and I'm getting complaints from, or I'm getting reports from either teammates or, you know, maybe my stakeholders that the there's something going on with, with the status value. So, you know, we're not actually seeing the volume of adoptable pets that we would expect. So in DataHub, when you are in your, kinda, home page here, you go to our lineage visualization, and this gives you just a really nice kind of visual way to understand how everything is connected. We also make it really easy to automatically expand out all of the edges of your lineage graph. So you'll start to see that, you know, this pet details table that, we started with has dependencies down in Looker, and goes on and on and on and on. We also have some dependencies on, Power BI and additional, Snowflake tables as well. So in terms of kinda navigating that lineage graph, we make it really easy to expand out and and kinda navigate all of the various stages of transformation. But let's talk about that status field. Something's going on here. So when we're looking at, you know, kind of our our pet details table, we're looking at it. We don't have any active incidents on it. We all of our assertion, so within DataHub Cloud, all of our assertions are passing as expected. So what could be going on? What's really great about our, our lineage coverage is we actually start to automatically derive all the column level lineage as well. If we take a look at status, we'll notice that, you know, we can kinda see that it's, derived upstream from this table called pets. Just at a glance, what I can see, number one, is that there are set there's an active incident, and then a DataHub assertion, so a DataHub Cloud, excuse me, assertion is failing. But then I can also see that this asset has been deprecated. So instead of me having to trace through code bases to figure out how was this one dataset derived, okay, it came from these five DBT jobs, which sit on top of, you know, these four tables, I can just go go in and see right off the bat that, hey. First of all, this dataset's been deleted, hasn't been or it's been deprecated, so it likely hasn't been updated. And I have a I have a much stronger foothold in understanding the root cause of that data quality, issue. So let's talk about, let's talk about how we actually move into proactive treatment of this. So, let's say, you know, I need to go in and I'm gonna change the status here maybe before the values were, up for adoption and adopted. If I wanted to understand and now the statuses are gonna be, you know, in shelter, placed in home, and in foster care. I don't know. So that's gonna have a ripple effect downstream. As you saw when I've sent this out on the right hand side, there's a ton of of, downstream resources, but I don't know how many of those are gonna be using that field or how many of those I should really be concerned about. So another thing that we can do in our front or in the DataHub UI is go into our impact analysis tab and start looking at either the upstream or downstream dependencies, with various degrees of dependencies. So with this quick view, I can now see that there's 14 downstream assets. And if I wanted to, I can actually hone in on that status field and see which resource, which resources of those 14, so we have seven of them, are leveraging, leveraging that. So now I have a concrete list of resources that I know that I need to act on, and, and, I can go from there. For the sake of time, I'm gonna go a little bit more quickly. In DataHub Cloud, we do offer, column, column documentation, glossary term, and tag propagation across your lineage graph. So, if this is something that if you're kind of interested in learning more about this, we can, feel free to reach out and we can talk to you all about it. But it's basically just kind of flipping a switch, and then as your as an asset is annotated, all of its downstreams are gonna be updated automatically. You'll notice when we go back to this resource here, for example, postal code is, it'll have a little lightning bolt here, and it'll tell you that this is propagated from and and which, resource that came from. So this is really, really valuable for, PII or kind of GDPR or compliance initiatives so that you can have consistent documentation across the board. Finally, this is all well and fine. Right? Well and good. But this is you know, you're still kinda pointing and clicking and you have to, you have to, go in and and kind of do this discovery on your own. Another option on the for folks that are using our MPC and let me go ahead and make this a little bit bigger. So on DataHub Cloud, we have a fully hosted MPC server as well as a fully hosted, DataHub Slack bot or Discovery AI assistant. So I'm gonna go in here and I'm gonna say, hey, DataHub. I noticed that our customer lifetime value is high. Is it possible we aren't excluding returns? So DataHub is going to, automatically start working with or scanning across your metadata graph, and it'll kinda update you as as it goes to kind of, explain which resources it is taking a look at. And I'm trying to see how well I'm I know I zoomed in pretty big here, but wanted to make sure that this stands out. In the meantime, I'm also gonna kick off another one, to say if I update the logic or the calculation for customer LTV, what will be impacted? And who should I contact? So you'll see that DataHub is very busy over here. The all of the metadata that we've ingested, all the connections that we've made between resources are are, candidates for DataHub to to, to kind of evaluate. We're also looking at ownership. We're looking at classification. We're looking at, you know, which domain it lives within, what's its status, how recently is it updated, how is it queried. I mean, really, basically, it's it's training off of, you know, everything that we know about this in our our knowledge graph. So, once again, I asked that our our customer lifetime value is high. Is it possible we aren't excluding returns? What DataHub comes back with and says, you know, in this explorer, we see this field called is returned, but it's actually there's no indication that returns are being excluded from that calculation. What's really cool here is that it gives you some recommendations of where you can, which pieces you can kinda start looking at and validating. It will also give you some kind of suggested follow-up questions. And as you're as you're kind of working in this thread, you can ask follow-up questions and it's, I mean, no different than having a, you know, chat, or conversation with chat GBT or cursor wherever you're wherever you're working. So once again, I'll call out that this is on the Slackbot itself is available for DataHub cloud, but for folks who are running if you if you have your, kind of own deployed MCP server, you can absolutely hook this up with with your cursor or anywhere that you're you have kind of a age, agents running. Last but not least, I'll just show kind of the impact analysis side of things. So, again, I ask if I update the calculation of this field, and it is again, keep in mind, it's looking at that column level lineage. So it's not just saying anything that ever touched the one dataset, that this calculation is derived from. It's actually looking at the the field level information. So it's gonna show you or kinda list out which, resources should be a value or which, resources are kind of dependent there. And then it will also, it does a really great job of kind of telling you which groups you might wanna reach out to and then individuals who are owners, of those resources. So long story short and I'll be honest, folks, like, we've been using this internally. I also but and we in heading up our data team, and so we're I my teammate and I are building out DBT models galore. We use this on a daily basis, and I I'm blown personally blown away by how much it's helped us, just go, you know, move so much more fast, so much more quickly. But then also just, like, identify issues as they or, like, as issues come up, we can act on them so much more quickly. It's it's been unbelievable. So, little bit of a shameless plug there that I'm also a very happy user of it. Alright. We are, getting close on time. So I am and, yes, the demo gods were with us today. Thank you. I've my sacrifice I spilled a a full iced coffee this morning on the ground and didn't get a single sip of it. So maybe that was my sacrifice to the demo gods. Alright. Let's talk about, road map. I know that we are getting close on time. I apologize, but that lineage stuff is just too good to skip through. So, if you've been to one of our sessions before, you you, may recall that we break down our initiatives between three four different pillars, data discovery, data governance, and observability. The last one is the metadata graph that fuels all three of those. So, starting with discovery, the, key focuses this year for us are human centered insights, so making sure that human context is available next to kind of the physical or technical context, that we're finding and encouraging, intelligent exploration and then, of course, our end to end lineage. On the, lineage or or kind of the, you know, traction side of things, these are the data sources that we have coming up. So we're, starting to work on, RudderStack and Snowplow. We have Azure Data Lake coming up as well. HEX users out there, just a reminder that we shipped us a connection for that earlier this year. We're also using HEX internally, so I can also vouch for that one. It's been really awesome to see that in there. In terms of, discovery, in in or, improving kind of how we navigate these lineage graphs, We're starting to look at hierarchical lineage, so really kind of showing the, the parent and child or, like, you know, the kind of container based approach of of, hierarchy, to understand different levels of net of nesting. If John hasn't already, John's gonna be posting in announcements in DataHub Slack. We're actually looking for, feedback and, contributors or, pen design partners here. So if you're interested in partnering with us and and helping us kinda shape what that looks like, please do let us know. The other thing is, I've officially kicked off discovery for our metrics catalog. So if you are, if you are using or implementing an external metrics catalog, and or are looking for kind of a native, you know, native implementation, this is something that we will be working on discovery over the next couple of weeks and hopefully kicking off development in the next month or so. And also I'm looking for feedback on that one. So I think John also called that out. Another just a reminder, our MCT server is live. We are actively so for DataHub cloud customers, this is, we have a fully hosted version. DataHub core is self hosted, but everything's available for you. Just so you know, a lot of the the areas of investment that we're putting in into this is really, one, just, you know, relevancy or, prompt tuning so that we are we're returning the most relevant Mhmm. Details from our metadata graph. But then we're also taking an extremely scientific approach here to make sure that we have a high level of precision, and that we're not, you know, that we're actually collectively getting value out of this and not just, you know, kind of some convincing LLM, LLM results. We've also it sounds like we've had a lot of really great contributions coming in from the community. So we we're so appreciative of folks, giving us that feedback, giving us those contributions, and, we're just methodically kind of parsing through that and making sure that, we really, that we're just, you know, continually refining that resource there. This is what I just, gave you a little demo of. So our Slack AI, discovery assistant is available in public beta for DataHub cloud. Another thing that I wanted to call out is that on, on DataHub cloud, we're currently in private beta for a feature where feature where you can customize your home page. So, you know, we know that every single organization is it's are we truly like a a snowflake? Like, there's no two organizations that are alike, in terms of how they, you know, kind of manage their data stack or or how they think about data. So this this is just a really great way to, kinda curate that landing page for users. If this is something that you are interested in, on DataHub Cloud, please do reach out. On the governance side of things, our key focuses this year are on, continuing to ensure that DataHub is set up to be a universal data registry, also use, used for central compliance and policy enforcement. So with that in mind, we are continually working on our bidirectional syncing ability between DataHub and external systems. So really making sure that the external, external system reflects what is, presented within DataHub and vice versa. For the, another another kind of item that we're currently working on, we should have some announcements for the community over the next month, month or so, is, support for logical datasets. So kind of a a parent child, asset relationship. So, again, this kind of goes through the theme of, you know, kinda document once and and propagate that out. If you've defined a central model and then you're kind of materializing that across various places, this is a great candidate for you there. John recently announced, or or took a lead on, introducing initial support for access request workflows in DataHub cloud. So this is really an a native way to build or, a way to build native workflows within DataHub to allow end users to request access to datasets. So they found the dataset. They it looks like this is the right this is the one for them. This is where, we can start kind of generating the, generating and addressing data access requests. I also know I'm talking very fast. I just went a little too long on, the first one. On, the observability side of things, we are continuing to make sure that data observability is accessible, collaborative, and contextual. So what does that look like? Actually, I'm gonna skip over that one. This, just recently, we rolled out improvements for our Python SDK specifically for, for creating and managing assertions. So So this is really great for teams who are working with a high volume of of assertions at scale, so you can, and then on the other side of that, actually programmatically create assertions as your data ecosystem evolves. The other thing we rolled out in DataHub Cloud is the ability to bulk create field level assertions. So you can imagine if you have a super wide dataset with hundreds of columns and you wanna and you wanna create, kinda data monitors on, on multiple columns, before you'd have to go in and do them one at a time, we now allow you to, we now support for going into DataHub and, creating those in bulk. Last but not least, we'll talk about our platform. We, we just recently rolled out some massive improvements to our quick start performance. So for folks who are running quick start, you should see much faster build times and, lower lower, failures due to ZooKeeper. Later this year, we're gonna be rolling out service accounts so you can kind of have those central keys for, managing programmatic workflows. I also we also just rolled out some improvements to our ingestion run summary, so you'll have better, improved visibility into kind of the outcomes of your ingestion run. I'll be announcing this in our upcoming release, to the open source community, and let's go. Thank you all for listening. Sir, Shankar, what do you think? Let me stop sharing my screen. Well, clearly, DataHub is moving fast, and the updates are rolling in fast and quick. It was super exciting to see the MCP updates. Thanks everyone for joining us today. Of course, what we've shared today represents more than just, you you know, just features and timelines. It's really about empowering every one of you to build more trustworthy data systems, AI ecosystems. We're already seeing in the chat a lot of people are building on top of DataHub, building interesting UIs for internal use, building internal applications. And it's really cool to see all of this innovation in the community. So super excited to see all of the innovation that we are able to power. Also exciting to see DataHub evolving in real world production environments. Like we saw the DataHub rest catalog going live at Demandbase, today. And I'm confident we'll be hearing, more success stories like this from the community soon. And speaking of innovation, of course, we heard a lot about the DataHub MCP server. Across the board, we're seeing a lot of usage of the MCP server both in the open source community as well as in our, cloud customers. Use cases around data understanding, use cases around data exploration, and then all the way into data development workflow. So it's amazing to see how quickly this adoption has gone and where, the the applications of this can be. But before I wrap up, I did wanna say that please mark your calendars for October 29. Context, which is our conference, is right around the corner. It's our, biggest event of the year. We are bringing together leaders across FinServ, the public sector, as well as tech, of course, to share how they're solving the toughest data challenges for AI. You won't want to miss it, so save the date. There's a registration link, that'll be shared very quickly, or you can just, you know, get that QR code. And whether you're running, you know, open source or leveraging our cloud offering, of course, you're part of this community, and we'd love for you to be, part of this conference. Few other announcements. I think we mentioned this already. We just released, v1.2.zero. So check out the new release, and, of course, visit DataHub.com for updates. Let's go and make some data and AI magic happen. Amazing. Thank you so much everybody for your time and your attention. I realized today, I'm this is my fourth year of town halls. It's been a wild ride. So I appreciate you all for coming along with me. Wow. But, yeah, happy happy summer, everybody. Hope you had some nice time off, and we'll see you at the next one. Adios. See you.