1. Data Lakehouses are here to stay
As the theme centered around Lakehouses, several sessions focused on the concept and how it is evolving/improving. The Lakehouse is the standard when building an end-to-end data platform for data engineers, data analysts, and data scientists. Recently, we have seen a lot of improvements making this technical implementation even better on Databricks. One interesting feature was the implementation of the Photon query engine within Databricks, which gives you the fastest query response so far. This paper explains the engine in further detail. At Codit, our reference architecture for a data platform is also based on the Lakehouse and is fully compatible with the data mesh principles to build even larger data platforms. With these improvements, we are capable of building better and faster solutions for our customers, using a simplified architecture.
2. Delta Lake 2.0
The complete Lakehouse principle is based on the use of Delta Lake, which brings the structure back into your Data Lake. During the summit, it was announced that as of now, Lake will become open source, with the aim of further adoption in the market. In the past, Delta Lake was highly linked with Databricks, to the extent that you could only use all of its features if you were using Databricks. However, we are now seeing that other platforms, such as Azure Synapse, are adopting Delta Lake. Starting with Delta Lake 2.0, others will also be able to implement all features, making it less dependent on Databricks. With this evolution, we hope that we can also onboard smaller companies onto a Lakehouse for their data platform.
3. Improved data governance
When building a data platform, governance is of utmost importance, especially when your dataset is growing exponentially. Knowing which data you have and understanding its quality is key for a successful data platform implementation. Databricks has now implemented Unity Catalog, a centralized unified governance solution for all data & AI assets. This means you can build a catalogue of your files, tables, dashboards, but also your ML Models. Additional features such as a fully automated lineage of all your developed workloads (SQL, R, Python, Scala) are included with this, which you can read more about here. To further extend this part, Databricks has joined up with Monte Carlo to improve overall Data observability. Inspired by the proven best practices of application observability in DevOps, data observability is an organization’s ability to fully understand the health of the data in their system. Data observability, just like its DevOps counterpart, uses automated monitoring, alerting, and triaging to identify and evaluate data quality issues. For further details on this, read more here.
4. Latest version of MLflow
If you want to do machine learning on top of your Lakehouse, then MLflow is the end-to-end machine learning tool for the complete ML lifecycle. During the summit, MLflow 2.0 was announced, bringing even more capabilities which will accelerate your ML solutions. The biggest new component in MLflow 2.0 is the introduction of pipelines. With this feature, it will become even easier for your models to move from development into production. The ability to implement MLOps for our customers with fewer workarounds is a big improvement. The model monitoring capabilities are also improved, making it easier to check the performance of your models and evaluate if they need to be retrained.
5. Spark Connect
Spark was already well known as a unified engine for large-scale data analysis, especially with the automatic scaling to handle even larger data sets. However, we see that the demands are changing and edge computing is becoming more important in solutions. Spark is often too large to run on smaller edge devices, but now Databricks has announced Spark Connect. This is a client and server interface for Apache Spark, based on the DataFrame API that will decouple the client and server. With this feature, developers are able to build solutions and gain access to Spark from any device.
The summit was full of a lot more features and announcements to help us build even better data solutions for our customers. If you are interested in starting or building on your data journey, contact us for a chat on the future of data platforms and how they can help your organization become more data-driven. You can also watch the summit on-demand here.
*Cover image from Databricks.