Back to blog

The Most Painful And Repetitive Job Of A Data Engineer

Why we should do something about JDBC

·2 min read·
dataengineeringsoftwareengineer
Image by the Author, generated by MidJourney

I remember my first job as data analyst was to use Microsoft SSIS, a GUI ETL tool. We had only one data source back then, which was plenty enough for the business use cases.

In today’s modern world, that often looks like a joke or a utopia. Even if you have a small business, you quickly end up with many different services and other places to consume, process, and analyze data.

You end up doing the most boring data engineering job: moving data.

How did we get into this situation?

Image by the Author

Column-based databases are the most common places where we consume analytics today, aka Cloud Data Warehouse (BigQuery, Snowflake, and Cie). And so are the file formats used today, like Apache Parquet, Delta Lake, and Apache Iceberg. Arrow is, therefore, in the right place to be used around these tools.

Practical Tips to get the most of Arrow 🏹

Consider implementing the following strategies:

  1. Evaluate your existing data stack: Assess your current data stack to identify areas where Apache Arrow can be integrated to optimize data movement and processing. Determine which systems and tools are compatible with Arrow and can benefit from its columnar data format.

  2. Embrace open-source columnar file formats: Use formats like Parquet, Delta Lake, Apache Hudi or Apache Iceberg to enable better data compatibility and interoperability.

  3. Leverage modern data tools: Choose modern data tools that support Apache Arrow, such as Polars, DuckDB, Apache Flink or Apache Spark to take advantage of its performance benefits.

  4. Stay informed about new developments: Keep an eye on Apache Arrow’s ongoing developments and improvements and its growing adoption in the data community. 

What the future looks like?

The future of database protocol is looking brighter than ever! While using a standard file format does have some performance tradeoffs, Arrow’s role in properly interfacing data has huge potential.

With its growing adoption, Arrow is expected to simplify moving data between different systems, minimizing the need for extra serialization and deserialization. Its columnar format makes data transfer efficient, and its support for multiple programming languages and platforms makes it incredibly versatile.

Soon, you’ll be able to spend less time on the mundane task of moving data and more time generating valuable insights for your business.

To quote Tristan, CEO at dbt labs during an interview I did last October, “I want Apache Arrow to take over the world.”

In the meantime, may the data be with you.

Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work.