Big data analytics is the new black in business, and these days you will hear talk of “data pipelines” around your office more often than hearing classic lines such as, “We broke production.” So what’s the trick, and why is everyone talking about it?
Evan Thomas, a Lead Software Engineer at Tilting Point, told us about the common tasks of data engineers, the specifics and arc of their careers, and together with his team, shared personal development resources that should help one’s professional growth in Data Engineering.
What does a Data Engineer Do?
“The functions of data engineers depend directly on the specifics of the business but still have a lot in common. One of the main goals is creating systems for collecting and processing data. Also, a big part of the process is data validation, which is necessary for its transformation and further usage in R&D and machine learning models by data science teams and business analysts. We do the things that help companies grow. From the technical side, it is necessary to create versatile solutions, not just suitable for one specific data source. Product flexibility and performance are also essential. The questions we are asking ourselves are: how can we work faster, how do we cut costs, how can we make the system more usable for different teams, and how do we ensure reliable pipelines built off of many data sources”, — Evan Thomas, a Lead Software Engineer at Tilting Point.
What are the top technology trends in Data Engineering?
I would highlight two significant trends in the industry right now:
- Using tools like DBT (data build tool) that allow you to significantly expand who can build ETL pipelines and work with data. With DBT, data analysts are empowered to build their own ETL pipelines. Engineers only need to be responsible for operating the underlying infrastructure and bringing data into the system. This frees up time for engineers to focus on more complex ETLs and building a data platform.
- Transition from Data Lakes and Data Warehouses to Lake Houses. The industry moves from BigQuery, Redshift, and Snowflake data to more decentralized data, creating new systems that utilize the best option from the existing solutions.
Lake House: a new term in the industry that combines the structure of data and data controls of Data Warehouses and the type of economical data storage used for Data Lakes.
What makes a good Data Engineer?
The main trait of a successful data engineer is the basic ability to solve complex engineering problems by breaking them down into simple, manageable parts. Due to the specifics of working with Big Data, it is important to design software with high adaptability to changes.
The primary soft skills, besides communication, are attention to detail and curiosity. As for the hard skills, the following are key for working as a Data Engineer:
- Proficiency in SQL.
- Experience in programming besides SQL. The most popular languages are Python and Scala.
- Working with databases – design, configuration, knowledge of the nuances of engines, troubleshooting. Popular DBMSs: MySQL, PostgreSQL, Oracle, MongoDB.
- Knowledge of task orchestration tools: AirFlow, Prefect, Oozie.
- Understanding how large chunks of data are stored in Redshift, BigQuery, Snowflake, Delta Lake.
- Expertise in data processing on a large scale – approaches and tools.
- The ability to understand how a specific business works, how the system makes money, and what is crucial for the business.
My favorite work tool for orchestration is AirFlow, and for processing large amounts of data, Spark. These tools allow you to easily build and manage data streams using the programming language that is most convenient for you. Among the programming languages, I would single out Scala.
What can I do to grow as a Data Engineer?
The profession of a Data Engineer is relatively new, so there is no standard career path. But I would suggest there are two common ways you can advance as a Data Engineer.
- First – advancing from working with SQL queries or DBMS administration to building ETL pipelines. This is a fairly standard way of learning: write SQL queries, understand the analytical part and the purpose of data, and move on to building the data pipelines themselves. Here it is vital to master one of the programming languages.
- Second – transitioning to data engineering from backend engineering. This is a common practice, as most of these professions’ principles apply to each other. It was the “second scenario” that happened in my case.
Together with Tilting Point’s data engineering team, I have compiled a list of valuable learning and development resources.
Books:
- “Clean Code” by Robert C. Martin – a fantastic book that will teach you how to write good code.
- “Design Patterns” by Erich Gamma – this book teaches you how to make top-notch architecture.
- “Database System Concepts“ 7th Edition by Avi Silberschatz.
- “Designing Data-Intensive Applications“ by Martin Kleppmann – this book offers the fundamental principles, algorithms, and trade-offs of developing data-intensive applications.
- “Principles of Distributed Database Systems“ by M. Tamer Özsu – the book describes distributed and parallel database technology in detail.
- “Learning Spark: Lightning-Fast Big Data Analysis” by Holden Karau – a practical book for novice engineers, as it dives into Spark infrastructure, core concepts, API, types of operations, and data structures.
Blogs:
- Towards Data Science – a platform with useful articles from the developer community.
- Functional Data Engineering Post – a blog from the creator of Apache Superset and Apache Airflow.
- InfoQ – a blog with a selection of articles from developers and programmers on all sorts of IT topics. https://www.infoq.com/
- Medium Spotify Insights – a fascinating blog about Spotify: from UI/UX design to analytics. It is especially interesting for data engineers to read about the domain area, how analysts at Spotify work with data, and how the data processing is configured.
- Medium Airbnb Engineering – an incredible resource for programmers and data engineers from the company, known for its code and data quality standards.
Podcasts:
- Data Engineering Podcast – a weekly podcast with Tobias Macey about new data and management approaches with detailed analysis of real cases.
- Software Engineering Daily – a podcast that features daily interviews about IT.
Telegram channels:
- Spark in me – a data industry channel with many links to engaging articles, videos, and blogs.
- DataEng – a channel about Data Engineering & Distributed Systems. It’s all you wanted to know about building an infrastructure for storing, processing, and efficiently analyzing a massive amount of data.
- Data Engineering – a channel for anyone interested in or working with data and analytics.
Also, here at Tilting Point, we are giving away 3 premium tickets to the conference for Data Engineers – “AI and Big Data Online Day.” Others who fill out the form below with some handy information will receive a 50% discount on the premium tickets.
To do that, you just need to share your best resources for training and development in Data Engineering & Big Data with us.