Hello and welcome to the Data Engineer onboarding process at Mashey!
- Mashey Team Culture
- What is a Data Engineer?
- ELT vs ETL
- Agile + Scrum
- Data Engineering Practices
- The Mashey Stack
- Data Modeling
- Environment Setup
- Practice
Things about our team culture. Include:
- Feedback
- Code Quality
- Testing
- Supporting each other
- Bring your whole-self to work
You are now a Data Engineer, what is that?
From Real Python:
Data engineering is a very broad discipline that comes with multiple titles. In many organizations, it may not even have a specific title. Because of this, it’s probably best to first identify the goals of data engineering and then discuss what kind of work brings about the desired outcomes.
The ultimate goal of data engineering is to provide organized, consistent data flow to enable data-driven work, such as:
- Training machine learning models
- Doing exploratory data analysis
- Populating fields in an application with outside data
This data flow can be achieved in any number of ways, and the specific tool sets, techniques, and skills required will vary widely across teams, organizations, and desired outcomes. However, a common pattern is the data pipeline. This is a system that consists of independent programs that do various operations on incoming or collected data.
Please take a few minutes to read the entire article:
Data Engineers at Mashey are responsible for a broad range of engineerng practices, including aspects of the roles outlined below. During your onboarding process as a new Data Engineer at Mashey you will not be responsible for the overlapping roles, but it is beneficial to understand how the Data Engineer role will grow as you gain experience.
It is important to understand where an Analytics Engineer fits between Data Engineers and Data Analysts. Saira Barles, Analytics Engineer at Hubspot describes the differences in this blog post from Dataform:
“Data engineers build the cupboard, they gather together the wood and the tools and put it together. The Analytics Engineers open the cupboard and start putting in the plates, mugs, bowls, and arrange them in a certain order. This could be arranging them into particular colours, shapes or sizes. Data analysts then go into the cupboard and they know where everything lives as it is arranged nicely. They can then grab the small blue mug they were looking for and go make a cup of tea!”
Understanding how to design and model data, similar to an Analytics Engineer, will be an extension to the role of Data Engineer at Mashey. It is a separate discipline and takes time to learn, but the ability to fulfill both roles will provide incredible value to our team.
The role of a DevOps engineer will vary from one organization to another, but invariably entails some combination of release engineering, infrastructure provisioning and management, system administration, security, and DevOps advocacy. The Atlassian team has a great DevOps Engineer overview: here.
A DevOps engineer is an IT generalist who should have a wide-ranging knowledge of both development and operations, including coding, infrastructure management, system administration, and DevOps toolchains. DevOps engineers should also possess interpersonal skills since they work across company silos to create a more collaborative environment.
DevOps engineers need to have a strong understanding of common system architecture, provisioning, and administration, but must also have experience with the traditional developer toolset and practices such as using source control, giving and receiving code reviews, writing unit tests, and familiarity with agile principles.
There are generally 3 primary components of Cloud Engineering, which are outlined in this Northeastern University blog post.
- Cloud Architecture
- Cloud Development
- Cloud Adminstration
Cloud engineers must refine specific cloud computing skills in order to be successful in their roles. These skills range from software development and database administration to change management and data security.
At Mashey our Data Engineers live in the Cloud, and it's important to eventually be comfortable with fulfilling the above roles within our core Cloud technologies.
Extract Load Transform vs Extract Transform Load.
Things about Agile and Scrum.
Things about Asana.
Our data engineering practices.
Our core stack.
How we test and maintain code quality.
CircleCI info.
Coveralls info.
Codacy or Hound info based on which one I pick.
The languages we use.
Python Info.
- autopep8
- coverage
- coveralls
- poetry
- pylint
- pytest
- pytest-cov
- pytest-mock
- pytest-vcr
- python-dotenv
- requests
- singer-python
- singer-tools
- SQLAlchemy
- vcrpy
SQL Info.
Singer info.
Meltano info.
Airflow info.
DBT info.
Fivetran info.
Docker info.
Kubernetes info.
Cloud Functions info.
Cloud SQL info.
Cloud Storage info.
Artifact Registry info.
ECS info.
EKS info.
ECR info.
S3 info.
The data warehouses we use.
GCP BigQuery.
Snowflake.
Concepts to architect and model data.
About Star Schemas.
About Snowflake Schemas.
Setting up a development environment.
Things to practice, and how to practice them.