An experienced data engineer using various methods to transform raw data into useful data systems. You will be responsible for designing, developing, testing, and deploying data pipelines, data warehouses, data lakes, and data products that support business needs. You will also work closely with data analysts, data scientists, and other stakeholders to ensure data quality, reliability, and availability. For example, you will develop and configure programs and jobs for data ingestion, transformation, data enhancement, efficient DB design and storage, and accessible by data end consumers. Overall, you’ll strive for efficiency by aligning data systems with business goals. In this data engineering position, you should have strong data analysis skills with the ability to combine, correlate and resolve data conflicts from different sources. You will need skills and experience with several programming languages and knowledge of machine learning methods is a plus.

Job Duties

  • Utilize and optimize Apache Spark for distributed data processing, handling both batch and stream processing
    workloads.
  • Design, develop, and maintain scalable data pipelines for processing and analyzing large datasets.
  • Collaborate with cross-functional teams to understand data requirements and implement effective solutions.
  • Implement ETL (Extract, Transform, Load) processes to ingest and transform data from various sources into usable formats.
  • Implement data quality checks, data validation, and data governance processes to ensure data accuracy and
    consistency.
  • Develop and maintain data models, schemas, and metadata to support data analysis and reporting.
    • Create and manage data warehouses, data lakes, and data marts using cloud platforms such as AWS, Azure, or GCP.
  • Collaborate with data analysts, data scientists, and other business users to understand their data needs and provide data solutions.
  • Collaborate with technical, including DevOps, engineering, and compliance, to ensure seamless cloud implementation and adherence to best practices.
  • Develop data and cloud architecture documentation, including diagrams, guidelines, and best practices for reference and knowledge sharing.
  • Troubleshoot and resolve data pipeline issues, ensuring minimal downtime and data integrity.
  • Optimize data pipelines for performance, reliability, and data quality, utilizing best practices in data engineering.
  • Build algorithms and prototypes that combine raw information from different sources.

Required Skills

  • Bachelor’s degree in Computer Science, Engineering, Mathematics, or related field, or equivalent work experience.
  • 3+ years of experience in data engineering or related roles.
  • Extensive experience with Apache Spark for large-scale data processing, including RDDs, DataFrames, and Spark SQL.
  • Familiarity with components like HDFS, MapReduce, Hive, and HBase.
  • Experience with both SQL and NoSQL databases, such as MySQL, PostgreSQL, DynamoDB.
  • Proficient in SQL and at least one programming language such as Python, etc.
  • Experience with data pipeline orchestration and scheduling tools such as AWS Step Functions, Air Flow, etc.
  • Experience with cloud-based data platforms and services such as AWS, Azure, or GCP.
  • Experience with data warehouse and data lake design and architecture.
  • Experience with data quality, data testing, and data governance methodologies and tools.
  • Strong analytical and problem-solving skills, attention to detail, and communication skills
  • Ability to work independently and collaboratively in a fast-paced environment.
  • Experience working with a modern data catalog such as Alation, Collibra, etc. is a plus.
  • Ability to prepare data for prescriptive and predictive modeling is a plus.