Description
Design, build, and optimize scalable ETL pipelines using Apache Spark on Amazon EMR.Work closely with data scientists, analysts, and other engineering teams to define, implement, and maintain high-performance data infrastructure.Develop and maintain automated data workflows and processes for efficient data ingestion, transformation, and loading.Implement best practices for data engineering, including monitoring, logging, and alerting for data pipelines.Collaborate with stakeholders to understand business requirements and translate them into technical solutions.Optimize performance of data processing jobs and troubleshoot issues with large-scale distributed systems.Drive innovation in data infrastructure, evaluating and integrating new tools, frameworks, and approaches.Qualifications: 5+ years of experience in data engineering, with at least 3+ years working with Apache Spark and Amazon EMR.Strong programming skills in Python and Scala with a focus on performance tuning and optimization for Spark jobs.Deep understanding of distributed computing concepts, data partitioning, and resource management in large-scale data processing systems.Proficiency in building and maintaining ETL pipelines for structured and unstructured data.Hands-on experience with AWS services such as S3, Lambda, EMR, Glue, and RDS.Strong problem-solving skills and ability to debug complex systems.Experience with DevOps practices, including CI/CD, infrastructure as code (e.g., Terraform, CloudFormation), and containerization (e.g., Docker).Experience with Kubernetes and container orchestration for Spark jobs.Familiarity with streaming data processing using tools like Kafka, Kinesis, or Flink.Experience with modern data lake architectures, including Delta Lake or Iceberg.AWS Certification (e.g., AWS Certified Big Data – Specialty, AWS Certified Solutions Architect) is a plus.
#J-18808-Ljbffr