Data Engineer (PySpark)

Noida, Uttar Pradesh, India
Feb 14, 2025
Feb 14, 2026
Onsite
Full-Time
3 Years
Job Description

We are seeking a highly skilled and motivated Data Engineer with expertise in PySpark to join our dynamic team. In this role, you will be responsible for designing, developing, and maintaining scalable data solutions to support data generation, collection, and processing. Your day-to-day activities will involve building data pipelines, ensuring data quality, and implementing ETL (Extract, Transform, Load) processes for seamless data migration and deployment across multiple systems. You will play a key role in optimizing data infrastructure and working closely with stakeholders such as data engineers, data scientists, and business teams to develop robust and efficient solutions.

If you have a passion for big data processing, hands-on experience with PySpark and Apache Spark, and a strong understanding of cloud platforms such as AWS, Azure, or Google Cloud, this is a great opportunity to be part of a fast-growing team working on cutting-edge data technologies.

Experience. 3 Years

Key Responsibilities

Data Pipeline Development

  • Design, build, and optimize large-scale data processing pipelines using Apache Spark and PySpark.
  • Develop efficient ETL workflows to extract, transform, and load data across various platforms.
  • Ensure scalability, efficiency, and reliability of data pipelines.

Data Quality & Processing

  • Implement best practices for data validation, cleansing, and transformation.
  • Monitor and improve data quality, integrity, and consistency across different sources.
  • Work with structured and unstructured datasets, ensuring data is properly formatted and stored.

Collaboration & Stakeholder Engagement

  • Work closely with data scientists, analysts, and software engineers to provide clean and structured datasets for analytical and machine learning purposes.
  • Collaborate with business stakeholders to understand data requirements and deliver solutions that align with organizational goals.

Performance Optimization & Troubleshooting

  • Identify and resolve performance bottlenecks in data pipelines and processing workflows.
  • Optimize Spark jobs for better efficiency, reducing latency and improving execution times.
  • Debug and resolve errors and failures related to data processing.

Cloud & Database Integration

  • Work with cloud platforms (AWS, Azure, or Google Cloud) for storage, processing, and deployment.
  • Utilize databases such as SQL, NoSQL, and data lakes for managing large-scale datasets.
  • Implement and maintain data security measures to ensure compliance with industry standards.

Version Control & Agile Practices

  • Use Git and other version control systems for collaborative development and tracking changes.
  • Follow Agile methodologies, participating in scrum meetings, sprint planning, and retrospectives to improve efficiency.

Required Skills & Qualifications

Programming & Data Processing

  • Proficiency in Python and PySpark, with hands-on experience in writing efficient Spark jobs.
  • Strong understanding of data processing concepts, distributed computing, and parallel processing.

ETL & Big Data Technologies

  • Experience in designing and managing ETL pipelines for large-scale data ingestion and processing.
  • Knowledge of Hadoop ecosystem, Kafka, and other big data frameworks is a plus.

Database Management & SQL Expertise

  • Strong command of SQL, with experience in relational and NoSQL databases such as MySQL, PostgreSQL, MongoDB, or Cassandra.
  • Experience in working with data lakes and cloud-based storage solutions.

Cloud Platforms

  • Hands-on experience with cloud providers like AWS, Azure, or Google Cloud.
  • Familiarity with cloud data services such as Amazon S3, Google BigQuery, Azure Data Lake, or Redshift.

Debugging & Performance Optimization

  • Strong problem-solving skills with the ability to troubleshoot complex data processing issues.
  • Experience in performance tuning and optimizing Spark jobs for better efficiency.

Version Control & Agile Development

  • Proficiency in Git and CI/CD pipelines for seamless deployment and collaboration.
  • Experience working in Agile/Scrum environments, with knowledge of JIRA, Confluence, or similar tools.

Good to Have (Bonus Skills)

  • Knowledge of Machine Learning workflows and how data engineering supports AI/ML models.
  • Experience with data governance, security, and compliance in large-scale organizations.
  • Familiarity with streaming data processing using tools like Kafka, Flink, or Spark Streaming.

Why Join Us?

  • Work on cutting-edge big data technologies in a fast-paced and innovative environment.
  • Collaborate with industry-leading experts in data engineering and analytics.
  • Competitive salary and benefits package.
  • Opportunities for continuous learning, certifications, and career growth.
  • Work on challenging and high-impact projects that drive business success.

If you're passionate about data engineering, PySpark, and big data technologies, and you love solving complex challenges, we encourage you to apply and be part of our growing team!