Requirements:
- Experience in building and optimizing data pipelines using Pyspark.
- Experience with big data tools: Hadoop, Hdfs, Pyspark, Hive, Kafka, Yarn.
- Good understanding of Python programming.
- Good understanding of spark internal architecture.
- Having a good amount of knowledge of any schedular tools (Airflow/Oozie/TWS/Autosys).
- Experience performing root cause analysis on internal and external data and processes to answer specific business questions and identify opportunities for improvement.
- Strong analytic skills related to working with structured and unstructured datasets.
- Build processes supporting data transformation, data structures, metadata, dependency, and workload management.
- A successful history of manipulating, processing, and extracting value from large disconnected datasets.
- Advanced working SQL knowledge and experience working with relational databases, query authoring (SQL) as well as working familiarity with a variety of databases.
- Working knowledge of message queuing, stream processing, and highly scalable big data stores.
- Strong project management and organizational skills.
- Experience supporting and working with cross-functional teams in a very fast dynamic environment.