Job Description:
Core Responsibilities:
- Strong understanding of Linux administration.
- Strong understanding of AWS/GCP/Azure cloud.
- In depth understanding of networking.
- Good in python.
- Strong understanding of machine learning lifecycle.
- Design and automate a process for mlops implementation.
- Implement data versioning, model versioning, code versioning.
- Model deployment at scale, monitoring and alerting with drift management.
- Create model retraining pipeline.
- Evaluate and choose technology stacks that best fit client data strategy and constraints.
- Write/rewrite the code to scale model training/deployment.
Qualification:
3+ yearsβ experience in Software Engineering and DevOps, 2+ yearsβ experience in machine learning development and deployment.
Technical Skills:
- Need to be strong in Python and Bash/Shell Scripting. (Must)
- Understanding of machine learning model development. (Must)
- Mlflow, DVC, Databricks, Kubeflow, Distributed model training(RAY), Grafana
- Hands on with Data versioning and monitoring tools.
- Machine learning experiment tracking.
- Known with distributed model training techniques.
- (Must)
- Experience on Azure/GCP (Google Cloud Platform)/AWS (Must)
- Experience in Linux. (Must)
- Experience in Ansible/Chef/Puppet. (Must)
- Log Management Tools like ELK (Elastic Search, Logstash, Kibana), Splunk. - Added Advantage.
- Knowledge about big data system such as (InfluxDB or ElasticSearch or Cassandra) - Added Advantage.
- In depth knowledge about Networking, UNIX and low level OS internals. - Added Advantage.