Data science is a term that escapes any single complete definition, which makes it difficult to use, especially if the goal is to use it correctly. If you are learning data science or entering the data science job market for the first time, you’d be forgiven for some confusion on what a career in data science actually entails.
In this two-part series, we’ll explore various aspects of data science, from the skills required, to teams and specific roles. This series is based on the recently-published Workera report AI Career Pathways: Put yourself on the right track.
You can read part two here:
Part I: the AI development life cycle and skills needed
To understand the skills needed to work in data science, it’s best to start with how the AI project development life cycle usually works.
First, someone prepares data for modelling. Then someone trains a model on this data. Once that happens, the model is delivered to the customer. Team members then analyse the model to determine whether it brought value to the business and/or the user. If all goes well, the cycle will repeat itself with new data, models, and analyses. All the while, people working in AI infrastructure build software to improve the cycle’s efficiency.
Data engineering
Data is the foundation on which data science, machine learning, deep learning and AI is built on. Traditional data is stored across a variety of databases and files, while big data is structured or unstructured data t ranges in format from numbers to text, images, video or audio in large volumes such as tera-, peta-, exabytes and stored in specialised data warehouses.
Data engineers are responsible to prepare data and transform it into formats that other team members can use. They need strong coding and software engineering skills, ideally combined with machine learning skills to help them make good design decisions related to data. They commonly use Big data tools such as Hadoop and Hive, query language skills such as SQL, and object-oriented programming (OOP) languages such as Python, Java and C++.
Common tasks in data engineering includes:
- Defining data requirements
- Collecting data
- Labelling data
- Inspecting and cleaning data
- Augmenting data
- Moving data and building data pipelines
- Querying data
- Tracing data
This part of the development cycle forms the foundation on which the next steps are built, making it crucial to determine the results of the project as a whole. As a common saying in machine learning goes, “garbage in, garbage out”.
Modelling
The modelling section is my favourite part, and what many people think of when they talk about data science. I like this part the most because this is where art and science come together and merge to provide an outcome. I believe data science to be an art, as two different data scientists will have different ways of approaching problems through feature engineering and choice of algorithms used; that in itself is beautiful.
People assigned to modelling look for patterns in data that can help an organisation predict the outcomes of various business decisions, identify risks and opportunities, or determine cause-and-effect relationships.
Modelling can be done in Python, R, Java, MATLAB, C++ or any other programming language desired. Here, a strong foundation in mathematics, statistics and machine learning is important, as well as a dose of creative problem-solving.
Common tasks in modelling include:
- Fitting probabilistic and statistical models
- Training machine learning and deep learning models
- Accelerating training
- Defining evaluation metrics
- Speeding up prediction time
- Iterating over the virtuous cycle of machine learning projects
- Searching hyperparameters
- Keeping your knowledge up to date
The most common machine learning methods used include: linear regression, logistic regression, decision trees, random forest, XGBoost, support vector machines, K-means, K-nearest neighbours, neural networks, principal component analysis. Deep learning skills are required by companies focusing on computer vision, natural language processing or speech recognition.
Deployment
This part of the cycle turns a good model into a product that is useful . Streams of data are combined with a model and tested before production. Cloud technologies such as AWS and Azure can make deployment more rapid and more successful.
Tasks in deployment include:
- Converting prototype code into production code
- Setting up a cloud environment to deploy the model
- Branching (version control) using a tool like GitHub
- Improving response times and saving bandwidth
- Encrypting files that store model parameters, architecture and data
- Building APIs for an application to use a model
- Retraining machine learning models
- Fitting models on resource-constrained devices
Business Analysis
The aim of any data science project is to provide value, and that usually means business value. Starting out in data science, you might ask what happens to the models after deployment. That’s where business analysis comes in. The team members in this stage suggest or make changes to either increase benefit or abandon unproductive models.
In this sector of the development cycle it is recommended that team members have strong communication skills and business acumen as well as the necessary analytics principles for a given data science project.
Tasks in business analysis include:
- Building data visualisations
- Building dashboards for business intelligence
- Presenting technical work to clients or colleagues
- Translating statistics to actionable business insights
- Analysing datasets
- Running experiments to analyse deployed models
- Running A/B testing campaigns
For example, a team is tasked to build a recommendation engine to provide jokes to users for an online comedy series. People responsible for business analysis will use this data to evaluate the performance of the recommendation system and measure how much value it creates for the client.
AI infrastructure
The team working in AI infrastructure builds and maintains reliable, fast, secure, and scalable software systems to help people working in data engineering, modelling, deployment, and business analysis. They build the infrastructure that supports the project.
Continuing with the example of a jokes recommender, someone in AI infrastructure would ensure that the recommender system is available 24/7 for global users, that the underlying model is stored securely, and that user interactions with the model on the website can be tracked reliably.
Working on AI infrastructure requires strong and broad software engineering skills to write production code and understand cloud technologies such as AWS and Azure.
Tasks in AI infrastructure include:
- Making software design decisions
- Building distribution storage and database systems
- Designing for scale
- Maintaining software infrastructure
- Networking
- Securing data and models
- Writing out tests
With an understanding of the AI development lifecycle in place, we can now look at how different job roles contribute to different parts of the cycle. Check in soon for part two, where we explore data science roles in more detail.
This two-part series is based on the recently-published Workera report AI Career Pathways: Put yourself on the right track. Check out workera.ai for more information and resources on interview preparation and tests in data scientist and machine learning roles.
Derick Kazimoto is completing his MPhil in Data Science specializing in Financial Technology at the University of Cape Town. He is Zindi’s student ambassador at UCT, and chair of the UCT Cryptocurrency and Artificial Intelligence Society. He plans to work in financial services after completing his degree.