For AI companies, bias in data collection is a big issue.
Here’s the bottom line: “Bias in data produces biased models which can be discriminatory and harmful to humans”. – source
The reason for the problems? The sources of data used to train them.
To combat bias for good, we’re going to take you through the different types of biases you should be aware of in data collection. After identifying and correcting data bias, you will have a data model that is sure to accurately, and ethically, reflect your investigation.
What types of bias are there?
This specific type of bias occurs in user-generated data, i.e. posts on social media (Facebook, Twitter, Instagram), reviews on eCommerce websites, etc.
As the people who contribute to user-generated data are a small percentage of the entire population, it is likely their opinions/preferences will reflect the opinions of the majority.
Societal bias is generated purely by content produced by humans. Whether this is on social media or curated news articles, this type of bias still exists. An instance of this is in the use of gender, or race stereotypes. This can also be known as label bias.
Omitted Variable Bias
This is the bias that occurs in data when the critical attributes, that influence its outcome, are missing. Usually, this happens when data generation relies on human input, allowing more room for mistakes. This can also happen through the recording data process not having access to key attributes.
Feedback Loop/Selection Bias
This bias involves the model influencing the data that is used to train it.
Where does this occur? Often, selection bias takes place in rank content (think ad personalization, etc), by presenting items to certain users, over others.
The labels for these items are then generated by the users’ responses to items that are collected. Items not collected are therefore left with unknown responses. User responses, however, can be influenced. They can be influenced by any element to do with the item – whether it’s its’ position on the page, the font, or the media.
System Drift Bias
This type of bias occurs when the system that generates the data experiences changes over time. These types of changes include attributes captured in the data (including outcome), or changing the underlying model/algorithm so the user interacts with the system differently altogether.
Introducing new modes of user interaction (i.e. like, share buttons), or adding a search feature into your system, can be some of the many ways system drift bias can occur.
How to Identify Bias
There are 3 instances where bias can introduce itself in an investigation:
Data collection is one of the most common places to find biases. Why? Data is typically collected by humans, therefore lending more opportunity for error and bias.
The common biases found in data collection can be categorized into:
- Selection Bias – the selection of data isn’t representative of the population as a whole, and therefore presents bias.
- Systematic Bias – consistent error that repeats itself throughout the model.
- Response Bias – participants of data respond to questions in a way that is deemed false, or inaccurate.
This is where you prepare the data to be analyzed – think of this as an extra step in ensuring 100% ethical and unbiased data.
First, you will need to determine whether there are any outliers within the data, that would have an unnatural impact on the model itself.
Handling missing variables can also be a key indicator in the introduction of bias. If missing values are ignored, or instead replaced with the ‘average’ of data, you are effectively altering the results. Your data collection would then be more biased to results than reflecting the general ‘average’.
And sometimes, data is simply filtered too much! Over-filtering data can often have the effect of no longer representing the original data target.
Despite going through the two initial stages of data collection, you may still find bias within data analysis. These are the most typical biases seen in the analysis stage:
- Confirmation bias – involving preconceptions, and focusing on information that supports this theory.
- Misleading charts (or graphs) – a distorted display of information that incorrectly represents data. From this, an incorrect conclusion is formed based on the model.
How to Correct Bias
Once you have figured out the source of bias, you have successfully handled half of the battle. It is your decision on whether you remove the bias, or handle the bias.
For example, if there is a class imbalance within your model that makes it more biased, then you could look into ways of resampling.
Working with a reputable and ethical source, from the beginning, will eliminate any risks of bias within your AI data collection.
Twine can help build you a dataset that is free of bias.
From our global marketplace of over 400,000 diverse freelancers, we are able to provide a wide selection of data to build your model from the ground up. Our team will work on audio and video datasets, entirely customized to your requirements.