In order to start your model training and begin collecting data, there are many things that you need to consider. One of the many important aspects of datasets is ethics.
How sure are you that your data has been ethically sourced? And why should you care?
Well, in this article, we’re going to break down the importance of ethically sourced data, and why, not having it, can be disastrous for your model.
In 1996, The Health Insurance Portability and Accountability Act (HIPAA), was created, marketing the start of digital data collection.
This act was created in order to protect sensitive and identifying personal health data after medical treatment: data shared was strictly on a “need to know” basis, with patients signing a consent form for use. However, in the interest of the “common good”, some exceptions were made (i.e. crime-related injuries, infectious diseases, etc) that marked the start of unethical data-sharing.
HIPAA was later updated by the Omnibus Final Rule of 2013, which altered the law to create heavier financial penalties for organizations caught violating the law. This sort of mistreatment of data sets a precedent for how data collection is often viewed…
Then came the introduction of a data protection law you’re probably very familiar with: General Data Protection Regulation (GDPR). Generally, GDPR goes much further in protecting personal data. Although there has been plenty of discussion over the efficiency of this law, there is no doubt that it remains one of the most stringent data protection laws in the world. Unlike HIPAA or other data protection laws, GDPR requires organizations to use the highest possible privacy settings by default and limits data usage to six classes: consent is given, vital interest, legal requirement, etc.
#1: Data Protection
Now, no data can be collected until explicit consent for that purpose has been given – with consent being able to be retracted at any time. This also means that the Terms of Service agreement cannot give a company free reign over a user’s data indefinitely. Organizations that violate the GDPR are heavily fined, up to 20 million euros or 4% of the previous year’s total revenue.
British Airways, for example, was fined 183 million pounds after poor security led to a skimming attack targeting 500,000 of its users. Another instance occurred with tech company Clearview – a Buzzfeed investigation recently reported that employees at multiple law enforcement agencies (across the U.S.) had used controversial facial recognition policing technology made by tech firm Clearview AI. The report found that officers had used the company’s technology, without the knowledge of their departments or the consent of the individuals in said images.
This is what we call bad data protection. Individuals using datasets need confirmation that you can deliver on data that contains no defining characteristics to recognize other people. Not knowing where the data has come from – often the case for free datasets – means that you don’t know if all the checks and balances have taken place, and you could risk tainting your algorithm.
But, how easy is it to secure ethically sourced data collection? What factors do we have to consider?
Consent means allowing people the choice and control over how you use their data. If the individual has no real choice, consent is not freely given and therefore will be invalid.
Individuals must be able to refuse consent without detriment or prejudice, as well as withdraw consent easily, at any time.
Consent shouldn’t be linked to other terms and conditions (including separate consent options for different types of processing) in order to gain freely given consent.
The GDPR is clear that consent should not be bundled together as a condition of service unless it is necessary for the service. This is prevalent in article 7.4 in the UK GDPR, which states:
“When assessing whether consent is freely given, utmost account shall be taken of whether… the performance of a contract, including the provision of a service, is conditional on consent to the processing of personal data that is not necessary for the performance of that contract.”
#3: Anonymity & Transparency
Data anonymization: the process of protecting private or sensitive information by erasing/encrypting identifiers that connect an individual to stored data. Personally Identifiable Information (PII) such as names, social security numbers, and addresses, can be run through a data anonymization process to keep the source of the data anonymous.
As a rule, personal data (there should be a clear distinction between personal data and anonymized research data) should be destroyed when no longer required, with your participants being aware of this. Anonymized research data, that is stored for the purpose of the model and algorithm, can be held indefinitely and made available to others.
The GDPR outlines a specific set of rules that protect user data for this purpose, creating transparency. While GDPR is strict, there is a catch: companies are permitted to collect anonymized data without consent, use it for any purpose and store it for an indefinite time. However – they are only permitted to do this if the company removes all identifiers from the data.
Bias in data collection is a distortion that results in information not being truly representative of the study you are attempting to investigate.
Essentially, bias occurs when you ‘hand pick’ your subjects when collecting data. Some individuals make the mistake of thinking a little bit of bias is harmless, however, what they don’t realize is the effects they have introduced into their study.
To avoid bias, data needs to be collected objectively. If you’re collecting data via surveys or interviews, for instance, you should use well-prepared questions that do not lead respondents into having a specific answer. Or, if you are selecting a sample of people for your research, then you need to make sure the sample group is representative of the population you are studying.
Data should also be collected and recorded in the exact same way, from every participant, for effective data collection. By planning the data collection process carefully, instances of bias can be stopped in their tracks and not harm the overall model.
#6: Accountability (Model Training)
Accountability means that the way in which a result was derived from a model through an end-to-end system, can be understood, transcribable, and reproducible.
Regarding the many societal implications and concerns of artificial intelligence systems, there has been an increase in the demand for transparency and accountability. Datasets that empower machine learning are often created, used, and shared with minimal visibility into the processes of their creation.
Ultimately, those collecting data are also accountable for ensuring and upholding basic human rights in their processes. Data should be easily held accountable for human rights. As well as this, those collecting data need to provide clear, openly accessible information about the process required within their investigation. This does not necessarily mean openly accessible to the public – unless the investigation falls under government programming, in which case data should be completely anonymized if open to the public.
Acquiring Ethically-Sourced Data
Luckily, acquiring ethically sourced data for your research and AI model has never been easier with Twine. With over 410,000+ freelancers, Twine’s data service is a simple way to create a data set that is not only created according to your specifications but will also be ethically sourced from a pool of over 410,000 global participants.
Our Account Management team ensures that each participant fully understands the nature of your project and what their data will be used for, before obtaining a signed consent form.
Should the project requirements change, we will revisit each participant to regain consent for the additional or supplementary requirements, making sure that everything stays above board.
The digital and AI learning age will continue to move forward, and become more apparent in our day-to-day lives. Without proper data collection, the increase in scandals and major data breaches will only continue also. Knowing how to protect data privacy, in a world growing increasingly devoid of it, is an essential part of creating a long-lasting, ethical dataset.
Make sure all research within your data collection falls under the above parameters of being ethically sourced. In the long term, paid data collection pays off – no scraping tools, you need to ensure your data is people-first. Otherwise, you’re doing a major disservice to your AI model. Free datasets, on the other hand – unless vetted for consent and protection like our database of 100+ open datasets – should be avoided at all costs, so you don’t fall into the traps of bias or a breach in data security.