Data Ethics : The most important part for data usage and implementations

6 min readFeb 23, 2022

Ethics is integral part of any implementations that include data. Even though data is evident way to get the reality, it can be the result of some existing anomaly and bias in society. Let’s walk through why is it important to take the right data and how ethics is essential for data science.

Photo by José Martín Ramírez Carrasco on Unsplash

Data is the currency and asset for 21st century. It is highly valuable and can have huge impacts on society. With this great power of leveraging the data, all is intertwined with massive responsibilities.

There are 2.5 quintillion bytes of data created each day at our current pace. More than 90% of data was generated in last two years. [IBM]

Many facial recognition systems when initially introduced into market with claims of more than 90 percent accuracy where found to be only accurate for certain section of society. Even though the accuracy was only high for smaller section within the society, it was boasted to be universally accurate.
In 2018 “Gender Shades” project, an intersectional approach was applied assess the value of three gender classification algorithms, including those developed by IBM and Microsoft. Participating people were grouped into four categories: darker-skinned females, darker-skinned males, lighter-skinned females, and lighter-skinned males. All three algorithms performed the worst on darker-skinned females, with error rates up to 34% higher than for lighter-skinned males. The maximum error rate for lighter-skinned males was only 0.8%.

Gender classification confidence scores from IBM (IBM). Scores are near 1 for lighter male and female subjects while they range from ∼ 0.75 − 1 for darker females. [Source: Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification]

This suggests that the dataset it might have been trained on, wasn’t representational enough and it was clearly evident that the training was result of inappropriate data resulting in highly biased results.

**Auditing five face recognition technologies***. [Source:* https://sitn.hms.harvard.edu/flash/2020/racial-discrimination-in-face-recognition-technology/]

The market for facial recognition services is projected to double by 2024, as developers will work to improve human-robot interaction and more carefully target ads to shoppers. They would go for recognizing and figuring out buyer’s gender, so that they can target there audience correctly and sell what is appropriate for that gender. Most of the times, it is done without the consent of the shopper or the participant. Such data that’s taken without the consent of users, may be a breach to their privacy.
Also, when such models are applied for real life applications in society, can result in widening of pre-existing social biases and stereotypes.

Data used for RAIs [Risk Assessment Score] in criminal justice has been claimed to be biased and discriminatory. The recidivism scores were based on previous datasets that had more inclination towards people of color committing crimes in US. Such scores that are based on old and irrelevant datasets can impact the present society adversely. Even if they are statistical analysis, context and data should be taken into consideration as there have been evidences of biased justice system in past.

The data serves as the key entry for any data related implementations such as models and artificial intelligence. If the data is not representational then the resulting implementation will be highly biased and unfair. So, capturing right data is another key point. The data ingested should be random but representational of true background.

For a simple example, a company wants to built a product for women in India. They start a survey to understand the needs of various women in region. The region may consist of population with 70% living in urban and 30% combined in rural, semi-rural, indigenous areas etc. If the data collected by the company was from the city(urban area) only then they are suggesting that the population of women is solely in urban region. This data fails to represent the other 30%. If the data focused on housewives, then it fails to represent the working women. Even if it’s less, it completely fail to mention there existence which may result in bad product or something that doesn’t reflect the needs of all women in that region.
If the company understands the percentages and collects random data but with right proportion keeping in mind the diversity, then the data might be great fit and highly representational. Now, the product or even facial recognition system is more accurate universally than overfitted for just a particular section in society.

Representation is necessary, thus unbiased and fair implementation with that unbiased data is provides the correct results. Most of the commonly used algorithms are stochastic based on a machine learning model that learns the structure of training data and predicts an outcome. As a result, algorithms are reflections of data they are fed with, and if the data is biased, so will be the algorithms.

Privacy is another part of ethical issues lying with misuse of data. Also, recording someone’s data without there consent is wrong and raises highly ethical questions in some cases. Data capturing should be done with proper consent of participants. If there is any data breach due to irresponsible use of data then that should be communicated and the authority should be accountable. Many companies have been blamed in past with charges of misuse of data that was provided by the user for other purposes, even sometimes directing there user actions aggressively. Many whistleblowers from such companies including some big companies have made claims of such actions recently and even in past.

Some recent data breaches on various events in last three year. [Source: Wikipedia]

Financial data are highly sensitive data and many reforms has been introduced recently to ensure there usage. In India, the financial data of users has been recently regulated by RBI [Reserve Bank of India] with the requirement of account aggregators which will be used by personal finance applications or FIU (Financial Information User) for accessing the user financial details only with consent. Such regulations can ensure transparency and to some extent privacy.

FAT(Fairness, Accountability and Transparency) is another concept nowadays that provides the three main area for ensuring unbiased actions.
A good resource is this for same.

Data ethics is essential for real life applications.

There are some unconscious bias as well in society that can effect actions of some data users. Even though people understands there actions and take care to be inclusive, sometimes the bias might be absolutely unintentional. Suppose picturize word “CEO” or “head of the office”, what comes into mind? A male figure? but a CEO can be a female too, isn't? what about transgender and other genders, why didn’t other picture came into your mind.
Another word can be “Assistant”, what comes into picture? A female figure ? but an assistant can be any one including male and other genders.

The fact that such pre-existing stereotyped roles or any other areas can result in unconscious bias if not properly discussed and introspected. The developers working on such systems may unconsciously form a bias and thus it’s important to be vigilant before having any implementation.

Data science has many technical parts including statistics and probability, modeling and predictions, artificial intelligence and advance applications with data. Thus the data science practitioners should have great grasp of data ethics and it’s implications. Big data ethics also known as simply data ethics refers to systemizing, defending, and recommending concepts of right and wrong conduct in relation to data, in particular personal data.

Data ethics is concerned with the following principles:

Ownership — Individuals own their own data.
Transaction transparency — If an individuals personal data is used, they should have transparent access to the algorithm design used to generate aggregate data sets
Consent — If an individual or legal entity would like to use personal data, one needs informed and explicitly expressed consent of what personal data moves to whom, when, and for what purpose from the owner of the data.
Privacy — If data transactions occur all reasonable effort needs to be made to preserve privacy.
Currency — Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
Openness — Aggregate data sets should be freely available

In modeling and algorithms, more data-centered models with unbiased data is absolutely required in future implementations, that can be both technically accurate and ethically sound.

Thankyou for reading, please share your thoughts on this blog in comment section. Provide other examples of ethical issues with data usage or possible misuse in future or suggestions on how it can be solved!!

Data Ethics : The most important part for data usage and implementations

Written by Prasansha Satpathy