What are 3 potential problems limitations or challenges to data collected in the community?

McKinsey’s recent survey reported that the adoption of AI has rapidly increased globally (See Figure 1). However, the glass is still half-empty since almost half of the world still needs to leverage the power of AI.

This gap can be due to the various challenges and barriers businesses face while developing and implementing the technology. Some of those challenges are encountered during the data collection/harvesting phase of developing an AI/ML model.

This article explores 4 data collection challenges and ways to overcome them to streamline the AI development and implementation process for business leaders and developers.

Figure 1. AI adoption from 2020 to 2021

Source: McKinsey

1. Selecting the right dataset

Data can be considered as fuel for an AI/ML model. Determining the dataset is one of the most crucial steps of data collection. One of the key challenges that can occur while determining the dataset is data being myopic, which means it does not cover the full scope of the project and is not aligned with the real-world activities the model will perform.

A study by the University of California and Google identified that in the machine learning development community, the majority of the datasets used to train models are reused or borrowed. This creates misalignments in the project objectives and results in an inaccurate finished product.

Solution

To overcome this challenge, the following steps can be taken:

Assign a dedicated team for data collection. A dedicated team will know the project in and out and will be able to choose the right dataset.
Ensure that the team understands and knows the objectives and goals of the project.
If prepackaged datasets do not cover the scope of the project, then opt for another data collection method that best suits the project.

2. Avoiding data bias

Collecting biased data can lead to a biased and erroneous AI/ML model and thus should be avoided.

For instance, if the dataset used to train a patient referral system does not include male patients or patients with lower income levels, it will provide biased and erroneous outcomes when implemented in a real clinic.

This bias can unintentionally be transferred by the data collector into the AI model.

Solution

The following steps can be taken to overcome data bias while harvesting data:

Ensure that the dataset is comprehensive and all-inclusive. For instance, a quality inspection system must be trained with data on both defective and working items.
Ensure that the participants for data collection and revision include people from diverse backgrounds. The dataset must represent the total population on which the AI/ML system will be deployed on.
Utilize crowdsourcing to expand the range of the data since it offers fast access to large amounts of human-generated data. Since the data collectors are located in different countries, the datasets are diverse.

Sponsored

Clickworker can help you overcome data collection challenges with their crowdsourcing model. They work with over 4 million registered data collectors who are proficient in 45 languages and over 70 different target markets.

Check out this video to get a glimpse of their offerings.

3. Data protection and legal issues

This section explains some ethical and legal constraints to data collection:

Data protection

Not all data is readily and publicly available to use. Some data is sensitive in nature and can not be accessed easily, thus making it challenging to collect.

For instance, in order to train a computer vision system for radiology, thousands of medical images are required. This type of data can be expensive to collect and can have various ethical constraints attached.

Legal issues

Data collection is not as easy as it used to be. As people and government bodies recognize the risks of data exploitation, they make more efforts to regulate data collection and improve data protection.

Solution

In order to avoid these issues, considering the following questions prior to data collection can be helpful:

What data will be collected?
- To answer this, you need to check what type of data is required. For instance, is it biometrics data, such as face images of people, voice data, thumbprint scans, etc? This can help clarify the type of ethical and legal factors to consider
How should legal stipulations (related to the collection of the dataset) be mitigated?
- To answer this question, you need to study the country-specific regulations regarding data collection.
How will the data be collected?
- Different data collection methods have different legal considerations attached to them. For instance, there are certain rules regarding web scraping in different countries.
How will the data be stored?
- Since cyber threats are rising, it is important to consider where the data will be most safe. Will the cloud be more efficient, or physical hard drives will be safer?
How will the data be used?
- To answer this question you need to understand how the data will be used. Who within the organization will have access to the collected data and communicate this information to the data provider.

Answering these questions and clearly explaining them to the participants can make the whole data collection process transparent. It is also important to check the data collection rules from the relevant regulatory body followed in the country in which the data is being collected.

4. Underestimating the costs

Large datasets require a large number of data collectors. In this case, the costs can pose a barrier. For instance, if a company opts for in-house data collection for an ML project, it will have to perform the following tasks:

Hire a dedicated team of data collectors
Ensure the level of diversity and skillsets match the requirements of the project
Go through onboarding and training for the data collectors
Acquire all relevant resources for data collection
Track and manage the progress of data collection tasks from all participants

This process can be unaffordable or even overwhelming for some businesses, thus thwarting the entire process.

Solution

The following considerations can help overcome this challenge:

Consider data collection costs during the planning phase of the AI/ML development project
If the costs cannot be adjusted in the budget, consider outsourcing the operation
Use prepackaged datasets if the project does not require highly personalized data. These are relatively cheaper to purchase.

You can also check our data-driven list of data collection/harvesting services to find the best option that suits your project. If you need to evaluate data collection vendors in the market, you can download our free data collection vendor evaluation guide spreadsheet:

Get Data Collection Vendor Selection Guide

For more in-depth knowledge on data collection, feel free to download our comprehensive whitepaper:

Get Data Collection Whitepaper

zusammenhängende Posts

What are the two mechanisms by which a neurotransmitter binding its receptor can result in a change in the membrane potential of the postsynaptic cell?

What are 3 potential problems limitations or challenges to data collected in the community?

Figure 1. AI adoption from 2020 to 2021

1. Selecting the right dataset

Solution

2. Avoiding data bias

Solution

3. Data protection and legal issues

Data protection

Legal issues

Solution

4. Underestimating the costs

Solution

Further reading

0 Comments

What are the potential challenges limitations of the data collection methods?

What are common challenges in data collection?

What are the 3 most commonly used data collection in research?

What are the disadvantages of data collection?

zusammenhängende Posts

What are the two mechanisms by which a neurotransmitter binding its receptor can result in a change in the membrane potential of the postsynaptic cell?

What is work sampling What are its merits and limitations where work sampling can be useful in the area of production?

A(n) ________ is the set of actual and potential buyers of a product or service

Who is responsible for security breaches and poor system response time problems?

Which of the following options strategies provides the greatest profit potential in a bull market?

Identify which of the following is not a potential risk to the network security.

Which theory claims that frustrated individuals blame their problems on other groups?

Which of the following refers to the degree to which an innovation fits the values and experiences of potential consumers?

Is a set of instructions that will transform the problems input into its output?

If you have problems storing away new memories, which are of your brain is most likely damaged?

Toplist

Neuester Beitrag

Stichworte