McKinsey’s recent survey reported that the adoption of AI has rapidly increased globally (See Figure 1). However, the glass is still half-empty since almost half of the world still needs to leverage the power of AI.
This gap can be due to the various challenges and barriers businesses face while developing and implementing the technology. Some of those challenges are encountered during the data collection/harvesting phase of developing an AI/ML model.
This article explores 4 data collection challenges and ways to overcome them to streamline the AI development and implementation process for business leaders and developers.
Figure 1. AI adoption from 2020 to 2021
1. Selecting the right dataset
Data can be considered as fuel for an AI/ML model. Determining the dataset is one of the most crucial steps of data collection. One of the key challenges that can occur while determining the dataset is data being myopic, which means it does not cover the full scope of the project and is not aligned with the real-world activities the model will perform.
A study by the University of California and Google identified that in the machine learning development community, the majority of the datasets used to train models are reused or borrowed. This creates misalignments in the project objectives and results in an inaccurate finished product.
Solution
To overcome this challenge, the following steps can be taken:
- Assign a dedicated team for data collection. A dedicated team will know the project in and out and will be able to choose the right dataset.
- Ensure that the team understands and knows the objectives and goals of the project.
- If prepackaged datasets do not cover the scope of the project, then opt for another data collection method that best suits the project.
2. Avoiding data bias
Collecting biased data can lead to a biased and erroneous AI/ML model and thus should be avoided.
For instance, if the dataset used to train a patient referral system does not include male patients or patients with lower income levels, it will provide biased and erroneous outcomes when implemented in a real clinic.
This bias can unintentionally be transferred by the data collector into the AI model.
Solution
The following steps can be taken to overcome data bias while harvesting data:
- Ensure that the dataset is comprehensive and all-inclusive. For instance, a quality inspection system must be trained with data on both defective and working items.
- Ensure that the participants for data collection and revision include people from diverse backgrounds. The dataset must represent the total population on which the AI/ML system will be deployed on.
- Utilize crowdsourcing to expand the range of the data since it offers fast access to large amounts of human-generated data. Since the data collectors are located in different countries, the datasets are diverse.
Sponsored
Clickworker can help you overcome data collection challenges with their crowdsourcing model. They work with over 4 million registered data collectors who are proficient in 45 languages and over 70 different target markets.
Check out this video to get a glimpse of their offerings.
3. Data protection and legal issues
This section explains some ethical and legal constraints to data collection:
Data protection
Not all data is readily and publicly available to use. Some data is sensitive in nature and can not be accessed easily, thus making it challenging to collect.
For instance, in order to train a computer vision system for radiology, thousands of medical images are required. This type of data can be expensive to collect and can have various ethical constraints attached.
Legal issues
Data collection is not as easy as it used to be. As people and government bodies recognize the risks of data exploitation, they make more efforts to regulate data collection and improve data protection.
Solution
In order to avoid these issues, considering the following questions prior to data collection can be helpful:
- What data will be collected?
- To answer this, you need to check what type of data is required. For instance, is it biometrics data, such as face images of people, voice data, thumbprint scans, etc? This can help clarify the type of ethical and legal factors to consider
- How should legal stipulations (related to the collection of the dataset) be mitigated?
- To answer this question, you need to study the country-specific regulations regarding data collection.
- How will the data be collected?
- Different data collection methods have different legal considerations attached to them. For instance, there are certain rules regarding web scraping in different countries.
- How will the data be stored?
- Since cyber threats are rising, it is important to consider where the data will be most safe. Will the cloud be more efficient, or physical hard drives will be safer?
- How will the data be used?
- To answer this question you need to understand how the data will be used. Who within the organization will have access to the collected data and communicate this information to the data provider.
Answering these questions and clearly explaining them to the participants can make the whole data collection process transparent. It is also important to check the data collection rules from the relevant regulatory body followed in the country in which the data is being collected.
4. Underestimating the costs
Large datasets require a large number of data collectors. In this case, the costs can pose a barrier. For instance, if a company opts for in-house data collection for an ML project, it will have to perform the following tasks:
- Hire a dedicated team of data collectors
- Ensure the level of diversity and skillsets match the requirements of the project
- Go through onboarding and training for the data collectors
- Acquire all relevant resources for data collection
- Track and manage the progress of data collection tasks from all participants
This process can be unaffordable or even overwhelming for some businesses, thus thwarting the entire process.
Solution
The following considerations can help overcome this challenge:
- Consider data collection costs during the planning phase of the AI/ML development project
- If the costs cannot be adjusted in the budget, consider outsourcing the operation
- Use prepackaged datasets if the project does not require highly personalized data. These are relatively cheaper to purchase.
You can also check our data-driven list of data collection/harvesting services to find the best option that suits your project. If you need to evaluate data collection vendors in the market, you can download our free data collection vendor evaluation guide spreadsheet:
Get Data Collection Vendor Selection Guide
For more in-depth knowledge on data collection, feel free to download our comprehensive whitepaper:
Get Data Collection Whitepaper
Further reading
- Top 6 Data Collection Best Practices
- AI Data Collection: Quick Guide, Challenges & Top 4 Methods
- Data Collection Automation: Pros, Cons, & 3 Methods
If you need help finding a vendor or have any questions, feel free to contact us:
Find the Right Vendors
Shehmir Javaid is an industry analyst at AIMultiple. He has a background in logistics and supply chain management research and loves learning about innovative technology and sustainability. He completed his MSc in logistics and operations management from Cardiff University UK and Bachelor's in international business administration From Cardiff Metropolitan University UK.
Leave a Reply
YOUR EMAIL ADDRESS WILL NOT BE PUBLISHED. REQUIRED FIELDS ARE MARKED *
Comment *