Edition No.22 Why Are Machine Learning and Deep Learning Difficult?

#Digital

2018.08.24

0 LIKE

533 VIEW

메일 공유
https://stories.amorepacific.com/en/edition-no22-why-are-ma

ColumnistKim Haksin
Digital IT Innovation Team

# Intro

They say AI will change everything. Artificial Intelligence or AI has penetrated various aspects of our lives, providing new customer experiences we have yet to have seen before and easily solving problems that were unsolvable until now. For instance, it is expected that AI will be utilized in more and more areas such as enabling us to automate simple and repetitive manual labor or making predictions of the future based on data, or even understanding unstructured data (texts, images, etc.). At the core of AI lies machine learning and deep learning, which is based on the concept of the human brain neural network. And these are becoming a bigger part of our lives.

Through Amorepacific's new program, Theme Hyecho, I had the opportunity to experience first-hand Amazon's various AI services, which are expected to be combined to many aspects from B2C perspective. I also learned about machine learning and deep learning from beginning to end. Although it was only a short period of 3 months and it was probably too short to have understood everything about machine learning and deep learning, the journey of understanding these technologies was a process of solving challenges from the start until the very end. Through my experience, I learned why machine learning and deep learning are difficult. And based on my experience, I would like to share with you why that is.

Categories of Machine Learning
Source : https://swalloow.github.io/pyml-intro1

# Why is machine learning difficult

1. Data problems

Machine learning and deep learning are about creating a model (function) that gets closer to the answer or solution by learning repeatedly based on data. Because it learns from data and the model is built based on data, the outcome may be completely different depending on the given data. Therefore, it is no exaggeration to say that the quality of data defines the outcome of machine learning and deep learning. Then, what is high quality data? We can identify high quality data in two perspectives. First, data quality and second, data volume. When I asked what qualifies as good data quality, people at Amazon replied, "Good data must have continuity, with no missing data and should be consistent." I also was told that "there should be significant and good data on the problem you seek to solve through machine learning or deep learning." Second, data volume. Machine learning and deep learning require a significant amount of data. For example, let's say you would like to develop a model that categorizes images. Required data may vary depending on how many categories you want to group images into, but you need dozens if not hundreds of images for just one object.

In reality, obtaining good quality data that satisfies both quality and volume is very challenging. Collecting data requires a lot of time and effort, not something that can be done in short term. It requires a lot of resources. Many businesses are now aware that 'data = competitiveness and asset'. And so securing data from external sources comes at a high cost. This exactly proves that the importance of data is increasing.

2. Painful data preprocessing

Data preprocessing is essential in writing machine learning or deep learning models. Simply put, data preprocessing refers to data understanding and analysis, data refining, data conversion and data formatting. You need to have knowledge on the relevant domain to understand and analyze data. Based on such knowledge, the process of understanding and analyzing the meaning of data is undertaken. This process takes up quite a significant amount of time. And only after data understanding and analysis is completed is when you can delete any unnecessary data or add new data features that you believe will be significant based on the existing data you have. Next, you fill in the data value of any missing data or go through data normalization to reduce distribution if the variance of data values is large. It is also at this point you convert data in texts into numbers. You must convert texts into numbers because machine learning and deep learning perform calculations based on numbers. Finally, there is the data formatting process, where you format data into a format that fits the machine learning system. This process takes up a significant part of the preprocessing required for machine learning and deep learning, although it may vary depending on data quality. If the outcome of machine learning or deep learning is not good enough, you may try different ways, but this would require doing some data preprocessing again.

3. Too many different recipes

Machine learning and deep learning don't offer only one single solution for one problem. There is a diverse range of solutions. Let's say you are creating a model that distinguishes spam mail from legitimate mail. There is a variety of solutions you can use. Main solutions include XG-Booster, Decision Tree, Support Vector Machine, and K-NN (Nearest Neighbor). After choosing the appropriate solution, the issue of, again, many recipes remain. Among so many recipes, there are things that a person must define based on past experience or know-how. These are called hyperparameters. You need to set all hyperparameter values but the number of combinations of hyperparameter values is astronomical. And the outcome of machine learning changes depending on the values. In other words, you go through a lot of trial and error to find the value and combination of the appropriate hyperparameters. And with experience and know-how on similar problems you can start to solve problems a bit faster.

Illustration showing why machine learning is difficult: arriving at the outcome only when there is enough data, the right algorithm and implementation
Source : http://ai.stanford.edu

4. No way to guarantee the optimum outcome

Another challenge of machine learning and deep learning that comes up a lot of times is whether the outcome of the recipe guarantees the optimum outcome. In technical terms, this is called the 'Global Minimum.' It means that the optimum outcome is reached in learning (the process of solving the problem) the entire data. But machine learning and deep learning sometimes reach the Local Minimum (optimum outcome to partial data), not the Global Minimum, and even if the Global Minimum is reached, it is almost impossible to explain. Therefore, whether a certain outcome of a model has significance for a business depends on whether the business accepts the results of the model.

5. Learning process that requires an incredible amount of time

As mentioned above in data problems, a significant amount of data is needed for machine learning and deep learning. In other words, it takes a lot of time to learn such volume of data. It depends on what problem you would like to solve, but let's say that you are developing a model that distinguishes images. You need a very costly Graphics Processing Unit (GPU). And it takes from a few hours to even a few days to learn using a GPU. It's great if the result of learning from data is good. But if the outcome is not good, you need a different recipe and another process of learning. And of course, re-learning does not guarantee a better outcome.

6. Difficult to explain outcome

It is true that there aren't a lot of decision trees or other models to interpret the outcome in machine learning and deep learning. It is the same for both Supervised Learning* and Unsupervised Learning*. Especially, for deep learning, it is almost impossible to pinpoint the reasons for the outcome. That's because the calculated outcome in the hidden layer of deep learning is like a black box. As a result, there is no way of knowing why the model produced a certain outcome. In fact, someone I met at Amazon told me that it takes more time and effort to interpret and analyze why a model produced a certain outcome than learning with a deep learning model and validating the outcome. Due to problems arising from this issue, there is a lot of research into Explainable AI. It is still in the early stages, however, and therefore looks like we still have a long way to go.

- Supervised Learning* : A method of learning that automates the decision-making process by developing a generalized model based on already-known examples (e.g. make the computer learn how to distinguish spam mail from legitimate mail, and classify spam and non-spam mail in the future based on past experience)
- Unsupervised Learning* : Unlike supervised learning, unsupervised learning does not provide a solution. This method is mainly used to understand the correlation among data or how data is organized (e.g. clustering).

# Trends in machine learning and deep learning to overcome such challenges

As explained above, it is very difficult to write a machine learning or deep learning program from scratch. This requires an expert. However, there is a finite number of qualified experts and, demand is significantly higher than the available resources. It is hard to hire machine learning & deep learning engineers even in Silicon Valley and naturally they are highly-paid. These most sought-after engineers usually work at IT giants such as Google, Amazon and Microsoft or at startups aiming to become a unicorn startup. In other words, it is challenging to apply machine learning and deep learning except for a few big companies. This is exactly the reason why IT giants are releasing AutoML to make machine learning and deep learning available as explained in the previous column (Edition no. 21. AI-Creating Artificial Intelligence, AutoML). At the 2018 Google Cloud Next, which is a Google Cloud Computing conference recently held here in San Francisco, Google released BigQuery with AutoML function. It creates machine learning or deep learning models automatically if you search for data and define the problem you want to solve based on the searched data. Amazon also announced its SageMaker, while IBM released Watson Studio. It is true that these platforms are mostly for data scientists* rather than the average developer. But it is also true that with time, machine learning and deep learning are quickly becoming much more available and easier to use for anyone including developers. I believe that quickly solving various problems by using these platforms will be possible in the near future.

- Data scientist* : Data scientists are responsible for finding insights that help achieve certain business performance or goals from data. Their main roles include organizing and analyzing huge amounts of data.

AutoML process
Source : Data Robot

# Epilogue to column edition no. 22

Although the project was a short 3-month experience, I have shared what I learned and the challenges of machine & deep learning I heard from the people at Amazon in this column. The column is mostly about the disadvantages or challenges machine learning and deep learning pose, but these are inevitable in the process of the development of machine learning and deep learning. These challenges must be solved, but more know-how obtained in the journey will enable us to solve more problems in the future. I will apply what I have learned from Theme Hyecho into any potential machine learning and deep learning services in B2C perspective to do my best in providing a variety of customer experience.

Like
0
Recommend
0
Thumbs up
0
Supporting
0
Want follow-up article
0

List