On any fast-moving topic, there are always new things to learn, and machine learning is no exception. This article will point out five things about machine learning, five that you may not know, may not realize, or may have known, and are now forgotten. Dalian Regular Women’s Hospital http://mobile.62671288.com/
Please note that the title of this article is not about “the most important 5 things” or “the first 5 things” about machine learning; it’s just “5 things.” It’s not authoritative, it’s not a trivial matter, it’s just a collection of five things that might be useful.
1. Data preparation is 80% of machine learning, so…
In machine learning tasks, data preparation takes a significant portion of the time; or, at least a seemingly large part of the time is spent. Many people think so.
We often discuss the details of the implementation of data preparation and the reasons for its importance; but in addition to these, we should also pay attention to more things. That is why we should care about data preparation. I don’t mean just to get consistent data, but more like a philosophical flaw so that you understand why you should accept data preparation. Do a good job of data preparation and be a person with data preparation.
Data preparation in the CRISP-DM model.
Some of the best advice I can think of about machine learning is that since you are destined to spend a lot of time preparing data for a big project, a professional who is determined to be the best data preparer is a pretty good target. . Because it’s not just a time-consuming and laborious task, data preparation is of great importance to subsequent steps (invalid data input, invalid data output, etc.) and because of bad reputation as a bad data preparer. It won’t be the worst thing in the world.
So, yes, although data preparation may take a while to execute and master, it is really not a bad thing. There are many opportunities in the need for data preparation, whether it is for a professional who stands out from the crowd, or to demonstrate the intrinsic value of your ability to work.
2. Performance baseline values
When you simulate some data with a specific algorithm and spend a lot of time adjusting your hyperparameters, perform some engineering features and/or picking; you are very happy because you have already delved into how to train accuracy, for example. Said that the accuracy is 75%. You are very satisfied with the work you have done.
But what do you compare to the results you get? If you don’t have a baseline—a complete check that is simpler than the thumb rule to compare your data—and then you don’t actually have the hard work results with any Compare things. So is there a reason to assume that any accuracy is valuable when not compared to other data? Obviously not.
Random guessing is not the best solution for the baseline; instead, there is a widely accepted method for determining the accuracy of the baseline. For example, Scikit-learn provides a series of baseline classifiers in its DummyClassifier classification:
Stratified generates random predictions by respecting the distribution of training set classes.
Most_frequent always predicts the most frequent tags in the training set.
Prior always predicts the class that maximizes priority (like most_frequent’) and “predict_proba returns the priority of the class.
Uniform randomly generated predictions.
The constant always predicts the constant labels provided by the user.
The baseline is also not just a classifier; for example, there are also statistical methods in the baseline regression task. After exploratory data analysis, data preparation, and pre-processing, establishing a baseline is a logical next step in the machine learning workflow.
3. Verification: More than training and testing
When we build machine learning models, we train them to use training data. When we test the resulting model, we use the test data. So where does verification come from?
Rachel Thomas of fast.ai recently wrote an article on how and why to create a good validation set and introduced the following three types of data:
Training set for training a given model
Validation set for selecting between models
(For example, random forests and neural networks which better solve your problem? Do you want a random forest with 40 or 50 trees?)
A test set that tells you how you work. If you try a lot of different models, you might get a good validation set, but this is only accidental, because there is always a test set that doesn’t.
So, is it a good idea to randomly split the data into test, training, and validation sets? It turns out that the answer is no. Rachel answers this question in the context of time series data: Kaggle is currently working to resolve predictions that predict the volume of grocery sales in Ecuador. Kaggle’s “training data” runs from January 1, 2013 to August 15, 2017, and the test data spans from August 16, 2017 to August 31, 2017. It’s a good idea to use your verification set from August 1st to August 15th, 2017, and all the early data can be used as your training set.
The rest of this article will cover data sets that split into Kaggle’s competing data, which is very useful; and I will include cross-validation in the discussion, and readers can explore it in my own way.
Many other times, random segmentation of data can be useful; it depends on further factors, such as the state of the data when you get the data, (has it been divided into training/test data?), and what type it is. Data (see time series classification above).
For what conditions random splits are possible, Scikit may not have the train_validate_test_split method, but you can use the standard Python library to create your own methods.
4. More integration methods than trees
Choosing an algorithm can be a challenge for novice machine learners. When building a classifier, especially for beginners, there is usually a way to solve the single instance problem of a single algorithm. However, it is more efficient to combine or combine classifiers in a given situation; this approach uses voting, weighting, and combining techniques to pursue the most accurate classifier. Integrated learning is a classifier that provides this functionality in a variety of ways.
Random forests are a very important example of collective learners, which use many decision trees in a predictive model. Random forests have been successfully applied to a variety of problems and have achieved good results. But they are not the only integrated methods that exist, and many others are worth a try.
A simple concept about bagging operations: build multiple models, observe the results of these models, and solve most of the results. I recently had a question about my rear axle assembly: I didn’t take the advice of the dealer who diagnosed the problem, and I took it to the other two places where the car was repaired. Both of them thought the problem was and distribution. Different issues raised by the business. This shows the use of bagging in real life. Random forests are based on bagging techniques.
Acceleration is similar to bagging, but there is a slight difference in concept. Instead of assigning equal weights to the model, it adds weighting to the classifier and derives the final result based on weighted voting. Take my car problem as an example. Maybe I have been to a car repair shop many times in the past and trust their diagnosis more than others. Also assume that I have not interacted with or dealt with the dealer before, and I don’t believe in their ability. The weight I assign will be reflective.
Stacking differs from the first two techniques in that it trains multiple single classifiers instead of a collection of the same learners. Although bagging and acceleration use many established models, using different instances of the same classification algorithm (such as decision trees), the stacking model also uses different classification algorithms (such as decision trees, logistic regression, an ANNs or other). The combination).
The merge algorithm is then trained to predict the other algorithms to obtain the final prediction. This combination can be any integration technique, but logistic regression is often considered to be one of the most complete and simplest algorithms to perform this combination. As the classification progresses, the stack can also be used in unsupervised learning tasks such as density estimation.
5. Google Colab?
Finally, let’s take a look at some of the more practical things. Jupyter Notebook has in fact become the most practical tool for data science development, and most people run it on a personal computer, or through some other configuration—a more complex method (such as in a Docker container or virtual machine). First and foremost is Google’s Coloboratory, which allows Jupyter-style and compatible Notebook to run directly on your Google Drive without any configuration.
Colaboratory is pre-configured with some of the more popular Python libraries in the near future, and with support for package management, it allows you to install yourself in Notebooks. For example, TensorFlow falls into this category, but Keras is not; but it takes only a few seconds to install Keras via pip.
The good news on this issue is that if you are using a neural network, you can enable GPU hardware acceleration in your training and enjoy up to 12 hours of free service when you turn it on. This good news doesn’t really look perfect at first, but it’s an added bonus and a good start for a nationwide GPU acceleration
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Recent Posts: TechnoBlogy