Notebook

Both classification and regression are done by Estimator objects.(Classifier and Regressors extend this class). Estimator more or less for most of the classifiers and regressors looks like in the following:

In [1]:

class Estimator(object):
    
    def __init__(self, *args, **kwargs):
        # Initialization of object
        pass
        
  
    def fit(self, X, y):
        """Train the Estimator
        Arguments:
            X(numpy array-like): Training Data
            y(numpy array-like): Labels
        """
        # This goes the classifier algorithm
        # does not return, but updates the Estimator object
        pass
            
    def predict(self, X):
        """Predict the test data
        Arguments:
            X(numpy array-like): Test Data
        Returns:
            y(numpy array): Predicted Labels
        """
        # compute predictions
        
        return predictions

If we want to summarize Scikit Learn in three lines:

est = Estimator()
est.fit(X_train, y_train)
est.predict(X_test)

Commoditization of machine learning

Some people say ...

First, we initialize the etimator, then fit the training instances by providing training dataset and labels. Then, try to predict the test dataset and get the predictions. Classification produces discrete labels(based on the number of classes) whereas regression produces rational numbers. However, the api stays same for classifiers and regressors for supervised learning. For unsupervised learning, we do not have predict as there is no target variable. We have fit and transform functions.

Advantages¶

It has a consistent API which is easy to use while also providing a lot of evaluation, diagnostic and cross-validation methods out of the box (sound familiar? Python has batteries-included approach as well).
It uses Scipy data structures under the hood and fits quite well with the rest of scientific computing in Python with Scipy, Numpy, Pandas and Matplotlib packages. Therefore, if you want to visualize the performance of your classifiers (say, using a precision-recall graph or Receiver Operating Characteristics (ROC) curve) those could be quickly visualized with help of Matplotlib. Considering how much time is spent on cleaning and structuring the data, this makes it very convenient to use the library as it tightly integrates to other scientific computing packages.
It has also limited Natural Language Processing feature extraction capabilities as well such as bag of words, tfidf, preprocessing (stop-words, custom preprocessing, analyzer).
If you want to quickly perform different benchmarks on toy datasets, it has a datasets module which provides common and useful datasets. You could also build toy datasets from these datasets for your own purposes to see if your model performs well before applying the model to the real-world dataset.
For parameter optimization and tuning, it also provides grid search and randomized parameter search.