How to Use Sklearn train_test_split in Python effectively?

How to Use Sklearn train_test_split in Python?

What is Sklearn train_test_split?

The train_test_split() function in Sklearn is a powerful tool for splitting datasets into training and test sets in Python. It is important to split your dataset into training and test sets to avoid overfitting and to evaluate the performance of your model on unseen data.

`train_test_split()` Syntax

The train_test_split() function takes the following parameters:

X: The features of the dataset.
y: The target labels of the dataset.
test_size: The proportion of the dataset to include in the test set.
random_state: An integer seed used to control the randomness of the split.

The function returns four arrays:

X_train: The training features.
y_train: The training labels.
X_test: The test features.
y_test: The test labels.

Example of using Sklearn train_test_split

How to use train_test_split to split a dataset into training and test sets:


    import numpy as np
    from sklearn.model_selection import train_test_split

    # Create a sample dataset
    X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
    y = np.array([9, 10, 11, 12])

    # Split the dataset into training and test sets with a test size of 25%
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

    # Print the shape of the training and test sets
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)

Output:

    (3, 2)
    (1, 2)
    (3,)
    (1,)

Using train_test_split with stratified datasets

A stratified dataset is a dataset where the distribution of the target labels is preserved in both the training and test sets. This is important for classification tasks, where we want to ensure that the model is trained on a representative sample of the data.

To split a stratified dataset into training and test sets, we can use the stratify parameter of the train_test_split() function. The stratify parameter should be an array-like containing the target labels for the dataset.

Here is an example of how to split a stratified dataset into training and test sets:

    import numpy as np
    from sklearn.model_selection import train_test_split

    # Create a stratified sample dataset
    X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
    y = np.array([0, 1, 0, 1])

    # Split the dataset into training and test sets with a test size of 25% and stratified by the target labels
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y)

    # Print the shape of the training and test sets
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)

Output:

    (3, 2)
    (1, 2)
    (3,)
    (1,)

Conclusion

The train_test_split() function is a powerful tool for splitting datasets into training and test sets in Python. It is important to split your dataset into training and test sets to avoid overfitting and to evaluate the performance of your model on unseen data.

How to Use Sklearn train_test_split in Python effectively?