Text Classification With Keras and Scikit-Learn

Background

We are going to build a model that does classifies customer reviews as positive or negative sentiment, using the Women’s E-Commerce Clothing Reviews Dataset. We will walk you through how we would organize this task in Metaflow. Concretely, we will demonstrate the following steps:

  1. Read data from a parquet file
  2. Show a branching workflow to record a baseline and train a model in parallel.
  3. Evaluate The Model:
    • on a holdout set and compare against the baseline
    • do a smaoke test
  4. If the model passes those it is tagged as a deployment_candidate

Constructing The Metaflow Flow

from metaflow import FlowSpec, step, Flow, current

class MyFlow(FlowSpec):

    @step
    def start(self):
        "Read the data"
        import pandas as pd
        self.df = pd.read_parquet('train.parquet')
        print(f'num of rows: {self.df.shape[0]}')
        self.next(self.baseline, self.train)

    @step
    def baseline(self):
        "Compute the baseline"
        from sklearn.metrics import accuracy_score, roc_auc_score
        baseline_predictions = [1] * self.df.shape[0]
        self.base_acc = accuracy_score(self.df.labels, baseline_predictions)
        self.base_rocauc = roc_auc_score(self.df.labels, baseline_predictions)
        self.next(self.join)

    @step
    def train(self):
        "Train the model"
        import tensorflow as tf
        from tensorflow.keras.utils import set_random_seed
        from sklearn.metrics import accuracy_score, roc_auc_score
        from sklearn.feature_extraction.text import CountVectorizer
        from model import get_model
        set_random_seed(2022)
        
        self.cv = CountVectorizer(min_df=.005, max_df = .75, stop_words='english', strip_accents='ascii', )
        res = self.cv.fit_transform(self.df['review'])
        self.model = get_model(len(self.cv.vocabulary_))
        self.model.fit(x=res.toarray(), 
                       y=self.df['labels'],
                       batch_size=32, epochs=10, validation_split=.2)

        self.next(self.join)
        
    @step
    def join(self, inputs):
        "Compare the model results with the baseline."
        import tensorflow as tf
        from tensorflow.keras import layers, optimizers, regularizers
        from sklearn.metrics import accuracy_score, roc_auc_score
        from sklearn.feature_extraction.text import CountVectorizer
        import pandas as pd
        
        
        self.model = inputs.train.model
        self.cv = inputs.train.cv
        self.train_df = inputs.train.df
        self.holdout_df = pd.read_parquet('holdout.parquet')
        
        self.predictions = self.model.predict(self.cv.transform(self.holdout_df['review']).toarray())
        labels = self.holdout_df['labels']
        
        self.model_acc = accuracy_score(labels, self.predictions > .5)
        self.model_rocauc = roc_auc_score(labels, self.predictions)
        
        print(f'Baseline Acccuracy: {inputs.baseline.base_acc:.2%}')
        print(f'Baseline AUC: {inputs.baseline.base_rocauc:.2}')
        print(f'Model Acccuracy: {self.model_acc:.2%}')
        print(f'Model AUC: {self.model_rocauc:.2}')
        self.beats_baseline = self.model_rocauc > inputs.baseline.base_rocauc
        print(f'Model beats baseline (T/F): {self.beats_baseline}')
        
        #smoke test to make sure model is doing the right thing on obvious examples.
        _tst_reviews = ["poor fit its baggy in places where it isn't supposed to be.",
                        "love it, very high quality and great value"]
        _tst_preds = self.model.predict(self.cv.transform(_tst_reviews).toarray())
        self.passed_smoke_test = _tst_preds[0][0] < .5 and _tst_preds[1][0] > .5
        print(f'Model passed smoke test (T/F): {self.passed_smoke_test}')
        
        if self.beats_baseline and self.passed_smoke_test:
            run = Flow(current.flow_name)[current.run_id]
            run.add_tag('deployment_candidate')
        self.next(self.end)
        
    @step
    def end(self): ...

if __name__ == '__main__':
    MyFlow()
Overwriting flow.py
!python flow.py --no-pylint run
Metaflow 2.7.1 executing MyFlow for user:hamel
Validating your flow...
    The graph looks good!
2022-07-19 23:01:03.867 Workflow starting (run-id 1658296863862787):
2022-07-19 23:01:03.876 [1658296863862787/start/1 (pid 29926)] Task is starting.
2022-07-19 23:01:04.909 [1658296863862787/start/1 (pid 29926)] num of rows: 20377
2022-07-19 23:01:05.013 [1658296863862787/start/1 (pid 29926)] Task finished successfully.
2022-07-19 23:01:05.023 [1658296863862787/baseline/2 (pid 29931)] Task is starting.
2022-07-19 23:01:05.032 [1658296863862787/train/3 (pid 29932)] Task is starting.
2022-07-19 23:01:06.442 [1658296863862787/baseline/2 (pid 29931)] Task finished successfully.
2022-07-19 23:01:08.339 [1658296863862787/train/3 (pid 29932)] 2022-07-19 23:01:08.339044: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
2022-07-19 23:01:08.421 [1658296863862787/train/3 (pid 29932)] Epoch 1/10
510/510 [==============================] - 1s 1ms/step - loss: 0.3523 - accuracy: 0.8507 - val_loss: 0.2988 - val_accuracy: 0.875453 - loss: 0.6811 - accuracy: 0.62
2022-07-19 23:01:09.284 [1658296863862787/train/3 (pid 29932)] Epoch 2/10
510/510 [==============================] - 0s 958us/step - loss: 0.2945 - accuracy: 0.8818 - val_loss: 0.2956 - val_accuracy: 0.8771 loss: 0.2944 - accuracy: 0.90
2022-07-19 23:01:09.773 [1658296863862787/train/3 (pid 29932)] Epoch 3/10
510/510 [==============================] - 0s 956us/step - loss: 0.2860 - accuracy: 0.8840 - val_loss: 0.2989 - val_accuracy: 0.8768 loss: 0.2319 - accuracy: 0.96
2022-07-19 23:01:10.261 [1658296863862787/train/3 (pid 29932)] Epoch 4/10
510/510 [==============================] - 0s 954us/step - loss: 0.2761 - accuracy: 0.8891 - val_loss: 0.2951 - val_accuracy: 0.8803 loss: 0.3847 - accuracy: 0.81
2022-07-19 23:01:10.748 [1658296863862787/train/3 (pid 29932)] Epoch 5/10
510/510 [==============================] - 0s 954us/step - loss: 0.2676 - accuracy: 0.8956 - val_loss: 0.2991 - val_accuracy: 0.8759 loss: 0.2714 - accuracy: 0.9
2022-07-19 23:01:11.235 [1658296863862787/train/3 (pid 29932)] Epoch 6/10
510/510 [==============================] - 0s 949us/step - loss: 0.2623 - accuracy: 0.9005 - val_loss: 0.2996 - val_accuracy: 0.8793 loss: 0.3324 - accuracy: 0.84
2022-07-19 23:01:11.720 [1658296863862787/train/3 (pid 29932)] Epoch 7/10
510/510 [==============================] - 0s 947us/step - loss: 0.2549 - accuracy: 0.9056 - val_loss: 0.3044 - val_accuracy: 0.8754 loss: 0.2044 - accuracy: 0.93
2022-07-19 23:01:12.203 [1658296863862787/train/3 (pid 29932)] Epoch 8/10
510/510 [==============================] - 0s 957us/step - loss: 0.2472 - accuracy: 0.9109 - val_loss: 0.3137 - val_accuracy: 0.8759 loss: 0.2832 - accuracy: 0.84
2022-07-19 23:01:12.691 [1658296863862787/train/3 (pid 29932)] Epoch 9/10
510/510 [==============================] - 0s 971us/step - loss: 0.2376 - accuracy: 0.9134 - val_loss: 0.3151 - val_accuracy: 0.8776 loss: 0.2699 - accuracy: 0.93
2022-07-19 23:01:13.187 [1658296863862787/train/3 (pid 29932)] Epoch 10/10
510/510 [==============================] - 1s 1ms/step - loss: 0.2298 - accuracy: 0.9187 - val_loss: 0.3276 - val_accuracy: 0.8744 - loss: 0.1719 - accuracy: 0.93
2022-07-19 23:01:13.846 [1658296863862787/train/3 (pid 29932)] To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-19 23:01:13.847 [1658296863862787/train/3 (pid 29932)] 2022-07-19 23:01:13.846758: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2022-07-19 23:01:13.876 [1658296863862787/train/3 (pid 29932)] WARNING:absl:Function `_wrapped_model` contains input name(s) Input with unsupported characters which will be renamed to input in the SavedModel.
2022-07-19 23:01:14.384 [1658296863862787/train/3 (pid 29932)] Task finished successfully.
2022-07-19 23:01:14.395 [1658296863862787/join/4 (pid 29940)] Task is starting.
2022-07-19 23:01:16.806 [1658296863862787/join/4 (pid 29940)] 2022-07-19 23:01:16.806559: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
2022-07-19 23:01:17.227 [1658296863862787/join/4 (pid 29940)] Baseline Acccuracy: 77.04%
2022-07-19 23:01:17.228 [1658296863862787/join/4 (pid 29940)] Baseline AUC: 0.5
2022-07-19 23:01:17.251 [1658296863862787/join/4 (pid 29940)] Model Acccuracy: 87.77%
2022-07-19 23:01:17.251 [1658296863862787/join/4 (pid 29940)] Model AUC: 0.91
2022-07-19 23:01:17.251 [1658296863862787/join/4 (pid 29940)] Model beats baseline (T/F): True
2022-07-19 23:01:17.251 [1658296863862787/join/4 (pid 29940)] Model passed smoke test (T/F): True
2022-07-19 23:01:17.406 [1658296863862787/join/4 (pid 29940)] To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-19 23:01:17.406 [1658296863862787/join/4 (pid 29940)] 2022-07-19 23:01:17.405995: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2022-07-19 23:01:17.434 [1658296863862787/join/4 (pid 29940)] WARNING:absl:Function `_wrapped_model` contains input name(s) Input with unsupported characters which will be renamed to input in the SavedModel.
2022-07-19 23:01:17.865 [1658296863862787/join/4 (pid 29940)] Task finished successfully.
2022-07-19 23:01:17.877 [1658296863862787/end/5 (pid 29944)] Task is starting.
2022-07-19 23:01:18.633 [1658296863862787/end/5 (pid 29944)] Task finished successfully.
2022-07-19 23:01:18.633 Done!