Generating Realistic Testing Data for Manufacturing Analytics Software

In the development of a software that includes data tools for providing analytics on batches in chemical manufacturing, the use of realistic testing data is critical to ensure accurate and reliable results. In this blog, I exemplify how such testing data can be generated, starting with a basic example of “ideal batches” and then incorporating some deviations such as data outages, variability in process step implementation, and equipment malfunctioning. By using the resulting close-to-real data testing time series, one can ensure that even early versions of the software are able to handle real-world scenarios.

Unit test

Before diving into the two implementations of the any data generating function, I introduce a unit test to ensure that the generated data meets the desired criteria. Our basic scenario will include generating electrical current data; I want to be able to choose the number of batches to generate data for, the approximate duration of each batch in minutes, the approximate time of transition between batches (“window size”), a lower_bound, which is the lower value for the current data, and the upper_bound, which defines the upper limit for the current data. Thus, in terms of a unit test, I would like to check if the function outputs the expected number of batches, with the correct duration, window size, and current values. It is important to note that if the test should check statistical parameters of randomly generated samples, it should employ .assertAlmostEqual(), .assertTrue(), and .all() methods.

The next test class verifies the basic functionality of the future generate_current_data function, i.e. that: the number of batches returned is as expected, the duration of each batch is as expected, and the window size and current parameters affect the generated data as expected. The setup of the parameters is done in the setUp method which is automatically called before each test.

class TestGenerateCurrentData(unittest.TestCase):
    
    def setUp(self):
        self.num_batches = 10
        self.duration = 100
        self.window_size = 20
        self.lower_bound = 10
        self.upper_bound = 11
        self.tested_data = generate_current_data(
            self.num_batches,
            self.duration,
            self.window_size,
            self.lower_bound,
            self.upper_bound
        )

    def test_shape(self):
        self.assertEqual(
            len(self.tested_data),
            (self.num_batches*(self.duration + self.window_size))
        )

    def test_value_range(self):
        self.assertTrue(
            np.all(self.tested_data >= 0) and np.all(self.tested_data <= self.upper_bound)
        )
    
    def test_generate_current_data_batch_stats(self):
        batch_num, batch_avg_dur, avg_window = self.__slice_into_batches(self.tested_data)
        self.assertEqual(batch_num, self.num_batches)
        self.assertAlmostEqual(batch_avg_dur, self.duration, delta = 10)
        self.assertAlmostEqual(avg_window, self.window_size, delta = 10)
    
    @staticmethod
    def __slice_into_batches(df: pd.DataFrame):
        # Add a column indicating if the current is non-zero
        df['current_on'] = df['current'].apply(lambda x: False if x == 0 else True)
        # Add a column indicating if the current is zero
        df['current_off'] = df['current'].apply(lambda x: True if x == 0 else False)
        # Add a column indicating if the current state changed from the previous row
        df['state_changed'] = df['current_on'].ne(df['current_on'].shift())
        df['state_changed_to_on'] = np.where(
            df['current']* df['state_changed'], True, False
        )
        # Add a column indicating the batch number
        df['batch_number'] = df['state_changed_to_on'].cumsum()
        # number of batches
        batch_count = df['batch_number'].max()
        # average duration
        mean_duration = pd.pivot_table(
            df, index='batch_number', values='current_on', aggfunc='sum'
        ).mean()[0]
        # average window size between batches
        mean_window = pd.pivot_table(
            df, index='batch_number', values='current_off', aggfunc='sum'
        ).mean()[0]
        return batch_count, mean_duration, mean_window

if __name__ == '__main__':
    unittest.main()

Ideal Batch Model

The first version of our current data generating function will represent an “ideal” case. The function is starting by initializing an empty list called current_data. It then is setting the start time to the current time. For each batch, the function is generatimg duration minutes of current data with values randomly between lower_bound and upper_bound using the random.uniform function. Then it is generating window_size minutes of current data with values equal to 0. The start time is being updated after each iteration of the outer loop. Finally, the function is building and returning a pandas dataframe for which the current data is stored in the ‘current’ column and the time stamp is set to be the index.

def generate_current_data(
        num_batches: int,
        duration: int,
        window_size: int,
        lower_bound: float,
        upper_bound: float
    ) -> pd.DataFrame:
    """Generate electricity current data for a given number of batches.

    :param num_batches: number of batches to generate data for
    :param duration: duration of each batch in minutes
    :param window_size: time period between batches
    :param lower_bound: lower threshold value for the current data
    :param upper_bound: range for the current data values

    :return: a time series dataframe, where values represent the current value
    """
    current_data = []
    start_time = datetime.strptime(
        datetime.now().isoformat(timespec='minutes'), '%Y-%m-%dT%H:%M'
    )
    for i in range(num_batches):
        for j in range(duration):
            current_data.append((
                start_time + timedelta(minutes=j),
                random.uniform(lower_bound, upper_bound)
            ))
        for j in range(window_size):
            current_data.append((start_time + timedelta(minutes=duration + j), 0))
        start_time = start_time + timedelta(minutes=duration + window_size)
    current_data = pd.DataFrame(current_data)
    current_data = current_data.rename(columns={0: 'time', 1: 'current'}).set_index('time')
    return current_data

Below is a function to visualize a timeseries dataframe using the plotly library:

def visualize_current_data(df: pd.DataFrame):
    # Create a trace for the current data
    trace = go.Scatter(
        x=df.time,
        y=df.current,
        mode='lines',
        name='Current'
    )

    # Create a layout for the plot
    layout = go.Layout(
        title='Current Data',
        xaxis=dict(title='Time'),
        yaxis=dict(title='Current (A)')
    )

    # Create a Figure object
    fig = go.Figure(data=[trace], layout=layout)

    # Show the plot
    fig.show()

The generated dataframe meets the desired criteria:

============================= test session starts =============================
collecting ... collected 3 item

current.py::TestGenerateCurrentData::test_generate_current_data_batch_stats PASSED  [100%]
current.py::TestGenerateCurrentData::test_shape PASSED  [100%]
current.py::TestGenerateCurrentData::test_value_range PASSED  [100%]

============================== 1 passed in 0.039s ==============================

Process finished with exit code 0

Let’s generate a time series with the generate_current_data function and plot it:

# Generate current data
current_data = generate_current_data(
    num_batches=5,
    duration=30,
    window_size=5,
    lower_bound=10,
    upper_bound=11
).reset_index()
# Visualize
visualize_current_data(current_data)

In this example, I use the generate_current_data function to generate current data with 5 batches, each with a duration of 30 minutes, a window size of 5 minutes, and values between 10 and 11. With more batches it can look as follows:

visualize_current_data(generate_current_data(25, 120, 20, 10, 11))

Introducing Data Omissions

Now, I need to take into acount that multiple reasons can lead to missing values appearing in the raw data. Let’s take a look on how data omissions can be introduced into the generate_current_data function result:

def generate_current_data(
        num_batches: int,
        duration: int,
        window_size: int,
        lower_bound: float,
        upper_bound: float,
        irregularity_rate: float
    ) -> pd.DataFrame:
    """Generate electricity current data for a given number of batches.

    :param num_batches: number of batches to generate data for
    :param duration: duration of each batch in minutes
    :param window_size: time period between batches
    :param lower_bound: lower threshold value for the current data
    :param upper_bound: range for the current data values
    :param irregularity_rate: rate of data omissions (between 0 and 1)

    :return: a time series dataframe, where values represent the current value
    """
    current_data = []
    start_time = datetime.strptime(
        datetime.now().isoformat(timespec='minutes'), '%Y-%m-%dT%H:%M'
    )
    for i in range(num_batches):
        for j in range(duration):
            if random.random() > irregularity_rate:
                current_data.append((
                    start_time + timedelta(minutes=j),
                    random.uniform(lower_bound, upper_bound)
                ))
            else:
                current_data.append((start_time + timedelta(minutes=j), None))
        for j in range(window_size):
            current_data.append((start_time + timedelta(minutes=duration + j), 0))
        start_time = start_time + timedelta(minutes=duration + window_size)
    current_data = pd.DataFrame(current_data)
    current_data = current_data.rename(
        columns = {0: 'time', 1: 'current'}
    ).set_index('time')
    return current_data

In this example, I add a parameter called irregularity_rate to the generate_current_data function and a check before appending each data point to the current_data list. This check works as follows: using the random.random() function which returns a random float between 0 and 1, I generate a random number; if this number is greater than the irregularity_rate passed to the function, we append the current data point to the current_data list; otherwise we are appending None to the list instead, indicating an omitted data point. The irregularity_rate parameter takes value between 0 and 1, which allows controlling the rate of data omissions in the generated data. To illustrate this version of the generate_current_data function, I pass a value of 0.3 to this parameter:

However, actually, the data omissions rarely take place randomly; thus, I want to account for omissions of sequences of data points instead (or in addition to) of random data point omissions:

def generate_current_data(
        num_batches: int,
        duration: int,
        window_size: int,
        lower_bound: float,
        upper_bound: float,
        irregularity_rate: float,
        irregularity_length: int
    ) -> pd.DataFrame:
    """Generate electricity current data for a given number of batches.

    :param num_batches: number of batches to generate data for
    :param duration: duration of each batch in minutes
    :param window_size: time period between batches
    :param lower_bound: lower threshold value for the current data
    :param upper_bound: range for the current data values
    :param irregularity_rate: rate of data omissions (between 0 and 1)
    :param irregularity_length: the length of the time period to omit data for in minutes

    :return: a time series dataframe, where values represent the current value
    """
    current_data = []
    irregularity_rate /= irregularity_length
    current_time = datetime.strptime(
        datetime.now().isoformat(timespec='minutes'), '%Y-%m-%dT%H:%M'
    )
    i = 0
    j = 0
    while i < num_batches:
        while j < duration:
            if random.random() > irregularity_rate:
                current_data.append((
                    current_time, random.uniform(lower_bound, upper_bound)
                ))
                current_time += timedelta(minutes=1)
                j += 1
            else:
                for m in range(irregularity_length):
                    current_data.append((current_time + timedelta(minutes=m), np.NaN))
                current_time += timedelta(minutes=irregularity_length)
                j += irregularity_length
        if j < duration + window_size:
            for m in range(duration + window_size - j):
                current_data.append((current_time + timedelta(minutes=m), 0))
            current_time += timedelta(minutes=duration + window_size - j)
            j = 0
            i = i + 1
        else:
            i = i + 1 + j//(duration + window_size)
            j = j - (duration + window_size) * j // (duration + window_size)
    current_data = pd.DataFrame(current_data)
    current_data = current_data.rename(
        columns = {0: 'time', 1: 'current'}
    ).set_index('time')
    return current_data

In this example, in addition to the irregularity_rate parameter, I usr a parameter called irregularity_length in the generate_current_data function. I still check the former before appending each data point to the current_data list; however, whenever the generated random number is lower than the irregularity_rate passed to the function, I append irregularity_length number of missing data points to the list instead, thus, leaving out a chunk of data. I then have to check whether the current batch is already over and account for the number of batches possibly skipped, to appropriately restart the regular data generation. To handel the grown complexity of the condition check and uncertain loop length, I swithed to while loops. The irregularity_rate parameter still takes value between 0 and 1; it is normalized by the length of the period to be omitted at the very start of the function. To illustrate this version of the generate_current_data function, I pass a value of 0.1 to this parameter and the value of 200 to the irregularity_length parameter:

Variability in Process Step Duration

Now that I can generate test data for ideal batch sequences and incorporate omissions in it, it is important to acknowledge that consider the chemical manufacturing process as a perfectly timed operation is pretty unreasonable. There can be deviations in the timing of each process step, which can impact the overall accuracy of batch data analysis. To accurately reflect real-world conditions in our testing data, it is necessary to incorporate these delays into our generated test time series. To this end, I use a random number from a truncated normal distribution to add variability to the duration parameter value:

def generate_truncated_normal_vector(
    mean: float,
    std_dev: float,
    size: int,
    lower_bound: float,
    upper_bound: float
) -> np.array:
    """Generates a vector of random values with a truncated normal distribution.

    Uses scipy.stats.truncnorm function to truncate values outside the specified bounds.

    :param mean: the mean of the normal distribution.
    :param std_dev: the standard deviation of the normal distribution.
    :param size: the number of random values to generate.
    :param lower_bound: the lower bound of the truncation.
    :param upper_bound: the upper bound of the truncation.
    :return: a numpy vector.
    
    """
    return truncnorm(
        (lower_bound - mean) / std_dev,
        (upper_bound - mean) / std_dev,
        loc=mean,
        scale=std_dev
    ).rvs(size)

To adjust the duration parameter value, I can generate a 1-dimensional array from a distribution with the following parameter values:

mean=1,
std_ved=0.2,
lower_bound=0,
upper_bound=2.

generate_truncated_normal_vector(1, 0.2, 1, 0, 2)[0]

The updated version of the generate_current_data function will look as follows:

def generate_current_data(
        num_batches: int,
        mean_duration: int,
        window_size: int,
        lower_bound: float,
        upper_bound: float,
        irregularity_rate: float,
        irregularity_length: int
    ) -> pd.DataFrame:
    """Generate electricity current data for a given number of batches.

    :param num_batches: number of batches to generate data for
    :param mean_duration: average duration of batches in minutes
    :param window_size: time period between batches
    :param lower_bound: lower threshold value for the current data
    :param upper_bound: range for the current data values
    :param irregularity_rate: rate of data omissions (between 0 and 1)
    :param irregularity_length: the length of the time period to omit data for, in minutes

    :return: a time series dataframe, where values represent the current value
    """
    current_data = []
    irregularity_rate /= irregularity_length
    current_time = datetime.strptime(
        datetime.now().isoformat(timespec='minutes'), '%Y-%m-%dT%H:%M'
    )
    i = 0
    j = 0
    while i < num_batches:
        duration = int(mean_duration*generate_truncated_normal_vector(1, 0.2, 1, 0, 2)[0])
        while j < duration:
            if random.random() > irregularity_rate:
                current_data.append((
                    current_time, random.uniform(lower_bound, upper_bound)
                ))
                current_time += timedelta(minutes=1)
                j += 1
            else:
                for m in range(irregularity_length):
                    current_data.append((current_time + timedelta(minutes=m), np.NaN))
                current_time += timedelta(minutes=irregularity_length)
                j += irregularity_length
        if j < duration + window_size:
            for m in range(duration + window_size - j):
                current_data.append((current_time + timedelta(minutes=m), 0))
            current_time += timedelta(minutes=duration + window_size - j)
            j = 0
            i = i + 1
        else:
            i = i + 1 + j//(duration + window_size)
            j = j - (duration + window_size) * j // (duration + window_size)
    current_data = pd.DataFrame(current_data)
    current_data = current_data.rename(
        columns = {0: 'time', 1: 'current'}
    ).set_index('time')
    return current_data

I have refactored the duration parameter into mean_duration to correctly reflect its role of the batch average duration. Let’s illustrate the difference in the generated time series:

Equipment Failures

When it comes to equipment failures, the sensor signal should either go off (in which case we are dealing with missing data again) or experience some major irregularities. I can use the same generate_truncated_normal_vector function to model signal irregularities of this kind, with more “extreme” values, e.g.:

generate_truncated_normal_vector(1, 0.9, 1, 0, 100)[0]

I apply the resulted random number as a multiplier, while generating regular current values instead of the missing values in the last version of the generate_current_data function:

def generate_current_data(
        num_batches: int,
        mean_duration: int,
        window_size: int,
        lower_bound: float,
        upper_bound: float,
        irregularity_rate: float,
        irregularity_length: int
    ) -> pd.DataFrame:
    """Generate electricity current data for a given number of batches.

    :param num_batches: number of batches to generate data for
    :param duration: duration of each batch in minutes
    :param window_size: time period between batches
    :param lower_bound: lower threshold value for the current data
    :param upper_bound: range for the current data values
    :param irregularity_rate: rate of irregularity occurrence (between 0 and 1)
    :param irregularity_length: time period of irregularity, in minutes

    :return: a time series dataframe, where values represent the current value
    """
    current_data = []
    irregularity_rate /= irregularity_length
    current_time = datetime.strptime(
        datetime.now().isoformat(timespec='minutes'), '%Y-%m-%dT%H:%M'
    )
    i = 0
    j = 0
    while i < num_batches:
        duration = int(mean_duration*generate_truncated_normal_vector(1, 0.2, 1, 0, 2)[0])
        while j < duration:
            if random.random() > irregularity_rate:
                current_data.append((
                    current_time, random.uniform(lower_bound, upper_bound)
                ))
                current_time += timedelta(minutes=1)
                j += 1
            else:
                for m in range(irregularity_length):
                    irreg_coef = generate_truncated_normal_vector(1, 0.9, 1, 0, 100)[0]
                    current_data.append((
                        current_time + timedelta(minutes=m),
                        random.uniform(lower_bound, upper_bound)*irreg_coef
                    ))
                current_time += timedelta(minutes=irregularity_length)
                j += irregularity_length
        if j < duration + window_size:
            for m in range(duration + window_size - j):
                current_data.append((current_time + timedelta(minutes=m), 0))
            current_time += timedelta(minutes=duration + window_size - j)
            j = 0
            i = i + 1
        else:
            i = i + 1 + j//(duration + window_size)
            j = j - (duration + window_size) * j // (duration + window_size)
    current_data = pd.DataFrame(current_data)
    current_data = current_data.rename(
        columns = {0: 'time', 1: 'current'}
    ).set_index('time')
    return current_data

Now, the irregularity_rate parameter controls the frequency with which the data exhibits equipment malfunctioning behaviour. By using different combination of the parameters I can change the resulting irregularity pattern:

Conclusion

In this blog post, I modeled real-world electric sensor data from a process manufacturing equipment unit, while focusing on the three most common irregularities which can be observed, forexample, in chemical manufacturing production lines. These include variations in batch duration, missing chunks of data, and equipment malfunctions. The actual time series can be a combination of these irregularities, including varying batch durations and the presence or absence of missing data and equipment malfunctions in different proportions.

As far as further improvement of the data model, variability should be introduced to the time window between the batches and to the length of irregularity periods. One additional feature to consider can be simulating load patterns within batches. This can help to replicate real-world scenarios where there may be varying levels of equipment utilization during specific process steps.

Following steps would improve further the code:

merge the three versions of the generate_current_data function into a single method;
build a dedicated class and add additional methods as needed (the generate_truncated_normal_vector function should be one of them);
improve test coverage to cover all the irregularity cases discussed above.