Measurement | ShanghaiTech-China

Overview

Our project used a variety of Measurement tools to help the Hardware team get the functions programmed for the cell density measurement module. In order to obtain the function of the spectral module reading on product OD value, we collected 50 sets of 100mL product solution with different concentrations of OD value and spectral module reading data, each set contains 10 measurements of data, totaling 500 data points.

For these data, we carry out missing value processing, outlier processing and noise processing. For the pre-processed data, we use linear least square method to fit the data, and finally give the fitting effect evaluation.

Our Measurement Tools provide Hardware teams with more accurate functions for inspection and maintenance device development.

Pre-processing

Missing value processing

In the process of obtaining experimental information and data, there will be various reasons leading to data loss and vacancy. We will take different approaches to missing values based on the distribution characteristics of the variables and the importance of the variables (information and predictive power).

The main treatment methods are as follows:

1. Delete variables.

2. Fill with fixed value.

3. Statistic filling: If the missing rate is low (less than 95%) and the importance is low, it will be filled according to the data distribution. If the data conform to uniform distribution, the mean of the variable is used to fill in the missing, and if the data has a skewed distribution, the median is used to fill in the missing.

4. Interpolation method filling: including random interpolation, multiple difference method, hot platform interpolation, Lagrange interpolation, Newton interpolation, etc

5. Model filling: Using regression, Bayes, random forest, decision tree and other models to predict missing data.

6. Dummy variable filling: If the variable is discrete and has fewer different values, it can be converted into dummy variable.

In the process of obtaining the data, our measurement team did not have sufficient conditions to adjust the OD value of the product with equal group spacing, so the density of the 500 data points we obtained was not satisfactory. In order to ensure that there is a data point in each 0.02 length interval with OD values of 0 to 1, we insert 50 sets of data using Lagrange interpolation.

Lagrange interpolation formula

The basic idea of Lagrange interpolation is that for n points that are different in the plane (no two points are on a line), we must be able to find a polynomial y of degree n - 1, so that this polynomial function passes through these points. The formula is as follows:

The coordinates of n points are:

`(x_{1},y_{1}),(x_{2},y_{2}),(x_{3},y_{3})...(x_{n},y_{n})`

Let:

`y_{1}=a_{0}+a_{1}x_{1}+a_{2}x_{1}^{2}+...+a_{n-1}x_{1}^{n-1}`

`y_{2}=a_{0}+a_{1}x_{2}+a_{2}x_{2}^{2}+...+a_{n-1}x_{2}^{n-1}`

......

`y_{n}=a_{0}+a_{1}x_{n}+a_{2}x_{n}^{2}+...+a_{n-1}x_{n}^{n-1}`

The Lagrange interpolation polynomial is solved as follows:

`L(x)=\sum_{i=1}^{n}y_{i}\prod_{j=1,j\ne i}^{n}\frac{x-x_{j}}{x_{i}-x_{j}}`

Outlier processing

Outliers are the norm of data distribution, and data that lies outside a specific distribution area or range is often defined as an anomaly or noise. The main detection methods are as follows:

1. Simple statistical analysis: Determine whether any abnormality exists according to the box diagram and each sub-point

2. 3`\sigma` principle: If the data has a normal distribution, it deviates from the mean by 3`\sigma`. An outlier is usually defined as a point in the range P(| x - μ | > 3`\sigma`) <= 0.003.

3. Based on the absolute median deviation: This is a robust distance value method against outlier data, using the method of calculating the sum of the distance between each observation and the mean. The main methods are: distance based, density based, cluster based and so on.

For each group of 10 data in the data, we use the 3`\sigma` principle to screen out abnormal data.

Noise treatment

Noise is the random error and variance of the variable, the error between the observed point and the real point, that is, obs = x + ε.

The main treatment methods are as follows:

1. Box operation, equal frequency or equal width box, and then use the average, median or boundary value of each box (different data distribution, different processing methods) to replace all the numbers in the box, play a role in smoothing the data.

2. Establish the regression model of the variable and the predictor, and inversely solve the approximation of the derived variable according to the regression coefficient and the predictor.

We averaged each group of data, divided the box with 0.1 as the group distance of equal width, and took the average to eliminate noise and make the data smoother. After processing, we obtained 10 data points for linear least squares fitting.

LLSF: Linear Least Squares Fit

Given a series `x_{i},y_{i}(i\in N)`, assuming that they have a linear relationship, that is, they can be fitted as `y=kx+b`. If we find the optimal parameters k and b so that `x_{i},y_{i}`and the line `y=kx+b`are as close as possible, we have a fitting line. To describe the effect of fitting parameters, we need to define an optimization function "residual sum of squares" whose expression is as follows:

`f=\sum(y_{i}-kx_{i}-b)`

Then the problem is converted to solving for the values of k and b when f is the extreme point.We solved:

`k=\frac{\sum x_{i}y_{i}-N\cdot\overline{x}\cdot\overline{y}}{\sum x_{i}^{2}-N\cdot\overline{x}^{2}}`

`b=\overline{y}-k\overline{x}`

Using linear least square fitting, we obtain a function of the spectral module reading with respect to the product OD value:y = -6.4692x + 772.61.

Ef fectiveness Evaluation

We can use the determinability coefficient `R^2` as an evaluation index of fit quality, which is defined as follows:

`0\le R^{2}=\frac{SSR}{SST}=\frac{SST-SSE}{SST}\le1`

That is, the closer R^2 is to 1, the smaller the error and the better the fitting effect.

It is calculated that `R^2` = 0.9993 > 0.999

Our evaluation is: the fitting effect is excellent, meet the needs of engineering code use.