| BIT-China - iGEM 2023

Introduction

Our modeling efforts do not end here, as determining the optimal fragrance mixing ratios that are widely accepted remains a challenge. During our discussions in the HP community, we discovered that some companies use observational cues to assess user preferences for their products. This revelation greatly inspired us. As a result, we attempted to build an emotion recognition model using machine learning. This model evaluates user satisfaction based on their body language and facial expressions. Through several rounds of training, our emotion recognition model has achieved a high level of accuracy in discerning three levels of user emotions: liking, disgust, and neutrality. Although it is still in its early stages of development, this innovative modeling approach aims to cover the entire spectrum of our project.

For our project, the emotion recognition model provides valuable insights for adjusting the fragrance mixing ratios based on user preferences. Moreover, this model presents a novel method for conducting user research in the fragrance industry, holding great potential for the future. Therefore, we invite you to explore how our model works and give it a try!

Detailed Description

A well-known truth is that 'customers are God'. It is of great significance for us to quickly determine whether users are interested in our fragrances when developing products. The rapid development of artificial intelligence technology now provides us with great inspiration. Why not apply AI technology to our consumer behavior analysis? We envisage developing a set of automatic detection system to quickly identify users' psychology by identifying the changes of microexpression when they use spices, and then judge users' preference for our products, to guide us to improve the production formula and choose different product ratio to make products that users like.

After clarifying the goal, we intensified our research on relevant literature and technical implementation methods. We classified this specific problem into the field of artificial intelligence technology in image recognition and computer vision, and decided to use the Convolutional neural network related algorithm, which has been very mature in this field, to recognize and classify images, and finally use the continuous capture technology of computer cameras to quickly detect user expressions from videos, The detection accuracy reaches the second level and basically achieves zero latency.

Algorithm Description

Convolutional neural network (CNN) is a deep learning algorithm, which is mainly used for image recognition and computer vision tasks. It extracts features from images through multi-layer convolution and pooling operations, and classifies them through fully connected layers and softmax classifiers.

The following are the main implementation steps of the algorithm:

1.Convolutional Layer: Convolutional layer is the core part of CNN, used to extract local features in images. It performs convolution operations on the input image and a set of learnable convolution kernels to generate a set of Feature Maps.

The formula for the convolution operation is as follows:

output(i,j) = ∑(m,n) input(i + m, j + n) * kernel(m,n)

Among them, output (i, j) represents the pixel values of the output feature map, input (i+m, j+n) represents the pixel values of the input image, and kernel (m, n) represents the weight of the convolutional kernel.

2. Pooling Layer: Pooling layer is used to reduce the spatial resolution of feature maps, reduce the number of parameters and computational complexity. The commonly used pooling operations include Max Pooling and Average Pooling.

The formula for maximum pooling is as follows:

output(i,j) = max(input(2i,2j), input(2i,2j+1), input(2i+1,2j), input(2i+1,2j+1))

Among them, output (i, j) represents the pixel values of the output feature map, and input (2i, 2j) represents the pixel values of the input feature map.

3.Fully Connected Layer: The fully connected layer flattens the output feature maps of the pooling layer into one-dimensional vectors and classifies them through a series of fully connected operations. Each neuron is connected to all neurons in the previous layer, outputting a scalar value.

The formula for fully connected layers is as follows:

output = activation(input · weights + bias)

Where, output represents the output of the full connection layer, input represents the input of the full connection layer, weights represents the weight, bias represents the offset, and activation represents the Activation function.

4.Softmax classifier: The last layer is usually a Softmax classifier, which converts the output of the fully connected layer into a probability distribution.

The formula for the Softmax function is as follows:

output_i = e^(input_i) / ∑(j) e^(input_j)

Among them, output_ I represents the probability of the i-th category, input_ I represents the input of the i-th category.

The above is the working principle and formula of Convolutional neural network. By stacking convolutional layers, pooling layers, and fully connected layers multiple times, CNN can learn complex features from the original image and be used for tasks such as image classification and object detection.

Training process

1.Establish a dataset

We take photos directly through computer cameras and manually annotate the data

2.Model training

We call the Pytorch toolkit to train the dataset and generate a model that can be called for face recognition.

Download Now!

The following is a visualization of the model structure:

Input structure:

Float32, shape: [null, 244, 244, 3]

Output structure:

Float32, labels for images

labels: [“happy”, “sad”, “so-so”]

Accuracy:

99%!!!!!

3.Model application:

Reference

[1] Armeni, Iro & Sener, Ozan & Zamir, Amir & Jiang, Helen & Brilakis, Ioannis & Fischer, Martin & Savarese, Silvio. (2016). 3D Semantic Parsing of Large-Scale Indoor Spaces. 1534-1543. 10.1109/CVPR.2016.170.
[2] Xiaokang, Chen & Chen, Jiahui & Liu, Yan & Zeng, Gang. (2022). D$^3$ETR: Decoder Distillation for Detection Transformer. 10.48550/arXiv.2211.09768.
[3] Chen, Yukang & Li, Yanwei & Zhang, Xiangyu & Sun, Jian & Jia, Jiaya. (2022). Focal Sparse Convolutional Networks for 3D Object Detection. 5418-5427. 10.1109/CVPR52688.2022.00535.
[4] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser and M. Nießner, "ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2432-2443, doi: 10.1109/CVPR.2017.261.
[5] HGwak, JunYoung & Choy, Chris & Savarese, Silvio. (2020). Generative Sparse Detection Networks for 3D Single-Shot Object Detection. 10.1007/978-3-030-58548-8_18.
[6] He, Chenhang & Li, Ruihuang & Li, Shuai & Zhang, Lei. (2022). Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds. 8407-8417. 10.1109/CVPR52688.2022.00823.
[7] Pourramezan Fard, Ali & Mahoor, Mohammad. (2022). Ad-Corre: Adaptive Correlation-Based Loss for Facial Expression Recognition in the Wild. IEEE Access. 10. 1-1. 10.1109/ACCESS.2022.3156598.
[8] Zhang, Kaipeng & Zhang, Zhanpeng & Li, Zhifeng & Qiao, Yu. (2016). Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters. 23. 10.1109/LSP.2016.2603342.
[9] Arpit, Devansh & Jastrzębski, Stanisław & Ballas, Nicolas & Krueger, David & Bengio, Emmanuel & Kanwal, Maxinder & Maharaj, Tegan & Fischer, Asja & Courville, Aaron & Bengio, Y. & Lacoste-Julien, Simon. (2017). A Closer Look at Memorization in Deep Networks.
[10] van der Maaten, Laurens & Hinton, Geoffrey. (2008). Viualizing data using t-SNE. Journal of Machine Learning Research. 9. 2579-2605.