Highest likelihood of what? They're both faces, but the images are labeled by class. If there's an image marked "trap" and it contains one trap and one man, then it's 50/50 which one gets marked trap and which one discarded.If you are using Facenet then it probably has a way to consider multiple faces in an image. Just pick the face with the highest likelihood. Regardless, unless the data is massively tainted then a classifier should still work despite some false positives. Ditto for Asians, unless they are 90% of the dataset, the network will find how to deal with them. If you have labels for ethinicity, it might help to add an additional race class output to the network that can boost performence.
I am using sklearn's MLPClassifier, inputs are FaceNet's outputs. The accuracy is slightly better than SVM. Adding noise to the images won't change anything.Transfer learning should already work with a small dataset, especially when you have thousands of images. Dimensionality reduction won't help in any way. What is the input to your network? Maybe try to use something YOLO to cut a square around every person and input that to your network, to reduce background noise.