Lessons from a series of breast cell annotation contests for schoolchildren

Evaluation criteria

Accuracy of cell annotation in the competition:

A cell was accepted as correctly annotated if the coordinates of the points fell within a range of 14 pixels (i.e. within 3.5 µm – image scanned at X40 with a resolution of 0.25 µm/pixel ) with respect to the ground truth. The accuracy of the assessment was calculated by dividing the total number of correct annotations by the total number of ground truths at each level. Accuracy ranged from 0 to 100 because the percentage was multiplied by 100 when grading accuracy.

$${mathrm{Precision}}=frac{{mathrm{Number,of,correctly,annotated,cells}}}{{mathrm{Number,of,all,cells}} } times 100$$

Other evaluation parameters:

In this article, we further use the F1 metric to assess accuracy. Instead of just calculating the percentage of corrected annotated cells, the F1 metric considers both false positives and false negatives. It is defined as the weighted average of precision and recall (F1 = 2 × TPR × PPV/(TPR + PPV)). TPR is the recall or true positive rate (TPR = TP / (TP + FN)) and PPV is the precision or positive predictive value (PPV = TP / (TP + FP)) where TP is the number of true positives , FP is the number of false positives and FN is the number of false negatives.

Mobilization and participation

As the pilot phase of the competition was announced and launched over the summer, many schools were not in active contact with parents and students. A total of 28 students entered the competition and completed the training task. Only 5 of them chose to participate in the competition. We believe that the low number of people who made the transition from training to competition was partly related to the high degree of complexity of previous training and competition levels. It was too much of a leap for students without any experience with cell annotation and pathology. Among the five participants, one student passed the Mild level and entered the Hot level. No participant reached the final Spicy level. Again, difficulty level was felt to be a factor here. From the pilot experiment, we concluded that participants needed a step-by-step approach to understanding the appearance of different cell categories.

In the main competition, we have simplified the practical part and the previous competition levels. As a result, a total of 98 students entered the competition and 95 of them (97%) continued to participate in the competition. Figure 3(a) gives the distribution of the different cell types in each level of competition. An additional category of cells was added to the annotations at each level, starting with the most easily recognized cells, i.e. tumor positive cells, moving to the most difficult non-tumor negative cells. In the Supercharger level, which included four categories of cells, the majority (81%) of the cells were tumor cells, 60% were tumor negative cells, and 21% tumor positive cells, with negative cells arguably being more challenging than obvious brown cells. positive cells. We plotted the accuracy scores achieved by participants at each level of competition in Figure 3(b) and (c). The majority of participants (n=91.96%) achieved scores above threshold 50 in the Soft level and 61 of them choose to continue the Hot level. 52 participants (85%) then passed the Hot level and 28 of them went on to join the Spicy level. Among the 28 participants who attempted the Spicy level, 22 of them (81%) managed to unlock the last Supercharger level of the competition. This was a significant improvement over the pilot phase, with the staged approach making it easier for participants to progress to higher levels. It was good to see that as the level complexity increased, the percentage of participants who were able to unlock the next level decreased. Peak accuracy at each level also decreased from 99 to 87. This supported our design of the main competition that level sequence and cell annotation complexity were consistent. The relevant codes are available in https://github.com/TIA-Lab/pathcomp.

figure 3

(a) Percentage of different cell types in the four levels of competition: Mild, Hot, Spicy and Supercharged. (b) Percentage of participants who pass the current level and join the next level. (vs) Accuracy of all participants at competition levels.

Lessons learned from the pilot phase of the competition

The successful launch of the main competition is a result of lessons learned from the pilot phase of the competition. They conclude as follows:

Gradual increase in task complexity Following the Pilot edition which took place during the summer of 2020, we investigated the reasons why 82% of the participants had registered but had not participated in the competition and 80% who had started the competition n couldn’t pass the first level. We solicited in-depth feedback from 2 school-aged children (ages 10 and 12) asking them to enter this contest. Comments were that even after watching the intro video and reading the instructions, they were still not clear on what to do. The ”jump” straight into the annotation of 4 cell types at level 1 (soft level) was too big and so both subjects’ suggestion was to introduce a new cell type in each level, starting with the simpler cell type (PT) and moving to more difficult (NNT) as a step-by-step way to gradually increase complexity. We also introduced such progressive learning in the practice sessions.

Launch schedule The launch of the pilot edition during the summer of 2020 with the aim of offering an activity during the summer turned out to be a sub-optimal time because many schools were not actively communicating with parents and children in order to to promote the activity.

Launch event Launching the main competition at the Oxford Virtual Science Festival with a video of the launch event helped promote the competition to a wider audience. In addition, specific promotional material aimed at schools has been created to promote the contest.

Comparison with pathologists

In Fig. 4, we give some examples of images that were annotated by the pathologists and three participants who achieved the three best accuracies at the Supercharger level. The pathologist’s annotations were considered the ground truth (GT). As observed, each participant could detect tumor cells with high accuracy. However, participants tended to confuse positive non-tumor cells (yellow) with positive tumor cells (red). Some of these detection errors are marked with a dotted black circle, which would sometimes be a difficult and subjective distinction. In addition, it may result from the lack of training on non-tumor cells in the Practical part. Additionally, some artifacts are poorly detected as cells.

Figure 4
number 4

Examples of cell annotation results by the pathologist and three participants who achieved the top three accuracies at the Supercharger level. (Tumor positive: red; Tumor negative: green; Non-tumor positive: yellow; Non-tumor negative: blue). Pathologists’ annotations were used as ground truth.

Evaluation metrics on cell detection by participants who placed in the top three in each level are shown in Table 2. Focusing on the F1 measure which incorporates both precision and recall, we we observe that lower performances are obtained on the detection of non-tumor cells. As can be observed from the Supercharger level, F1 is 0.75 on PNT and 0.59 on NNT while 0.82 on PT and 0.80 on NT. This is mainly because the training on identifying non-tumor cells is not enough and the level of difficulty is high. We should add more training sessions on non-tumor cells, especially negative non-tumor cells in the Practical part in later competition launches.

Table 2 F1 score on cell detection by participants ranked in the top three at each level.

In Fig. 5, we further assessed the F1 distribution among all participants at each level of competition. It is clear to see that the F1 measurement gives a similar value (around 0.80) in both tumor cell identifications. However, it appears more variance and a lower mean value on the identification of non-tumor cells. At Supercharger level, F1 on PNT is 0.60 ranging from 0.24 to 0.75 while F1 on NNT is 0.53 ranging from 0.36 to 0.61.

Figure 5
number 5

Breakdown of F1 between different cell categories and levels of competition. (a) Soft level; (b) Hot level; (vs) Spicy level; (D) Boost level. (PT: Tumor positive; NT: Tumor negative; PNT: Non-tumor positive; NST: Non-tumor negative).

Machine Learning Algorithm Performance

We assessed the competition performance of two popular neural networks after training on the same images as the students (AlexNet27 and VGG1628). Table 3 gives the cell detection performance using both the neural networks and the strategies. Similar results are observed between the two neural networks. Interestingly, compared to transfer learning, training from scratch yields much higher F1, especially in detecting non-tumor cells. VGG16 with training from zero gives the best performance with F1 0.91 in PT, 0.93 in NT, 0.70 in PNT and 0.72 in NNT. We also plot the F1 obtained by the neural networks with training from scratch and the participants in Fig. 6. As we observed in Table 3, the neural networks achieve a higher F1 compared to the average F1 of participants in all four cell categories. It should be noted that in PT, NT, and NNT, neural networks trained from scratch outperform the best participants’ performance. These observations prove the learning superiority of neural networks.

Table 3 F1 score on cell detection using different neural networks and strategies.
Picture 6
number 6

Comparison between the F1 made by the participants and that by the neural networks trained from scratch at the Supercharger level. The purple triangle represents results using VGG16 while the purple square represents results from AlexNet.