Item-level monitoring, response style stability, and the hard-easy effect

Cognitive and metacognitive processes during learning depend on accurate monitoring, this investigation examines the influence of immediate item-level knowledge of correct response feedback on cognition monitoring accuracy. In an optional end-of-course computer-based review lesson, participants (n = 68) were randomly assigned to groups to receive either immediate item-by-item feedback (IF) or no immediate feedback (NF). Item-by-item monitoring consisted of confidence self-reports. Two days later, participants completed a retention test (IF = NF, no significant difference). Monitoring accuracy during the review lesson was low, and contrary to expectations was significantly less with immediate feedback (IF < NF, Cohen’s d = .62). Descriptive data shows that (1) monitoring accuracy can be attributed to cues beyond actual item difficulty, (2) a hard-easy effect was observed where item difficulty was related to confidence judgements as a non-monotonic function, (3) response confidence was predicted by the Coh-Metrix dimension Word Concreteness in both the IF and NF treatments, and (4) significant autocorrelations (hysteresis) for confidence measures were observed for NF but not for IF. It seems likely that monitoring is based on multiple and sometimes competing cues, the salience of each relates in some degree to content difficulty, but that the stability of individual response styles plays a substantive role in monitoring. This investigation shows the need for new applications of technology for monitoring multiple measures on the fly to better understand SRL processes to support all learners.


Introduction
has noted that metacognitive skills enable people to be strategic, and that these skills can be learned. Self-regulation is a critical metacognitive life skill that is central for individual wellbeing and for achieving success of both immediate and long-term goals (Bandura, 1991). Self-regulated learning (SRL) is an educationally important subset of self-regulation that includes setting and updating learning goals, ongoing planning, monitoring, occasional strategy-shifts, and ongoing progress evaluation (Butler & Winne, 1995;Panadero, 2017). Monitoring is a central aspect of SRL and is the emphasis of this experimental investigation, because improving technology-supported monitoring in SRL is important for designers, educators, and researchers (Brady et al., 2020;Kavousi et al., 2020;Reid et al., 2017;Zhu et al., 2020). Panadero (2017) reviewed six actively researched models of SRL including those of Boekaerts; Efklides; Hadwin, Järvelä and Miller; Winne and Hadwin; Pintrich; and Zimmerman. Although these models differ in various ways, all rely on learners' ability to gauge their understanding, for example as judgements of learning (JOLs), in order to select and use appropriate cognitive and metacognitive strategies (Reid et al., 2017). But Nelson and Dunlosky (1991) note that, "… the nearly universal finding of the literature … is that accuracy of JOLs is far from perfect, in fact closer to nil than to perfect" (p. 267). Because of the fundamental relationship between monitoring accuracy and good decision making in SRL, monitoring accuracy is important and requires further research and thought (Bol & Hacker, 2012;Hacker et al., 2008).
A vital monitoring task asks, how am I doing now? Because some idea units are easier and some are more difficult, the actual and the perceived difficulty of lesson materials likely varies from moment-to-moment so collecting monitoring data more frequently, for example at the individual idea level, could improve monitoring accuracy (Butler & Winne, 1995;Dunlosky & Lipko, 2007;Hartwig & Dunlosky, 2017).
Monitoring accuracy can be even better when accompanied by performance feedback that shows whether the monitoring judgement matches actual performance or not, and this feedback may then improve the subsequent monitoring judgements (Händel et al., 2020;Nietfeld et al., 2005Nietfeld et al., , 2006. Dunlosky and Rawson (2015) reviewed the prior literature on SRL with feedback, with a focus on using test-like events during lesson review of previous course content. They report that feedback during review lessons improves memory and final course performance for that specific course content, transfers to comprehension and application outcomes, and influences decision choices in the lesson (Merriman et al., 2012).
A major goal of monitoring research is to find the basis of students' judgments by asking, "What factors are students making inferences about when they construct their judgments?" (Dunlosky & Thiede, 2013, p. 59). Monitoring as a manifestation of perceived difficulty may be inaccurate since it is likely to be based on multiple cues (Hertzog et al., 2013) such as memory of past test performance (Finn & Metcalfe, 2007. Attention to multiple conflicting cues or to no cues is likely to establish monitoring inertia, a form of measurement bias referred to as the stability of individual response styles (SIRS, Javaras & Ripley, 2007;Weijters et al., 2010). SIRS judgements may be an individual trait variable that remains fairly constant or stable over time "…despite external influences and direct evidence to the contrary" (Kornell & Bjork, 2009, p. 460; also see Bjork et al., 2013).
This current investigation examines the influence of immediate feedback on monitoring accuracy measured as item-by-item confidence self-reports (Dunlosky & Lipko, 2007). The central question is, will immediate item-level feedback improve monitoring accuracy relative to a no immediate feedback control? Most SRL studies only collect one or a few pieces of monitoring data typically at the start or at the end of a lesson or unit of instruction. Such sparse monitoring data does not allow for richer descriptive analyses, such as the extent of the relationship between the difficulty of an idea unit and learners' judgments of that idea unit, nor does it allow comparison of confidence across time or of the influence of previous confidence on future confidence (e.g., as hysteresis, the influence of a recent past outcome on a current outcome). Thus this investigation adds the following descriptive/exploratory questions, is item level confidence non-monotonically related to item level difficulty (hardeasy effect)? Is an individual's item level confidence stable across responses within the review lesson (SIRS)? Do item level confidence responses influence follow-on confidence responses (hysteresis)? What immediate context cues influence confidence judgements (as Coh-Metrix item-level text features)? Monitoring accuracy and the literature bases of each of these four exploratory questions are provided next.

Monitoring accuracy in SRL
Monitoring accuracy is the degree of fit between a person's judgment of performance towards a goal and his or her actual performance related to that goal (Bol & Hacker, 2012;Keren, 1991). Fleming and Lau (2014) point out that cognitive psychologists as early as 1885 were interested in how well people could monitor their own knowledge, and that confidence ratings were a mainstay of such analysis. Several broad areas can be monitored within the five SRL phases of setting learning goals, planning, monitoring, strategy-shifts, and progress evaluation (reflection) including regulation of cognition, of motivation, of behavior, and of context (Greene & Azevedo, 2007;Pintrich et al., 2000). Even after many decades of research, Rutherford (2014) has noted, "although calibration is a growing area of research within Educational Psychology, unanswered questions remain about the nature of calibration: how it should be measured, its role as a dynamic aspect of metacognition, and how best to improve it." (p. xv). Of most interest in this investigation is cognition monitoring accuracy (i.e., or calibration).
Various interventions including providing guidance, practice, clear criterion of performance expectations, and feedback, have been proposed to reduce gaps between perception of performance versus actual performance (Burson et al., 2006;Händel et al., 2020;Nietfeld et al., 2005Nietfeld et al., , 2006Stone, 2000). Bandura (1991) notes that the informativeness of performance feedback is necessary for students to have a clear idea of how they are doing (p. 251). Combining feedback and certitude-based JOLs have been actively researched for at least 40 years (Kulhavy et al., 1976;Vasilyeva et al., 2008), but these studies have mainly considered how certitude can influence the effectiveness of feedback (Stock et al., 1992), but not the reverse, how feedback influences ongoing certitude and thus influences monitoring accuracy. Extending this to SRL settings, Nugteren, Jarodzka, Kesterm, and van Merriënboer (2018) propose that "Future studies could therefore incorporate feedback to improve the students' self assessments…." (p.375).
Content micro-to-macro level grain size is also an important consideration for cognition monitoring. It is not the same to ask a student how well will you do in this course or how well will you do on this exam (global monitoring) versus how well will you do on this exam item (local monitoring), and so on (Hartwig & Dunlosky, 2017). "In comparison to macrolevel analysis, item-by-item analysis allows a more detailed, and likely more accurate, view 1 3 of the process of forming metacognitive judgments" (Rutherford, 2014, p. 22). Butler and Winne (1995) comment thatin general, research investigating feedback and self-regulation has focused on behaviors at too large a grain size -for example, studying whole passages or answering sets of test items after studying is over -and has thereby collected data that fail to reflect the variance in behavior that is regulation. (p. 246) Dunlosky and Lipko (2007) note, "Will even more specific judgments (e.g., at the level of individual idea units in each concept) serve to further reduce overconfidence and support even higher levels of accuracy?" (p. 231). Thus, it is not yet clear from the research base how item level monitoring with feedback influences monitoring accuracy.

Multiple cues that may influence monitoring accuracy beyond actual difficulty
Actual measured difficulty and perceived difficulty are not identical measures. Review of the literature point to several areas that consider perceived and actual difficulty. Prior familiarity could influence judgements, but when such information is incomplete, what other cues influence learner's perceptions of content difficulty? Specifically in this section, first, the hard-easy effect is presented to provide one account for past observed mixed and no difference findings for monitoring accuracy. Next, stability of individual response style (SIRS) observed in traditional survey research is extended here to item-level confidence self-report measures as an exploratory and descriptive measure related to monitoring accuracy. Because multiple confidence measures are collected over time, the possible influence of confidence responses on subsequent confidence responses (hysteresis) is considered. And last, Coh-Metrix text features of the lesson items are examined as a context cue that can influence perceived difficulty and thus influence monitoring.

Hard-easy effect
A premise of this investigation and perhaps of all SRL investigations is that of a monotonic relationship between the observed difficulty of the lesson content based on actual performance scores and the learner's judgement of the difficulty of that content -that difficult material is perceived to be difficult and easy material is perceived to be easy. But previous SRL research shows that students tend to be overconfident on difficult things and under confident on easy things, the hard-easy effect (Arnold et al., 2017;Juslin et al., 2000). So what is perceived difficulty? At a minimum, confidence perceptions of difficulty depend on multiple cues including familiarity, prior knowledge, characteristics of the text, characteristics of the items, guessing, and combinations of these (Dinsmore & Parkinson, 2013). The cues attended to will influence judgements of difficulty and confidence. Reid et al. (2017) note that although most learners appear to be poorly calibrated, highachieving students are generally more accurate than low achievers at estimating their performance, but also high performing students are more likely to be under confident than low performing students (Bol et al., 2010;Grabowski, 2004;Kruger & Dunning, 1999;Krueger & Mueller, 2002;Rutherford, 2014;Stone, 2000). For example, Hacker et al. (2000) in a semester-long course placed students into 5 groups based on ability and asked them to make prediction and postdiction judgements of their performance on successive multiplechoice exams. At exam 1 monotonic and nearly linear relationships between actual grouplevel performance and anticipated performance whether elicited before (predicted) or after completing the exam (postdicted) were observed (see Fig. 1), but by the third exam several weeks later, predicted and postdicted judgements were non-monotonic, and postdicted judgements had become considerably more overconfident than predicted judgements. The lower ability groups clearly show substantial overconfidence with both predicted and postdicted judgements.

Monitoring inertia due to stability of individual response style and hysteresis
Stability of individual response styles (SIRS) observed in survey research has been shown to be stable over the course of a single questionnaire administration (Javaras & Ripley, 2007) and is both a within survey as well as a longitudinal phenomenon. Weijters et al. (2010) surveyed Belgian members of the public (n = 604) with non-overlapping surveys a year apart, the first survey asked items such as "Television is my primary form of entertainment" and "Air pollution is an important worldwide problem" while the second survey asked such questions as "I understand myself" and "The things I possess are not that important to me." Weijters et al. reported four Likert-scale response styles that to a large extent appear to be a stable individual characteristics, giving either mainly positive, negative, middling, or extreme responses. Monitoring judgements, such as predicting an exam score, are basically survey type responses, thus an individual's confidence responses may exhibit stability within and across trials (Bjork et al., 2013). In other words, individuals' confidence responses vary around their individual set point, and so these measures would be consistent across their own responses and thus consistently higher or lower relative to the group average. This is biologically plausible since in living systems there are numerous examples of such closed-loop control systems (Illingworth, 2011).
Similarly, Hacker et al. (2000) reported that judgments of performance were influenced by prior judgments, and so besides an individual response stability component (SIRS), confidence responses may also be influenced by recent previous confidence responses. This is also referred to as response history (Hertzog et al., 2013), or hysteresis, defined as the dependence of the current state of a system on its recent history. For example, confidence response has been described as biased towards previous responses, where "recent confidence represents a mental shortcut (heuristic) which informs self-reflection when more relevant information is unavailable" (Benwell et al., 2020, p. 18). In this present investigation, item-by-item confidence measures are examined for hysteresis using SPSS Forecasting-Autocorrelation of the confidence measures.

Coh-Metrix measures of test item cues
Besides actual measured item difficulty, confidence judgements can likely be influenced by the immediately available text features (Mills et al., 2015), for example Weaver and Bryant (1995) reported that metacognitive accuracy for reading comprehension depended on text readability level (Kelemen et al., 2000, p. 93). A multilevel theoretical framework of text features has been operationalized as Coh-Metrix that builds from and considerably extends the early readability measures such as Flesch, Dale-Chall, Gunning, and SMOG formulas (Dowell et al., 2016). The Coh-Metrix analysis tool is a well-established and well-researched computational tool that amalgamates a number of previously separate measures of linguistic and discourse features of a text. Coh-Metrix provides more than one hundred measures of a text in five discrete dimensions: Narrativity, Syntactic Simplicity, Word Concreteness, Referential Cohesion, and Deep Cohesion. Which of these measures and dimensions are salient here? Of these five dimensions, Word Concreteness is the only one that aligns with the review items in this investigation, because this lesson content does not have narrativity, the syntactic form of items has no variability because the item format is standard across all of the items, and the items are too brief to exhibit referential or deep cohesion.
Most Coh-Metrix investigations typically use long text portions, but a study by Walkington et al. (2018) used Coh-Metrix measures of sentence-long mathematics word problems from 20 years of archived test data from the National Assessment of Educational Progress (NAEP). They used pilot studies to narrow down to four Coh-Metrix measures, specifically the Word Concreteness dimension plus three individual measures including word count, pronoun density, and presence of second person pronouns. These measures were related to performance in various ways, for example students who were weaker in mathematics tended to benefit more from factors such as word concreteness that make the math problems easier to read (pp. 403-404). Following Walkington et al. (2018), this present investigation seeks to determine the influence of the Coh-Metrix item-level text measures on response confidence using two of these measures, word count and Word Concreteness. Note that pronoun density and presence of second person pronouns used by Walkington et al. could not be considered because the review items in this present investigation contain only three pronouns.

Purpose
This experimental investigation with random assignment to treatment examines the influence of item-by-item immediate knowledge of correct response feedback compared to no immediate feedback (control) on cognition monitoring accuracy in an end-of-course review lesson. There is only one question, does immediate feedback improve monitoring accuracy? But in addition, to reconcile past mixed findings, descriptive analysis (1) seeks evidence of non-monotonicity as the hard-easy effect, (2) of the influence of response stability (both SIRS and hysteresis), and (3) the possible role of selected Coh-Metrix text features on confidence responses.

Participants
Students (n = 68) were a sample of convenience from an instructional design program of an eastern U.S. university who voluntarily participated in this experimental investigation, most of the participants self-reported as female (78%) and most were working professionals (78%, the remainder were full time students).

Materials and procedures
This review lesson approach follows that of a series of SRL investigations by Dunlosky and colleagues who used definitions of key concepts in an introductory undergraduate psychology course that were provided as a computer-delivered review towards the end of the course (e.g., Dunlosky & Rawson, 2012. The content of this review lesson consisted of instructional design terminology from the course textbook. The lesson items were arranged in the order of primary occurrence in the textbook. Students could drop in to any campus computer lab at any time to complete this unmonitored review lesson. After logging in to the lab computer, the software randomly assigned them to the IF or NF treatment group. Students under IF completed the end-of-course review lesson with immediate item-by-item feedback (n = 31), while students under NF received the same items in the same order but with no immediate feedback (n = 37). There was no participant mortality, the unequal group sizes are due to true random assignment by the software without regard to past assignment to group. Item responses included providing a confidence judgement and then selecting the term that matches the definition from four alternatives. Confidence judgements had 5 levels ranging from "I am just guessing" to "I am about 50% confident that I know the answer" to "I am certain I know the answer".
Both IF and NF here are classic knowledge of correct response (KCR) feedback that informs the learner of the correct answer to a specific problem by displaying the question and correct response with no additional information (Clariana, 1990;Clariana et al., 1991). For IF the correct answer was provided immediately after the response entry, while for the NF treatment, the correct answer feedback was given in mass at the end of that lesson section (i.e., thus NF is actually delayed feedback). The review lesson also offered a second section that covered other course content, the second section is not included in this current analysis but that data is reported in a separate investigation by Follmer and Clariana (2020). Two days later, participants completed a paper-based retention test. The lesson and retention test data were matched for analysis using the user-created logins that were known only by the participants.

Results
The results are presented in two sections, the first section includes the individual-level cognitive monitoring accuracy data and the retention test performance. The second section presents the review lesson data descriptive analyses to consider relationships between confidence and performance (e.g., the hard-easy effect), the potential downstream influence of response confidence (hysteresis), and stepwise multiple regression to consider how Coh-Metrix measures of text features can predict lesson confidence responses beyond actual item difficulty.

Cognitive monitoring accuracy
Participants' monitoring accuracy in the review lesson was calculated as the relationship between item-level confidence (as 1, 2, 3, 4, 5) and lesson-item performance (as 0, 1) data for each item. Because this confidence data is collected as ordinal-level data, there is a continuing debate regarding which parametric or non-parametric test may be used with ordinal data in these kinds of studies. Typical analysis uses Spearman rho, Goodman's gamma, or Receiver Operating Characteristic (ROC, as area under the curve). Following the analysis approached used by Bol and Hacker (2012) and Lin and Zabrucky (1998), this study utilized Spearman rho.
Individuals' relative monitoring accuracy data were calculated as the Spearman rho value of an individual's confidence (1-5) with their actual performance (0, 1) in the review lesson; thus each participant has 18 monitoring accuracy values, one for each item. The observed Cronbach's alpha reliability for this monitoring data is KR-20 = .96. As in most studies of this type, some individuals were excluded from the correlation analysis due to correlation indeterminacy, for example by having no variance in item confidence replies (e.g., gave a 4 for every item). Four participants were excluded due to indeterminacy, the individual-level analysis sample size for NF is reduced from n = 37 to n = 35 and for IF is reduced from n = 31 to n = 29. Because Spearman rho values are not interval level data, before averaging, these rho values were converted to Fisher's r-to-Z transformation values that are interval level measures. The averaged individual-level monitoring accuracy as Fisher's Z-transformation data are: NF (control) M = .27, SD = .30 and IF (treatment) M = .10, SD = .21. The data were analyzed by analysis of variance, with one between subjects factor Feedback (IF or NF). Levine's test was not significant, p = .126. Feedback was significant, F(1,62) = 6.629, MSe = .068, p = .012, eta square = .097. (i.e., NF > IF, Cohen's d = .62). Although the mean difference between NF and IF is significant, monitoring accuracy was very low for both NF and IF groups.

Retention test performance
The retention test given two days after the computer-based review lesson consisted of the same 18 four-alternative multiple-choice items as the review lesson, but the items were randomly ordered. The observed Cronbach's alpha reliability for the retention test is KR-20 = .77. Lesson and retention test performance for IF (n = 31) were M = .78 and M = .83, SD = .11, improvement ES = .45 (p = .003) and for NF (n = 37) were M = .79 and M = .86, SD = .12, improvement ES = .58 (p = .005), indicating that students learned from both the IF and NF lesson and did well on the retention test. The performance data from the retention test were analyzed by analysis of variance with the between-subjects factor Treatment (NF or IF). Levine's test equality of error variances was not significant, p = .617. No main effect was observed, F(1,67) = 1.339, MS = .014, p = .251, eta square = .020. The IF and NF (delayed feedback) retention test means were statistically equivalent, which a normal finding in the literature (Clariana et al., 1991).

The hard-easy effect
For descriptive purposes, confidence rank data is treated as close approximations of interval data (from Tekin & Roediger, 2017). To convert item-level confidence ranks to interval data, we estimated as follows: for the 1-5 scale with these descriptions where 1 is "I am guessing", 3 is "I am about 50% confident that I know the answer", and 5 is "I am certain I know the answer", then 1 = 0%, 2 = 25%, 3 = 50%, 4 = 75%, and 5 = 100% (see Table 1).
The item-level data scatterplot of item difficulty (item P) and item confidence were prepared to check for linearity/monotonicity (Laerd, 2018; see Fig. 2). The calibration ideal is represented by the perfect accuracy diagonal line, the classic hard-easy effect is apparent for both the NF and IF treatments, with overconfidence for difficult items (left side of Fig. 2) and under confidence with easy items (right side of Fig. 2).
The inflection points of the two curves are consistent with Lichtenstein and Fischhoff (1977) who reported overconfidence for relatively more difficult items below .75 item difficulty and under confidence for relatively easier items above .75 item difficulty. Notice the fairly linear relationship between confidence judgements and item difficulty for easy items greater than .75 item difficulty. A nonlinear, non-monotonic relationship between confidence and item difficulty is statistically problematic for determining cognition monitoring accuracy. Perhaps the fairly low values observed here for monitoring accuracy and in previous studies can be attributed to item difficulty of the lesson content, which may be a common phenomenon for studies that use content than spans a range of difficulty.

Descriptive charts of the lesson data
Sequence charts of the item-level performance and confidence self-reports are provided in Fig. 3. There is substantial similarity between the IF and NF groups' item-level confidence (IF-NF confidence, Pearson correlation r = .85) as well as their item performance (IF-NF item P, Pearson correlation r = .91). Inspection of the averaged item-level difficulty and item confidence measures indicates that the IF treatment (see top of Fig. 3) and the NF treatment (see bottom of Fig. 3) are quite alike. An item that was difficult for the NF group was also difficult for the IF group, and in the same way a low confidence response for an item by the NF group was also a low confident response for the IF group. Also within both IF and NF treatments there is a noticeable pattern between confidence and performance, with the confidence measures being less extreme than the performance measures (i.e., confidence nearer to the .76 average confidence value horizontal line).

Hysteresis and autocorrelation
Does confidence at a response moment influence confidence for later items, and if yes, how far downstream (Hacker et al., 2000)? Since the 18 items occurred in a sequence, the average confidence measures for each item displayed in Fig. 3 are a sequence chart. Separate autocorrelation analyses of item P data and of item confidence data were conducted with SPSS version 25 using Forecasting analyses. After inspection of the sequence charts, it was determined to use the typical analysis assumptions of 0 difference (i.e., no trend detected) and not to use the natural log transform. Analysis of the ACF and PACF curves and the autocorrelation values for up to 4 lags indicate no autocorrelation for any item difficulty (P) series. But for the item confidence data, a significant Lag 1 was observed for the NF treatment (Lag 1, r = −.45, p = .03) but not for the IF treatment (see Table 2).
These autocorrelation values support that the item confidence exhibited a downstream influence under no immediate feedback (NF), but not with immediate feedback (IF). The

Confidence Judgements
Actual Lesson Performance (Item P)

NF IF
negative Lag 1 autocorrelation for the NF (control) treatment suggests confidence response stability around an average setpoint where a confidence response influences the next confidence response. But with immediate feedback (IF), there is no observed measurable lag (hysteresis) in the ongoing confidence responses.

Coh-metrix text features and item-level confidence
Stepwise multiple regression was used to predict the role of actual observed item difficulty (item P) and of selected text features on confidence of response. The web-based tool NF -no immediate feedback  Coh-Metrix (version 3.0, http:// tool. cohme trix. com/) was used to calculate the text features of each of the review lesson items. Then the group-level lesson item difficulty (P) along with the two selected text features, Word Concreteness (a dimension) and word count (a measure), were used to predict item-level confidence (see Table 1). Two separate stepwise multiple regressions were conducted, one for the IF treatment and one for the NF treatment. For the IF treatment, two predictor variables entered the regression equation that significantly predicted item-level confidence (2,15) = 11.734, p < .001. As would be expected, the actual lesson item difficulty (item P) entered the analysis at Step 1 (R 2 = .380). Next, Word Concreteness z-score entered at Step 2 adding a substantial amount of accounted variance (R 2 = .610). Item word count did not enter the regression model. The regression equations for predicting item-level confidence for both immediate feedback (IF) and no feedback (NF) are: For the NF (control) treatment, only one predictor variable entered the regression equation, Word Concreteness z-score (R 2 = .258) that significantly predicted item-level confidence (1,16) = 5.573, p = .031, but contrary to expectations, the actual NF lesson item difficulty (item P) did not enter the regression (t = 1.215, p = .243). The Word Concreteness dimension regression component was included and was negative in both models but for NF, text surface features are a better predictor of confidence response that was item difficulty.

Discussion
This investigation examined item-level monitoring with immediate item-level feedback that allows for local idea-level reflection and recalibration that should improve monitoring accuracy judgements. Contrary to expectations, immediate feedback did not improve monitoring accuracy; monitoring accuracy was low with or without immediate feedback, and monitoring accuracy with immediate feedback was significantly less accurate than with no feedback (IF < NF, Cohen's d = .62, p = .012). A non-monotonic relationship between item performance and confidence was observed (hard-easy effect, Arnold et al., 2017;Juslin et al., 2000) that attenuates the findings here for relative monitoring accuracy (see Fig. 2).
The item-level descriptive data show that cognitive monitoring accuracy in this investigation depends on both internal cues (i.e., recent monitoring responses) and external cues (i.e., text features, feedback). It is reasonable and even likely that item-level monitoring is based on multiple and sometimes competing cues, the salience of each cue relates in some degree to content difficulty as well as to text features of the materials. Also stability of individual response styles plays a substantive role in confidence measures.

Descriptive findings
A substantial correlation between the NF and IF groups' item difficulties (P) were observed, r = .91 and also for the NF and IF item confidence self-report means, r = .85, indicating that difficult items were difficult for both groups, and that low confidence responses were also low for both groups, and vice versa (see Fig. 3). This may be a generally under-reported but obvious finding that group average item-level confidence measures, similar to group average item-level difficulty (P) measures, are similar across treatment groups. The correlation values observed between the group-level confidence measures of the two groups seems like an important probative finding, and so should be considered in future SRL investigations.

Internal cues
Confidence judgments inversely influenced follow-on confidence judgements when performance feedback was not provided (see Table 2). Specifically, at the item level (micro level), confidence self-reports exhibited hysteresis as autocorrelation for no feedback, where a confidence response influenced the next confidence response inversely at Lag 1. Since most learning happens without immediate feedback, future SRL investigations should consider the influence of confidence responses on follow-on confidence responses (hysteresis).
The hard-easy effect that was observed for item difficulty and confidence is consistent with past investigations that individuals are often overconfident on difficult items and under confident with easy items. This can be parsimoniously explained by stability of individual response styles as a form of regression to the response confidence mean. This can also be thought of as an individual confidence setpoint. It would be prudent for future SRL investigations to report content difficulty estimates and also monotonicity between the monitoring and performance measures, at least as a scatter plot, as qualifying assumptions before reporting relative or absolute monitoring accuracy.

External cues
Lesson confidence was predicted by the Word Concreteness dimension for both immediate and no feedback treatments. For both regression equations, the Word Concreteness Coh-Metrix dimension had a counter-intuitive inverse relationship with confidence. Word concreteness refers to a perceptible entity (i.e., a dog). Concrete words are easier to remember than abstract words (i.e., love) and are likely to be more familiar and thus "easier". So it is unclear why increased item concreteness decreases confidence self-report values for the item. Perhaps this is a 'feeling of knowing' or 'tip of the tongue' cue that can confound accurate monitoring. Further research should consider the influence of Coh-Metrix text features on monitoring with both new (unfamiliar) and review lesson (familiar) content.

Summary
The significant item-level confidence response autocorrelation for no feedback (hysteresis), the extraordinarily large reliability value for the item-level confidence measures (Cronbach's alpha of .96), and that item-level confidence responses tended to be less extreme than the item P values across the lesson (see Fig. 3) taken together support the likeliness of stability of individual response styles for confidence judgements. This may be a form of stability bias (Kornell & Bjork, 2009); note that Bjork et al. (2013) have proposed the troubling implications of stability bias on monitoring in self-regulated learning (p. 429). Future investigations should consider the influence of stability of individual response styles on monitoring measures (Bjork et al., 2013;Javaras & Ripley, 2007). Perhaps stability bias can be statistically controlled in future SRL research.
Because micro-level monitoring depends on incomplete and even competing information, there is likely to be a monitoring trade-off between the bioenergetic commitment needed for constant diligence and monitoring accuracy (Rae et al., 2003), with the mental system favoring macro level monitoring and may surrender monitoring control to an external system when immediate feedback is available (both require less effort). Carver and Scheier (2000) note that for self-regulation in general, "It is pointless and maybe even counterproductive to plan too far ahead too fully … Thus, it makes sense to plan in general terms, chart a few steps, get there, reassess, and plan the next bits." (p. 67).
On the other hand, micro-level hypervigilance may pay off in some extreme settings. Thus there would be an advantage for human monitoring accuracy to be "just good enough", seeking a balance between micro-level accuracy, effort costs, and macro-level accuracy. The importance of explicit micro-level monitoring would likely be amplified in high stakes decisions.

Limitations of this study
Because content familiarity is likely strongly related to feeling of knowing, and both are suspected cues for explicit and implicit monitoring (Bjork et al., 2013;Kornell, & Bjork, 2009), then in terms of generalizability of these findings, these results here should only apply under conditions of familiar content (i.e., students in an end of course review of key vocabulary). In settings where content is unfamiliar or less familiar to the learners, preexisting low familiarity is likely to be an important monitoring cue. Since Word Concreteness of words in recently studied material likely manifests differently than with unstudied materials, especially for technical vocabulary, because familiarity may falsely increase confidence as feeling of knowing, then these findings for Word Concreteness do not automatically apply when learning new or unfamiliar material.
The results observed here must be limited to recognition lesson tasks. In contrast, Pressley et al. (1990) report that short-answer questions can increase metacognitive accuracy relative to multiple-choice questions. Thus these findings should definitely not be extended beyond recognition tasks.
A ramification of this study is that it is possible that immediate feedback may deter some students from monitoring. Students may overly rely on the lesson feedback as explicit external monitoring rather than intuitive implicit monitoring (Azevedo & Hadwin, 2005;Corbett & Anderson, 2001). Future research should consider whether immediate feedback subverts self-regulation and whether this manifests as self versus system control of instructional strategy selection for some students.

Implications for design and theory
What seemed to be a simple data set of item performance and confidence self-report measures turns out to be interesting, complex, and difficult to interpret. We fully agree with Butler and Winne (1995), who note that: SRL is a process that unfolds step-by-step over time. It is also recursive; that is, internal monitoring of a current state in a task, the trigger for engaging SRL, generates feedback that, in turn, is input contributing to the learner's regulation of subsequent cognitive engagement. (p. 246) New ways for measuring monitoring and strategy shifts on the fly, such as automated emotion recognition through facial expressions and bodily movements (Azevedo, 2014;López & Tucker, 2018), offer promise for measuring implicit, unconscious regulation that contrast with self-report measures (e.g., Kahneman, 2013; System 1 as fast, instinctive, and emotional and System 2 as slow, deliberative, and logical). Such new measurement approaches could provide abundant real-time data to better understand SRL processes. New data sources, both explicit and implicit, and more robust analysis approaches that capture activity across multiple scales as well as levels of these scales across time may soon drive SRL theory and research. So then there is a need now to refine and integrate multiple measures and methodologies to inform new theory building in SRL (Cascallar et al., 2006).

Conflict of interest
The authors declare that they have no conflict of interest, this research was conducted without external funding Ethical approval All participation was voluntary, anonymous, and was conducted under an approved institutional IRB.