Utah’s Early Intervention Reading
Software Program
K-3 Program Evaluation Findings
Submitted to the Utah State Board of Education
September 2018
Evaluation and Training Institute
100 Corporate Pointe, Suite 387
Culver City, CA 90230
www.eticonsulting.org
Table of Contents
Executive Summary ............................................................................................................................ 1
What are the goals of this report?
Evaluation Purpose & Evaluation Questions ..................................................................................... 4
What do we know about why the program was created and who participated?
Program Background and Enrollment ............................................................................................... 5
Evaluation Methods ........................................................................................................................... 8
What do we know about how students used the software programs this year?
Program Implementation ........................................................................................................................... 13
What impact did the program have on students’ literacy achievement this year?
Program-Wide Impacts: The big picture ..................................................................................... 15
Impact of Each Software Program .............................................................................................. 21
What trends do we see in program enrollment, use and impact across the years?
Multi-year Findings .......................................................................................................................... 25
Discussion, Limitations and Recommendations .............................................................................. 28
References ....................................................................................................................................... 31
Appendix A: Analyses Samples ........................................................................................................ 32
Appendix B. Data Processing & Merge Summary ........................................................................... 36
Appendix C: DIBELS Next Measures ................................................................................................ 39
Appendix D: Determining Effect Size Benchmark ........................................................................... 40
Appendix E. Program Use by Vendor and Grade ............................................................................ 41
Appendix F: Multi-Year Dosage Recommendations ....................................................................... 43
List of Figures
Figure 1. DIBELS Indicator & Literacy Skill Measures ............................................................... 10
Figure 2: Students who met vendors minimum dosage recommendations ................................... 14
Figure 3: Students who met the dosage recommendations by grade ............................................. 14
Figure 4. DIBELS Literacy Domain Effect Sizes by Grade, Highest Dosage Sample ................. 18
Figure 5. % Change in Benchmark Status from BOY to EOY, Kindergarten .............................. 19
Figure 6. % Change in Benchmark Status from BOY to EOY, 1st Grade ..................................... 19
Figure 7. % Change in Benchmark Status from BOY to EOY, 2nd Grade ................................... 20
Figure 8. % Change in Benchmark Status from BOY to EOY, 3rd Grade ................................... 21
Figure 9. Impact of Individual Vendors on DIBELS Composite Scores, Effect Sizes by Grade . 24
Figure 10. Grade Levels with Significant Effect Sizes by Vendor and Program Year ................. 27
List of Tables
Table 1. 2017-2018 Program Enrollment Overview ....................................................................... 5
Table 2. 2017-2018 Program Enrollment by Vendor and Grade .................................................... 6
Table 3. Vendor 2017-2018 Minimum Dosage Recommendations ................................................ 7
Table 4. Predicted Means of DIBELS Composite Scores for Matched Treatment and Control,
Program-wide, Highest Dosage Sample ................................................................................ 16
Table 5. Predicted Means of EOY DIBELS Literacy Domains for Matched Treatment and
Control, Highest Dosage Sample ........................................................................................... 17
Table 6. Mean Score Differences on EOY DIBELS Composite Scores by Grade and ................ 21
Table 7. Predicted Means of EOY DIBELS Composite for Matched Treatment and Control, by
Vendor, OLS Regression Model ........................................................................................... 23
Table 8. Est. Program Enrollment from 2013/2014 – 2017/2018 ................................................. 25
Table 9. Multi-year Trends in Program Use .................................................................................. 26
Table 10. Trends in program-wide impacts, effect sizes by dosage sample ................................. 26
Acknowledgements
The Evaluation and Training Institute (ETI) thanks Sarah Young (Coordinator, Digital Teaching
and Learning) from the Utah State Board of Education (USBE) for her ongoing collaboration
and direction throughout this evaluation project.
We also acknowledge Kristen Campbell and Malia Mcllvenna at USBE for helping us
understand the ins and outs of the data and appreciate all their efforts preparing and transferring
the student data used for the analyses.
Finally, the software vendor representatives played a key role in helping us understand their
software programs, sharing their data, and working patiently with us to prepare the data in a
consistent and streamlined format. In particular, we give special thanks to Robert Rubin from
Istation, Nari Carter from Imagine Learning, Haya Shamir from Waterford, Sean Mulder from
SuccessMaker, Sarah Franzén from Lexia, Dave McMullen from MyOn and Sarah Hoerr and
Kevin Sheridan from ReadingPlus. Each of these individuals provided necessary data from
their products that were used to complete the evaluation project.
Acronyms
BOY Beginning-of-Year
C Control group/non-program group
EISP Early Intervention Software Program
EOY End-of-Year
ES Effect Size
ETI Evaluation and Training Institute
IS Insufficient Sample
LEA Local Education Agency
NS Non-significant statistical coefficient
OLS Ordinary Least Squares analyses
Tr. Treatment group/program group
USBE Utah State Board of Education
Evaluation and Training Institute
1
Executive Summary
Evaluation Purpose
The Early Intervention Software Program (EISP) provides Utah’s Local Education Agencies
(LEAs) with the opportunity to select from among seven adaptive computer-based literacy
software programs. The program’s goal is to increase the literacy skills of all students in K-1 and
struggling readers in Grades 2-3. As the EISP external evaluator, the Evaluation and Training
Institute (ETI) studied three aspects of the EISP: 1) students use of the program during the school
year (“program implementation”); 2) the effects the program had on increasing students’ literacy
achievement (“program impacts”), including program effects across all seven software programs
(program-wide) and between each software vendor (vendor-specific); and 3) trends in program
implementation and impacts across multiple years of program implementation.
Program Implementation Findings
Program vendors provided recommendations on program dosage for students to achieve the
benefits in literacy skill development from their participation in the software programs. The
implementation study was designed to determine the extent to which students met each vendors’
recommendations for average weekly use and total weeks of use. A majority of students (72-
83%) using five of the seven software programs met the requirements for total weeks of use,
which ranged from 15-28 weeks, and is an indication of students consistent use of the software.
Although a majority of students across programs used the software for the recommended total
weeks, fewer students met their respective vendors recommended minutes per week. Among the
seven vendors, there were three in which more than half of the students met the
recommendations for weekly minutes of use, on average.
Program-wide Impacts Findings
The program had a positive impact on students’ literacy skill development in kindergarten and
first grade, regardless of their program dosage, and in 3rd grade for students with the highest
program dosage. There were no statistically significant positive effects for students in second
grade. In general, the effectiveness of the program increased in strength as dosage increased from
the lowest to highest dosage. The program was most effective for students in kindergarten who
had the highest program dosage (ES=.16), which is also higher than the average effect size seen
in similar intervention programs. In addition, K-1 students with the highest program dosage
ended the year above benchmark (mean composite scores of 157 and 209 respectively) for their
grade. Students who scored above benchmark had a 90-99% likelihood of achieving subsequent
early literacy goals (Dynamic Measurement Group, 2016).
Evaluation and Training Institute
2
Vendor Impacts Findings
In addition to examining program-wide impacts, we studied the impacts of individual program
vendors on students’ literacy achievement. In kindergarten students who met a minimum
program dosage threshold had higher literacy achievement, as measured by their mean literacy
composite scores, compared to a group of non-program students with similar characteristics. To
measure the strength of these effects, we looked at the average effect sizes produced by similar
education intervention programs. In kindergarten, the effects were stronger than those found in
similar intervention programs for all five software vendors included in this analyses1. In addition,
two vendors had positive impacts on students in first grade and one vendor had a positive impact
in second grade; however, these effects were smaller than the average effect size benchmark.
Multi-Year Evaluation Findings
We studied the trends in program enrollment, students’ program use, and the program impacts on
student achievement over the past few years of program implementation. Program enrollment
and program use increased exponentially each year, indicating that LEAs are making incremental
improvements in students’ usage as the program continues. The trends in program impacts were
more complex and varied each year depending on the vendor and students’ grade level. We
consistently saw strong impacts for students in kindergarten for multiple vendors, but not in
Grades 1-3. In addition, when comparing the strength of the program impacts across years using
effect sizes, we found that the strength of the effect sizes were decreasing each year. However,
we caution readers from drawing the conclusion that the program is less effective now than it
was in the beginning of its implementation. For example, we know that schools in Utah are
increasing their use of computer-based intervention programs and it’s possible that more of our
control students are using programs similar to those being measured. In addition, through our
2016-2017 qualitative study of program implementation, we now understand that students need
to be monitored by teachers to ensure that they are progressing through the curriculum
appropriately and that time in the program may not tell the complete story.
Discussion & Recommendations
The 2017-2018 program had a positive effect in kindergarten and first grade (looking at the
program as a whole), and had mixed effects on students in first through second grade depending
on the specific vendor. When reviewing our current evaluation results with those from previous
years, it is easy to recommend that the program be continued for kindergarten students. It is more
difficult to endorse the program’s use with students in first through third grade due to mixed
results from year-to-year and the complexities involved with making vendor comparisons (e.g.
differences in vendor sample sizes, etc.). With select vendors, however, there were indicators
that students in these upper-early grades benefited from the program, so we are recommending
that more data be collected and results reviewed for future cohorts. Future research is needed to
increase our understanding of the conditions which lead to improvements in literacy achievement
for specific vendors and students, and we recommend combining students across multiple
1 ReadingPlus was used in only in the upper-early grade levels and MyOn did not have enough kindergarten students
to be included in this analysis.
Evaluation and Training Institute
3
program years as one approach for increasing the sample sizes of specific vendors. Combining
cohorts of students would allow us to measure the program impacts for all vendors and grades
and who met the same program dosage criteria. We also propose studying additional
implementation details and their link to program outcomes in order to make targeted
recommendations to improve the efficacy and impacts of the program. For example, studying the
connection between students’ progression through the program content and time spent on the
software would help us determine if students are learning during their time in the software. This
information could also be used to study the relationship between the amount of program content
covered and students’ literacy achievement.
Evaluation and Training Institute
4
Evaluation Purpose & Evaluation Questions
The Utah state legislature established the Early Intervention Software Program (EISP) to aid in
the development of Utah students’ literacy skills through computer-based, adaptive reading
software programs designed to meet students’ individualized learning needs. The programs were
supplied by multiple vendors and were implemented in schools, grades K-3. The Evaluation and
Training Institute (ETI) conducted its annual evaluation of the EISP, which focused on how the
reading software programs were used and the impact they had on students’ literacy achievement.
The evaluation included results for the combined impact of all the software programs taken
together (“program-wide” impacts) and a comparison of the relative effects on literacy
achievement for each of the software providers (“individual vendor impacts”).
This report includes findings from the 2017-2018 academic year, the EISP’s fifth year of
implementation, as well as an overview of cumulative program findings from previous program
years. These findings are intended to help the Utah State Board of Education (USBE) and Local
Education Agencies (LEAs) understand how the program is working, to identify potential areas
for program improvement, and to make evidence-based decisions about future iterations of the
program.
The following research questions were used to guide our evaluation and organize the findings in
this report:
1. Did students use the software as intended?
2. Did the program have an overall effect across all vendors?
3. Did the program effects differ based on student or school characteristics?
4. Were there differences in treatment effects among vendors?
5. What are the trends in implementation and literacy achievement across the years?
The EISP annual reports are disseminated to a wide-audience of stakeholders, including
educators, researchers, policy staff and non-technical reviewers, and we structured this report for
all types of stakeholders to understand.
In this report we include a description of the EISP and 2017-2018 student enrollment, a summary
of our research methods, findings related to each research question and the two study objectives
(program implementation and program impacts), and trends in findings across the program years.
Finally, we discuss the key findings and the study limitations.
Evaluation and Training Institute
5
Program Background & Enrollment
Utah passed legislation in 2012 (HB513) to supplement students’ classroom learning with
additional reading support in the form of computer-based adaptive reading programs. The intent
of the legislation was to increase the number of students reading at grade level each year, and to
ensure that students were on target in literacy achievement prior to the end of the third grade.
The legislation provided funding to use for the programs with students in kindergarten and in
first grade, and as an intervention for students reading below grade level in second and third
grade. To participate in the EISP, LEAs (districts and charter schools) submitted applications to
the USBE requesting funding for the use of specific reading software programs prior to the start
of each school year.
Seven software vendors provided software and training to schools through the EISP in 2017-
2018. The seven vendors were (in alphabetical order): Imagine Learning, Istation, Pearson
(“SuccessMaker”), Lexia® Core5® Reading (Core5), MyOn, Reading Plus and Waterford.
These software programs were used in 79 LEAs and 403 schools and by approximately 100,951
students. Core5 was the most frequently used program (188 schools, 50,000+ students), while
Istation was used the least (7 schools; 1,238 students).
Tables 1-2 present the 2017-2018 enrollment of LEAs and students who used each vendor.
While the EISP was intended for second and third grade students reading below grade level
(referred to as “intervention” throughout the report), some educators implemented the program
with their entire class, and in these instances, students reading at grade level (“non-intervention”)
also had access to the software programs. Our report focused on intervention students in Grades
2-3, however, we have provided enrollment information for both types of students so readers
may understand how the program was implemented in practice and as intended.
Table 1. 2017-2018 Program Enrollment Overview
Program LEAs Schools
Students
All K-3 All K-1 & 2-3
Intervention
Istation 5 7 1,238 926
Waterford 23 52 6,398 5,712
Imagine Learning 45 168 33,035 23,997
SuccessMaker 8 19 2,015 1,220
Core5 39 188 52,807 32,136
Reading Plus 2 14 1,246 174
MyOn 8 33 4,211 1,512
Note. Count of LEAs/schools are not unique due to instances where multiple programs were used within a
LEA/school. Data source: pre-merged data in K-1 and vendor data merged to DIBELS in Grades 2-3.
Evaluation and Training Institute
6
The percent of participants per grade varied by program, and three vendors had a greater
percentage of students who used the program in the third grade than the other grades (Table 2).
Table 2. 2017-2018 Program Enrollment by Vendor and Grade
Program Kinder 1st 2nd 3rd
All Intervention All Intervention
Istation 350 356 334 125 198 95
Waterford 2,731 2,588 868 283 211 110
Imagine Learning 8,357 11,013 7,880 2,446 5,785 2,181
SuccessMaker 192 586 581 185 656 257
Core5 11,337 13,441 14,341 3,518 13,688 3,840
ReadingPlus N/A N/A 218 7 1,028 167
MyOn 123 582 1,655 367 1,851 440
Total 23,090 28,566 25,877 6,931 23,417 7,090
Note. Grades 2-3 intervention students included those with scores below benchmark for their grade at the
beginning of year.
Usage Recommendations
Each vendor provided recommendations for using the software program in order for it to have an
impact on students’ literacy achievement (Table 3). Recommended weekly use ranged from 20
minutes to 80 minutes of use per week, and suggested weeks of use ranged from 15 to 28 weeks.
For LEAs to continue to receive program funding, the state requires that at least 80 percent of the
students within a school meet 80% of vendors’ average use or weeks of use recommendations
within two years of implementation2.
2 ETI submitted a separate report to the USBE on school level fidelity.
Evaluation and Training Institute
7
Table 3. Vendor 2017-2018 Minimum Dosage Recommendations
Program Kindergarten
ALL Students
First Grade
ALL students
Second Grade
Intervention
Students
Third Grade
Intervention
Students
Suggested
Minimum
Weeks
Istation 60 min/week 60 min/week 60 min/week 60 min/week 28 weeks
Waterford 60 min/week 80 min/week 80 min/week 45-60
min/week 28 weeks
Imagine
Learning 40 min/week 45 min/week 45 min/week 45 min/week 18 weeks
SuccessMaker 45 min/week 45 min/week 45 min/week 45 min/week 15 weeks
Core5 20 minutes to
60 min/week*
20 minutes to
60 min/week*
20 minutes to
60 min/week*
20 minutes to
60 min/week* 20 weeks
Reading Plus 45 min/week 45 min/week 45 min/week 45 min/week 15 weeks
MyOn 45-60
min/week
45-60
min/week
45-60
min/week
45-60
min/week 20 weeks
Note. Core5 based its usage recommendations on student performance, and students who were working below grade
level were assigned usage recommendations that were greater than those for students who worked at or above grade
level.
Evaluation and Training Institute
8
Evaluation Methods
We provide an overview of our research methods, samples and data sources that were used to
answer each research question. The methods are described for the two studies, the impact study
of students’ achievement outcomes and the implementation study of students’ program use, that
were used to inform the program evaluation. Appendices A-C provide additional details on our
methods, data processing procedures and samples.
Which program participants were included in our study?
Implementation Study Samples
The goal of the implementation study was to examine the extent to which students used the
software as intended by each program vendor. We included as many students who used the
programs as possible to provide the most accurate depiction of students’ program use, and the
samples used for the implementation analyses were the most inclusive of all the samples. For K-
1 students, we used the vendor data, and did not remove students with inaccurate SSIDs, students
who used multiple software providers, or students with incomplete DIBELS data. In Grades 2-3,
our focus was on struggling readers, and we needed valid SSIDs in the vendor and DIBELS data
as well as beginning-of-year DIBELS scores to identify the students reading below grade level.
Impact Study Samples
For the impact analyses, we selected a group of student participants (students who used the
software) within the larger pool of program students to create an “analytic sample,” which is the
group of students with whom we ran our statistical analyses (see Appendix A for descriptive
statistics of the students included in our samples). Our analytic samples changed based on the
specific analyses goals, or out of necessity in response to barriers found with the data, such as
small enrollment numbers for specific vendors. In second and third grade, the program was
designed to target intervention students only (students performing below grade benchmark
literacy levels), and we constrained our samples to include participants who were below grade
level literacy benchmarks at the beginning of the year across all analyses. Students needed to
have accurate state student Ids (SSIDs) and complete DIBELS data (outcome data) to be a viable
case for our sample. We excluded students who may have used multiple software programs in
order to study the individual impacts of each software vendor.
Control Student Matching Process. Our impact study relied on comparing program students’
achievement outcomes to non-program students’ outcomes (known as “control students”), so that
we could analyze what impact the program had on learning achievement. Control students were
drawn from schools across the state of Utah who did not participate in the EISP. Program
students were matched to control students using Coarsened Exact Matching (CEM, Lacus et al.,
2008). The students were matched on data from the beginning of the school year, and across
several important characteristics (covariates used included: grade, beginning-of-year
achievement scores, gender, race, English Language Learner status, and poverty status). If no
Evaluation and Training Institute
9
matches could be made, children were removed from the sample. CEM minimized differences
between the two groups prior to enrollment in the program, creating groups of treatment and
control student groups that were balanced across covariates.
Program-Wide Samples. Each program vendor provided schools with a recommendation for
how much time students should use the program before benefits are observed. This minimum use
recommendation was an important predictor of literacy achievement, and we wanted to
determine how students dosage characteristics affected their outcomes. We operationally defined
the combination of weekly use and weeks of use as “program dosage”. We created three matched
samples of students with three levels of program dosage (Low, Medium, High) to study the
effects of increased program use on students’ test scores across vendors:
• The Highest Dosage sample was comprised of students who met the vendors
recommended use (in minutes) for at least 80% of the weeks the software was used. In
addition, students must have used the software for at least the minimum number of weeks
suggested by each program vendor. In past reports this sample was referred to as the
optimal (OPTI) sample.
• The Medium Dosage group use sample was comprised of students who used the program
greater than or equal to 80% of vendors’ recommended use3. Students in this sample had
the second highest program dosage. In past reports this sample was referred to as the
relaxed optimal (ROPT) sample.
• The Lowest Dosage sample includes all students who used the program for any amount
of time, and shows how effective the program was irrespective of use. In past reports this
sample was referred to as the intent to treat (ITT) sample.
Individual Vendor Samples. For the individual vendor analyses, our goal was to create a sample
of students who used the software long enough for improvements in literacy skill development to
occur. If we created our sample from students who met the program vendors exact dosage
recommendations for average minutes of use and minimum weeks of use, we would not have
enough students to study each software program. Instead, we studied a subset of students who
met a relaxed version of vendors’ recommendations (students who used the software greater than
or equal to 80% of vendors recommended use; “Medium Dosage”). Although we lowered our
minimum dosage threshold, there were certain instances, for certain vendors and grades, in
which the sample size was still too small for us to detect small program effects4. For these
instances, we used the Lowest Dosage sample (all students, regardless of use) and reported any
findings which were statistically significant. Similar to our program-wide approach, we created
seven matched samples for each program vendor, which allowed us to have tightly matched
control groups for each program vendor.
3 “Met the vendors recommended use (in minutes)” is equal to 80% of the recommended weekly minutes. For
example, if a vendor recommended 60 minutes, the student must have used the program for at least 48 minutes. 4 We identified all instances in which we had an insufficient sample size for using the nomenclature, IS (e.g.
insufficient sample).
Evaluation and Training Institute
10
What sources of data were used in our analyses?
We collected data from nine different sources to create our master dataset for the EISP analyses.
The data sources included: seven program vendors, who provided us with usage information for
each student who used their programs; state Dynamic Indicators of Basic Early Literacy skills
(DIBELS Next) testing data; and student information system (SIS) demographic data provided
by the Utah State Board of Education (USBE). See Appendix B for details on how we created
our master dataset.
Which instruments did we use to
measure literacy achievement?
We measured literacy achievement using
the DIBELS Next, which was administered
in schools throughout the state in Grades
K-3. The DIBELS Next measures were
used throughout Utah, and are strong
predictors of future reading achievement.
DIBELS Next is comprised of six
measures that function as indicators of
critical skills students must master to
become proficient readers, including: First
Sound Fluency (FSF), Letter Naming
Fluency (LNF), Phoneme Segmentation
Fluency (PSF), Nonsense Word Fluency
(NWF), DIBELS Oral Reading Fluency
(DORF), and reading comprehension
(DAZE). In addition to scores for the six
subscale measures described above, we
used reading composite scores and
benchmark levels, or criterion-reference
target scores that represent adequate
reading progress. See Appendix C for
additional detail on the DIBELS Next
measures.
How did we study program
implementation?
Our program implementation findings focused on program usage in relationship to its intended
use, as described through vendors’ dosage recommendations. Program usage data included the
following: total minutes of software use, from log-in to logoff for each week the program was
used during the school year; total weeks, and average weekly use. Program vendors supplied the
usage data.
Figure 1. DIBELS Indicator & Literacy Skill Measures
Reading
Comprehension
•1st-3rd: Oral Reading Fluency (ORF)
•3rd: Daze
Fluency
•1st-3rd: Oral Reading Fluency (ORF)
Phonics
•K-2nd: Nonsense Word Fluency (NWF)
•1st-3rd: Oral Reading Fluency (ORF)
Informs
Competencies
•K-1st: Letter Naming Fluency (LNF)
Phonemic
Awareness
•K: First Sound Fluency (FSF)
•K-1st: Phoneme Segmentation Fluency
(PSF)
Evaluation and Training Institute
11
How did we study the program-wide impacts across all vendors?
Our study relied on three types of statistical analyses, all based on comparing the program
samples to matched groups of their peers, which included: hierarchical linear models,
independent t-test mean score comparisons, and benchmark outcome visual analyses.
Hierarchical linear regression model. We studied the program-wide impacts by comparing a
sample of treatment group students drawn from all vendors to a matched sample of control
students. We determined that using a two-level regression model (also known as a “hierarchical
linear regression model”, or HLM) allowed us to study the differences in treatment and control
group student outcomes, while controlling for other student-level predictors, and also allowed us
to control for Title 1 status school effects. A two-level random intercept statistical model with
school as the level-2 predictor was used to regress student outcomes on our predictor variables.
Our independent variable was treatment group status (1/0), and we included other predictor
variables to control for their effects in our models, including: beginning-of-year (BOY) test
scores, gender, special education status, economic disadvantaged status, and ethnicity in the
model to adjust for their influence on end-of-year reading scores. By accounting for these
additional predictor variables, we increased our ability to show a causal link between program
use and outcomes, while holding other factors unrelated to the program constant.
In addition, we used regression analyses to study how program participation impacted students
with specific characteristics, such as English Language Learners, special education students,
economic disadvantaged students, non-white students, and students from Title 1 schools. We
included students who met our criteria for the highest program dosage in this analysis sample.
Benchmark Outcome Visual Analyses. To present our findings in an intuitive and applicable
context, we measured the change in treatment and control students reading proficiency at the
beginning and end of the school year. Changes in students’ reading proficiency benchmark levels
were reported for the highest dosage matched sample. Although we used a sample in which
students were similar on average, descriptive statistics did not allow us to control for pre-existing
differences between groups, and need to be interpreted with caution.
How did we study individual vendor impacts?
We used an Ordinary Least Squares (OLS) regression model to predict the differences in mean
scores between treatment and control students while controlling for demographic characteristics
and baseline scores. We controlled for students’ beginning-of-year (BOY) reading scores,
gender, special education status, economic disadvantaged status, ethnicity, English Language
Learner status, and Title 1 school status in the models. Some covariates were dropped in certain
models due to collinearity.
Evaluation and Training Institute
12
How did we study the multi-year trends in program implementation and
program impacts?
There were several changes made to the evaluation design and methods throughout the duration
of the evaluation and we focused on the past three implementation cycles (2015-2016, 2016-
2017, and 2017-2018) to report findings in which our analyses methods were consistent across
years. We reported the effect sizes for three levels of program dosage to study the entire program
over time, and visually identified the grade levels in which vendors had an impact on students’
literacy achievement. To study the trends in program implementation, we reported students’
average weeks of use, total minutes of use, and weeks of use. Program usage descriptives
reported prior to 2015-2016 were estimated from students’ total minutes and the program start-
and-end dates, while usage reported after 2015-2016 was calculated from actual weeks of use5.
What statistics do we provide in our results?
Where appropriate, we provided predicted mean scores and mean score differences for our
treatment and control groups, which are meaningful when comparing treatment and control
groups from the same sample. Statistical significance testing allowed us to determine the
likelihood that a finding was a result of chance, or due to the treatment effect. We also provided
treatment effect sizes (ES; based on Cohen’s Delta6, or “d”) to help readers understand the
magnitude of treatment effects. Presenting effect sizes enabled us to provide a standardized scale
to compare results based on different samples, and measure the relative strengths of program
impacts. Descriptive statistics, such as percentages, were presented to describe students’ program
use and change in reading proficiency benchmark status.
When interpreting our findings, it is important to note that effect sizes can be used to measure the
strength of program impacts in multiple ways. A commonly used method is Cohen’s (1988)
characterization of effect sizes as small (.2), medium (.5) and large (.8). However, recent studies
have suggested using a more targeted approach for determining the magnitude of the program
impacts. For example, Lipsey et. al (2012) suggested effect size comparisons should be based on
“comparable outcome measures from comparable interventions targeted on comparable
samples”, and notes that effect sizes in educational program research are rarely above .3, and that
an effect size of .25 may be considered large (pg. 4). For the purposes of this study, we have
chosen to contextualize our findings using similar instructional programs as our benchmark. The
mean effect size for similar instructional programs is .13, and we consider this the standard by
which to compare our results. Effect sizes larger than this are stronger than average, which we
note in our results.7 More information on how we selected our ES benchmark is provided in
Appendix D.
5 Beginning in 2015-2016, we received weekly program use data from vendors and calculated more accurate
descriptives. 6 Effect sizes are calculated by taking the difference in the two groups means divided by the average of their pooled
standard deviations. 7 This interpretation is based on a review of 829 effect sizes from 124 education research studies conducted by
researchers at the Institute of Education Sciences (IES) (Lipsey et. al, 2012).
Evaluation and Training Institute
13
Program Implementation Findings
It is important for evaluators to study program implementation prior to measuring the program
impacts on student learning, and with increased understanding of how a program was
implemented, conclusions made about the program impacts can be more meaningful. For the
EISP, the most important aspect of program implementation is dosage, which is how much of the
program a student received during the school year, as students must use the program for a long
enough period of time for it to have an impact on their literacy skill development. We explored
the differences in usage across software programs and grade levels in order to better understand
the nuances of program implementation based on these factors. We used the recommendations
provided by each program vendor on average weekly use and total weeks of use to determine if
students were using the program as it was intended. A more detailed summary of student use is
included in Appendix E.
Did students use the software as intended?
As shown in Figure 2, a majority of students used their
respective software programs for the minimum weeks8
recommended: 72-83% of students among 5/7 vendors. This
finding indicates that LEAs are facilitating students’ use of
the software on a weekly basis and for the minimum number
of weeks that vendors’ recommended.
While LEAs made sure that their students used the software
regularly, it was more difficult for them to meet vendors’
weekly minutes of use targets.9 Among the seven vendors,
there were three vendors in which at least half of their
students used the software for the recommended minutes per
week, on average (Core5, Imagine Learning, and
SuccessMaker). Students using ReadingPlus and MyOn had
the lowest percentage of students who met the average
minutes recommendations.
The percentage of students who met vendors’
recommendations for both average minutes and total weeks is presented in the last column of
Figure 2. These students used the programs as intended on both aspects of dosage: weekly
minutes and total weeks. Over half of the students who used Core5 met both recommendations,
and almost half of Imagine Learning and SuccessMaker students reached this goal.
8 Vendor recommendations for total weeks of use ranged from 15-28 weeks. 9 Vendor recommendations for average minutes per week ranged from 45-80 minutes. Core5 had lower
recommendations for non-intervention students: 20 minutes per week.
Key Finding: The percentage of
students who met vendors’
weeks of use and average use
recommendations increased from
last year within each grade:
• Students who met the
average minutes recs
increased by 10% in
kindergarten; 17% in 1st
grade; 14% in 2nd grade; and
10% in 3rd grade.
• Students who met the weeks
of use recs increased by 17%
in kindergarten; 12% in 1st
grade; 11% in 2nd grade; and
10% in 3rd grade.
Evaluation and Training Institute
14
Figure 2: Students who met vendors minimum dosage recommendations
N: Istation (926); Waterford (5,712); IL (23,997); SM (1,220); Core5 (32,136);
RP (174); MyOn (1,512)
Figure 3 provides an overview of program use within each grade. Forty-five to 61 percent of
students met the average minutes recommendations across grades, while 72 to 85% met the
minimum weeks requirements. Fewer students met the average weekly use recommendations in
third grade (45%) among all the grades; however, kindergarten students had the fewest (47%) to
meet both the minutes and weeks recommendations.
Figure 3: Students who met the dosage recommendations by grade
N: K (23,090); 1st (28,566); 2nd (6,931); 3rd (7,090)
64%
54%
52%
41%
34%
20%
16%
83%
81%
78%
72%
76%
47%
31%
58%
48%
46%
31%
30%
16%
12%
Core5
Imagine Learning
SuccessMaker
Istation
Waterford
ReadingPlus
MyOn
Met Ave
Weekly Use
Met Weeks
of Use
Met Ave Min
& Wks Recs
53%
61%
54%
45%
56%
77%
85%
80%
72%
80%
47%
56%
56%
50%
40%
K
1st
2nd
3rd
Total
Met Ave
Weekly Use
Met Weeks
of Use
Met Ave Min
& Wks Recs
Evaluation and Training Institute
15
Impacts on Literacy Achievement
We studied the effectiveness of the program on literacy achievement by comparing groups of
students who used the program to groups of students who did not. We present our findings in two
sections: 1) Program-wide impacts, and 2) Individual vendor impacts. The first section includes
findings on the impact of the EISP across all seven software programs, providing a global view
of how the program performed as it was used across the state, while in the second section, we
explore the relative impacts of each program vendor.
Program-Wide Impacts
We begin the program-wide analyses studying the program impacts for three samples
representing different levels of program use (from lowest to highest use). This analysis helps
illustrate the relationship between program effects and program use (or dosage) and depicts
program effects for literacy composite scores for each grade. Following this analysis, we
examine the program effects on individual literacy subscales for the highest usage group, then
determine how the program affects changes in students’ benchmark status, an indication of
students reading risk. We completed our analyses with an examination of program effects for
specific groups of students.
Did the program have an overall effect across all vendors?
Dosage (or amount of software use) is the most important determinate in program-wide
treatment effects. As seen in Table 4, the statistically significant program-wide effects on
DIBELS Next end-of-year (EOY) composite scores increase with dosage, and the more a student
used the program the better his/her EOY outcomes.
• In kindergarten, the treatment effects tripled when you move from the lowest dosage to
the highest dosage sample.
• In first grade, students in the highest dosage sample had slightly more than four-fold the
effects size when compared to the lowest dosage sample.
• In second grade, there were no statistically significant treatment effects.
• In third grade, only the highest dosage sample produced a statistically significant effect.
Who is included in each dosage sample?
• Highest Dosage: students met vendors’ recommendations for at least 80% of the
weeks it was used, and used it for the total weeks recommended by vendors.
• Medium Dosage: students met at least 80% of vendors recommended dosage
• Lowest Dosage: students of all usage.
Evaluation and Training Institute
16
Effect sizes (ES) describe the magnitude of the difference between two groups on an outcome
and are often interpreted as meaningful if they reach a certain minimum threshold. For the
purposes of this report, we define this threshold as any effect size equal or greater to .13, which
is the average effect size seen in similar intervention programs (Lipsey et. al, 2012). Students
with the highest program dosage in kindergarten, first and third grade had the highest treatment
effect sizes overall, as measured by their average DIBELS Next Composite scores (ES: .16, .09
and .1, respectively). The .16 effect size in kindergarten is meaningful when compared against
the average effect size of .13 produced by similar intervention programs.
Table 4. Predicted Means of DIBELS Composite Scores for Matched Treatment and Control,
Program-wide, Highest to Lowest Dosage Samples
Kindergarten 1st Grade 2nd Grade
Intervention 3rd Grade
Intervention Tr. C ES Tr. C ES Tr. C ES Tr. C ES
Highest
Dosage
N=12,152 N=17,250 N=3,542 N=2,772
157 144 .16* 209 198 .09* 170 166 NS 278 268 .1*
23,150 26,682 7,188 6,174
149 140 .09* 191 184 .06* 158 159 NS 260 257 NS
Lowest
Dosage
31,362 28,252 8,750 9,162
145 140 .05* 187 184 .02* 154 156 NS 254 257 NS
Note. NS (not significant) in a cell means the program did not have a statistically significant effect. ES: Effect Size
(based on Cohens D). ES’s greater than .13, the average for similar intervention programs, are highlighted in bold. *p ≤ .05.
In addition to examining the program effects on composite measures of literacy, we examined
the program’s benefits on specific literacy skill development in Table 5. Program students had
higher mean scores than their control group counterparts across all grade levels and literacy
measures, although these differences were small (from 1 to 6 points). The largest difference in
mean scores was observed for developing kindergarten students alphabetic principles and basic
phonics skills (NWF: CLS), with program students scoring 6 points higher, on average, than the
control group. Program participation had less of an impact in the upper-early grade levels.
Program students did slightly better than non-program students on measures of basic phonics
(NWF) and reading comprehension (ORF) in first grade, fluency in second grade, and fluency
and reading comprehension in third grade.
Evaluation and Training Institute
17
Table 5. Predicted Means of EOY DIBELS Literacy Domains for Matched Treatment and
Control, Highest Dosage Sample
Kindergarten 1st Grade 2nd Grade 3rd Grade
DIBELS Scale N=
12,106 – 12,152
N=
17,200-17,246
N=3,542 N=2,722
Tr. C Dif. Tr. C Dif. Tr. C Dif. Tr. C Dif.
First Sound
Fluency (FSF) 40** 37 3 N/A N/A N/A
Letter Naming
Fluency (LNF) 54** 50 4 N/A N/A N/A
Phoneme
Segmentation
Fluency (PSF)
53** 51 2 N/A N/A N/A
Nonsense
Word Fluency-
CLS
50** 44 6 89** 85 4 N/A N/A
Nonsense
Word Fluency-
WWR
9** 8 1 28** 26 2 N/A N/A
Oral Reading
Fluency (ORF) N/A 70** 68 2 NS 76** 74 2
DAZE N/A N/A N/A 14** 13 1
Note. NS (not significant) in a cell means the program did not have a statistically significant effect. N/A: measure
not administered in grade. *p ≤ .05. **p ≤ .01.
In Figure 4, we present the effect sizes for each statistically significant DIBELS literacy domain
in which the treatment group had higher mean scores compared to the matched control group to
aid in interpreting the practical significance of the findings. Effect sizes increase in strength from
the left to the right of the figure and are plotted by grade. As expected, we see significant
treatment impacts for grade levels in which the DIBELS composite reading scores were also
significant (kindergarten, first and third grade). There were no statistically significant treatment
effects for either the composite or for specific literacy domains in second grade. Two subscales
in kindergarten produced effects greater than or equal to similar intervention programs: Letter
Naming Fluency (LNF) and Nonsense-Word Fluency: Correct letter sounds (NWF: CLS;). Letter
naming fluency measures students’ ability to recognize letters and Nonsense-Word Fluency
measures students’ understanding of alphabetic principles and blending.
Evaluation and Training Institute
18
Figure 4. DIBELS Literacy Domain Effect Sizes by Grade, Highest Dosage Sample
What were the differences in treatment and control group outcomes for
at-risk students across all vendors?
DIBELS Next benchmark levels serve as an indicator of students’ reading level. Benchmark
categories are designated as “At or Above Benchmark”, “Below Benchmark”, and “Well Below
Benchmark.” Students with DIBELS Next composite scores below “At or Above Benchmark”
for their grade level may be at-risk compared to their peers. To determine how the program
affected the outcomes of at-risk students, we depict the percent of students who started the year
Well Below Benchmark or Below Benchmark for their grade, and follow their change in reading
status in comparison to their non-program counterparts (Figures 5-8). The two bars on the left of
each figure portray the percentage of students who began the year Below or Well Below
benchmark in the treatment and control group (“BOY Tr” vs. “BOY Cntrl”), and the two bars on
the right portray the percentage of students who ended the year in each benchmark category
(“EOY Tr” vs. “EOY Cntrl”). Similar to the trends found in the regression analyses, descriptive
analyses showed that program students had the highest growth compared to their comparison
group counterparts in kindergarten and first grade, followed by a small difference in third grade.
We describe the findings for each grade level in more detail in the following paragraphs.
FSF
.09
LNF
.13
PSF &
WWR
.06
CLS
.16
CLS
.08
WWR
.07
ORF
.09
ORF
.09
DAZE
.12
.04 .07 .10 .13 .16
K
1st
2nd
3rd
ES above 0.13 are
stronger than similar
programs
*Note: All data points displayed in figure were statistically significant at p≤ .05.
Evaluation and Training Institute
19
Kindergarten: In kindergarten, 360 EISP students and 360 comparison students in the matched
Highest Dosage sample began the school year below grade level based on their beginning of year
reading DIBELS scores. Of these, 59 percent in the treatment group ended the year reading at
grade level, compared to 33 percent of comparison students (a difference of 26 percent).
Figure 5. % Change in Benchmark Status from BOY to EOY, Kindergarten
Data source: Students reading below benchmark at BOY,
matched kindergarten Highest Dosage sample. N: 720
First Grade: Among the program students in first grade who started the year reading below
grade level, 48 percent (1,182/2,450) were reading at grade level by year end (Figure 6). In
comparison, 40 percent of the non-program students (963/2,417) moved from reading below
grade level to reading at grade level from beginning (BOY) to the end of the school year (EOY).
The difference in growth between the treatment and comparison group was 8 percent.
Figure 6. % Change in Benchmark Status from BOY to EOY, 1st Grade
Data source: Students reading below benchmark at BOY,
matched 1st Grade Highest Dosage sample. N: 4,863
55% 58%
21%
41%
45% 42%
21%
26%
59%
33%
BOY Tr BOY Cntrl EOY Tr EOY Cntrl
Well Below Below At/Above
56% 57%
33% 41%
44% 43%
19%
19%
48% 40%
BOY Tr BOY Cntrl EOY Tr EOY Cntrl
Well Below Below At/Above
Evaluation and Training Institute
20
Second Grade: As shown in Figure 7, the difference in growth between program and non-
program struggling readers in second grade was negligible: 3% more program students were
reading at grade level than their non-program peers. Moreover, within the sample of struggling
2nd grade readers, only a small percentage reached At/Above Benchmark status (26% of
treatment students vs. 23% of control students). Approximately half of the students in both
groups fell within the Well Below benchmark category at EOY, indicating that there is a 10-20%
likelihood of these students achieving subsequent reading goals without intensive support outside
of core curriculum (Dynamic Measurement Group, 2016).
Figure 7. % Change in Benchmark Status from BOY to EOY, 2nd Grade
Data source: Students reading below benchmark at BOY,
matched 2nd Grade Highest Dosage sample. N: 3,372
Third Grade: In Figure 8 we can see that slightly more third grade program students were
reading at grade level compared to non-program students by the end of the school year (a 6%
difference). Thirty-seven percent of program students and 31 percent of non-program students in
the matched Highest Dosage sample identified as Below or Well Below Benchmark at the
beginning of the school year reached At/Above benchmark status by year end.
64% 64%
50% 52%
36% 36%
24% 25%
26% 23%
BOY Tr BOY Cntrl EOY Tr EOY Cntrl
Well Below Below At/Above
Evaluation and Training Institute
21
Figure 8. % Change in Benchmark Status from BOY to EOY, 3rd Grade
Data source: Students reading below benchmark at BOY,
matched 3rd Grade Highest Dosage sample. N: 2,626
Did the program effects differ based on student or school
characteristics?
Table 6 shows the mean score differences in DIBELS composite scores at program exit for
certain subgroups of program students. Program students who were identified as low-income,
special education (SPED), and English Language Learners had lower predicted means scores
than their higher income, general education, and English speaking program counterparts in
specific grades. These differential treatment effects were the most pronounced for special
education students in Grades 1 and 3: in 1st grade they scored 34 points lower and in 3rd grade
they scored 47 points lower than general education treatment students.
Table 6. Mean Score Differences on EOY DIBELS Composite Scores by Grade and
Subgroup, Highest Dosage Sample
Kindergarten 1st Grade 2nd Grade 3rd Grade
Low-income -2 -13 -8 -6
Special Education (SPED) -13 -34 -23 -47
Title I Schools 8 NS NS NS
ELL NS -20 -10 NS
Non-white -2 NS NS 16
Note. NS (not significant) in a cell means the program did not have a significant effect.
Kindergarten (N=12,024); 1st Grade (N=17,250); 2nd Grade (N=3,542); 3rd Grade (N=2,722)
All mean differences displayed in table were statistically significant at p≤ .05.
68% 67%
42% 47%
32% 33%
21% 23%
37% 31%
BOY Tr BOY Cntrl EOY Tr EOY Cntrl
Well Below Below At/Above
Evaluation and Training Institute
22
Individual Vendor Impacts
The vendor-specific analyses were designed to help program stakeholders understand the
effectiveness of the individual programs and make informed decisions. With this in mind, we
have done our best to conduct comprehensive analyses in which readers understand program
effectiveness based on different aspects. We must also stress that differences within program
vendors samples (e.g. sample size, types of students who used the programs, etc.) make it
difficult to conduct a fair comparison among vendors. To help the reader understand these
limitations, we indicate when different samples are used in our findings and discuss these
limitations in the beginning of sections (where applicable) and at the conclusion of the report.
The vendor-specific findings in this section include a mean comparison between each program
and a matched control group that shows program effects on overall literacy scores.
What were the differences in treatment and control group outcomes
among vendors?
Table 7 presents the predicted means and mean score differences of program and non-program
students in the matched medium dosage sample for each vendor and grade. Vendors with sample
sizes that may be too small to detect small program effects were identified with “IS”, insufficient
sample, and findings that were not statistically significant were identified as “NS”, not
significant. Five vendors had a positive impact on students in kindergarten (Istation, Waterford,
Imagine Learning, SuccessMaker, Core5), followed by two vendors in first grade (Imagine
Learning; Core5), and one vendor in second grade (Imagine Learning). There were no
statistically significant findings in third grade for the vendor specific analyses. In kindergarten
and first grade, the average predicted DIBELS composite means for both program and non-
program students fell within or above the At Benchmark range for their grade (119-151 in
kindergarten and 155-207 in first grade), which signifies a 70-85% likelihood of achieving
subsequent reading outcomes (Dynamic Measurement Group, 2016). Second grade students who
began the year reading below grade level and with whom received program benefits were still at
risk based on their end-of-year reading level: predicted mean scores fell within the Well Below
Benchmark range (0-179) at end-of-year.
Evaluation and Training Institute
23
Table 7. Predicted Means of EOY DIBELS Composite for Matched Treatment and Control, by
Vendor, OLS Regression Model
K 1st 2nd 3rd
Tr. C Dif. Tr. C Dif. Tr. C Dif. Tr. C Dif.
Istation
N=244 N=436 N=136
IS
160 148 12 NS NS
WF
2,734 1,816 204
IS
146 137 9 NS NS
IL
8,110 13,546 2,940 1,806
146 137 9 184 182 2 161 156 5 NS
SM 322† 696 178 272
141 131 9 NS NS NS
Core5
12,454 16,268 4,196 4,114
151 145 6 200 195 5 NS NS
RP N/A N/A IS IS
MyOn
104 250
IS NS NS NS
Note. Model covariates were gender, Hispanic, special education, school Title I status, low-income, ELL and BOY
Composite score. IS: Insufficient sample. NS (not significant) in a cell means the program did not have a significant
effect. † Lowest dosage sample reported for SM in kindergarten. *p ≤ .05
Like the program-wide analyses, we present effect sizes for the individual analyses to identify
the strength of the treatment effects in relationship to similar intervention programs. Effect sizes
increased in strength from the left to the right of the Figure 9 and are plotted by grade. Effect
sizes to the right of the dotted line are stronger than the average effect sizes produced by similar
intervention programs and are therefore more meaningful based on this frame of reference. As
displayed in Figure 9, all vendors that were used with kindergarten students produced effect
sizes greater than the effect size benchmark, including: SuccessMaker (ES.44), Istation (.32),
Waterford (ES: .25), Imagine Learning (ES: .24), and Core5 (ES: .15). Vendors in Grades 1-3
had small effect sizes, none of which were greater than the effect size benchmark.
Evaluation and Training Institute
24
Red Circle,
WF
.25
IL
.24
IL
.04
IL
.08
Core5
.15
Core5
.08
Istation
.32
SM
†.44
.00 .13 .26 .39
ES above 0.13 are stronger
than similar programs
3rd
2nd
1st
K
Figure 9. Impact of Individual Vendors on DIBELS Composite Scores, Effect Sizes by Grade
Note. IS for medium dosage group: Istation 3rd grade (n=16); WF 3rd grade (n=8); MyOn 1st grade (n=16); RP in
2nd/3rd grade (n=0-50). † Lowest dosage sample used for SM in kindergarten. IS for lowest dosage group: WF 3rd
grade (n=34); RP 2nd grade (n=6).All data points displayed in figure were statistically significant at p≤ .05.
Evaluation and Training Institute
25
Multi-year Findings
In this section we identify the key trends in program enrollment, student program use, and its
impacts on student achievement across the past few years of program implementation.
What are the multi-year trends in program enrollment?
Table 8 depicts program enrollment of Local Education Agencies, schools and students over the
past four years of the EISP. It is clear that program enrollment continues to increase
exponentially with approximately 64,000 more students enrolled from 2014/2015 to 2017/2018.
Table 8. Est. Program Enrollment from 2013/2014 – 2017/2018
Includes all K-3
students
All K-1 &
2/3 intervention only
2013-2014 2014-2015 2015-2016 2016-2017 2017-2018
LEAs 32 45 72 79 79
Schools 145 218 388 338 403
Students 38,553 36,790 68,891 86,723 100,951
Note. Data reported prior to 2015-2016 includes non-intervention students in Grades 2-3.
Student counts may contain duplicates and should be seen as estimates.
What are the multi-year trends in students’ program use?
Table 9 presents the change in average usage from 2013-2014 to 2017-2018. Prior to 2015-2016,
we estimated students’ average weekly use and total weeks of use10, and we did not present these
usage statistics for those years. In 2015-2016, we received weekly program use data from
vendors, which provided us with more accurate usage statistics. As displayed in Table 9, LEAs
appear to be doing a better job overall with program implementation from 2015-2016 to 2017-
2018 in all three areas (average minutes, total weeks, and total minutes).
10 Averages were calculated from students’ total minutes and the program start-and-end dates prior to
2015-2016.