Mining budget testimony for themes

Print More

Over the course of two weeks, hundreds of Connecticut residents, nonprofit providers, lobbyists and state agency leaders journeyed to the state Capitol complex to testify about the state budget Gov. Dannel P. Malloy has proposed that would cut state spending by $569.5 million in the upcoming fiscal year.

Most testified against the cuts proposed to programs that assist the disabled, foster children and the poor throughout the state.

While hundreds submitted their testimony, many spoke off-the-cuff to members of the Appropriation Committee on the impact the proposed cuts would have. We found health and education were among the most-discussed topics. Testimony was peppered with language emphasizing urgency, and children appear to have been discussed more than other groups of people.

Here at Trend CT, we culled though the submitted testimony to identify trends.

We converted more than 1,200 scanned documents into text — more than 5 million words in all — and counted the most frequently occurring terms. We avoided focusing too narrowly on the frequency of single phrases, except to narrow down the list of phrases we examined (we looked only at phrases occurring more than 100 times and longer than two letters). We wouldn’t suggest a term occuring 120 times in written testimony is quantitatively more ‘significant’ than another term occurring 100 times. Instead, we looked for themes that emerged when we categorized words qualitatively.

Health

The word “health” occurred in the text far more frequently than any other meaningful (ignoring words such as “the” and “and”) single-word phrase – we counted it 4,960 times. “Mental health” occurred more than any other meaningful two-word phrase – 1,213 times. Here are some other health-related phrases that occurred frequently:

Addiction (227) • addiction services (131) • age (201) • aging (210) • ahec network (242) • area health (100) • behavioral (289) • behavioral health (189) • brain (250) • brain injury (188) • community health (282) • disabilities (459) • disability (158) • health (4,960) • health care (550) • health services (211) • health professionals (148) • health service (138) • health boards (108) • health center (131) • health careers (108) • health education (147) • healthcare (258) • home care (135) • hospital (391) • hospitals (233) • human services (193) • illness (173) • injury (228) • medicaid (552) • medical (651) • medicine (160) • mental (1425) • mental illness (134) • nursing (456) • nursing home (153) • oral health (127) • patient (158) • primary care (283) • public health (251) • social services (193) • substance abuse (214) • treatment (322) • uconn health (398). In total the phrases occurred 16,515 times.

Education

Education came up a lot, too. The word “education” appeared 1,668 times, and “school” was even more frequent, appearing 1,919 times. We found these related terms showed up frequently:

Academic (202) • adult education (126) • childhood (371) • college (518) • community college (177) • developmental (210) • early childhood (327) • education programs (122) • education (1668) • educational (343) • high school (211) • higher education (205) • learn (267) • learning (438) • public schools (119) • school (1919) • schools (602) • student (480) • students (1,676) • teachers (238) • uconn (858) • university (519) • total (11,596).

Urgency

Those giving testimony used words that suggest a sense of urgency:

Crisis (260) • critical (392) • emergency (250) • essential (240) • important (690) • invaluable (171) • must (422) • necessary (224) • need (1,379) • needed (441) • needs (1,039) • please (620) • possible (191) • significant (259) • urge you (283) • urge (393) • vital (228) • vulnerable (238) • total (7,720).

Thinking of the children

We noticed a lot of terms were used to describe groups of people. “Children” and “childrens” (we dropped all non-alphabetical characters, including apostrophes, for our analysis) appeared a combined 2,162 times. Here are some of the other groups of people discussed most frequently in the testimony.

Adult (377) • adults (228) • asnuntuck community (114) • caregivers (164) • child (565) • children (1,925) • childrens (237) • citizens (323) • clients (270) • connecticut residents (126) • families (1,366) • family (1,027) • governor (461) • human (324) • individual (166) • individuals (613) • kids (199) • most vulnerable (149) • myself (162) • parent (313) • parents (586) • patient (158) • patients (366) • people (1363) • population (295) • president (224) • professionals (294) • ranking members (115) • representative (702) • resident (220) • senator (656) • society (167) • son (242) • staff (801) • teen (180) • the community (493) • the public (140) • the people (131) • victims (266) • women (290) • worker (194) • workers (286) • young (437) • youth (585); total (18,300).

Methodology, resources

We’ve admittedly only scratched the surface of this large data set. Please feel free to use our work as a launching point for your own research.

To complete this analysis, we downloaded more than 1,200 PDFs. Since the PDFs were scanned paper documents, rather than exported from a word processor to a PDF, we had to use text-recognition software to turn it into text, which you can download here.

We used the software Tesseract for converting the documents. Text recognition isn’t perfect. It can stumble on documents with hand-written portions, imperfections such as smudges, or even on characters that are visually similar, such as lowercase “l” and the numeral “1”. Some of the text, as a result, came out garbled, so it’s not particularly useful to focus on terms that appear infrequently. But with so many source documents, it was good enough for a big-picture view.

From there we built a tool to count words and two-character phrases using the natural language processing library NLTK.

What do you think?

  • Joseph Brzezinski

    Have you tried using a “Word Cloud”?
    I presume most testimony was in favor of eliminating some proposed budget cut. Does Your analysis include associating testimony with particular budget revenue increases or spending reductions to get some overall measure of financial impact. I other words, what is the prognosis for how big a tax increase could be considered from the tax and spend legislature?

  • Jake Kara

    Word clouds make a clear visual impact, but we didn’t want to put too much emphasis on the number of occurrences of one single word, but rather look for groups of a significantly large group of related words that all appear a great number of times (more than 100 was our cutoff) to try and establish the most dominant themes. There are definitely more sophisticated analyses that could be done, which is why we made available our cleaned up files available.