Q&A: Talking about the state of open data

Before the state’s data portal launched last year, we asked a handful of people in the data community about their expectations of the portal. We reached out to a few of them to see what they think of it now. We talked with:

  • Sheryl Horowitz: The community research and evaluation director at the Connecticut Association for Human Services. She went to CAHS to implement “results based accountability evaluation.” She was also co-chair of a data committee for the Early Childhood Cabinet, where they discussed ways to make data public.
  • Mark Abraham: Executive director of DataHaven, a New Haven nonprofit that compiles and shares community indicator data.
  • Scott Gaul: Director of the Hartford Foundation’s Community Indicators Project, an initiative to gather and analyze community data focused on the workforce and education.
  • Has the data portal met your expectations?

    Sheryl Horowitz: Well it is a massive undertaking and knowing that many CT agencies are understaffed and lagging in technology, my expectations were low—so I think they are making a good faith effort. It seems like the approach taken is a form of crowd sourcing – meaning that data sets are added based on what is easily available and that somebody is willing and gets permission to provide. This is in contrast to developing a systematic approach to a set of datasets that are essential to understand the working of CT government and services. For example, the only hospitalizations I found were for Asthma, and while this is important, there are other diseases that could tell interesting stories about the exposure and condition of populations in CT. The data for diabetes for example is parsed differently and is about diagnosis. It would be so much more useful if there were parallel data sets for each disease including , diagnosis, hospitalization and death by cause. The organization of the data sets is also very confusing and only with a good search algorithm can you find what you might be looking for and certainly unlike a library where browsing near the book you are looking for can result in many other similar resources, this strategy is not possible on the website.

    Mark Abraham: The portal has generally met our expectations, especially in terms of the large volume of raw data posted there, and we look forward to
    seeing future enhancements.

    Scott Gaul: Generally yes. We have to be realistic about the expectations for government. Increasing transparency is both the most basic and most valuable thing they can do. Government collects the data; they can provide it back to the public. In a perfect world, government would be active in sharing, using, analyzing and acting upon that same data, but I think it’s more realistic to expect civil society to play that role. We try to do that work in the Hartford region through our community indicators project. The portal has increased transparency and been responsive to requests for data. If they don’t publish something, at least they explain why.

    Where has it had shortcomings?

    Sheryl Horowitz: Besides the ones mentioned above, other shortcomings include: data sets that are not timely (last updated data 2011 on DPH for example). Several sets that do not have clear dates. More disaggregation is needed below the state level to capture income inequalities (economic status) in addition to ethnicity, gender, and age differences. The town level is the only unit with real meaning, but with 169 (maybe 168) towns there needs to be another way to capture differences between large urban areas and rural town rather than only looking at the state level. Finally there needs to be data dictionaries for each of the datasets that give formal definitions of the variable names.

    (Followup question: It’s interesting how it has changed your workflow, in that you work with the hope that the data portal is comprehensive enough that you don’t need to ask for a dataset. But when that fails, what does your workflow look like? Your inquire about data and its associated metadata with specific agencies? And how do you know what data exists? (I know your previous experience helps — but is that the primary source of knowledge?)

    My normal pattern to search for data starts from a question that I am trying to answer- or from trying to understand a a distribution or pattern of information that is used to understand child or family welfare. For example, we think that it is important that children have stable and safe environments while they are growing. So I might look for indicators that might measure the effects of that on the child such as mental health status . Then I would start to look at what is available nationally and is being collected by other researchers and policy makers. Much of these would be proxy measures such as the number of children that have been seen by mental health providers—which of course leaves out those who don’t have access to those services.. There is also a network of agencies throughout the country (e.g. Child Trends) that I would reference to understand what types of data are already being collected. I would then start asking agencies if they are collecting anything like that and what fields(definitions), frequencies, levels of disaggregation they use. For our Kids Count work we are often stymied by the need for town level data that can end up being suppressed or not available because of confidentiality concerns.

    You are also correct that I rely on past history at my agency and prior relationships to inquire, but all in all there is no formula, and it can be really frustrating, if the people we deal with leave, or we are given a new contact- and the data we receive is slightly different. That is why we are starting to go the route of MOUs so that it is in writing what we are asking for and what we are receiving.

    Mark Abraham: In terms of shortcomings, I feel that raw datasets that are posted on the portal are not always as current as they could be. For example, I checked today and the State Department of Education had 2014 data posted on its Bureau of Data Collection, Research, & Evaluation website, but the Connecticut Open Data portal was still carrying 2013 data as the most recent year.

    Scott Gaul: It’s not particularly user-friendly, but I don’t see that as a tremendous failing. Other sites are filling that gap, like the CT Data Collaborative and the State Data Center. I don’t really want state government putting a lot of resources into building a slick website.

    I would like to see an inventory of the data that state agencies collect. If the low-hanging fruit are now published, what else is behind the scenes? This is basically the ‘metadata’ that was talked about in the initial interviews. I believe they are working on this, but it’s not there yet. That inventory would also help to understand the systems and costs that the state incurs to collect data. Are there redundancies or areas where we could be more efficient? Data collection not only costs money, but filling out forms, applications and questionnaires puts a burden on the public, especially the most disadvantaged.

    Has it changed the way you do your job?

    Sheryl Horowitz: I do look first at this site before I inquire about data.

    Mark Abraham: I’m not sure the portal has changed the way we do our job at DataHaven. Although it has facilitated access to some raw data sets, DataHaven staff still need to investigate and thoroughly understand each dataset before using it – just to begin, by confirming that the most recent year of data are posted on the site, that there haven’t been updates or corrections, and developing “metadata” to understand potential strengths and limitations in terms of how data are collected over time. From there, comes the difficult work that we do to make the raw data more useful as information for our partner institutions, agencies, and community groups. The portal could be enhanced in ways that would make this easier, like by flagging whether data posted is of the most recent year published, or by linking to concrete examples of how the raw data are used to support decision-making within government agencies.

    Scott Gaul: Yes – I now know who to ask for data, at least when it comes from a state agency. (I ask Tyler.) That’s pretty basic, but it really helps to have a clear point of contact to whom you can make public requests, rather than trying to hunt down the right person within each agency’s org chart and hoping they are helpful.

    Do you think it shows a commitment to transparency?

    Sheryl Horowitz: This is uneven between the departments. Some departments seem to providing data on difficult topics like the Departnebt of Education (chronic absences, cohort graduation and DCF (abuse and neglect). The public health department and hospital data seem to be lagging in this area.

    Mark Abraham: Yes, I think the portal does show a commitment to transparency, as do many other state websites that post public records. The transparency embodied in the portal is also based on legislation, for instance, in the case of The Alvin W. Penn Racial Profiling Prohibition Act, which prohibits law enforcement agencies from stopping motorists solely on the basis of race, age, or other demographic characteristic. The Open Data portal hosts a dataset that was created in accordance with this law. Communities should consider passing additional legislation to require that important datasets be collected and posted, and government agencies could do much more to disaggregate their data by neighborhood, district, or demographic group, as I suggested earlier — this has to be done in a way that protects confidentiality, but allows us to use the data to produce much more useful information.

    Scott Gaul: There’s certainly a basic level of commitment through the creation of the portal and the public process to suggest and publish new datasets. Has that permeated the being of all state employees? I don’t know. There are a lot of requested data sets that are not yet published.

    Other states have made a more substantial commitment – not just to transparency, but to using data for performance management or raising public awareness about key issues. Maryland is a good example of this – when you go to their site, you get a better sense of ‘how Maryland is doing.’ But their commitment has not just been to transparency as an ideal — they have committed to resources and staff to support the portal and to allow a more ambitious set of activities.

    (Followup question: Do you think it’s important for the state to financially invest in open data staff?)

    I would say that ‘If the state has ambitions to go beyond transparency to making active use of its data, that requires an investment in staff.’ I haven’t actually fact-checked that Maryland has staff for the open data portal, but I’ve heard people say they have several (maybe 9 – 10 people?), whereas we just have one person (to my knowledge).

    What data you’d still like to see on the portal?

    Sheryl Horowitz: More surveys conducted by towns and agencies — e.g.food insecurity and the new Data Haven health survey.

    Mark Abraham: Additional data would certainly make the portal more useful. I would like to see all public information that is currently collected by the State Department of Education, Department of Labor, Department of Public Health, and other key agencies on the portal, and more importantly, updated on a frequent schedule so that it does not conflict with or become less current than the information that is already being posted on each agency website.

    Scott Gaul: An inventory of data is the most important piece – then residents would know what’s available and could set priorities. I’d also like to see access to longitudinal or cross-agency data through the portal as the state develops the systems that help to link records over time and across agencies.

    Cohort counts for high school graduation rates would be nice too.

