Using generative AI for data analysis, understanding open answers, part 2

In the first part of these operational tips for data analysis, we discussed the importance of avoiding the wow-effect traps, the necessity of cleaning the data, and investing time to properly prepare for analysis by following a methodological approach that involves explaining the file we will analyse with the help of AI.

Now, let's see how to proceed in a more advanced way.

Every AI requires learning

In the examples I provide, I chose ChatGPT because it is the tool I have been using frequently from the beginning, and it is the one I know best. If you prefer Claude, Gemini, or others, it is important to remember that each model has its learning curve. The time you invest in understanding what works and what doesn't with each chatbot will pay off in terms of time savings and efficiency, but only after a while.

However, if you decide to switch chatbots, you will likely need to start over, though with some additional knowledge. In other words, a prompt that works for ChatGPT may not work the same way for Gemini, for example.

Generative artificial intelligences (AIs) promise to become important tools for journalists and publishers. These technologies shine particularly in analysing data, evaluating survey responses, and dissecting comments and sentiments on various topics.

Working on open answers

What we are about to see is a process that can be replicated whenever we need to analyse text in natural language for which sentiment analysis, an overview of written content, summary work, or clustering responses might be useful.

To begin, we need to “prepare” the chatbot as usual. I continue to work with ChatGPT: feel free to use the chatbot you prefer, using the same method.

For convenience, I work on a CSV that contains only the two columns of open responses to start with. I could work on the complex file, but I prefer to proceed in an orderly fashion. I also preferred to translate the responses I obtained into English for various reasons: it is more useful for this article, and ChatGPT's performance in English is still better than in Italian.

The first thing to do is to remind ChatGPT what is in the responses we are analysing and what we expect to obtain. I will work with the same case of study of the first part of this guide, for consistency and practical reasons: a survey about an event on generative AIs.

For example:

Prompt: These are the open responses from a survey conducted among participants at an event. The question is: |What doubts do you still have? Write any questions you want about generative AIs. I will use them to build a useful information archive.” The responses may be in natural language and contain punctuation and stop words. Analyse the responses and categorise them into sets.
Classification examples:
“How to use them practically” â†’ Practical training requests
“How to save creativity from the AI threat” â†’ Concerns
“Who handles the models and algorithms that train the AI?” â†’ Foundational questions
“How to really use the many AI programs?” â†’ Orientation
“Will I still have a job?” â†’ Concerns
“Copyright and intellectual properties” â†’ Legal issues

As you can see, I need to provide the AI with some classification examples. This way, I guide it towards the type of output I expect to obtain.

You will notice a lot of “Others,” since it’s necessary to push the AI to go deeper.

For example, like this: “When you find 'no', 'no doubt', 'all clear' and similar, category: 'No doubts.' Find and propose categories for the 'others' classification.”

As you can see, the AI makes suggestions. In this case, I decided to accept them all (I am always in charge, never the machines!).

The category “Others” is still too numerous.

This is normal: it’s not a failure of the AI, nor mine. The working relationship is like a dialogue with semi-finished products. At first, it requires more time, but the time savings become increasingly evident as you become more skilled.

So, here is the new prompt. ChatGPT needs a few more examples.

Prompt: “Let's take the 'Others' category and break it down into other categories.

“Privacy” is a 'Legal issue' or 'Concern.'

“No doubt, but I would need more practical mini-sessions on the tools” is a 'Request for practical...'

When mentioning someone (example: 'Alberto', 'Prof. Attardi', 'Pilhofer'), categorize as 'Specific request for a speaker.'

'A big doubt remains: AI needs an ethical shift that does not align with the business needs of the companies that generate it. How to reconcile the two aspects' is 'Concern.'

'The only concern is that everything is paid' is 'Concern.'

'What do you think will be the future developments of AI?' is 'Vision for the future.'

"The only concern is that everything is paid" is 'Concern'

"What do you think will be the future developments of AI" is Vision for the future

Reclassify everything.

ChatGPT performs the reclassification, which we must always verify. Once we are satisfied, we can move on. For example, request a graph.

There is no doubt that the request for practical training is the most popular (although sometimes it is difficult to understand exactly what someone who asks for 'more practice' really wants, because it is clear that in a training event, the specific request of everyone cannot be satisfied).

And in the end, I can also ask the AI to give me a brief summary of what we've discussed together.

Of course, this is a draft, and I'll need to work on it, if I want to publish an excerpt or an executive summary. But for my needs, for now, it’s enough.

ChatGPT does not distinguish certain things well and cannot have certain nuances. For example, I believe that 'prompting' appears so few times in the responses only because it is a technical term. The draft executive summary is far from being usable, but in the end, I provided few details to the AI: I did not specify the desired tone of voice, I did not specify the target, I only gave a very sparse prompt. And as we have seen, the less preparation, the worse the result.

Now, by combining the instructions and methods from this part of the work with those from the first part of the guide, we could, for example, investigate which categories of people made which training requests, look for correlations, and similar tasks. In this case, clearly, we need to go back to working on the original CSV (not just the columns with open responses), so that the AI can also assist us with this. We can even integrate different AI tools for different tasks, if needed.

Best practices

Here are the best practices that you should add to those from the first part.

Understand your AI tool:

Familiarise yourself with the capabilities and limitations of your chosen AI tool.
Invest time in learning how to craft effective prompts and queries.
Remember that different AI models may require different approaches.

Define clear objectives:

Clearly outline the goals of your data analysis project.
Specify the type of output you expect from the AI (e.g., sentiment analysis, clustering, summarisation).
Provide detailed instructions and examples to guide the AI in generating accurate results.

Iterative approach:

Start with simple queries and gradually increase complexity as you refine your analysis.
Regularly review and verify AI-generated outputs to ensure accuracy and relevance.
Be prepared to adjust your prompts and methods based on the AI’s performance.

Utilise examples for classification:

Provide concrete examples to help the AI understand your classification criteria.
Use a variety of examples to cover different scenarios and categories.
Continuously refine and expand your examples to improve the AI’s accuracy over time.

Combine methods:

Integrate multiple analysis methods to gain comprehensive insights (e.g., combining open answers analysis with correlation studies).
Leverage different AI tools if necessary to complement each other’s strengths

[subscribeform]

Tech