6 Must-Try ChatGPT Prompts For Data Analysts
Photo Source: @siva_photography via Unsplash
The role of a data analyst can be complex and challenging, as it requires a unique blend of technical expertise, analytical skills, and business acumen. Data analysts are often tasked with sifting through vast amounts of data to extract meaningful insights that can inform business decisions.
But no two analysts are alike, and while some analysts can be automation experts, others may find themselves doing things more manually.
In this article, I wanted to share a few real-world examples of how a data analyst (or aspiring data analyst) can use ChatGPT to help sharpen their skills and work smarter.
All of the examples I’ll cover in this article were carried out using ChatGPT, some with GPT-3.5 and others with GPT-4 (which currently requires a paid subscription). So, for each example below, I’ve indicated which GPT model was used as well as the exact prompts.
I’ve also created a video walk-through of this article, so if you’d rather watch / listen you can check out the video below.
If you’re reading, here’s a quick summary of the 6 use-cases.
1. Use AI to create mock datasets
If you’re a student or someone looking to break into a career in data, one of the challenges you might face in the early stages is access to relevant datasets. The good news is there’s probably more access to data today than when I started out in my career ten years ago. Websites like Kaggle datasets are a great resource for data analysts to sharpen their skills.
But sometimes, you need access to very specific types of data or structures of data, which may not always be available on free sources like Kaggle.
This is where ChatGPT can be really useful. Using a simple prompt, you can ask ChatGPT to create a mock dataset for almost any type of data and with any type of data structure. Here’s a sample prompt you can use. Be sure to tailor it to your specific context and needs.
Prompt (using GPT-3.5 and a free plan):
Create a mock dataset for training purposes that focuses on eCommerce sales data for a small business specializing in online smartphone case sales. The dataset should comprise the following columns:
- Transaction ID
- Transaction date
- Product SKU
- Product name
- Quantity
- Unit price
- Total amount
Please ensure that the mock dataset is realistic and representative of typical eCommerce sales data for the specified small business.
As you can see in the prompt above, giving ChatGPT some direction is useful. I specified the context of the data (i.e. eCommerce, small business, smartphone products) and asked for some specific column headers. You don’t have to be this specific, but if you’re too broad, then the output may not be relevant to your needs.
The image below shows the output from my prompt.
Mock dataset from ChatGPT
This is a relatively small dataset, but you can also ask ChatGPT for more rows or columns of data. But again, keep in mind that if you’re using a free plan, then there will be limitations on how big the dataset can be.
The table ChatGPT provided above can be pasted into Excel fairly easily. However, if you’re on a paid plan and using GPT-4, you can actually ask ChatGPT for a downloadable CSV or XLS.
If you’d like to read through the full chat thread, you can do so via this link.
2. Use AI to learn and understand new concepts
Even a seasoned data scientist will come across new concepts or ideas they’re unfamiliar with or need to brush up on. Whether you’re attempting to crack a complex function in Excel or trying to wrap your head around a complex statistical concept, ChatGPT can be a great help with this.
For me, ChatGPT has totally changed the way I learn on the go. I find that explanations from ChatGPT are easier to follow compared to combing through various articles online. And the fact that I can go back and forth and clarify specific points with ChatGPT is very powerful.
Below is an example prompt where I asked about a type of statistical test known as an ANCOVA.
Prompt (using GPT-3.5 and a free plan):
In simple terms, please explain what an ANCOVA statistics test is. Assume that I have limited understanding of statistics and explain the following:
1. What is an ANCOVA test?
2. When should I use it?
3. How does it work?
4. What are the main benefits and limitations compared to other statistical tests?
Your explanation should be clear, concise, and easy to understand for someone with limited statistical knowledge. Additionally, please provide practical examples or analogies to illustrate your points and ensure that the response is informative and educational.
Similar to the prompt in example 1, you can see that I gave ChatGPT clear direction. You could just ask “explain what ANCOVA is” and you’ll probably get something useful. But I find that the more specific and clear you are, the better the output will be. In this case, I clarified my knowledge level (i.e. beginner) and provided some specific questions I wanted answers to.
Below is the output, but note that the image is cut off as the response is long. But I’ve included a link below where you can read the full thread.
Using ChatGPT to explain complex topics
As I mentioned earlier, the great thing about learning through ChatGPT is the ability to ask follow-up questions and go deeper into concepts you may be struggling with. Staying with the ANCOVA example above, let’s say I want to ask a follow-up where ChatGPT provides a detailed case study of how to run an ANCOVA test. Here’s another sample prompt:
Follow-up prompt (using GPT-3.5 and a free plan):
Please provide a real-world example illustrating the application of ANCOVA. The example should include a specific case study, calculations, outputs, and interpretation of an ANCOVA test. Your explanation should demonstrate how ANCOVA is used to analyze data in a practical scenario, incorporating relevant details and steps involved in the process.
With this follow-up prompt, ChatGPT created a comprehensive case study complete with a hypothesis and detailed steps on how to run the calculations. It’s really quite impressive.
ChatGPT case study on ANCOVA
You can view the entire chat thread via this link.
3. Use AI to clean your data
ChatGPT can be very helpful when it comes to cleaning and processing raw datasets. However, I will say upfront that this is usually when you’re going to need a paid plan with ChatGPT, using GPT-4. Aside from being a more powerful model, the paid version of GPT-4 also allows you to upload files/datasets, which is going to be important for this use case.
With that said, I’m still going to show an example of data cleaning that can be accomplished in a free plan using GPT 3.5.
Let’s say you ran a survey and asked an open-ended question about top-of-mind brands. In the world of market research, we usually call this spontaneous or unaided awareness. This is often a good way to understand which brand within a sector has the greatest mindshare among consumers. The problem is that cleaning and preparing this type of data can be a pain.
One issue you’ll encounter is that survey respondents will often mistype common brand names. For example, the smartphone brand Samsung could be mistyped as "Sumsung," "Samsng," or "Samsing.
The good news is that ChatGPT can clean this data so we can quickly calculate how often specific brands are mentioned.
Prompt (using GPT-3.5 and a free plan):
I have some raw text data from a survey where I asked respondents about a brand that comes to mind when they think about smartphones. However, the raw text data contains various misspelling of the various brands. For example, the brand Samsung can sometimes be typed as "Sumsung", "Samsng", or "Samsing". I want to be able to count the frequency that each brand was typed, but to do this I need your help with combining all of the common misspellings of the various brands into the correct spelling, with a count of how many times the brand was typed. Here is the dataset:
I’ve removed the brand list that I included in the original prompt as it includes 200 rows of open end text. But as you can see, this prompt is a little more comprehensive. Since we’re cleaning a very specific type of data (i.e. quant survey data), I need to provide information about the dataset as well as clear instructions about what I want ChatGPT to do.
This prompt's output was impressive (shown below), even using GPT-3.5 on a free plan. This is a task that’s still being done manually by some researchers and which can take hours. ChatGPT did it in about 20 seconds, for free.
Data cleaning with ChatGPT
You can view the entire chat thread via this link.
4. Use AI to process and transform data
Aside from cleaning datasets, you can also use ChatGPT to reformat or transform your data to facilitate data analysis. This example will require the ability to upload files to ChatGPT, so you will require a paid plan and access to GPT-4.
For this example, let’s assume you’re working with a survey dataset. A common type of analysis is the creation of what’s called a crosstab or crossbreak. This is where you take one variable (i.e. question) within your survey and break it down by other variables. The image below shows an example of what a crosstab looks like.
Sample crosstab
Creating crosstabs in commercial software like SPSS or JMP is usually pretty straightforward. But this software is expensive. And if you don’t have access to commercial-grade statistics software, creating crosstabs in Excel can be tedious and time-consuming.
This is where ChatGPT can be quite helpful. In this example, I uploaded a survey dataset (in a raw, respondent-level format) to ChatGPT and then used the following prompt.
Prompt (using GPT-4 on a paid plan):
Attached a survey dataset. Each row in the file represents an unique survey respondent, and every column contains variables for different questions asked in the survey. Each column header includes a unique question id (e.g. Q1, Q2, Q3, etc) and the full question text. Below is an example of the column headers with the question text and question id’s.
- Q1: What is your gender?
- Q2: What is your age?
- Q3: What is favourite colour?
The cell values in the survey contain numbers of text. Text values denote a specific response option that was selected (e,g. Strongly disagree). Numeric values are used for multi-select questions where 1 = selected and 0 = not selected.I would like you to create a crosstab that shows Q3 as the base question, and then Q2 (gender) across columns. Also, please show the cell values as percentages and not counts.
As with all examples, you can see my prompt provides information about the dataset and its structure and clear instructions about what I want. GPT-4 is astonishingly good at reading and understanding a dataset with limited instruction. Even still, I always recommend you take a few extra minutes to flesh out your prompt and explain the dataset structure.
Based on this prompt, the image below shows what ChatGPT produced. In this case, the output is in a table format within the chat that I can copy and paste into Excel. But as I mentioned earlier, GPT-4 does have the ability to give you a downloadable CSV or XLS files.
ChatGPT creating crosstabs
5. Use AI to analyze and summarize your data
An important part of being a data analyst is drawing conclusions after cleaning and processing your data. And this is something I’ve found ChatGPT is rather good at.
Text summarization is, after all, one of the most common ways people are using generative AI and LLM’s today. However, there’s a big difference between summarizing words from words (e.g. making a short summary from long-form text) versus words from numbers (creating a short summary from stats). And yet, I stand impressed at what ChatGPT can do here.
In this first example, I’m going to upload a survey dataset to ChatGPT and ask it to make some general observations. This example is actually sourced from the same chat thread in example 4 above. So I’ve already described the data to ChatGPT. Here’s my prompt.
Prompt (using GPT-4 on a paid plan):
Based on the provided dataset, please analyze and present the top 3 to 5 most interesting observations, highlights, or trends. This may include identifying segments (e.g. age, gender, etc) that were more likely to respond in a certain way, significant patterns, or unexpected insights from the data.
Your analysis should be detailed and insightful, focusing on the most compelling aspects of the dataset. Provide a clear and concise summary that highlights the key findings, ensuring that the trends or observations are presented in an engaging and informative manner.
Please ensure that your response encourages creativity and originality in identifying and presenting the most compelling insights from the dataset while maintaining accuracy and relevance.
As with any prompt, you will see a very diverse range in the outputs. However, compared to other use cases, I find that asking ChatGPT to analyze data typically produces highly varied and unpredictable responses. So, you can also consider re-running the prompt a few times to see what you get back. Or, you can simply go deeper into the chat thread and ask ChatGPT more specific follow-up questions.
The image below shows the output from my prompt above.
ChatGPT and data analysis
What I like about this output is the structure. ChatGPT has given me observations about the data broken down by a few areas, from demographic trends to insights related to open-ended data. It’s really quite good for the first response to my prompt.
But you shouldn’t stop here. The real power of ChatGPT will shine when you ask follow-up questions and dig deeper. So, make sure you’re not stopping at a single prompt.
6. Use AI to visualize your data
ChatGPT can also create charts! But before we go deeper into this one, you need to know a few things. First, like examples 4 and 5, you will need a paid subscription to ChatGPT. The second point is that ChatGPT charting doesn’t afford much control over formatting and visuals. So, if you want to do a lot of special formatting, then this may not be the most effective way to create charts and data visualization.
But, if you want to create some quick visuals from a dataset, this could be a great solution.
I’ll be building on the survey example from earlier. So, you can assume that I’ve already uploaded and described a dataset to ChatGPT. Here’s an example prompt.
Prompt (using GPT-4 on a paid plan):
Based on the provided dataset, I want you to create charts for Q1 and Q2. Please use horizontal bar charts to visualize the data.
Striaght out of the gate, here’s what ChatGPT gave me.
Charting with ChatGPT
Not bad, but there are a few things we need to fix.
These charts use counts when I wanted the results shown as a percentage (though I didn’t mention this to ChatGPT). The response attributes are also ranked highest to lowest, which is odd for this type of data since it’s based on an ordinal scale (i.e. you would typically show this using the original order in the survey). I also wanted to add some simple formatting to highlight the most selected option.
To address these points, I followed up with this prompt.
Follow-up prompt (using GPT-4 on a paid plan):
Modify the two provided charts according to the following specifications:
1. Ensure that all bars in the chart are the same color (#d4d4d4), except for the bar representing the greatest response, which should be a different color (#3c8db5).
2. Display the percentages (based on the total sample) on the bars instead of the counts.
3. Re-order the bars based on the scale order. For Q1, "I love them" should be at the top and "I hate them" at the bottom. For Q2, "very likely" should be on the top and "very unlikely" at the bottom. Please ensure that all other options are ordered accordingly (e.g., "I love them," "I like them a lot," "I like them a little," etc).
Again, you can see that I was specific about what I wanted, down to the HEX codes for the colours that I want to see. The image below shows the revised output.
Formatted charts with ChatGPT
These charts are definitely looking better.
Again, this may not be the most efficient or effective way to build charts. Creating charts directly in Excel may usually be faster, and gives you more control.
But what excites me about this is that I created these charts from a raw dataset. When visualizing data in a tool like Excel, you usually need to spend time cleaning, formatting and processing the data first. But with this example, all of these steps (i.e. cleaning, processing, visualizing) were achieved entirely through a series of prompts in a chat window, starting with a raw unprocessed dataset.
This represents a totally different paradigm for data analysts. Although ChatGPT and Gen-AI have a ways to go when it comes to crunching datasets at scale, I’m really impressed at what they can do today.
That’s a wrap
That’s all for today. I hope these examples and prompts will help you along the way. If you liked this post, I’d greatly appreciate it if you could share it on your socials.
Thanks and see you next time.