The Challenge of Distinguishing Reality: An Insight into Real vs. AI-Generated Images
Unraveling the enigma of perception, this exploration dives into our ability to discern genuine imagery from AI-crafted illusions, revealing startling insights into human intuition and the prowess of modern technology.
In today's digital age, where technology is rapidly evolving and artificial intelligence is creating content that rivals human-made productions, the question arises: Can we genuinely distinguish between real and machine-generated images? My son Ryan embarked on a journey to find out. Through an experiment, he sought to understand how well individuals can differentiate between real photographs and those generated by AI algorithms.
In this post we will explore the results which, from time to time, get into statistical analysis. I am a black belt in Six Sigma and have studied statistical methods, but the statistics summarized below were completed by ChatGPT using its data analysis feature. I simply prompted the AI assistant to look at the data in an excel file and asked questions. How many participants accurately identified the images? Did age or gender sway the outcomes, and if they did, were these differences statistically significant? Can you plot these results? Was it possible high scores were simply random? Throughout my commentary I will include the snippets of Python code ChatGPT created to manipulate the data so it could answer questions.
Interestingly, ChatGPT made mistakes with the code. It analyzed the provided error, rewrote the code, and tried again without being prompted.
At the end of the post, we will see the correct answers and the percentage of those who answered correctly.
Overall Results
The findings were intriguing. Out of all the participants:
- Real Images: Approximately 39.6% correctly identified real images.
- AI-Generated Images: Approximately 31.2% correctly identified AI-generated images.
import pandas as pd
# Load the Excel file into a DataFrame
df = pd.read_excel("/mnt/data/Real or AI-Generated Survey (Responses).xlsx")
# Display the first few rows to get an overview of the data
df.head()
# Extract the column names that indicate whether the image was real or AI-generated
image_labels = df.columns[4:]
# Create a dictionary to store the correct answers for each image
correct_answers = {}
for label in image_labels:
# Extract the real label from the column name (either "Real" or "AI-generated")
real_label = label.split(".")[0]
correct_answers[label] = real_label
# Calculate the number of correct identifications by users for each image
correct_counts = {}
for label, real_label in correct_answers.items():
correct_counts[label] = (df[label] == real_label).sum()
# Calculate the percentage of correct identifications for real and AI-generated images
real_columns = [col for col in image_labels if "Real" in col]
ai_columns = [col for col in image_labels if "AI-generated" in col]
real_correct_percentage = sum(correct_counts[col] for col in real_columns) / (len(df) * len(real_columns)) * 100
ai_correct_percentage = sum(correct_counts[col] for col in ai_columns) / (len(df) * len(ai_columns)) * 100
real_correct_percentage, ai_correct_percentage
These numbers underscore the remarkable capabilities of modern AI in creating lifelike visuals. The participants all faced challenges distinguishing between real and AI-generated images, highlighting the blurry line between reality and AI-produced content.
Does Age Matter?
When diving deeper into the data, we observed variations across age groups:
- Younger participants, specifically those aged 12 and under and 13-17, had accuracy rates of around 25.0% for real images.
- Interestingly, the age groups 50-64 and 18-24 showcased a higher accuracy rate nearing 43.75%43.75%.
# Calculate the percentage of correct identifications by age group for both real and AI-generated images
# Function to calculate the correct percentage for a given subset of data
def calculate_correct_percentage(subset_df, image_type_columns):
return sum((subset_df[col] == correct_answers[col]).sum() for col in image_type_columns) / (len(subset_df) * len(image_type_columns)) * 100
# Calculate percentages by age group
age_groups = df['Age'].unique()
age_correct_percentages = {
"Real": {},
"AI-generated": {}
}
for age_group in age_groups:
subset_df = df[df['Age'] == age_group]
age_correct_percentages["Real"][age_group] = calculate_correct_percentage(subset_df, real_columns)
age_correct_percentages["AI-generated"][age_group] = calculate_correct_percentage(subset_df, ai_columns)
age_correct_percentages
However, were these differences statistically significant? Upon analysis of the Chi-Square test ChatGPT ran, the answer was clear. A p-value less than 0.05 would mean age did make a difference in answering correctly. In this case, the p-values were well above 0.05, 0.39 for real images, and 0.29 for ai-generated. This means the variations in accuracy among age groups were not statistically significant. This means age might not affect the ability to answer correctly.
# Create a contingency table for real images based on age groups
contingency_real_age = []
for age_group in age_groups:
subset_df = df[df['Age'] == age_group]
correct_count = sum((subset_df[col] == correct_answers[col]).sum() for col in real_columns)
incorrect_count = len(subset_df) * len(real_columns) - correct_count
contingency_real_age.append([correct_count, incorrect_count])
# Create a contingency table for AI-generated images based on age groups
contingency_ai_age = []
for age_group in age_groups:
subset_df = df[df['Age'] == age_group]
correct_count = sum((subset_df[col] == correct_answers[col]).sum() for col in ai_columns)
incorrect_count = len(subset_df) * len(ai_columns) - correct_count
contingency_ai_age.append([correct_count, incorrect_count])
# Conduct Chi-Square tests for age groups
chi2_stat_real_age, p_val_real_age, _, _ = chi2_contingency(contingency_real_age)
chi2_stat_ai_age, p_val_ai_age, _, _ = chi2_contingency(contingency_ai_age)
p_val_real_age, p_val_ai_age
This surprised me, I was hoping we would see if life experience made a difference in the results. When reviewing the images at the end of the post we will see that AI makes identifiable mistakes.
Does Gender Matter?
We also asked participants to list their gender:
- Males had a slight edge, with an accuracy rate of 39.73% for real images, compared to females, with an accuracy rate of 35.39%.
Yet, when tested for statistical significance using Chi-Square, these differences also did not prove statistically significant, meaning there are no gender differences. This did not surprise me, I can't imagine how gender would make a difference in identifying real content.
Diving into the Distribution: Understanding User Performance
Another aspect to consider is how individual users perform. The survey results were anonymous so we used a unique timestamp as a placeholder for a user. I asked ChatGPT to visualize the distribution using a histogram, box plot, and density plot. The latter was to compare actual results to what the distribution would look like if everyone just randomly guessed.
A Peek at the Histogram
When we visualize the data using a histogram, a pattern emerges. A peak is visible in the 20-30% range, meaning a chunk of our participants correctly identified roughly a quarter of the images. This suggests distinguishing between real and AI-generated images is no walk in the park. This aligns with the observation participants made on LinkedIn where Ryan sought out participants.
"I am not a good AI detector" - JT
"Even though it was just a survey, I still feel like I failed. 😂" - Sean A.
"The funny part about the survey is that I am not even able to spot any difference amongst the images." - Adewale A.

The Box Plot
For a more summarized view, the box plot shown above helps us determine who was above and below average and what quartile participants landed in. Here, we observed that the median score (the middle dark blue line inside the box) lies below 40%. This means that half of our participants scored below this mark, while the other half scored above. The width of the box, representing the interquartile range, stretches approximately from 20% to 60%. This tells us that the middle 50% of the scores (from the 25th to the 75th percentile) scored between 20% and 60%.
Interestingly, the box plot's whiskers (indicating the data's general spread) show that most users scored between 10% and 70%. Scores outside of this range, especially on the higher end show a few outliers. The highest score was 90%. This person may have a keen eye, but it could be simple randomness. Here's a visualization comparing the distribution of actual scores (in blue) against simulated scores (in red), representing random guessing.
The Density Distribution With Random Guessing

# Create a combined plot of actual scores vs. simulated scores
plt.figure(figsize=(12, 7))
# Plot density of simulated scores
sns.kdeplot(simulated_scores, shade=True, label="Simulated (Random Guessing)", color="#e74c3c")
# Plot density of actual scores
sns.kdeplot(user_correct_percentages, shade=True, label="Actual Scores", color="#3498db")
plt.title("Distribution of Actual vs. Simulated Scores")
plt.xlabel("Percentage Correct")
plt.ylabel("Density")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
This suggests the highest score could have been achieved randomly. To iron this out, we would need to increase the number of images to make a high score harder to achieve by simply choosing randomly. If you get 90% correct with 20 images, it should be more skill than chance.
Inference from the Distribution
What does this distribution tell us? For one, it underscores the complexity and effectiveness of modern AI in generating images. Many participants gravitated towards the lower score ranges, emphasizing the difficulty. Yet, some participants seemed to have a sharp eye, scoring remarkably well and thus pushing the boundaries of the distribution.
As AI continues to evolve, its creations will become increasingly indistinguishable from reality, even more so today. Let's look at the images from the most distinguishable to the least. There were 5 real and ai-generated images. To add variety, 2 of the 5 in each category were cats.
Image #4 | Real | 54.61% Correct

Image #1 | Real | 48.68% Correct

Image #6 | AI | 47.37% Correct

Image #7 | Real | 40.79% Correct

Image #3 | Real | 36.18% Correct

Image #8 | AI | 36.18% Correct

Image #2 | AI | 28.29% Correct

Image #9 | AI | 25.66% Correct

Image #5 | AI | 18.42% Correct

Image #10 | Real | 17.76% Correct

Conclusion
Thank you to all who participated and helped bring this experiment to life. Ryan's experiment offered a compelling glimpse into the challenges presented by AI advancements. As technology continues to push boundaries, our ability to identify what is real will become more difficult. We may need new tools to navigate the digital realm with a discerning eye since ours may be unable to tell the difference.