Blog
Social Listening & Monitoring

In-Video Analysis: What Social Videos Actually Reveal

Mya Achidov
May 31, 2026
Reading time:
8
Table of Contents

The caption says "first impressions of the new launch." The creator's tone says something else, the product on screen is sitting next to a competitor's, and the comment section is mostly laughing.

A text-based social listening tool sees the caption and logs a positive mention. Everything that mattered happened inside the video, and the dashboard never noticed.

This is the gap in-video analysis exists to close, because most of what a social video communicates about a brand lives outside the caption, in the speech, the visuals, the audio, and the audience reaction. Reading all of it is the work dig was built to do.

What you'll learn

  • The three things every social video communicates, what's being said, what's being shown, and what's being felt

  • Why measuring sentiment from text alone misses most of what a video actually says

  • How in-video analysis picks up sarcasm, emotion, and story at scale

  • The authenticity gap, how to tell whether a reaction is real or manufactured

  • What it takes to read in-video content reliably, and trace every insight back to its source

What is in-video analysis?

In-video analysis is the practice of looking at what's actually happening inside a social video, the speech, the visuals, the emotion, the audience reaction, and how all four connect. It treats the video itself as the thing to study, not the caption or hashtags around it.

The framework breaks down into three layers, each carrying signals a text-only tool can't read.

  • What's being said. Every spoken word, line of dialogue, voiceover, on-screen text, and bit of background audio gets transcribed and sorted into categories like promotional, testimonial, tutorial, complaint, and critique.

  • What's being shown. Products, environment, on-screen text overlays, gestures, facial expressions, and brand placement in the background all get identified and turned into data you can search and act on.

  • What's being felt. The tone in someone's voice, the feeling behind their delivery, the way the comment section reacts, and how the story unfolds across the clip all get read together to land on a single take on the emotion.

A platform that reads one or two of these layers gives you partial intelligence. A platform that reads all three, and then asks whether the emotion is real or manufactured, gives you social media intelligence.

How does in-video analysis differ from social listening?

Where social listening reads the text around a video, in-video analysis reads the video itself.

Social listening tracks mentions, hashtags, comments, and captions, scoring sentiment from the words people typed and counting engagement from likes and shares, and it's genuinely useful for measuring volume and trend on text-heavy platforms.

In-video analysis sits at a different layer, reading the speech inside the video, the products and scenes on screen, the creator's tone, and the audience reaction together. It catches a sarcastic positive that text sentiment would log as a win, flags a competitor's logo in the background of an unboxing video, and surfaces a tutorial that misrepresents a feature before the comment section turns into a critique.

The two stack, social listening tells you something is being said, in-video analysis tells you what it actually means.

What is being said in social videos?

The first layer is what's said out loud, where the spoken word gets transcribed, the intent behind it gets read, and the content gets grouped into themes that matter for the brand.

Accuracy on the transcription itself is the floor for everything that comes after, because a misheard transcript sends the rest of the analysis in the wrong direction, wrong sentiment, wrong topic, wrong account. Accuracy here isn't a nice-to-have, it's the starting point.

How does in-video analysis detect sarcasm and tone?

Sarcasm is what breaks text-based sentiment most often, like when a creator says "Best launch of the year, honestly" while rolling their eyes, and the words score positive while the video tells a different story.

The way to catch it is to read three things at once, the words being said, how they're being said (the tone, the pitch, the pacing), and what's happening on screen (facial expression, body language, gesture). When the three don't line up, the real feeling underneath is the one to trust, not the surface words. The same approach catches dry humor, sincerity that sounds like sarcasm, and culturally specific cues that text-only tools take literally.

What is being shown in social videos?

The second layer is what's on screen, from the products and the environment to the on-screen text overlays, the gestures, and the brand placement in the background, where a creator's framing of a story actually lives.

In-video analysis identifies logos, products, and people on screen, and reads the environment around them, kitchen, storefront, street, studio, wherever the video is filmed. It also follows what's happening across frames, the movement, the cuts, and the on-screen text the creator typed into the video itself, which often contradicts or sharpens what's being said out loud.

For brand teams, this layer is where you catch trademark misuse, impersonation, and counterfeit goods. A counterfeit unboxing video reads as a sincere product review if you only look at the caption, and as a clear infringement when you actually watch the video. Whether your legal team has a real case often depends on whether your monitoring tools can tell those two things apart.

How does in-video sentiment differ from text sentiment?

Where text sentiment scores the words, in-video sentiment scores the experience.

A text-based tool reads "love this brand" as positive, while an in-video tool reads the same words coming from a creator with crossed arms, in a sarcastic tone, with the on-screen text saying "yeah, right," and lands on the opposite score, which is the accurate one.

The gap matters most in three places. The first is critique dressed up as praise, where sarcasm, irony, and cultural inside jokes read as positive in text and negative in video. The second is engagement that looks like endorsement, where comments often contradict the creator's surface message and only an in-video read picks up the contradiction. The third is visual context that text can't see, things like a positive review filmed next to a competitor's product, an unboxing that subtly mocks a feature, or a tutorial that shows the product failing, all of which read as positive in text and negative when you actually watch the video.

If a tool only scores sentiment from text, the more of the conversation that lives in video, the less you can trust the score.

Is what's being felt actually real?

In-video sentiment tells you what audiences feel, but real social media intelligence asks one more question, is the feeling organic, or is it engineered?

Bot networks, deepfakes, and paid amplification campaigns can manufacture emotional reactions at scale. A surge of outraged comments on a launch video might come from real customers, or from a network of accounts created in the last 30 days posting in coordinated waves. A flood of glowing testimonials might be genuine word-of-mouth, or a paid campaign with the same five sentences rephrased across two thousand profiles. The sentiment read looks identical either way, the right response is opposite.

This is the layer most video analysis tools skip, and closing it takes three checks running together. The first looks at the accounts themselves, when they were created, how they post, and how they're connected to each other. The second looks at the content, whether it's a deepfake or AI-generated. The third looks at how the post is spreading, whether the curve matches how real people share things, or whether it's too clean to be organic. When the three point at coordination rather than a real audience, the moment gets flagged as manufactured and the response goes down a different path.

The result is a sentiment read brand teams can actually act on, because what matters is whether the feeling is real or manufactured.

Your brand is in videos right now. Do you know what they say?

Book a demo

In-video analysis: dig vs text-only social listening

In-video analysis: dig vs text-only social listening

Text-only social listening dig in-video analysis
What it reads Captions, hashtags, comments Speech, visuals, audio, comments, on-screen text
Sentiment basis Words around the video Words, tone, and visuals, read together
Sarcasm handling Read literally Caught when words, tone, and visuals don't line up
Visual context Invisible Logos, scenes, products, gestures, on-screen text
Authenticity layer None Account checks, deepfake detection, spread-pattern checks
Accuracy Caption-bound 95% on transcription, sentiment scored across everything else
Source traceability Comment-level 100% to the original video, frame, and account

Key takeaways

  • In-video analysis reads three layers of every social video. What's being said, what's being shown, and what's being felt. Text-only social listening reads none of them directly.

  • Most of a social video's meaning lives outside the caption. A monitoring tool built on text alone is, by design, blind to most of what matters to your brand in 2026.

  • Reading words, tone, and visuals together beats reading text alone on sarcasm, visual context, and audience reaction. It's the only kind of sentiment read you can trust when the conversation lives in video.

  • The authenticity layer is what separates social listening from social media intelligence. Checking who's posting, whether the content is real, and how it's spreading tells brand teams whether a sentiment surge is real or manufactured, which changes the response.

  • dig analyzes in-video content at 95% accuracy with 100% source traceability, so every insight links back to the original video, frame, and account.

The bigger picture

Your brand is already inside thousands of social videos right now, and reading the caption only shows you the surface. Reading the speech, the visuals, the audio, the audience response, and whether the momentum behind it is real shows you what your stakeholders are actually hearing.

The teams that win the next cycle of brand reputation are the ones that stop treating video like a text problem.

See what dig finds inside your videos.

Book a demo

FAQs

What is in-video analysis in social media intelligence?

In-video analysis is the practice of reading what's actually happening inside a social video, the speech, the visuals, the audio, and the audience reaction, instead of just looking at the caption and comments around it. In a social media intelligence context, it's the layer that turns raw video into structured insight, who's saying what, what's on screen, how the audience is reacting, and whether the engagement is real or coordinated.

How does sentiment analysis work inside social videos?

In-video sentiment analysis combines three signals to land on a single emotional read. The first is the words themselves, transcribed and read for tone and intent. The second is the way those words are spoken, the pitch, pacing, and vocal cues that flag sarcasm, urgency, or sincerity. The third is what's on screen, the facial expression, gesture, and visual context. When the three point in the same direction, the read is high confidence. When they don't, the mismatch itself becomes the signal, which is how the system catches sarcasm and irony that text-only tools miss.

What is the difference between social listening and in-video analysis?

Social listening reads the text around a video, including captions, hashtags, mentions, and comments, while in-video analysis reads the video itself, including speech, visuals, audio, and audience response. Social listening tells you something is being said, while in-video analysis tells you what it actually means and whether the reaction is real. The two stack, social listening produces volume and trend data, in-video analysis produces interpretation you can act on.

How accurate is dig's in-video analysis?

dig transcribes across 95+ languages with 95% accuracy on the transcription itself, then layers sentiment analysis across the words, tone, and visuals on top of that. Every insight traces back to the original video, frame, and account, with 100% traceability across the platform. The accuracy of the transcription matters because every layer that comes after, sentiment, topic, account, depends on the words being heard correctly in the first place.

What social media platforms does dig in-video analysis cover?

dig in-video analysis covers the major video-first social platforms where brand narratives form, including TikTok, Instagram Reels, YouTube and YouTube Shorts, Facebook video, X video and LinkedIn video. The system also handles long-form video and live-stream content, plus the comment threads and remix trees that spread from each.

What is the authenticity gap in social video?

The authenticity gap is the difference between sentiment that's real and sentiment that's engineered. A flood of outraged comments on a launch video can come from real customers, or from a coordinated network of bot accounts, and the sentiment read looks identical even though the action the brand should take is opposite. dig closes the gap by checking the accounts, the content, and the spread pattern, so brand teams know whether they're responding to organic reaction or manufactured momentum.

Ready to get a grip on social video?

Start Here

Mya Achidov

Mya leads product and content marketing at dig, writing at the intersection of culture, brand, and social video. She helps global organizations go beyond the text, surfacing the narratives, signals, and reactions happening inside social video so they can shape the conversation on their terms, in real time.

Related stories

Blog
February 3, 2026

The High Cost of Being One Step Behind in a Video-First World

Brand Reputation & Health
Blog
May 24, 2026

What Is the Social Intelligence Gap?

Social Video Intelligence
Blog
May 20, 2026

The Brand Test Most Marketers Miss

Brand Reputation & Health