Calculate visitor's reading time of your blog article effectively.

Few days back, I got a thought on how can I know the actual time spent by users on my blog posts as accurately as possible. Well, thats easy right. You record the start time when the users visits the page and record the time when the user leaves the page. But this solution is not even close and in my opinion this is quite a complicated problem if you think about all the possible scenarios like page refreshes, multiple tabs, idle time, scroll position, etc. One might say that Google Analytics has this metric and thats true but that metric is not accurate. I might stare at the header or footer and GA will still account for this time. This article is more about the actual reading time of your article and not any other elements in the page.

Assumptions

I am going to assume a couple of points before deep diving into the solution. Its a big challenge to understand user intentions. And if we do not make certain assumptions, our data is going to contain a lot of noise. For instance, lets say you took 4 mins to read a post for the first time. If you return back and read it again after few hours or the next day, you are going to take lesser time, lets say 2 mins. If we consider both the reading times, the average comes to 3 mins. This is not so bad, however it will be significantly worse for a bigger user base, if you do not consider these scenarios. So its better to eliminitate such reading times. Here are five more assumptions:

User visits the page and starts reading the article, but gets into an inactive state after x mins and does not get active within 20 mins, we will record only 1 min of time. Here the assumption is that the article was not convincing enough to get the attention of the user.
If the user is reading the article and reaches the bottom of the article, we will record the time spent between start and end of the article. Even if the user scrolls up to the middle of the article, we are not going to record this. Here we want to avoid over capturing of time.
Internet connection may drop off while reading the post or the device might shutdown due to no charge or the browser tab might deprioritise itself killing the timers. So we will send events every 10 secs to capture reading time.
Few users with technical knowledge will try to manipulate the requests. So we need to identify them and remove noise.
If the same user visits after 3 days to read the blog, we will capture the reading time of this user.

Characteristics of a good solution

Now that we know what we want to achieve, its a good idea to surround our solution with some attributes that would add accuracy in calculating this time.

When not to record time?

The article is not in the viewport of the user.
The user is idle for x secs.
User clicks on a different tab or defocuses the current reading tab.
Article left the viewport of the user.
User navigated to another page.
User came back to the page after x mins.
If the user reaches the end of the article, the recording comes to an end.

It is important to note that users do not login to read the post. So one can produce a lot of spam by identifying themselves as new users and adding bias to this metric.. And thats why its extremely important to also record the IP Address

Sending events from browser

Below is the code that will run on each article. You can assign an id to the article that you want to monitor. We will use intersectionObserver for this. We will also add an empty div at the bottom. If that becomes visible, then that means the user has reached the end of the post. Rest of the code is self explainatory, so I am not going into too much detail.

const EVENT_INTERVAL = 10;
const IDLE_TIMEOUT = 10;

export const TimeSpent: FC<Props> = memo(
  ({ id, type, idleTimeMs = IDLE_TIMEOUT * 1000 }) => {
    const nodeRef = useRef<HTMLDivElement | null>(null);
    const isIntersecting = useIntersectionObserver(nodeRef, {});
    const isIdle = useIdle(idleTimeMs);
    const heartBeatRef = useRef<NodeJS.Timeout>();
    const endRef = useRef<HTMLDivElement>(null);
    const isEnd = useIntersectionObserver(endRef, {});

    const startTimer = useCallback(() => {
      heartBeatRef.current = setInterval(async () => {
        const res = await sendEvent({
          id,
          type,
        });
        if (!res.continue) {
          setEndCookie(id);
          stopTimer();
        }
      }, EVENT_INTERVAL * 1000);
    }, [id, type]);

    const stopTimer = () => {
      clearTimeout(heartBeatRef.current);
    };

    useEffect(() => {
      if (isEnd) {
        setEndCookie(id);
        stopTimer();
      }
    }, [id, isEnd, type]);

    useEffect(() => {
      nodeRef.current = document.querySelector('#article-content');
    }, []);

    useEffect(() => {
      if (Cookies.get(getCookieName(id))) {
        return;
      }
      if (isIdle || !isIntersecting) {
        stopTimer();
        return;
      }
      startTimer();
    }, [isIdle, isIntersecting, startTimer, id]);

    return <div ref={endRef} />;
  }
);

Handling Events in Database

From frontend we are going to send events every 10 seconds when the user is not idle and is within the article section. 10 seconds is a good number considering it will eliminate users who visit for a short time. You may change to 15 or 20. Our request body may look like this:

// api/page-time/route.ts

{
  postId: 3
}

On the server, we are going to receive these packets and store it in database. But can you imagine if you have a big website and good traffic, this is going to flood your database. So always use a different server and database for logging this. Also, we will be deleting all the records which are older than 3 days. More about this later. In the database, we will have one row per visiter per article. Each row will look like this:

ip_address  x.x.x.x
post_id     5
page_time   140 <!--seconds-->
createdAt   2024-02-21 13:29:54.290
updatedAt   2024-02-21 13:34:18.109

Everytime we receive a packet, we add the time in page_time. We know that each packet comes only after 10 seconds. So if a request has arrived, it means the user was reading the article for atleast 10 seconds.

In this example, 140 seconds is equals 2.3 mins. We will also apply a check on the server, when not to accept these requests. Everytime we update the record the updatedAt gets changed. So we know the last request time. If the current request comes after 20 mins, we will ignore this request.

export async function GET(req: Request) {
  const head = headers();
  const params = new URL(req.url).searchParams;
  const id = params.get("id");

  let ipAddress = head.get("x-real-ip") as string;

  const forwardedFor = head.get("x-forwarded-for") as string;
  if (!ipAddress && forwardedFor) {
    ipAddress = forwardedFor?.split(",").at(0) ?? "Unknown";
  }

  if (ipAddress && id) {

    const log = await prisma.pageTimeLog.findUnique({
      select: {
        updatedAt: true,
        page_time: true,
      },
      where: {
        ip_post_id: {
          ip: ipAddress,
          post_id: Number(id)
        },
      }
    });
    if (log && log.updatedAt) {
      if (isMinsAgo(log.updatedAt, 20)) {
        return NextResponse.json({ continue: false }, { status: 200 });
      }
    }
    await prisma.pageTimeLog.upsert({
      create: {
        ip: ipAddress,
        page_time: 10,
        snapshot: "",
        post: {
          connect: {
            id: Number(id)
          }
        }
      },
      update: {
        page_time: (log?.page_time ?? 0) + 10
      },
      where: {
        ip_post_id: {
          ip: ipAddress,
          post_id: Number(id)
        }
      }
    })
    return NextResponse.json({ continue: true }, { status: 200 });
  }
  return NextResponse.json({ continue: false }, { status: 400 });
}


function isMinsAgo(date: Date, ago: number) {
  const diff = new Date(new Date().toUTCString()).getTime() - new Date(date).getTime();
  const minutes = Math.ceil(diff / 60000);

  return (minutes > ago)
}

Removing noise

Its possible that someone might open the chrome console and write a request loop that will fire 100 times. We wont be able to prevent this. But since we track the IP, the user wont be able to continue this for the same post. Once we have sufficient users reading the post, we will be able to see the reading time of each user. We can collect all these reading times in an array and remove the odd ones out. Below is a simple function which calculates the median after sorting the array and then we calculate the interquartile range IQR.

function removeNoise(numbers: number[]) {
  // Calculate the median
  numbers.sort((a, b) => a - b);
  const median = numbers[Math.floor(numbers.length / 2)];

  // Calculate the interquartile range (IQR)
  const q1 = numbers[Math.floor(numbers.length * 0.25)];
  const q3 = numbers[Math.floor(numbers.length * 0.75)];
  const iqr = q3 - q1;

  // Define a threshold for outliers
  const threshold = 1.5 * iqr;

  // Filter out values outside the threshold
  return numbers.filter(item => Math.abs(Number(item) - median) <= threshold);
}

Once we have this array, we can take the average to get the actual reading time of the post.

Cleanup

Since we are going to capture the reading time of the user who visits after 3 days, we need to cleanup the IP's. We do so by updating the relevant post in the posts table with the reading time we got after doing the average. And then we will delete its relevant records. We can setup a cron job this.

Summary

Calculate realistic reading time is not easy. There can be a lot of edge cases. However, its an interesting experiment that you can conduct to validate the quality of your posts and how much attention it draws from the user. I have implemented this in Letterpad and waiting for data to validate my change. If you want to read the full source code of how I have tackled this on the client side, you can visit this github link.

I hope this post has shed some light that how a trivial problem becomes more challenging when we try to think deeper and deeper of all possibilities that can make it go wrong. Thats all folks, thanks for reading.