Accuracy of AI information retrieval

From: hu-po

This article explores the accuracy and utility of specialized PDF summarizer AI tools, specifically PDFGPT.io and ChatPDF.com, in comparison to a generalist AI model like GPT-4. The evaluation uses two distinct academic papers: a generic deep learning survey and an obscure computational archaeology paper.

Initial Observations on Processing [02:25:00]

Upon uploading, both PDFGPT.io and ChatPDF.com processed the documents very quickly, leading to a suspicion that they may not be scraping the entire PDF for text initially. It’s suggested they might acknowledge the upload and then process the document in the background, or perform a similarity search based on queries rather than reading the entire document upfront [02:29:00], [02:42:42].

User Interface and Feature Comparison

PDFGPT.io: Features a user interface that displays the PDF alongside the chat, allowing for direct visual reference [03:03:00]. It also provides clickable references to specific page numbers in the PDF where information was sourced [06:49:00], [17:51:00].
ChatPDF.com: Does not display the PDF directly in the interface once uploaded, requiring separate viewing [03:18:00]. References are embedded directly within the answer text [18:01:00].

Comparative Performance on Information Retrieval

The evaluation involved asking specific questions derived from the content of the uploaded PDFs and comparing the answers from PDFGPT.io, ChatPDF.com, and GPT-4.

Test 1: “What is deep learning?” [03:39:00]

GPT-4: Provided a general, accurate definition [04:21:00].
ChatPDF.com: Offered a definition very similar to GPT-4’s, suggesting a reliance on general knowledge rather than specific PDF content [05:06:00].
PDFGPT.io: Provided a definition more specific to the paper, including direct citations to page numbers, indicating it pulled information directly from the PDF [06:38:00].
- Verdict: PDFGPT.io was superior in leveraging the PDF content.

Test 2: “What are the different types of supervised learning?” [08:32:00]

GPT-4: Gave a succinct and correct answer, identifying regression and classification [09:06:00].
ChatPDF.com: Despite a mistyped query (“soy provised”), ChatPDF struggled, listing model architectures (e.g., CNN, RNN) rather than types of supervised learning [10:40:00].
PDFGPT.io: Provided the types of learning as categorized in the paper: supervised, semi-supervised, and unsupervised, also mentioning deep reinforcement learning [09:32:00].
- Verdict: PDFGPT.io was better at extracting paper-specific information. GPT-4 provided the most generally correct answer.

Test 3: “What is a key difference between traditional ML and DL?” [13:18:00]

All three models (GPT-4, PDFGPT.io, ChatPDF.com) accurately identified feature extraction as a key difference, demonstrating effective similarity search and information retrieval [14:04:00], [15:36:00], [16:44:00].

Test 4: “When did Fukushima first describe the CNN?” [19:57:00]

PDFGPT.io: Failed to find the information, likely due to the exact phrase “CNN” not being in the sentence containing the 1988 date [21:12:00].
ChatPDF.com: Correctly found the 1988 date from the PDF [21:46:00].
GPT-4: Provided a date (1979-1980) that contradicted the PDF’s 1988 date [20:40:00]. However, an external web search confirmed GPT-4’s date was actually correct regarding the original Neocognitron paper, revealing an inaccuracy in the survey PDF itself [23:30:00].
- Verdict: This highlights that generalist AIs might sometimes correct inaccuracies in the source material due to their broader training data.

Test 5: “What new types of architecture/design did AlexNet introduce?” [25:36:00]

GPT-4: Identified larger/deeper networks, ReLU activation, and Dropout regularization [26:30:00].
PDFGPT.io: Identified Local Response Normalization (LRN) and Dropout [27:13:00].
ChatPDF.com: Identified deeper/wider models, convolutional layers, ReLU, and Dropout, but missed LRN [27:58:00].
- Verdict: PDFGPT.io extracted both specific mentions from the paper.

Test 6: “Who proposed the Parametric ReLU and when?” [37:51:00]

GPT-4: Accurately provided the full names (Kaiming He, et al.) and the publication year (2015), along with the paper’s title [38:50:00].
PDFGPT.io & ChatPDF.com: Both correctly cited “Kim et al. in 2015” as found in the paper [39:31:00], [40:59:00].
- Verdict: GPT-4 demonstrated superior external knowledge, providing more comprehensive authorship details. The PDF bots faithfully reproduced the information as presented in the document, even if less complete.

Test 7: “What are different optimization methods for DL?” [43:13:00]

GPT-4: Provided the most comprehensive list of optimization methods, including several not explicitly listed in the paper [46:36:00].
ChatPDF.com: Accurately listed the five specific methods mentioned in the paper (SGD, AdaGrad, AdaDelta, RMSprop, Adam) in the exact order [43:52:00]. This suggests ChatPDF has a higher similarity threshold for information retrieval [44:53:00].
PDFGPT.io: Gave a more generic and less specific answer, suggesting a lower similarity threshold where it stops searching once it finds something “kind of similar” [45:47:00].
- Verdict: GPT-4 was the best source for general knowledge. ChatPDF was best at extracting specific lists from the paper.

Test 8: “Can you provide a link to the Stanford Question Answering Dataset (SQuAD)?” [52:22:00]

PDFGPT.io & ChatPDF.com: Both found the exact specific link within the PDF [53:02:00], [53:15:00].
GPT-4: Surprisingly, GPT-4 also knew the exact, obscure link without being provided the PDF, demonstrating its vast training data and knowledge base [53:31:00].
- Verdict: All performed well, but GPT-4’s ability to recall such niche information without the document was particularly striking.

Test 9 (Obscure Paper): “Who and when recorded the original field notes for ashlars in Tiwanaku, Bolivia?” [58:29:00]

This test used a complex computational archaeology paper.

ChatPDF.com: Successfully identified J.P. Protzen (mid-1990s), Léonce Agron (1848), and Max Uhle (1893) from the paper [59:23:00].
PDFGPT.io: Crashed and became unresponsive [59:58:00].
GPT-4: Identified Max Uhle (with a slightly different but close date of 1895) and other researchers, again showcasing its broad general knowledge [01:01:03].
- Verdict: ChatPDF was the most reliable of the PDF tools for this obscure query.

Test 10 (Obscure Paper): “What is a Z-Corp Z310?” [01:02:59]

ChatPDF.com: Provided a succinct summary of the printer’s technology as described in the PDF [01:03:21].
PDFGPT.io: Provided details from the paper, including its use in creating 3D models [01:04:00].
GPT-4: Provided a much more detailed and externally sourced answer, including manufacturer, acquisition by 3D Systems, and operational details, demonstrating knowledge beyond the specific PDF [01:05:07].
- Verdict: GPT-4 again offered a superior, richer answer due to its vast knowledge base.

Test 11 (Obscure Paper): “What university houses the School of Art and Architecture as well as the Young Research Library?” [01:09:32]

ChatPDF.com: Initially stated it could not find the information in the PDF, but then performed an internet search and correctly identified UCLA [01:11:04]. This unexpected “tool-forming” capability was surprising [01:11:55].
PDFGPT.io: Successfully found the information within the PDF and identified UCLA [01:12:12].
GPT-4: Immediately knew the answer was UCLA [01:13:37].
- Verdict: All eventually provided the correct answer, but ChatPDF’s internet search capability was a notable feature.

Conclusion and Implications

The overall conclusion is that generalist models like GPT-4 often outperform specialized PDF AI tools in terms of both accuracy and comprehensiveness of information retrieval [01:17:02]. GPT-4’s vast training data allows it to answer obscure questions, even correcting inaccuracies found in the source PDF.

While PDF-specific tools can excel at extracting precise information directly from the document via similarity search, their utility is questioned when a generalist AI can provide superior or equally good answers without needing to “read” the specific PDF [01:15:19]. This raises broader implications for the future of information consumption and content creation:

Reading Papers/Books: The increasing capability of large language models (LLMs) suggests a future where traditional reading might be replaced by asking AI about content [01:15:02].
Content Formatting: The format of academic papers or APIs might evolve to be more consumable by LLMs rather than solely by humans [01:15:42], potentially affecting software development and benchmarking AI agents.
Specialized vs. Generalist AI: The trend indicates that general AI models, trained on the most extensive data, are becoming more effective than task-specific AIs, leading to a shift in market dynamics and the purpose of niche AI tools [01:07:47], [01:08:02].
Inference cost and efficiency in AI: While not directly discussed in detail, the underlying inference cost and efficiency in AI for such expansive models is a consideration.

Ultimately, the generalist GPT-4 was deemed the superior tool for information retrieval across a wide range of queries, often making dedicated PDF AI tools redundant for most purposes [01:18:03].

Tubegraph

Explorer

Table of Contents