diff --git a/.gitattributes b/.gitattributes index 1ef325f..9ed71b7 100644 --- a/.gitattributes +++ b/.gitattributes @@ -57,3 +57,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text # Video files - compressed *.mp4 filter=lfs diff=lfs merge=lfs -text *.webm filter=lfs diff=lfs merge=lfs -text +EPS_FILES_20K_NOV2025.txt filter=lfs diff=lfs merge=lfs -text diff --git a/EPS_FILES_20K_NOV2025.txt b/EPS_FILES_20K_NOV2025.txt new file mode 100644 index 0000000..d0624e5 --- /dev/null +++ b/EPS_FILES_20K_NOV2025.txt @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:64b960a2a77aac6d66bf174654115341ec995cb2d3854279f042a92c14210c21 +size 105577791 diff --git a/README.md b/README.md index 81d1f71..ce98ce1 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,92 @@ ---- -license: cc0-1.0 ---- +# U.S. House Oversight Epstein Estate Documents + +The motivation for curating this dataset is to enable transparent exploration of the Epstein estate documents. +The goal is to empower AI practitioners, researchers, and enthusiasts to build RAG based systems +that can identify patterns, connections, and insights that are difficult to obtain through manual inspection. + +## Usage and Responsibilities (Required Reading) + +This dataset is provided for **research and exploratory analysis** with a focus on: + +- Evaluating information retrieval and retrieval augmented generation (RAG) systems. +- Developing and testing search, clustering, and summarization methods on a real world corpus. + +**Users are responsible for:** + +- Using the dataset only for lawful purposes and in accordance with institutional and ethical review requirements. +- Treating individuals mentioned in the documents with respect, and avoiding sensationalism or misuse of sensitive material. +- Clearly distinguishing model generated content and exploratory findings from verified facts, and citing primary sources where appropriate. + +It is **not** intended for: + +- Fine-tuning language models. +- Harassment, doxing, or targeted attacks on any individual or group. +- Attempts to deanonymize redacted information or circumvent existing redactions. +- Making or amplifying unverified allegations as factual claims. + +All use must comply with applicable law, institutional policies, and the terms of the original House releases. See the “Legal and copyright status” and “Ethical and content warning” sections below before working with this corpus. + + +## Source (All data derived from publicly released materials) + +All documents originate from the public release **“Oversight Committee Releases Additional Epstein Estate Documents”** on the official House Oversight Committee website (press release dated **November 12, 2025**): + +https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/ + +The underlying materials are distributed via a Google Drive structure maintained by the Committee. This dataset is an independent derivative collection built from that release and is **not** an official product of the U.S. House of Representatives or the Committee on Oversight and Government Reform. + + +## Dataset contents + +- **Documents**: Over 25,000 plain text files derived from the committee’s public releases organized in a single csv file +- **Source Folders**: + - `TEXT/` – Files that were originally text-based (e.g., PDFs, emails) and converted to plain text. + - `IMAGES/` – Image files (primarily JPG) converted to text via OCR. + +**Filenames preserve the relative path and naming conventions from the original Google Drive release, to facilitate cross-referencing back to the official source files.** + + +## Processing + +- All image files under the `IMAGES/` directory (approximately 20,000 JPGs) were converted to machine-readable text using the open-source **Tesseract** OCR engine. +- Native text-based files under `TEXT/` were converted to plain text using standard tools (e.g., PDF/text extraction) without manual editing. +- No manual content editing, summarization, or redaction has been performed beyond: + - basic file organization, + - text extraction / OCR, + - and any redactions already present in the official House releases. + +As a result, the corpus may contain: + +- OCR noise and misrecognized characters +- Broken formatting +- Redaction blocks, stamps, or markers inherited from the original scans + +## Legal and copyright status (non-authoritative) + +- The original underlying documents were created by various private individuals and entities, not by the dataset maintainer. +- The documents are sourced from releases published by the U.S. House Committee on Oversight and Government Reform. The release webpages themselves carry standard copyright notices (© 2025 Committee on Oversight and Government Reform), and many individual documents are likely protected by copyright held by their original authors or rights holders. +- This dataset: + - **Does not** assert any ownership over the underlying documents. + - **Does not** grant any license to reproduce, distribute, or create derivative works from the underlying texts beyond what may already be permitted by law (e.g., fair use or similar doctrines in your jurisdiction). +- Users are solely responsible for ensuring that their use of this corpus complies with applicable copyright law, privacy law, institutional policies, and the terms of the original House releases. + +Nothing in this dataset card constitutes legal advice. If you plan to use this corpus in a public-facing product, for model training, or at scale, you should seek independent legal counsel. + +## Ethical and content warning + +The documents contain material related to: + +- Sexual abuse and exploitation +- Trafficking +- Violence and other highly sensitive topics +- Unverified allegations, opinions, or speculation + + +## Intended use and limitations + +Recommended / common use cases include: + +- Text mining and exploratory analysis of the *public record* surrounding the Epstein estate documents. +- Search / retrieval experiments (e.g., indexing, ranking, IR/RAG prototypes) conducted in controlled or research settings. +- Qualitative review by journalists, historians, or legal scholars. +