docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,76 @@
+# DataGemma
+
+LLM grounding with Google **Data Commons** — a public knowledge graph of 240B+ statistical data points (economics, health, demographics, science). Built on **Gemma 2 27B**. No Gemma 3 or 4 generation yet.
+
+## What it is
+
+Two flavors:
+
+- **DataGemma RIG** (Retrieval-Interleaved Generation): Model is fine-tuned to emit inline Data Commons queries wrapped around its own claims. Outputs look like `The population of Sunnyvale is [__DC__("population of Sunnyvale") --> "152,200"]`. An external resolver substitutes the real stat.
+- **DataGemma RAG** (Retrieval-Augmented Generation): Standard RAG pipeline — query Data Commons, inject results into context, generate.
+
+## Sizes
+
+- **27B instruct** only (`datagemma-rig-27b-it`, `datagemma-rag-27b-it`).
+
+## Model cards
+
+- https://ai.google.dev/gemma/docs/datagemma
+- DeepMind: https://deepmind.google/models/gemma/datagemma/
+- HF RIG: https://huggingface.co/google/datagemma-rig-27b-it
+- HF RAG: https://huggingface.co/google/datagemma-rag-27b-it
+- Paper: https://docs.datacommons.org/papers/DataGemma-FullPaper.pdf
+
+## Performance claim
+
+Baseline Gemma 2 factuality on the 101-query statistical eval: **5–17%**. DataGemma RIG: **~58%**. The improvement is narrow (statistical claims only) but real.
+
+## Prompt format
+
+No special template. Plain natural-language input. The difference is in the **training** and the **output format**.
+
+**RIG output example:**
+```
+Sunnyvale has [__DC__("total population of Sunnyvale CA") --> "152,200"]
+residents as of 2020, with a median age of [__DC__("median age of
+Sunnyvale CA") --> "34.8"].
+```
+
+Post-processing: regex out the `[__DC__("...") --> "..."]` blocks and either (a) replace with resolved Data Commons values, or (b) render as inline citations.
+
+**RAG flow:** query Data Commons first, inject tabular context, then prompt normally.
+
+## Minimum invocation — RIG
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "google/datagemma-rig-27b-it"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, device_map="auto", torch_dtype=torch.bfloat16
+)
+
+prompt = "What are the demographic trends in Sunnyvale, California?"
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+out = model.generate(**inputs, max_new_tokens=1024)
+print(tokenizer.batch_decode(
+    out[:, inputs["input_ids"].shape[1]:],
+    skip_special_tokens=True
+)[0])
+```
+
+Then run a resolver that extracts each `[__DC__(q) --> ""]` and hits the Data Commons API.
+
+## When to choose it over base Gemma 4
+
+- You're building a **statistics-grounded assistant** (government data, public health, economic indicators) and need low hallucination on numbers.
+- You're okay with a **27B model** — DataGemma only ships at this size.
+- Your domain overlaps Data Commons coverage (US-heavy, but growing internationally).
+
+Base Gemma 4 + a conventional RAG pipeline can do the same thing if you bring your own retriever. DataGemma's value is the **trained inline-citation behavior** (RIG) — Gemma 4 won't emit that format without prompting gymnastics.
+
+## Homelab fit
+
+Low. No current Seth project leans on statistical grounding. Niche for a news-summary use case (POS-Automation daily print) if Seth ever wants "US inflation was X% as of Y" kind of interjections — but then a simple Data Commons API call from the script is cheaper than running a 27B model.