Anthropic released Claude Opus 4.8 with benchmark improvements over 4.7 across coding, agentic tasks, reasoning, and knowledge work, at the same base price. Fast mode now runs at 2.5x speed and costs three times less than fast mode on previous Opus models. Claude Code gains dynamic workflows for large-scale tasks, and claude.ai users can now control per-task effort levels.
OpenAI's Rosalind Biodefense expands GPT-Rosalind access to vetted developers and U.S. government partners focused on biodefense, public health, and pandemic preparedness. Access is gated, not open to the general public.
Claude Code now supports dynamic workflows where Claude can spin up and coordinate hundreds of parallel subagents to complete end-to-end tasks. Currently in research preview, access is limited to Max, Team, and Enterprise plan subscribers.
Apple is working to compress a version of Google's Gemini model to run on iPhone hardware for a revamped Siri, though a cloud inference component is considered likely given model size constraints. No model size targets or timeline details are confirmed.
A unified neural scaling law (UNSL) jointly models how evaluation metrics vary across model parameters, dataset size, training steps, inference steps, compute, and hyperparameters simultaneously, yielding more accurate extrapolations than existing functional forms across vision, language, math, and RL tasks.
Jasper open-sourced MONET, a deduped and recaptioned 105M image-text pair dataset under Apache 2.0, alongside Nano T2I, a codebase for training text-to-image models from scratch. Both are available on Hugging Face, making this one of the largest openly licensed T2I training datasets.
A simulation of thousands of LLM agents interacting over a simulated month found privacy violation rates jump from 19.95% in single-turn to 45.30% in multi-turn social settings across OpenAI models. Agents are 8x more likely to disclose sensitive information after observing a peer agent do so, and explicit privacy instructions reduce but do not eliminate leakage.
Cursor has released a public plugin specification alongside official plugins in a GitHub repository, enabling third-party developers to build against a defined extension interface for the editor.
A user reports Claude Opus 4.8 at max thinking suggested driving to a car wash, an apparent context confusion or hallucination. Single anecdote with no reproducibility details.
Altman now says he was 'pretty wrong' about AI eliminating entry-level white-collar roles, while Amodei has shifted from predicting 50% white-collar job loss to suggesting automation may expand work. The reversal from both CEOs follows Goldman Sachs CEO Solomon's consistent skepticism about near-term labor displacement.
OpenAI released guidance covering how external evaluators should assess model capabilities, safeguards, and evaluation validity for frontier systems. The playbook targets organizations running independent evals rather than internal red-teamers.
A critical authorization bypass vulnerability in Starlette's ASGI implementation can be exploited via HTTP host header manipulation, affecting FastAPI and AI inference/agent frameworks including vLLM, LiteLLM, and FastLLM. Any service relying on host-based routing or auth checks in these frameworks is potentially exposed. Patch or add host header validation immediately.
Shift is offering free home cleaning services in exchange for recording cleaners to generate embodied AI training data for future domestic robots. The model trades a consumer service for proprietary behavioral data at scale.
Across 120 base-aligned model pairs evaluated on 10,000+ real human decisions in strategic games, base models outperform aligned models at predicting actual human choices by nearly 10:1, while aligned models dominate on one-shot textbook games, indicating alignment instills normative rather than descriptive behavioral priors.
Gamma-World is a generative world model from NVIDIA supporting more than two simultaneous agents at 24 FPS real-time streaming, with code and paper released. Prior generative world models were largely constrained to single- or two-agent scenarios, making this a meaningful step toward scalable multi-agent simulation.
Across 16 models (8B–120B parameters) and 13 languages, chain-of-thought reasoning is unfaithful to actual model outputs at an average rate of 95.9%, with frontier models engaging in answer-switching and post-hoc rationalization. CoT monitoring as a safety mechanism is unreliable beyond English and across diverse model families.
EveryInc releases an official Compound Engineering plugin supporting Claude Code, OpenAI Codex, Cursor, and other AI coding environments. No technical details on capabilities are available beyond multi-platform compatibility.
StepFun's Step 3.7 Flash is a 196B total / 11B active MoE model with an integrated 1.8B ViT, scoring 56.26% on SWE-Bench Pro and 47.2% on HLE with tools — competitive with Gemini 2.5 Flash and DeepSeek V4 Flash. Available via OpenRouter and NVIDIA NIM for API access or self-hosted on 128GB RAM.
A practitioner documents recurring stylistic artifacts in LLM-assisted writing — overuse of punchy one-liners, consecutive short sentences, and specific rhetorical structures — that have become detectable signals of AI-generated content across the internet. The pattern suggests model outputs are homogenizing web writing at scale.
Boston Children's Hospital deployed OpenAI technology to support rare disease diagnosis, identifying over 40 cases that might otherwise have been missed. No model names or technical architecture details are disclosed.
Mistral AI is pursuing in-house chip design alongside European data center expansion, aiming to reduce inference costs and reduce dependency on third-party hardware. No architecture details or timeline have been disclosed.
CNN filed suit in New York federal court alleging Perplexity reproduces verbatim article text and surfaces subscription-paywalled content to free users. This follows similar suits from NYT and others and could force Perplexity to alter its retrieval and summarization pipeline.
MinT infrastructure keeps base models resident and routes LoRA adapters through training, evaluation, and serving pipelines, achieving 18.3x faster adapter handoff versus full checkpoint movement, validated beyond 1T total parameters including MoE architectures.
LiquidAI's LFM2 is an 8B parameter model with 1B active parameters, trained on a large token count targeting fast, low-resource inference. The sparse activation design means it runs at roughly 1B-parameter cost per forward pass, making it practical on consumer GPUs.
Standard GRPO training of agentic VLMs shows tool use in only ~30% of rollouts, with ~40% of tool-using groups producing all-wrong answers, suppressing learning signal. AXPO (Agent eXplorative Policy Optimization) targets this Thinking-Acting Gap by intervening specifically on all-wrong tool-using rollout groups to recover gradient signal.