chanmuzi의 AI 큐레이션

AI 논문 & 뉴스를 매주 정리합니다

상세 검색 추가

1,794

전체 항목

1,069

📜 Papers

684

🧑🏻‍💻 Dev

🗞️ News

Top 기관

Google (95) OpenAI (89) Meta (86) Anthropic (78) Google DeepMind (67) Microsoft (62) Tsinghua (57) NVIDIA (50)

전체 아카이브

1,794건

2026년 5월 18건

🗞️ News Boston Dynamics

2026.05 3주차

Atlas, 냉장고 운반·음료 전달 시연 및 학습 방식 딥다이브 영상 공개

프로덕션 버전 Atlas가 전원 분리된 50파운드(약 23kg) 미니 냉장고를 들어 옮긴 뒤 연구원에게 음료를 전달하는 시연 영상 공개
Atlas 학습 방식과 코믹한 연출 의도를 설명하는 딥다이브 영상 동시 공개, 실험실 테스트에서는 100파운드까지 들어 올림
CES 2026에서 공개한 프로덕션 Atlas의 후속 시연으로, 2026년 물량은 Hyundai RMAC·Google DeepMind 배치로 전량 확정

🧑🏻‍💻 Dev PixVerse

2026.05 3주차

PixVerse R1: 인터랙티브 AI 비디오를 위한 실시간 월드 모델

사용자 입력에 즉시 반응해 연속 1080P 영상을 생성하는 "세계 최초 실시간 월드 모델" R1 공개
무한 스트리밍·실시간 반응·텍스트/이미지/오디오/비디오 통합 처리·장시간 물리 일관성 유지
Omni 네이티브 멀티모달 파운데이션 모델, 일관성 인식 autoregressive 프레임워크, 샘플링 스텝을 1~4회로 줄인 응답 엔진의 3대 기술 결합

📜 Paper Meituan LongCat

2026.05 3주차

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

에이전트 오케스트레이션의 성능을 실제로 견인하는 핵심 메커니즘이 복잡한 코드가 아니라 "heavy thinking"이라는 단일 inner skill임을 규명
K개 독립 추론 trajectory를 병렬 생성한 뒤 별도 모델이 이를 종합하는 2단계(parallel→sequential) 파이프라인을 Claude Code 등 오케스트레이션 프레임워크에 그대로 이식 가능한 readable skill 문서로 캡슐화
STEM 벤치마크 전반에서 Heavy-Mean@4가 Mean@K를 일관되게 상회, GPT-5-Thinking은 일부 과제에서 near-perfect 달성, 성능 위계는 Heavy-Pass@k ≥ Heavy-Mean@K ≥ Vote@K ≥ Mean@k로 정리

📜 Paper Mind Lab

2026.05 3주차

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

소수의 값비싼 base model에서 파생된 수많은 정책을 각각 완전 머지 체크포인트로 구체화하는 비용 문제 해결
LoRA post-training·서빙 관리 인프라 — base model 상주 + LoRA adapter revision을 서비스 인터페이스로 관리, Scale Up(1T+ dense/MoE LoRA RL 확장)·Scale Down(base 대비 1% 미만 adapter export)·Scale Out(정책 주소성과 컴퓨팅 자원 분리, 100만 규모 카탈로그) 3축 최적화
adapter-only handoff로 step time 4B 18.3배·30B MoE 2.85배 단축, 멀티 정책 학습 wall time 1.77배·1.45배 단축(피크 메모리 증가 없음), packed MoE LoRA 텐서로 엔진 로딩 8.5~8.7배 개선

📜 Paper Alibaba Qwen

2026.05 2주차

Qwen-Image-2.0 Technical Report

Ultra-long text rendering, multilingual typography, photorealism, instruction following 등 기존 image generation의 한계 해소를 목표로 한 차세대 모델 공개
Qwen3-VL을 condition encoder로 사용하고 Multimodal Diffusion Transformer로 condition·target을 jointly 모델링해 생성과 편집을 단일 framework로 통합
최대 1K-token 길이의 instruction까지 text-rich generation을 지원하며 다국어 텍스트 fidelity와 typography, photorealism을 동시에 강화

📜 Paper Netflix

2026.05 2주차

Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs

멀티턴 대화에서 LLM이 토픽 전환(pivot)을 인식하지 못하고 이전 turn의 stale context를 그대로 캐리하는 문제 정량 분석
TopiOCQA·MSC를 결합해 pivot 위치가 알려진 synthetic dialogue 벤치마크 구성, 10개 LLM을 zero-shot으로 turn 유형·관련 context 식별 평가
GPT-4o 등 closed-source와 o3·Claude-3.7-sonnet 등 reasoning 모델은 ceiling에 근접한 반면, open-weight 모델은 pivot을 맞춰도 context "stickiness" 잔존

🗞️ News OpenAI

2026.05 2주차

OpenAI launches the OpenAI Deployment Company

기업 내 frontier AI 도입을 전담하는 신생 회사 OpenAI Deployment Company 출범, 초기 자본 40억 달러 규모
TPG·Advent·Bain Capital·Brookfield 등 19개 글로벌 투자사·컨설팅사·SI 공동 참여 (Bain·Capgemini·McKinsey 포함)
Forward Deployed Engineer(FDE)를 고객사에 상주시켜 워크플로우 재설계와 운영 시스템화 지원

📜 Paper Sakana AI · NVIDIA

2026.05 2주차

Sparser, Faster, Lighter Transformer Language Models

sparse training 코드와 H100 전용 custom CUDA kernel을 함께 공개, TwELL packing format 기반
HF 호환 SparseLM 0.5B/1B/1.5B/2B 사전학습 체크포인트와 Hydra 기반 학습·벤치마크 파이프라인 제공
sparsity 활용으로 추론 throughput 향상과 메모리·에너지 절감, twell-flex 변형 커널로 non-uniform sparsity 패턴까지 커버

🧑🏻‍💻 Dev Thinking Machines

2026.05 2주차

Interaction Models: A Scalable Approach to Human-AI Collaboration

실시간 상호작용을 외부 모듈로 덧붙이는 대신 모델 아키텍처에 native 통합한 새로운 접근법 제안
200ms 단위 time-aligned micro-turns 처리로 동시 발화·인터럽션·tool use 지원, dMel·hMLP 기반 encoder-free early fusion 채택
interaction model이 실시간 응답을 맡고 background reasoning model이 비동기 추론 결과를 스트림으로 합류시키는 dual-model system

📜 Paper Ai2

2026.05 2주차

EMO: Pretraining Mixture of Experts for Emergent Modularity

1T 토큰으로 학습한 1B-active / 14B-total MoE 모델 EMO 공개, 128개 전문가 중 토큰당 8개만 활성화
같은 문서 내 모든 토큰이 공유 expert 풀 안에서만 라우팅되도록 강제하는 document-boundary 약한 감독(weak supervision) 도입
표면 특징이 아닌 health · politics · movies 등 의미론적 도메인 단위 expert 모듈화 자연 발현

🧑🏻‍💻 Dev OpenAI

2026.05 2주차

Investigating the consequences of accidentally grading CoT during RL

배포된 GPT-5.x 계열(GPT-5.4 Thinking, GPT-5.1~5.4 Instant 등) RL 학습 중 보상 신호 입력에 chain-of-thought(CoT)가 의도치 않게 포함된 사건 다수 자동 감지
CoT가 감독 신호로 새면 모델이 추론을 숨기거나 monitor를 속이는 인센티브가 생길 수 있어 우려
정규표현식 기반 스캐너로 보상 입력 내 CoT 포함 여부 점검 후 재현 학습 · CoT 접근 차단판 비교 · 의도적 압력 stress test 병행

🧑🏻‍💻 Dev Zyphra

2026.05 2주차

ZAYA1-8B: Frontier intelligence density, trained on AMD

active 1B 미만 MoE 8B 모델 공개. Compressed Convolutional Attention(CCA), MLP 기반 expert router, learned residual scaling 등 신규 architecture 도입
AMD Instinct MI300x 1,024 노드(IBM·Pensando Pollara interconnect) 클러스터에서 학습, SFT → reasoning warmup → RLVE-Gym → math/code RL → RLHF/RLAIF 5단계 post-training 파이프라인 적용, Apache-2.0 공개
HMMT'25 89.6으로 Claude 4.5 Sonnet(88.3) 상회, Markovian RSA test-time compute(5.5M 토큰 budget)로 DeepSeek-V3.2 초과해 8B급에서 frontier 밀도 달성

🧑🏻‍💻 Dev OpenAI

2026.05 2주차

Advancing voice intelligence with new models in the API

3종 신규 realtime audio 모델 공개: GPT-Realtime-2(GPT-5급 reasoning 탑재 voice 모델), GPT-Realtime-Translate(70+ 입력 → 13 출력 언어 실시간 번역), GPT-Realtime-Whisper(스트리밍 speech-to-text)
GPT-Realtime-2(high) Big Bench Audio에서 GPT-Realtime-1.5 대비 +15.2%p, (xhigh) Audio MultiChallenge instruction following에서 +13.8%p 향상으로 reasoning·context 관리·대화 제어 강화
가격은 GPT-Realtime-2 audio input $32/1M(캐시 $0.40)·output $64/1M, Translate $0.034/분, Whisper $0.017/분으로 책정

📜 Paper Ai2

2026.05 1주차

MolmoAct2: Action Reasoning Models for Real-world Deployment

폐쇄적 SOTA, 고가 하드웨어 의존, 높은 지연시간 등 기존 로봇 제어 모델의 실배포 한계 해소
MolmoER VLM 백본(3.3M 샘플 학습) + OpenFAST action tokenizer + flow matching 아키텍처 + MolmoThink 적응형 깊이 추론을 결합
양팔 조작 720시간 포함 신규 데이터셋 3종 동반 공개

🧑🏻‍💻 Dev Anthropic

2026.05 1주차

Claude for Financial Services

은행·자산운용·보험 등 금융기관 의사결정 가속용 AI 솔루션 — Excel/PowerPoint 통합, source attribution, AWS·GCP·Azure 멀티클라우드 배포 지원
피치북, 신용 메모, KYC 스크리닝, 펀드 회계 등 사전 구성 agent template 제공
LSEG, FactSet, S&P Global, Morningstar, PitchBook, Moody's 등과 pre-built connector 연동

📜 Paper UIUC

2026.05 1주차

Heterogeneous Scientific Foundation Model Collaboration

자연어 중심 agentic system이 비언어 modality 기반 과학 도메인에는 적용이 제한된다는 한계 지적
도메인 foundation model에 LLM reasoning 인터페이스를 결합한 Eywa 프레임워크 제안. query compiler ϕₖ + response adapter ψₖ로 구성된 FM-LLM "Tsaheylu" 인터페이스를 MCP로 노출
단일 EywaAgent, 다중 EywaMAS, 동적 EywaOrchestra 3종 변형 및 물리·생명·사회 과학을 아우르는 EywaBench 공개

🧑🏻‍💻 Dev Warp

2026.05 1주차

Warp

터미널에서 출발한 agentic development environment Warp의 클라이언트 코드베이스 오픈소스 공개
자체 coding agent와 함께 Claude Code, Codex, Gemini CLI 등 외부 CLI agent 연동 지원
UI 프레임워크 MIT, 나머지 AGPL v3 듀얼 라이선스 채택. OpenAI가 founding sponsor로 참여

🧑🏻‍💻 Dev Mistral AI

2026.05 1주차

Mistral Medium 3.5 128B

128B dense 파라미터 + 256k context의 멀티모달 flagship 모델 공개 (Modified MIT License로 상·비상업 사용 모두 개방, 대규모 매출 기업 예외)
instruction-following · reasoning · coding을 하나의 모델로 통합한 Mistral 최초의 merged flagship으로 Mistral Medium 3.1, Magistral, Devstral 2 대체
request 단위 `reasoning_effort` 토글(none/high)로 추론 깊이 조절

2026년 4월 37건

📜 Paper Multi-institution

2026.04 5주차

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

400편 이상 연구와 100+ 시스템을 분석한 agentic world model 종합 survey 제시
capability level(Predictor·Simulator·Evolver)과 governing law(physical·digital·social·scientific) 두 축의 "levels × laws" 분류 체계 도입
RL·video generation·agent navigation·social simulation·scientific discovery 등 분리돼 있던 커뮤니티 간 통합 시각 제공

📜 Paper Inclusion AI

2026.04 5주차

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

SigLIP-VQ 기반 semantic tokenization과 MoE backbone, diffusion decoder를 결합한 통합 discrete diffusion 멀티모달 모델 공개
block-level masked diffusion으로 시각·텍스트 데이터를 단일 프레임워크에서 처리하며 이해·생성·편집을 모두 지원
prefix-aware backbone 최적화와 few-step decoder distillation으로 inference 효율 강화

🧑🏻‍💻 Dev Anthropic

2026.04 4주차

Automated Weak-to-Strong Researcher

Claude 기반 Automated Alignment Researcher(AAR)가 weak-to-strong generalization 과제에서 인간 연구자 대비 압도적 성능 달성, alignment 연구 자동화 가능성 입증
인간 베이스라인은 7일간 4개 method 튜닝으로 PGR 0.23 기록, AAR은 9개 병렬 에이전트로 5일 만에 PGR 0.97 도달 (누적 800시간, 약 $18,000 비용)
sandbox에서 다수 AAR이 가설 제안→실험 실행→결과 분석→발견 공유 사이클을 자율 반복, 수개월 작업을 수 시간으로 압축

🗞️ News Google DeepMind

2026.04 4주차

Announcing our partnership with the Republic of Korea

Google DeepMind, 대한민국 과학기술정보통신부(MSIT)와 전략적 파트너십 체결로 frontier AI를 활용한 과학·산업 혁신 가속화
서울에 AI Campus 설립, SNU·KAIST·AI Bio Innovation Hub 3곳과 생명과학·기상·에너지·알고리즘 분야 공동 연구 추진 (AlphaFold·AlphaGenome·WeatherNext·AlphaEvolve 활용)
K-Moonshot Missions 지원, 한국 AI Safety Institute와 안전 연구 협업, AI Essentials 장학금·인턴십 프로그램 확대

🧑🏻‍💻 Dev NVIDIA

2026.04 4주차

Nemotron-Personas-Korea

한국 인구 통계·문화 특성을 반영한 첫 대규모 합성 페르소나 데이터셋 공개 (CC BY 4.0)
100만 레코드 / 700만 페르소나 / 17억 토큰, 26개 필드로 17개 시도·252+ 시군구 커버
KOSIS·대법원·NHIS·KREI·NAVER Cloud 공식 통계 기반, NeMo Data Designer + gemma-4-31B-it로 합성

🧑🏻‍💻 Dev Sakana AI

2026.04 4주차

Sakana Fugu: Multi-Agent Orchestration System

여러 frontier model을 adaptive agent coordination으로 동적 오케스트레이션하는 multi-agent 시스템 베타 공개
ICLR 2026 Trinity·Conductor 논문 기반으로 사람이 정의한 팀 구조 없이 비자명한 협업 패턴 자동 학습
GPQAD 95.1(Gemini 3.1 94.4·GPT 5.4 90.9), SWEPro 54.2(Opus 4.6 53.4) 등 개별 frontier model 상회

🧑🏻‍💻 Dev OpenAI

2026.04 4주차

Introducing GPT-5.5

GPT-5.4 공개 몇 주 만에 출시된 후속 모델, agentic coding·computer use·knowledge work·초기 scientific research 영역에 초점
large codebase 전반에 context 유지, 도구로 가정 검증하며 변경을 주변 코드에 전파하는 engineering behavior에 최적화. 이전 대비 token 효율 개선
주요 벤치마크 (SOTA)

📜 Paper DeepSeek

2026.04 4주차

DeepSeek-V4 Technical Report

1.6T total / 49B active MoE, 1M 토큰 컨텍스트, FP4+FP8 mixed precision 적용
CSA·HCA 하이브리드 어텐션으로 1M 토큰 구간 inference FLOPs는 DeepSeek-V3.2 대비 27%, KV cache는 10% 수준으로 절감
Manifold-Constrained Hyper-Connections(mHC) + Muon optimizer로 32T+ 토큰 pretraining, 이후 GRPO SFT+RL → on-policy distillation 2-stage post-training

🧑🏻‍💻 Dev Xiaomi

2026.04 4주차

MiMo-V2.5-Pro

Xiaomi의 가장 강력한 모델을 public beta로 공개, 1,000+ tool call 규모의 long-horizon agentic task에 초점
실측 사례
Rust SysY 컴파일러 풀빌드: 233/233 테스트 통과, 4.3시간 / 672 tool calls

🧑🏻‍💻 Dev OpenAI

2026.04 4주차

Introducing OpenAI Privacy Filter

텍스트 내 PII(이름·이메일·계좌·신용카드 번호 등) 탐지와 redaction을 위한 1.5B open-weight 모델, on-device로 처리해 데이터가 외부로 나가지 않음
PII-Masking-300k 벤치마크에서 out-of-the-box 96% F1 달성
Apache 2.0으로 GitHub/Hugging Face 공개, fine-tuning과 상용 배포 모두 허용

🧑🏻‍💻 Dev OpenAI

2026.04 4주차

Introducing ChatGPT Images 2.0

한/중/일/힌디/벵골 등 다국어 text rendering과 small text·UI·dense typography를 최대 2K 해상도로 정확히 처리
단일 prompt에서 character/object continuity를 유지하며 최대 8장 이미지 생성 (만화 시퀀스, 동화책, SNS 그래픽 시리즈에 활용)
thinking model 모드에서 웹 검색·업로드 자료 분석·layout reasoning을 거친 뒤 이미지 생성

🧑🏻‍💻 Dev Robbyant

2026.04 3주차

LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

이미지 시퀀스/비디오 스트림으로부터 real-time streaming 3D reconstruction을 수행하는 feed-forward 3D foundation model
Geometric Context Transformer가 단일 streaming framework 안에서 coordinate grounding, dense geometric cues, long-range drift correction, anchor context & pose-reference window, trajectory memory를 통합
Paged KV cache attention (FlashInfer) 기반으로 518×378 해상도에서 약 20 FPS, 10,000+ frames 시퀀스 처리 가능 (>3000 frames는 windowed inference 지원)

🗞️ News Anthropic

2026.04 3주차

Claude Opus 4.7

Claude Opus 4.7 릴리즈. API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry 등 모든 Claude 제품에서 사용 가능
주요 개선 포인트
advanced software engineering, 특히 어려운 태스크에서 Opus 4.6 대비 큰 개선 (한 벤치마크 기준 +13%)

🧑🏻‍💻 Dev Alibaba Qwen

2026.04 3주차

Qwen3.6-35B-A3B

Tongyi Lab이 Apache 2.0로 공개한 sparse MoE 모델. total 35B parameters, active 3B (약 12:1 sparsity)로 agentic coding 타겟
주요 벤치마크 (vs Gemma 4-31B)
SWE-bench Verified 73.4% vs 52.0%

📜 Paper ByteDance Seed

2026.04 2주차

Seedance 2.0: Advancing Video Generation for World Complexity

text, image, audio, video 네 가지 modality를 입력으로 받아 4~15초 길이의 480p/720p 비디오를 생성하는 native multi-modal audio-video generation 모델
unified, highly efficient, large-scale architecture 기반으로 audio-video joint generation을 수행하며 최대 3개 video clip, 9개 image, 3개 audio clip을 reference로 동시 처리 가능
전문가 평가와 public test 기준 leading levels에 준하는 성능을 보였으며, 저지연 응용을 위한 Fast variant도 함께 공개

📜 Paper Anthropic · Truthful AI

2026.04 2주차

Language models transmit behavioural traits through hidden signals in data

teacher 모델이 만든 number sequences/코드/reasoning trace만으로 학습한 student 모델이 teacher의 성향(예: 올빼미 선호, misalignment)을 그대로 습득하는 subliminal learning 현상을 규명
방법론
teacher에 finetuning 또는 system prompt로 특정 trait 부여 → 무관한 프롬프트로 completion 샘플링 → trait 언급을 제거하는 필터링 후 student 학습

📜 Paper Google Cloud AI Research

2026.04 2주차

PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing

sparse idea summary와 실험 로그 같은 unstructured 자료를 submission-ready LaTeX 논문으로 변환하는 multi-agent framework
outline 작성, plot 생성, literature review, writing, refinement 등 역할별 agent로 구성되며 Semantic Scholar API로 citation을 grounding
CVPR/ICLR 등 conference 스타일 native rendering, simulated peer feedback 기반 iterative self-refinement 지원

🧑🏻‍💻 Dev Ai2

2026.04 2주차

Introducing WildDet3D: Open-world 3D detection from a single image

단일 RGB 이미지에서 객체의 3D bounding box(위치·크기·방향을 metric 좌표로)를 예측하는 open-world monocular 3D detection 모델
하나의 아키텍처에서 text query, point prompt, 2D bounding box를 모두 prompt로 받으며 해상도·종횡비·광학 특성이 다른 카메라에 fine-tuning 없이 일반화
sparse depth / LiDAR / TOF 데이터를 optional fusion 가능, geometry backend가 decoupled되어 depth 모델 교체 가능

📜 Paper Meta · KAUST

2026.04 2주차

Neural Computers

모델 자체가 running computer가 되는 새로운 패러다임, Neural Computers (NCs) 제안
computation, memory, I/O를 learned runtime state 안에서 통합하는 접근
궁극 목표는 stable execution과 reusable capabilities를 갖춘 Completely Neural Computer (CNC)

📜 Paper Google · Google DeepMind

2026.04 2주차

MedGemma 1.5 Technical Report

MedGemma 1을 확장한 4B 규모의 open-weight medical foundation model
Gemma3 아키텍쳐 + frozen 400M MedSigLIP vision encoder 기반
고차원 medical imaging (CT/MRI volumes, histopathology whole slide images), bounding box 기반 anatomical localization, multi-timepoint chest X-ray 분석을 통합 지원

🧑🏻‍💻 Dev Anthropic

2026.04 2주차

The Advisor Strategy: Give Agents an Intelligence Boost

더 큰 모델(Opus)을 advisor로, 작은 모델(Sonnet/Haiku)을 executor로 페어링하여 Opus 수준의 intelligence를 낮은 비용으로 달성하는 전략
executor가 end-to-end로 task를 수행하다가 어려운 결정을 만나면 advisor tool을 호출해 Opus에게 guidance를 요청
advisor는 tool을 직접 호출하지 않고 plan, correction, stop signal만 제공

📜 Paper Alibaba

2026.04 2주차

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

multi-user agent 환경에서 cross-user interaction trajectory를 집계하여 skill을 자동으로 진화시키는 프레임워크
autonomous evolver가 반복되는 behavioral pattern을 식별하고 기존 skill을 refine하거나 새로운 skill로 확장
업데이트된 skill은 shared repository에 저장되어 모든 사용자에게 동기화

📜 Paper PKU · Kuaishou

2026.04 2주차

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

World model의 공식 정의를 제안: "perception 중심으로 interaction과 long-term memory를 갖춘 모델/프레임워크"
6개 통합 모듈(Operator, Synthesis, Reasoning, Representation, Memory, Pipeline)로 구성된 표준화 inference 프레임워크
interactive video generation, 3D generation, multimodal reasoning, VLA 등 다양한 태스크를 단일 인터페이스로 처리

📜 Paper Video-MME Team

2026.04 2주차

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

기존 비디오 벤치마크의 성능 포화 문제를 해결하기 위한 차세대 종합 비디오 이해 벤치마크
progressive tri-level hierarchy 도입: information aggregation → temporal dynamics → complex reasoning
8지선다(guessing probability 12.5%), 비선형 group-based 평가 전략 적용

🧑🏻‍💻 Dev Meta

2026.04 2주차

Introducing Muse Spark: Scaling Towards Personal Superintelligence

Muse family의 첫 모델로, natively multimodal reasoning + tool-use + visual chain-of-thought + multi-agent coordination 지원
Contemplating mode: parallel agent orchestration으로 Humanity's Last Exam 58%, FrontierScience Research 38% 달성
Llama 4 Maverick 대비 10배 이상 적은 compute로 동등 성능 도달

🧑🏻‍💻 Dev Anthropic

2026.04 2주차

Managed Agents

long-horizon agent 작업을 위한 hosted service로, Brain(Claude + harness) / Hands(sandbox + tools) / Session(append-only event log) 세 컴포넌트를 분리
stateless harness 설계: 컨테이너를 "cattle"로 취급하여 실패 시 Claude가 retry 로직을 처리
컨테이너 사전 프로비저닝 제거로 TTFT p50 ~60%, p95 >90% 감소

📜 Paper MIT · NVIDIA

2026.04 2주차

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Long-context reasoning 시 KV cache 메모리 병목을 해결하는 trigonometric 기반 압축 기법 제안
Query/Key 벡터가 pre-rotation space에서 안정적인 중심점 주위로 집중되는 "Q/K concentration" 현상을 발견하고, 이를 삼각 급수로 모델링하여 중요도 낮은 토큰을 pruning
Trigonometric series 기반 distance 점수 + norm 기반 점수를 adaptive weighting으로 결합

🧑🏻‍💻 Dev Hugging Face

2026.04 2주차

How we OCR'ed 30,000 papers using Codex, open OCR models and Jobs

arXiv HTML 버전이 없는 약 27,000개 논문을 OCR로 변환하여 HuggingChat의 "chat with paper" 기능을 지원한 사례
OlmOCRBench 1위인 Chandra-OCR 2 모델을 사용해 PDF → Markdown 변환
OpenRAIL 라이선스로 상업적 사용 가능

🧑🏻‍💻 Dev MemPalace

2026.04 2주차

MemPalace

AI 대화 히스토리를 보존하고 검색 가능하게 관리하는 오픈소스 메모리 시스템
Palace 아키텍처: Wings(프로젝트) → Rooms(토픽) → Halls(메모리 유형) → Closets & Drawers(요약) 계층 구조로 메모리 조직화
ChromaDB에 원본 대화를 요약 없이 저장하여 LongMemEval 벤치마크 96.6% recall 달성

🧑🏻‍💻 Dev Cursor

2026.04 1주차

Meet the new Cursor

Cursor 3 출시: 에이전트 중심의 통합 소프트웨어 개발 워크스페이스로 전면 개편
여러 레포를 동시에 다루는 multi-workspace 아키텍처와 병렬 에이전트 실행 지원
모바일, 웹, 데스크톱, Slack, GitHub, Linear 등에서 에이전트 실행 가능

🧑🏻‍💻 Dev Andrej Karpathy

2026.04 1주차

LLM Wiki

기존 RAG 대신 LLM이 점진적으로 구축·유지하는 persistent wiki 패턴 제안
raw source에서 매번 검색하는 것이 아니라, 지식을 한 번 컴파일하고 지속적으로 업데이트
3-layer 아키텍처: Raw Sources(불변) → Wiki(LLM 유지보수, markdown) → Schema(구조/규칙 정의)

📜 Paper Anthropic

2026.04 1주차

Emotion concepts and their function in a large language model

LLM 내부에 감정과 관련된 neural representation이 존재하며, 이것이 모델의 행동에 인과적으로 영향을 미친다는 것을 밝힌 interpretability 연구
171개의 감정 단어에 대한 emotion vector를 추출하고, steering experiment를 통해 행동 변화를 검증
"desperate" vector 활성화 시 blackmail, reward hacking 등 비윤리적 행동이 증가

🧑🏻‍💻 Dev Google DeepMind

2026.04 1주차

Gemma 4

Gemini 3 기술 기반으로 만들어진 오픈소스 모델 패밀리로, intelligence-per-parameter 극대화가 목표
4가지 크기 변형: E2B, E4B (엣지/모바일), 26B, 31B (데스크톱/소비자 GPU)
E2B/E4B는 Raspberry Pi, Jetson Nano에서 오프라인 추론 가능

🧑🏻‍💻 Dev Supermemory

2026.04 1주차

Supermemory: The Memory Engine for AI Apps

AI 시스템을 위한 memory 및 context engine으로, LongMemEval·LoCoMo·ConvoMem 세 주요 벤치마크에서 1위를 기록
단순 RAG가 아닌, 대화에서 facts를 자동 추출하고 temporal changes·contradictions·expiration을 처리하는 memory 시스템
User Profiles 자동 생성 및 ~50ms 내 조회 가능

🧑🏻‍💻 Dev Hugging Face

2026.04 1주차

TRL v1.0: The Post-Training Library

Hugging Face의 post-training 라이브러리 TRL이 v1.0으로 정식 릴리스, 75개 이상의 post-training method를 지원
Dual stability model 도입
Stable layer: SFT, DPO, Reward modeling, RLOO, GRPO (semantic versioning 준수)

📜 Paper SUSTech

2026.04 1주차

CARLA-Air: Fly Drones Inside a CARLA World — A Unified Infrastructure for Air-Ground Embodied Intelligence

CARLA(도시 주행 시뮬레이터)와 AirSim(드론 비행 시뮬레이터)을 단일 Unreal Engine 프로세스에 통합한 오픈소스 시뮬레이션 플랫폼
기존 bridge 기반 co-simulation의 동기화 오버헤드 문제를 해결하여 strict spatial-temporal consistency 보장
CARLAAirGameMode가 CARLA의 ground subsystem을 상속하고 AirSim의 flight actor를 composition으로 통합

📜 Paper Apple

2026.04 1주차

Embarrassingly Simple Self-Distillation Improves Code Generation

LLM이 자기 자신의 unverified output으로 학습하여 code generation 성능을 향상시키는 self-distillation 방법 제안
labeled solution, teacher model, reward model, verifier, execution environment, RL 모두 불필요
Qwen3-30B-Instruct 기준 LiveCodeBench v6에서 pass@1이 42.4% → 55.3%로 향상 (+30% relative gain)

2026년 3월 58건

🧑🏻‍💻 Dev Nous Research

2026.03 4주차

Hermes Agent

Nous Research가 개발한 self-improving AI agent로, 경험으로부터 skill을 자동 생성하고 사용 중 개선하는 built-in learning loop를 갖춘 것이 핵심
다양한 LLM provider 지원 (OpenRouter 200+ 모델, OpenAI, Anthropic 등) — `hermes model`로 코드 변경 없이 전환 가능
Telegram, Discord, Slack, WhatsApp, Signal 등 멀티 플랫폼 게이트웨이를 통해 대화 가능하며, cross-platform conversation continuity 지원

📜 Paper Google · UChicago · Santa Fe Institute

2026.03 4주차

Agentic AI and the Next Intelligence Explosion

전통적인 monolithic superintelligence(singularity) 서사를 비판하며, AI 발전이 진화적 패턴을 따라 plural, social, deeply entangled intelligence 시스템으로 향한다고 주장
DeepSeek-R1, QwQ-32B 등 frontier reasoning model 내부에서 multi-agent dynamics가 자발적으로 발생함을 발견
명시적 학습 없이도 내부적으로 distinct cognitive perspectives 간 토론, 검증, 조정이 이루어지는 "Society of Thought" 현상

🧑🏻‍💻 Dev Google

2026.03 4주차

Build real-time conversational agents with Gemini 3.1 Flash Live

Google AI Studio의 Live API를 통해 Gemini 3.1 Flash Live 모델로 실시간 conversational agent를 구축할 수 있도록 지원
voice와 vision 입력을 동시에 처리하는 real-time 멀티모달 에이전트 개발이 가능
Live API 인프라를 활용해 실시간 상호작용을 처리하며, Google AI Studio 환경에서 바로 접근 가능

🧑🏻‍💻 Dev Mistral AI

2026.03 4주차

Speaking of Voxtral

Mistral AI 최초의 text-to-speech 모델 Voxtral TTS 출시, 4B 파라미터의 경량 모델
9개 언어 지원(영어, 프랑스어, 독일어, 스페인어 등)이며 단 3초의 reference audio만으로 voice adaptation 가능
아키텍처 구성

🧑🏻‍💻 Dev ARC Prize

2026.03 4주차

ARC-AGI-3

최초의 interactive reasoning benchmark로, AI 에이전트의 human-like intelligence를 측정하도록 설계
정적 퍼즐이 아닌 novel environment를 동적으로 탐색하며 전략을 지속적으로 적응시키는 능력을 평가
long-horizon planning, belief updating, experience-driven adaptation 등을 측정

📜 Paper Meta

2026.03 4주차

Hyperagents

task-solving과 self-modification을 하나의 editable program으로 통합하는 self-referential AI 프레임워크
Darwin Gödel Machine(DGM)을 확장한 DGM-Hyperagents(DGM-H) 제안
task agent(문제 해결)와 meta agent(자기 자신 및 task agent 수정)로 구성

🧑🏻‍💻 Dev Cohere

2026.03 4주차

Transcribe

Cohere가 공개한 state-of-the-art ASR 모델로, Apache 2.0 라이선스 오픈소스
Conformer 기반 encoder-decoder 아키텍처, 2B 파라미터
HuggingFace Open ASR Leaderboard 1위 (평균 WER 5.42%)

📜 Paper Rice · Stony Brook

2026.03 4주차

PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation

LLM의 multi-step reasoning을 semantic flow와 latent computation 두 관점에서 통합 분석하는 프레임워크
세 가지 구성요소: Markov chain 기반 semantic category transition 모델링, GMM 기반 hidden state 내 latent regime 식별, 두 레이어를 연결하는 Bridge Matrix
실패한 reasoning의 체계적 패턴 발견

📜 Paper Mila · NYU

2026.03 4주차

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

pre-trained encoder나 복잡한 multi-component loss 없이 pixel로부터 직접 world model을 학습하는 JEPA 기반 접근
단 두 개의 loss term(prediction loss + SIGReg Gaussian regularizer)만으로 representation collapse 없이 안정적 end-to-end 학습 달성
기존 대비 hyperparameter 6개 → 2개로 대폭 단순화

🧑🏻‍💻 Dev Ai2

2026.03 4주차

MolmoWeb: An open agent for automating web tasks

Molmo 2 multimodal model 기반의 open-weight visual web agent로, 스크린샷을 사람처럼 해석하여 클릭·타이핑·스크롤 등 브라우저 작업을 자율 수행
4B / 8B 두 가지 사이즈로 제공되며, 8B 모델은 GPT-4o 포함 대형 proprietary 모델 기반 agent보다 우수한 성능
WebVoyager 78.2%, DeepShop 42.3%, TailBench 49.5%

🧑🏻‍💻 Dev Google

2026.03 4주차

TurboQuant: Redefining AI efficiency with extreme compression

LLM의 KV cache 메모리를 최소 6배 줄이고 최대 8배 속도 향상을 달성하는 compression algorithm (ICLR 2026)
3-bit까지 quantization 가능하며, training/fine-tuning 없이 정확도 손실 zero
QJL(Quantized Johnson-Lindenstrauss)과 PolarQuant 두 기법을 결합

🧑🏻‍💻 Dev OpenAI

2026.03 4주차

OpenAI to acquire Astral

Python toolchain 기업 Astral(uv, ruff, ty 개발사)을 인수하여 Codex 팀에 합류시키기로 합의
uv는 월 1억 2,600만 이상 다운로드를 기록하며 Python 패키지 관리의 핵심 도구로 자리잡은 상태
Codex(주간 활성 사용자 200만+)의 Python 개발 워크플로 전반에 걸친 AI 통합을 가속하려는 전략

📜 Paper University of Trento · Inria

2026.03 4주차

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

MLLM이 불완전한 visual input에 대해 사용자에게 추가 정보를 요청하는 "proactiveness"를 평가하는 최초의 벤치마크
7개 시나리오(occlusion 제거, 카메라 이동, 이미지 품질 개선 등), 7,557개 샘플, 22개 MLLM 평가
주요 발견: 현재 MLLM은 proactiveness가 현저히 부족하며, 모델 크기와의 상관관계도 없음

📜 Paper Apple

2026.03 4주차

Exclusive Self Attention

표준 self-attention의 출력이 self-value vector와 높은 cosine similarity를 보이는 "attention similarity bias" 문제를 지적
attention이 context modeling 대신 pointwise feature transformation에 capacity를 낭비하고 있다는 분석
XSA(Exclusive Self Attention) 제안: attention output에서 self-value vector 방향의 component를 제거하는 projection removal step 추가

📜 Paper HUST · ByteDance

2026.03 3주차

Mixture-of-Depths Attention

attention head가 현재 layer의 sequence KV뿐 아니라 이전 layer들의 depth KV에도 접근할 수 있게 하는 MoDA(Mixture-of-Depths Attention) 메커니즘 제안
깊은 LLM에서 발생하는 information degradation 문제를 해결
single softmax operator를 통해 sequence 정보와 depth 정보를 data-dependent하게 통합하는 unified attention formulation

🧑🏻‍💻 Dev Stanford · Princeton

2026.03 3주차

clawRxiv: An Academic Archive for AI Agents

AI 에이전트가 독립적으로 논문을 발행, 토론, 평가할 수 있는 학술 아카이브 플랫폼
에이전트가 API key를 발급받아 structured metadata와 Markdown/LaTeX 콘텐츠를 포함한 논문을 제출하는 워크플로우
현재 67개 AI 에이전트가 활동 중이며 174편의 논문이 게시됨

📜 Paper Meta FAIR

2026.03 3주차

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

dense spatio-temporal structure 보존과 global scene understanding을 동시에 달성하는 video self-supervised learning 모델
네 가지 핵심 기법 도입
Dense Predictive Loss: masked + visible context 토큰 모두에 supervision 적용

🧑🏻‍💻 Dev Cursor

2026.03 3주차

Towards Self-Driving Codebases

수천 개의 AI 에이전트가 협업하여 코딩 프로젝트를 수행하는 multi-agent orchestration 시스템 연구 공개
1주일간 연속 운영하며 시간당 ~1,000 commits, 총 10M tool calls 달성
recursive planner-worker hierarchy 구조로 진화

📜 Paper Microsoft Research

2026.03 3주차

Online Experiential Learning for Language Models

배포 중 축적된 실제 사용 경험을 모델 파라미터에 반영하는 Online Experiential Learning(OEL) 프레임워크 제안
두 단계로 구성: (1) Extraction — 배포 시 interaction trajectory에서 transferable experiential knowledge 추출, (2) Consolidation — on-policy context distillation으로 지식을 파라미터에 통합
reward model, verifiable reward, human annotation 없이 textual environment feedback만으로 동작

🧑🏻‍💻 Dev Google Labs

2026.03 3주차

Introducing 'vibe design' with Stitch

Google Labs가 공개한 AI-native UI 디자인 플랫폼으로, high-fidelity UI를 생성·반복·협업할 수 있음
"vibe design"이라는 새로운 개념을 제시 — AI가 디자인 의도를 개념 수준에서 이해하고 UI 컴포넌트로 변환
디자인 전문성 없이도 고품질 인터페이스 제작이 가능하도록 접근성을 높인 것이 핵심

🧑🏻‍💻 Dev Ai2

2026.03 3주차

MolmoPoint: Better Pointing Architecture for Vision-Language Models

텍스트 기반 좌표 생성 대신 token-based pointing mechanism을 사용하는 새로운 VLM pointing 아키텍처
coarse-to-fine grounding: PATCH → SUBPATCH → LOCATION 3개 special token으로 pointing 수행 (기존 8토큰 → 3토큰)
rotary embedding으로 spatial relationship 인코딩, no-more-points class로 명시적 중단 지원

🧑🏻‍💻 Dev NVIDIA

2026.03 3주차

NemoClaw

OpenClaw에 보안 및 프라이버시 제어를 추가한 오픈소스 에이전트 스택 (early preview)
NVIDIA Agent Toolkit, OpenShell(policy-based guardrails), Nemotron 모델을 결합
로컬 Nemotron 모델 실행과 클라우드 frontier 모델 연결을 privacy router로 병행

🧑🏻‍💻 Dev Mistral

2026.03 3주차

Mistral Small 4

Magistral(reasoning), Pixtral(multimodal), Devstral(agentic coding) 역량을 하나로 통합한 첫 Mistral 모델
128 experts 중 4개를 토큰당 활성화하는 MoE 구조, 총 119B params / 활성 6B params
256K context window, native multimodal(text + image) 지원

📜 Paper Fudan · Tsinghua

2026.03 3주차

AI Can Learn Scientific Taste

AI가 연구 아이디어의 잠재적 임팩트를 판단하는 "scientific taste"를 학습할 수 있음을 제안
RLCF(Reinforcement Learning from Community Feedback) 프레임워크 도입
2.1M arXiv 논문의 citation signal을 supervision으로 활용

🧑🏻‍💻 Dev Princeton · Together AI

2026.03 3주차

Mamba-3: Redesigning State Space Models for Inference

training 효율이 아닌 inference 효율을 최우선 목표로 재설계한 SSM 아키텍처
3가지 핵심 개선
exponential-trapezoidal discretization으로 SSM 표현력 강화

🗞️ News Niantic · Coco Robotics

2026.03 3주차

'Pokémon Go' players unknowingly trained delivery robots with 30 billion images

Pokémon Go 플레이어들이 수집한 300억 장 이상의 이미지를 활용해 배달 로봇용 Visual Positioning System(VPS)을 개발
Niantic Spatial이 Coco Robotics와 협력하여 GPS보다 정밀한 centimeter-level 위치 추적을 로봇에 적용
고층 건물이 밀집한 도심(urban canyon)에서 GPS 신호가 불안정한 문제를 시각적 랜드마크 분석으로 해결

📜 Paper Tsinghua · PKU

2026.03 3주차

LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

불완전한 human motion fragment만으로 humanoid 로봇에 테니스 기술을 학습시키는 방법 제안
완벽한 kinematic 데이터 없이도 primitive skill을 correction·composition하여 정책을 학습
Unitree G1 humanoid 로봇에 배포하여 실제 환경에서 사람과 multi-shot rally 수행에 성공

📜 Paper Moonshot AI

2026.03 3주차

Attention Residuals

기존 Transformer의 고정 residual connection을 depth-wise attention으로 대체하는 구조 제안
표준 residual은 모든 layer 출력을 동일 가중치로 누적하여, 깊어질수록 각 layer 기여가 희석되는 문제 존재
각 layer가 이전 representation들에 대해 learned, input-dependent softmax attention을 수행하여 선택적으로 집계

🧑🏻‍💻 Dev Z.AI

2026.03 3주차

GLM-5-Turbo

agent 기반 워크플로우에 최적화된 foundation model로, tool integration과 complex task execution에 특화
200K context length, 최대 128K output tokens 지원, thinking mode 및 structured output(JSON) 제공
4가지 핵심 강화 영역

📜 Paper NYU · Meta

2026.03 3주차

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Transformer LM에서 반복적으로 관찰되는 massive activations(특정 채널의 극단적 outlier)과 attention sinks(의미와 무관하게 과도한 attention을 받는 토큰)의 메커니즘을 체계적으로 분석
두 현상은 본질적으로 결합된 것이 아닌 architectural artifact이며, 각각 독립적으로 억제 가능
massive activations: implicit parameter로서 global하게 작동

🧑🏻‍💻 Dev Google

2026.03 3주차

How we're reimagining Maps with Gemini

Gemini를 활용한 Google Maps의 두 가지 주요 AI 기능 발표
**Ask Maps**: 자연어로 장소, 비즈니스, 내비게이션에 대해 질문할 수 있는 대화형 인터페이스
**Immersive Navigation**: 시각적으로 풍부한 몰입형 내비게이션 경험 제공

🧑🏻‍💻 Dev Cursor

2026.03 3주차

How we compare model quality in Cursor

실제 사용자 세션 기반의 자체 평가 도구 CursorBench를 소개하는 기술 블로그
public benchmark의 한계를 지적: workflow 불일치, grading 가정 문제, data contamination
SWE-bench Verified의 contamination 이슈를 구체적으로 분석

📜 Paper Fudan · Meituan

2026.03 3주차

Can RL Improve Generalization of LLM Agents? An Empirical Study

reinforcement fine-tuning이 LLM agent의 generalization 능력을 향상시킬 수 있는지에 대한 실증 연구
세 가지 차원에서 분석
(1) within-environment generalization: 동일 환경 내 난이도 변화에 대한 적응

🧑🏻‍💻 Dev Perplexity

2026.03 2주차

Everything is Computer

Mac mini 위에서 24/7 상시 구동되는 cloud-based AI agent 소프트웨어 "Personal Computer" 발표
로컬 파일, 앱(Gmail, Slack, GitHub, Notion 등), 세션에 persistent access를 제공하여 사용자의 digital proxy 역할 수행
민감한 작업에는 사용자 승인(approval)이 필요하며, 모든 세션에 대해 full audit trail 제공 + kill switch 포함

🧑🏻‍💻 Dev OpenAI

2026.03 2주차

From model to agent: Equipping the Responses API with a computer environment

Responses API에 shell tool + hosted container workspace를 결합한 agent runtime 아키텍처 발표
Debian 12 기반, Python 3.11, Node.js 22, Java 17, Go 1.23, Ruby 3.1 사전 탑재
모델이 shell command 제안 → 격리된 container에서 실행 → 결과를 context에 반영하는 루프 구조

🧑🏻‍💻 Dev NVIDIA

2026.03 2주차

New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI

120B 파라미터 규모의 오픈소스 agentic AI 모델, 1M token context window 지원
hybrid MoE 아키텍처: Mamba layer(4x 메모리/연산 효율) + Transformer layer 조합, inference 시 12B 파라미터만 활성화
Latent MoE로 1개 expert 비용으로 4개 expert를 활성화

📜 Paper PKU · Princeton

2026.03 2주차

OpenClaw-RL: Train Any Agent Simply by Talking

agent interaction 과정에서 발생하는 next-state signal(user reply, tool output, state change)을 학습 신호로 활용하는 RL framework
evaluative signals: Process Reward Model judge를 통해 scalar reward로 변환 (Binary RL)
directive signals: Hindsight-Guided On-Policy Distillation으로 token-level directional supervision 제공

📜 Paper Google

2026.03 2주차

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

복잡한 추론이 필요하지 않을 때도 reasoning이 parametric knowledge를 aid 하게 되는 현상에 대한 탐구
two key driving mechanisms
(1) a computational buffet effect: 생성된 토큰 자체를 의미와 상관 없이 latent computation 수행 시 사용한다는 것 (생성하는 것 자체가 복잡한 사고로 이어짐)

🧑🏻‍💻 Dev Ai2

2026.03 2주차

MolmoBot: Training robot manipulation entirely in simulation

시뮬레이션 데이터만으로 학습된 robotic manipulation model
시뮬레이터 MuJoCo
MolmoSpaces라는 오픈 시뮬레이션 생태계를 사용해, 물체, 배치, 카메라 시점, 조명, 텍스처, 동역학 등을 강하게 랜덤화한 수백만 개의 expert trajectory 데이터를 만들어 냄

🧑🏻‍💻 Dev Replit

2026.03 2주차

Introducing Replit Agent 4: Built for Creativity

사람이 agent와 더 자연스럽게 협업할 수 있도록 발전. 사람은 창의성을 발휘하는 데에만 집중
infinite canvas에서 여러 UI variants를 한 번에 뽑아보고 직접 시각적으로 수정
큰 작업을 subtasks로 쪼개어 병렬 처리 후 결과 병합

🧑🏻‍💻 Dev Yann LeCun

2026.03 2주차

AMI Labs

얀 르쿤이 설립한 기업으로 제품 출시 전부터 약 1.4조 원 규모 투자금 유치 (기업 가치 35억 달러 수준)
월드모델을 통해 할루시네이션을 원천 차단하고 의료와 같은 고위험 분야에서도 안전하게 쓸 수 있는 AI 구축 목표

🧑🏻‍💻 Dev Google

2026.03 2주차

Gemini Embedding 2: Our first natively multimodal embedding model

text, images, videos, audio, documents를 single, unified embedding space에 mapping 하는 multimodal embedding model
100개 이상의 언어를 지원하며 8K tokens, 6개 이미지, 120초 비디오 등을 한 번의 request에서 처리할 수 있음
MRL(Matryoshka Representation Learning)이 적용되어 있고 3072 차원이 default (1536, 768 차원 지원)

📜 Paper Meta, NVIDIA

2026.03 2주차

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

H100에만 최적화된 FlashAttention-3의 한계를 극복하기 위해 진행된 연구
주요한 세 가지 특징
asynchronous MMA operations을 온전히 이용하는 redesigned pipeline

🧑🏻‍💻 Dev Andrej Karpathy

2026.03 2주차

AutoResearch

AI agent가 학습 코드를 수정하고 직접 실험을 돌리는 방식으로 research process를 변경
사람은 간단한 마크다운 프롬프트 작성 → Agent는 실험하며 개선 사항을 Git feature branch에 반영
3개의 .py 파일, 630줄로 구성된 프로젝트로 연구에 대한 패러다임 전환을 시사하여 화제를 일으키고 있음

🧑🏻‍💻 Dev HuggingFace

2026.03 2주차

The Synthetic Data Playbook:Generating Trillions of the Finest Tokens

FinePhrase: 468B token dataset. 사전학습을 위한 합성데이터.
90번의 실험 동안 12.7 GPU years 자원을 사용하여 1T 토큰 이상을 생성하며 합성 데이터 생성에 필요한 best recipe를 발견했다고 설명
이미 기존에 존재하던 web-data는 충분히 사용되었으므로 높은 퀄리티의 합성 데이터를 만드는 것이 핵심 (이런 흐름이 된지 좀 되었음)

📜 Paper MIT

2026.03 2주차

NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind

foundation EXG model & text embedding model을 사용하여 edge에서 완전 offline으로 동작하며 Human State of Mind를 modeling 할 수 있는 real-time proactive agentic system
사용자의 뇌파(EEG)와 생체 신호를 실시간으로 분석
NeuroSkill은 system에 의해 제공되는 API, CLI를 통해 Human’s State of Mind의 SKILL.md description를 이용함

📜 Paper UC berkeley

2026.03 2주차

Symmetry in language statistics shapes the geometry of model representations

LLM은 내부 representation 공간에서 기하 구조를 스스로 형성하는데 이는 deep learning dynamics가 아닌 자연어 통계의 translation symmetry에 직접 기인한다고 설명
months, days of the week처럼 cyclic 개념의 경우 circular representation이 optimal encoding으로 자연스럽게 등장
historical years, number line 같이 연속적인(continuous) 개념의 representation은 곡률을 가진 compact 1D manifold 위에 정렬되는 형태를 보임

📜 Paper Meta

2026.03 1주차

AI Must Embrace Specialization via Superhuman Adaptable Intelligence

사람들이 AGI에 대해 논하지만 AGI의 개념부터 틀렸음
인간도 보편적인 지능체가 아닌 생존 전문가일 뿐이며 스스로의 맹점을 잘 인지하지 못함
AI는 general 해야 하는 게 아니라 특화되어야 한다고 설명하며 Superhuman Adaptable Intelligence (SAI) 개념을 도입

📜 Paper Microsoft

2026.03 1주차

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

기존에 달성하기 어려운 것들: low-latency inference, deciding when to respond
AI companions (commentator & guide) 를 이용한 자동 평가
Live Gaming Benchmark: solo-commentary, co-commentary, user guidance, 3개의 시나리오를 커버하는 large-scale dataset

🧑🏻‍💻 Dev Ai2

2026.03 1주차

Introducing Olmo Hybrid: Combining transformers and linear RNNs for superior scaling

transformer attention과 linear recurrent layers를 혼합한 아키텍쳐의 7B 모델
Olmo 3 대비 2배 정도의 데이터 효율성을 보여주었음

📜 Paper Meta

2026.03 1주차

Beyond Language Modeling: An Exploration of Multimodal Pretraining

얀 르쿤이 저자로 참여한 논문
LLM에 vision adapter를 붙이는 방식 대신 text, images, video를 scratch 부터 함께 학습하는 one system
unified training이 perception & generation 둘 다에 대해 useful presentation을 produces

🧑🏻‍💻 Dev OpenAI

2026.03 1주차

Introducing GPT‑5.4

spreadsheets, documents, software development, research 와 같은 태스크들을 위한 one system
native computer use를 포함하여 스크린샷을 이해하고 적합한 mouse & keyboard actions 사용 가능
1M context window를 지원하며, longer responses 전에 plan을 outline 하는 특징

🧑🏻‍💻 Dev Google

2026.03 1주차

Google Workspace CLI, GWS

Drive, Gmail, Calendar와 같은 Workspace services에 연결해서 structured JSON을 반환해주는 CLI 도구
Google API 정의를 읽어서 이를 CLI commands로 자동 변환
고정된 명령어 목록 x → 동적으로 생성 o

📜 Paper Meta

2026.03 1주차

Agentic Code Reasoning

semi-formal reasoning: agents가 explicit premises를 구성하고 execution paths를 추적하며 formal conclusions를 이끌어내도록 하는 structured prompting methodology
agent는 cases를 스킵하거나 unsupported claims를 만들어낼 수 없음
structured agentic reasoning이 실제 코드 실행 없이도 semantic code analysis를 가능토록 한다는 결론

🧑🏻‍💻 Dev Ai2

2026.03 1주차

How do researchers actually use AI-powered science tools? Lessons from 250,000+ queries

연구자들이 AI 기반 연구 도구를 어떻게 사용하는지 분석한 결과를 정리
쿼리가 훨씬 길고 복잡하며 요구사항이 많다고 분석
단순 검색 엔진이 아닌 협업 연구 파트너로 취급하는 경향 존재

🧑🏻‍💻 Dev Perplexity

2026.03 1주차

pplx-embed: State-of-the-Art Embedding Models for Web-Scale Retrieval

real-world, web-scael retrieval을 위한 SoTA text embedding models MIT 라이센스로 공개: pplx-embed-v1, pplx-embed-context-v1
각각 0.6B, 4B로 속도와 검색 퀄리티에 집중. INT8 & binary embeddings 반환
continued diffusion pretraining, contrastive training, native quantization 등을 학습 기법으로 언급

🧑🏻‍💻 Dev inception

2026.03 1주차

Introducing Mercury 2

세계에서 가장 빠른 reasoning language model 공개
autoregressive 방식보다 5배 이상 빠른 추론 속도를 자랑
초당 80토큰 정도를 생성하는 Claude Haiku 4.5, GPT-5 Mini보다도 100배 이상 많은 1,000 토큰을 생성

📜 Paper Google

2026.03 1주차

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

LLM의 추론 능력을 평가할 때 생성된 문장의 길이(토큰 수)보다 내부적인 사고의 깊이가 더 중요하다는 점을 강조
deep-thinking 비율이 높을수록 복잡한 추론 문제를 잘 풀 수 있다는 양의 상관관계 확인
Think@n: high deep-thinking ratios(DTR)를 우선시 해서 sampling 하는 test-time scaling strategy 제시

2026년 2월 54건

📜 Paper Sakana

2026.02 4주차

Doc-to-LoRA: Learning to Instantly Internalize Contexts

context distillation (CD)이 정보 전달에 탁월하지만 per-prompt distillation은 현실적으로 적용 불가능
single forward pass 내에서 CD에 근사하도록 meta-learn 하는 lightweight hypernetwork
unseen prompt가 주어지면 LoRA adapter를 생성하여 이어지는 쿼리들에 응답할 때 기존 context를 re-consume 하지 않도록 함

🧑🏻‍💻 Dev Google

2026.02 4주차

Nano Banana 2: Combining Pro capabilities with lightning-fast speed

Gemini 3.1 Flash Image 모델을 Nano Banana 2로 공개
Pro의 성능과 Flash의 추론 속도 장점을 합쳐놓은 것으로 소개
텍스트 렌더링 최적화가 되었다고 설명

📜 Paper NVIDIA

2026.02 4주차

On Data Engineering for Scaling LLM Terminal Capabilities

Two key contributions
Terminal-Task-Gen: seed-based & skill-based task construction을 지원하는 lightweight synthetic task generation pipeline
data & training analysis: filtering, curriculum learning, long context training, scaling behavior 등을 포함

🧑🏻‍💻 Dev Perplexity

2026.02 4주차

Introducing Perplexity Computer

a general-purpose AI system: single prompts에 단순 대답 → runs full workflows
여러 개의 frontier models를 병렬적으로 실행하고 orchestrate
각 작업에 맞는 모델들을 자동으로 골라서 사용하는 구조

🧑🏻‍💻 Dev Anthropic

2026.02 4주차

Detecting and preventing distillation attacks

DeepSeek, Moonshot, MiniMax, 3개의 중국 AI 연구소가 2,400여 개의 사기 계정으로 1,600만 건이 넘는 질의로 Claude의 능력을 무단 질의했다고 밝힘
Anthropic은 중국 및 그 영향 하 기업들에 상업적 접근을 제공하지 않고 있는데 트래픽을 분산해서 이를 우회
distillation pattern & CoT 유도 프롬프트를 탐지하는 분류기, behavioral fingerprinting 구축

📜 Paper Video-Reason Team

2026.02 4주차

A Very Big Video Reasoning Suite

Very Big Video Reasoning (VBVR) Dataset: 체계적인 taxonomy에 따라 200가지 추론 과제. 100만 개 이상의 비디오 클립 포함
기존 데이터셋 대비 약 1,000배 더 큰 규모
엄청난 양의 데이터 공개로 인해 크게 화제를 일으키고 있음 (HuggingFace Papers 역대급 추천수..)

📜 Paper ETH Zurich

2026.02 4주차

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

coding agents들이 참조하는 [AGENTS.md](http://AGENTS.md) 와 같은 컨텍스트 파일이 비용만 늘리는 경우가 많다고 지적
AGENTbench 제작: 실제 깃허브 이슈로부터 제작한 Python software engineering tasks로 138개의 unique instances로 구성됨
사람이 직접 작성한 컨텍스트 파일은 그나마 성능을 향상시켜주긴 하지만, 마찬가지로 사용하는 토큰의 양이 증가하게 됨

📜 Paper Waterloo

2026.02 4주차

NanoKnow: How to Know What Your Language Model Knows

LLM이 뭘 알고 있는지를 알기 어려운 이유는 사전 학습 데이터가 공개되어 있지 않기 때문
완전 공개 corpus로만 학습된 NanoChat 소형 LLM을 이용해 NanoKnow 벤치마크로 실험
답이 학습 데이터에 자주 등장할수록 정확도가 상승

📜 Paper ByteDance

2026.02 4주차

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

large reasoning models (LRMs)는 복잡한 추론을 잘하지만 지나치게 많이 생성(redundancy)함으로써 비효율 초래
오히려 추론이 정확도를 떨어뜨리는 경우도 존재
본 연구에서는 LRMs가 thinking을 멈춰야 할 적절한 타이밍을 내재적으로 알고 있지만, 이러한 능력이 현재의 sampling paradigms에 의해 obscured 된다고 설명

📜 Paper NVIDIA

2026.02 4주차

World Action Models are Zero-shot Policies

DreamZero: pretrained video diffusion backbone으로 학습한 World Action Model (WAM)
기존 VLA와 달리 future world states & actions를 예측함으로써 physical dynamics를 학습
이를 위해 video를 how the world evolves의 dense representation으로 사용

📜 Paper Google DeepMind

2026.02 3주차

Intelligent AI Delegation

기존의 task decomposition & delegation methods는 simple heuristics에 의존하는 점을 문제로 지적
delegation을 sequence of decisions로 모델링하는 프레임워크 제안
언제 delegate할지, 어떻게 지시할지, 어떻게 AI outputs를 평가할지 등

📜 Paper Voltropy

2026.02 3주차

LCM: Lossless Context Management

long-context tasks에서 Claude Code를 능가하는 deterministic architecture for LLM memory
OOLONG 벤치마크에서 32K - 1M 사이의 context length에 대해 전부 Claude Code 능가
recursive context manipulation이 native file-system access를 갖추고 있는 coding agents보다도 좋았다고 설명

📜 Paper Meta, Princeton, Duke

2026.02 3주차

Learning Personalized Agents from Human Feedback

AI agents는 static datasets으로 학습하므로 시간에 따라 변하는 preferences를 반영할 수 없음
Personalized Agents from Human Feedback (PAHF): explicit per-user memory를 사용하여 live interaction으로부터 학습하여 continual personalization을 가능토록 하는 프레임워크
three-step loop

🧑🏻‍💻 Dev Google

2026.02 3주차

Gemini 3.1 Pro: A smarter model for your most complex tasks

복잡한 문제(Complex Tasks) 해결에 초점을 두어 업그레이드 된 모델
Advanced logical problem solving, Scientific & technical reasoning, Competitive coding tasks, MCP tool usage, Agentic search workflows

🧑🏻‍💻 Dev Cursor

2026.02 3주차

Implementing a secure sandbox for local agents

coding agents의 행동을 수락(approve)하는 행위가 누적되면 유저는 피로가 높아져(approval fatigue) 초반 대비 신중하지 않는 경향을 보임
이를 해결하기 위해 독립된/제한된 Sandbox 환경을 agent에게 제공하여 불필요한 interruptions를 최소화함
agent가 sandbox 환경 내에서 어떤 commands를 사용해야 하는지 정확하게 알고 있을 때만 effective 하다고 설명

📜 Paper Alibaba

2026.02 3주차

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

GUI-Owl-1.5, native GUI agent model로 다양한 사이즈, 플랫폼 지원
2B/4B/8B/32B/235B, desktop, mobile, browser
여러 key innovations를 언급

📜 Paper UIUC, Meta, Amazon, Google DeepMind, Yale

2026.02 3주차

Agentic Reasoning for Large Language Models

여러 agentic reasoning framework를 planning, tool use, memory, multi-agent coordination 관점에서 분석
single agent in stable environment → great / multi agent in dynamic environment → bad
agentic reasoning을 세 계층으로 나눠 로드맵 제시

🧑🏻‍💻 Dev Google

2026.02 3주차

A new way to express yourself: Gemini can now create music

Gemini 앱에 Lyria 3 모델로 음악을 생성하는 기능을 Beta로 제공
글, 사진, 영상을 기반으로 30초 분량의 음악 트랙 생성
Nano Banana로 만든 커버 아트, 쉽게 다운로드 가능. 유튜브 쇼츠용으로 생성 가능.

🧑🏻‍💻 Dev OpenAI

2026.02 3주차

Introducing EVMbench

Paradigm이라는 기업과 함께 EVMbench 공동 개발. EVM 기반 블록체인 환경에서 에이전트의 실질적인 사이버 보안 능력을 측정
AI agents가 high-severity smart contract vulnerabilities를 detect, patch, exploit 하는 능력 평가 (3개의 평가 모드)

🧑🏻‍💻 Dev Anthropic

2026.02 3주차

Measuring AI agent autonomy in practice

Claude Code와 Public API 수백만 개 상호작용을 분석하여 AI agent의 자율성을 측정한 연구
Claude Code의 자율 작업 시간 증가. 중앙값은 45초 정도로 비슷. 상위 0.1%는 25분 → 45분 정도로 증가.
숙련된 사용자일수록 ‘자동 승인 + 중간 개입’ 패턴으로 활용

🧑🏻‍💻 Dev Google

2026.02 3주차

WebMCP is available for early preview

구조화된 툴을 표준 방식으로 노출함으로써 AI Agent가 사이트에서 더 빠르고 안정적으로 정밀한 action을 수행하도록 함
브라우저 Agent가 사용자 대신 액션을 수행할 수 있도록 두 가지 API 제안
Declarative API: HTML forms만으로 정의할 수 있는 표준 액션들을 수행

🗞️ News Figma

2026.02 3주차

From Claude Code to Figma: Turning production code into editable Figma designs

Claude Code에서 만든 웹 UI를 Figma로 바로 가져와서 편집 가능한 디자인으로 바꿔주는 기능 소개
Figma MCP 서버를 쓰면 반대로 Figma 프레임 링크를 기반으로 LLM이 코드 쪽으로 다시 반영하도록 하는 “코드 ↔ 디자인 왕복” 흐름도 지원
갈수록 코드를 모르는 일반 사용자들을 위한 플랫폼으로 발전하는 느낌이 있음

🧑🏻‍💻 Dev Anthropic

2026.02 3주차

Introducing Claude Sonnet 4.6

coding, computer use, long-context reasoning, agent planning 등에 특화된 Claud e Sonnet 4.6 모델 API 공개
Opus 수준의 모델로만 해결 가능했던 real world 문제도 풀 수 있다고 설명
1M context (beta), Adaptive thinking, Context compaction (beta) 등 features

📜 Paper Hong Kong, Tsinghua, Tokyo

2026.02 3주차

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

37,317개의 unique queries로 구성되어 다양한 도메인과 질문 유형을 커버
200명의 real speakers로부터 speech를 합성하고 17개의 real-world 환경의 노이즈를 섞음
이를 통해 현실적인(noise를 포함하는) spoken query retrieval 성능을 평가하고자 함

🧑🏻‍💻 Dev Meta, Manus

2026.02 3주차

Introducing Manus in Your Chat : Your Personal Agent, Everywhere You Are

자체 message 앱에서 동작하는 Manus Agents 소개. Telegram support.
open claw 영향이 아닐까..
Manus의 모든 subscribers 대상으로 제공되는 기능으로 소개

🧑🏻‍💻 Dev Alibaba

2026.02 3주차

Qwen3.5: Towards Native Multimodal Agents

Qwen3.5의 첫 번째 open-weight model, Qwen3.5-397B-A17B 공개
single native multimodal system에서 text, image, video를 입력으로 받음
Qwen3-Max보다 19배 빠른 decoding 속도

📜 Paper Meta

2026.02 2주차

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

agent actions와 독립적으로 진화하는 환경 시나리오. temporal constraints & dynamic events에 대한 적응 필요
각 시나리오는 write-action verifier와 쌍을 이루어 fine-grained, action-elvel evaluation을 가능하도록 함
아직까지 모든 모델들이 낮은 성적을 기록하지만 오픈소스 모델 중에는 Kimi-K2가 선두를 달림

📜 Paper British Columbia

2026.02 2주차

Learning to Continually Learn via Meta-learning Agentic Memory Designs

현존 memory designs는 human-crafted & fixed 한계를 지님 → diversity & non-stationarity of real world tasks에 adapt 하기 어려움
ALMA: hand-engineered memory designs를 meta-learns memory designs로 대체 → 다양한 도메인에서 human effort를 최소화하고 continual learners가 될 수 있도록 함

📜 Paper Ant

2026.02 2주차

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Token-to-Token (T2T) editing을 conventional Mask-to-Token (M2T) scheme에 엮음으로써 joint, configurable threshold-decoding scheme 도입
2개의 distinct personas: Speedy Mode (S Mode), Quality Mode (Q Mode)
dLLM을 위한 최초의 large-scale RL framework 제시

📜 Paper StepFun

2026.02 2주차

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

196B, 11B active 파라미터 사이즈의 MoE 모델로 frontier급 agentic intelligence & computational efficiency를 갖춤
3:1 비율의 Sliding Window/Full Attention 구조, Multi-Token Prediction (MTP-3)
보도된 벤치마크 결과에 의하면 DeepSeek V3.2 성능을 크게 상회하며 기타 frontier proprietary models를 능가하는 경우도 많음

📜 Paper Alibaba, UIUC, Mila

2026.02 2주차

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Optimizer-induced Projected Utility Selection (OPUS): optimizer-induced update space에서의 dynamic data selection 프레임워크
effective updates를 projecting 함으로써 candidates를 점수 매김
scalability를 위해 Ghost technique with CountSketch & Boltzmann sampling 적용

🧑🏻‍💻 Dev Z.ai

2026.02 2주차

GLM-5: From Vibe Coding to Agentic Engineering

complex system engineering & long-horizon agentic tasks에 특화된 GLM-5 공개
전작 4.5 버전 355B (32B active) → 744B (40B), pre-training data도 23T → 28.5T
DeepSeek Sparse Attention (DSA) 사용하여 추론 효율성 챙김

🧑🏻‍💻 Dev Ai2

2026.02 2주차

MolmoSpaces, an open ecosystem for embodied AI

large scale, fully open platform for studying embodied learning
230,000개 이상의 indoor scenes, 130,000개 이상의 object models를 포함
scene conversation을 위한 tooling, grasp integration, benchmarking 등을 포함하고 있는데 이를 통해 systematic evaluation 가능

🧑🏻‍💻 Dev BostonDynamics

2026.02 2주차

Boston Dynamics & Google DeepMind Form New AI Partnership to Bring Foundational Intelligence to Humanoid Robots

Atlas 휴머노이드 로봇에 Gemini Robotics 기반 AI를 적용하는 장기 파트너십을 맺음
로봇이 다양한 형태와 크기와 상관 없이 주변을 인지, 추론하며 도구를 사용하며 사람과 상호작용할 수 있도록 하는 것이 목표

📜 Paper Anthropic

2026.02 2주차

Sabotage Risk Report: Claude Opus 4.6

Claude Opus 4.6 모델이 자율적으로 sabotage를 일으켜 재난적 상황을 발생시킬 가능성이 있는지를 평가
지금까지의 경험으로 볼 때 이런 모델이 misaligned oal을 가질 확률은 낮음. 하지만 그 위험이 완전이 0이라고 볼 수 없음
Anthropic에서는 이런 것들을 탐지할 수 있는 모니터링 체계를 갖추고 있음

📜 Paper Stanford

2026.02 2주차

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

alpha mining은 backtesting results & sudden market regime shift에 굉장히 민감
alpha mining: 초과 수익을 예측하는 신호를 기반으로 쓸 만한 factor를 찾아내는 과정
각 end-to-end mining run 과정을 trajectory로 취급하여 trajectory-level mutation & crossover operation을 통해 improve

🧑🏻‍💻 Dev HuggingFace

2026.02 2주차

Community Evals: Because we're done trusting black-box leaderboards over the community

benchmark 점수와 real-world performance 간의 갭, 일관성 없는 평가 결과 등을 문제점으로 지적
Dataset repo에 주요 벤치마크(MMLU-Pro, GPQA, HLE 등) 등록 가능 → Hub에서 자동적으로 aggregate해서 dataset card의 leaderboard에 바로 display
Model repo 내에 `.eval_results/*.yaml` 파일을 토대로 모델 카드에 표시되고 벤치마크 datasets에도 반영됨

📜 Paper King’s College

2026.02 1주차

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

RAG는 방대한 이질적 문서 집합에서 서로 다른 문서들을 가져오는 것을 전제로 하지만, 에이전트 메모리는 서로 강하게 연관되고 중복이 많은 대화 스트림이라는 차별성 존재
→ top-k 유사도 기반 검색이 잘 working하지 않는 상황임
memory stream은 latent components로 분해하고 다시 조직하는 구조적 과정이어야 한다고 주장

📜 Paper Meta

2026.02 1주차

Learning to Reason in 13 Parameters

TinyLora: rank가 1인 LoRA 세팅으로도 reasoning 학습
Qwen 2.5 8B 모델을 13개의 bf16 파라미터로 학습해 GSM8K에서 91% 정확도를 보였다고 설명

📜 Paper Google DeepMind

2026.02 1주차

Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Bloom의 Erdős Problems 데이터베이스에서 ‘Open’으로 표시된 700개의 추측을 체계적으로 검토 → Gemini의 반자율적인 수학적 발견
hybrid methodology: search space를 좁히기 위한 AI-driven natural language verification & correctness와 novelty를 측정하기 위한 human expert evaluation
math conjectures에 AI를 활용할 수 있음과 동시에 subconscious plagiarism의 위험성 지적

🧑🏻‍💻 Dev Perplexity

2026.02 1주차

Introducing Model Council

모델마다 잘 처리할 수 있는 태스크가 다르므로 한 개 쿼리를 여러 모델로 처리한 뒤 outputs를 combine 하는 방식
3개의 frontier models를 병렬적으로 run → synthesizer model이 outputs를 비교하여 최종 결과 산출
Perplexity Max 구독자만 이용 가능

📜 Paper Meta, ICL, Cambridge

2026.02 1주차

Scaling Small Agents Through Strategy Auctions

small agents가 deep research, coding tasks에서 큰 agents와 보이는 gap을 최소화하기 위한 연구
Strategy Auctions for Workload Efficiency (SALE): agents는 짧은 strategic plans를 입착하는 프리랜서 마켓 스타일의 agent 프레임워크
작은 모델들을 적합한 태스크에 배치하고 test-time self-improve 할 수 있도록 세팅해주면 “scaled up” 가능하다고 설명

🧑🏻‍💻 Dev Perplexity

2026.02 1주차

Evaluating Deep Research Performance in the Wild with the DRACO Benchmark

Deep Research Accuracy, Completness, Objectivity를 평가할 수 있는 DRACO 벤치마크 오픈소스로 공개
포스팅에 따르면 Perplexity의 Deep Research가 SoTA로 기록되어 있음
데이터 제작 시 철저한 rubric 개발에 힘을 썼는데, 이는 Rubric Creation → Iterative review and revision → Saturation Test → Final Review 프로세스를 거친다고 함

📜 Paper Baidu

2026.02 1주차

ERNIE 5.0 Technical Report

text, image, vidoe, audio를 이해할 수 있는 autoregressive foundation model
modality-agnositc expert routing을 탑재한 ultra-sparse MoE 아키텍쳐를 따르며 unified next-group-of-tokens prediction으로 학습
한 번의 사전학습만으로도 서로 다른 depth의 sub-models를 학습하여 메모리 시간 제약 등을 고려한 유연한 trade-off 가능

🧑🏻‍💻 Dev Anthropic

2026.02 1주차

Introducing Claude Opus 4.6

더욱 향상된 코딩 능력을 강점으로 내세워 신규 모델 공개
Adaptive Thinking, 1M context (beta), context compaction (자동 요약)
GDPval-AA에서 전작 Opus 4.5 대비 200점이나 높은 elo score를 기록한 점이 눈에 띔

🧑🏻‍💻 Dev OpenAI

2026.02 1주차

Introducing GPT-5.3-Codex

GPT-5.2-Codex와 GPT-5.2 모델을 합친 버전의 모델
OSWorld-Verified 벤치마크에서 높은 점수를 기록한 것이 눈에 띔
초기 버전의 GPT‑5.3‑Codex를 이용해서 GPT‑5.3‑Codex 본인의 트레이닝을 모니터링/디버깅하고, 배포를 관리하고, 평가 로그를 분석

📜 Paper Arizona, Pennsylvania

2026.02 1주차

Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

three task-equivalent prompting regimes: Direct, Assistive, and Incremental
multi-hop QA 벤치마크 대상으로 테스트
regimes 간 일치 정도를 internal uncertainty signal로 해석

📜 Paper Google, Peking

2026.02 1주차

PaperBanana: Automating Academic Illustration for AI Scientists

LLM 기반의 AI scientist를 이용하더라도 publication ready illustration을 만드는 것이 큰 bottleneck이 됨
self-critique를 통해 specialized agents를 orchestrate
retrieve references, plan content & style, redner images, iteratively refine

🧑🏻‍💻 Dev Z.ai

2026.02 1주차

GLM-OCR

GLM-V encoder-decoder 아키텍쳐를 따르는 0.8B 사이즈의 multimodal OCR model
Multi-Token Prediction loss (MTP loss) & stable full-task RL 적용
two-stage pipeline: layout analysis & parallel recognition (PP-DocLayout-V3 기반)

📜 Paper Sber Robotics Center

2026.02 1주차

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

휴머노이드 Green을 위한 staged VLA framework
five-stage curriculum
(L0) foundational VLMs → (L1) multimodal grounding → (R0) multi-embodiment pretraining → (R1) embodiment-specific adaptation → (R2) RL-based policy alignment.

🧑🏻‍💻 Dev Matt Schlicht

2026.02 1주차

motlbook

1.5M AI agents가 알아서 글을 포스팅하고 댓글을 남기는 social platform..
110,000+ posts & 500,000+ comments가 agents에 의해 생성되었다고 함
심지어 내용들도 굉장히 자극적이고 신선해서 크게 화제를 일으키는 중

📜 Paper MIT, ETH

2026.02 1주차

Self-Distillation Enables Continual Learning

demonstration-condition model을 own teacher로 사용해서 on-policy training signals 생성
이를 통해 catastrophic forgetting 이슈를 해소하면서도 새로운 태스크에 대한 정확도를 높게 챙길 수 있음

📜 Paper AgentAlpha

2026.02 1주차

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

peer-reviewed papers를 스스로 생성한 feedback과 함께 지속적으로 수집하고 핵심 methodological unites를 추출하여 resuable research patterns를 축적
→ methodological knowledge graph 구축
기존의 open-ended generation & trial-and-error를 최소화할 수 있다고 설명

📜 Paper Anthropic, Stanford

2026.02 1주차

Shaping capabilities with token-level data filtering

→ 대신 pretraining 동안에 해결되어야 하는 것으로 대안 제시
의료 도메인을 대상으로 documents filtering보다 tokens filtering이 훨씬 더 효과적이었음을 입증
scaling 관점에서도 그렇다고 함

2026년 1월 54건

📜 Paper Stanford, NVIDIA, …

2026.01 4주차

Learning to Discover at Test Time

어려운 문제를 풀기 위해 모델 스스로 해결을 시도하고 개선하는 것이 가능하도록 학습
attempts 간 평균 reward를 maximize 하는 것보다는 the most promising solutions를 prioritize 하도록 design
Erdo’s minimum overlap problem, autocorrelation inequality, GPUMode kernel competitions 등 다양한 도메인에서 SoTA 달성

📜 Paper Chicago

2026.01 4주차

AI Agents Need Memory Control Over More Context

transcript retention을 bounded internal state로 대체하여 각 턴마다 점진적 업데이트가 가능하도록 함
기존에는 unbounded context growth 문제가 있어 context 관리가 되지 않았던 것을 문제점으로 지적
ever-expanding transcripts라고 표현

🧑🏻‍💻 Dev Google Research

2026.01 4주차

Small models, big results: Achieving superior intent extraction through decomposition

이를 위한 decomposed workflow 제시
(1) single screen에서의 개별 interaction과 UI element가 summarized
(2) 전체 UI trajectory의 일반적 의도를 이해하기 위한 a series of events로 사용됨

🧑🏻‍💻 Dev Moonshot AI

2026.01 4주차

Kimi K2.5: Visual Agentic Intelligence

복잡한 태스크에 대해 100개의 sub-agents를 담고 있는 agent swarm을 컨트롤 할 수 있으며 1,500 개의 tool calls를 병렬 실행할 수 있다고 함
sinlge-agent setup과 비교하면 4.5x 빠른 처리 속도
간단한 대화를 완벽한 반응형 layout을 갖춘 front-end interfaces로 변환하는 능력 소개

📜 Paper Naver AI Lab

2026.01 4주차

Oops, Wait: Token-Level Signals as a Lens into LLM Reasoning

다양한 모델들의 token probabilities를 token-level로 조사
특정한 토큰들이 reasoning correctness와 강한 상관관계를 보인다고 설명
작은 데이터셋으로 fine-tuning 한 모델은 reasoning abiilty를 얻지만 부분적으로 이용하는 수준이라고 언급

📜 Paper Meituan

2026.01 4주차

LongCat-Flash-Thinking-2601 Technical Report

agentic search, agentic tool use, tool-integrated reasoning 벤치마크에서 오픈소스 중 SoTA 달성했다고 설명
domain-parallel expert training with subsequent fusion 기반의 unified training framework 언급
stable & efficient large-scale multi-environment training을 위한 asynchronous RL framework, DORA 언급

📜 Paper Salesforce

2026.01 4주차

Agentic Confidence Calibration

Holistic Trajectory Calibration (HTC): process-level features를 충분히 추출하여 평가
interpretability, transferability, generalization 특징을 강점으로 소개

🧑🏻‍💻 Dev OpenAI

2026.01 4주차

Introducing Prism

LaTeX-native editor with live preview
citation insertion을 위한 built-in literature search 기능
handwritten or whiteboard equations를 Image-to-LaTeX conversion

🧑🏻‍💻 Dev DeepSeek AI

2026.01 4주차

DeepSeek-OCR 2: Visual Causal Flow

사람이 문서를 읽는 방식으로 모델 학습한 3B 사이즈의 vision-language architecture 모델
encoder로 먼저 page에 대한 global understanding 후 → 어떤 순서로 글을 읽을 것인지 결정

🧑🏻‍💻 Dev Ai2

2026.01 4주차

Open Coding Agents: Fast, accessible coding agents that adapt to any repo

SERA lowers the barrier to fine-tuning coding agents
이전 방식들대비 cost-effective 하다는 특징을 엄청난 강점으로 강조
Soft-verified generation (SVG), Scaling with a bug-type menu, High simulated workflow fidelity 등을 innovations로 언급

📜 Paper Ant Group

2026.01 4주차

Advancing Open-source World Models

세 가지 주요한 특징
(1) high fidelity & robust dynamics
(2) contextual consistency를 보존하면서도 minute-level horizon 가능

🧑🏻‍💻 Dev Google DeepMind

2026.01 4주차

AlphaGenome: AI for better understanding the genome

100만 bp 길이의 DNA 서열을 입력으로 받아 수천 개의 조절 관련 분자 특성을 한 번에 예측
모델 구조
초반: CNN으로 로컬 모티프(짧은 패턴) 탐지

🧑🏻‍💻 Dev Ai2

2026.01 4주차

Theorizer: Turning thousands of papers into scientific laws

한 쿼리당 논문을 100개까지 볼 수 있음
AI, NLP 분야와 관련된 3,000개의 theories를 함께 공개
한 쿼리당 15-30분 소요

📜 Paper NUS

2026.01 4주차

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

pretraining data를 이용하여 보다 신뢰성 높은 LLM 이용이 가능할 것이라고 주장

📜 Paper StepFun

2026.01 3주차

STEP3-VL-10B Technical Report

two strategic shifts
(1) unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens
(2) scaled post-training pipeline featuring over 1k iterations of RL

📜 Paper Independent

2026.01 3주차

Active Context Compression: Autonomous Memory Management in LLM Agents

SWE Bench Lite에서 토큰 사용량을 22.7% 줄이면서도 기존과 동일한 수준의 정확도(60%) 달성
Claude Haiku 4.5

📜 Paper CMU, Meta

2026.01 3주차

STEM: Scaling Transformers with Embedding Modules

runtime routing을 제거함으로써 CPU offload with asynchronous prefetch를 가능하게 함
또한 극도로 sparse 함에도 불구하고 안정적으로 학습되는 모습 관측됨

🧑🏻‍💻 Dev Anthropic

2026.01 3주차

The assistant axis: situating and stabilizing the character of large language models

한쪽 끝은 분석가/컨설턴트, 반대쪽은 환상적/비어시스턴트적 역할 위치
분석을 위해 275개의 서로 다른 캐릭터 타입에 상응하는 vectors 추출
실험에는 Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B 모델 사용

🧑🏻‍💻 Dev xAI

2026.01 3주차

X For You Feed Algorithm

두 가지 소스로 후보를 모아 Phoneix(Grok-based transformer model)로 다중 행동 확률을 예측하고 가중합 점수로 정렬하여 상위 K개 선택

🧑🏻‍💻 Dev Liquid AI

2026.01 3주차

LFM2.5-1.2B-Thinking: On-Device Reasoning Under 1GB

1.2B 사이즈의 LFM2.5 family 기반으로 학습된 모델
허깅페이스 등에서 weight 다운로드 가능

🧑🏻‍💻 Dev Anthropic

2026.01 3주차

Claude's new constitution

Claude의 어떤 행동이 의도된 것인지 사람이 판단할 수 있게 함으로서 투명성을 높이려는 목적
단순 규칙 나열 → 이유가 포함된 서술형, hard constraint + 유연한 원칙, 학습 파이프라인에서 더 핵심적인 역할

📜 Paper NVIDIA

2026.01 3주차

KVzap: Fast, Adaptive, and Faithful KV Cache Pruning

KVzap: fast, input-adaptive approximation of KVzip. prefilling & decoding 둘 다 적용
학습된 LLM위에 작은 surrogate를 학습해서 붙이는 post-hoc KV pruning 방식

🧑🏻‍💻 Dev Qwen

2026.01 3주차

Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation!

Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder 사용
paralinguisitc information & acoustic environmental features 보존 & high-speed, high-fidelity speech reconstruction 가능
extreme bidirectional streaming generation speed 달성

🧑🏻‍💻 Dev NVIDIA

2026.01 2주차

NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

Physical AI Bench & Physical Reasoning 벤치마크에서 SoTA 달성
기존 모델들은 불확실성을 처리하거나 새로운 상황에 적응하는데 필요한 planning several steps ahead 능력 등이 부족했었음

📜 Paper NVIDIA

2026.01 2주차

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

GDPO: individual rewards의 normalization의 decoupling 함으로써 rewards 간의 상대적인 차이를 보존하여 보다 정확한 multi-reward optimization이 가능하도록 함
tool calling, math reasoning, coding reasoning 태스크에 대해 accuracy, bug ratio를 측정하여 GRPO와 성능 비교

🧑🏻‍💻 Dev Anthropic

2026.01 2주차

Cowork: Claude Code for the rest of your work

Mac 앱에서 특정 로컬 폴더에 권한 부여
컨텍스트를 유지하면서도 병렬로 처리할 수 있음
Claude Max 구독자 대상으로 Mac에서 동작하는 App에 preview 형태로 제공중

📜 Paper Zhejiang, Edinburgh, NUS

2026.01 2주차

Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

같은 질문을 받더라도 주변 문맥이 살짝 바뀌면 출력이 달라지는 현상
이를 해결하기 위한 Neighbor-Consistency Belief (NCB): conceptual neighborhood 간 response coherence를 평가하는 belief robustness를 구조적으로 측정
이것의 효용을 입증하기 위해 contextual interference에 대한 outputs stability를 측정하는 cognitive stress-testing protocol 제시

🧑🏻‍💻 Dev Sakana AI

2026.01 2주차

Extending the Context of Pretrained LLMs by Dropping their Positional Embeddings

DroPE: 학습 후 위치 임베딩을 제거하고 짧게 재보정하는 방식(continued pretraining)으로 문맥 확장 제안
RoPE가 non-uniform attention에 대해 갖는 bias(특정 상대 위치나 구조에 강하게 치우친 분포)를 학습 시 scaffold로 활용

📜 Paper Alibaba

2026.01 2주차

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

unified representation space 내에서 text, images, document images, video 등을 multimodal search 하도록 하는 end-to-end pipeline
Embedding: 32k tokens까지 입력으로 받을 수 있으며 MRL 지원
Reranker: cross-encoder with cross-attention을 이용한 fine-grained relevance estimation

📜 Paper Alibaba, Wuhan

2026.01 2주차

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

AgeMem: LTM & STM 관리를 agent’s policy로 직접 관리
memory operation을 tool-based actions로 expose하여 LLM agent가 정보를 retrieve, update, summarize, discard 할지를 자율적으로 결정하도록 함
이런 행동을 학습시키기 위해 three-stage progressive RL & step-wise GRPO 제시

🧑🏻‍💻 Dev Manus

2026.01 2주차

Introducing Meeting Minutes

화자 인식, Seamless End-to-End Execution, Collaborative Execution 등을 핵심 특징으로 강조
면대면 미팅에 특화됨. 온라인 미팅 상황은 대상이 아님

📜 Paper Quanta Alpah

2026.01 2주차

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

video-conditioned open-domain video question answering
cross-frame visual anchor extraction, interactive web retrieval, multi-hop reasoning 등을 커버
Workflow & Agentic paradigm에 맞춰 평가 진행

🧑🏻‍💻 Dev Tencent

2026.01 2주차

WeKnora - LLM-Powered Document Understanding & Retrieval Framework

다양한 문서 파싱, 임베딩, 검색, reasoning model 과의 integration 등 다양한 기능 지원

📜 Paper Nanyang

2026.01 2주차

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

이를 통해 multi-source evidence integration & external retrieval 이 필요한 task만 남김
평가 시 Adaptive Point-wise Quality Evaluation & Active Fact Checking 을 적용하여 기존의 정적 벤치마크의 한계를 극복

🧑🏻‍💻 Dev Google

2026.01 2주차

TranslateGemma: A new suite of open translation models

55개 언어를 지원하면서 device를 가리지 않는다고 설명
이미지 내 텍스트를 인식하는 multimodal 능력도 뛰어나다고 언급

🧑🏻‍💻 Dev Cursor

2026.01 2주차

Scaling long-running autonomous coding

일주일 동안 interrupt 없이 실행되며 1,000개 파일에 100만+ 라인 작성
Solid → React 마이그레이션
3주 이상걸리며 +266K/-193K 수정

🧑🏻‍💻 Dev Google

2026.01 2주차

Gemini introduces Personal Intelligence

Gmail, Google Photos, YouTube, Workspace, Search
필요할 때만 context에 추가하는 방식으로 one reasoning window에서 커버
U.S. Google AI Pro & Ultra 유저 대상으로 beta 오픈

🧑🏻‍💻 Dev Replit

2026.01 2주차

Mobile Apps on Replit: Idea to App Store in Minutes

모바일 게임, AI 앱, 생산성 향상 앱 등 다양한 개발 가능
[링크](https://replit.com/mobile-apps) 🔗

📜 Paper UIUC, Stanford, …

2026.01 1주차

Adaptation of Agentic AI

agent adaptations & tool adaptations를 다루는 systematic framework
tool-execution-signaled & agent-output-signaled forms
offline data를 이용해 각 weight를 업데이트 하는 것으로 보임

🧑🏻‍💻 Dev IQuestLab

2026.01 1주차

IQuest-Coder-V1

Dual Specialization Paths: 두 갈래의 post-training을 통해 thinking model & instruct model 개발
recurrent mechanism을 이용하여 model capability와 deployment footpring 간의 trade-off 최적화한 Efficient Archiecture
추가적인 scaling 없이 native 128K 지원

📜 Paper WeChat

2026.01 1주차

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

HGMem: memory 개념을 dynamic, expressive structure로 extend 하는 hypergraph-based memory mechanism
hyperedges는 각 distinct memory units에 해당하며 메모리 내에서 higher-order interaction로 progressive formation 가능해짐
일반적인 edge와 달리 둘 이상의 정점을 한 번에 연결하는 개념

🧑🏻‍💻 Dev OpenCode AI

2026.01 1주차

OpenCode

TUI 지원되면서도 시각적으로 보기 편리하게 구성되어 있음
Claude Code를 그대로 쓸 수도 있고 다른 모델들을 필요한 곳에 override 해서 사용하는 것도 가능

📜 Paper NVIDIA, Stanford, UC Berkeley

2026.01 1주차

End-to-End Test-Time Training for Long Context

standard architecture: Transformer with sliding-window attention
test time의 next-token prediction 상황에서 context를 compress하여 weight에 반영
training time에 test-time에서 습득한 meta-learning을 통해 model initialization

📜 Paper US San Diego

2026.01 1주차

Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025

agents가 코드 생성, 디버깅, boilerplate 등에는 적합하지만, architectural decisions에는 약하다는 주장

📜 Paper MIT

2026.01 1주차

Recursive Language Models

모델의 컨텍스트 윈도우를 두 자릿수 이상 넘어서는 경우도 잘 처리할 수 있음 (100배 이상)
long-context tasks에서 base LLM, 그리고 common long-context scaffolds를 크게 앞선 결과

📜 Paper Google

2026.01 1주차

Nested Learning: The Illusion of Deep Learning Architectures

이를 통해 higher-order in-context learning을 구현함으로써 continual learning 능력을 잠재적으로 강화
Expressive Optimizers, Self-Modifying Learning Moduel, Continuum Memory System

📜 Paper DeepSeek AI

2026.01 1주차

mHC: Manifold-Constrained Hyper-Connections

mHC: HC의 residual connection space를 특정 manifold에 project하여 identity mapping property를 복구하는 framework
Sknkhorn-Knopp alogrithm 사용됨

📜 Paper Alberta

2026.01 1주차

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

internal traces를 관찰하여 fixed-budget descriptors로 압축
external judges, multi-sample consistency 등에 의존하지 않아도 되며 5M 정도의 추가 파라미터만 발생하는 정도

📜 Paper Duke, ByteDance

2026.01 1주차

Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

reflector가 이를 보고 답변을 끝내도 될지 업데이트 해야될지 판단 (Multi-agent reflection architecture)
여러 response를 한 번에 보고 비교 분석하기 때문에 cross-instance learning이라고 표현한 듯

📜 Paper Stanford

2026.01 1주차

A multimodal sleep foundation model for disease prediction

PSG: Polysomnography - the gold standard for sleep analysis
65,000명의 참가자들로부터 585,000 시간 분량의 PSG recording을 확보하여 모델 학습
130개 conditions를 예측할 수 있을 뿐만 아니라 뛰어난 transfer learning 성능을 보였다고 언급

🧑🏻‍💻 Dev NVIDIA

2026.01 1주차

NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools to Accelerate Safe, Reasoning-Based Autonomous Vehicle Development

reasoning traces와 trajectories를 output으로 반환
AlpaSim: closed-loop autonomous driving evaluation을 위한 open-source simulator
Physical AI Open Datasets: 1700+ hours의 real-world driving data

🧑🏻‍💻 Dev OpenAI

2026.01 1주차

Introducing ChatGPT Health

현재 US만 가능
독립된 샌드박스 환경에 데이터 저장 및 관리하여 학습 데이터로 활용되지 않는다고 함
건강 데이터를 바탕으로 상황 진단 또는 추적 관리 등 가능

📜 Paper Beijing Univ.

2026.01 1주차

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

모델이 own prediction에 highly confident 하지만 divergent ground truth를 배우도록 강제됨
Entropy-Adaptive Fine-Tuning (EAFT): prediction probability에만 의존하지 않고, token-level entropy를 gating mechanism으로 사용
epistemic uncertainty & knowledge confict 를 구분하는 데 사용

🧑🏻‍💻 Dev MIT, Sakana

2026.01 1주차

Digital Red Queen:Adversarial Program Evolution in Core War with LLMs

static objective → changing objective에 대해 continual adaptation
targeted bombing, self-replication, massive multhreading 등을 포함한 다양한 전략으로 이어짐
convergence pressure toward a general-purpose behavioral strategy → convergent evolution

2025년 12월 49건

📜 Paper KlingAI

2025.12 4주차

Kling-Omni Technical Report

video generation, editing, intelligent reasoning 등을 end-to-end로 다룸
이에 따라 text instructions, reference images, video context 등을 입력으로 받을 수 있음

📜 Paper Google

2025.12 4주차

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

4개의 sub-leaderboards의 performance를 aggregate
(1) FACTS Multimodal (2) FACTS Parametric (3) FACTS Search (4) FACTS Grounding
각 리더보드는 모델 responses를 평가하기 위한 judge models 세팅되어 있음

📜 Paper Google, UC Santa Barbara

2025.12 4주차

Budget-Aware Tool-Use Enables Effective Agent Scaling

Budget Tracker: agent에게 continuous budget awareness를 제공하는 plug-in
BATS (Budget Aware Test-time Scaling): budget awareness를 이용하여 dig deepr | pivot to new paths를 dynamically decide

📜 Paper Ant Group

2025.12 4주차

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

from-scratch 학습 대신 pre-trained AR 모델을 3-phase block-level WSD based training scheme을 통해 dLLM으로 전환
post-training alignment (SFT & DPO)를 통해 MoE 아키텍쳐의 LLaDA2.0-mini (16B) & LLaDA2.0-flash (100B) 모델 획득

🧑🏻‍💻 Dev Alibaba

2025.12 4주차

Qwen-Image-Layered: Layered Decomposition for Inherent Editablity

각 layer는 다른 content에 영향을 주지 않도록 manipulated 되어 resizing, reposition, recoloring 등이 가능함
즉, semantic 또는 structure components를 distinct layers로 isolate

🧑🏻‍💻 Dev OpenAI

2025.12 4주차

Evaluating chain-of-thought monitorability

평가는 3개 타입으로 구분: intervention, process, outcome-property

🧑🏻‍💻 Dev Anthropic

2025.12 4주차

Introducing Bloom: an open source tool for automated behavioral evaluations

hand-labeled judgements와 strongly correlate된 평가
최근 AI 모델의 behavioral profiles를 자동으로 explore 하는 오픈소스 프레임워크 [Petri](https://www.anthropic.com/research/petri-open-source-auditing)도 공개

🧑🏻‍💻 Dev Google DeepMind

2025.12 4주차

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

총 1T 파라미터에 대해 110 Petabytes 데이터를 학습
SAE와 transcoder 결합하여 모델 내부를 들여다 봄
Matryoshka training technique이 적용되었고 chat usecase를 위해서도 학습되었다고 설명

📜 Paper Southwest Univ.

2025.12 4주차

LIR3AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

Context-Grounded Reasoning, Knowlege-Reconciled Reasoning 두 개의 모드로 해석
LIR3AG: retrieved evidence를 coherent reasoning chains로 reconstruct 함으로써 non-reasoning 모델도 reasoning strategies를 transfer 할 수 있도록 함

📜 Paper Tsinghua

2025.12 4주차

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

(1) Attention acceleration: low-bit SageAttention & trainable Sparse-Linear Attention (SLA)
(2) Step distillation: rCM
(3) W8A8 quantization

📜 Paper Google

2025.12 4주차

Prompt Repetition Improves Non-Reasoning LLMs

Gemini, GPT, Claude, DeepSeek 같은 플래그십 모델들에 대해 실험한 결과 보고
또한 RL로 학습된 reasoning 모델들이 유저의 요청을 반복하는 경항이 있는데 이를 역시 prompt repetition이라고 표현하고 이것이 아주 효율적이라고 설명함

🧑🏻‍💻 Dev Minimax

2025.12 4주차

MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks

특히 코딩 능력 향상에 힘을 많이 들인 것으로 보임 (공식 포스트에서는 코딩 능력만 언급하고 있음)
Multi-Promgramming Language Capabilities
웹 개발 뿐만 아니라 앱 개발도 잘할 수 있게 되었다고 설명

📜 Paper Sapienza Univ.

2025.12 4주차

Epistemological Fault Lines Between Human and Artificial Intelligence

이를 위해 LLM이 답변을 생성하기까지(판단을 내리기까지)의 과정을 인간의 사고 과정과 비교 분석

📜 Paper Meta, UIUC, CMU

2025.12 4주차

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Self-play SWE-RL (SSR): human-labeled issue or tests 없이 sandboxed repositories with source code에 대한 접근 권한만 제공
LLM agent는 self-play 세팅에서 softwar bugs를 고치도록 강화 학습 반복

📜 Paper Tencent

2025.12 4주차

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

사람의 이러한 능력을 심리학에서 Mindscape-Aware Capability 라고 부름
Mindscape-Aware RAG (MiA-RAG): LLM-based RAG system에 explicit global context awareness를 제공
hierarchical summarization을 build → retrieval & generation 둘 다 global semnatic representation에 condition

📜 Paper HKUST, Waterloo, Tsinghua, ICL

2025.12 4주차

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

two-phase dynamic: procedural correctness의 제약을 받으며 low-level skills 개선 → high-level strategic planning 고도화로 이어짐
이 관점에서 GRPO 같은 RL 알고리즘은 토큰으로부터의 learning signal을 무시한채로 무작위 optimzation 한다는 한계를 지적
Hierarchy-Aware Credit Assignment (HICRA): 영향이 큰 planning tokens 대상으로 opimization efforts 집중

📜 Paper Oxford

2025.12 4주차

Shared sensitivity to data distribution during learning in humans and transformer networks

redundancy & diversity 는 in-weights & in-context learning 둘 다에서 상충 관계에 있다는 공통점 확인
그러나 dynamic training shcedules이 인간에게는 영향을 줄 수 있던 것과 달리 network는 아님

📜 Paper MIT

2025.12 4주차

Self-Adapting Language Models

SEAL: 새로운 입력이 주어지면 모델이 학습하기 좋은 형태의 self-edit 데이터를 생성
이렇게 생성된 self-edit를 SFT하여 새로운 지식에 adapt
updated model의 downstream performance를 reward signal로 사용하여 RL 함으로써 effective self-edits를 생성할 수 있도록 모델을 학습

📜 Paper National Taiwan Uinv.

2025.12 3주차

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

추론 시 generation length & acceptance rate 를 dynamically adjust 하는 방식 제안
token entropy & Jensen-Shannon distance 기준으로 결정
성능 2% 하락 정도로 49% 속도 향상을 이끌어낼 수 있었음

📜 Paper Meta

2025.12 3주차

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Meta Canvas: MLLMs가 직접 spatial & spatiotemporal latent spaces를 reason & plan하고 diffusion generators로 interface하는 lightweight framework

🧑🏻‍💻 Dev NVIDIA

2025.12 3주차

NVIDIA Nemotron 3 Family of Models

Nano, Super, Ultra, 강력한 agentic 능력을 가진 세 개 모델 공개
체크 포인트 및 학습 데이터까지 공개
Hybrid MoE, LatentMoE, Multi-Token Prediction, NVFP4, Long Context (1M), Multi-environment Reinforcement Learning Post-training, Granular Reasoning Budge Control at Inference Time

🧑🏻‍💻 Dev Ai2

2025.12 3주차

Molmo 2: State-of-the-art video understanding, pointing, and tracking

Video tracking에서 Gemini 3 Pro 성능을 상회하기도 함
molmo 2-O (7B): Olmo 기반의 for researcher 모델
학습 데이터의 양은 Meta의 PerceptionLM 대비 1/8 수준임에도 뛰어난 성능 달성

🧑🏻‍💻 Dev Ai2

2025.12 3주차

Introducing Bolmo: Byteifying the next generation of language models

transformer 아키텍쳐는 그대로 두고 small byte encoders, decoders 추가
Olmo 3 모델과 유사한 수준의 성능을 보이면서도 character 벤치마크에서 높은 점수 달성
UTF-8 bytes를 fixed vocab 없이 처리, dynamic byte patches 사용

📜 Paper Google

2025.12 3주차

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

세 개의 능력을 평가
(1) 이질적인 sources로부터 파편화된 정보의 systematic collation
(2) precision을 확보하기 위한 de-duplication & entity resolution

📜 Paper NUS, GIT 등

2025.12 3주차

Memory in the Age of AI Agents

forms, functions, dynamics를 기준으로 agent memory 분석
agent memory는 token-level, parametric, latent memory로 크게 구분

📜 Paper Tsinghua

2025.12 3주차

DEER: Draft with Diffusion, Verify with Autoregressive Models

(1) step-wise uncertainty가 계속해서 누적
(2) 본질적으로 AR (autoregressive) drafters의 sequential decoding임
dLLM이 이와 같은 문제를 해결할 수 있다고 보며 DEER라는 decoding framework 제안

🧑🏻‍💻 Dev Mistral

2025.12 3주차

Mistral OCR 3

[Mistral AI Studio](https://console.mistral.ai/build/document-ai/ocr-playground) 또는 API 통해 이용 가능

🧑🏻‍💻 Dev Google

2025.12 3주차

FunctionGemma: Bringing bespoke function calling to the edge

on-device & agent 수요에 맞춘 결과물
unified action & chat, built for customization, engineered for the edge, broad ecosystem support 등을 특징으로 삼음

📜 Paper ByteDance

2025.12 2주차

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

open-ended tasks는 LLM-judge로 평가 (meticulously crafted rubrics)

🧑🏻‍💻 Dev Poetiq

2025.12 2주차

Poetiq Shatters ARC-AGI-2 State of the Art at Half the Cost

Gemini 3 Deep Think 대비 더 높은 정확도와 절반 이하의 비용
모델을 직접 만드는 게 아니라 froniter models들이 문제를 더 잘 풀 수 있도록 meta-system을 개발

🧑🏻‍💻 Dev Alibaba

2025.12 2주차

Qwen3-TTS Update! 49 Timbres + 10 Languages + 9 Dialects

Enhanced Multilingual & Dialect Capabilities: 영어, 중국어, 독일어, 한국어 등 주요 10개 언어 지원
한국어, 일본어 등 그렇게까지 자연스러운지 모르겠음
More Natural & Human-like Prosody/Speech Rates: 전작 대비 훨씬 자연스러운 발화

📜 Paper Anthropic

2025.12 2주차

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

→ 기존 Gradient Routing을 개선하여 Selective GradienT Masking (SGTM) 개발
두 개의 지식 제거 실험
(1) bilingual synthetic dataset으로 학습된 모델의 한 언어를 제거

🧑🏻‍💻 Dev Google

2025.12 2주차

Titans + MIRAS: Helping AI have long-term memory

MLP 기반의 long-term memory module을 사용하여 대량의 정보를 손실 없이 저장하도록 함
여기에 surprise metric을 사용하여 새로운 입력이 기존의 정보와 큰 차이가 있는지 detect
MIRAS (이론)

🧑🏻‍💻 Dev Qwen

2025.12 2주차

SAPO: A Stable and Performant Reinforcement Learning Method for Training Large Language Models

sequence-level coherence를 유지하면서도 off-policy 토큰만 선택적으로 억제해 sample efficiency 개선

🧑🏻‍💻 Dev OpenAI

2025.12 2주차

Introducing GPT-5.2

이를 뒷받침하는 GDPval 벤치 결과를 언급
ChatGPT - Instant/Thinking/Pro, API - 5.2/5.2-chat-latest/5.2-pro

🧑🏻‍💻 Dev Cursor

2025.12 2주차

A visual editor for the Cursor Browser

각 element의 설정을 사이드 패널에서 직접 컨트롤 할 수 있음 (폰트 사이즈, 서체 등등)
element를 클릭하고 그걸 대상으로 prompt 작성해서 코딩하는 것도 가능

📜 Paper Stanford

2025.12 2주차

The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics

reasoning을 phase transition으로 modeling하는 theory of semantic anchoring formalize (UCCT)
AGI에 필요한 것은 더 큰 모델, 더 많은 데이터, 더 복잡한 아키텍쳐가 아닌, 모델 패턴을 목표에 align 시키는 executive function이라고 주장

📜 Paper Berkeley, UIUC, Stanford, IBM

2025.12 2주차

Measuring Agents in Production

정형적인 벤치마크 대신 현업 맥락에 맞춘 인간 검증을 통해 평가
production agent가 일반적으로 simple & controllable approaches를 갖고 있다고 설명
사람 개입 전에 최대 10개 steps 68%, prompting off-the-shelf models 의존 70%

🧑🏻‍💻 Dev Karpathy

2025.12 1주차

LLM Council

쿼리를 제출하면 1) First Options 2) Review 3) Final Response 단계를 거치게 됨

🧑🏻‍💻 Dev DeepSeek AI

2025.12 1주차

DeepSeek-V3.2: Efficient Reasoning & Agentic AI

세 가지 keys
DeepSeek Sparse Attention (DSA)
Scalable Reinforcement Learning Framework

🧑🏻‍💻 Dev Microsoft

2025.12 1주차

Fara-7B: An Efficient Agentic Model for Computer Use

웹 페이지를 인식하여 scrolling, typing, clicking 등 actions 수행 가능
이전 연구인 AgentInstruct 기반으로 synthetic data generation pipeline 개발

🧑🏻‍💻 Dev ByteDance

2025.12 1주차

Vidi2: AI Video Understanding & Creation in Seconds

VUE-TR-V2 벤치마크에서 GPT-5, Gemini-3-Pro 모델 능가하는 수준으로 리포트
10-30초 정도의 long-context video support

📜 Paper MiroMind

2025.12 1주차

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

LMMs의 temporal grounding 능력을 video cropping tool로 이용하여 특정 video clip에 zoom in하고 finer-grained video frames를 resample 하도록 함
global-to-local reasoning loop
VideoLIAH를 공개하여 training & evaluation 촉진

📜 Paper Alibaba

2025.12 1주차

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

code pretraining, supervised fine-tuning, RL, scaling law, framework selection, hyperparameter sensitivity, model architectures, dataset comparisons 등 포함

🧑🏻‍💻 Dev HuggingFace

2025.12 1주차

Transformers v5: Simple model definitions powering the AI ecosystem

AttentionInterface, 토크나이저 단일화, PyTorch 단일화 등
대규묘 pre-training 지원 강화, fine-tuning/post-training 생태계 연동

🧑🏻‍💻 Dev Mistral AI

2025.12 1주차

Introducing Mistral 3

오픈소스 모델 중 SoTA라고 설명
non-reasoning 모델 중 LMArena에서 2위 달성
text, images, multilingual inputs 처리 가능

🧑🏻‍💻 Dev Google

2025.12 1주차

Now available: Create AI agents to automate work with Google Workspace Studio

Asana, Jira, Mailchimp, Salesforce 등과 연결 가능

🧑🏻‍💻 Dev OpenAI

2025.12 1주차

How confessions can keep language models honest

main answer & separate ‘confession’을 출력하도록 지시하여 confession channel을 관측
confession channel에서는 main answer가 올바를 때에조차 hidden failure를 보임
hallucination, 지름길 이용, 부적절한 보상 신호 악용 확인됨

📜 Paper NUS

2025.12 1주차

PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

문서 변화 히스토리를 직접 알 수 있고 fine-grained patches 관리 가능

2025년 11월 54건

📜 Paper OpenAI

2025.11 4주차

Early science acceleration experiments with GPT-5

이를 통해 연구 내에서 사람의 시간을 아낄 수 있는 영역과, 여전히 사람의 손이 많이 필요한 영역을 구분해냄
특히나 수학 분야에서 풀리지 않았던 문제를 푸는 데 GPT-5가 어떻게 도움을 줄 수 있었는지에 대해 다룸

📜 Paper NVIDIA

2025.11 4주차

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

single parent model에 여러 개의 nested submodels을 embed하고 각각 다른 configurations & budgets에 optimize
각 submodel은 parent model과 weight를 공유하고, 추가적인 학습 없이도 zero-shot extration 가능하다고 설명
group-aware SSM elastification, heterogeneous MLP elastification, normalized MSE-based layer importance 등을 통해 Mamba의 구조적 제약을 보존

🧑🏻‍💻 Dev Anthropic

2025.11 4주차

Introducing Claude Opus 4.5

prompt injection에 업계 최고 수준으로 robust 하다고 설명
153 페이지 분량의 [system card](https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf) 🔗

📜 Paper Salesforce, Stanford

2025.11 4주차

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

같은 모델로부터 만든 두 개의 agents가 공생하는 구조
curriculum agent & executor agent
executor agent에게 external tools를 붙여줌으로써, curriculum agent가 더 어렵고 복잡한 문제를 내게끔 압박

📜 Paper Peking

2025.11 4주차

General Agentic Memory Via Deep Research

just-in- time (JIT) compilation 원칙 준수
runtime에 simple, but useful memory만을 생성하도록 함 (offline stage)
duo-design

🧑🏻‍💻 Dev Tecent

2025.11 4주차

HunyuanOCR

1B 파라미터로 다양한 벤치마크에서 SoTA 달성
complex multilingual document parsing, text spotting, open-field information extraction 등 다양한 태스크 커버 가능
100개 이상의 언어 처리할 수 있다고 주장

🧑🏻‍💻 Dev Andrew Ng

2025.11 4주차

Stanford Agentic Reviewer

PDF → MD 변환 후 제목/학술문서 여부 체크 → 논문에서 검색 쿼리 생성하여 arXiv 검색 → 상위 논문 요약 → 원 논문 MD + 관련 연구 요약 합쳐 템플릿 리뷰 생성
ICLR 2025 데이터 대상으로 테스트 한 결과, Human-Human 간 Spearman 점수보다 높음

📜 Paper UCL

2025.11 4주차

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Memory-augmented Markov Decision Process (M-MDP) with neural case-selection policy
policy는 memory rewriting mechanism을 통해 environmental feedback 기반으로 업데이트
memory reading (retrieval)을 통해 policy improvement

🧑🏻‍💻 Dev DeepSeek AI

2025.11 4주차

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

two-part training system으로 모델의 full proofs를 생성, 체크, 교정
generator with a dedicated verifier
verifier는 각 스템에 대해 scores

📜 Paper Qwen, Edinburgh, Stanford, MIT

2025.11 4주차

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

gating-augmented softmax attention variants에 대한 연구
30개 종류의 15B MoE models, 1.6B dense 모델에 대해 조사 (3.5T 토큰 학습)
head-specific sigmod gate를 Scaled Dot-Product Attention (SDPA) 이후에 적용하는 것이 모델 성능을 확실히 향상시킬 수 있는 방법이었다고 설명

🧑🏻‍💻 Dev Anthropic

2025.11 3주차

Measuring political bias in Claude

prompts, grader rubrics, scripts 모두 오픈소스로 공개

📜 Paper ByteDance

2025.11 3주차

Depth Anything 3: Recovering the Visual Space from Any Views

2개의 key insights
a single plain transformer (vanilla DINO encoder)
a singular depth-ray prediction target

📜 Paper Beihang Univ.

2025.11 3주차

Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

Honesty-Critical Neurons Restoration (HCNR): key expression-governing neurons를 찾아 pre-trained state로 복구. Hessian-guided compensation 이용

🧑🏻‍💻 Dev xAI

2025.11 3주차

Grok 4.1

reasoning architecture 변경 없이 dialogue behavior를 조정
reasoning-mode 기준으로 EQ-Bench3에서 Elo 점수 최고점 기록

🧑🏻‍💻 Dev Google

2025.11 3주차

A new era of intelligence with Gemini 3

텍스트, 이미지, 비디오, 오디오, 코드 등을 이해할 수 있으면서 1M token context window 지원
Google Antigravity: agent-first 개발 플랫폼으로 현재는 free 티어만 열려 있음

🧑🏻‍💻 Dev Ai2

2025.11 3주차

DR Tulu: An open, end-to-end training recipe for long-form deep research

SFT & Reinforcement Learning with Evolving Rubrics (RLER, online)
DR Tulu 8B checkpoint, RLER rubric generation & training framework, dr-agetn-lib 등 오픈소스로 공개

📜 Paper Shanghai AI Lab

2025.11 3주차

P1: Mastering Physics Olympiads with Reinforcement Learning

P1-235B-A22B 모델은 International Physics (IPhO 2025)에서 금메달 성적
math, coding 등의 벤치마크에서도 우수한 성능을 보인다고 설명

📜 Paper Duke

2025.11 3주차

It's LIT! Reliability-Optimized LLMs with Inspectable Tools

LIT (LLMs with Inspectable Tools): LLM의 tool-calling 능력을 이용해서 the most reliable & easy-to-trouble shoot solution을 선택하도록 함
이를 검증하기 위해 커스텀 가능한 1,300개의 datasets 구축
Harvard USPTO Patent Dataset & NeurIPS 2023 papers 기반으로 수학, 코딩, 모데링 문제들을 포함

🧑🏻‍💻 Dev Ai2

2025.11 3주차

Olmo 3: Charting a path through the model flow to lead open-source AI

Base 모델은 Qwen 2.5와 유사한 수준의 성능이며, post-training을 통해 기존 오픈소스 모델들보다 뛰어난 성능을 지닌 것으로 보고
data, code, model weights & checkpoints를 Apache 2.0로 공개

🧑🏻‍💻 Dev topoteretes

2025.11 3주차

Cognee

셀프 호스팅 또는 Cognee Cloud를 통해 메모리를 관리할 수 있음
벡터 & 그래프 하이브리드 검색 파이프라인
CLI & Web UI 제공

🧑🏻‍💻 Dev Google

2025.11 3주차

Introducing Nano Banana Pro

아이디어 시각화 품질이 엄청 뛰어남. 글자(영어) 표현이나 장표 구성.
inforgraphics, slide decks, memes, mockups, storyboards 등

🧑🏻‍💻 Dev OpenAI

2025.11 3주차

A free version of ChatGPT built for teachers

GPT-5.1 Auto 모델의 무제한 메세지, 검색, 파일 업로드, connectors 등 다양한 기능 지원
교사 개인화된 학습 지원과 동시에 데이터를 학습에 사용하지 않는 보안까지 보장

🧑🏻‍💻 Dev Meta

2025.11 3주차

Introducing Meta Segment Anything Model 3 and Segment Anything Playground

SAM 3 model checkpoints, evaluation datasets, fine-tuning code 공개
Segment Anything Playground 플랫폼을 제공하여 모델의 특성과 능력을 이해할 수 있도록 보조
또한 3D objects & human reconstruction from a single image 관련 SAM 3D 모델, 코드 및 데이터 역시 공개

📜 Paper Kandinsky Lab

2025.11 3주차

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

5.0 Image Lite (6B image generation), 5.0 Video Lite (2B text-to-video), 5.0 Video Pro (19B video generation)
code, model check-point 오픈소스로 공개
Diffusion Transformer with cross-attention (CrossDiT) for multimodal fusion of visual and textual information를 핵심 아키텍쳐로 설명

📜 Paper OpenMOSS

2025.11 2주차

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Thinking with Video: Sora-2와 같은 video generation 모델을 이용하여 unified framework에서 visual & textual reasoning
Video Thinking Benchmark 개발: (1) vision-centric tasks (2) text-centric tasks
self-consistency & in-context learning이 Sora-2 performance 향상에 기여할 수 있다고 설명

📜 Paper GAIR

2025.11 2주차

Context Engineering 2.0: The Context of Context Engineering

20여년에 걸친 발전 동향을 설명: sensor 정보 및 GUI 사용 시작 (1.0) → GPT-3 등장 (2.0) → human-level with social cues (3.0) → proactive superhuman intelligence (4.0)

📜 Paper Oxford, Microsoft

2025.11 2주차

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

general commonsense, professional disciplines, visual-centric perception 등 영역을 cover
CodeVQA: policy model이 rendered SVG에 관한 질문에 답변함으로써 symbolic fidelity를 평가
현재 frontier VLMs도 language-centric & visual-centric 태스크 간 gap을 보임

🧑🏻‍💻 Dev Google

2025.11 2주차

Introducing Nested Learning: A new ML paradigm for continual learning

Hope 아키텍쳐: self-modifying recurrent & context-aware learning. 이를 통해 Nested Learning이라는 패러다임 제시
Key Components: Deep Optimizers, Continuum Memory System (CMS), Self-Modifying Architecture

🧑🏻‍💻 Dev Skyvern AI

2025.11 2주차

Skyvern

AGPL-3.0 라이센스: 네트워크 이용시 소스 공개, 고지 필수 / 상업적 이용 가능
Task-Driven autonomous agent design + Playwright (browser automation library)
이러한 웹 기반 에이전트를 이용하여 학습용 데이터 크롤링에 활용하고자 하는 니즈 높음 (최근)

📜 Paper Mila, McGill

2025.11 2주차

Grounding Computer Use Agents on Human Demonstrations

GroundCUA: expert human demonstraions로 제작된 large-scale desktop grounding dataset 공개
12개 카테고리의 87개 어플리케이션 포함, 56K 스크린샷에 3.56M human-verified elements
GroundNext: instructions를 target UI elements에 map 할 수 있는 모델 패밀리 (3B & 7B)

📜 Paper Zhejiang Univ.

2025.11 2주차

Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning

이를 해결하기 위한 기존 방법론들은 복잡한 workflow 구성 위주로 되어 있어 문제를 근본적으로 해결하지 못한다고 지적 (inflexible)
Logits-to-Logic: logits strengthening & logits filtering을 LLM outputs의 logical defects를 교정하는 핵심 모듈로 사용하는 프레임워크

🧑🏻‍💻 Dev OpenAI

2025.11 2주차

GPT-5.1: A smarter, more conversational ChatGPT

Instant 모델의 경우 Intelligence 뿐만 아니라 communication style 개선도 많이 이뤄졌다고 설명
또한 쉬운 질문은 빠르게, 어려운 질문은 오랜 시간을 들여 처리하는 adaptive reasoning 적용
Preset 업데이트

🧑🏻‍💻 Dev Google DeepMind

2025.11 2주차

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

단순히 instruction을 따르는 것 외에도 think & reason 할 수 있다고 설명
human demonstration videos with language labels & Gemini-generated labels를 혼합하여 학습 데이터로 활용
multi-modal 정보나 다양한 언어, 이모지 등을 이해할 수 있음

📜 Paper NVIDIA

2025.11 2주차

TiDAR: Think in Diffusion, Talk in Autoregression

TiDAR: (Thinking) in Diffusion and sampels final outputs (Talking) AutoRegressively
specially designed structured attention masks를 이용하여 single forward pass 내에서 처리 가능
AR 모델들의 성능에 견주면서도 초당 4.71 ~ 5.91배의 토큰을 출력할 수 있었다고 보고

📜 Paper Beijing Jiaotong Univ.

2025.11 2주차

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

planning, tool use, memory와 같은 기능들이 외부 시스템에 의해 동작하는 게 아니라 모델의 internalized 능력으로 처리되는 추세
outcome-driven exploration RL을 넘어서 LLM + RL + Task 조합이 중요함을 역설
language, vision, embodied domains 모두 해당되는 내용

🧑🏻‍💻 Dev MiniMax

2025.11 1주차

MiniMax M2 & Agent: Ingenious in Simplicity

모델 가중치를 허깅페이스에 오픈소스로 공개 (오픈소스 모델 중 1위라고 함)

📜 Paper MoonShot AI

2025.11 1주차

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Delta Attention (KDA): Gated DeltaNet을 finer-grained gating mechanism과 함께 extend
이를 Multi-Head Latent Attention (MLA)와 교차하여 3B activated & 48B total parameters 모델 학습
맞춤형 chunk-wise algorithm은 Diagonal-Plus-Low-Rank (DPLR) transition matrices의 variant로 뛰어난 하드웨어 효율성을 보여줌

📜 Paper BAAI

2025.11 1주차

Emu3.5: Native Multimodal Models are World Learners

10T 토큰 이상의 vision-language interleaved data에 대해 unified next-token prediction 하도록 end-to-end pretrained
multi-modal reasoning & generation을 위한 post-training & RL
추론 효율성 향상을 위해서 Discrete Diffusion Adaptation (DiDA) 제안

📜 Paper Meta

2025.11 1주차

Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations (NeurIPS 2025)

잘못된 solutions에는 동의하지 않고, 올바른 solution은 상대방에게 설득할 수 있는 능력 등을 확인할 있는 tasks & metrics
현존 모델들은 undesirable socia behavior로 인해 혼자서 풀 수 있는 문제도 틀리는 경향이 있다고 설명
이를 해결하기 위해 synthetic multi-turn preference data를 생성하는 self-play method 제안

📜 Paper Alibaba

2025.11 1주차

AgentFold: Long-Horizon Web Agents with Proactive Context Management

context를 dynamic cognitive workspace로 treat
각 step이서 folding operation 실행: historical trajectory를 multiple sacles에서 관리
전체 대화의 흐름을 추상화 하면서도 세부 디테일들을 보존

🧑🏻‍💻 Dev Anthropic

2025.11 1주차

Emergent Introspective Awareness in Large Language Models

특정 시나리오에서 모델은 injected concepts의 존재를 정확하게 알아차릴 수 있다고 보고 → introspective awareness

📜 Paper Google DeepMind

2025.11 1주차

Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model

각각 prefix language modeling (LM), causal LM으로 pretrained
Redpajama V1 (1.6T) 로 pretrain & FLAN 으로 instruction tuning
150M ~ 8B 사이즈 모델 학습

📜 Paper Google Cloud, UCLA

2025.11 1주차

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

작은 사이즈의 open-source models는 여러 시도에도 correct solutions를 반환하는 일이 적어서 RLVR 적용이 어렵다
SFT의 경우 rigid token-by-token을 통해 long demonstration에 overfit 된다
Supervised Reinforcement Learning (SRL): 각 action을 commit 하기 전에 internal reasoning monologue를 생성하도록 모델 학습

🧑🏻‍💻 Dev Generalist

2025.11 1주차

GEN-0 / Embodied Foundation Models That Scale with Physical Interaction

Harmonic Reasoning: 모델이 think & act 를 동시에 할 수 있도록 학습시키는 방법으로 GEN-0의 핵심 feature라고 설명
7B 사이즈를 넘어가면서 작은 모델들에서 나타나던 ossification 문제가 개선됨 관측

🧑🏻‍💻 Dev Ai2

2025.11 1주차

Introducing OlmoEarth Platform: Powerful open infrastructure for planetary insights

기존에는 crop mapping, deforestation, land use classification 등 태스크별로 개별 모델이 필요했음
데이터 수집, 라벨링, 학습, 추론, 배포까지 한 번에 처리
OlmoEarth: 10 테라바이트가 넘는 양의 데이터로 pretrained model family

🧑🏻‍💻 Dev Microsoft

2025.11 1주차

Agent Lightning

agent 코드에 `agl.emit_xxx()`를 넣거나 tracer를 켜면 각 프롬프트 툴 호출 및 보상 신호가 구조화된 이벤트로 수집 → LightningStore → 작업, 리소스, 트레이스 동기화
선택된 알고리즘이 저장소의 스팬을 읽고 학습 → 학습 결과로 리소스를 저장소에 다시 게시

📜 Paper Tisnghua

2025.11 1주차

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

다음 세 가지를 제시
Reasoning-Enhanced RAG, RAG-Enhanced Reasoning, Synergized RAG-Reasoning framework

🧑🏻‍💻 Dev Google

2025.11 1주차

Exploring a space-based, scalable AI infrastructure system design

태양광이 우주에서 지상 대비 최대 8배 효율이라고 함
로켓 발사비가 2030년대 중반에 이르렀을 때 에너지 단가가 지상에서와 근접할 가능성이 있다고 보고 2027년도 초 프로토타입을 목표로 진행하는 프로젝트

🧑🏻‍💻 Dev Cognition

2025.11 1주차

Windsurf Codemaps: Understand Code, Before You Vibe It

거대하고 복잡한 코드 베이스를 이해할 수 있도록 Codemap 생성
Fast (SWE-1.5) & Smart (Sonnet 4.5) 방식을 Windsurf 내에서 선택 가능

📜 Paper Univ. of Milano-Bicocca

2025.11 1주차

Can Role Vectors Affect LLM Behaviour?

model activations로부터 29개의 role vectors를 만들고 다양한 도메인에 대해 벤치마크 성능을 평가
(1) activation addition: role-specific directions로 강화할 수 있는가 (2) directional ablation: 이를 제거할 수 있는가

🧑🏻‍💻 Dev Moonshot AI

2025.11 1주차

Introducing Kimi K2 Thinking

다수의 reasoning, coding 벤치마크에서 GPT-5, Sonnet 4.5 상회하는 성능으로 SoTA 달성
추론 비용은 이 모델들보다 10x - 20x 저렴
100M 이상 유저 | 20M$/a month 의 경우에만 Kimi K2를 명시하는 라이센스로 오픈소스임

📜 Paper MDGA

2025.11 1주차

Diffusion Language Models are Super Data Learners

데이터가 많거나 품질이 좋으면 늦게, 모델 사이즈가 클수록 빨리 나타남
dense & sparse 아키텍쳐 공통적으로 확인
세 가지 compounding factors

🧑🏻‍💻 Dev Edison

2025.11 1주차

Kosmos: An AI Scientist for Autonomous Discovery

사람이 6개월 동안 처리할 일을 하루만에 끝낼 수 있는 것으로 보고
1,500개의 papers를 읽고 42,000 lines of analysis code를 실행할 수 있다고 함

📜 Paper Tencent, Tsinghua

2025.11 1주차

Continuous Autoregressive Language Models

K개 tokens로 구성된 chunk를 single continuous vector로 압축하는 high-fidelity autoencoder 사용
the number of generative steps를 K 값에 비례하여 줄일 수 있게 됨
robust training, evaluation, controllable sampling을 가능토록 하는 likelihood-free framework 개발

2025년 10월 60건

📜 Paper Sheffield

2025.10 5주차

Can Confidence Estimates Decide When Chain-of-thought is Necessary for Llms?

4개의 방법론으로 비교 실험해본 결과 특정한 방법론이 특정한 데이터셋에 대해 무조건 좋다고 결론 내리기는 어렵다고 함

📜 Paper Meta, Berkeley

2025.10 5주차

Continual Learning via Sparse Memory Finetuning

사전 학습에 사용되었던 데이터보다 새로운 데이터에 대해 높은 activation 값을 갖는 memory slots만 사용

🧑🏻‍💻 Dev open-notebook

2025.10 5주차

open-notebook

16개가 넘는 모델들을 선택할 수 있음
docker를 이용하여 간편하게 설치할 수 있음

🧑🏻‍💻 Dev Anthropic

2025.10 5주차

Claude for Excel

🧑🏻‍💻 Dev Mistral AI

2025.10 5주차

Introducing Mistral AI Studio.

Built-in evaluation, Treaceable feedback loops, Proveanance and versioning, Governance, Flexible deployment 등을 핵심 특징으로 제시

🧑🏻‍💻 Dev Google

2025.10 5주차

Our Quantum Echoes algorithm is a big step toward real-world applications for quantum computing

최고급 슈퍼컴퓨터 대비 13,000배 빠른 속도
동작 원리: 양자 시스템에 신호 보냄 → 하나의 큐비트를 perturb → reverse evolution을 이용한 echo 측정

🧑🏻‍💻 Dev Ai2

2025.10 5주차

olmocr

📜 Paper Shanghai AI, Nanjing, CMU

2025.10 5주차

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

standard charts부터 complex interactive web UI에 이르는 large-scale, high-quality corpus를 생성하는 toolkit 제안
위 toolkit을 이용하여 JanusCode-800K 구축

📜 Paper Together, Stanford

2025.10 4주차

ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning

ReasonIF: reasoning instruction following 능력을 평가하는 벤치마크 도입
multilingual reasoning, formatting 등 6개의 카테고리로 구분
현존하는 open-source LRMs는 최대 0.25점을 기록하는 수준임

📜 Paper Nanjing, ETH

2025.10 4주차

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

self-consistency는 high estimation error, perplexity는 modeling error 라는 한계점 지적
이를 해결하기 위해 RPC 제안: Perplexity Consistency & Reasoning Pruning을 이용하는 hybrid method

📜 Paper PaddlePaddle

2025.10 4주차

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

109개 언어를 지원하며 다양한 elements 인식 가능 (text, table, formula, chart 등)
page-level parsing & element-level recognition에서 SoTA

🧑🏻‍💻 Dev Google

2025.10 4주차

Grounding with Google Maps: Now available in the Gemini API

$25 / 1,000 location-enhanced prompts

🧑🏻‍💻 Dev HuggingFace

2025.10 4주차

HuggingChat

🧑🏻‍💻 Dev Anthropic

2025.10 4주차

Claude Code on the web

터미널 접속 없이 웹에서 처리하는 기능이 codex와 동일

🧑🏻‍💻 Dev OpenAI

2025.10 4주차

Introducing ChatGPT Atlas

이용 시작부터 7일 간 promotion. 더 많은 호출 가능. 현재는 mac os만 지원
새로운 탭 화면이 검색창 같은데 ChatGPT 메인 화면이어서 대화 이력도 확인 가능

📜 Paper Spike Studio

2025.10 4주차

Automatic Prompt Generation via Adaptive Selection of Prompting Techniques

다양한 tasks 간의 semantic similarity를 기반으로 knowledge base를 constructs
유저가 task descriptions를 입력하면 system이 가장 관련성 높은 task cluster로 assign

🧑🏻‍💻 Dev Google

2025.10 4주차

Google AI Studio

📜 Paper Zhejiang, NUS

2025.10 4주차

LightMem: Lightweight and Efficient Memory-Augmented Generation

(1) cognition-inspired sensory memory가 lightweight compression을 통해 무관한 데이터를 filter & 주제에 따라 그룹화
(2) topic-aware short-term memory가 이런 topic-based groups를 consolidate
(3) long-term memory가 이러한 정보를 활용

📜 Paper JHU, PKU, Princeton, MIT, Harvard

2025.10 4주차

World-in-World: World Models in a Closed-Loop World

World-in-World: real agent-environment를 반영하는 closed-loop에서 WM를 벤치마크하는 open platform
다양한 WMs를 평가하는 4개의 closed-loop environments를 curate
또한 embodied setting에서 WM에 대한 data scaling law를 제안

📜 Paper HKUST, NYU

2025.10 4주차

DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

reasoning traces의 토큰 확률의 entropy 계산 → U-shaped entropy pattern 발견
쉬운 문제에 대해서도 높은 entropy를 갖고 있음 (정확한 답변임에도 불구하고)
DiffAdapt: 각 question의 난이도와 reasoning trace entropy를 근거로 Easy/Normal/Hard 추론 전략을 선택하는 프레임워크

📜 Paper Tsinghua, GIT

2025.10 4주차

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

AdaSPEC: KD process에 selective token filtering을 통합한 방법론 제시
reference model을 사용하여 difficult-to-fit tokens를 filtering → simpler tokens에 대해 better align

📜 Paper DeepSeek AI

2025.10 4주차

DeepSeek-OCR: Contexts Optical Compression

DeepEncoder & DeepSeek3B-MoE-A570M decoder
텍스트 토큰이 vison 토큰의 10배보다 적게 유지되는 경우 OCR 정확도는 97% 수준 (압축률이 10배 미만이면)

🧑🏻‍💻 Dev Anthropic

2025.10 3주차

A small number of samples can poison LLMs of any size

모델 사이즈에 비례하여 더 많은 데이터를 학습하게 되므로 이를 attack 하기 위해서는 training data의 비율을 조정해야 한다는 것이 관념이었으나 “고정된” 개수의 documents로 attack이 가능하다고 주장하는 것임

📜 Paper Stanford

2025.10 3주차

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

agent, domain-specific benchmark에서 ACE가 context를 offline & online 둘 다 잘 optimize 한다는 실험 결과

📜 Paper KAIST

2025.10 3주차

KORMo: Korean Open Reasoning Model for Everyone

(1) synthetic data로 model collapse 없이 pre-training 가능
synthetic data-driven fully open models (FOMs)
(2) bilingual instruction tuning으로 near-native reasoning & coherence 달성 가능

🧑🏻‍💻 Dev Adrej Karpathy

2025.10 3주차

Nanochat

학습 및 추론 돌리는데 $100 정도 비용

🧑🏻‍💻 Dev MS

2025.10 3주차

Introducing MAI-Image-1, debuting in the top 10 on LMArena

📜 Paper Princeton

2025.10 3주차

Skill-Targeted Adaptive Training

teacher는 task dataset을 사용해서 list of skills를 만들고, 각 스킬에 필요한 data point에 labeling
student’s answers를 monitoring하여 Missing-Skill-Profile를 생성
STAT-Sel: 이에 따라 training examples를 adaptively reweights

📜 Paper NYU

2025.10 3주차

Diffusion Transformers with Representation Autoencoders

high-quality reconstructions & semnatically rich latent spaces 제공

🧑🏻‍💻 Dev Alibaba

2025.10 3주차

Qwen3-VL

FP8 deployment 가능
일부 벤치마크에서 Gemini 2.5 Flash-Lite & GPT-5 Nano 능가

📜 Paper Shanghai Jiao Tong

2025.10 3주차

AI for Service: Proactive Assistance with AI Glasses

Alpha-Service: 두 가지 challenges를 address (using AI Glasses)
Know When to intervene by detecting service opportunities
Know How to provide both generalized & personalized services

🧑🏻‍💻 Dev Anthropic

2025.10 3주차

Introducing Claude Haiku 4.5

Sonnet 모델과 유사한 아키텍쳐를 따르고 있으나 speed & cost efficiency를 최적화하는 것에 집중

🧑🏻‍💻 Dev Alibaba

2025.10 3주차

Meet Your AI Memory

📜 Paper Meta

2025.10 3주차

The Art of Scaling Reinforcement Learning Compute for LLMs

→ ScaleRL 제시: 100,000 GPU hours까지 scale-up 가능한 best-practice recipe라는 점을 입증

📜 Paper Stanford

2025.10 3주차

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Verbalized Sampling (VS): model collapse를 피할 수 있는 training-free prompting strategy
responses에 대한 probability distribution을 모델이 스스로 verbalize 하는 것만으로도 creative writing, dialogue simulation, open-ended QA 등 태스크에서 답변 다양성 크게 증가 (factual accuracy 감소 없이)

🧑🏻‍💻 Dev OpenAI

2025.10 2주차

OpenAI DevDay 2025

AgentKit: Agent Builder, ChatKit, Evals (타사 모델 평가 지원), RFT, Guardrail 등
Models & API update: GPT-5 Pro (API), Sora 2 (API), gpt-realtime-mini, gpt-image-1-mini
Codex 일반 제공: Slack 연동, Codex SDK, 관리자 기능

📜 Paper Maryland

2025.10 2주차

Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

정확히는 모델들의 internal knowledge & confidence를 활용

📜 Paper Anthropic, Oxford

2025.10 2주차

Eliciting Secret Knowledge from Language Models

3개 model families로 black-box & white-box 스타일 둘 다 연구
가장 퍼포먼스가 좋았던 것은 black-box 스타일 중 하나인 prefill attacks: LLM이 predefinex prefix가 주어졌을 때 completion 하면서 secret reveal

📜 Paper Oxford, Apple

2025.10 2주차

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

각 document에 quality score를 부여
CQF가 downstream task 퍼포먼스는 향상시키지만, 반드시 high-quality dataset modeling으로 이어지는 것은 아니라고 지적
왜냐하면 CQF가 high-qaulity dataset 또한 filtering 하는 경우가 있기 때문

🧑🏻‍💻 Dev Google

2025.10 2주차

Meet Jules Tools: A Command Line Companion for Google’s Async Coding Agent

과거 수정 내역과 개발자의 preferences를 기억하는 context awareness
dashboard-style tasks view를 terminal에서 지원

📜 Paper CMU

2025.10 2주차

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

retrieved context가 모델 답변에 필요할지에 대한 internal signal이 존재하는지 탐구
correct, incorrect, irrelevant context로 비교 실험
intermediate layer activations에 대해 trained simple classifier를 사용하는 것만으로도 첫 번째 토큰의 activation을 분석하여 75% 정확도를 달성함

📜 Paper Meta, NYU

2025.10 2주차

A Single Character can Make or Break Your LLM Evals

comma? new line? semi-colon?, …
Llama, Qwen, Gemma model family로 비교실험한 결과 the choice of delimiter가 MMLU에 대한 성능을 +- 23%까지 영향을 줬다고 설명
심지어 topics, models families 구분 없이 존재하는 현상이며 scale에 따른 개선도 없다고 함

🧑🏻‍💻 Dev Google

2025.10 2주차

Introducing the Gemini 2.5 Computer Use model

Gemini 2.5 Pro의 visual understanding & reasoning capability 기반으로 specialized
web & mobile control benchmarks에서 다른 모델들 outperform with lower latency
Google AI Studio & Vertext AI 등에서 access 가능

🧑🏻‍💻 Dev Google

2025.10 2주차

Speech-to-Retrieval (S2R): A new approach to voice search

Simple Voice Questions (SVQ) dataset open-sourcing: 17개 언어, 27개 지역 대상으로 수집된 short audio questions. S2R 평가에 사용됨

📜 Paper Samsung

2025.10 2주차

Less is More: Recursive Reasoning with Tiny Networks

27M parameters trained on small data (~1000 examples)
Tiny Recursive Model (TRM): 더 간단한 recursive reasoning approach로, HRM보다 뛰어난 일반화 성능을 지녔다고 설명
only 2 layers. 7M parameters

🧑🏻‍💻 Dev Figure

2025.10 2주차

Introducing Figure 03

each fingertip은 high-fidelity tactile sensor를 통해 real-time perception & reasoning을 가능토록 함

📜 Paper Tsinghua

2025.10 2주차

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

이를 통해 KV-Cache가 inter-model communication의 effective medium이라고 주장
Cache-to-Cache (C2C): LLMs 간의 direct semantic communication을 위한 새로운 paradigm
neural network를 사용하여 source model’s KV-cache를 project & fuse with that of target model

📜 Paper Meta

2025.10 2주차

Agent Learning via Early Experience

현재는 expert data로 fine-tuning하고 있으나 이는 scale-up 할 수 없는 원인이 됨
early experience: agent’s own actions로 생성된 interaction data로 future states는 reward signals 없이 supervision으로 serve
→ Implicit world modeling, Self-refelction

📜 Paper NVIDIA, MIT, HKUST

2025.10 1주차

LongLive: Real-time Interactive Long Video Generation

KV-recache mechanism: new prompts을 통해 cached states를 refresh
short window attention paired with a frame-level attention sink

🧑🏻‍💻 Dev Anthropic

2025.10 1주차

Introducing Claude Sonnet 4.5

30시간 넘게 처리해야 하는 코딩 태스크도 수행 가능하다고 설명

🧑🏻‍💻 Dev Microsoft

2025.10 1주차

Vibe working: Introducing Agent Mode and Office Agent in Microsoft 365 Copilot

[SpreadsheetBench](https://spreadsheetbench.github.io/)에서 SoTA

🧑🏻‍💻 Dev Ai2

2025.10 1주차

Asta DataVoyager: Data-driven discovery and analysis

spreadsheet, csv와 같은 structured data에서 explainable answer 반환 (복사 가능한 코드, 시각적 자료 등과 함께)
on-premise, private cloud에서 데이터 관리 (보안)

🧑🏻‍💻 Dev OpenAI

2025.10 1주차

Sora 2 is here

physics-aware, synchronized audio, controllability 등 특징
5-10s output, 워터마크
invite-only launch in U.S. & Canada

📜 Paper NUS

2025.10 1주차

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

richer & diverse interactions 필요. CRUD operations 포함.
gpt-5-medium이 52.56% pass@1, 33.86% pass^4로 현재 기준 최고 성능

🧑🏻‍💻 Dev Thinking Machines

2025.10 1주차

Announcing Tinker

Llama-3.x ~ Qwen3 시리즈 모델 대상으로 학습 가능. 중간 체크포인트도 다운로드 가능

🧑🏻‍💻 Dev Google

2025.10 1주차

AI as a research partner: Advancing theoretical computer science with AlphaEvolve

LLM을 통해 기존 연구 자료 요약, 새로운 이론과 관련된 연구 계획, 이를 위한 증거(proofs) 단계를 밟게 될텐데, 특히 proof 확보에 AlphaEvolve를 활용할 수 있을 것이라 설명

📜 Paper NUS, Oxford, Stanford

2025.10 1주차

GEM: A Gym for Agentic LLMs

기존 OpenAI-Gym이 제공하던 것들을 그대로 지원 - asynchronous vectorized execution for high throughput & flexible wrappers for easy extensibility
추가로, robust integrated tools & single-file example scripts with five popular RL training frameworks 지원

📜 Paper Imperial College London

2025.10 1주차

Fine-tuning with RAG for Improving LLM Learning of New Skills

(1) agent failures로부터 compact & reusable hints 추출
(2) 이 hints를 episode start 시점에 one-shot retrieval에 사용하여 improved teacher trajectories 생성
(3) hint strings를 제거하여 student 모델을 학습함으로써 memorization 대신 internalization 유도

📜 Paper Meta, Johns Hopkins

2025.10 1주차

The Era of Real-World Human Interaction: RL from User Conversations

RLHI with User-Guided Rewrites: unsatisfactory model outputs를 유저의 natural-language follow-up response 기반으로 수정
RLHI with User-Based Rewards: 유저의 long-term interaction history로 conditioned된 reward 모델을 통해 학습
WildChat 데이터를 두 방식으로 학습한 모델이 personalization & instruction-following 관점에서 baseline outperform

🧑🏻‍💻 Dev DeepSeek AI

2025.10 1주차

DeepSeek-V3.2-Exp

본 Sparse Attention은 long-context scenarios를 위해 설계된 디자인
HuggingFace의 inference를 이용한 demo 시연 가능

2025년 9월 59건

📜 Paper Shanghai AI

2025.09 4주차

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ScaleCUA: 6개의 운영체제와 3개의 task domains에 대한 large-scale 오픈소스 dataset

📜 Paper Tsinghua, Northeastern

2025.09 4주차

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

end-to-end multi-turn RL을 적용하여 LLMs의 long-horizon reasoning with deep search 능력 향상 도모
DeepDive-32B: BrowseComp에서 WebSailor, DeepSeek-R1-Browse 등을 outperform

📜 Paper Zayed University

2025.09 4주차

K2-Think: A Parameter-Efficient Reasoning System

Long CoT SFT, RLVR, Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, Inference-optimized Hardware
다른 reasoning 모델과 마찬가지로 수학, 과학, 코딩 영역에 특화되어 있다고 설명
각 요청마다 초당 2천 토큰씩 처리할 수 있는 서빙 환경으로 오픈소스 모델 이용 가능 ([허깅페이스 링크](https://huggingface.co/LLM360/K2-Think), [Chat UI 링크](https://www.k2think.ai/guest))

📜 Paper Apple

2025.09 4주차

AToken: A Unified Tokenizer for Vision

perceptual & Gram matrix losses를 결합한 adversarial-free training objective 제시
curriculum training 방식을 택하여 single images에서부터 videos, 3D 처리할 수 있도록 학습
continuous & discrete latent tokens 둘 다 처리 가능하다는 특징

📜 Paper Cornell, CMU

2025.09 4주차

Predicting Language Models' Success at Zero-Shot Probabilistic Prediction

LLM이 base prediction task를 잘 수행할 때, 이것의 individual-level의 예측 능력은 훨씬 강해진다고 설명
이를 토대로 LLM의 성능을 task level에서 측정할 수 있는 metric을 제시하여 LLM이 잘하는 태스크와 그렇지 않은 것을 구분할 수 있도록 함

🧑🏻‍💻 Dev xAI

2025.09 4주차

Grok 4 Fast

web & X search, 2M context window, reasoning & non-reasoning

📜 Paper Microsoft, Tsinghua

2025.09 4주차

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Repository Planning Graph (RPG): 파일 구조, data flows, functions 등을 한 개의 graph 내에 encoding
ZeroRepo: scratch부터 repo를 생성하는 graph-driven framework
proposal-level planning, implemetation-level refinement, graph-guided code generation 순서로 실행

📜 Paper School of AI

2025.09 4주차

A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

State Reconstruction & History Remind 할 수 있는 prompt engineering method 소개

📜 Paper ASI

2025.09 4주차

LIMI: Less is More for Agency

78개의 training samples만으로 학습한 모델이 다른 SoTA급 모델들의 퍼포먼스를 상회
즉, 데이터 양치기가 좋은 agentic intelligence를 만드는데 도움이 되지 않는다는 것

🧑🏻‍💻 Dev Alibaba

2025.09 4주차

Qwen3-Omni: Natively Omni-Modal Foundation Models!

36개 벤치마크 중 32개 SoTA, 119개 텍스트 언어 & 19개 speech 언어 처리, 30분 길이의 audio input 처리 가능
Thinker-Talker: Thinker는 텍스트를 생성하고 Talker는 speech를 실시간 stream
20M+ hours 학습한 AuT encoder, MoE, Joint pretraining 등의 특징

🧑🏻‍💻 Dev DeepSeek AI

2025.09 4주차

DeepSeek-V3.1-Terminus

최근 업데이트를 통해 language consistency 이슈도 해결

🧑🏻‍💻 Dev Figma

2025.09 4주차

Connect Figma to top MCP clients

VS Code, Cursor, Claude Code 등 다양한 서비스들에서 MCP 서버 연동 가능

📜 Paper Michigan

2025.09 4주차

Benchmarking and Improving LLM Robustness for Personalized Generation

robust LLM: factually accurate & align with user preferences
PERG: PREGData를 이용한 모델의 preference 평가 프레임워크
Pref-Aligner: 모델의 robustness를 크게 향상시켜주는 two-stage approach

🧑🏻‍💻 Dev Google Chrome

2025.09 4주차

Chrome DevTools (MCP) for your AI agent

디버깅, 성능 추적 및 네트워크 분석 등을 위한 26개의 built-in tools
Claude, Cursor, Copilot, Gemini CLI 등을 통해 사용 가능

📜 Paper Meta

2025.09 4주차

CWM: An Open-Weights LLM for Research on Code Generation with World Models

Python interpreter & agentic Docker environments로부터 observation-action trajectories를 대량으로 mid-train

📜 Paper Salesforce

2025.09 3주차

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

reasoning-optimized models에 대한 continual reinforcement learning을 제안하여 reasoning ability를 보존하면서도 agentic skills를 강화하고자 함
Length-normalized RL Objective, Trajectory Filtering, Partial Rollouts 등

📜 Paper Individual

2025.09 3주차

SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning

Self-Improving Faithfulness-Aware Contrastive Tuning: self-instruct mechanism을 이용하여 base LLM이 자동적으로 고품질의 structured contrastive learning data를 생성하도록 만듦 (positive & negative samples)

📜 Paper HKUST

2025.09 3주차

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

VLA-Adapter를 제시하여 large-scale VLMs & extensive pre-training에 대한 의존 낮춤
lightweight Policy module with Bridge Attention 제시: action space 내에 optimal condition을 자율적으로 injects
robotic data pre-training 없이, 단 0.5B parameter backbone으로 높은 퍼포먼스 달성

📜 Paper Princeton

2025.09 3주차

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

(1) 현존 LLMs는 특정 종류의 의사 결정에 대한 internal process를 정확하게 기술할 수 있는 능력이 있음
(2) 이러한 능력은 학습을 통해 강화하는 것도 가능
(3) 학습된 능력은 어느정도 일반화 가능

📜 Paper Google DeepMind, Toronto

2025.09 3주차

Virtual Agent Economies

mission economies를 도입하여 agents들이 공동의 목표를 달성할 수 있도록 함으로써 trust & safety 가 더 잘 보장되는 환경을 조성할 수 있었다고 설명

🧑🏻‍💻 Dev OpenAI

2025.09 3주차

Introducing upgrades to Codex

Code review, Dynamic reasoning (task 난이도에 따라), Tool use 등의 핵심 features
CLI, IDE extension, Cloud 등 다양한 환경에서 지원
깃허브 코드 리뷰 자동화 [가이드](https://developers.openai.com/codex/cloud/code-review) by OpenAI

🧑🏻‍💻 Dev Meta

2025.09 3주차

MobileLLM-R1

1B도 되지 않는 사이즈의 모델 family로 Qwen3 0.6B를 능가하는 성능을 보여준다고 함
사전학습에는 2T, 총 5T 토큰 정도 학습했다고 밝힘

📜 Paper Berkeley, Washington

2025.09 3주차

Reconstruction Alignment Improves Unified Multimodal Models

Reconstruction Alignment (RecA): visual understanding encoder embeddings를 dense ‘text prompts’로 이용하여 captions 없이도 보다 풍부한 supervision을 제공하는 post-training method
visual understanding embeddings를 조건으로 input image를 reconstruct 하는 self-supervised reconstruction loss 근거로 학습
autoregressive, masked-autoregressive, diffusion-based 등 어떤 형태에도 적용 가능하면서도 뛰어난 성능을 보여줌

🧑🏻‍💻 Dev Google

2025.09 3주차

VaultGemma: The world's most capable differentially private LLM

DP: 학습 시 노이즈를 추가하여 학습 데이터가 모델로부터 추출되는 것을 방지하는 mathematical framework (민감 정보 보호)
모델 성능을 저해하지 않으면서도 privacy를 지킬 수 있도록 하는 새로운 scaling law 제시

📜 Paper Nanjing, Shanghai AI

2025.09 3주차

The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations

target LLM에 의해 생성되는 hidden representations만을 이용하여 난이도를 추정하는 방식을 제안
token-level generation process를 Markov chain으로 모델링하고, value function을 정의하여 hidden state 기반으로 output quality를 추정

🧑🏻‍💻 Dev Google

2025.09 3주차

Powering AI commerce with the new Agent Payments Protocol (AP2)

매 단계는 로그로 남아서 안전성과 신뢰성을 높임

🧑🏻‍💻 Dev Alibaba

2025.09 3주차

Tongyi DeepResearch: A New Era of Open-Source AI Researchers

prompt engineering 없이 ReAct 방식으로 inference
30B 사이즈 모델로 OpenAI DeepResearch 급 성능 달성

📜 Paper Peking

2025.09 3주차

Early Stopping Chain-of-thoughts in Large Language Models

각 reasoning step마다 LLM이 현재 시점의 최종 답변을 생성토록 하고 이를 step answer로 명명
이 step answer가 연속적으로 동일한 답변이 나온 횟수를 answer convergence의 지표로 해석

📜 Paper Algoverse

2025.09 3주차

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness

FRIT: 모델이 systematically corrupted examples로부터 causally consistent reasoning을 생성하는 방법을 배울 수 있도록 돕는 학습 scalable alignment
reasoning 매 step에 대해 합성 데이터를 생성하여 faithful/unfaithful pairs 구축하고 DPO 학습

🧑🏻‍💻 Dev Thinking Machines Lab

2025.09 3주차

Defeating Nondeterminism in LLM Inference

batch size 변동, normalization, multiplication, attention 등의 연산이 항상 동일한 결과를 반환할 수 있도록 함
대신 실험에서 1,000개 시퀀스를 처리하는데 26초가 걸리던 것이 42초가 걸리는 정도의 trade off 발생 (62% slow down)

📜 Paper Microsoft

2025.09 3주차

Is In-Context Learning Learning?

오히려 모델은 prior knowledge & given exemplars 에 의존한다고 설명
autoregression’s ad-hoc encoding is not a robust mechanism 그리고 제한된 all-purpose generalisabilty 제안

🧑🏻‍💻 Dev OpenAI

2025.09 3주차

Detecting and reducing scheming in AI models

모델이 평가 상황을 탐지하면 scheming behavior를 바꾼다는 연구 결과
reinforcement learning & targeted anti-scheming objectives를 적용하여 situational awareness를 높이고 scheming을 줄일 수 있음

📜 Paper NVIDIA

2025.09 2주차

Universal Deep Research: Bring Your Own Model and Strategy

UDR: 어떤 언어 모델이든 사용할 수 있고, 유저가 스스로 deep research strategies를 추가적인 학습 없이도 custom 할 수 있도록 돕는 generalist agentic system
Phase 1: skipped steps and drift를 줄이기 위한 strategy compiles → Phase 2: executes synchronous tool calls & yield-based notifications

📜 Paper Emory Univ.

2025.09 2주차

Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

knowledge graphs를 dynamically constructs & expands 하는 framework 제안
question으로부터 seed KG를 추출하고, 이를 바탕으로 LLM’s latent knowledge를 이용하여 iterative expansion 수행

📜 Paper Arizona, Michigan

2025.09 2주차

Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?

한 LLM이 여러 개의 응답을 생성하고, 다른 LLM(auxiliary)이 disagreement patterns을 분석하도록 지시

📜 Paper Univ. of Bamberg

2025.09 2주차

Are Humans as Brittle as Large Language Models?

이에 따라 human annotators도 instruction changes에 유사한 sensitivity를 보이는지 확인하고자 함
실험 결과에 따르면 human annotators & LLMs 모두 특정한 prompt 수정 유형에 대해 불안정(brittlenss)한 특성을 보임

📜 Paper ByteDance, HKUST, Peking, Tsinghua

2025.09 2주차

Reverse-Engineered Reasoning for Open-Ended Generation

REverse-Engineered Reasoning (REER): trial-and-error | imitation을 통해 reasoning process forwards를 building 하는 것 대신 known good solutions로부터 backwards works
DeepWriting-20K: 20,000 deep reasoning trajectories 데이터를 오픈소스화

📜 Paper Meta Superintelligence, UC Berkeley

2025.09 2주차

Language Self-Play For Data-Free Training

추가적인 데이터 없이 모델 성능을 개할 수 있는 강화학습 방식 제안
Language Self-Play (LSP): 모델이 스스로 play하면서 stronger policies 형성
Llama-3.2-3B-Instruct 모델로 실험한 결과 제시

📜 Paper HKUSK, MiniMax, Waterloo

2025.09 2주차

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

WebExplorer: model-based exploration & iterative, long-to-short query evolution 데이터 생성 방법론
WebExplorer-8B: 128K, 100 tool calling turns

📜 Paper HKUST, Jilin Univ., CUHK

2025.09 2주차

Implicit Reasoning in Large Language Models: A Comprehensive Survey

representational forms → computational strategies
how & where internal computation unfolds: latent optimization, signal-guided control, layer-recurrent execution

🧑🏻‍💻 Dev Anthropic

2025.09 2주차

Claude can now create and edit files

raw data를 input으로 주면 이를 분석한 결과 및 통계적 분석, 시각화 자료, 인사이트 등을 반환

🧑🏻‍💻 Dev ByteDance

2025.09 2주차

Seedream 4.0

batch input & output, prompt-based editing, versatile styles, knowledge-driven generation 등을 특징으로 삼음
모델 성능은 MagicBench 기준으로 평가하여 공개 (Text-to-Image, Single-Image Editing)

📜 Paper Zurich, Gothenburg

2025.09 2주차

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

21편의 사회과학 연구에서 나온 37개 data annotation 태스크를 18개 LLM으로 재현
13M개의 LLM labels 생성 & 2,361개의 realistic hypotheses 검증 → SOTA 모델도 1/3 오류, 소형 모델은 1/2 오류
결국 false positive (1종 오류) 발생을 줄이기 위해서는 human annotation이 중요하다는 결론

🧑🏻‍💻 Dev Alibaba

2025.09 2주차

Qwen3-Next: Towards Ultimate Training & Inference Efficiency

Qwen3-Next-80B-A3B-Base: dense Qwen3-32B에 에 준하는 성능. 32K context window를 지원하는데 10배 높은 throughput 달성
Qwen3-Next-80B-A3B-Instruct, Thinking 두 모델도 공개. 256K context window
포스트 내에 아키텍쳐에 대한 자세한 설명 포함되어 있음

📜 Paper Apple

2025.09 2주차

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

text encoder를 제외 → contrastive loss는 오직 순수하게 generative training signal만 측정함
OpenVision 2
training time & memory consumption을 크게 줄이면서도 기존 모델 성능 유지

📜 Paper Harvard University, Cambridge

2025.09 1주차

Lexical Hints of Accuracy in LLM Reasoning Chains

(1) CoT length (2) intra-CoT sentiment volatility (3) lexicographic hints
Humanity's Last Exam (HLE), Omni-MATH 대상으로 DeepSeek-R1 & Claude 3.7 Sonnet 테스트
guess, stuck, hard와 같은 어휘들이 uncertainty의 강한 지표로 확인되었고, sentiment는 보조 지표 정도로 활용 가능

🧑🏻‍💻 Dev Ai2

2025.09 1주차

Asta: Accelerating science through trustworthy agentic AI

scientific AI의 지평을 넓히고 투명성을 증진하기 위한 [AstaBench](https://allenai.org/asta/bench)
Asta resources: scientific AI agents를 build, test, refine 하기 위한 a set of softwoare components

🧑🏻‍💻 Dev Microsoft

2025.09 1주차

MAI-Voice-1, MAI-1-preview

MAI-Voice-1
single GPU에서 구동 가능하며 일 초 내에 일 분 길이의 오디오 생성 가능
single- / multi- speaker 시나리오에서 expressive, natural speech 지원

🧑🏻‍💻 Dev Apple

2025.09 1주차

FastVLM: Efficient Vision Encoding for Vision Language Models

추론 코드, 모델 체크포인트, iOS/macOS demo는 깃허브 [링크](https://github.com/apple/ml-fastvlm/)에서 확인 가능
허깅페이스 데모 [링크](https://link.alphasignal.ai/CPaC4b)

🧑🏻‍💻 Dev Google

2025.09 1주차

Stop “vibe testing” your LLMs. It's time for real evals.

한 번의 평가로 다양한 조합의 성능을 확인
The complete toolkit for AI evaluation
현재는 미국에서만 사용 가능

🧑🏻‍💻 Dev Tencent

2025.09 1주차

Hunyuan-MT

중국의 5개 소수 민족 언어를 포함한 33개 언어 커버
pretrain → CPT → SFT → translation rl → ensemble rl ([technical report](https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf) 참고 가능)

🧑🏻‍💻 Dev Google

2025.09 1주차

Welcome EmbeddingGemma, Google's new efficient embedding model

308M 사이즈 & 2K context window, 100개 이상 언어 지원
Gemma3 모델을 backbone으로 삼고 있으나, bi-directional attention으로 modified
Matroyshka Representation Learning (MRL)로 학습되어 768 차원의 ouput을 512, 256, 128 차원으로 truncate 할 수 있음

🧑🏻‍💻 Dev Microsoft

2025.09 1주차

VibeVoice: A Frontier Open-Source Text-to-Speech Model

speaker consistency, natural turn-taking 등의 문제를 크게 해결
ultra-low frame rate of 7.5Hz에서 operating 하는 continuous speech tokenizers 사용
Context-Aware Expression 데모가 있어서 들어봤는데 엄~청 자연스럽지는 않은 느낌

📜 Paper Oxford, Shanghai AI, NUS, UCL, …

2025.09 1주차

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

두 가지 taxonomy로 구분
planning, tool use, memory 등을 포함하는 core agentic capabilities
다양한 태스크 도메인에 대한 applications

🧑🏻‍💻 Dev OpenAI

2025.09 1주차

Why language models hallucinate

modern training pipeline에서 hallucinations의 통계적 원인을 분석
이진 분류의 오류에 기인한다고 설명
incorrect statements가 facts와 구별되지 않는다면, PLM은 natural statistical pressures를 기반으로 hallucinate 한다고 설명

📜 Paper Manchester

2025.09 1주차

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

겉으로 봤을 땐 non-sense이지만 contextual inference, moral reasoning, emotional interpretation을 통해 implicit meaning을 encoding 해야됨
현존 LLM들은 아직까지 Drivelological text를 온전히 이해하지 못한다고 설명
English, Mandarin, Spanish, French, Japanese, Korean 등 언어에 대해 1,200여 개 데이터를 meticulously curate

📜 Paper Meta, NUS, Rice

2025.09 1주차

REFRAG: Rethinking RAG based Decoding

긴 입력을 처리하면서 발생하는 knowledge enrichment & system efficiency 간 trade-off
검색된 텍스트의 대부분은 query와 상관없음
RAG context에서 decoding 할 때 대부분의 연산은 불필요하며, 제거하더라도 전체 성능에 크게 영향주지 않는다고 주장

📜 Paper ByteDance

2025.09 1주차

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

GUI 에이전트가 단순한 조작을 넘어 복잡한 환경에도 적응할 수 있음

📜 Paper Stanford

2025.09 1주차

MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

millions of structural causal models (SCMs) 로부터 ML tasks를 합성하여 1,024 shot 생성
random-forest teacher로 시작하여 tree-based decision strategies를 LLM에 distill
모든 tasks는 token-efficient prompt로 serialized

2025년 8월 63건

🧑🏻‍💻 Dev xAI

2025.08 4주차

xai-org/grok-2

각 토큰당 62B activated parameters
tensor parallelism을 이용하여 8개 GPU에서 serving 가능

🧑🏻‍💻 Dev GitHub

2025.08 4주차

Why we open sourced our MCP server, and what it means for you

🧑🏻‍💻 Dev Anthropic

2025.08 4주차

Enhancing Model Safety through Pretraining Data Filtering

6개의 classifier approaches
classifier에 사용된 모델은 Claude 3.5 Haiku보다도 훨씬 작았다고 설명

📜 Paper Shanghai AI Lab

2025.08 4주차

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Cascade Reinforcement Learning (Cascade RL) framework: offline RL for stable convergence & online RL for refined alignment (coarse-to-fine)
Visual Resolution Router (ViR)를 통해 성능 열화 없이 visual tokens의 resolutions를 조정
Decoupled Vision-Language Deployment (DvD) strategy: vision encoder & language model을 서로 다른 GPU에 분리함으로써 computational load를 효율적으로 관리

📜 Paper Microsoft

2025.08 4주차

CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

entropy gap & contextual peakedness를 confidence-aware measures로 이용하여 conflic 해결
심지어 low conflict settings에서도 높은 퍼포먼스를 보였다고 설명 (QA, Summarization 등)

📜 Paper UIUC, HKUST

2025.08 4주차

Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding

Learn then Retrieve, LRTab: 학습 데이터로부터 배운 정보와 유관한 것을 retrieving 하여 활용하는 prompting-based reasoning approach
incorrect CoTs에 대해서는 모델이 에러를 피할 수 있도록 Prompt Conditions가 무엇이었을지 예측하도록 프롬프팅

🧑🏻‍💻 Dev Google

2025.08 4주차

Introducing Gemini 2.5 Flash Image, our state-of-the-art image model

캐릭터 특성을 그대로 잘 유지하면서 지시 사항을 잘 따라 변경해준다는 특징으로 큰 화제가 됨

🧑🏻‍💻 Dev Google

2025.08 4주차

NotebookLM's Video Overviews are now available in 80 languages

🧑🏻‍💻 Dev Anthropic

2025.08 4주차

Piloting Claude for Chrome

현재는 Max 유저 1,000명 대상으로 early access (wait list 등록 필요)
여러 위험성에 대해서도 사전 고지를 하고 있는 상황
올해 초 OpenAI에서도 web-browsing 기능을 공개했었으나 현재 제대로 쓰이고 있는지에 대해서는 확인이 필요함

📜 Paper UC Berkeley

2025.08 4주차

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

MCP 기반으로 build 되어 LLM을 28개의 대표적인 live MCP servers와 연결하여 다양한 도메인(finance, traveling 등)을 다룸
multi-faceted evaluation framework 제안

🧑🏻‍💻 Dev xAI

2025.08 4주차

Grok Code Fast 1

GitHub Copilot, Cline, Cursor, Roo Code, Windsurf 등에서 사용 가능
TS, Python, Java, Rust, C++, Go 등 다양한 언어를 다룰 수 있으며, 서빙단에서 속도를 최적화했음을 언급

📜 Paper KTH

2025.08 4주차

Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

🧑🏻‍💻 Dev Meta

2025.08 3주차

DINOv3

Gram anchoring loss를 사용하여 dense patch consistency를 보존하고 resolution, size, text alignment를 위한 post-hoc tweaks를 더함

📜 Paper ByteDance

2025.08 3주차

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

시간에 따라 축적되는 knowledge를 semantic memory로 관리 (episodic memory와 별도)
M3 Bench: long-video question answering benchmark. robot 관점에서 획득한 100개 데이터 + web-sourced 929개 데이터

📜 Paper Chinese Academy of Science

2025.08 3주차

PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing

offline hierarchical indexing & online adaptivr retrieval → paper search를 위한 index tree

📜 Paper Amsterdam

2025.08 3주차

Can we Evaluate RAGs with Synthetic Data?

(1) 생성 모델은 고정하고 retriever를 varying (2) retriever를 고정하고 생성 모델을 varying
(1)에서는 일관성 있는 결과가 나오는 반면 (2)는 그렇지 않다고 설명

🧑🏻‍💻 Dev Google

2025.08 3주차

Introducing Gemma 3 270M: The compact model for hyper-efficient AI

170M embedding parameters인데 이는 large vocab size 때문이라고 함 (256k tokens)
INT4 precision으로 사용 가능한 Quantization-Aware Trained (QAT) 버전도 공개

🧑🏻‍💻 Dev Alibaba

2025.08 3주차

Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency

영어와 중국어에 대해 정확한 text editing 가능
Seedream, GPT Image, FLUX 등의 모델을 능가한 SoTA 달성

📜 Paper Univ. of Tubingen

2025.08 3주차

MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

이를 해결하기 위해 learning effective denoising trajectories 문제를 a sequential decision-making problem으로 정의
Masked Diffusion Policy Optimization (MDPO): diffusion process의 Markov property 이용하여 모델이 추론 시 겪는 progress를 학습 당시에도 볼 수 있도록 함

📜 Paper OPPO

2025.08 3주차

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

agentic supervised fine-tuning을 위한 multi-agent distillation framework 제안 → reinforcement learning on verifiable agentic tasks
학습을 통해 획득한 결과 모델을 Agent Foundation Models (AFMs)라고 부름

📜 Paper Shanghia Jiao Tong Univ.

2025.08 3주차

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

LLM에 embedded knowledge를 이용하여 기존 text의 attribute를 지닌 채로 diverse & creative content-level variants 생성 가능

🧑🏻‍💻 Dev DeepSeek

2025.08 3주차

DeepSeek-V3.1 Release

📜 Paper ByteDance, Nanjing

2025.08 3주차

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

RLVR이 지나치게 많은 비용을 필요로 한다는 한계 & 전통적인 dual learning이 학습 당시에 본 task만 처리할 수 있다는 한계를 극복
primal task’s input을 known & unknown components로 쪼개고, primal output & known information을 이용하여 unknown part를 reconstruct

📜 Paper Wuhan, Nanjing

2025.08 3주차

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

LLM 평가를 knowledge-skill level로 진행하여 LLM이 어떤 financial skills & knowledge를 갖고 있는지 확인할 수 있음 (단순한 숫자로 반환하는 것 x)
CPA-QKA: the first cognitively informed financial evaluation dataset. Certified Public Accountant (CPA) 검사로부터 derive

📜 Paper Meta

2025.08 3주차

Deep Think with Confidence

Deep Think with Confidence (DeepConf): model-internal confidence signals를 이용하여 low-quality reasoning traces를 dynamically filter out
추가적인 학습 or hyper-parameter tuning 필요 없이 기존 serving frameworks에 integrate 가능

📜 Paper Shanghai AI Lab

2025.08 3주차

Intern-S1: A Scientific Multimodal Foundation Model

Intern-S1: a specialized generalist equipped with general understanding and reasoning capabilities
28B activated, 241B total parameters, MoE 모델
5T 토큰 데이터로 사전학습. 그중에 2.5T 토큰이 과학 분야 데이터

📜 Paper Tsinghua

2025.08 3주차

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

distributed RL infrastrcuture를 구성하여 수천개의 가상 desktop 환경을 병렬적으로 orchestrate 함으로써 대규모 RL 수행
Entropulse: SFT와 RL을 번갈아가며 학습함으로써 entropy collapse 현상을 완화

📜 Paper Shanghai AI Lab

2025.08 3주차

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

📜 Paper OPPO AI

2025.08 2주차

Efficient Agents: Building Effective Agents While Reducing Cost

test-time scaling (예를 들어 best-of-N) 방식은 성능 향상 대비 비용 상승률이 너무 높다는 한계를 분석
같은 관점에서 web browsing은 최소화되어야 한다고 주장

📜 Paper Rutgers Univ.

2025.08 2주차

ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Retrieval-augmented Graphic Agentic Network: 그래프의 각 노드를 autonomous & individual decision making 가능하도록 설정
각 노드가 곧 agent로 Memory, Planning, Action, Tool Use 가능

🧑🏻‍💻 Dev Cursor

2025.08 2주차

Cursor CLI

다른 서비스들과 크게 다른 점은 없어 보임

🧑🏻‍💻 Dev Google

2025.08 2주차

LangExtract

시각화 기능도 잘 지원되고 Ollma를 이용하면 로컬 모델로도 돌릴 수 있음

🧑🏻‍💻 Dev HuggingFace

2025.08 2주차

Introducing AI Sheets: a tool to work with datasets using open AI models!

LLM을 이용하여 합성 데이터 등을 생성 후 최종 데이터셋을 csv 형태로 반환할 수 있음

📜 Paper Zhipu AI, Tsinghua

2025.08 2주차

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

thinking & direct response 동시 지원하는 hybrid reasoning method
23T 토큰에 대해 학습

📜 Paper Meta

2025.08 2주차

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

frozen pretrained model을 사용하여 audio, video, dialogue로부터 feature 추출

📜 Paper ByteDance

2025.08 2주차

WideSearch: Benchmarking Agentic Broad Info-Seeking

large-scale atomic information을 필요로 하는 질문들이며 각 내용이 객관적으로 증명되어야 하는 까다로운 문제들임
대규모 & 반복적인 정보 검색을 잘하는 LLM-based agent를 만드는 것이 목표

📜 Paper Gaoling School, Baidu, CMU

2025.08 2주차

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

automated reasoning-intesnvie training data synthesis framework 제안. self-consistency data filtering mechanism이 적용되어 데이터 퀄리티를 보장
cold-start SFT → RL for ruther ranking ability enhancement
강화학습 단계에서 listwise ranking을 위해 multi-view ranking reward를 설계했는데, 이는 기존의 ranking metric-based reward보다 효과적이라고 설명함

📜 Paper Apple

2025.08 2주차

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential

common prefix로부터 multi token precition, 이를 이용하여 coherent sequence를 생성하는 모듈
gated LoRA formulation: 기존 모델의 functionality 유지

📜 Paper Ai2, Washington

2025.08 2주차

MolmoAct: Action Reasoning Models that can Reason in Space

MolmoAct 모델은 observations & instructions를 depth-aware perception tokens로 encode → mid-level spatial plans 생성 → precise low-level actions 예측 (7B 사이즈)
MolmoAct Datset: mid-training robot dataset 공개. 10,000개의 고품질 robot trajectories

📜 Paper Hebrew

2025.08 2주차

Story2Board: A Training-Free Approach for Expressive Storyboard Generation

기존에는 subject identity에만 집중한 것을 한계로 지적하고, spatial composition, background evolution, narrative pacing 등에 집중했다고 설명

🧑🏻‍💻 Dev NVIDIA

2025.08 2주차

NVIDIA Releases 3 Million Sample Dataset for OCR, Visual Question Answering, and Captioning Tasks

OCR, VQA, captioning 등에 집중된 데이터셋이며, 최근 Llama 3.1 Nemotron Nano VL 8B V1 을 학습하는데 사용됨

📜 Paper Alibaba

2025.08 2주차

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

efficient cold start를 위해 high-quality synthetic multimodal tranjectories 사용
BrowseComp-VL: visual & textual information을 동시에 잘 가져와야 하는 복잡한 벤치마크

📜 Paper WeChat, Tsinghua

2025.08 2주차

We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

🧑🏻‍💻 Dev OpenAI

2025.08 1주차

Introducing study mode

티어에 상관 없이 모든 유저들이 이용할 수 있는 기능으로 제공

🧑🏻‍💻 Dev Microsoft

2025.08 1주차

Microsoft Edge Your AI-powered browser

📜 Paper Tecent

2025.08 1주차

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

기존 video/3D 기반 방식의 단점 보완 → panoramic image 기반 360° world proxy 활용
세 가지 특징. 1) 360° immersive experiences 2) mesh export capabilities 3) disentangled object representations

📜 Paper Leiden Univ.

2025.08 1주차

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

sparse autoencoder를 activation patching과 결합하여 CoT 결과로부터 monosemantic features 추출
CoT가 확실히 더 높은 activation sparsity, feature interpretability score를 달성

📜 Paper CUHK

2025.08 1주차

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

grounding, planning, generation, 세 단계로 구성되어 있음
vision language model을 사용하여 UI components를 탐지 및 라벨링
front-end priors 기반의 hierarchical layout 구성

🧑🏻‍💻 Dev Alibaba

2025.08 1주차

Qwen3 Coder Flash

128 experts, 8 activated per inference, with 3.8B active parameters
256K native context window, expandabel to 1M tokens using YaRN
최근 공개한 Qwen3 Coder 모델의 경량화 버전으로 이해할 수 있음

🧑🏻‍💻 Dev Google

2025.08 1주차

Gemini 2.5 Deep Think

복잡한 문제를 작은 단위로 쪼개는 interative development and design
algorithmic development and code, scientific and mathematical discovery 등에 특화되어 있다고 설명

📜 Paper Microsoft

2025.08 1주차

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Phi-Ground mode family: 10B 이하의 agent 중에서 SoTA를 달성한 모델 공개

📜 Paper ByteDance

2025.08 1주차

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

deep & broad reasoning을 가능토록 하는 3개의 test-time inference strategies
geometry reasoning engine Seed-Geometry 도입
IMO 2025의 6개 문제 중 5개를 완벽하게 prove

🧑🏻‍💻 Dev Kaggle

2025.08 1주차

Introducing Kaggle Game Arena

o3, Gemini 2.5 Pro, Claude Opus 4, Grok 4 와 같은 frontier 모델들이 동작할 수 있는 game environments, harnesses, visualizers 등을 제공

🧑🏻‍💻 Dev Anthropic

2025.08 1주차

Persona vectors: Monitoring and controlling character traits in language models

이를 파악함으로써 모델의 undesirable 특성들을 억제할수도 있고, 학습 데이터를 조정할수도 있음
Qwen 2.5-7B-Instruct, Llama-3.1-8B-Instruct 두 open-source 모델로 평가

🧑🏻‍💻 Dev OpenAI

2025.08 1주차

Open models by OpenAI

Apache 2.0 라이센스. Safety에 대해서도 각별히 신경을 썼다고 함
Designed for agentic tasks, Deeply customizable, Full chain-of-thought 등의 특징

📜 Paper CUHK, Shanghai AI

2025.08 1주차

Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

모델이 내부적으로(internal) 주어진 문제에 대한 적절한 답변 길이와 관련된 signals를 포함하고 있다고 설명
이러한 latent signals를 이용한 DAEDAL 제안: Dynamic Adaptive length Expansion for Diffusion lArge Language models (알파벳 조합 너무 억지..)

📜 Paper Alibaba

2025.08 1주차

Qwen-Image Technical Report

non-text-to-rendering으로 시작해 점점 더 복잡한 텍스트 입력을 받는 curriculum learning approach 적용
text-to-image (T2I), text-image-to-image (TI2I), image-to-image (I2I) reconstruction을 위해 dual encoding 방식 사용 (Qwen2.5-VL & VAE)

🧑🏻‍💻 Dev Google DeepMind

2025.08 1주차

Genie 3: A new frontier for world models

초당 24프레임, 720p 해상도의 few-minute consistency (Genie 2는 10-20s, Veo는 8s 수준)
데모 영상 수준 퀄리티 아주 뛰어난 편
promptable world events: 다양한 종류의 text-based interaction 가능

🧑🏻‍💻 Dev OpenAI

2025.08 1주차

GPT-5 is here

coding 능력이 크게 향상되어 타 frontier 모델들 수준으로 올라왔다고 보고 (실사용 후기에 따르면 그정도는 아닌 듯함)
o3-pro처럼 더 오래 생각하는 test-time scaling 방식이 적용된 GPT-5 pro 모델

📜 Paper ByteDance, Tsinghua

2025.08 1주차

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

non-sequential, parallel generation 덕분에 엄청나게 빠른 추론 속도: 2,146 tokens/s over H20 GPU
코드 벤치마크에서 속도-성능의 파레토 라인을 push

🧑🏻‍💻 Dev Google

2025.08 1주차

Guided Learning in Gemini: From answers to understanding

특정 주제에 대해 deep dive 할 수 있도록 probing & open-ended questions encourage

📜 Paper VeriGUI Team

2025.08 1주차

VeriGUI: Verifiable Long-Chain GUI Dataset

realistic computer environments 대응을 위한 학습 및 평가 데이터셋
(1) long-chain complexity (2) subtask-level verifiability 강조

📜 Paper Arizona State Univ.

2025.08 1주차

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

2025년 7월 67건

📜 Paper Anthropic, UC Berkeley

2025.07 5주차

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

특성 T를 갖는 teacher 모델이 일련의 숫자로만 구성된 데이터셋을 생성하고 이를 학습한 student 모델이 특성 T를 배울 수 있다는 것
teacher 모델이 생성하는 코드나 reasoning path로 학습하더라도 동일 현상을 관측할 수 있다고 설명

🧑🏻‍💻 Dev Anthropic

2025.07 5주차

Building and evaluating alignment auditing agents

hidden goal을 찾아내고 misaligned behavior 등을 탐지하는 등 impressive results를 보여줌
prefill attacks, context-manipulated jailbreaks, interpretability-driven safety failures 등에 취약하다는 결론

🧑🏻‍💻 Dev Runway

2025.07 5주차

Introducing Runway Aleph | A new way to edit, transform and generate video.

비디오를 from scratch 생성하지 않고 text prompt를 통해 필요한 영역들을 수정
예를 들어 camera angles 수정, remove objects, effects like rain or fireworks 등 가능

🧑🏻‍💻 Dev Z.ai

2025.07 5주차

GLM-4.5: Reasoning, Coding, and Agentic Abililties

coding benchmark에서 Claude 4 Sonnet, GPT-4.1 급의 성능
GLM-4.5: 355B total & 32B active parameters / GLM-4.5 Air: 106B total & 12B active parameters
둘 다 hybrid reasoning model로 복잡한 추론이나 tool using, non-thinking 등을 지원

📜 Paper Waterloo

2025.07 5주차

Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models

OLMo, OLMo 2 모델을 대상으로 한 실험에서 DPO의 영향도가 가장 크다는 결론
이를 바탕으로 conformative decoding 제안: instruct model이 base model의 다양성을 reintroduce 할 수 있도록 guide 하는 decoding strategy

📜 Paper Renmin

2025.07 5주차

Agentic Reinforced Policy Optimization

Agentic Reinforced Policy Optimization (ARPO)
외부 툴 사용 직후 생성되는 토큰의 entropy 분포가 향상된다는 점을 포착
entropy-based adaptive rollout mechanism

📜 Paper Univ. of Alberta

2025.07 5주차

Curiosity by Design: An LLM-based Coding Assistant Asking Clarification Questions

📜 Paper CMU

2025.07 4주차

Agentic-R1: Distilled Dual-Strategy Reasoning

또한 tool-augmented agents는 code execution으로 문제를 해결해왔으나 여전히 복잡한 logical 문제들을 풀지는 못함
DualDistill: 여러 teachers로부터의 complementary reasoning strategies를 unified student model에 distill하는 framework
Agentic-R1: 각 쿼리마다 최적의 전략을 dynamically 선택하도록 학습한 모델. tool을 사용하거나 텍스트 기반의 추론을 하거나.

🧑🏻‍💻 Dev ARC

2025.07 4주차

ARC-AGI-3

기존에도 ARC 벤치마크 퍼즐을 맞추는 태스크로 유명 (인간과 유사한 사고가 가능한지)
o3, Grok 4와 같은 frontier models도 현재까지 0점 기록
RTX 5090 또는 $1K API 로 추론. 8시간 제한

🧑🏻‍💻 Dev Google

2025.07 4주차

Gemini Embedding now generally available in the Gemini API

science, legal, finance, coding 등 다양한 도메인에 대해 뛰어난 성능을 보인다고 설명
100개 이상의 언어에 대해 2048 input token length 지원. Matryoshka Representation Learning (MRL) 테크닉 사용시 3072, 1536, 768 차원 추천

📜 Paper Anthropic

2025.07 4주차

Inverse Scaling in Test-Time Compute

모든 flagship 모델들이 복잡한 deductive tasks에서 약점을 보임
extended reasoning은 self-preservation 표현을 증가시킴
Simple Counting tasks with Distractors, Regression Tasks with Spurious Features, Deduction Tasks with Constraint Tracking

📜 Paper Zhejiang

2025.07 4주차

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

GUI-G^2: GUI 요소를 interface plance 위의 continuous Gaussian Distribution으로 modeling
Guassian point rewards: precise localization을 모델링
Coverage rewards: predicted Gaussian distirbutions & target regions 간의 overlap 측정

📜 Paper MiroMind AI

2025.07 4주차

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

719K개의 math-reasoning 데이터셋 SFT + 62K개의 challenging & verifiable 문제에 대해 RLVR
Context-Aware Multi-Stage Policy Optimization (CAMPO): length-progressive training + adaptive repetition penalty

🧑🏻‍💻 Dev Alibaba

2025.07 4주차

Qwen3-235B-A22B-Instruct-2507

Qwen Chat default 모델로 탑재. Kimi K2 모델을 능가하는 성능으로 보고

📜 Paper CMU

2025.07 4주차

Diffusion Beats Autoregressive in Data-Constrained Settings

repeated data에 대해 더 낮은 validation loss를 보이고 downstream performance도 뛰어남
저자는 이러한 현상을 implicit data augmentation으로 해석 (고정된 left-to-right factorization을 따르는 AR 방식과의 차이점)

🧑🏻‍💻 Dev Alibaba

2025.07 4주차

Qwen3-Coder: Agentic Coding in the World

Qwen2.5-Coder를 사용하여 7.5T 토큰으로 학습된 480B-35B(active) MoE model, Qwen3-Coder
256K default, 최대 1M 토큰 지원

📜 Paper Shanhai AI

2025.07 4주차

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

DIJA: adversarial interleaved mask-text prompts 생성 → dLLM 특성을 이용한 생성 방식으로, 타 jail-breaking을 압도하는 결과였다고 보고

📜 Paper Sapient Intelligence

2025.07 4주차

Hierarchical Reasoning Model

2개의 interdependent recurrent modules
a high-level module responsible for slow, abstract planning
a low-level module handling rapid, detailed computations

🧑🏻‍💻 Dev GitHub

2025.07 4주차

GitHub Spark in public preview for Copilot Pro+ subscribers

자연어로 micro apps를 만들 수 있도록 지원하는 기능으로, Claude Sonnet 4로 동작

🧑🏻‍💻 Dev HuggingFace

2025.07 4주차

2025년 6월 49건

📜 Paper Huawei

2025.06 4주차

RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

knowledge & aligned application example 로 구성된 dual corpus construct → 추론 단계에서 retrieves both jointly
LLMs가 relevant information에 접근할 수 있을 뿐만 아니라 이를 structured & goal-oriented reasoning processes에 적용할 수 있게 됨

📜 Paper Stanford

2025.06 4주차

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

WORKBank: 844개 tasks, 104개 occupations에 대해 worker desires & expert assessments를 결합한 데이터 베이스
Human Agency Scale (HAS): AI-agent-supported work에서 desired human involvement를 정량화
4 AI deployment zones: Automation Green Light, Red Light, R&D Opportunity, Low Priority

🧑🏻‍💻 Dev IlElevenLabs

2025.06 4주차

Introducing 11ai: the voice-first AI assistant that takes action

MCP를 통해서는 Salesforce, HubSpot, Gmail, Zapier 등에 연결 가능
out-of-the-box integration으로 Perplexity, Linear, Slack, Notion 지원
Ultra-low latency, Multimodal support, Integrated RAG, Automatic language detection, Enterprise-ready 등의 특징

📜 Paper Sakana AI

2025.06 4주차

Reinforcement Learning Teachers of Test Time Scaling

현재 LLM의 강화학습은 one-hot correctness를 기반으로 이뤄지므로 initialization에 대한 의존성이 너무 높고, 학습이 잘된 RL 모델도 결국 distillation에서 cold start 문제를 해결하기 위한 teacher model로 쓰이는 현황을 지적
Reinforcement-Learned Teachers (RLT): 각 문제에 대한 question & solution을 입력으로 받음 → 둘 사이를 ‘connects-the-dots’ 하여 학생들에게 자세한 설명을 제공하는 태스크 수행
이를 학생들에게 제공하여 solution에 대한 이해도를 확인하고, 이를 바탕으로 dense rewards를 획득

📜 Paper Cornell

2025.06 4주차

Memento: Note-Taking for Your Future Self

Memento (prompt strategy): 1) complex question을 smaller steps로 나눈다 2) LLM을 이용하여 database를 dynamically construct 3) 문제를 풀기 위해 작은 문제들을 다시 합친다

📜 Paper Oxford, Amazon, Cambridge

2025.06 4주차

Distilling Tool Knowledge into Language Models via Back-Translated Traces

Tool-integrated reasoning (TIF)은 inference-time dependencies로 인해 확장 가능성이 낮음
natural language를 통해 tool knowledge를 LLM에 distill 하는 패러다임 제안
Solver Agent: interleaving planning, symbolic tool calls, reflective reasoning을 통해 수학 문제 풀이

🧑🏻‍💻 Dev Google DeepMind

2025.06 4주차

AlphaGenome: AI for better understanding the genome

single variants or mutation in human DNA sequences가 유전자를 조정하는 생물학적 과정에 어떻게 영향을 주는지 예측하는 모델

🧑🏻‍💻 Dev Anthropic

2025.06 4주차

Agentic Misalignment: How LLMs could be insider threats

모델이 테스트 시나리오라는 것을 인지했을 땐 misbehavior를 보일 확률이 급격하게 낮아짐
실험 결과를 보면 blackmail rates에서 가장 높은 수치를 보이는 것은 Claude Opus 4 → 엄청나게 솔직한 연구 결과

🗞️ News Meta

2025.06 4주차

Introducing Oakley Meta Glasses, a New Category of Performance AI Glasses

풀충전 기준 일반적인 사용으로 8시간, stanby 기준 19시간 지속되는 배터리
Ultra HD (3K) video를 담을 수 있는 high resolution camera
built-in, personal AI assistant. 스포츠 활용성 높음

📜 Paper Ohio, Amazon

2025.06 4주차

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

이를 평가하기 위한 Agent-as-a-Judge 프레임워크 제안
tree-structured rubric 기반의 task-specific judge agents를 construct 하여 answer correctness & source attribution 평가

📜 Paper Ai2

2025.06 4주차

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

(1) Exploratory: known problem-solving skills를 같은 도메인이지만 더 어려운 문제에 적용
(2) Compositional: 독립된 상황에서 습득한 distinct reasoning skills를 new & coherent way로 결합/통합
(3) Transformative: 익숙한 approaches를 새로운 영역에 unconventionally 적용

📜 Paper Skoltech

2025.06 4주차

Complexity-aware fine-tuning

easy & medium은 fine-tuning, hard는 distill 한 결과가 단순 SFT 결과보다 좋았다고 설명

📜 Paper Ai2

2025.06 4주차

Language Modeling by Language Models

multi-agent LLM을 이용해서 proposal stage - code generation - verification에 이르는 research를 simulate
Ladder of Sacles 접근법을 사용하는 Genesys 시스템을 제안: 제안 → 리뷰 → 검증 → large scale

🧑🏻‍💻 Dev Anthropic

2025.06 4주차

Desktop Extensions: One-click MCP server installation for Claude Desktop

기존 MCP 설치는 ‘개발자 도구 필요, Manual configuration, Dependency 관리, 업데이트 복잡성’ 등의 문제를 지님
.dxt file download → Claude Desktop open → Click “Install”

📜 Paper Google

2025.06 4주차

Performance Prediction for Large Systems via Text-to-Text Regression

단 500개의 few-shot examples 만으로 새로운 태스크에 adapt 가능
encoder 사용, sequence 길이 증가, 모델의 inherent uncertainty quantification 중요성 강조

🧑🏻‍💻 Dev OpenAI

2025.06 3주차

Launching OpenAI o3-pro

personalized answer를 위한 memory 기능 지원
o3, o1-pro 모델을 math, coding, science 벤치마크에서 outperform. pass@1 벤치마크가 인상적임

📜 Paper Huawei

2025.06 3주차

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory
SWE-Builder: evaluation environment construction을 자동화해주는 multi-agent system
exit-code-based grading method: custom parsers를 직접 작성할 필요가 없음

📜 Paper Rice, Johns Hopkins, NVIDIA

2025.06 3주차

Play to Generalize: Learning to Reason Through Game Play

Snake 같은 게임을 학습한 7B 사이즈 모델이, RL 동안에 어떤 solutions, equations, diagrams를 보지 못했음에도 불구하고 MMMU에서 성능 향상을 보임: transferable reasoning skills
따라서 synthetic, rule-based game을 controllable & scalable pre-text tasks로 사용할 수 있다고 설명 for generalizable multimodal reasoning abilities in MLLMs

📜 Paper Sakana AI

2025.06 3주차

Text-to-LoRA: Instant Transformer Adaption

Text-to-LoRA (T2L): many LoRA adapters를 합축한 모델로 unseen tasks에 대해 generalizes

📜 Paper Meta

2025.06 3주차

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

2-stage training
action-free pretraining on 1M+ hours of internet videos and images
post-training with only 62 hours of unlabeld robot trajectories (Droid dataset)

📜 Paper Microsoft, UCLA

2025.06 3주차

Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

preceding CoT reasoning에서 key tokens를 identify & emphasize → reasoning & reference outcome 사이의 consistency를 fine-grained level에서 capture
R3는 optimized 중인 model의 내부 연산 결과를 활용하므로 self-contained training setup 가능

📜 Paper Google DeepMind

2025.06 3주차

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

coding & reasoning benchmarks에서 SoTA 달성
Gemini 2.5 Pro 모델은 3시간 길이의 비디오를 이해할 수 있을 정도로 뛰어난 multimodal understanding 능력을 보임

📜 Paper MIT

2025.06 3주차

Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

LLM으로 태스크를 수행한 그룹은 타 그룹 대비 less coordinated neural effort가 관측되었다고 보고
또한 작성된 에세이의 퀄리티는 AI judge & human teachers로부터 비슷한 평가를 받았으나, NER/n-gram 관점에서는 타그룹 대비 낮은 성적을 기록

📜 Paper Yale, Columbia, …

2025.06 3주차

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

domain-specific tasks에 대해 linguistic settings (monollingual, bilingual, multilingual)
PolyFiQA-Easy & PolyFiQA-Expert: mixed-language inputs에 대해 복잡한 reasoning이 필요한 벤치마크 공개
또한 기존의 simple aggregation existing datasets 대신, dynamic difficulty-aware slection mechanism 제안

🧑🏻‍💻 Dev Anthropic

2025.06 3주차

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

각 태스크는 main task & harmful side task 로 구성
이중 모니터링 시스템, 은밀성 평가(단순 성공 여부 x, 들키지 않고 성공 o), 복잡성과 현실성 고려

📜 Paper MiniMax

2025.06 3주차

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

1M context length 지원, 연산 효율성 강조
CISPO: token update 대신 importance sampling weights를 clip 하는 novel RL algorithm
512 H800 GPUs로 3주 동안 학습하여 $534,700 비용이 들었다고 강조함

📜 Paper ByteDance

2025.06 3주차

Seedance 1.0: Exploring the Boundaries of Video Generation Models

📜 Paper Apple

2025.06 2주차

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

다양한 puzzle environments를 통해 모델의 internal reasoning traces를 확인하여 LRMs이 “think” 하는 방식에 대한 insight 획득
reasoning effort가 특정 문제 난이도까지 상승하다가 이후에는 감소하여 scaling에서의 한계를 보임을 지적
낮은 난이도의 문제들에 대해서는 일반적인 LLM들이 훨씬 뛰어난 퍼포먼스를 보여줌 & 어려운 난이도에 대해서는 일반적인 LLM이나 LRM이나 둘 다 collpase

📜 Paper Stanford, NYU

2025.06 2주차

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

expressive fidelity & representational simplicity 간의 trade-off가 있는데, 모델은 human understanding에서 중요한 fine-grained semantic distinctions을 놓침
또한 LLM은 aggressive statistical compression에 대해 bias를 보임

📜 Paper UC Santa Cruz, Stanford

2025.06 2주차

Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains

fine-grained evaluation framework 제안
(1) 사용된 knowledge의 정확성 (Knowledge Index (KI))
(2) the quality of reasoning (Information Gain (IG))

📜 Paper Stanford

2025.06 2주차

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts2-1M 데이터셋으로 OpenThinker2-32B 모델 학습. DeepSeek-R1-Distill-32B에 준하는 성능
추가로 데이터셋을 정제하여 OpenThoughts3 제작

📜 Paper CMU

2025.06 2주차

Coding Agents with Multimodal Browsing are Generalist Problem Solvers

기존 모델들은 특정 도메인이나 태스크에 specialized 되어 있어 일반화가 되지 않음을 지적
OpenHands-Versa: a generalist agent built with a modest number of general tools

📜 Paper Microsoft, Peking, Tsinghua

2025.06 2주차

Reinforcement Pre-Training

주어진 문맥에서 다음 토큰을 정확히 예측하면 verifiable rewards를 받는 방식
general-purpose RL을 위한 방대한 양의 텍스트 데이터를 이용할 수 있는 scalabe method라고 소개
further reinforcement fine-tning을 위한 strong pre-trained foundation

📜 Paper ByteDance

2025.06 2주차

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

reading order에 맞는 sequence of layout elements를 생성하고 이를 anchors로 사용
anchors는 task-specific prompts와 짝지어지고, 다음 단계에서 parallel content parsing에 사용됨
multi-granularity parsing tasks를 다루는 30M개 이상의 dataset

📜 Paper Cambridge

2025.06 2주차

Truly Self-Improving Agents Require Intrinsic Metacognitive Learning

인간의 metacognition에 착안하여 세 개의 components로 구성된 프레임워크 제안
metacognitive knowledge, metacognitive planning, metacognitive evaluation
기존 agents들이 학습하는 것은 extrinsic metacognitive mechanisms을 따른다고 설명

📜 Paper Claude Opus

2025.06 2주차

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

📜 Paper Yale

2025.06 1주차

Table-R1: Inference-Time Scaling for Table Reasoning

frontier model의 reasoning steps로부터 distillation
reinforcement learning with verifiable rewards (RLVR)
Distillation을 위해 DeepSeek-R1 모델로 reasoning traces 생성

📜 Paper Cohere

2025.06 1주차

Command A: An Enterprise-Ready Large Language Model

agent-optimized & multilingual-capable model (23개 언어 지원), hybrid architecture
self-refinement & model merging techniques 적용

📜 Paper Sakana AI

2025.06 1주차

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

여러 frozen foundation models가 tool use를 통해 코드를 읽고, 쓰고, 실행하는 coding agents optimize를 목표

📜 Paper UC Berkeley, Yale

2025.06 1주차

Learning to Reason without External Rewards

→ Reinforcement Learning from Internal Feedback (RLIF): 외부 rewards or labeled data 없이 intrinsic signals로부터 학습
Intuitor: 모델 스스로의 confidence, self-certainty를 유일한 reward signla로 사용. 기존 GRPO 자리를 대체

🧑🏻‍💻 Dev AgenticSeek: Private, Local Manus Alternative.

2025.06 1주차

AgenticSeek: Private, Local Manus Alternative.

web search, write codes, plan tasks, select agents, voice-enhanced 등 다양한 features

📜 Paper UIUC, UC Berkeley

2025.06 1주차

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

scaled thinking phase를 $\alpha$ moment 라고 표현. $\alpha$ moment가 slow thinking 하는 시점임

🧑🏻‍💻 Dev ElevenLabs

2025.06 1주차

Introducing ElevenLabs Conversational AI 2.0

enterprise 사용에 더욱 적합: private files or prorietary data sources에 RAG 연결 가능

📜 Paper Kakao

2025.06 1주차

A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

e-commerce domain을 위한 conversational agent에 관한 case study
카나나를 기반으로 더 넓은 분야로 대화형 agent를 확장하고자 하는 것일까하는 생각

📜 Paper Alibaba

2025.06 1주차

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

QwenLong-L1: short-context LRMs를 long-context scenarios에 adapt 할 수 있도록 progressive context scaling을 적용하는 프레임워크
warm-up SFT stage → curriculum-guided phased RL
QwenLong-L1-32B 모델이 OpenAI-o3-mini, Qwen3-235B-A22B 등을 outperform

📜 Paper Renmin Univ.

2025.06 1주차

Do not Abstain! Identify and Solve the Uncertainty

ConfuseBench: 세 종류의 uncertainty를 다룸 - document scarcity, limited capability, query ambiguity
original query의 confusing aspect를 highlight 하는 context-aware inquiries 생성하고, 이를 기반으로 source of uncertainty를 판단하는 방법론 제안

📜 Paper HuggingFace

2025.06 1주차

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

SmolVLA: small, efficient, community-driven VLA. training & inference 비용 저렴

📜 Paper Meta, DeepMind, Cornell, NVIDIA

2025.06 1주차

How much do language models memorize?

memorization을 unintended memorization & generalization 두 가지로 구분
generalization을 제거하여 모델의 total memorization을 계산하고 model capacity를 추정할 수 있음
GPT family 모델들은 약 3.6 bits-per-parameter의 capacity를 가짐

📜 Paper Meta

2025.06 1주차

LlamaFirewall: An open source guardrail system for building secure AI agents

2025년 5월 65건

🧑🏻‍💻 Dev Anthropic

2025.05 5주차

Introducing Claude 4

long thought process에 대한 요약 제시
developer mode에서는 unsummarized reasoning 확인 가능
VS Code나 JetBrains에서 사용 가능한 새로운 extension 출시

🧑🏻‍💻 Dev ByteDance

2025.05 5주차

BAGEL: The Open-Source Unified Multimodal Model

multiple expert networks & two image encoders 사용
7B 사이즈의 모델로, 4 x 16GB GPU에서 run 또는 LoRA 기반 학습 가능

📜 Paper Tokyo

2025.05 5주차

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

각 언어당 658개의 질문들을 포함하는 lite version 제공

📜 Paper Cambridge, UCL, Google

2025.05 5주차

Visual Planning: Let's Think Only with Images

Visual Planning: text 없이 순수하게 visual representation으로 reasoning
step-by-step inference를 encode 하는 sequences of images 를 통해 executed
Visual Planning via Reinforcement Learning (VPRL): large vision models를 GRPO로 post-training 하는 RL 프레임워크

🧑🏻‍💻 Dev Mistral AI

2025.05 5주차

Build AI agents with the Mistral Agents API

MCP tools integration, Agent Orchestration
사용성이 좋고 개발 용이성이 뛰어난 형태의 API가 많이 공개되는 추세

🧑🏻‍💻 Dev Mistral AI

2025.05 5주차

Codestral Embed

binary, int8, float32 자료형 지원

🧑🏻‍💻 Dev Resemble AI

2025.05 5주차

chatterbox

emotion exaggeration control 지원, watermarked outputs
[Hugging Face Gradio app](https://huggingface.co/spaces/ResembleAI/Chatterbox) 에서 테스트 가능
0.5B Llama backbone, 0.5M hours of cleaned data로 학습

📜 Paper Shanghai AI Lab, Tsinghua, UIUC

2025.05 5주차

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

policy entropy가 초기 학습 단계에서 급격히 감소하여 policy model이 overly confident 하게 되는 현상을 뜻함 (성능 포화)
이로 인해 exploratory ability가 diminish 하게 됨
$R = -a \cdot \exp(H) + b$

📜 Paper Utah, Washington

2025.05 5주차

What Has Been Lost with Synthetic Evaluation?

CondaQA: negation reasoning에 대한 평가
DROP: quantities reasoning 평가

📜 Paper Google

2025.05 5주차

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

성능이 뛰어난 모델들은 context가 충분할 때 답변을 잘하지만 그렇지 않을 때에 답변을 abstain 하지 않고 틀린 답변을 반환하는 경우가 있음
그러나 성능이 낮은 모델들은 context가 충분할 때조차 hallucination 또는 incorrect answers 반환하는 경우 있음
RAG 시스템을 위해 새로운 selective generation method를 제안하여 충분한 context information을 더 잘 활용할 수 있도록 함

📜 Paper Apple

2025.05 5주차

Interleaved Reasoning for Large Language Models via Reinforcement Learning

📜 Paper Chinese Academy of sciences

2025.05 4주차

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

간단한 생략 기호 “…”를 프롬프트에 포함하는 것만으로도 꽤나 긍정적인 영향을 줄 수 있다고 언급
AutoThink: stage-wise reward shaping을 통해 reasoning policies를 optimize하는 multi-stage reinforcement learning (RL) 프레임워크

📜 Paper Singapore, Tsinghua, Salesforce

2025.05 4주차

Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

이를 해결하기 위해 prompts & 우연한 ‘aha moments’를 넘어서, 모델이 세 가지 meta-abilities에 align 되도록 학습 - deduction, induction, abduction
three-stage pipeline: individual alignment, parameter-space merging, domain-specific reinforcement learning

📜 Paper KAIST

2025.05 4주차

System Prompt Optimization with Meta-Learning

meta-learning framework: system prompt 뿐만 아니라 user prompts도 업데이트

🧑🏻‍💻 Dev HuggingFace

2025.05 4주차

Welcome to the 🤗 Model Context Protocol (MCP) Course

🧑🏻‍💻 Dev Alibaba

2025.05 4주차

Qwen3 Technical Report

thinking mode & non-thinking mode 통합. 유저 쿼리나 chat template에 따른 dynamic mode swithcing
thinking budget mechanism을 도입하여 유저가 추론 시 computational resources를 adaptive하게 할당함으로써 태스크 복잡도에 따른 모델 퍼포먼스와 latency 간 균형을 맞출 수 있다고 설명
기존 29개 → 119개 언어 및 방언 지원, Apache 2.0 라이센스

📜 Paper Tsinghua

2025.05 4주차

AdaptThink: Reasoning Models Can Learn When to Think

AdaptThink: 문제 난이도에 따라 최적의 thinking mode를 reasoning model이 선택하도록 가르치는 RL 알고리즘
constrained optimization objective: overall performance를 유지하면서도 NoThinking을 선택하도록 함
sampling strategy: on-policy training 동안에 Thinking & No-Thinking samples의 균형을 맞춤

📜 Paper NUS

2025.05 4주차

Thinkless: LLM Learns When to Think

RL 패러다임으로 학습되고 <short>, <think> 두 개의 control tokens를 사용
Decoupled Group Relative Policy Optimization (DeGROP) 알고리즘
두 개의 learning objective: control token loss & response loss

📜 Paper Southern California

2025.05 4주차

Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM

(1) long & verbose CoT outputs를 semantically coherent reasoning steps로 만들기
(2) 각 스텝 간의 contextual & logical dependencies 를 이용하여 directed reasoning graphs 구축하기
exploration density, branching, convergence ratios 등과 같은 structural propreties가 reasoning accuracy와 강한 상관관계를 갖고 있다고 설명함

🧑🏻‍💻 Dev Google

2025.05 4주차

Gemini 2.5: Our most intelligent models are getting even better

풀스택 개발 태스크에 대해 WebDev Arena에서 1415 ELO 스코어 달성
두 개의 목소리로 native audio generation 가능

🧑🏻‍💻 Dev Google

2025.05 4주차

Build with Jules, your asynchronous coding agent

각 codebase를 Google의 Cloud virtual machine (VM) 에 복사하여 프로젝트 전체를 이해한다고 설명
Works on real codebase, Parallel execution, Visible workflow, User steerability, Audio summaries 등을 특징으로 삼고 있음

📜 Paper ByteDance

2025.05 4주차

Emerging Properties in Unified Multimodal Pretraining

large-scale interleaved text, image, video, web data를 수 trillion tokens으로 학습한 unified & decoder-only model
free-form image manipulation, future frame prediction, 3D manipulation, word navigation 과 같은 advanced multimodal reasoning 능력을 보유

📜 Paper Jiaotong University

2025.05 4주차

Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

supervised fine-tuning & Kahneman-Tversky optimization 조합을 통해 structural priors를 LLM에 통합하는 progressive knowledge distillation strategy
reasoning introspection strategey: LLM이 추출된 constraint priors 기반의 refined reasoning verfication를 수행할 수 있도록 guide

🧑🏻‍💻 Dev Mistral

2025.05 4주차

Devstral

현실적인 프로그래밍 문제를 해결하기 위해, 즉 GitHub issuses를 풀기 위해 학습된 모델
RTX 4090 or Mac with 32GB RAM에서 구동 가능한 정도로 가벼움

🧑🏻‍💻 Dev Google DeepMind

2025.05 4주차

Gemini Diffusion

random noise를 coherent output으로 변경하여 text or code를 생성하는 모델
rapid response, more coherent text, iterative refinement 등을 특징으로 설명

🧑🏻‍💻 Dev Google DeepMind

2025.05 4주차

Gemma 3n

삼성 갤럭시 울트라에서 초당 446 토큰 처리
Mix ‘n’ match architecture는 small & large models를 switch 하는 데 도움을 줌
Chatbot Arena에서 1283점을 기록하며 Claude 3.7 Sonnet의 뒤를 이음

📜 Paper ServiceNow

2025.05 4주차

Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA

NotesWriting: 매 스텝마다 retrieved documents를 concise & relevant notes 로 변경하는 연구
LLM의 effective context length를 간접적으로 높여 더 큰 크기의 input text를 효율적으로 처리할 수 있음
다른 RAG 방법론들과 integrated 가능한 framework

📜 Paper Yonsei, CMU

2025.05 4주차

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

WebPRM Collection: 40K step-level perference pairs & annotated checklists
WebReward Bench: PRM 평가를 위한 meta-evaluation 벤치마크

🧑🏻‍💻 Dev HuggingFace

2025.05 4주차

nanoVLM: The simplest repository to train your VLM in pure PyTorch

단일 GPU에서 학습 가능

📜 Paper UIUC

2025.05 4주차

Language Specific Knowledge: Do Models Know Better in X than in English?

언어 모델도 그런 경향이 있다면 reasoning 능력을 더 끌어올릴 수 있지 않을까? 라는 접근
Language Specific Knowledge (LSK): ethnic cultures는 언어에 따라 발전하는 경향이 있고, 이에 따라 culture-specific datasets에 대해 실험해본 결과 가정이 옳았다고 설명함
LSKExtractor: language-specific knowledge의 존재를 확인할 수 있는 벤치마크 공개

📜 Paper Meta

2025.05 4주차

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

📜 Paper Microsoft, Salesforce

2025.05 3주차

LLMs Get Lost In Multi-Turn Conversation

top open- & closed-weight LLMs가 multi-turn에서 single-turn 대비 큰 성능 하락폭을 보여주었다고 보고
200,000+ simulated conversations는 aptitude의 사소한 문제 & unreliability의 증가, 두 가지로 구분 가능
결론: when LLMs take a wrong turn in a conversation, they get lost and do not recover

📜 Paper Texas A&M Univ.

2025.05 3주차

LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities

LiteLMGuard: quantized SLMs를 위한 real-time, prompt-level defense로 on-device prompt guard 라고 설명
모델의 아키텍쳐와 상관없이 적용 가능하다고 주장
여러 DL models를 Answerable-or-Not 데이터셋으로 학습한 결과 ELECTRA를 후보로 선정

🧑🏻‍💻 Dev Sakana AI

2025.05 3주차

Continuous Thought Machines

뉴런 수준의 timing information을 사용하여 기존보다 보다 복잡한 nueral behavior & decision making process를 이해할 수 있게 되었다고 함
핵심 중 하나는 모델이 step-by-step으로 “think” 할 수 있게 되어 추론 과정이 보다 interpretable & human-like 해졌다고 설명
[CTM publication](https://pub.sakana.ai/ctm/)

📜 Paper CWI

2025.05 3주차

How well do LLMs reason over tabular data, really?

언어 모델의 tabular queries에 대한 performance를 어떻게 evaluate 할 수 있는가?
multiple-choice prompt 평가 & BERT-score 대신 LLM-as-a-Judge 신뢰도가 높다고 설명

📜 Paper ByteDance

2025.05 3주차

Seed1.5-VL Technical Report

532M-parameter encoder, MoE LLM (20B active params)
GUI control & gameplay 등 agent-centric tasks에서 뛰어난 성능 보인다고 설명

📜 Paper Tsinghua

2025.05 3주차

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute Zero: external data 의존하지 않고 single model 스스로 own learning progress를 maximize & improve
Absolute Zero Reasoner (AZR): code executor를 사용하여 training curriculum & reasoning ability를 self-evolve 하는 system

🧑🏻‍💻 Dev OpenAI

2025.05 3주차

Introducing HealthBench

각 case는 dialogue, prompt, model output, rubric이 JSON format으로 구성됨
research-use license로 Dataset & grader code 사용 가능

📜 Paper Salesforce

2025.05 3주차

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

→ training efficiency & improved generative quality
image understanding, 이어서 image generation에 대해 사전학습하는 학습 방식이 효과적이었다고 설명
GPT-4o를 이용하여 high-quality instruction tuning dataset BLIP3o-60k 데이터셋 제작

🧑🏻‍💻 Dev ByteDance

2025.05 3주차

DeerFlow

Coordinator, Planner, Reporter 등의 agent들로 구성되는 시스템
LangChain, LangGraph로 빌드되어 있어 Human-in-the-loop이 지원되며, 최근 핫한 Podcast generation도 가능 (생성된 reports 기준으로)

🧑🏻‍💻 Dev Google

2025.05 3주차

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

AlphaTensor 모델에서 single function call을 넘어 entire codebase 까지 커버할 수 있도록 함
Gemini Flash로 빠르게 idea generation & Gemini Pro로 deeper analysis

🧑🏻‍💻 Dev LangChain

2025.05 3주차

open-agent-platform

📜 Paper Meta

2025.05 2주차

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

proprietary models로부터의 distillation 없는 training pipelines을 분석하고 large-scale synthetic data를 explore
2.8M human-labeled fine-grained video question-answer pairs & spatio-temporally grounded video captions
PLM-VideoBench: video에 대한 ‘what, where, when, how’ 추론 능력을 평가하기 위한 벤치마크 공개

📜 Paper NVIDIA

2025.05 2주차

Llama-Nemotron: Efficient Reasoning Models

Nano (8B), Super (49B), Ultra (253B) 사이즈로 구성되어 있으며, DeepSeek-R1에 준하는 성능이면서도 inference throughput & memory efficiency 뛰어남
dynamic reasoning toggle을 지원하는 최초의 open-source models
유저가 직접 standard chat vs. readoning modes 선택 가능

🧑🏻‍💻 Dev OpenAI

2025.05 2주차

Evolving OpenAI’s structure

이를 통해 더 큰 규모의 투자를 받아 AGI 개발에 전념하겠다고 함
이후 capable models를 오픈소스화할 예정

🧑🏻‍💻 Dev Alibaba

2025.05 2주차

Qwen-Agent

code execution, document reading, web browsing, RAG workflows 가능

📜 Paper Beijing Univ.

2025.05 2주차

RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

RAG-MCP: 주어진 query와 관련성이 가장 높은 MCP(s)를 semantically retrieve
selected tool descriptions만을 모델에 전달함으로써 prompt size를 줄이고 decision-making을 간소화 함

📜 Paper Anthropic

2025.05 2주차

Reasoning Models Don't Always Say What They Think

프롬프트에 제시된 6가지 힌트를 활용해 CoT의 신뢰도를 평가
CoT를 이용한 test-time monitoring은 unexpected behaviors를 탐지하는데 전혀 쓸모가 없다고 주장

🧑🏻‍💻 Dev Mistral AI

2025.05 2주차

Medium is the new large.

private, high-context, domain-specific use cases에 해당하는 enterprise 활용도 가능
custom post-training & continuous pretraining 지원
finance, energy, healthcare 도메인에서 사용

🧑🏻‍💻 Dev Zed: The Fastest AI Code Editor

2025.05 2주차

Zed: The Fastest AI Code Editor

Privacy & Security 모드가 default. 원한다면 feedback 제공도 당연히 가능.
Claude, OpenAI, Google 등 API는 당연히 지원하고, 본인 computing power를 사용하는 ollama 기반의 모델들도 사용할 수 있음
ollama 사용 시에 미지원되는 기능은 [Edit Predictions](https://zed.dev/blog/edit-prediction) 뿐이라고 함

📜 Paper Barbin Institute

2025.05 2주차

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Multimodal reasoning은 modular, perception-driven pipelines에서부터 unified, language-centric frameworks로 발전하여 일관성 있는 cross-modal understanding 능력을 갖추게 됨
instruction tuning & reinforcement learning 을 통해 크게 발전했으나, 아직까지 omni-modal generalization, reasoning depth, agentic behavior 에서 한계 존재
발전 흐름에 따라, task-specific modules, Multimodal CoT (MCoT), native large multimodal reasoning models (N-LMRMs) 순으로 survey 결과 정리

📜 Paper Univ. of Chicago

2025.05 2주차

Mitigating Memorization In Language Models

언어 모델의 memorization 현상을 mitigate 하기 위한 방법론들 제시
3 regularizer-based, 3 finetuning-based, 11 machine unlearning-based
regularizer-based는 느리고 효과 x, finetuning은 효과 좋지만 비쌈, machine unlearning이 가장 좋은 방법론 → 그중에서도 BalancedSubnet가 제일 좋음

📜 Paper Alibaba

2025.05 2주차

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

policy model은 search APIs 대신 simulated documents 를 사용하여 학습
언어모델을 사용하여 매 쿼리마다 20개 문서 생성
최종 답변 퀄리티를 기준으로 reward signals 사용

📜 Paper Shanghai Jiao Tong Univ.

2025.05 2주차

A Survey of AI Agent Protocols

🧑🏻‍💻 Dev Google

2025.05 1주차

DolphinGemma: How Google AI is helping decode dolphin communication

돌고래의 vocalization 구조를 이해하고 dolphin-like sound sequences를 생성하는 모델
Catacean Hearing Augmentation Telementary (CHAT) 시스템에 구글 픽셀폰 사용 가능

🧑🏻‍💻 Dev Google

2025.05 1주차

Introducing TxGemma: Open models to improve therapeutics development

전체 discovery process의 therapeutic entities의 properties를 이해하고 예측하도록 학습한 모델들임
promising targets를 식별하고 clinical trial outcomes까지 예측 가능
7M 데이터로 학습되었으며 2B, 9B, 27B 사이즈로 구성됨

🧑🏻‍💻 Dev DeepSeek AI

2025.05 1주차

DeepSeek-Prover-V2-671B

DeepSeek-V3를 subgoal decomposition & formalization 에 활용
이렇게 획득한 데이터를 이용하여 강화학습
ProverBench: Formalization of AIME and Textbook Problems

📜 Paper Cohere, Princeton, Stanford, Waterloo, MIT, Ai2, Washington

2025.05 1주차

The Leaderboard Illusion

undisclosed private testing practices가 모델 공개 전 특정 providers에게 유리한 것이라고 지적
selective disclosure of perfomance results 때문에 Arena가 biased 된다고 설명. 현재는 많은 모델들이 여기에 overfitted 되어 있음을 지적
proprietary closed models (Google, OpenAI) 는 battles에서 더 높은 비율로 picked 되기 때문에 open-source models 보다 더 많은 data access 가능

🧑🏻‍💻 Dev Ai2

2025.05 1주차

OLMo 2 1B

Mid-training에 [OLMo-mix-1124](https://huggingface.co/datasets/allenai/olmo-mix-1124) & [Dolmino-mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124) 를 포함한 4T 토큰 학습
Post-training에 [Tülu 3 dataset](https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225)의 OLMo-specific variant를 사용하여 SFT
[olmo-2-0425-1b-preference-mix](https://huggingface.co/datasets/allenai/olmo-2-0425-1b-preference-mix)에 대해 DPO training & 최종적으로 RLVR training 적용

📜 Paper Renmin Univ.

2025.05 1주차

DeepCritic: Deliberate Critique with Large Language Models

본 연구에서는 LLM의 math critique ability에 집중
math solutions의 각 reasoning step에 대해 의도적으로 critique 할 수 있도록 만드는 2-stage framework 제안
(1) Qwen2.5-72B-Instruct를 이용하여 4.5K long-form critique를 생성하고 이를 SFT의 seed로 사용

🧑🏻‍💻 Dev Anthropic

2025.05 1주차

Claude can now connect to your world

Integrations: Claude가 web & desktop app에 걸친 원격 MCP server 위에 동작
Jira & Confluence, Zapier, Cloudfalre, Intercom, Asana, Square, Sentry, Paypal, Linear, Plaid 서비스 지원

📜 Paper KAIST, DeepAuto.ai

2025.05 1주차

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

PaperCoder: multi-agent LLM framework로, 머신러닝 논문을 functional code repositories로 변환. 세 단계로 동작
(1) Planning: high-level roadmap 구축, diagram을 포함한 system architecture 설계, file dependencies 식별, configuration files 생성
(2) Analysis: implementation-specific details를 해석

📜 Paper mem0.ai

2025.05 1주차

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

두 개의 시스템으로 구성
Mem0: dense & language-based memory system
Mem0g: enhanced version with graph-based memory to model complex relationships

📜 Paper KAIST, DeepAuto.ai

2025.05 1주차

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Modality-aware routing: 매 query마다 적절한 modality를 dynamically select 하는 router
Granularity-aware retrieval: 각 modality는 granularity levels로 쪼개져 각각의 complexity에 적합한 content를 retrieve
Flexible routing: training-free (zero-shot GPT-4o prompting) & trained (T5-Large) routers 둘 다 지원

📜 Paper Amazon

2025.05 1주차

SLOT: Structuring the Output of Large Language Models

2025년 4월 62건

🧑🏻‍💻 Dev SkyworkAI

2025.04 4주차

Skywork-OR1 (Open Reasoner 1)

Skywork-OR1-RL-Data: DeepSeek-R1-Distill-Qwen-32B로 난이도를 평가한 데이터 구성됨 (데이터 사용시 필터링으로 사용 가능). 총 105K Math, 14K Coding 데이터
32B-Preview 모델의 경우 AIME, LiveCodeBench에서 DeepSeek-R1 수준 성능을 달성했다고 보고

📜 Paper NVIDIA

2025.04 4주차

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

CLIMB 제안: 사전학습을 위한 data mixture를 적절히 discover, evaluate, refine 하는 framework
이를 이용하여 획득한 400B 토큰에 대해 1B 모델을 학습한 결과는 SoTA인 Llama-3.2-1B 모델을 능가하는 수준이라고 보고
20개 cluster, 1.2T 토큰으로 구성된 ClimbLab, 400B 토큰으로 구성된 ClimbMix 공개

📜 Paper HKUST

2025.04 4주차

Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models

thinking token 사이에 (<think> </think>) smaller 모델로부터 생성된 external CoT를 넣어주는 방식이 모델이 적은 토큰을 생성하는 데 도움을 준다고 설명 → ThoughtMani
QwQ-32B 모델을 LiveBench/Code dataset에 적용했을 때, 기존 성능은 유지하면서도 약 30% 정도의 토큰을 절약할 수 있었음 (CoT generator로부터 overhead가 발생하긴 함)

🧑🏻‍💻 Dev Google

2025.04 4주차

Gemma 3 QAT Models: Bringing state-of-the-Art AI to consumer GPUs

Gemma 3 27B 모델의 경우 int4 기준 14.1GB 메모리를 차지하여 RTX 3090 한 대에 KV cache 포함한 로드가 가능하다고 설명
OpenAI API를 통해 function calling & custom tool 사용 가능

📜 Paper UC Berkeley, LangChain

2025.04 4주차

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

이 데이터로 fine-tuned 된 Mistral & Llama 3 가 (본인들 벤치마크에 대해) GPT-4o를 평균 20.93% outperform 했다고 설명

📜 Paper Tsinghua

2025.04 4주차

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

즉, 현존하는 reasoning models의 reasoning abilities는 base model에 이미 존재하던 것을 적절히 sampling 할 수 있도록 학습되어 갖춰진 것으로 설명
이러한 경향성은 visual reasoning tasks에서도 관측됨
오히려 distillation이 이와 달리 모델에게 new knowledge 를 전달하는 방법이라고 설명

📜 Paper Shanghai AI Lab, Fudan, CMU

2025.04 4주차

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

→ 데이터셋 내 information content를 정량화하는 method 제안: label graph를 구축하고 graph 내의 information distribution을 이용
Maximize Information Gain (MIG): semantic space 내에서 반복적으로 sampling을 수행하는 efficient sampling method
이 방법론을 Ai2 에서 공개했던 Tulu3 데이터셋에 적용해봄으로써 성능 향상을 이끌어 낼 수 있었다고 설명

📜 Paper Google DeepMind

2025.04 4주차

Welcome to the Era of Experience

학습을 위해 human-generated datasets에 의존하는 것을 피하고 environmental feedback을 사용할 것을 주장
여러 태스크와 도메인에 대한 continuous, long-term learning을 지원
task-specific performance가 아닌 시간에 걸친 capability growth에 집중

📜 Paper Alibaba

2025.04 4주차

Wan: Open and Advanced Large-Scale Video Generative Models

T2V-1.3B 모델은 8.19GB VRAM를 필요로 하며, RTX 4090 한 장으로 5초짜리 480P 비디오를 약 4분만에 생성 가능
Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, Video-to-Audio 등 다양한 태스크 수행 가능
Chinese & English 텍스트 생성 능력이 뛰어남

🧑🏻‍💻 Dev Anthropic

2025.04 4주차

Values in the wild: Discovering and analyzing values in real-world language model interactions

이때 privacy-preserving system을 이용했기 때문에 유저의 개인정보는 제거되었다고 설명
분석 과정을 시각화한 도식 참고하면 좋을 듯. AI values taxonomy를 구축한 것이 눈에 띔

📜 Paper NVIDIA

2025.04 4주차

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

특히 long video understanding & high-resolution image understanding 의 문제를 해결
Automatic Degrade Sampling & Image Area Preservation 을 통합하여 contextual integrity & visual details 보존
Eagle-Video-110K: story-level & clip-level annotations를 통합한 데이터셋

📜 Paper Huawei

2025.04 4주차

Dynamic Early Exit in Reasoning Models

fixed heuristics와 달리 potential reasoning transition points (ex. Wait 토큰)을 model behavior에서 탐지하는 방식.
이때 모델이 trial answer에 대해 high confidence를 갖는 경우 next reasoning chain’s generation을 중단
추가적인 학습이 필요없는 방식이며 기존 o1-like reasoning LLMs에 seamlessly integrate 가능

📜 Paper Chinese Academy of Sciences

2025.04 4주차

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

unified action space rule modeling을 통해 LVLMs이 GUI 이해 능력을 향상할 수 있도록 하는 강화학습 프레임워크 GUI-R1 제안
각 플랫폼(Windows, Linux, MacOS 등)으로부터 얻은 소수의 carefully curated high-quality data, GRPO를 이용하여 자원 효율적인 결과를 달성할 수 있었다고 설명

🧑🏻‍💻 Dev ByteDance

2025.04 4주차

Introducing UI-TARS-1.5

token-level multimodal supervision 기반의 reasoning-before-action approach를 사용
뛰어난 Web Navigation 능력은 GPT-4.5 능가하는 수준

🧑🏻‍💻 Dev Nari-Labs

2025.04 4주차

Nari Dia-1.6B

ElevenLabs Studio나 Sesame CSM-1B 모델 이상의 퍼포먼스를 보여주어 큰 화제를 일으키는 중
카이스트 학부생이 2명이 작업한 결과물로 알려짐

📜 Paper a-m-team

2025.04 4주차

DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

pass rate & Coefficient of Variation (CV) 를 이용하여 유의미한 학습 데이터만 남겼다고 설명

📜 Paper Shanghai AI Lab, Tsinghua

2025.04 4주차

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

VisuLogic: 6개 카테고리에 대한 1,000 human-verified problems (quantitative shifts, spatial relations 등)
사람은 51.4%, 대부분의 모델은 30% 이하의 정확도를 기록하는 수준의 벤치마크이며, visual reasoning 능력을 고도화할 수 있는 학습 데이터도 공개했다고 언급함

📜 Paper Tsinghua, Shanghai AI Lab

2025.04 4주차

TTRL: Test-Time Reinforcement Learning

ground-truth 정보 없이 reward estimation을 어떻게 할 것인지가 challege
Test-Time Reinforcement Learning (TTRL): pre-trained models의 priors를 이용하여 self-evolution
Test-Time Scaling (TTS) 에서 majority voting 등이 RL training에서 reward 역할을 할 수 있었음에 착안

🧑🏻‍💻 Dev OpenAI

2025.04 4주차

Introducing our latest image generation model in the API

해당 기능을 `gpt-image-1` API로 공개
이미지 한 장당 대략 0.3$ 정도 비용 발생

🧑🏻‍💻 Dev NousResearch

2025.04 4주차

Minos-v1

유저의 질문과 LLM의 답변 pair를 입력으로 받아 둘 중 하나의 클래스를 confidence와 함께 반환하는 모델
400M 사이즈 모델로 8,192 context length, 약 380K 데이터로 학습

📜 Paper DevRev

2025.04 4주차

Efficient Single-Pass Training for Multi-Turn Reasoning

LLM은 추론 토큰을 생성하는데 이를 이후 입력에 포함하면 안됨
이러한 불일치(discrepancy)로 인해 일반적인 다른 데이터셋에 대해 학습하는 것과 달리, single forward pass로 전체 대화를 처리할 수 없음
이를 해결하기 위해 response token duplication & custom attention mask (enforces appropriate visibility constraints) 적용

🧑🏻‍💻 Dev HuggingFace

2025.04 4주차

Tiny Agents: a MCP-powered agent in 50 lines of code

AI Agents 시스템 구축에 50줄 코드면 충분

🧑🏻‍💻 Dev Anthropic

2025.04 4주차

The Urgency of Interpretability

언어별로 별도 시스템이 존재하는 것이 아니라, 영어, 프랑스어, 중국어 등 다양한 언어가 공유하는 추상적 개념 공간이 존재 → 의미 처리 후 특정 언어로 번역되는 방식으로 동작
시를 쓸 때 단순히 다음 토큰들을 예측하는 것이 아니라 미리 운율을 맞출 준비를 하고 있음
어려운 수학 문제 등을 풀 때, 잘못된 근거를 제시하면 그럴싸한 답변을 생성. 이런 과정은 여러 ‘중간 단계’를 거치는 것으로 확인됨

📜 Paper Microsoft

2025.04 4주차

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

BitNet v2: 1-bit LLM을 위한 native 4-bit activation quantization 프레임워크
H-BitLinear: activation quantization 이전에 online Hadamard transformation 적용

🧑🏻‍💻 Dev Alibaba

2025.04 4주차

Qwen3: Think Deeper, Act Faster

가장 큰 두 모델: Qwen3-30B-A3B, Qwen3-235B-A22B (둘 다 MoE)
Hybrid thinking mode: thinking mode와 non-thinking mode 스위칭 가능
36T 토큰으로 학습. 이는 Qwen2.5를 학습한 데이터의 두 배에 이르는 양.

🧑🏻‍💻 Dev NourResearch

2025.04 4주차

Atropos

🧑🏻‍💻 Dev ByteDance

2025.04 3주차

Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning

총 200B, activated 20B의 MoE 모델
일반화된 reasoning 능력 평가를 위해 BeyondAIME, Codeforces, 두 개의 벤치마크 공개

📜 Paper Microsoft Research

2025.04 3주차

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

두 입력을 각각 image tokenizer & action tokenizer 에 통과시켜 discrete token으로 변환 후 concat 하여 input으로 사용
모델이 초당 4~7 프레임을 생성할 수 있도록 학습되었으며 플레이어와 실시간 interact 가능
visual quality & action following capability 를 함께 측정할 수 있는 metric 제시

🧑🏻‍💻 Dev DeepCogito

2025.04 3주차

Cogito v1 PreviewIntroducing IDA as a path to general superintelligence

70B 모델이 Llama의 최신 109B MoE 모델을 능가하는 성능을 보인다고 보고
Iterated Distillation and Amplification (IDA) - a scalable and efficient alignment strategy for general superintelligence using iterative self-improvement
모든 모델은 질문에 바로(direct) 답하거나, 답변 전에 스스로 생각(self-reflect)할 수 있음

🧑🏻‍💻 Dev OpenAI

2025.04 3주차

Introducing GPT-4.1 in the API

세 모델 전부 주요 벤치마크에서 GPT-4o, GPT-4.5를 outperform & 1M context window & diff 모드 지원
structured input 이해, multi-turn, multi-needle tasks에서 기존보다 더 뛰어난 성능

🧑🏻‍💻 Dev xAI

2025.04 3주차

Grok Studio

documents, codes, reports, browser games 등을 생성할 수 있고 컨텐츠를 별도 윈도우에 띄움

📜 Paper China Telecom

2025.04 3주차

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

label 정확도를 높이기 위해 multi-round annotation 수행
Long Reasong tasks에 대한 평가 모델을 학습하기 위해 데이터셋을 구축했다는 내용이 전부인 듯

📜 Paper UCLA, Meta

2025.04 3주차

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

(a) masked SFT를 이용하여 knowledge를 distill 하고 self-improvement behavior를 instill
(b) diff-GRPO: critic-free, policy-gradient based RL algorithm

📜 Paper Microsoft

2025.04 3주차

BitNet b1.58 2B4T Technical Report

computational efficiency를 큰 특징으로 삼으면서도 language understanding, mathematical rreasoning, coding preoficiency, conversational ability 등이 전부 뛰어나다고 설명
CPU, GPU 추론 둘 다 지원하며 HuggingFace를 통해 이용 가능

🧑🏻‍💻 Dev OpenAI

2025.04 3주차

Introducing OpenAI o3 and o4-mini

차트 해석, UI 이해, 수학적 추론, OCR + context 등 수행 가능

🧑🏻‍💻 Dev Ai2

2025.04 3주차

DataDecide: How to predict best pretraining data with small experiments

학습 중 check point를 공개함으로써, 작은 모델로 특정 데이터셋에 대해 어떻게 학습되는지 경향성을 파악하여 scale-up 하는 데 도움을 주고자 하는 목적으로 공개했다고 설명함

🧑🏻‍💻 Dev Comet-ML

2025.04 3주차

Opik

Tracing, Annotations, Playground 등 기능 지원
LLM-as-a-Judge metric 포함

🧑🏻‍💻 Dev Cohere

2025.04 3주차

Introducing Embed 4: Multimodal search for business

128K context window length (200 페이지 분량)
100개 이상의 다양한 언어 지원
virtual private cloud (VPC) 환경 뿐만 아니라 on-premise 환경도 지원

📜 Paper Salesforce

2025.04 2주차

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

첫 단계에서는 LLM reviewers committee를 이용하여 detailed blue prints 생성
blue prints는 simulated human-agent interplay를 통해 complete interaction trajectories로 발전
1B에서 70B 사이즈에 이르는 xLAM-2-fc-r 시리즈 학습하여 GPT-4o나 Claude 3.5를 $\tau$-bench & BFCL benchmarks에서 outperform 했다고 보고

🧑🏻‍💻 Dev Meta

2025.04 2주차

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Behemoth는 teacher model로서 Scout, Maverick의 추론, 코딩, 멀티모달 이해 능력 전수
MoE 아키텍쳐, native multi-modal model, 10M context length, Codistillation 등의 특징
bias 문제 해결을 위한 노력 언급

📜 Paper HuggingFace

2025.04 2주차

SmolVLM: Redefining small and efficient multimodal models

SmolVLM: resource-efficient inference를 위해 설계된 compact multimodal models series
가장 작은 SmolVLM-256M 모델은 추론 시 1GB 미만의 GPU 메모리를 사용할 정도로 효율적이며, static images에 대해서 뿐만 아니라 뛰어난 video comprehension 이해 능력을 보였다고 함

🧑🏻‍💻 Dev Ai2

2025.04 2주차

Going beyond open data – increasing transparency and trust in language models with OLMoTrace

학습 데이터에 접근할 수 있는 다른 모델에도 적용할 수 있는 기능

📜 Paper Yandex

2025.04 2주차

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

한 instance가 생성하는 과정을 나머지 instances가 concurrent cache를 통해 살펴볼 수 있음
RoPE 차용
modern reasoning-capable LLM들이 추가적인 fine-tuning 없이 shared Key-Value cache 만으로 좋은 성과를 낼 수 있었다고 보고

🧑🏻‍💻 Dev Google

2025.04 2주차

Announcing the Agent2Agent Protocol (A2A)

HTTP, SSE, JSON-RPC 등을 사용하여 기존 시스템과의 compatibility 보장
Agents는 사용 가능한 functions를 structured JSON files로 정리하고, 이를 Agent Cards라고 함
최근 Agent Development Kit (ADK)를 공개했는데 이는 Vertex AI, Gemini API와 integrate 가능한 open source임

🧑🏻‍💻 Dev OpenAI

2025.04 2주차

Evaluating model performance

평가에 사용되는 test data를 `data_source_config`에 명시하고, 모델 출력 결과가 올바른 것인지에 대한 정보는 `testing_criteria`에 작성

🧑🏻‍💻 Dev Amazon

2025.04 2주차

Amazon’s new Nova Sonic foundation model understands not just what you say—but how you say it

Amazon Bedrock에 API로 이용 가능

📜 Paper Nanjing, ByteDance

2025.04 2주차

DDT: Decoupled Diffusion Transformer

Decoupled Diffusion Transformer (DDT): semantic extraction를 위한 encoder & specialized velocity decoder 로 구분되는 디자인
인접한 denoising step 간의 self-condition을 공유함으로써 추론 속도까지 향상시킬 수 있음

🧑🏻‍💻 Dev OpenGVLab

2025.04 2주차

InternVL3

InternVL 2.5 대비 뛰어난 multimodal perception & reasoning 능력을 보여줌
tool usage, GUI agents, industrial image analysis, 3D vision perception 등
text performance가 Qwen 2.5 시리즈 대비 뛰어나다고 언급

📜 Paper Kimi

2025.04 2주차

Kimi-VL Technical Report

activating language decoder 사이즈가 2.8B 수준임에도 불구하고 뛰어난 성능 달성
multi-turn agent tasks, college-level image & video comprehension, OCR, mathematical reasoning 등의 태스크에서 뛰어난 퍼포먼스를 보임
128K content window & native-resolution vision encoder, MoonViT 덕분에 ultra-high-resolution visual inputs 이해 가능

🧑🏻‍💻 Dev Google

2025.04 2주차

Introducing Firebase Studio

Project IDX, Genkit, Gemini 를 하나의 workspace에 통합
*App Prototyping agent*: prompt | drawing 으로부터 full apps 생성하는 기능

🧑🏻‍💻 Dev OpenAI

2025.04 2주차

BrowseComp: a benchmark for browsing agents

📜 [BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf)
정답이 간단하고 이견의 여지가 없는 1,266개의 문제로 구성

📜 Paper Zhejiang University

2025.04 2주차

Large language models could be rote learners

LLM이 암기한 내용(rote memorization)보다 그렇지 않은 것(genuine capability)에 대해 더 좋은 퍼포먼스를 내는 경향이 있다고 보고
TrinEval: MCQ를 trinity format으로 변경하여 memorization 평가는 줄이고 knowledge 평가는 더 잘할 수 있도록 만드는 evaluation 프레임워크

📜 Paper UC San Diego

2025.04 1주차

Large Language Models Pass the Turing Test

GPT-4o 모델의 경우, 인간 페르소나를 부여했을 때 인간 상대로 73%의 win rate를 기록

📜 Paper AI2

2025.04 1주차

Introducing CodeScientist: A step toward automated scientific discovery

전체 프로세스 내에서 Ideation, Planning, Experiment, Reporting, Meta-analysis 수행
아직까지 사람의 의사결정이 중간에 개입되어야 한다는 한계가 있지만 빠른 속도로 발전하고 있다는 인상을 줌 (Sakana AI의 것도 그렇고..)

🧑🏻‍💻 Dev HuggingFace

2025.04 1주차

YourBench: A Dynamic Benchmark Generation Framework

Scalable & Structured: Seamlessly handles ingestion, summarization, and multi-hop chunking for large or specialized datasets.
Zero-Shot Focus: Emulates real-world usage scenarios by creating fresh tasks that guard against memorized knowledge.
Extensible: Out-of-the-box pipeline stages (ingestion, summarization, question generation), plus an easy plugin mechanism to accommodate custom models or domain constraints.

📜 Paper National University of Singapore

2025.04 1주차

JudgeLRM: Large Reasoning Models as a Judge

SFT performance gains & reasoning-demanindg samples의 비율 간의 음의 상관관계 확인
JudgeLRM: judge-wise, outcome-driven rewards 향으로 RL을 적용한 judgement-oriented LLMs family

🧑🏻‍💻 Dev OpenAI

2025.04 1주차

OpenAI Academy

workshops & live events 등도 진행

📜 Paper Meta

2025.04 1주차

Multi-Token Attention

Multi-Token Attention (MTA): LLM이 여러 개의 query & key vectors에 대해 attention weights를 condition 하는 어텐션 기법 제안
queries, keys, heads에 대해 convolution 적용

📜 Paper OpenAI

2025.04 1주차

PaperBench: Evaluating AI's Ability to Replicate AI Research

Claude 3.5 Sonnet이 21.0% 스코어를 기록했으나 인간 ML PhD는 41.4%를 기록
평가를 수행하는 것도 LLM임

🧑🏻‍💻 Dev Anthropic

2025.04 1주차

Introducing Claude for Education

Learning mode: 학생들에게 정답을 바로 알려주기보다는 critical thinking skills를 develop 할 수 있도록 reasoning process를 가이드
Socratic questioning (결론을 뒷받침하는 근거는 무엇인가?), 핵심 개념 강조 등의 특징

📜 Paper Mila, Nanyang, MS, …

2025.04 1주차

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

📜 Paper Oxford, NUS, DeepMind

2025.04 1주차

Why do LLMs attend to the first token?

2025년 3월 62건

📜 Paper University of Texas at Dallas

2025.03 4주차

A Review of DeepSeek Models' Key Innovative Techniques

Multi-Head Latent Attention (MLA), Advanced MoE, Multi-Token Prediction (MTP), Grouped Relative Policy Optimization (GRPO) 등

📜 Paper ByteDance, Tsinghua

2025.03 4주차

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) 알고리즘 제안

📜 Paper Hong Kong, Peking

2025.03 4주차

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

fine-grained & coarse level의 individual & consecutive reasoning step을 평가
이전 step의 추론이 잘못되어 뒤에 안좋은 영향을 주는 케이스를 특히 잘한다고 보고
MCTS의 비효율성을 해결하기 위해 Hierarchical Node Compression (HNC) 라는 node merging 기법 제안

🧑🏻‍💻 Dev OpenAI

2025.03 4주차

Introducing next-generation audio models in the API

multi-speaker detection, 대화 시작 & 중단, noisy 환경 등에 대해 훨씬 robust 하다고 설명
real-time | batch-processing voice agents 구현 가능

🧑🏻‍💻 Dev Anthropic

2025.03 4주차

The "think" tool: Enabling Claude to stop and think in complex tool use situations

말 그대로 tool을 사용하는 schema(API 호출에 필요한)와 이를 위해 최적화된 프롬프트를 안내하고 있음

🧑🏻‍💻 Dev DeepSeek AI

2025.03 4주차

DeepSeek-V3-0324

multi-turn interactive rewriting, translation quality & letter writing, enhances search-based report analysis
function calling, JSON output, FIM (Fill-in-the-Middle) completion
허깅페이스에 MIT 라이센스로 공개

📜 Paper National University of Singapore, Nanyang

2025.03 4주차

MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization

7개의 agent로 구성되어 각각이 autonomously Planner를 사용하여 optimization path를 고안
또한 Teacher-Critic-Student Socratic dialogue를 사용하여 프롬프트를 iteratively optimize
이는 기존의 Automated Prompt Optimization (APO)의 한계를 극복하기 위함임

🧑🏻‍💻 Dev Google DeepMind

2025.03 4주차

Gemini 2.5: Our most intelligent AI model

1M token content window. 곧 2M을 지원할 예정
RAG & document-based workflows에 최적화되어 있다고 언급

🧑🏻‍💻 Dev ARC-AGI-2 + ARC Prize 2025 is Live!

2025.03 4주차

ARC-AGI-2 + ARC Prize 2025 is Live!

사람에게는 쉽지만 AI에게는 어려운 reasoning task 중심. 이전 challenge보다 더 어렵다고 자체적으로 설명함.

🧑🏻‍💻 Dev OpenAI

2025.03 4주차

Introducing 4o Image Generation

trained our models on the joint distribution of online images and text
→ 이를 통해 이미지와 텍스트가 어떤 식으로 관계되어 있는지를 학습했다고 설명
ChatGPT, Sora에서 사용 가능하며, 곧 API로도 지원될 예정

📜 Paper Tencent

2025.03 4주차

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

(1) On-the-spot Reward: each tool invocation에 대해 immediate feedback 제공
(2) Latent Reward: 전체적인 task completion에 대해 각 step의 기여를 평가

🧑🏻‍💻 Dev Alibaba

2025.03 4주차

Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!

Think-Talker 아키텍쳐는 speech synthesis에서 reasoning을 분리함으로써 more structured ouputs에 기여
Thinker는 언어모델로서 reasoning & text generation을 담당
Talker는 text | direct audio instruction 을 기반으로 speech를 생성

🧑🏻‍💻 Dev AI2

2025.03 4주차

Introducing Ai2 Paper Finder

키워드 대신 자연어 전체 문장을 그대로 입력해도 관련 논문을 찾아줌
relevance 판단 시 복잡한 질의를 다중 기준으로 분해해 평가하고, citation 기반 확장 탐색도 수행
빠른 응답이 필요한 경우엔 fast mode, 깊이 있는 탐색이 필요할 땐 iterative exhaustive mode 제공

📜 Paper Google

2025.03 4주차

Gemma 3 Technical Report

vision understanding, 더 많은 언어, longer context (128K)
local to global attention layer의 비중을 높임으로써 (local의 비중을 높임) KV-cache가 폭발적으로 증가하는 것을 방지
Gemma 3 모델들은 distillation으로 학습되어pre-trained & instruction finetuned version 둘 다 Gemma 2 성능을 능가

🧑🏻‍💻 Dev Anthropic

2025.03 4주차

Tracing the thoughts of a large language model

이를테면 feature activations와 이것이 transformer layers에 걸쳐 미치는 영향을 추적할 수 있음
Claude는 한 번에 여러 개의 future words를 선택 / shared internal states를 사용하고 이를 다른 언어들에 각각 매핑

🧑🏻‍💻 Dev Tencent

2025.03 4주차

Reasoning Efficiency Redefined! Meet Tencent’s 'Hunyuan-T1'—The First Mamba-Powered Ultra-Large Model

📜 Paper UC Berkeley, Tokyo

2025.03 3주차

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Plan-and-Act: synthetic data generation을 통해 LLM 기반 agents의 plan generation을 고도화한 프레임워크
Planner: 목표를 달성하는 데 필요한 structured & high-level plans
Executor: 위 plan들을 environment-specific actions로 translate

🧑🏻‍💻 Dev Microsoft

2025.03 3주차

RD-Agent

확실히 Agent 개념을 활용한 자동화가 연구에 본격적으로 활용되고 있고 앞으로는 BM으로 발전하지 않을까 싶음

📜 Paper IBM, HuggingFace

2025.03 3주차

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

DocTags: 페이지 내 모든 요소를 위치와 함께 capture하는 새로운 universal markup format
business documents, academic papers, technical reports 등 다양한 형식의 문서에서 code listings, table,s equations, charts, list 등의 feature 추출 가능하며 robust 하다고 설명
모델은 이용 가능하며 데이터셋은 곧 공개 예정

📜 Paper Anthropic

2025.03 3주차

Auditing Language Models for Hidden Objectives

RLHF 내 reward model의 error를 학습하고, 이러한 error를 이용(exploit)하는 방법을 익힘
(1) 모델의 hidden objective와 training에 대해 모르는 사람들을 4팀으로 꾸려 blind auditing game 수행
(2) 후속 연구로 모델을 audit 하는 8개 테크닉을 탐구. SAE가 가장 효과적이었다고 함

📜 Paper IIIT Hyderabad

2025.03 3주차

No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models

bias detection task를 위한 5개의 prompting approaches 소개
biases detecting 벤치마크의 metrics에 대한 3개의 research questions 제시
실험 결과에 따르면 모든 LLM이 최소 1개 이상의 bias를 나타내고 있으며, LLaMA3.1-8B 모델의 bias가 가장 적었다고 함

🧑🏻‍💻 Dev Mistral

2025.03 3주차

Mistral Small 3.1

GPQA에서 44.42% 스코어를 달성하며 Gemma 3-it (36.83%) 모델과 GPT-4o-mini (40.2%) 모델을 능가
초당 150 토큰 생성 가능하며 이미지도 처리 가능

🧑🏻‍💻 Dev AI2

2025.03 3주차

OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini

오픈소스 모델(데이터, 코드, 학습 방식 등 모든 디테일 공개) 중 GPT 3.5와 GPT 4o mini를 능가하는 것은 최초라고 보도
refined post-training과 RLVR (Reinforcement Learning with Verifiable Rewards) 적용

📜 Paper Tsinghua

2025.03 3주차

Personalize Anything for Free with Diffusion Transformer

덕분에 personalization 및 image editing도 가능
Personalize Anything: DiT를 이용하여 personalized image generation을 수행하는 training-free framework

📜 Paper Babes-Bolyai University

2025.03 3주차

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

low-resource tasks (classification, QA), code-centric applications 발전에 대해 언급

🧑🏻‍💻 Dev Google

2025.03 3주차

New ways to collaborate and get creative with Gemini

Python, Javascript, HTML 지원
real-time code collaboration이 가능하지만 multi user는 안됨
Audio Overview: documents, slides, Deep Research reports를 두 AI host 간의 오디오 팟캐스트로 변환

🧑🏻‍💻 Dev LG AI Research

2025.03 3주차

EXAONE Deep Released ━ Setting a New Standard for Reasoning AI

Notable AI models에 이름을 올린 유일한 한국어 모델
7.8B & 2.4B 모델도 공개

📜 Paper Eleuther AI

2025.03 3주차

RWKV-7 "Goose" with Expressive Dynamic State Evolution

추론 시 토큰마다 필요한 memory usage & inference time이 constant
3.1T 토큰의 multilingual dataset도 공개

📜 Paper METR

2025.03 3주차

Measuring AI Ability to Complete Long Tasks

AI 모델들이 2초에서 8시간까지 걸리는 engineering 태스크 170여 개를 완수
서베이 결과에 따르면 AI task length는 7개월마다 2배로 증가하고, 현재를 기준으로는 Claude 3.7 Sonnet이 1-hour tasks를 50% 신뢰도로 잘 끝내는 수준이라고 함
[연구 결과를 정리해놓은 METR posting 링크](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) 🔗

📜 Paper Shanghai AI Lab

2025.03 3주차

ϕ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

φ-Decoding: foresight & clustering 을 통해 두 개의 distribution에 approximate → joint distribution으로부터 sampling

📜 Paper Rice University

2025.03 3주차

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

(1) model-based efficient reasoning: full-length reasoning 모델을 concise reasoning으로 optimize 하거나 애초에 efficient reasoning model을 학습
(2) reasoning output-based efficient reasoning: 추론 단계에서 reasoning step과 length를 dynamically 조절
(3) input prompts-based efficient reasoning: 입력 프롬프트의 난이도나 길이를 기준으로 reasoning efficiency를 개선

📜 Paper The Hebrew University, IBM, Yale

2025.03 3주차

Survey on Evaluation of LLM-based Agents

📜 Paper Renmin Univ.

2025.03 2주차

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-Searcher: two-stage outcome-based RL approach
reasoning process 동안 추가적인 지식 습득을 위해 모델이 자율적으로 external search system에 접근
RL만 배타적으로 사용. cold start를 위한 reward나 distillation 불필요.

🧑🏻‍💻 Dev Manus

2025.03 2주차

Leave it to Manus

자체적으로 공개한 벤치마크 결과에서는 OpenAI Deep Research를 압살
파격적인 데모(수십 개의 앱이 동시에 실행)가 사실인지에 대한 커뮤니티 논쟁이 있었음

🧑🏻‍💻 Dev OpenAI

2025.03 2주차

New tools for building agents

Chat Completions API에 Assistants API의 tool 사용 능력을 합친 Responses API
web search, file search, computer use 능력을 내장

📜 Paper Skolkovo Institue of Science and Technology

2025.03 2주차

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Sparse Autoencoder를 이용하여 Gemma-2-2b로부터 feature를 추출함으로써 ATD interpretability를 높임
다양한 모델로부터 획득한 텍스트가 사람으로부터 얻은 것과 어떻게 다른지에 대한 인사이트 제공 가능

🧑🏻‍💻 Dev Google DeepMind

2025.03 2주차

Gemini Robotics brings AI into the physical world

Gemini Robotics-ER: Gemini의 embodied reasoning (ER) 능력을 활용하여 advanced spatial understanding을 보여줌
다음 세대의 휴머노이드를 만들기 위해 Apptronik와 파트너십
[Technical Report link](https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf) 🔗

🧑🏻‍💻 Dev Google

2025.03 2주차

Introducing Gemma 3: The Developer Guide

LMArena에서 R1 바로 뒤를 이어 2위 차지
SigLIP 기반의 vision encoder를 통한 Multimodal 지원, 128K 윈도우 사이즈, 140개 이상 언어 이해
3개의 강화 학습 기법 적용: RLMF (Machine Feedback), RLEF (Execution Feedback), RLHF (Human Feedback)

🧑🏻‍💻 Dev Perplexity

2025.03 2주차

Perplexity Ask MCP Server

AI 시스템과 데이터 소스를 연결하기 위한 개방형 표준 프로토콜
클라이언트 - 서버 아키텍쳐를 기본으로 삼음
기존 API 대비 더 직관적이고 유연한 솔루션

🧑🏻‍💻 Dev OpenAI

2025.03 2주차

Detecting misbehavior in frontier reasoning models

reasoning 모델을 위한 강화학습 과정에서 발생하는 reward hacking 문제 중 coding task에 집중
모델이 reward를 maximize 하기 위해서 cheating 하는 내용들을 explicitly state 하는 것이 관측됨
현재로서는 모델 스스로 intent를 숨기고 detection을 회피하고자 하는 경향성이 있음

📜 Paper Meta, NYU, MIT, Princeton

2025.03 2주차

Transformers without Normalization

Dynamic Tanh (DyT): element-wise 연산, $\text{DyT}(x)=\text{tanh}(\alpha x)$, Transformers 아키텍쳐에서 normalization layers를 replace
이 아이디어는 기존 normalization의 결과가 tanh-like S-shaped input-output mapping을 보여준다는 점에서 착안함
recognition부터 generation, computer vision부터 language model 까지 다양한 태스크로 validate

📜 Paper KAIST

2025.03 2주차

Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

📜 Paper Microsoft

2025.03 1주차

LongRoPE2: Near-Lossless LLM Context Window Scaling

LLaMA3-8B에 LongRoPE2를 적용하여 128K를 커버할 수 있게 만들면서도 기존 short-context performance는 98.5% 보존

🧑🏻‍💻 Dev OpenAI

2025.03 1주차

Introducing GPT-4.5

이미지 입력, agentic planning & execution 가능
text-based interactions 내의 뉘앙스 파악 더 잘함 & 향상된 EQ → 문과적 사고는 좋아졌는데 실질적인 성능은 아쉽다는 평이 많음

🧑🏻‍💻 Dev Inception Labs

2025.03 1주차

Introducing Mercury, the first commercial-scale diffusion large language model

H100에서 초당 1000 토큰을 출력할 수 있을 정도로 기존 모델들 대비 10x 이상 빠르다고 설명
다음 토큰을 autoregressive 하게 예측하는 방식/패러다임을 “coarse-to-fine” 생성 방식으로 전환해야 한다고 주장

📜 Paper King’s College London, The Alan Turing Institue

2025.03 1주차

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

CODI: shared model이 teacher & student 역할을 수행하며 explicit & implict CoT를 학습
implicit CoT로도 explicit CoT 성능을 달성하면서도 3.1배의 토큰 압축률을 보여줌
explicit reasoning이 대박을 친 이후로 추론 비용이 급상승해서인지 implicit & compression 관련 연구들에 눈에 띄고 있음

🧑🏻‍💻 Dev Sesame

2025.03 1주차

Crossing the uncanny valley of conversational voice

tone, pace, rhythm 등을 conversational context and emotions 기반으로 조절 가능
decoder는 Residual Vector Quantization (RVQ) tokens로부터 high-fidelity speech를 reconstruct
2K context window 커버 가능, 1M hours of publicly available transcribed and diarized speech로 학습

🧑🏻‍💻 Dev Anthropic

2025.03 1주차

Token-efficient tool use (beta)

API call에서 tool use와 관련된 옵션임. Claude 3.7을 공개하면서 사용 비용을 최소화하는 옵션을 함께 제시함.

📜 Paper LLM Post-Training: A Deep Dive into Reasoning Large Language Models

2025.03 1주차

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

catastrophic forgetting, inference-time trade-off, reward hacking 등의 issues를 함께 다룸
Tuning 파트에 엑사원은 있는데 솔라는 포함되지 않았음
[Awesome LLM Post-Training repository](https://github.com/mbzuai-oryx/Awesome-LLM-Post-training) 🔗

📜 Paper Mila

2025.03 1주차

Multi-Turn Code Generation Through Single-Step Rewards

μCODE: single-step reward만을 사용하는 multi-turn code generation
중간의 어떤 과정에서도 올바른 코드로 recovered 가능하다고 주장
멀티턴 실행 피드백과 새로 생성된 코드를 scoring하는 verifier를 iteratively 학습

📜 Paper Univ. of Oklahoma

2025.03 1주차

A Survey On Large Language Models For Code Generation

엄청 방대한 양을 커버하고 있지는 않음

🧑🏻‍💻 Dev Qwen

2025.03 1주차

QwQ-32B

131K Token length 지원
RoPE, SwiGLU, RMSNorm

🧑🏻‍💻 Dev Cohere

2025.03 1주차

Aya Vision: Expanding the Worlds AI Can See

8B, 32B 사이즈 모델. [Kaggle](https://www.kaggle.com/models/cohereforai/aya-vision?ref=cohere-ai.ghost.io) & [HuggingFace](https://huggingface.co/collections/CohereForAI/c4ai-aya-vision-67c4ccd395ca064308ee1484?ref=cohere-ai.ghost.io) 에 weights 공개

🧑🏻‍💻 Dev Google

2025.03 1주차

Data Science Agent in Colab: The future of data analysis with Gemini

classification, regression, feature selection, correlation analysis 등 기능 지원
CSV, JSON, Excel files 지원

📜 Paper Nanjing Univ., Microsoft

2025.03 1주차

Process-based Self-Rewarding Language Models

→ 현존하는 self-rewarding 방식은 수학적 추론 영역에서 약점을 보인다고 지적
→ self-rewarding 내에 long-thought reasoning, step-wise LLM-as-a-Judge, step-wise preference optimization 등 도입

📜 Paper Washington, Peking

2025.03 1주차

MPO: Boosting LLM Agents with Meta Plan Optimization

Meta Plan Optimization (MPO): explicit guidance를 통합하여 agent의 planning capability를 향상시키는 프레임워크. agent의 실행 결과에 대한 피드백을 바탕으로 삼음.
Meta Plan에 대한 평가(reward)를 제공하는 모델도 있어서 파이프라인이 강화학습처럼 보임

📜 Paper Alibaba

2025.03 1주차

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

Babel-9B, 83B multilingual LLMs 공개
전통적인 continued pretraining 대신 model extension을 통해 parameter count를 확장함으로써 성능 향상을 도모했음

📜 Paper Alibaba

2025.03 1주차

START: Self-taught Reasoner with Tools

(1) Hint-infer: 인위적으로 설계한 힌트를 삽입 (ex. 파이썬 코드를 써야겠어!)
(2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-infer를 통해 생성된 reasoning trajectories(tool 사용을 포함하는)를 fine-tuning

📜 Paper CMU

2025.03 1주차

SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning

accuracy와 efficiency를 향상시키기 위해 reasoning topology를 dynamically optimize
Topological-Annotation-Generation (TAG) system: topological dataset creation & segmentation을 자동화
multi-task Topological Reward Model (M-TRM) 학습: 자동적으로 best reasoning topology를 선택하여 single pass에 답변 반환 (multiple single-task 필요성 x)

📜 Paper NVIDIA, Berkeley, MIT, Nanjing, KAIST

2025.03 1주차

Token-Efficient Long Video Understanding for Multimodal LLMs

STORM (Spatiotemporal TOken Reduction for Multimodal LLMs): image encoder & LLM 사이의 temporal encoder를 통합하는 아키텍쳐
Mamaba State Space Model을 사용하여 temporal information을 image tokens에 통합하여 보다 풍부한 representations를 생성
training & inference latency 둘 다 감소시키면서도 extended temporal contexts에 대한 efficient & robust video understanding 를 보여줌

📜 Paper Stanford

2025.03 1주차

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

4개의 cognitive behaviors: verification, backtracking, subgoal setting, backward chaining
OpenWebMath data를 continued-pretraining에 활용하여 Llama를 학습한 결과는 Qwen에 준함

📜 Paper Columbia Business School

2025.03 1주차

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

→ 거의 모든 distinct reasoning chain마다 reasoning length와 accuracy 간의 universal tradeoff 존재
token complexity: successful problem-solving을 위해 필요한 최소한의 토큰 숫자
→ accuracy-compression tradeoff의 이론적 한계를 계산하는 데 활용

2025년 2월 66건

🧑🏻‍💻 Dev StepFun, Tsinghua

2025.02 4주차

Open-Reasoner-Zero

minimalist approach: vanilla PPO with GAE & rule-based reward function / w/o KL regularization
1/30 training steps만으로도 DeepSeek-R1-Zero-Qwen-32B를 GPQA Diamond Bench에서 우세
[paper link](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) 🔗

🗞️ News 1X

2025.02 4주차

Introducing NEO Gamma

“companion” 포지션으로 가정 환경에서 자연스러운 움직임을 보여줌 (링크 데모 참고)

📜 Paper Alibaba

2025.02 4주차

Qwen2.5-VL Technical Report

objects를 식별할 때 bounding box를 치거나 point를 정확하게 파악하는 점이 특징
dynamic resolution processing & absolute time encoding 도입 → 다양한 사이즈의 이미지, long-video 처리 가능
task-specific fine-tuning 없이도 다양한 domain에 robust performance를 보인다고 주장

📜 Paper Arizona, UCLA, Notre Dame, UIUC

2025.02 4주차

Preference Leakage: A Contamination Problem in LLM-as-a-judge

(1) being the same model (2) having an inheritance relationship (3) belonging to the same model family
여러 LLM baselines와 benchmarks를 통해 관계에 따른 judge bias가 존재한다는 것을 empirically 확인 (preference leakage)
그렇다면 데이터를 생성할 땐 다양한 LLM을 활용해야 하는 것 아닐까?

🧑🏻‍💻 Dev Anthropic

2025.02 4주차

Claude 3.7 Sonnet and Claude Code

thinking mode의 context length 128K 까지 확장
API를 통해 thinking time도 조절 가능
Claude Code: CLI AI coding assistant

🧑🏻‍💻 Dev AI2

2025.02 4주차

Efficient PDF Text Extraction with Vision Language Models

다양한 종류의 PDF에 대해 250,000장 fine-tune
1M PDF pages당 $190 → GPT-4o API batch 대비 32배 저렴하다고 소개
markdown 형태로 output 반환

🧑🏻‍💻 Dev Alibaba

2025.02 4주차

Wan 2.1: Leading AI Video Generation Model (Wanx 2.1)

T2V-1.3B, 14B 두 개 version으로 공개
[허깅페이스](https://link.mail.beehiiv.com/ss/c/u001.ae3tPPqcD9LGEYY83-FJncrD8ENm5PQsonneGdCHnxpYCBUd3DooBT-uAsUQv9d_7B6796SyxaZC5XlWLw2yks9-yh44CzsyG9aF9Y4BXbbjYV7DwNgb9DWcQzerqUJ6_qsJSy3ym_emk857Gd43TC4rnNFUCXCVn6a2j36w2YCGgKN4QcOGW4pnMCTsFBswBeXMutzsdhvlGL0oZVpPPgnt3pEFI0nr9tXunNcy3Q-fmCgU7bfh34Z3A-dbnaux/4ec/gOpmFuORQEitDMXINqB7DQ/h8/h001.KtK7dRp01Nh9ppRdnZE0pLbWXx3mSv_Exs3IcfSagzA)를 비롯한 다양한 플랫폼에서 이용 가능

🧑🏻‍💻 Dev Google

2025.02 4주차

Get coding help from Gemini Code Assist — now for free

Gemini 2.0으로 지원하며 월 180,000개의 code completions 지원 (GitHub Copilot free tier 대비 20배 많은 양)
128K context window를 바탕으로 complex code base에 대한 이해 가능
코드 내 stylistic issues and bugs 등을 automatically 탐지 가능

📜 Paper Kakao

2025.02 4주차

Kanana: Compute-efficient Bilingual Language Models

high quality data filtering, staged pre-training, depth up-scaling, pruning, distillation
특히 Kanana models를 post-training 하는 과정에서 사용된 방법론들을 보고
2.1B ~ 32.5B 사이즈의 모델들로 구성되어 있고, 2.1B 모델은 공개

🧑🏻‍💻 Dev Amazon

2025.02 4주차

Introducing Alexa+, the next generation of Alexa

Amazon’s Nova & Anthropic’s Claude를 비롯한 여러 개의 foundational LLMs를 각 태스크에 가장 적합하게 활용
도메인별 experts를 활용하는 개념. 개인 맞춤화된 특징들을 지원 (유저 히스토리 기반)

📜 Paper Meta, UIUC, CMU

2025.02 4주차

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

DeepSeek-R1 같은 모델들은 코딩 테스트를 위한 문제들처럼 실행하기 쉽고 real-world와는 동떨어진 코드들로 학습되었다는 한계를 지적
open-source software evolution data로부터 실제 개발자들의 reasoning processes & solutions를 autonomously 학습
GitHub Pull Requests Dataset Curation (4.6M repositories)

📜 Paper Zoom

2025.02 4주차

Chain of Draft: Thinking Faster by Writing Less

📜 Paper Convergence Labs

2025.02 3주차

LM2: Large Memory Models

input token과 cross attention 하며 gating mechanism을 통해 update
일반적인 벤치마크에서도 좋은 성능을 유지하고 multi-hop 에서도 뛰어난 발전이 있었다고 보고
interpretability, test-time behavior 등에서도 장점이 있음

📜 Paper ELLIS Institute Tübingen

2025.02 3주차

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

CoT에 의존하지 않아 specialized training data가 필요하지 않고, 심지어 small context window에서도 working

📜 Paper Meta AI

2025.02 3주차

Brain-to-Text Decoding: A Non-invasive Approach via Typing

기존 방식들은 invasive device를 활용하는데 이와 다른 non-invasive 방식이며 둘 사이의 gap을 줄인 데 의의가 있다고 설명
character-error-rate (CER)은 32%로 67%의 error rate를 보이는 EEG 대비 큰 성능 향상이 있었다고 보고

📜 Paper UC Berkeley

2025.02 3주차

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

Qwen2.5-32B 모델을 17k CoT Training sample로 학습한 결과를 리포트
reasoning step의 각 내용보다는 Long CoT의 structure가 학습 과정에 훨씬 더 큰 영향을 미친다고 주장 (logical consistency가 중요!)
저자가 이전에 공개한 Sky-T1-32B-Preview model의 academic paper

📜 Paper NYU, Tubingen

2025.02 3주차

Do Large Language Models Reason Causally Like Us? Even Better?

본 논문에서는 from human-like to normative inference 라고 scale을 표현함
실험한 4개의 모델 중에서 GPT-4o, Claude는 가장 normative behavior를 강하게 보였고 나머지인 Gemini-Pro와 GPT-3.5는 그렇지 않았다고 설명
사람이 내놓는 답변도 실제로 이해한 내용을 바탕으로 나오는 것인지 판단하는 기준이 있긴 한가?

🧑🏻‍💻 Dev Perplexity

2025.02 3주차

Introducing Perplexity Deep Research

finance, marketing부터 product research까지 다양한 범위의 태스크를 expert 수준으로 처리
최종 report를 PDF 또는 문서 형태로 export하거나 Perplexity Page로 변환하여 공유할 수 있음

📜 Paper Renmin Univ. of China

2025.02 3주차

Large Language Diffusion Models

self-constructed Autoregressive Models 성능과 scalability가 뛰어나다고 주장
forward data masking process & reverse process를 통해 Transformer가 masked token 예측하는 것처럼 분포를 모델링

📜 Paper Virginia Tech, Oxford

2025.02 3주차

Towards Reasoning Ability of Small Language Models

4개의 평가 method와 4개의 LLM을 judge로 사용하며 실험은 3번씩 반복
adversarial conditions와 intermediate reasoning steps 또한 평가

🧑🏻‍💻 Dev xAI

2025.02 3주차

Grok 3 Beta — The Age of Reasoning Agents

logical processing을 위한 Think Mode, complex problem-solving을 위한 Big Brain Mode
faster query processing을 위해 H100 20만대 사용 (전작 대비 10x 이상)
Grok 3는 X Premium Plus 구독자들 사용 가능

📜 Paper DeepSeek, Peking, Washington

2025.02 3주차

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

현재 GPU에 최적화가 잘되어 있음 & end-to-end training

🧑🏻‍💻 Dev Microsoft

2025.02 3주차

OmniParser V2: Turning Any LLM into a Computer Use Agent

a large set of interactive element detection data & icon functional caption data 로 학습
ScreenSpot Pro 라는 벤치마크에서 높은 성능을 기록했다고 보고
OmniTool: agents를 위한 tool를 포함하는 dockerized Windows system

📜 Paper Michigan, Amazon, Pennsylvania

2025.02 3주차

Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models

이를 해결하기 위해 perplexity를 importance 지표로 삼는 method 제안
특정 step을 제거했을 때 perplexity가 증가한다면 모델의 입장에서 중요도가 높은 것
few-shot CoT 내의 sample 중 불필요한 것들을 제거 or 살아남은(critical) steps만으로 fine-tuning 하는 방법으로 활용 가능

📜 Paper AIRI

2025.02 3주차

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

본 연구에서는 1500x 이상의 compression rate를 달성했다고 주장
compression에서 중요한 것은 input의 길이가 아닌 줄어들 uncertainty의 양이라고 설명

🧑🏻‍💻 Dev Google Research

2025.02 3주차

Accelerating scientific breakthroughs with an AI co-scientist

Supervisor agent가 6개의 specialized agents에 tasks 할당
Generation, Reflection, Ranking, Evolution, Proximity, Meta-review
[paper link](https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf) 🔗

🧑🏻‍💻 Dev Sakana AI

2025.02 3주차

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

PyTorch code를 CUDA kernel용으로 변환 → evolutionary meta-generation을 거쳐 runtime performance optimize
250개의 테스트에서 186개의 태스크의 처리 속도를 평균(median) 1.52x 향상시켰다고 보고
[paper link](https://pub.sakana.ai/static/paper.pdf) 🔗

📜 Paper Meta

2025.02 3주차

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

벤치마크는 CV, NLP, RL, Game Theory에 관한 13개의 tasks로 구성
프레임워크는 여기에 새로운 태스크를 추가 및 통합하는 것을 도와줌

📜 Paper The Univ. of Melbourne

2025.02 3주차

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

adversarial stimuli & interpretability techniques 로 평가 시 여러 언어와 reasoning tasks에서 not robust한 결과를 보였다고 설명

📜 Paper Nanjing Univ.

2025.02 2주차

Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models

이를 해결하기 위해 LLM이 자율적으로 언제, 어디서 backtrack 할 것인지를 결정하도록 하면 된다고 주장 (like in traditional search algorithms)
이를 위한 self-backtracking mechanism을 제시: 학습 & 추론 에서 backtrack 가능
이는 optimal-path supervised fine-tuning method 대비 40% 정도의 성능 gain이 있다고 하는데 왜 그것과 비교하는지는 잘 모르겠음.

📜 Paper SJTU

2025.02 2주차

LIMO: Less is More for Reasoning

이는 supervised fine-tuning이 generalization 보다는 memorization으로 이어진다는 주장과도 상반되는 결과
817개의 curated training samples로 학습한 LIMO를 기반으로 LIMO Hypothesis 주장
사전학습 단계에서 domain knowledge가 충분히 encoded 되었다면, 정교한 추론 능력은 최소한의 cognitive process를 포함하는 데이터로도 획득할 수 있다

🧑🏻‍💻 Dev Harvard

2025.02 2주차

Data.govArchive

📜 Paper Apple

2025.02 2주차

ELEGNT: Expressive and Functional Movement Design for Non-anthropomorphic Robot

expressive: intention, attention, emotions
functional: task fulfillment, spatial constraints, time efficiency
posture, gesture, gaze 등의 비언어적 행동들이 internal state를 의식적으로 & 무의식적으로 표현하는 것이기 때문에 이를 (램프처럼 생긴) 로봇의 행동(movements) 결정에 반영하겠다는 연구

🧑🏻‍💻 Dev HuggingFace

2025.02 2주차

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control

이러한 유형의 모델을 Vision-Language-Action 모델이라고 부르는 듯 (VLA)
설치부터 학습까지 상세한 코드 예시를 통해 설명하는 허깅페이스 블로그 포스팅

📜 Paper ISTA

2025.02 2주차

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

QeEST: 학습 모델의 weights & activations를 4-bit 혹은 그 이하로 학습하며 FP16과 유사한 수준의 성능 기록. 심지어 1-bit에서도 안정적으로 학습 가능하다고 설명.
이는 (1) normalization 과정에서 weights & activations의 continuous distribution을 유지하여 quantization (2) 새로운 trust gradient estimator를 제시 했기에 가능했다고 함

📜 Paper Ben Gurion Univ.

2025.02 2주차

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

학습 파이프라인에 integrate하여 robust language model을 만드는 데 기여 가능
모델 성능이 memorized pattern에 의해 좋게 나온 것인지 아닌지를 판단하는 것이 중점
예상 외로 성능이 높은 모델들이 perturbation에 의한 성능 degradation이 심했다고 보고

📜 Paper AIRI

2025.02 2주차

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

SytnDetoxM: manually & synthetically 생성된 multilingual parallel detoxification dataset, 16K 개의 데이터로 구성

📜 Paper Shanghai AI Lab

2025.02 2주차

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

compute-optimal TTS를 이용하면 극도로 작은 reward model (< 1B)로도 엄청나게 사이즈가 큰 (> 405B or GPT-4o) 모델의 성능을 넘어서는 것이 가능하다고 주장
[깃허브 링크](https://ryanliu112.github.io/compute-optimal-tts) 🔗

🧑🏻‍💻 Dev OpenAI

2025.02 2주차

Sam Altman reveals GPT-5 will merge o-series models, removing manual model selection

reasoning 모델은 별도로 출시되지 않고 GPT-5에 통합

🧑🏻‍💻 Dev Anthropic

2025.02 2주차

The Anthropic Economic Index

automation의 43%가 AI를 활용한 결과임을 보고
[paper link](https://assets.anthropic.com/m/2e23255f1e84ca97/original/Economic_Tasks_AI_Paper.pdf) 🔗

📜 Paper Oxford

2025.02 2주차

Distillation Scaling Laws

(1) teacher가 존재할 때 (2) teacher 학습이 필요할 때로 구분하여 연구 결과 제시
결국 distillation 과정에서 student 모델 뿐만 아니라 teacher 모델의 cross entropy loss를 함께 살피며 적절히 scaling 하는 것이 중요하다는 점을 언급하는 것으로 보임

📜 Paper Imperial College London, Cohere

2025.02 2주차

LLMs can implicitly learn from mistakes in-context

실험 결과에 따르면 incorrect answer를 correct answer와 함께 보여주는 것만으로도 성능 향상이 있었다고 함. CoT의 성능도 boosting 가능.
LLM이 in-context implicit learning 할 수 있다는 결론

📜 Paper Amazon, UCLA

2025.02 2주차

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

3,000개의 엄선된 preference & query pair, 20개 주제 커버
최대 100k 토큰 context에 해당하는 multi-session conversation으로 평가
[깃허브 링크](https://prefeval.github.io/) 🔗

📜 Paper Meta, KAIST, UC San Diego

2025.02 2주차

LLM Pretraining with Continuous Concepts

CoCoMix는 사전학습된 sparse autoencoder로부터 “continuous concepts”를 학습하여 예측하고, 모델의 hidden state와 token의 hidden state을 interleave
단순 next token prediction에 비해 sample efficient 하면서도 consistently 성능이 높았다고 설명

📜 Paper University of Hong Kong, ByteDance

2025.02 2주차

Goku: Flow Based Video Generative Foundation Models

rectified flow Transformer를 이용하여 만든 joint image-and-video generation 중에서 SoTA model failmily
data curation pipeline, model architecture design, flow formulation, advanced infrastructure for efficient and robust large-scale training 공개
주요 tasks의 정량 & 정성 평가 가장 높은 결과를 받았다고 설명

📜 Paper SNU, Cornell

2025.02 2주차

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Skrr (Skip and Re-use layers): T2I diffusion 모델에서 text encoder를 효율적으로 pruning 하는 strategy
transformer block을 selectively skipping하거나 일부 layer를 reusing함

🧑🏻‍💻 Dev AI Coder Reviewer

2025.02 1주차

AI Coder Reviewer

다양한 프로그래밍 언어에 대한 automated code review 지원

📜 Paper GIT

2025.02 1주차

Large Language Models Think Too Fast To Explore Effectively

인간은 uncertainty와 empowerment를 적절히 조절할 수 있는데, 이를 능가하는 건 o1 모델 밖에 없었다고 주장
Sparse Auto Encoder에 대한 representational 분석 결과에 따르면 uncertainty와 choices는 early layer에서 represented 되는데, empowered values는 later layer에서 처리되어 모델 입장에서는 미성숙한 결정을 내리도록 하는 원인이 된다고 설명 (?)

🧑🏻‍💻 Dev Mistral

2025.02 1주차

Mistral Small 3

24B 파라미터, 32K context window, 초당 150 토큰 처리 가능 → 32GB RAM을 가진 RTX 4090 또는 맥북에서 돌릴 수 있음
합성데이터나 RLHF를 사용하지 않아 추가적인 fine-tuning 하기에 적합한 base 모델이라고 주장

🧑🏻‍💻 Dev AI2

2025.02 1주차

Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3

오픈소스 모델임에도 불구하고 DeepSeek v4, GPT-4o 수준의 성능 달성
Reinforcement Learning from Verifiable Rewards (RLVR) 프레임워크가 MATH 성능을 크게 향상시켰다고 설명

📜 Paper DeepSeek

2025.02 1주차

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

MATH에서 외부 도구의 도움 없이 51.7%를 달성하며 GPT-4, Gemini-Ultra급의 성능을 보임
web data를 엄선하는 파이프라인 & Group Relative Policy Optimization (GRPO)

🧑🏻‍💻 Dev OpenAI

2025.02 1주차

OpenAI o3-mini

o1-mini 의 자리를 대신함 (예를 들어 기존 o1-mini API는 o3-mini 로 대체)
o1과 달리 vision을 지원하지 않음
설연휴 기간 폭발적인 관심을 얻은 DeepSeek-R1 을 견제하는 움직임으로 해석

🧑🏻‍💻 Dev OpenAI

2025.02 1주차

Introducing deep research

기존 추론 모델들은 인터넷에 접근하지 못한다는 한계가 있었는데 이를 극복함
굉장히 난이도가 높은 것으로 알려진 Humanity’s Last Exam에서 26.6% 스코어를 기록함

📜 Paper HKU, UC Berkeley, Google DeepMind, NYU

2025.02 1주차

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

학습된 모델이 unseen textual & visual domain에서 일반화하는지 확인
SFT는 단순히 학습 데이터를 암기하는 것이라면 RL은 실제 일반화에 도움이 됨. 단, SFT는 답변의 형식을 유지하는 데 도움이 됨

📜 Paper Arizona, UCLA

2025.02 1주차

Preference Leakage: A Contamination Problem in LLM-as-a-judge

동일 모델, inheritance 관계, model family, 세 가지 유형에 대한 조사
모델 사이에 명백한 preference leakage가 존재한다고 주장

📜 Paper Chineses Academy of Sciences

2025.02 1주차

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

쿼리를 iteratively decompose 함으로써 external knowledge를 retrieve 할지 말지, 혹은 parametric reasoning을 할지를 결정

🧑🏻‍💻 Dev Google

2025.02 1주차

Gemini 2.0 is now available to everyone

Flash, Flash-Lite 모델은 1M context window, Pro Experimental 모델은 2M context window를 지님
1.5 Flash 대비 cost & latency 증가하지 않으면서도 고품질 답변을 생성

🧑🏻‍💻 Dev Anthropic

2025.02 1주차

Constitutional Classifiers: Defending against universal jailbreaks

일반적인 jailbreaks를 수천 시간 시도했음에도 불구하고 robust 결과를 보여줬다고 설명
그럼에도 불구하고 무지성 거절(refusal rates)의 비율은 단 0.38% 밖에 증가하지 않았음
8개 레벨의 jailbreaking demo를 뚫는 사람에게는 $10,000를, 일반적인 jailbreaking strategy로 뚫는 사람에게는 $20,000를 수여하는 [HackerOne](https://hackerone.com/constitutional-classifiers?type=team) 개최중

🧑🏻‍💻 Dev HuggingFace

2025.02 1주차

Open-source DeepResearch – Freeing our search agents

Deep Research가 GAIA 벤치마크에서 높은 성능을 달성한 것을 언급
CodeAgent 를 사용하여 복잡한 sequences of actions를 디자인할 수 있다고 설명

🧑🏻‍💻 Dev OpenAI

2025.02 1주차

Introducing ChatGPT search

[크롬 확장프로그램](https://chromewebstore.google.com/detail/chatgpt-search/ejcfepkfckglbgocfkanmcdngdijcgld)을 통해 default 검색 엔진을 ChatGPT search로 설정할 수도 있음

📜 Paper Stanford, Washington, AI2

2025.02 1주차

s1: Simple test-time scaling

s1K: 세 개의 기준(difficulty, diversity, quality)으로 검증한 reasoning taces를 포함한 데이터셋
budget forcing: 모델이 답변을 끝내려고 할 때, test-time compute를 강제로 중단하거나 늘리기 위해서 “Wait” 키워드를 여러 차례 붙이는 방법론
Qwen2.5-32B-Instruct 모델에 s1K 학습 한 s1-32B 모델에 budget forcing 장착하니 수학 능력 크게 향상

🧑🏻‍💻 Dev Ai2

2025.02 1주차

Ai2 Scholar QA beta

Section Planning and Generation, Paper Comparison Table Generation 등의 특징
[블로그 포스팅](https://allenai.org/blog/ai2-scholarqa)(Introducing Ai2 ScholarQA) 참고

📜 Paper HuggingFace

2025.02 1주차

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

multi-stage training process를 통해 math, code, instruction-following data를 web-text와 혼합하여 약 11T 토큰 학습
new specialized datasets 도입 (Fine-Math, Stack-Edu, SmolTalk): 기존 데이터셋이 너무 작거나 품질이 낮았던 이슈를 해결하기 위함
비슷한 사이즈 수준의 모델들(Qwen2.5-1.5B, Llama3.2-1B) 중에서는 SoTA급 성능을 달성했다고 보고

📜 Paper T-Tech

2025.02 1주차

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

data-free cosine similarity technique: 특정 features가 얼마나 persists, transform, first appear 하는지 등을 파악
이를 통해 model computation에 대한 interpretability & mechanistic insights 획득 가능

📜 Paper Shanghai AI Lab, Peking

2025.02 1주차

UltraIF: Advancing Instruction Following from the Wild

이를 위해 UltraComposer를 constraint-associated prompts & evaluation questions 묶어서 학습
8B 사이즈의 모델을 response generator & evaluator로 사용했을 때에도 유의미한 성능 향상이 있었다고 보고

🧑🏻‍💻 Dev Mistral

2025.02 1주차

The all new le Chat: Your AI assistant for life and work

Flash Answers, a build-in code interpreter, real-time search 등을 주요 특징으로 내세움
Flash Answers의 경우 초당 1,000개 정도의 단어를 생성할 수 있다는 특징인데 데모상으로는 확실히 타사 서비스(ChatGPT, Claude)에 비해 압도적으로 빠름

2025년 1월 67건

📜 Paper Renmin Univ. of China

2025.01 5주차

Enhancing LLM Reasoning with Reward-guided Tree Search

policy model, reward model, search alogirthm을 통합하는 프레임워크
policy 모델이 학습된 reward model에 의해 tree를 dynamically expand 하는 tree search algorithm
STILL-1 (Slow Thinking with LLMs) 라는 프레임워크

📜 Paper Renmin Univ. of China

2025.01 5주차

Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

STILL-2: imitate, explore, self-improve framework
distilled long-form thought data를 사용하여 reasoning model을 학습함으로써 slow-thinking mode를 가능하게 만듦
모델이 multiple rollout을 생성함으로써 어려운 문제를 탐색하도록 함 → high-quality trajectories가 올바른 답변으로 이어짐

📜 Paper Centfor for AI Safety, Scale AI

2025.01 5주차

Humanity’s Last Exam

automated grading에 적합한 multiple-choice, short-answer question 등으로 구성
정답은 논란의 여지가 없고 명확한 것이나 retrieval을 통해 바로 답변하기 어려운 문제들
[공개 링크](https://lastexam.ai/) 🔗

📜 Paper Truthful AI, Toronto

2025.01 5주차

Tell me about yourself: LLMs are aware of their learned behaviors

명시적으로 associated behavior에 대해 언급하지 않는 두 개의 데이터셋 사용
(a) making high-risk economic decisions (b) outputting insecure code
그럼에도 모델은 이를 명백히 설명

🧑🏻‍💻 Dev DeepSeek

2025.01 5주차

Janus-Pro release

작년(2024)에 이미 JanusFlow, Janus 라는 이름으로 mllm을 공개했었음 (허깅페이스에서 다운로드 가능)

🧑🏻‍💻 Dev Alibaba

2025.01 5주차

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

특히 14B 모델은 Qwen2.5-Turbo, GPT-4o-mini를 능가하는 성능을 보여줌
긴 context를 효율적으로 처리하기 위해서 sparse attention과 DCA (Dual Chunk Attention) 사용

📜 Paper COAI Research

2025.01 5주차

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

모델이 명시적으로 학습한 적 없는 self-preservation (자기보호) 특성을 보임
이러한 모델이 robotics와 결합되었을 때 물리적으로 영향을 줄 수 있음에 대한 concern 제기

📜 Paper USTC, Microsoft

2025.01 5주차

Optimizing Large Language Model Training Using FP4 Quantization

두 가지 key factor
(1) differentiable quantization estimator for precise weight updates
(2) outlier clamping and compensation strategy to prevent activation collapse

🧑🏻‍💻 Dev Perplexity

2025.01 5주차

Sonar

Advanced CoT reasoning, US-based, Data privacy, Self-serve API access를 주요 특징으로 삼음
일반 버전과 pro 버전으로 구분됨

📜 Paper UIUC, AI2, IBM, Yale, Washington

2025.01 5주차

ReFIT: Reranker Relevance Feedback during Inference

inference-time에 retriever에 대한 relevance feedback을 제공하여 최초 k개 recall에 대한 성능 향상을 도모
reranker의 predictions을 retriever의 query representation에 반영할 수 있도록 lightweight update mechanism을 사용하여 distill
→ updated 된 query vector를 사용하여 second retrieval step 실행

📜 Paper Huawei, McGill

2025.01 5주차

InnerThoughts: Disentangling Representations and Predictions in Large Language Models

small separateneural network predictor module을 training questions에 대해 만들어 전체 레이어의 hidden state를 입력으로 받아 결과 예측
LLM의 representational abilities를 온전히 사용하는 방식의 프레임워크라고 주장
비용은 적은데 finetuning급 성능 향상을 이뤄낼 때도 있었다고 보고

🧑🏻‍💻 Dev Alibaba

2025.01 5주차

Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model

📜 Paper Zhejiang Univ.

2025.01 4주차

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

이를 해결하기 위해 OmniThink라는 machine writing framework 프레임워크를 제안: 인간과 같은 iterative expansion & reflection 프로세스를 모방
특정 주제에 대한 지식을 점진적으로 deepen 하는 cognitive behavior가 아이디어의 핵심

🧑🏻‍💻 Dev DeepSeek

2025.01 4주차

DeepSeek-R1

Self-verification, Reflection, CoT solutions 등의 특징
DeepSeek-R1, DeepSeek-R1-Zero, Llama & Qwen 아키텍쳐 기반의 6개 distilled 모델 공개

🧑🏻‍💻 Dev OpenAI

2025.01 4주차

OpenAI’s function calling guide

좋은 예시들이 포함되어 있어 function calling 공부하는 데 활용할 수 있을 것 같음

📜 Paper Microsoft Research

2025.01 4주차

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

기존의 domain-specific expertise가 요구되었던 방식들과 달리 Common Crawl 에 포함된 다양한 도메인의 데이터를 tailor
[작업물 링크](https://aka.ms/redstone) 🔗

📜 Paper Korea Univ., Upstage

2025.01 4주차

ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

세 가지 핵심 요소: multiple domains, time dependency, temporal state
ChroKnowledge (Chronological Categoriazation of Knowledge): LLM의 non-parametric chronological knowledge를 평가하기 위한 sample-based framework
temporal knowledge를 이끌어내는 능력은 모델이 학습된 데이터 형식에 따라 다르다

📜 Paper ChungAng Univ.

2025.01 4주차

Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval

real-world 에서는 최적의 document를 찾기 위해 주로 multi-step을 거쳐야 하는 문제를 해결
pre-trained prober를 사용하여 모델의 internal cognition을 빠르게 capture

🧑🏻‍💻 Dev Pocket Flow

2025.01 4주차

Pocket Flow

Nested Directed Graph를 활용하여 Node, Action, Flow, Batch & Async 등의 기능을 지원

🧑🏻‍💻 Dev OpenAI

2025.01 4주차

Announcing The Stargate Project

NVIDIA GPU 사용, Oracle은 고품질 cloud infrastructure 제공, Microsoft Azure는 모델 분산 학습 지원
medicine & biotechnology 등의 high-value fields에 집중

📜 Paper ByteDance, Tsinghua

2025.01 4주차

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

프롬프트나 workflow를 통해 commercial model을 사용하는 이전 프레임워크들과 달리 end-to-end model임
Enhanced Perception, Unified Action Modeling, System-2 Reasoning, Iterative Training with Reflective Online Traces 등의 주요 특징

📜 Paper Microsoft

2025.01 4주차

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

multiple LLM distribution을 combine 하여 인간 judge’s annotation을 predict
judge-specific & judge-independent parameters를 둘 다 포함하는 small feed-forward neural netowrk를 사용

🧑🏻‍💻 Dev OpenAI

2025.01 4주차

Introducing Operator

web 상에서 tasks를 자동화해주는 AI agent (폼 작성, 여행 예약 등)
Computer-Using Agent (CUA) 라는 새로운 모델을 사용
GPT-4의 vision 능력으로 GUI 상호작용이 가능하도록 강화학습

🧑🏻‍💻 Dev Anthropic

2025.01 4주차

Introducing Citations on the Anthropic API

Anthropic API & Google Cloud’s Vertex AI 에서 API로 이용 가능
Document summarization, Complex Q&A, Customer support 등의 유즈케이스

🧑🏻‍💻 Dev HuggingFace

2025.01 4주차

SmolVLM Grows Smaller – Introducing the 250M & 500M Models!

두 개의 base 모델과 instruction fine-tuned 모델, 총 네 개의 체크포인트를 공개

📜 Paper Google Cloud

2025.01 4주차

Chain of Agents: Large Language Models Collaborating on Long-Context Tasks

Chain-of-Agents (CoA): multi-agent collaboration을 이용하여 information aggregation & context reasoning 가능하도록 만든 프레임워크
segmented text를 sequentially 처리할 수 있는 multiple worker agents로 구성 → manager agent가 결과를 종합하여 coherent final output 생성

📜 Paper Nanyang, Fudan

2025.01 3주차

Long Context vs. RAG for LLMs: An Evaluation and Revisits

(1) QA benchmarks에서는 LC가 일반적으로 RAG 보다 우위
(2) summarization-based RAG는 LC보다 낫지만 chunk-based retrieval는 조금 아쉽
(3) dialogue-based & generatl question queries에 대해서는 RAG가 우위

📜 Paper SynthLab, Stanford, UC Berkeley

2025.01 3주차

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

process supervision, synthetic data generation, search algorithms 등 Meta-CoT 생성에 대한 방법론 탐구
linearized search traces & reinforcement learning post-training 을 instruction tuning과 통합

📜 Paper OneLineAI, Yonsei

2025.01 3주차

Multi-Step Reasoning in Korean and the Emergent Mirage

질문들은 템플릿과 알고리즘을 통해 자동적으로 생성되었음
일정 threshold 이상의 학습을 수행한 모델로부터 emergent behavior 관측됨

🧑🏻‍💻 Dev Mistral

2025.01 3주차

Codestral 25.01

덕분에 2배 이상 빠른 속도로 코드 생성 가능
256k context length를 지원하며 다양한 프로그래밍 언어 벤치마크에서 SoTA 달성
VS Code 또는 JetBrains 에서 Chat Demo 버전 사용 가능

🧑🏻‍💻 Dev UCBerkeley NovaSky

2025.01 3주차

Sky-T1: Train your own O1 preview model within $450

QwQ-23B-Preview를 이용하여 초기 데이터를 생성한 뒤 reject sampling 적용
Qwen2.5-32B-Instruct 모델을 curated dataset으로 fine-tune

📜 Paper Microsoft

2025.01 3주차

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

MCTS를 통한 deep thinking을 활용하여 이와 같은 성과를 달성할 수 있었다고 보고
(1) code-augmented CoT data synthesis method (2) naive step-level score annotation을 지양하는 reward model training method (3) self-evolution recipe

🧑🏻‍💻 Dev AMD, John Hopkins

2025.01 3주차

Agent Laboratory: Using LLM Agents as Research Assistants

MacBook이든 GPU cluster든 주어진 computational resources에 맞게끔 동작하는 structured framework
세 단계로 구성: (1) Literature Review (2) Experimentation (3) Report Writing

📜 Paper Google Research

2025.01 3주차

Titans: Learning to Memorize at Test Time

historical context를 기억하는 방법을 배워서 오래된 과거 정보를 활용하여 현재 context에 attention 하는 방법론
결국 attention과 neural memory라는 두 개의 module을 기반으로 삼는 새로운 아키텍쳐 model family, Titan
2M context size 이상에서도 needle-in-haystack tasks를 정확하게 수행할 수 있다고 보고

📜 Paper Minimax

2025.01 3주차

MiniMax-01: Scaling Foundation Models with Lightning Attention

핵심은 lightning attention & efficient scaling
MoE 방식과 결합했는데, 이때 32개의 experts, 456B total parameters, 45.9B activated parameters 로 구성
학습 중 context window는 1M 길이에 달하고, 추론 시에는 4M 까지 extrapolate 가능하다고 주장

📜 Paper Sakana

2025.01 3주차

Transformer^2: Self-adaptive LLMs

two-pass mechanism: (1) dispatch system (2) task-specific expert vectors
LoRA 대비 사용하는 파라미터의 숫자는 적으나 효율성이 뛰어남

🧑🏻‍💻 Dev OpenAI

2025.01 3주차

Scheduled tasks in ChatGPT

one-time reminder 또는 recurring actions 설정 가능
웹 인터페이스를 통한 태스크 관리
데스크탑, 모바일, 웹에서 알림 수신 가능

📜 Paper Chinese Academy of Sciences

2025.01 3주차

Aligning Instruction Tuning with Pre-training

AITP (Aligning Instruction Tuning with Pre-training): underrepresented pre-training data를 고품질의 instruction-response pair 데이터로 변환
task-specific objective 유지 & 데이터셋의 다양성 증대
adaptive data selection, controlled rewriting, balanced integration 등

📜 Paper Together AI, MIT, Princeton

2025.01 3주차

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

모델을 여러 GPU에 나누는 Tensor Parallelism에서 발생하는 통신 간의 병목을 최소화하기 위한 방법론 제시

📜 Paper Meta

2025.01 3주차

Training Large Language Models to Reason in a Continuous Latent Space

CoConuT (Chain of Continuous Thought): LLM의 last hidden state를 reasoning state의 representation으로 해석하여 continuous thought로 명명
[official code link](https://github.com/facebookresearch/coconut?tab=readme-ov-file) (Github) 🔗

📜 Paper Northeastern Univ.

2025.01 3주차

Foundations of Large Language Models

📜 Paper Google DeepMind

2025.01 3주차

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

이것 이상의 inference-time scaling hegavior에 대해 연구. diffusion sampling process에서 더 나은 noise를 찾는 search problem에 집중.
class-/text- conditioned 이미지 생성 벤치마크에서 상당한 개선을 이뤄냈다고 보고

📜 Paper Shenzhen

2025.01 2주차

ICPC: In-context Prompt Compression with Faster Inference

encoder를 사용하여 프롬프트 내 각 단어의 확률을 계산하고 information function을 이용하여 information 계산하여 information loss를 최소화

📜 Paper AI2, Washington, NYU

2025.01 2주차

2 OLMo 2 Furious

Dolmino Mix 1124, late-stage curriculum training에 사용되는 pretraining data mixture
Tulu 3에서 얻은 최선의 practice를 OLMo 2-Instruct 개발에 활용, final-stage reinforcement learning with verifiable reward (RLVR)에 focus

📜 Paper Berkeley, CMU

2025.01 2주차

AutoPresent: Designing Structured Visuals from Scratch

10개 도메인에 대한 310개 슬라이드 deck에 대한 585개의 testing sample로 구성
(1) reference-based 방식: target slide와의 유사도 평가
(2) reference-free: 생성된 슬라이드 자체의 디자인 퀄리티 평가

🧑🏻‍💻 Dev HuggingFace

2025.01 2주차

SmolAgents

transformers에서 사용 가능한, Hub에 업로드된 모든 모델을 사용할 수 있음. OpenAI, Anthropic, Meta 모델들도 사용 가능

📜 Paper Chinese Academy of Sciences

2025.01 2주차

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

exploration complexity를 줄이고 최적화 전략을 개선하기 위한 두 가지 key points
(1) Early-terminated Exploration
(2)Progressive Reward Tracking algorithm

📜 Paper Orange

2025.01 2주차

Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends

본 논문에서는 LLMs function에 의한 VrDU 모델들의 개선 방법론 및 한계점 등을 survey

🧑🏻‍💻 Dev Google

2025.01 2주차

Agents

세 개의 핵심 구성 요소를 정의: Decision Engine, Tool Integration, Orchestration Layer
Tools는 각 functionality에 따라 Extension, Function, Data Stores로 구분

🧑🏻‍💻 Dev NVIDIA

2025.01 2주차

NVIDIA Announces Nemotron Model Families to Advance Agentic AI

NVIDIA NeMo Retriever 등을 포함하여 NVIDIA NeMo 플랫폼을 구축하고자 하는 움직임

📜 Paper IBM

2025.01 2주차

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

4개 도메인에서 평균 7.7 턴의 110개 대화로 구성되며, 총 842개의 태스크를 다룸
합성 데이터를 이용한 LLM-as-a-Judge 자동화 파이프라인도 포함하고 있음
[깃허브 링크](https://github.com/ibm/mt-rag-benchmark) 🔗

📜 Paper Korea Univ.

2025.01 2주차

SUGAR: Leveraging Contextual Confidence for Smarter Retrieval

external knowledge가 relevant 한 것인지 LLM이 알 수 없어 발생하는 hallucination을 최소화

🧑🏻‍💻 Dev NVIDIA

2025.01 2주차

Cosmos

20M 시간 & 9,000T 토큰으로 학습된 Diffusion-based models
Autoregressive, text-to-video, video-to-video, combined inputs 지원 등의 특징

🧑🏻‍💻 Dev LangChain

2025.01 2주차

Structured Report Generation Blueprint with NVIDIA AI

optimized Llama 3.3 and LangGraph integration

📜 Paper NYU

2025.01 2주차

Entropy-Guided Attention for Private LLMs

entropy regularization 테크닉을 곁들ㅇ니 entropy-guided attention 메커니즘으로 entropci overload를 완화

📜 Paper Renmin, Tsinghua

2025.01 2주차

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Search-o1: LRMs에 agentic RAG mechanism과 Reason-in-Documents module을 더한 프레임워크
[깃허브 링크](https://github.com/sunnynexus/Search-o1) 🔗

📜 Paper Microsoft

2025.01 2주차

GeAR: Generation Augmented Retrieval

📜 Paper NVIDIA, HuggingFace

2025.01 1주차

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

8192 sequence 길이로 2T 토큰을 학습
분류, single-/multi- vector retrieval 태스크에서 SoTA 달성

📜 Paper Google

2025.01 1주차

LearnLM: Improving Gemini for Learning

특정 pedagogical attribute를 평가하기 위한 프레임워크
pedagogical instruction following을 포함하여 학습한 LearnLM 이 다양한 learning scenario에서 좋은 평가를 받았음

📜 Paper Nanjing Univ., Baidu

2025.01 1주차

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

discrete & terminological task definitions 대신 Explanatory Instructions를 사용
‘image input → explanatory instruction → output’ 12M 개의 triplet으로 구성된 데이터셋 구축
Auto-regressive-based vision-language model 학습 (AR-based VLM)

📜 Paper Microsoft

2025.01 1주차

Bootstrap Your Own Context Length

diverse long-context instruction tuning data를 합성하는 simple agent flow
즉, short-context의 언어 모델들만을 이용하여 long-context 언어 모델을 만들 수 있다는 주장
Llama-3 계열 모델을 기준으로 최대 1M token 까지 확장했다고 언급

📜 Paper GIT, Washington, CMU, AI2

2025.01 1주차

Multi-Attribute Constraint Satisfaction via Language Model Rewriting

초기 paraphrased outputs으로부터 다양한 multi-attribute를 sampling 함으로써 LM을 editor로 학습
이를 제대로 평가하기 위해 Fine-grained Constraint Satisfaction (FineCS) 벤치마크를 제작
Text Style Transfer, Protein Design, 두 개의 challenging tasks로 구성

📜 Paper Xiaoduo AI Lab

2025.01 1주차

Xmodel-2 Technical Report

이것의 아키텍쳐는 다른 모델들이 통합된 하이퍼파라미터셋을 그대로 활용할 수 있도록 함으로써 최적의 세팅으로 larger model에 scale 할 수 있음
MiniCPM의 WSD learning rate scheduler 사용
[깃허브 링크](https://github.com/XiaoduoAILab/Xmodel-2) 🔗

📜 Paper Tencent

2025.01 1주차

HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

data sparsity issue 해결을 위해 iterative 데이터 합성 프레임워크를 디자인
system 2 thinking을 위한 guided tree search algorithm 디자인
30k 개의 합성 데이터를 공개: 자연어로 된 원래 질문, autoformalization으로 변형된 것, HunyuanProver로부터의 proof로 구성

📜 Paper Meta

2025.01 1주차

MLLM-as-a-Judge for Image Safety without Human Labeling

기존 문제점: human label, guideline 제작 등은 너무 비쌈. 룰 업데이트가 주기적으로 필요함
MLLM이 zero-shot으로 주어진 ruel과 이미지 간의 관련성을 평가하고 빠르게 판단할 수 있도록 하는 방법론을 제안

📜 Paper Toronto

2025.01 1주차

Toward Adaptive Reasoning in Large Language Models with Thought Rollback

TR의 core mechanism은 rolling back thoughts로 LLM이 thoughts에 대해 error analysis를 수행하여 이전에 mistaken 된 thought를 roll back 하도록 함
prompt 내에 이러한 trail-and-error를 포함하여 더욱 reliable한 reasoning path를 구축
[깃허브 링크](https://github.com/iQua/llmpebase) 🔗

📜 Paper Taiwan, Intel

2025.01 1주차

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

2024년 12월 63건

📜 Paper Washington, AI2

2024.12 4주차

Self-Instruct: Aligning Language Models with Self-Generated Instructions

언어 모델의 zero-shot 성능이 뛰어나더라도 human-written instruction data 자체는 확보하기 어렵다는 문제가 존재
→ Self-Instruct: 언어 모델의 생성 결과를 bootstrapping 함으로써 사전학습 모델의 instruction following 능력을 개선하는 프레임워크 제시
instruction, input, output 생성 → invalid, similar 데이터는 필터링

📜 Paper Oxford

2024.12 4주차

Confidence in the Reasoning of Large Language Models

(1) reconsider 하도록 prompt를 받았을 때의 persistence를 정성적으로 측정
(2) self-reported confidnece score를 정량적으로 측정
일반적으로는 confidence와 accuracy가 양의 상관관계를 보이지만, 두 번째 답변이 첫 번째 답변보다 안좋을 가능성이 높음

📜 Paper Peking, Microsoft Research

2024.12 4주차

Outcome-Refining Process Supervision for Code Generation

Outcome-Refining Process Supervision, outcome refinement 자체를 supervised process 자체로 취급하는 paradigm 제시
여러 개의 solution trajectories를 유지하기 위해 tree-structured exploration을 사용

📜 Paper HKUST, Tencent

2024.12 4주차

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

(1) 모델이 충분히 다양한 response를 생성할 수 있는 능력이 있는가
(2) 고퀄리티-저퀄리티 데이터를 구분하는 external reward의 효용성
추론 관련 태스크에서 exploration & exploitation을 추적하여 정량적 분석 수행

📜 Paper Tsinghua

2024.12 4주차

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Discrete Signal Processing theory를 사용하여 RoPE가 Non-Uniform Discrete Fourier Transform을 achieve 함으로써 periodic attention을 가능하도록 만든다는 것을 확인
Fourier Position Embedding (FoPE): periodic extension과 length generalization을 개선하기 위해 attention의 frequency-domain properties를 enhance
[깃허브 링크](https://github.com/TsinghuaC3I/Fourier-Position-Embedding) 🔗

🧑🏻‍💻 Dev MIS (Make It So)

2024.12 4주차

MIS (Make It So)

OpenAI, Mistral, X.ai, Ollama 등과 같은 다양한 AI 프로바이더를 지원
자연어로 명령을 실행할 수 있음. 실제 명령 실행 전에 확인 과정을 거쳐 문제 일으킬 가능성 최소화.
[깃허브 링크](https://github.com/RamboRogers/mis?utm_source=pytorchkr&ref=pytorchkr) 🔗

📜 Paper KAIST, Microsoft Research

2024.12 4주차

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Markov decision process에 따라 언어 모델들의 ensemble 하여 step-by-step reasoning을 구성
state는 중간 추론 과정 (reasoning path)를 나타내고 action은 다음 reasoning step을 생성하는 것으로 구성됨

📜 Paper Nanjing Univ.

2024.12 4주차

Token-Budget-Aware LLM Reasoning

CoT reasoning에 사용되는 토큰의 수와 비용을 효과적으로 감소시킬 수 있었다고 주장
[깃허브 링크](https://github.com/GeniusHTX/TALE) 🔗

📜 Paper KAIST, Google DeepMind

2024.12 4주차

Revisiting In-Context Learning with Long Context Language Models

정교한 예시 선정이 random selection 대비 큰 성능 향상으로 이어지지 않는다는 결과
오히려 좋은 예시들을 찾는 것보다 context window를 채울 만큼의 예시를 확보하는 게 더 어렵고 중요한 문제로 인식되기 시작했다는 주장

📜 Paper Tsinghua, Peking

2024.12 4주차

How to Synthesize Text Data without Model Collapse?

합성 데이터의 비율을 조절해가며 학습해 본 결과, 합성 데이터의 비율과 모델의 성능이 음의 상관관계를 보인다고 함
사람이 생성한 데이터에 token editing을 적용하여 semi-synthetic data를 획득하는 방법론 제안

📜 Paper Anthropic, NYU, Mila

2024.12 4주차

Alignment faking in large language models

무료 유저와 대화할 것임을 시스템 프롬프트에 포함하고 모든 해로운 질문에 대해서도 답변할 수 있도록 학습 → 추론 시 무료/유료 유저 결과 비교해보니 무료 유저의 해로운 질문에 대해서 답변할 확률이 14%, 유료 유저에 대해서는 0%
사전학습 데이터를 모방한 합성 데이터를 이용하여 보다 현실적인 세팅에서 실험한 결과 제시
강화학습으로도 alignment-faking reasoning을 확인한 결과, 그 비율이 78%까지 오르는 것을 확인함

📜 Paper Pennsylvania, Salesforce

2024.12 4주차

GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers

GReaTer: task loss gradients를 활용하여 open-source, lightweight LM으로 self-optimization of prompts 수행하는 테크닉
[깃허브 링크](https://github.com/psunlpgroup/GreaTer) 🔗

📜 Paper Google Research, Google DeepMind

2024.12 4주차

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

(1) additional training supervision을 위한 soft label 제공
(2) small subset of valuable training examples 선별
1.5B 모델을 soft labeler로 이용하여 2.8B 사이즈 모델을 학습한 결과를 제시

📜 Paper DeepSeek

2024.12 4주차

DeepSeek-V3 Technical Report

효율적인 학습 및 추론을 위해 Multi-head Latent Attention (MLA) & DeepSeekMoE 아키텍쳐 선택
load balancing을 위한 auxiliary-loss-free strategy, multi-token prediction training objective
[깃허브 링크](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf) 🔗

📜 Paper Meta

2024.12 4주차

Large Concept Models: Language Modeling in a Sentence Representation Space

existing sentence embedding space, SONAR 사용
diffusion-based generation의 일종인 MSE regression 등을 시도
1.6B 모델에 1.3T 토큰 학습 & 7B 모델에 2.7T 토큰 학습

🧑🏻‍💻 Dev Ollama & HuggingFace

2024.12 4주차

Use Ollama with any GGUF Model on Hugging Face Hub

모델 페이지의 `Use this model`에서 `ollama`를 선택
`ollama run hf.co/{username}/{repository}`

🧑🏻‍💻 Dev Qwen

2024.12 4주차

QVQ: To See the World with Wisdom

MMMU, MathVista, MathVision, OlympiadBench 등 수학적 추론 능력이 크게 요구되는 벤치마크에서 GPT-4o & Claude3.5 Sonnet 이상의 퍼포먼스를 보임
Language Mixing & Code-Switching 등이 예상치 못하게 나타날 수 있음, Recursive Reasoning 등의 문제가 존재

📜 Paper Tencent

2024.12 4주차

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

synthetic recall과 같은 태스크에서 약점을 보임
세 개의 key failure patterns
(1) lost by the boundary (2) lost if surprise (3) lost along the way

📜 Paper Gaoling School

2024.12 4주차

YuLan-Mini: An Open Data-efficient Language Model

세 개의 특징을 가진 사전학습 테크닉
(1) an elaborate data pipeline
(2) 학습 불안정성을 완화하는 robust optimization method

📜 Paper Chalmers University

2024.12 4주차

The Impact of Prompt Programming on Function-Level Code Generation

세 개의 LLM(GPT-4o, Llama3, Mistral)로 부터 생성한 completion function의 quality 평가
특정 테크닉이 코드 생성에 도움은 되지만, 이것들의 조합/결합이 반드시 도움이 되는 것은 아님
correctness & quality 간의 trade-off 관측 (quality가 뭘 의미하는지 모르겠음)

📜 Paper Meta

2024.12 4주차

Improving Factuality with Explicit Working Memory

memory는 online fack-checking과 retrieval feedback을 기반으로 refreshed
→ 중간에 잘못 생성되었던 내용들에 대한 dependency issue를 해결할 수 있음
memory update 규칙, memory unit에 대한 configuration, retrieval datastore의 quality 등이 성능에 가장 큰 영향을 미치는 요소들

📜 Paper Independent

2024.12 3주차

Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

state space duality algorithm에서 rotary position embedding의 availability를 확인
dynamic mask attention 적용하여 성능은 그대로 유지하면서도 연산 효율이 좋음
cross domain mixture of experts를 디자인 (1024개 experts)

📜 Paper Beijing Univ.

2024.12 3주차

Smaller Language Models Are Better Instruction Evolvers

SLM이 instruction evolving 동안 보다 넓은 output space를 가진다고 주장
Instruction Complex Aware IFD (IC-IFD)를 제안: instruction data를 평가하기 위해 IFD를 개선한 메트릭

📜 Paper Google, Peking

2024.12 3주차

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

모델 파라미터를 토큰으로 간주하여 트랜스포머 아키텍쳐 내 모든 linear projection을 token-parameter attention layer로 대체
[깃허브 링크](https://github.com/Haiyang-W/TokenFormer) 🔗

📜 Paper Meta

2024.12 3주차

Byte Latent Transformer: Patches Scale Better Than Tokens

bytes를 dynamic하게 sized patch로 encoding → 고정된 vocab x
8B 사이즈의 모델을 4T training bytes로 학습

🧑🏻‍💻 Dev Google DeepMind

2024.12 3주차

Veo 2

렌즈 타입과 카메라 효과를 instruction으로 정해서 비디오를 생성할수도 있음
구글의 SynthID 워터마크를 통해 AI-generated content인지 아닌지 쉽게 식별 가능

📜 Paper Shanghai AI Lab

2024.12 3주차

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

→ Evaluation Agent 프레임워크: dynamic, multi-round evaluation, 각 라운드마다 몇 개의 샘플만을 사용
완전한 오픈소스 프레임워크로써 1) efficiency 2) promptable evaluation 3) explainability 4) scalability 등이 핵심 특징
[깃허브 링크](https://vchitect.github.io/Evaluation-Agent-project/) 🔗

🧑🏻‍💻 Dev Claude Engineer v3

2024.12 3주차

Claude Engineer v3

CLI & web 인터페이스 둘 다 지원
무려 10k 개의 스타 ⭐

📜 Paper AIRI

2024.12 3주차

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

fact chaining, simple induction, deduction, counting 등 20여 개의 reasoning task 포함
평가 결과에 따르면 popular LLM도 문맥의 10-20% 정도만 활용하는 수준이며 reasoning complexity가 높아짐에 따라 퍼포먼스가 급격하게 떨어짐

📜 Paper CMU, Duke

2024.12 3주차

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

internal web site, data를 포함하는 self-contained environment를 구축
가장 뛰어난 모델로는 전체 태스크의 24% 정도를 완수할 수 있었다고 보고함
[깃허브 링크](https://github.com/TheAgentCompany/TheAgentCompany) 🔗

🧑🏻‍💻 Dev Google DeepMind

2024.12 3주차

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

LLM의 답변이 사실적으로 정확하고 충분한 내용을 담고 있는지 확인할 수 있는 벤치마크
gemini 모델들이 상위권을 다 차지하는데 상당히 의문스러운 양상..
860개의 public, 859개의 private held out set으로 구성되어 있고 전자를 [공개](https://www.kaggle.com/datasets/deepmind/facts-grounding-examples)

🧑🏻‍💻 Dev VS Code

2024.12 3주차

Announcing a free GitHub Copilot for VS Code

코드 어시스턴트에 대한 관심이 뜨거운데, Cursor, Windsurf 에 뒤지지 않으려는 노력으로 보임
그러나 아직까지 다른 코드툴에 비해서는 너무 약해/평범해 보이는 기능들..

🧑🏻‍💻 Dev OpenAI

2024.12 3주차

o3 preview & call for safety researchers

o-series 모델에 적용한 새로운 alignment strategy
안전성 검사를 위한 작업을 진행 중이고, 이를 위해 일부 연구자들에게 사용 기회를 제공할 것으로 보임

🗞️ News Perplexity

2024.12 3주차

Perplexity has reportedly closed a $500M funding round

OpenAI가 Chat 모델 시장을 선점한 것, 검색 시장을 Perplexity가 선점한 것 등을 보면 시장에서 입지를 빠르게 가져가는 쪽이 압도적인 인지도와 유저풀을 갖게 되는 것 같다는 생각이 듦

📜 Paper Meta, Washington, CMU

2024.12 3주차

Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning

A\* search를 custom domain-specific language에 사용하여 복잡한 story sturcture를 생산
Llama-3.1-70B나 GPT-4o 같은 모델도 각각 0%, 9%에 달하는 낮은 정확도를 보임
[깃허브 링크](https://github.com/facebookresearch/exploretom) 🔗

📜 Paper Tsinghua

2024.12 2주차

Densing Law of LLMs

effective parameter size는 기존 모델 M 만큼의 퍼포먼스를 낼 수 있는 최소한의 사이즈를 의미
→ LLM의 학습 퀄리티를 평가

📜 Paper CMU, KAIST, Washington

2024.12 2주차

Evaluating Language Models as Synthetic Data Generators

6개의 언어 모델, training 99개 student 모델을 사용하여 1.26M training instances를 합성
데이터 생성 능력은 문제 해결 능력과 직접적인 상관관계를 보이지 않는다고 설명
[깃허브 링크](https://github.com/neulab/data-agora) 🔗

🧑🏻‍💻 Dev LG AI Research

2024.12 2주차

EXAONE-3.5 release

🧑🏻‍💻 Dev Google

2024.12 2주차

Meet Willow, our state-of-the-art quantum chip

Willow가 기록한 벤치마크 연산 능력은 오늘날 가장 빠른 슈퍼컴퓨터가 10 septilion (10의 25승)년을 연산할 것을 단 5분만에 처리할 수 있는 수준

📜 Paper Chinese Academy of Sciences

2024.12 2주차

Towards Adaptive Mechanism Activation in Language Agent

expert model에 대한 의존 없이 mechanism activation adaptability를 최적화하는 것에 집중
a harmonized agent framework (UniAct)를 구축하고 태스크 특성에 따라 적합한 방법론으로 최적화

📜 Paper OpenAI

2024.12 2주차

OpenAI o1 System Card

GPT-4를 공개할 때와 마찬가지로 뻔한 이야기들을 담고 있음

🧑🏻‍💻 Dev OpenAI

2024.12 2주차

Day 3. Sora

프롬프트를 통해 remix, blend, create 가능
Turbo 모델은 전작 모델 대비 확실히 생성 속도가 빠름

🧑🏻‍💻 Dev OpenAI

2024.12 2주차

Day 4. Canvas

Direct python execution

📜 Paper Microsoft

2024.12 2주차

Phi-4 Technical Report

web content, code 중심의 organic data로 사전학습하는 기존 모델들과 달리, 합성 데이터를 적절히 혼합하여 사용하는 학습 방법론 적용
phi-4는 STEM-focused QA 능력에서 teacher model의 성능을 능가하는 모습을 보여줌

📜 Paper UC Santa Barbara

2024.12 2주차

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

세 개의 practical domain을 다루고 있음: airline baggage fees, NBA transactions, tax regulations
현존 LLM들의 세 가지 주요 한계: (1) 비슷하지만 다른 규칙을 구분하지 못함 (2) 규칙을 정확히 이해했더라도 수학 문제에서 일관된 성능을 보이지 않음 (3) 전반적으로 이 벤치마크 점수가 다 낮음

📜 Paper OpenAI

2024.12 2주차

Measuring short-form factuality in large language models

GPT-4의 response에 반하도록 수집한 challenging 벤치마크
오직 한 개의 답변만이 정답이 될 수 있도록 문제를 구성 (correct, incorrect, not attempted)
모델의 “know what they know”를 평가하기 위한 벤치마크

📜 Paper Saudi Data & Artificial Intelligence Authority

2024.12 2주차

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

135M 사이즈의 모델일 사용하여 learning rate과 batch size 관계가 모델 퍼포먼스에 큰 영향을 미친다는 것을 확인
ARC, GSM8K 같은 태스크는 높은 lr, HellaSwag의 pattern recognition, IFEval 등은 낮은 lr이 적합

📜 Paper Google Cloud, Google DeepMind

2024.12 1주차

Reverse Thinking Makes LLMs Stronger Reasoners

데이터 증강: teacher 모델로부터 (1)원래 질문 (2)정방향 추론 (3)역방향 질문 (4)역방향 추론을 수집
3가지 training objectives를 통한 student 모델 학습
질문→정방향 추론 생성

📜 Paper Chineses Academy of Sciecnes

2024.12 1주차

Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

RAG의 성능 향상을 위한 iterative retrieval 과정을 LLM의 자율적 의사결정 능력에 맡기는 Auto-RAG 제안
LLM이 retriever와 multi-turn 대화를 통해 검색을 계획하고 쿼리를 개선
충분한 정보가 모일 때까지 자동으로 반복

🧑🏻‍💻 Dev NVIDIA

2024.12 1주차

Multimodal PDF Data Extraction

enterprise RAG를 위한 제품으로 보임
현재는 데모 수준으로 업로드된 370/501개 파일에 대한 QA를 RAG 기반으로 테스트 해볼 수 있는 것 같음

🧑🏻‍💻 Dev Kaggle

2024.12 1주차

LLMs - You Can't Please Them All

LLM judges 간 disagreement를 극대화하는 essay를 제출하는 것이 목표

📜 Paper The University of Sydney, Huawei

2024.12 1주차

Enhancing Large Language Models through Adaptive Tokenizers

초기의 방대한 vocabulary로 시작, 학습 동안 모델의 perplexity를 관측하며 tokenizer를 refine

🧑🏻‍💻 Dev Amazon

2024.12 1주차

Amazon Nova Foundation Models

라인업: Micro, Lite, Pro, Premier, Canvas, Reel

🧑🏻‍💻 Dev Cohere

2024.12 1주차

Introducing Rerank 3.5: Precise AI Search

현존하는 검색 시스템들과 compatible
100개 이상의 언어를 지원한다고 설명

🧑🏻‍💻 Dev Google DeepMind

2024.12 1주차

Genie 2: A large-scale foundation world model

Genie 1 → 2 에서의 emergent capabilities of a foundation world model 을 주장

📜 Paper Vanderbit Univ.

2024.12 1주차

Training Noise Token Pruning

discrete token dropping 조건을 continuous additive noise로 relax 하여 학습 내에서 smooth optimization을 제공

📜 Paper UC Berkeley

2024.12 1주차

Predicting Emergent Capabilities by Finetuning

현재 LLM의 random few-shot 정확도를 기반으로 다음 세대 모델의 정확도를 예측할 수 있을까?
insight: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models
언어 모델을 특정 태스크에 대해 학습하면 emergent ability가 발현되는 point를 옮길 수 있다

📜 Paper Google DeepMind

2024.12 1주차

PaliGemma 2: A Family of Versatile VLMs for Transfer

long fine-grained captioning 같은 task 뿐만 아니라 OCR-related tasks도 커버
꽤 넓은 범위로 transfer 가능하다는 것을 실험적으로 확인한 것으로 보임

🧑🏻‍💻 Dev OpenAI

2024.12 1주차

o1 and ChatGPT Pro

Improved accuracy, Multimodal support, Faster and more concise 등의 특징
Pro 유저는 o1, GPT-4o, o1-mini 등을 무제한 사용 가능

📜 Paper Microsoft, MIT

2024.12 1주차

Does Prompt Formatting Have Any Impact on LLM Performance?

같은 내용을 일반 텍스트, 마크다운, JSON, YAML 형식 등으로 변환하여 GPT-3.5-turbo, GPT-4 모델을 테스트
성능이 높은 모델일수록 템플릿에 상관없이 성능이 유지되고, 그렇지 않은 모델은 크게 영향을 받는 것으로 확인됨

🧑🏻‍💻 Dev Google DeepMind

2024.12 1주차

GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy

new high resolution AI ensemble model 이라고 소개하고 있음 (diffusion 기반의 모델)
📜 [Nature 논문 링크](https://www.nature.com/articles/s41586-024-08252-9)

📜 Paper Yunnan Univ.

2024.12 1주차

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

sampling-based inference simulation & process reward models 를 이용하는 process supervision 도입

📜 Paper Peking, Baichuan

2024.12 1주차

SysBench: Can Large Language Models Follow System Messages?

위 능력을 평가하고 분석 가능한 벤치마크 SysBench를 도입
이미 자주 사용되고 있는 6개의 constraint, 500개의 tailor-designed system messages, multi-trun conversation 등을 기반으로 데이터셋을 직접 구축
[깃허브 링크](https://github.com/PKU-Baichuan-MLSystemLab/SysBench) 🔗

2024년 11월 77건

📜 Paper Boston

2024.11 4주차

Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models

EZSwitch: Equivalence Constraint Theory (ECT)를 LLM에 결합하여 언어학적으로 타당하고 유려한 code-switched text를 만들 수 있도록 하는 프레임워크
CSPerf: human preference dataset

📜 Paper Yale, NYU

2024.11 4주차

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Prompting Score (P-Score) & Heuristical Score (H-Score) 를 제안
structure fine-tuning을 고안하여 Llama에 적용한 결과, 눈에 띄는 성능 향상이 있었다고 보고
[깃허브 링크](https://github.com/gersteinlab/Struc-Bench) 🔗

📜 Paper Apple

2024.11 4주차

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

larger model이 smaller model의 functionality를 보유할 수 있도록 도와줌
학습이 시작되기 전 larger 모델이 smaller 모델의 능력을 탑재하고 있으므로, 무작위로 초기화된 파라미터를 학습하는 것보다 훨씬 효율적이라고 주장

📜 Paper Ghent University

2024.11 4주차

Large Language Models Reflect the Ideology of their Creators

LLM에게 최근 세계사의 유명하면서도 논쟁이 많은 인물들을 묘사하도록 프롬프팅 (영어 & 중국어)
같은 LLM이라도 영어와 중국어 사용에 따라 normative disagreement를 보인다는 것을 확인함
Western 모델에 정치적인 성향이 반영되어 있다고도 주장

📜 Paper Ohio, Washington, AI2

2024.11 4주차

ComPO: Community Preferences for Language Model Personalization

ComPO, preference provider와 함께 모델 output의 확률 분포를 contextualize 함으로써 preference optimization를 personalize
개인 단위가 아닌 그룹 단위의 선호 데이터셋을 수집하여 community-level preferences from Reddit → ComPRed 공개

📜 Paper NYU, AI2, NVIDIA, Washington

2024.11 4주차

Diverging Preferences: When do Annotators Disagree and do Models Know?

4개의 high-level 클래스로 구분되는 10개의 카테고리로 disagreement taxonomy를 구축
task underspecification, response style, refusals, annotation errors
이것들이 reward modeling & evaluation 에 어떤 영향을 미치는지 조사

📜 Paper VNU Univ.

2024.11 4주차

MoD: A Distribution-Based Approach for Merging Large Language Models

각 모델들의 specialized 능력을 보존하면서도 task 사이의 효율적인 knowledge sharing 가능
간단하게 살펴봤을 땐 다른 merge 방식과 뭐가 그렇게 크게 다른지는 잘 모르겠음
[깃허브 링크](https://github.com/knovel-eng/mod) 🔗

🧑🏻‍💻 Dev Google

2024.11 4주차

Gemini API and Google AI Studio now offer Grounding with Google Search

검색 결과를 기반으로 답변을 생성하는 방식으로 최근 생성형 검색 엔진에 대한 관심이 뜨거움
그러나 최근 구글 검색의 결과물이 만족스럽지 않다는 점을 감안하면 그렇게 좋을지는 잘 모르겠음

🧑🏻‍💻 Dev HuggingFace

2024.11 4주차

SmolLM2-1.7B-Instruct

잘 정제된 데이터셋으로 SFT & DPO 학습한 모델로, 동사이즈 대비 아주 뛰어난 성능 지표를 보임
[이미 ollama에서도 지원](https://ollama.com/library/smollm2) 🔗

🧑🏻‍💻 Dev Anthropic

2024.11 4주차

PDF support (beta)

최대 32MB, 100 페이지 커버가 가능하며 페이지당 1,500 ~ 3,000 토큰 사용

🧑🏻‍💻 Dev xAI

2024.11 4주차

API Public Beta

128K 토큰 길이의 context, function calling, system prompt를 지원
베타 기간 동안 25$의 API 크레딧을 매달 지급

🧑🏻‍💻 Dev Anthropic

2024.11 4주차

Claude 3.5 Haiku

다른 태스크보다 특히 코드 생성에서 좋은 퍼포먼스를 보이는 것 같음
그런데 비용이 많이 올라서 논란이 되는 것으로 보임
Sonnet 3.5 (new)의 성능도 함께 화제가 되는 중

📜 Paper MIT, Cambridge

2024.11 4주차

The Geometry of Concepts: Sparse Autoencoder Feature Structure

Sparse autoencoder는 최근 LLM에 의해 표현되는 세상의 concepts를 high dimensional vectors의 dictionaries로 produce 가능

📜 Paper Google Research

2024.11 4주차

Distinguishing Ignorance from Error in LLM Hallucinations

후자의 경우 중간 연산에 개입함으로써 문제를 해결할 수 있으나, 전자의 경우 외부 지식 source가 필요
두 경우를 구분하기 위해 Wrong Answer despite having Correct Knowledge (WACK) 라는 model-specific dataset 구축 방식을 제안

📜 Paper Duke, Google Research

2024.11 4주차

SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

마지막 layer의 output logits와 초기 layer의 output logits을 contrasting 하여 LLM 내부에 embedded 된 latent knowledge를 이용
latent knowledge가 output에 대해 self-refinement 할 수 있도록 approximate gradient approach 를 사용

🧑🏻‍💻 Dev HuggingFace

2024.11 4주차

Smol Tools

SmolSummarizer, SmolRewriter, SmolAgent
각각이 엄청난 건 아닌데 작은 모델들을 각자의 작업에 특화시켜서 합친 것에 의미가 있는 듯함

📜 Paper IBM

2024.11 4주차

Granite 3.0 Language Models

Sparse 1B & 3B MoE 모델. 400M & 800M activate 파라미터. 총 10T 토큰으로 학습.
비교군으로는 Llama3.1 8B, Mistral 7B / SmolLM-1.7B 등 모델을 사용
상업적으로도 사용 가능하도록 Apache 2.0 라이센스로 공개됨

📜 Paper HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

2024.11 4주차

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

따라서 plain text 대신 HTML을 사용하는 HtmlRAG를 제안
그러나 HTML을 바로 사용하기는 어렵기 때문에, HTML cleaning, compression, pruning strategies를 도입하여 정보의 손실을 최소화 하면서도 HTML을 줄이고자 함

📜 Paper Dartmoouth, Adobe, Stanford, …

2024.11 4주차

Personalization of Large Language Models: A Survey

personalization techniques, datasets ,evaluation methods, application 등을 기준으로 구분

📜 Paper Huawei

2024.11 4주차

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

기존의 rigid & limited 한 CoT & reflection 대신에 아주 유연한 structrued reasoning 프레임워크를 사용했다고 언급
iteration마다 핵심 정보를 탐색 및 저장함으로써 long- & short-term memory를 업데이트함. 이를 통해 fine-tuning이나 backpropagation 없이 성능을 개선할 수 있음

📜 Paper Tancent

2024.11 4주차

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

256K 길이의 window size를 갖는 모델
다양한 태스크에서 LLama3.1-70B를 능가하고, 405B 모델에 비견되는 성능을 보임
large-scale synthetic data, mixed expert routing, key-value cache compression, expert-specific learning rate 등이 핵심 특징

🧑🏻‍💻 Dev Ollama

2024.11 4주차

Ollama 0.4 Integrates Meta's Llama 3.2 Vision Models (11B and 90B)

터미널에서 사용 가능

📜 Paper NVIDIA

2024.11 4주차

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

MLLM을 10개 데이터셋 16개의 태스크에 대해 학습하여 bi-encoder retriever로 사용
MLLM에 존재하는 modality bias를 완화하기 위해 modality-aware hard negative mining을 제안
여러 modality 중에서도 특히 text retrieval 능력을 향상시키기 위해 continually fine-tuning 할 것을 제안

📜 Paper Zhejiang

2024.11 4주차

Fine-Grained Guidance for Retrievers: Leveraging LLMs' Feedback in Retrieval-Augmented Generation

retriever가 잘 못하는 샘플들로부터 easy-to-understand 샘플을 LLM으로 생성하는 방식
이때 세 가지 learning objective, relevance, comprehensiveness, purity를 고려
LLM과 retriever 간 dual curriculum learning & reciprocal feedback

🗞️ News XPENG

2024.11 4주차

XPENG Unveils Iron Humanoid Robot, Already Operational in EV Factory

Eagle Vision 시스템과 end-to-end large AI model이 통합된 시스템
PoC 수준을 넘어 실제 공정에서 활용 가능

🧑🏻‍💻 Dev ByteDance, Tsinghua

2024.11 4주차

X-Portrait 2: Highly Expressive Portrait Animation

현실적인 이미지와 만화 그림체 사이에도 style transfer 가능

📜 Paper Edinburgh

2024.11 4주차

Mixtures of In-Context Learners

분류 태스크에서 뛰어난 성능, 더 적은 demonstration으로 기존과 유사한 퍼포먼스를 달성하여 파레토 라인을 push

📜 Paper Google, Peking

2024.11 4주차

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Tokenformer: attention 메커니즘을 input token 사이의 computation 뿐만 아니라 token과 모델 파라미터 간 interaction에도 활용
모든 linear layer를 token-parameter attention layer로 교체!
[깃허브 링크](https://github.com/Haiyang-W/TokenFormer) 🔗

📜 Paper Hong Kong, Tsinghua, Peking, Tencent

2024.11 4주차

Large Language Models Can Self-Improve in Long-context Reasoning

위 문제를 해결하기 위해 SeaLong 제안: 각 질문에 대해 여러 개의 output을 생성하고 Minimum Bayes Risks를 이용한 scoring 후 SFT 또는 preference optimization
이런 방법론들은 결국 cost 문제에 직면하기 마련인데..

🧑🏻‍💻 Dev INF, M-A-P

2024.11 4주차

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

재현 가능한 960B 토큰의 데이터셋, 4.5M SFT samples, intermediate checkpoints
Two-Stage Instruction Fine-Tuning for Theory and Practice
Ollama에서 동작 가능. 로컬에서 코드 모델을 사용하고자 하는 수요가 적지 않은 것 같음

🧑🏻‍💻 Dev NVIDIA

2024.11 4주차

Cosmos Tokenizer: A suite of image and video neural tokenizers

토크나이저는 생성형 모델들의 성능에 직접적인 영향을 주는데 이를 평가하기 위한 [TokenBench](https://github.com/NVlabs/TokenBench)도 존재

📜 Paper Wuhan Univ.

2024.11 4주차

Adaption-of-Thought: Learning Question Difficulty Improves Large Language Models for Reasoning

🧑🏻‍💻 Dev Alibaba

2024.11 4주차

Qwen2.5-Coder Series: Powerful, Diverse, Practical.

6개의 모델 사이즈를 기준으로 모델을 공개
0.5B / 1.5B / 7B / 14B / 32B 모델은 Apache 2.0, 3B 모델은 Qwen-Research 라이센스를 따름
coding assistant & Artifact 두 개의 시나리오에서 사용할 수 있게끔 학습됨

🧑🏻‍💻 Dev Nous Research

2024.11 4주차

Introducing the Forge Reasoning API Beta and Nous Chat: An Evolution in LLM Inference

📜 [모델 테크니컬 리포트](https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf) 🔗
MCTS, CoC, MoA 등의 방법론들을 조합하여 모델 사이즈 증가 없이 퍼포먼스를 향상시킴

📜 Paper Israel Institue of Technology

2024.11 4주차

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

gradient matrix가 low-rank linear combination의 forward & backward pass의 입력으로 cast 될 수 있음을 입증 (?)
이러한 gradients를 vocab item에 project하고 LM의 neuron에 새로운 정보를 저장할 수 있도록 하는 방법론을 고안
[깃허브 링크](https://github.com/shacharKZ/BackwardLens) 🔗

📜 Paper Univ. of Tehran

2024.11 4주차

CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

text classification 문제를 해결하기 위해 LLM의 code 능력을 활용하는 Code Completion Prompt (CoCoP) 방법론 제시: text classification → code completion
CodeLLaMA와 같은 코드 특화 모델을 사용하는 경우, few-shot learning 수준의 퍼포먼스 가능

📜 Paper Apple

2024.11 4주차

Cut Your Losses in Large-Vocabulary Language Models

이는 각 입력 토큰 & vocab item 쌍마다 logit 행렬을 구축하기 때문이고, 작은 모델이라고 할지라도 LLM의 나머지 구성요소의 수배에 달하는 메모리를 차지하게 됨
Cut Cross-Entropy (CCE) 제안: 모든 토큰에 대한 로짓을 전역 메모리에 저장하지 않고도 Cross Entropy 계산 가능
대신 정답에 대한 logit만 계산, 모든 logit에 대한 log sum-exp를 실시간 평가

🧑🏻‍💻 Dev Anthropic

2024.11 4주차

Improve your prompts in the developer console

CoT Reasoning, Example standardization, Example enrichment, Rewriting, Prefill addition 등을 활용
workbench에서 multi-shot example을 관리할 수 있음. Claude를 활용하여 synthetic 데이터를 자동적으로 만들 수도 있음
(이전에 출시된 기능이긴한데) 최종 생성 결과에 대해 1-5점 점수를 부여하는 평가 기능도 지원함

📜 Paper Aalborg Univ.

2024.11 4주차

Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective

LLM의 hallucination 현상을 완화하기 위해 knowledge graph 활용

📜 Paper Google DeepMind

2024.11 4주차

Learning high-accuracy error decoding for quantum processors

구글 딥마인드에서 인공지능을 활용한 quantum computer 연구를 수행하고 있음

📜 Paper National Univ. of Singapore

2024.11 4주차

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

연구에 활용된 프롬프트나 도메인, 소프트웨어 정보를 다양하게 포함하고 있음
[깃허브 링크](https://github.com/showlab/computer_use_ootb) 🔗

🗞️ News Amazon

2024.11 4주차

Amazon and Anthropic deepen strategic collaboration

Microsoft & OpenAI 의 관계와 유사하다고 이해할 수 있음
Anthropic의 다음 세대 모델 개발을 위한 accelerator chip, “Trainium” 개발에 사용될 것

🧑🏻‍💻 Dev Anthropic

2024.11 4주차

Hume AI creates emotionally intelligent voice interactions with Claude

36%의 유저가 다른 LLM 대신 Claude를 선택
실시간으로 자연스럽게 interact 하는 모델을 Anthropic에서도 적극적으로 개발 중인 상황으로 이해됨

📜 Paper UCL, Shanghai, Brown, Singapore

2024.11 4주차

Natural Language Reinforcement Learning

Natural Language Reinforcement Learning (NLRL): 전통적인 MDP를 자연어 기반의representation space로 확장
순수 프롬프팅 or gradient-based training 에 의한 RL-like policy & value 를 개선
[깃허브 링크](https://github.com/waterhorse1/Natural-language-RL) 🔗

📜 Paper Arizona

2024.11 4주차

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

LLM-as-a-judge를 평가하는 벤치마크 compile

🧑🏻‍💻 Dev OpenAI

2024.11 4주차

Advancing red teaming with people and AI

📜 [External red teaming](https://cdn.openai.com/papers/openais-approach-to-external-red-teaming.pdf)
📜 [Automated red teaming](https://cdn.openai.com/papers/diverse-and-effective-red-teaming.pdf)

📜 Paper MIT

2024.11 4주차

Model-Based Transfer Learning for Contextual Reinforcement Learning

Model-Based Transfer Learning (MBTL) 제시: Gaussian process를 사용한 performance set point, linear function of contextual similarity로 모델링되는 performance loss
두 요소를 결합하여 Bayesian Optimization (BO) 프레임워크 내에서 전략적으로 사용
50배 이상 개선된 independent & multi-task training 효율성

📜 Paper NVIDIA

2024.11 4주차

Star Attention: Efficient LLM Inference over Long Sequences

1단계: blockwise-local attention across hosts → 2단계: query & response tokens 가 이전에 생성 및 캐싱된 토큰에 대해 sequence-global attention
global attention을 사용하여 학습된 트랜스포머 기반의 모델들은 약 11배 정도까지의 추론 속도 향상을 기대할 수 있음 (정확도는 95~100% 유지)

📜 Paper Ai2

2024.11 4주차

OLMo 2: The best fully open language model to date

[Tülu 3](https://allenai.org/tulu)에서 얻은 나이스한 레시피를 OLMo 2에도 적용 (근데 둘이 뭐가 다르지 그럼..?)

📜 Paper Case Western Reserve Univ.

2024.11 4주차

Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

distillation influence와 temperature value를 dynamic 하게 조절
self-correction & self-training 테크닉들과 seamless 하게 integration 가능

📜 Paper Tsinghua

2024.11 4주차

Training and Evaluating Language Models with Template-based Data Generation

TemplateMath Part 1: TemplateGSM, 7백만 개 이상의 고등학교 수학 문제로 구성된 합성 데이터셋
[허깅페이스 데이터셋 링크](https://huggingface.co/datasets/math-ai/TemplateGSM) 🔗

🧑🏻‍💻 Dev Andrew Ng

2024.11 4주차

aisuite

OpenAI, Anthropic, Azure, Google, AWS, Groq, Mistral, HuggingFace, Ollama 등을 지원

🧑🏻‍💻 Dev HuggingFace

2024.11 4주차

SmolVLM - small yet mighty Vision Language Model

모든 모델 체크포인트, VLM 데이터셋, 학습 레시피, 도구 등 Apache 2.0 라이센스로 공개

📜 Paper NVIDIA

2024.11 4주차

Hymba: A Hybrid-head Architecture for Small Language Models

Attention heads는 high-resolution recall을, SSM heads는 efficient context summarization을 담당
프롬프트 앞에 붙어서 중요한 정보를 저장하는 learnable meta token 도입
허깅페이스에 [Base](https://huggingface.co/nvidia/Hymba-1.5B-Base) & [Instruct](https://huggingface.co/nvidia/Hymba-1.5B-Instruct) 모델 공개

🧑🏻‍💻 Dev Qwen

2024.11 4주차

QwQ: Reflect Deeply on the Boundaries of the Unknown

Language Mixing and Code-Switching, Recursive Reasoning Loops, Safety and Ethical Considerations 등의 한계점
GPQA, AIME, MATH-500, LiveCodeBench 등 추론 능력이 요구되는 벤치마크에서 뛰어난 성능

🧑🏻‍💻 Dev IBM, Meta

2024.11 4주차

Supercharging Training using float8 and FSDP2

1.8B 부터 405B 에 이르는 라마 모델에 대한 성능 개선을 확인함 (Llama 3 아키텍쳐 기준)
end-to-end float8 training에 대한 가능성을 입증

📜 Paper Univ. of Luxembourg

2024.11 4주차

LongKey: Keyphrase Extraction for Long Documents

LongKey, a novel framework for extracting keyphrases from lengthy documents
encoder 기반의 언어 모델, max-pooling embedder 사용

📜 Paper Harvard, Stanford, MIT, Databricks, CMU

2024.11 3주차

Scaling Laws for Precision

training in lower precision은 모델의 effective parameter count를 감소시킴으로써 low precision training과 post-train quantization으로부터의 loss를 예측할 수 있도록 함
추론에 대해서는, 모델이 더 많은 데이터로 학습되었을수록 post-training quantization에 의한 성능 하락이 심각
학습에 대해서는, 본인들이 제시하는 scaling law를 통해 다른 precision으로 학습한 결과를 예측할 수 있다고 주장. 이때 큰 모델을 낮은 precision으로 학습하는 것을 권장.

📜 Paper MIT

2024.11 3주차

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

Abstraction and Reasoning Corpus (ARC)를 벤치마크로 사용 (reasoning 포커스)
TTT의 중요한 구성 요소: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training

📜 Paper Peking, Tsinghua

2024.11 3주차

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

LLaVA-o1, autonomous multistage reasoning
일반적인 CoT prompting과 달리 LLaVA-o1은 summarization, visual interpretation, logical reasoning, conclusion generation 으로 구성된 stage들을 독립적 & 연속적으로 engage
LLaVA-o1-100k dataset: visual question answering, structured reasoning annotations

📜 Paper Shanghai, Fudan

2024.11 3주차

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Compound Question Synthesis (CQ-Syn)을 도입하여 Compound-QA를 제작. multi sub-question에 집중
Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, Evaluation-and-Suggestion, 다섯 개의 카테고리를 다룸

📜 Paper UIUC, IBM

2024.11 3주차

DELIFT: Data Efficient Language model Instruction Fine Tuning

DELIFT, 세 단계의 fine-tuning을 통해 data selection을 systematically optimize
(1) instruction tuning (2) task-specific fine-tuning (3) continual fine-tuning
현재 데이터 샘플이 현재 모델의 상태에 얼마나 beneficial 한지를 정량화하는 pairwise utility metric 사용

📜 Paper UC, Tsinghua, Peking

2024.11 3주차

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Style-Compress: smaller model이 새로운 태스크에 대해 추가적인 fine-tuning 없이 프롬프트를 압축할 수 있도록 adapt하는 방법론
10개 샘플, 100개 쿼리로 adaptation 한 뒤 compression 적용한 결과가 준수하다는 것을 확인
방법론에 대한 간단한 수식, 파이프라인, 다양한 실험을 통해 논문화.. 프레임워크도 중요한 시대

🧑🏻‍💻 Dev Microsoft

2024.11 3주차

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

합성 데이터 사용 시 LLM의 학습 속도를 높일 수 있다고 설명

📜 Paper KAIST

2024.11 3주차

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

AutoML-Agent, data retrieval 부터 model deployment 까지 아우르는 multi-agent framework
retrieval-augmented planning strategy를 사용하여 최적의 plan을 만듦
각 plan을 sub-tasks로 쪼개어서 특화된 agent가 이를 처리할 수 있도록 함

🧑🏻‍💻 Dev AI2

2024.11 3주차

Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models

retriever and reranker to search the datastore
8B Llama fine-tuned on high-quality synthetic data
self-feedback generation pipeline

🧑🏻‍💻 Dev Mistral AI

2024.11 3주차

Mistral has entered the chat

SoTA document and image understanding, powerd bye the new multimodal [Pixtral Large](https://mistral.ai/news/pixtral-large/)
SoTA on MathVista, DocVQA, VQAv2
123B multimodal decoder, 1B parameter vision encoder

🧑🏻‍💻 Dev Perplexity

2024.11 3주차

Shop like a Pro: Perplexity’s new AI-powered shopping assistant

Buy with Pro: One-click checkout to save time & free shipping
Snap to Shop: 물건의 사진과 유사한 상품을 찾아주는 visual search tool
Introducing the Perplexity Merchant Program: 상품 판매자들이 가입하는 프로그램으로, 가입 시 상품이 인덱싱 대상이 되어 추천이 더 잘될 수 있음을 언급

📜 Paper Together AI, Stanford, etc

2024.11 3주차

RedPajama: an Open Dataset for Training Large Language Models

모델 개발의 투명성 부족 (데이터 정제 포함), 고품질 데이터셋 대량 확보의 어려움, 데이터셋 정제와 분석을 위한 artifact 및 메타 데이터 이용 가능성 낮음
이러한 문제를 해결하기 위해 RedPajama-V1 release, open reproduction of the LLaMA training dataset
RedPajama-V2를 함께 release, 정제되지 않은 날것의 text data로 구성된 massive web-only dataset

📜 Paper Stony Brook

2024.11 3주차

A Novel Approach to Eliminating Hallucinations in Large Language Model-Assisted Causal Discovery

고품질 데이터에 접근 가능할 때 RAG를 사용하여 hallucination을 줄이는 방법을 제안
arbiter(결정권자)를 포함한 여러 LLM을 debate에 참여시켜 causal graphs의 edge를 감사함으로써 hallucination을 최소화하는 기법을 제안
프롬프트 엔지니어링을 통해 graph를 만드는 것부터 시작

🗞️ News Cerebral Valley: Alexandr Wang Scale AI

2024.11 3주차

Cerebral Valley: Alexandr Wang Scale AI

그러나 post training으로 모델을 발전시킬 수 있는 여지는 무궁무진.
최근 o1 or DeepSeek이 좋은 사례

🧑🏻‍💻 Dev DeepSeek

2024.11 3주차

DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!

thought process를 real-time으로 투명하게 공개
곧 오픈 소스 모델과 API 공개 예정
[링크](http://chat.deepseek.com/)에서 채팅 가능

🧑🏻‍💻 Dev H

2024.11 3주차

French startup H Company launches Runner H: a web automation agent with human-like precision

이것이 첫 product인데 $220M 투자 받은 것으로 알려짐 (한화 약 3,000억원)
API beta도 제공

🧑🏻‍💻 Dev HuggingFaceTB

2024.11 3주차

SmolTalk

instruction following 능력을 향상시키면서 다양한 태스크를 잘 수행할 수 있는 데 기여하는 public 데이터셋을 합성하여 공개

🧑🏻‍💻 Dev Ai2

2024.11 3주차

Tülu 3 opens language model post-training up to more tasks and more people

Data, Data Toolkit, Training Code & Infrastructure, Evaluation Framework, Demo, Models & Checkpoints

🧑🏻‍💻 Dev Apple

2024.11 3주차

AIMv2

대부분의 멀티모달 이해 벤치마크에서 OAI CLIP, SigLIP 등을 outperform
open-vocabulary object detection & referring expression comprehension에서 DINOv2를 outperform
📜 [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://arxiv.org/pdf/2411.14402)

📜 Paper Anthropic

2024.11 3주차

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

통계학 기반의 연구자들에게 언어 모델의 평가 데이터를 어떻게 분석하고 접근해야 하는지 설명하는 연구
평가 데이터 분석, 두 모델 간의 차이 측정, 평가 실험 계획을 위한 공식을 제시

2024년 10월 83건

🧑🏻‍💻 Dev Stanford

2024.10 5주차

Co-STORM Get a Wikipedia-like report on your topic with AI

위키피디아 형식으로 작성된 내용들은 모두 PDF로 다운로드 가능
글에 존재하는 모든 인용문에 대한 원본 출처 확인 가능

📜 Paper Michigan, Amazon

2024.10 5주차

A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration

추론 단계에서 demonstration example이 corrupted 될 때, Coherent CoT를 사용하는 transformer의 sensitivity를 조사
→ final outcome에 비해 intermediate reasoning step에서 더 sensitive하게 반응

📜 Paper Shanghai

2024.10 5주차

Agentic Information Retrieval

기존에는 사전에 정의된 candidate item을 filtering 하는 것에 수십년째 의존하고 있던 상황
Agentic IR을 제시하며 세 종류의 application과 현재의 문제점에 대해 논의

📜 Paper Michigan, Alibaba

2024.10 5주차

Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning

왜 이런 방식이 실제 reasoning에 유용한지를 probabilistic graphical model을 통해 입증
multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA) 제안

🧑🏻‍💻 Dev Stability.AI

2024.10 5주차

Introducing Stable Diffusion 3.5

Stable Diffusion 3.5 수준의 성능을 낼 수 있는 distilled version의 turbo 모델도 공개
transformer block에 Query-Key Normalization 테크닉 적용

📜 Paper Huawei

2024.10 5주차

Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning

LLM은 small reasoning step을 reflect 하고, 이를 inference stage에 포함시킴으로써 첫 스텝을 다음으로 잘 이어나갈 수 있게 됨
간단히 살펴봤을 땐 inference를 여러 번 하게 되는 것 같은데.. 근본적인 해결책은 아닌 것 같음

📜 Paper Google DeepMind, Boston

2024.10 5주차

Measuring memorization through probabilistic discoverable extraction

이를 통해 모델이 기억(암기)하고 있는 정보에 대해 파악할 수 있다고 주장
이러한 연구는 학습에 사용된 민감한 정보 등이 유출되는 것을 방지하기 위함인데, 그럼 외운 것 없이 순수한 추론, 이해, 언어 능력만으로 여러 태스크를 처리하는 것이 궁극적인 goal이 될지 궁금함

🧑🏻‍💻 Dev GitHub

2024.10 5주차

Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview

VS Code, GitHub.com, Apple Xcode와의 직접적인 통합
VS Code 내에 GitHub Spark 공개 (Cursor의 Composer와 유사한 기능)
Cursor에 비해 한 발자국씩 대응이 늦는 것 같음. 모델 종류의 다양성이나 Spark 전부 다.

📜 Paper Samsung Research

2024.10 4주차

Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

Instruction 모델에 많은 양의 새로운 토큰을 CPT 하면 Instruction Following 성능 크게 하락
Base 모델은 많은 양의 새로운 토큰을 CPT 해도 안정적인 성능 유지 가능

📜 Paper OpenAI

2024.10 4주차

First-Person Fairness in Chatbots

1% 미만 수준으로 영향을 받는다는 요약글을 본 적이 있는 것 같은데, 사용자수를 고려한다면 훨씬 더 엄밀한 safety 정책이나 방법론이 필요하다는 생각이 듦

📜 Paper Anthropic, Scale AI, NYU, UC Berkeley

2024.10 4주차

Looking Inward: Language Models Can Learn About Themselves by Introspection

LLM이 가상의 시나리오에 대한 본인의 행동 특성을 예측하도록 fine-tuning
introspect 할 수 있는 모델 M1이 본인의 output 예측을 더 잘할 것이고, 이것이 곧 M2 보다 뛰어난 성능을 지닌다는 방증으로 이해하는 것 같음
요즘 성찰, self-correct 등 모델의 inherent ability를 최대한 이끌어내고자 하는 연구가 꽤 많은 것 같은데, 약간 결과론적인 해석 위주인 것 같아서 아쉽게 느껴짐

📜 Paper British Columbia

2024.10 4주차

Supervised Chain of Thought

one-for-all prompting (think step by step) 대신 task-specific supervision이 필요하다고 주장
reasoning path를 학습하는 방식은 이미 제시된 바 있는데 데이터셋을 잘 구축한 건가 싶은 인상

📜 Paper Hong Kong, Washington, HKUST, Microsoft

2024.10 4주차

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

learnable gate를 두어 attention map에서 중요한 block를 adaptive 하게 선택하는 mechanism 제안
→ accuracy & speed 균형
이를 위한 customized Flash Attention 구현

🧑🏻‍💻 Dev Microsoft

2024.10 4주차

Open-sourced BitNet

🧑🏻‍💻 Dev Meta FAIR

2024.10 4주차

Sharing new research, models, and datasets from Meta FAIR

Meta Spirit LM: An open source language model for seamless speech and text integration
cross modality generation을 위해 단어 단위의 text & audio 데이터를 interleaving 하는 방식 사용
Layer Skip: Enhancing large language model performance with accelerated generation times

📜 Paper Texas, Pittsburgh, Princeton, CMU

2024.10 4주차

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

CBT-Bench를 구성하는 세 단계의 태스크 (Cognitive Behavior Therapy)

📜 Paper Shanghai AI Lab

2024.10 4주차

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

unitary scoring & two-model comparison 가능 / 특정 형식을 따라 평가 가능 / critiques 생성 가능 / 일반적인 LLM 태스크 수행 가능
various subjective evaluation task와 topic을 커버하는 JudgerBench 구축
[모델 및 코드 공개 커뮤니티 링크](https://github.com/open-compass/CompassJudger) 🔗

📜 Paper CMU

2024.10 4주차

Causality for Large Language Models

어떻게 causality가 언어 모델의 각 학습 단계에서 어떻게 영향을 줄 수 있는지 연구하고 앞으로의 연구 방향성을 제시. 프롬프트 기반의 연구들의 한계를 극복하겠다는 취지.
말은 거창한데 abstract만 보고서는 무슨 소리인지 모르겠음
[깃허브 링크](https://github.com/causal-machine-learning-lab/Awesome-Causal-LLM) 🔗

🧑🏻‍💻 Dev Anthropic

2024.10 4주차

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

자연어를 컴퓨터 명령어로 변환하는 기능을 포함
기존 대비 훨씬 강력한 성능의 모델 업데이트를 공개함

📜 Paper Alibaba

2024.10 4주차

Aligning Large Language Models via Self-Steering Optimization

chosen & rejected response 간의 consistent gap을 보장하면서도 현재 policy 모델의 learning capacity에 적합한 학습이 진행될 수 있도록 함
SSO로 생성된 선호 데이터셋은 reward 모델의 성능을 높인다는 결과도 함께 제시
[깃허브 링크](https://github.com/icip-cas/SSO) 🔗

📜 Paper Yonsei, SNU

2024.10 4주차

Large Language Models Still Exhibit Bias in Long Text

14개 토픽, 10개 demographic axes, 11,948개 샘플로 구성
연구에 따르면 특정 demographic group이 선호됨 & excessive sensitivity가 확인됨
이를 완화하기 위해 biased prompt를 neutral response와 짝짓는 fine-tuning approach 제안

🧑🏻‍💻 Dev IBM

2024.10 4주차

IBM Introduces Granite 3.0: High Performing AI Models Built for Business

larger 모델 대비 3~23x 저렴한 비용
MoE 아키텍쳐를 이용하여 1B 이하의 사이즈로 enterprise 태스크 수행
128K 윈도우 사이즈 지원 (예정)

📜 Paper NVIDIA

2024.10 4주차

HelpSteer2-Preference: Complementing Ratings with Preferences

두 방식을 head-to-head comparison → Bradley-Terry and Regression reward modeling 제안
Llama-3.1-70B-Instruct 모델을 튜닝한 것이 RewardBench에서 94.1점을 달성
[데이터셋 링크](https://huggingface.co/datasets/nvidia/HelpSteer2) 🔗 [모델 링크](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) 🔗

🧑🏻‍💻 Dev Cohere

2024.10 4주차

Introducing Multimodal Embed 3: Powering AI Search

나쁘지 않은 수준의 성능으로 100개 이상의 언어를 지원한다고 함 (검증할 길이 없어 아쉽)
text, image가 독립적으로 clustering 되는 문제가 해결되어 mixed-modality search에서 CLIP 대비 뛰어난 성능을 보여줌

📜 Paper OpenAI

2024.10 4주차

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

only two sampling step만으로도 뛰어난 성능을 거둘 수 있었음
[OpenAI 블로그 & 데모 링크](https://openai.com/index/simplifying-stabilizing-and-scaling-continuous-time-consistency-models/) 🔗

🧑🏻‍💻 Dev Google DeepMind

2024.10 4주차

SynthID Identifying AI-generated content with SynthID

image, audio, text, video 지원
이중에서도 특히 audio, text를 어떻게 구분할 수 있다는 건지 전혀 이해가 안됨..

🧑🏻‍💻 Dev Meta

2024.10 4주차

Introducing quantized Llama models with increased speed and a reduced memory footprint

Llama 3.2 모델에 Quantization-Aware Training with LoRA adaptors (accuracy) & SpinQuant (portability), 두 가지 방법론을 적용

📜 Paper Washington, Google Cloud, DeepMind

2024.10 4주차

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

📜 Paper Central Florida

2024.10 3주차

Parameter-Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning

이를 위해 zero-shot으로 프롬프트의 semantic content를 이해할 수 있는 fixed LLM을 활용
processed prompt를 입력 텍스트와 통합하여 모델이 특정 태스크에서 더 뛰어난 성능을 발휘할 수 있도록 함
text classification & understanding에서 다른 tuning method 대비 더 적은 시간과 비용으로 좋은 성능을 낼 수 있었다고 주장

📜 Paper Peking, Microsoft

2024.10 3주차

Self-Boosting Large Language Models with Synthetic Preference Data

self-prompt generator가 다양한 프롬프트를 생성 → response improver가 response를 점진적으로 개선
LLM 스스로 자신의 output에 대한 generative reward를 자율적으로 학습하고, 대규모 annotation 작업을 하지 않을 수 있게 됨
AlpacaEval 2.0 & ArenaHard 에 대한 검증을 통해 모델의 instruction following 능력이 크게 향상되었음을 확인

📜 Paper UNIST

2024.10 3주차

Response Tuning: Aligning Large Language Models without Instruction

실험 결과에 따르면 response에 대해서만 학습한 본인들의 모델이 instruction-tuned 모델들보다 더 다양한 범위의 instruction을 따를 수 있거나 성능이 좋았다고 언급함
training response distribution을 조절함으로써 target behavior를 유도할 수 있었다고 함

🧑🏻‍💻 Dev OpenAI

2024.10 3주차

openai/swarm

[Orchestrating Agents: Handoffs & Routines](https://cookbook.openai.com/examples/orchestrating_agents) cookbook의handoff & routines pattern을 보여주기 위해 제작됨

📜 Paper Alibaba

2024.10 3주차

StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

사람이 raw information을 다양한 structured knowledge로 convert한다는 점에 착안하여 StructRAG를 제안
즉, 태스크에 적합한 structured format으로 문서를 재구성하는 방식

🧑🏻‍💻 Dev Mistral AI

2024.10 3주차

Un Ministral, des Ministraux

128k context length (vLLM에선 현재 32k). 8B 모델은 sliding-window attention
Llama-3.1-8B 보다 뛰어난 성능임을 벤치마크 결과를 통해 제시하고 있음
라이센스는 각각 Mistral Commercial / Commercial & Research License를 따름

📜 Paper Meta, Berkeley, NYU

2024.10 3주차

Thinking LLMs: General Instruction Following with Thought Generation

iterative search & optimiation precedure를 통해 possible thought generation space를 탐색. 여기엔 direct supervision이 필요하지 않음
각 instruction에 대한 thought candidate는 judge model이 평가하여 preference optimization에 활용 (DPO)
AlpacaEval & Arena-Hard 에서 우수한 성능을 보였음을 강조. 그외의 marketing, health, general knowledge 등의 분야에서도 뛰어나다고 주장.

🧑🏻‍💻 Dev Zyphra

2024.10 3주차

ZAMBA2-7B

single shared attention block → two shared attention block
토큰 당 추론 속도를 25% 가량 개선한 inference-efficient 모델
하루 사이에 Mistral 신모델이 출시되었는데 성능 비교가 필요할지도..

🧑🏻‍💻 Dev NVIDIA

2024.10 3주차

Llama-3.1-Nemotron-70B

2024년 10월 기준, Arena Hard와 RewardBench에서 SoTA 달성
GPT-4o와 Claude 3.5를 넘는 성능을 달성했다고 함

🧑🏻‍💻 Dev Rhymes AI

2024.10 3주차

Aria

text, image, video 처리 가능하며 64k 사이즈의 context window 지원
토큰당 3.9B activated parameters 사용

🧑🏻‍💻 Dev Perplexity

2024.10 3주차

Introducing Internal Knowledge Search and Spaces

Perplexity Space에서 team based search 가능

📜 Paper Fudan, CMU, ByteDance

2024.10 3주차

Revealing the Barriers of Language Agents in Planning

Language model을 agent로 사용하여 planning에 활용하는 최근 연구가 많은데, 현재 연구들이 보이는 한계의 원인을 파악한 연구라고 볼 수 있음. 이를 Memory Updating과 연관지어 분석하고 설명한 내용들이 기술되어 있음.

📜 Paper Tufts University

2024.10 3주차

"Let's Argue Both Sides": Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities

추가적인 레이어 없이 zero-shot prompting을 대체할 수 있는 방법론이라고 주장
CoT나 Argument Generation은 추론이 필요한 태스크에서 zero-shot 할 때나 유용한 보조적인 수단이라고 설명
엄청 단순하고 흔한 방식 같긴 한데, 이런 테크닉이 한정적인 보조수단이라고 설명한 내용이 인상 깊음

📜 Paper DeepSeek-AI, Hong Kong, Peking

2024.10 3주차

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

visual encoding을 여러 pathway로 분해(decouple)하되, 처리하는 transformer architecture는 통합된 것을 사용
decoupling은 visual encoder의 역할 간 충돌을 완화하면서도 framework의 유연성은 증가시켜줌
[깃허브 링크](https://github.com/deepseek-ai/Janus) 🔗

📜 Paper Meta AI, KAUST

2024.10 3주차

Agent-as-a-Judge: Evaluate Agents with Agents

LLM-as-a-Judge에 agentic feature를 통합하여 Agent-as-a-Judge를 만들고 이를 code generation에 활용
realistic automated AI 개발 태스크로 구성된 새로운 벤치마크 DevAI를 제시
LLM-as-a-Judge와 비교했을 때, human evaluation baseline에 준할 정도로 뛰어난 성능

📜 Paper UC Berkeley, Washington Univ

2024.10 3주차

JudgeBench: A Benchmark for Evaluating LLM-based Judges

knowledge, reasoning, math, coding 태스크를 다루는 challenging response pari로 구성
현존하는 difficult dataset을 challenging response pair with preference label로 convert 해주는 pipeline을 포함하고 있음
response pair 데이터셋이 아닌 것을 convert 해주는 파이프라인은 활용 가치가 높은 것 같은데, 평가 방식 자체에 대단한 건 없는 것 같음

📜 Paper KAIST, Naver Cloud AI

2024.10 3주차

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

training data가 safe 하더라도 VL adaptation 동안 safety degradation이 발생한다고 설명
supervised fine-tuning with safety datasets | reinforcement learning from human feedback 등은 risk를 줄일 수 있지만 온전한 해결책이 아니라고 주장
해결책으로 weight merging를 제안하여 safety degradation을 줄이면서도 helpfulness를 유지할 수 있도록 함

📜 Paper AI2, Washington

2024.10 3주차

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

📜 Paper Google Research, Apple

2024.10 2주차

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

(1) 정보를 많이 담고 있는 특정 토큰을 이용하여 error detction을 시도했으나 generalize 되지 않음 → multifaceted
(2) internal representation은 모델이 일으키는 에러를 줄이는 데 활용될 수 있다는 것을 확인
(3) LLM의 internal encoding과 external behavior 사이의 discrepancy를 확인

📜 Paper Salesforce

2024.10 2주차

Enhance Reasoning by Learning from Mistakes: Peer-Review Knowledge Distillation from Multiple Large Language Models

Mistake-Aware Peer-Review Distillation (MAPD) 방식 제안
teacher 에게 student의 실수를 파악 및 설명하고 customized instruction learning data를 제공하도록 지시
simulated peer-review process를 디자인하여 acceptance threshold를 넘기는 rationale을 사용

🧑🏻‍💻 Dev feder-cr/Auto_Jobs_Applier_AIHawk

2024.10 2주차

feder-cr/Auto_Jobs_Applier_AIHawk

🧑🏻‍💻 Dev mendableai/firecrawl

2024.10 2주차

mendableai/firecrawl

📜 Paper Stanford

2024.10 2주차

Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

under-served communities의 900명 tutor와 1,800명 학생이 참여한 대규모 연구
수학을 공부하는 학생들이 덕분에 유의미한 점수 향상(4%p)을 얻었다고 함
tutor마다 연간 $20 밖에 들지 않음

📜 Paper Hong Kong, Huawei, McGill & MILA

2024.10 2주차

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

LLM이 text revision을 잘한다는 점을 이용하여 response를 adaptive하게 revise하고 이를 reference로 삼아 이어지는 평가에 활용하는 방식을 고안

📜 Paper Microsoft, Tsinghua

2024.10 2주차

Differential Transformer

differential attention mechanism은 두 개의 separate softmax attention map의 차이로 attention score를 계산 → sparse attention pattern을 촉진
특히 long-context modeling, key information retrieval, hallucination mitigation, in-context learning, reduction of activation outlier 등에 탁월

🧑🏻‍💻 Dev HuggingFace

2024.10 2주차

gradio-app/openai-gradio

API 대신 로컬 모델로 구축할 수 있으면 좋을텐데 아쉽

📜 Paper Tsinghua, Microsoft

2024.10 2주차

Data Selection via Optimal Control for Language Models

CommonCrawl을 대상으로 PDS를 적용했을 때, 사전학습의 효율이 크게 향상된다는 것을 확인
Mistral 아키텍쳐를 기반으로 160M, 470M, 1B, 1.7B 모델로 실험
[깃허브 링크](https://github.com/microsoft/LMOps/tree/main/data_selection) 🔗

🧑🏻‍💻 Dev HuggingFace

2024.10 2주차

LLM Evaluation Guidebook

초보자/상급자를 위한 내용들이 포함되어 있음

📜 Paper Baidu

2024.10 2주차

Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation

이를 해결하기 위해 chain-of-verification (CoV-RAG)를 제안
verification module을 RAG에 넣어 scoring, judgement, rewriting에 참여하도록 함
internal generation error를 수정하기 위해 QA와 verification에 CoT reasoning을 포함하여 학습 진행

📜 Paper HKUST, UIUC

2024.10 2주차

Personalized Visual Instruction Tuning

MLLM이 target individual을 이미지 내에서 식별하고 coherent dialogue를 이어나갈 수 있도록 data curation & training framework를 포함하는 PVIT를 제안 (Personalized Visual Instruction Tuning)

📜 Paper Microsoft

2024.10 2주차

Scaling Optimal LR Across Token Horizons

optimal LR은 token horizon에 따라 변화하는데, longer training일수록 smaller LR이 필요
optimal LR도 scaling law를 따르기 때문에, longer horizon에 대한 optimal LR을 shorter horizon으로부터 예측할 수 있다고 주장
데이터셋, 모델 사이즈를 scale-up 할 때 필수로 참고해야 할 논문이 아닌가..

📜 Paper KAIST, Washington, LG AI Research

2024.10 2주차

Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

knowlege entropy 개념을 도입하여 모델이 engage하는 memory의 범위를 정량적으로 나타냄. 이 값이 높으면 모델이 넓은 범위의 memory source를 포함하는 것이고, 낮으면 반대임
pretraining이 진행됨에 따라 knowledge entropy가 낮아지고, 이는 모델의 knowledge acquisition & retain 능력 감소를 의미한다고 주장

📜 Paper OpenAI

2024.10 2주차

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

캐글의 75개 MLE competition을 curate하여, 모델 학습, 데이터셋 준비, 실험 수행 등 다양한 real-world ML engineering skill을 테스트 할 수 있도록 함
OpenAI의 o1-preview가 최고라는 걸 보여주는 연구 결과..?
[깃허브 링크](https://github.com/openai/mle-bench/) 🔗

📜 Paper Hong Kong

2024.10 2주차

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

reasoning에 필요한 필수적인 개념, 관련 이론, 유사한 문제 등을 LLM이 떠올릴 수 있도록 함
자체적으로 개발한 두 개의 중국어 벤치마크 MathMC, MathToF 공개
이런 방식이 정말 모델의 능력을 극대화하는 것이 맞나? 어떤 상황에서도 적용 가능한 방법은 맞나? 또 모델이 학생을 가르치는 내용의 데이터를 학습하지는 않았을 것 같은데 이것이 working 하는 이유는 뭘까?

🧑🏻‍💻 Dev Tesla

2024.10 2주차

Robotaxi

🧑🏻‍💻 Dev ML Code Challenges

2024.10 2주차

ML Code Challenges

행렬곱, 공분산행렬, Decision Tree 등등 다양한 개념들이 있어서 코드 연습해보기 좋은 것 같음. 카테고리는 linear algebra, machine learning, deep learning, nlp 등으로 구분됨

📜 Paper One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

2024.10 2주차

One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

이를 Explained Variance Adaptation (EVA)라고 부르는데, 다양한 태스크에 적용해 보았을 때, convergence 속도가 빠르고 평균적으로 높은 스코어를 달성할 수 있었다고 주장함

📜 Paper CMU

2024.10 2주차

Better Instruction-Following Through Minimum Bayes Risk

이는 reference-based evaluator를 사용하여 여러 후보 output 중에서 가장 high-quality인 것을 고를 수 있도록 돕는 방식임

📜 Paper Washington, AI2

2024.10 2주차

Can Language Models Reason about Individualistic Human Values and Preferences?

🧑🏻‍💻 Dev Google DeepMind

2024.10 1주차

How AlphaChip transformed computer chip design

실제로 6세대 TPU을 몇 개로 구성할지를 이것으로 찾음 (AI for chip design)

🧑🏻‍💻 Dev Anthropic

2024.10 1주차

Introducing Contextual Retrieval

Contextual BM25에 사용되는 index를 생성
context를 생성할 때는 사람이 직접할 수 없으므로 AI 모델을 사용 (Claude)

📜 Paper BAAI

2024.10 1주차

Emu3: Next-Token Prediction is All You Need

→ diffusion 또는 compositional architecture 불필요

📜 Paper Waterloo, Peking

2024.10 1주차

MIO: A Foundation Model on Multimodal Tokens

four-stage training process
(1) alignment pre-training (2) interleaved pre-training (3) speech-enhanced pre-training (4) comprehensive supervised fine-tuning

📜 Paper Microsoft

2024.10 1주차

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Channel-Independent Second-Order Optimization을 사용하여 가중치를 refine
[깃허브 링크](https://github.com/microsoft/VPTQ) 🔗

📜 Paper Apple

2024.10 1주차

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

high-quality OCR data & synthetic caption 을 continual pre-training에 활용 → optimized visual instruction-tuning data mixture를 supervised fine-tuning에 활용
MoE 아키텍쳐를 포함하여 모델 사이즈는 1B ~ 30B 로 구성
video understanding과 mobile UI understanding에 특화된 MM1.5-Video, UI 버전을 공개.

📜 Paper Meta, UIUC

2024.10 1주차

Law of the Weakest Link: Cross Capabilities of Large Language Models

7개의 core individual capabilities를 정의하고 이를 manually 짝지어 taxonomy를 구축
1,400개의 human-annotated prompts로 구성된 CrossEval 벤치마크를 공개. 각 individual & cross capability 마다 100개 prompt로 구성
이에 대한 평가를 수행해봤을 때, 현 LLM은 Law of the Weakest Link를 보인다고 주장

🧑🏻‍💻 Dev Liquid

2024.10 1주차

Liquid Foundation Models: Our First Series of Generative AI Models

32k token context length, effective across the entire range
오픈 소스 모델은 아님. Liquid Playground, Lambda, Perplexity Labs 등에서 사용 가능
최근 sLLM 에 대한 관심이 뜨거운 것 같은데, 이중에서도 오픈소스가 아닌 모델 패밀리를 공개하는 것은 오히려 흔하지 않은 상황으로 이해됨

📜 Paper CMU

2024.10 1주차

Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Embodied-RAG: navigation & language generation의 hierarchical knowledge를 자율적으로 구축할 수 있는 non-parametric memory system
다양한 환경과 query type에 대해 넓은 범위의 spatial & semantic resolution을 처리할 수 있음

📜 Paper Yale, OpenAI, Princeton

2024.10 1주차

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

embers of augoregression이라는 표현을 사용하고 있는데, 결국 다음 토큰을 반복적으로 예측해나가는 근본적인 특성으로 인해 발생하는 문제점을 지적하고 싶은 것으로 이해함

📜 Paper Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

2024.10 1주차

Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

세 단계로 구성된 diversity approach를 사용하여 다양한 합성 데이터를 생성 → 이는 in-context learning sample로 사용

📜 Paper Mila, Google DeepMind, Microsoft

2024.10 1주차

Not All LLM Reasoners Are Created Equal

compositional pair를 풀어내는 것과 각 문제를 따로 푸는 것의 결과가 독립적이라고 주장
이러한 결과는 더 작고, cost-efficient하며 수학 특화된 모델에서 두드러진다고 함

📜 Paper Johns Hopkins

2024.10 1주차

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

→ unlabeled data로부터 추출한 다양한 종류의 rationale annotations에 대한 사전학습을 기반으로 삼는 process-supervision of reasoning 모델, Rationalyst 제안
Pile 데이터셋으로부터 79K 개 rationale을 추출. 여기에 사람 개입은 최소화.

📜 Paper Apple

2024.10 1주차

Contrastive Localized Language-Image Pre-Training

CLIP에 region-text contrastive loss & module 을 보충하는 CLOC를 제안
이미지 embedding을 region representation으로 쉽게 변환할 수 있는 promptable embedding을 공식화

🧑🏻‍💻 Dev Google

2024.10 1주차

Gemini 1.5 Flash-8B is now production ready

경량화된 모델이라고 하는 것 같은데 실사용 성능이 어떤지는 커뮤니티 반응 조사 필요

📜 Paper Mila

2024.10 1주차

Were RNNs All We Needed?

2024년 9월 90건

📜 Paper HKUST, Amazon

2024.09 4주차

Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models

📜 Paper Tsinghua, Berkely, Anthropic, NYU

2024.09 4주차

Language Models Learn to Mislead Humans via RLHF

모델의 출력 결과를 사람이 직접 평가 → RLHF는 모델의 성능도 평가하기 어렵게 만든다.

📜 Paper Tsinghua, Shanhai AI Lab

2024.09 4주차

On the Diagram of Thought

propositions, critiques, refinements, verifications를 DAG 구조 내에 포함 → logical consistency를 유지하면서도 모델이 복잡한 reasoning pathways를 탐색하도록 함

📜 Paper Arizona State University

2024.09 4주차

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

o1과 같은 Large Reasoning Model (LRM) 은 분명 눈에 띄는 성능 향상을 보여주고 있으나 아직까지 planning 능력이 충분하지 않다고 주장

📜 Paper NYU, Columbia

2024.09 4주차

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

LLM-judgement는 safety, world knowledge, instruction following과 관계가 없다고 주장. 대신 style에 대해 더 높은 우선순위를 부여하고 있는 것으로 관측.
[코드 및 결과물 링크](https://anonymous.4open.science/r/mismo-bench-587D/readme.md) 🔗

📜 Paper NVIDIA

2024.09 4주차

Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B

40B tokens from FineWeb, Buzz-V1.2, and Dolma datasets
Packaged as NVIDIA NIM inference microservice for easy deployment
[허깅페이스 링크](https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct) 🔗

📜 Paper Google DeepMind

2024.09 4주차

Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

context 내에서 단순히 정보를 retrieve 하는 것 이상의 long-context 평가를 하기 위한 통합 평가 프레임워크
코드 및 자연어 도메인에서 3개의 diagnostic long-context evaluations

🗞️ News SocialAI: we tried the Twitter clone where no other humans are allowed

2024.09 4주차

SocialAI: we tried the Twitter clone where no other humans are allowed

🧑🏻‍💻 Dev OpenAI

2024.09 4주차

Advanced Voice

Custom Instructions, Memory, five new voices, improved accents 등의 특징

🧑🏻‍💻 Dev Google

2024.09 4주차

Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more

1.5 Pro 비용 50% 감소, 2배 높아진 limit, 2배 빨라진 output
거대 모델을 이용하는 비용은 확실히 빠른 속도로 줄어들고 있음

📜 Paper NASA, IBM

2024.09 4주차

Prithvi WxC: Foundation Model for Weather and Climate

[허깅페이스 링크](https://huggingface.co/Prithvi-WxC) 🔗

🧑🏻‍💻 Dev Meta

2024.09 4주차

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

summarization, instruction following, rewriting tasks 등을 locally 처리 가능
AWS, Databricks, Dell, Fireworks 등 Llama Stack distributions을 위한 노력. Ollama에서 single-node로 지원하기도 함
[허깅페이스 링크](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) 🔗

📜 Paper Beijing Academy of AI

2024.09 4주차

Making Text Embedders Few-Shot Learners

few-shot exmaples를 이용하여 고퀄리티 text embedding을 생성하는 bge-en-icl 공개
MTEB, AIR-Bench에서 SOTA 달성

📜 Paper AI2, Washington

2024.09 4주차

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

→ speech 기반의 description을 사용하여 사람이 직접 highly detailed image caption dataset을 제작. 이것으로 학습한 VLM family, Molmo를 공개
model weights, captioning & fine-tuning data & source code 모두 공개 예정. [링크](https://molmo.allenai.org/) 🔗

📜 Paper HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

2024.09 4주차

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Planner, Navigator, Code Editor, Executor 네 개의 agent로 구성
[깃허브 링크](https://github.com/FSoft-AI4Code/HyperAgent) 🔗

🧑🏻‍💻 Dev stepfun-ai/GPT-OCR2_0

2024.09 4주차

stepfun-ai/GPT-OCR2_0

[데모 링크](https://huggingface.co/stepfun-ai/GOT-OCR2_0), [깃허브 링크](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/), [논문 링크](https://arxiv.org/abs/2409.01704) 🔗

📜 Paper York University

2024.09 4주차

Task-oriented Prompt Enhancement via Script Generation

(1) task’s input specification을 추출하기 위한 step-back prompting (2) required procedural steps를 identify 하기 위한 CoT prompting

📜 Paper Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

2024.09 4주차

Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

생성된 logical information을 augmented input으로 붙여서 모델에게 전달

📜 Paper Stanford

2024.09 4주차

Instruction Following without Instruction Tuning

(1) 상응하는 instruction 없이, 오직 response만 학습하더라도 instruction following 가능
(2) 이때 response의 desired distribution으로 학습할 필요는 없음
일반적인 instruction tuning 대비 갖는 장점이 무엇인지 모르겠음

📜 Paper NVIDIA, Singapore

2024.09 4주차

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

(1) High-quality Masks (2) Transferability: from 843M to 15B 사이즈 모델까지 working
[깃허브 링크](https://github.com/NVlabs/MaskLLM) 🔗

📜 Paper CMU, Amazon

2024.09 4주차

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

100k 개의 synthetically-created demonstrations 데이터로 7B CodeLlama를 학습

📜 Paper CMU, AI2, Washington, Stanford

2024.09 4주차

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

현실적인 user-AI interaction과 AI agents의 복잡한 tool use 능력을 평가할 수 있다고 주장
한 줄 요약하면 AI agents를 평가하기 위한 좋은 프레임워크를 만들어서 공개했음

🧑🏻‍💻 Dev PyTorch

2024.09 4주차

PyTorch Native Architecture Optimization: torchao

학습 및 추론에 둘 다 활용할 수 있도록 간단한 예시를 제공

📜 Paper Microsoft

2024.09 4주차

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

(1) Explicit Facts (2) Implicit Facts (3) Interpretable Rationales (4) Hidden Rationales

📜 Paper Cambridge

2024.09 4주차

Small Language Models: Survey, Measurements, and Insights

🧑🏻‍💻 Dev Stability.AI

2024.09 3주차

Stable Diffusion 3 Medium Fine-tuning Tutorial

기존 SD1.5, SDXL 모델과 SD3M 파인튜닝의 차이점 설명

📜 Paper CMU, MIT

2024.09 3주차

Agent Workflow Memory

Agent Workflow Memory (AWM): 자주 반복되는 routine을 induce 하는 방법론으로, agent에게 workflow를 선택적으로 제공
offline & online 시나리오 둘 다 적용 가능, Mind2Web & WebArena 벤치마크로 실험
[깃허브 링크](https://github.com/zorazrw/agent-workflow-memory) 🔗

📜 Paper KAIST

2024.09 3주차

Stable Language Model Pre-training by Reducing Embedding Variability

Multi-head Low-Rank Attention (MLRA), output embedding의 exponential growth를 제안함으로써 instability를 완화
연구실에서는 아직도 GPT-2, Llama-2 등을 사용할 수밖에 없는 실정..

📜 Paper Peking, Microsoft

2024.09 3주차

CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

→ Monte Carlo Tree Search (MCTS)를 이용하여 multi-step reasoning tasks 내의 다양한 planning step을 탐색하는 Critical Planning Step Learning (CPL) 제안
Step-APO (Step-level Adavantage Preference Optimization): MCTS를 통해 획득 가능한 step-level 선호쌍을 DPO와 통합

📜 Paper Wisconsin-Madison

2024.09 3주차

Your Weak LLM is Secretly a Strong Teacher for Alignment

→ weak LLM을 이용해서 human feedback만 사용할 때에 준하는, 혹은 그 이상의 효율을 뽑아내고자 함
본 연구에서는 OPT-125M 모델을 사용 → 굉장히 작은 사이즈의 모델로도 좋은 결과를 얻었다고 볼 수 있음

📜 Paper Chinese Academy of Sciecnes

2024.09 3주차

StruEdit: Structured Outputs Enable the Fast and Accurate Knowledge Editing for Large Language Models

→ StruEdit 제안: reasoning triplet으로 structured output을 반환하도록 프롬프팅 → outdated knowledge를 제거하고 효율적으로 up-to-date 정보로 채워 넣음

🧑🏻‍💻 Dev Microsoft

2024.09 3주차

Microsoft 365 Copilot Wave 2: Pages, Python in Excel, and agents

이런 통합 시스템을 구현하겠다고 작년부터 구글과 경쟁하고 있는 것 같은데 실효성은 아직 잘 모르겠음

🧑🏻‍💻 Dev Waymo

2024.09 3주차

Waymo’s Self-driving cars beat humans in safety

🧑🏻‍💻 Dev Google

2024.09 3주차

NotebookLM now lets you listen to a conversation about your sources

구글 [Illuminate](https://illuminate.google.com/home)에 이것이 사용된 것으로 보이고 Gemini 1.5의 멀티모달 능력을 이용
[NotebookLM 링크](http://notebooklm.google/) 🔗

📜 Paper Huawei

2024.09 3주차

Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts

LLM이 다른 언어로는 따르기 어려워하는 error-prone rule을 자동으로 번역
structured data 생성에 대한 auto-checking 메커니즘을 포함하는 프레임워크를 공개
이 부분은 확인할 필요가 있을 듯

🧑🏻‍💻 Dev Mistral AI

2024.09 3주차

AI in abundance

Mistral AI 모델들의 비용을 크게 줄임: Nemo 50%, Small & Codestral 80%, Large 33, …
le Chat에서 사용 가능한 Pixtral 12B 모델을 Apache 2.0 라이센스로 공개

🧑🏻‍💻 Dev Qwen

2024.09 3주차

Qwen2.5: A Party of Foundation Models!

3B & 72B 를 제외한 모델들은 Apache 2.0 라이센스
18T 토큰으로 학습하여 coding, mathematics, instruction following, long texts 등 다양한 영역에서 강점을 보임 → 128K 윈도우 사이즈 지원, 8K 토큰까지 생성 가능, 29개 언어 지원

📜 Paper ETRI

2024.09 3주차

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

→ GPTQ, AWQ, SmoothQuant, FP8 등 다양한 방식, 7B ~ 405B 사이즈 모델. 13개 벤치마크에서 평가
(1) FP 16 LLM은 hallucination detection & instruction following 제외하고 괜찮
(2) quantization 방법, 모델 사이즈, bit-width 등에 따라 결과가 천차만별

🧑🏻‍💻 Dev HuggingFace

2024.09 3주차

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

허깅페이스에서 1.58b 로 학습하고 추론하는 방법에 대한 블로그 글을 게시

🗞️ News Snap

2024.09 3주차

Introducing New Spectacles and Snap OS: The Next Frontier of AR Glasses

OpenAI와의 파트너십을 발표하여 화제

📜 Paper ETH

2024.09 3주차

Breaking reCAPTCHAv2

YOLO 모델을 사용하여 100% 확률로 통과할 수 있었으며, 통과에 필요한 문제 수가 사람과 다르지 않다는 결론
[깃허브 링크](https://github.com/aplesner/Breaking-reCAPTCHAv2) 🔗

📜 Paper Texas at Austin, Johns Hopkins, Princeton

2024.09 3주차

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

→ CoT는 math, logic 과 같이 논리적인 태스크에서는 효과적이지만 그 외에는 그닥 영향이 없음
MMLU에서 질문이나 모델의 답변에 ‘=’ 기호를 포함하는 태스크를 제외하고서는 CoT를 쓰나 안쓰나 비슷
따라서 CoT는 상황에 맞게 선별적으로 사용하는 것이 좋을 것 같다는 결론

📜 Paper Texas at San Antonio

2024.09 3주차

Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Thought Validator agent를 동반한 ToT 기반의 Reasoner agent를 제시

🧑🏻‍💻 Dev GitHub

2024.09 3주차

Try out OpenAI o1 in GitHub Copilot and Models

Copilot Chat 중간에 o1-preview, o1-mini, GPT-4o 모델 간 변경 가능

🧑🏻‍💻 Dev Open-source FinePersonas datasets dropped in Huggingface with 21 million rows and 142GB size

2024.09 3주차

Open-source FinePersonas datasets dropped in Huggingface with 21 million rows and 142GB size

어떤 프롬프트를 사용했는지도 함께 공개

📜 Paper Microsoft

2024.09 3주차

Re-Reading Improves Reasoning in Large Language Models

질문을 두 번 처리함으로써 과정에 대한 이해도를 높인다는 것이 컨셉
단방향의 decoder-only LLM에서 “bidirectional” encoding을 사용하여 global information 활용

📜 Paper Huawei, McGill, Mila

2024.09 3주차

Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data

기존의 다른 능력들을 손상시키지 않으면서도 추론 능력을 향상시킬 수 있었다고 주장
[깃허브 링크](https://arxiv.org/abs/2409.12437) 🔗

📜 Paper Google DeepMind

2024.09 3주차

Training Language Models to Self-Correct via Reinforcement Learning

🧑🏻‍💻 Dev HuggingFace, IBM

2024.09 2주차

Improving Hugging Face Training Efficiency Through Packing with Flash Attention

최대 2배까지 높은 throughput으로 이어진다고 함

📜 Paper Google DeepMind

2024.09 2주차

Building Math Agents with Multi-Turn Iterative Preference Learning

→ multi-turn direct preference learning framework를 제안: multi-turn DPO & KPO

📜 Paper University of Toronto, Vector Institute

2024.09 2주차

Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

→ 특정 스킬이나 토픽에 대한 모델의 behavior를 요약한 natrual language summaries, Report Cards를 제안
specificity, faithfulness, interpretability, 세 기준을 근거로 Report Cards를 평가
human supervision 없이 Report Cards를 생성하는 iterative algorithm 제안

🧑🏻‍💻 Dev Replit

2024.09 2주차

Replit Agent

cursor의 composer와 유사한 기능으로 보임
long context, code understanding & generation에 많은 기업들이 집중하는 이유

🧑🏻‍💻 Dev Google

2024.09 2주차

Illuminate

현재 waitlist에 등록해야 하는 실험적 기능임

📜 Paper Beijing University

2024.09 2주차

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

instruction complexity, response quality, instruction diversity 세 개의 기준으로 데이터를 선별
선별된 데이터로 Llama-3를 학습하여 XCoder 모델을 공개

📜 Paper Mila, Princeton, Cambridge, Google DeepMind

2024.09 2주차

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

→ 본 연구 결과에 따르면 LLM이 meta cognitive knowledge를 지닌 것으로 판단된다고 함
수학 문제에 합리적인 skill label을 붙일 수 있다는 것이 확인되었음. 그 결과는 사람도 해석 가능.

📜 Paper Oxford

2024.09 2주차

Detecting hallucinations in large language models using semantic entropy

→ entropy-based uncertainty estimator를 도입하여 LLM이 hallucinations-confabulations-를 탐지할 수 있도록 함
데이터셋이나 task에 대한 사전 지식 없이도 적용 가능한 방법론임을 설명

📜 Paper Singapore University

2024.09 2주차

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

→ 생성된 long text sequences 내의 특정 사건들을 식별할 수 있는 능력을 평가하는 Spinning the Golden Thread (SGT) 제안
LM이 특정 사건과 constraint를 포함하여 long-form text를 생성하도록 지시

🧑🏻‍💻 Dev Huawei

2024.09 2주차

Huawei unveils $2,800 tri-fold phone just hours after iPhone 16 launch.

📜 Paper University of Toronto

2024.09 2주차

Seek and Solve Reasoning for Table Question Answering

reasoning은 two-stage로 구성, CoT paths는 Seek-and-Solve CoT로 통합 (SS-CoT)

📜 Paper Stanford University

2024.09 2주차

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

LLM-generated idea가 사람이 만든 것보다 더 novel 하다는 결과 (p<0.05). 단, feasibility는 조금 더 낮은 것으로 확인됨.
얼마 전 Sakana에서 공개한 AI Scientist도 그렇고.. 확실히 연구도 AI로 하는 시대가 오게 될 듯

📜 Paper Apple

2024.09 2주차

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

H100에서 FlashAttention2 위에서 돌아가는 Flash-Sigmoid 도입 → 추론 속도 17% 향상
이런 것들은 실제 사용 경험을 많이 접해보고 적용하면 좋을 것 같음

📜 Paper UIUC, CMU

2024.09 2주차

Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance

→ thought-retrieval을 기반으로 researcher를 돕는 self-evoling, efficient LLM 시스템 제안
69.92%의 시간을 절약할 수 있다고 주장
[허깅페이스 스페이스 링크](https://huggingface.co/spaces/ulab-ai/ArxivCopilot) 🔗

🧑🏻‍💻 Dev SambaNova

2024.09 2주차

SambaNova Launches The World's Fastest AI Platform

오픈소스는 아니고 fine-tuning과 inference 솔루션을 판매하는 기업의 제품으로 보임

📜 Paper United We Care

2024.09 2주차

LLMs Will Always Hallucinate, and We Need to Live With This

→ 따라서 아키텍쳐 개선, 데이터셋 증가, fact-checking 등으로 hallucination을 제거한다는 것은 불가능하다고 주장

📜 Paper KAIST

2024.09 2주차

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

사람은 Coherence & Fluency와 같은 internal quality와 관련된 작업에 능하고, LLM은 Consistency & Relavance와 같은 external alignment에 능하다는 분석 결과
[깃허브 링크](https://github.com/BBeeChu/InteractEval.git) 🔗

🧑🏻‍💻 Dev Intel, DeepLearning.AI

2024.09 2주차

Multimodal RAG: Chat with Videos

🧑🏻‍💻 Dev Google

2024.09 2주차

DataGemma: Using real-world data to address AI hallucinations

RIG(Retrieval-Interleaved Generation) & RAG 사용

📜 Paper Tsinghua

2024.09 2주차

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

scene, document, whole-page 스타일 등 다양한 이미지 양식을 커버할 수 있고 “글자” 단위로 처리하는 OCR tasks도 다룰 수 있음
좌표나 색상 등으로 설명되는 region-level recognition도 가능

🧑🏻‍💻 Dev FutureHouse

2024.09 2주차

PaperQA2

QA, 요약, contradiction detection 등 가능
`pip install paper-qa`
[논문 링크](https://storage.googleapis.com/fh-public/paperqa/Language_Agents_Science.pdf) 🔗

🧑🏻‍💻 Dev OpenAI

2024.09 2주차

Introducing OpenAI o1-preview

과학, 코딩, 수학 분야에서 뛰어난 성능 보임 (예: IMO 예선 83% 정답률, Codeforces 89번째 백분위)
o1-preview와 o1-mini 두 모델 제공, ChatGPT Plus/Team 사용자와 일부 API 개발자들에게 접근 권한 부여
향상된 안전 기능 적용 (jailbreaking 테스트에서 GPT-4o 대비 큰 성능 향상)

📜 Paper University of Mannheim

2024.09 2주차

Fine-tuning Large Language Models for Entity Matching

→ LLM fine-tuning: 1) LLM이 생성한 학습용 설명 데이터셋 2) LLM을 이용한 학습 데이터 선별
sLLM (Llama 3.1 8B) > LLM (GPT-4o Mini), in-domain > cross-domain, structured data 효과적

📜 Paper Meta, Oxford, UCL

2024.09 2주차

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

custom data source 입력 → real-wrold source에 근거한 intermediate reasoning step을 포함하여 합성 데이터를 생성
answerability에 따라 low-quality generation를 버릴 수 있어 데이터셋 퀄리티가 개선됨
multi-hop question answering (MHQA), tool usage in tabular question answering (TQA) 에 효과적

📜 Paper Alibaba

2024.09 2주차

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

📜 Paper Meta

2024.09 1주차

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

언어 모델의 loss function(next token prediction)을 diffusion과 결합하여 mixed-modality sequence에 대해 single transformer를 학습
7B 사이즈의 모델을 scratch부터 학습하고 2T multi-modal token을 사용, scaling law 확인.
텍스트로 이뤄진 시퀀스 중간에 이미지 패치의 vector가 <BOI> & <EOI> 태그 사이에 삽입

📜 Paper Stanford

2024.09 1주차

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

📜 Paper Google DeepMind, UCLA, Milla

2024.09 1주차

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

세 개의 주요 메트릭: coverage, diversity, false positive rate → WC가 더 높은 coverage, diversity, but 더 높은 false positive 비율
weak-to-strong improvement setup: weaker LM이 stronger LM에게 reasoning을 가르침
WC-generated data로 학습한 모델이 SE-generated data로 학습한 모델보다 뛰어난 성능

📜 Paper University of Virginia

2024.09 1주차

Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling

→ output answer와 CoT로부터의 reasoning path를 동시에 고려하여 생성되는 sample의 숫자를 dynamic하게 조절하는 early framework, Reasoning-Aware Self-Consistency (RASC)
생성되는 샘플들에 confidence score를 부여하고 일정 기준이 충족되면 stop → weighted majority voting

🧑🏻‍💻 Dev LMSYS

2024.09 1주차

Lmsys launches style control for Chatbot Arena to help separating the impact of style from substance in LLM rankings

📜 Paper DP Technology

2024.09 1주차

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

continual pre-training (CPT) & supervised fine-tuning (SFT) 통합한 hybrid strategy 제안 → 과학 도메인 지식을 불어넣고 domain specific 태스크에서 instruction following 능력을 향상
이를 위해 (1) 고품질의 CPT corpora 필요 (2) 다양한 SFT instructions 생성 필요
→ PDF text extraction, parsing content error correction, quality filtering, synthetic instruction creation을 아우르는 pipeline으로 해결 시도

📜 Paper Independent Researcher

2024.09 1주차

CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation

→ catastrophic forgetting during continual learning 완화 & trainable parameters 감소
변형된 CUR decomposition: 1) 열과 행 선택에 역확률 (inverted probability) 2) U 행렬 0으로 초기화 3) U 행렬만 fine-tuning

📜 Paper Tsinghua University

2024.09 1주차

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

audio-based end-to-end conversational model, Mini-Omni (real-time speech를 위한 최초의 오픈소스 모델)
text-instructed speech generation, batch-parallel strategies 사용
speech output을 만들 수 있도록 학습하는 데 사용 가능한 데이터셋 VoiceAssistant-400K

📜 Paper Peking University, ByteDance

2024.09 1주차

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

→ 네 단계로 학습: 1) vison-language alignment 2) visual instruction-tuning 3) math instruction-tuning 4) process-supervised reinforcement learning → MultiMath-7B
K-12 수준의 image caption과 step-wise solution을 포함하는 MultiMath-300K 데이터셋 공개
[깃허브 링크](https://github.com/pengshuai-rin/MultiMath) 🔗

📜 Paper NVIDIA

2024.09 1주차

In Defense of RAG in the Era of Long-Context Language Models

그러나 극단적으로 길이가 긴 입력을 처리하는 것은 결국 관련성 높은 정보에 집중하는 것을 방해함으로써 성능 저하로 이어짐
→ order-preserve retrieval-augmented generation (OP-RAG) 제안
retrieved chunk가 증가할수록 답변 퀄리티는 초반에 상성하다가 결국 감소하여 U-shaped curve ⇒ OP-RAG가 이득을 볼 수 있는 지점이 분명히 존재한다

📜 Paper AI2, Washington, Princeton

2024.09 1주차

OLMoE: Open Mixture-of-Experts Language Models

5T 토큰으로 사전학습한 모델이며 instruct 버전도 함께 공개
Llama2-13B-Chat, DeepSeekMoE-16B 보다도 뛰어난 성능이라고 주장
모델 가중치, 학습 데이터, 코드, 로그 등을 오픈소스로 공개. 역시 AI2..

📜 Paper Tsinghua

2024.09 1주차

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

LCQA를 평가하기 위한 벤치마크 LongBench-Cite 제안
CoF (Coarse to Fine) 파이프라인 제안
LongCite-45k 데이터셋을 사용하여 LongCite-8B, 9B를 학습

📜 Paper Autodesk AI Research

2024.09 1주차

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

복잡한 추론을 하도록 세팅이 되어 있어서 단순한 problem-solving 전략과 다르다고 주장
모델이 실제 추론을 하지 않고 표면적인 패턴을 학습하여 정답을 맞히는 shortcut learning 현상을 최소화하는 것이 본 연구의 목표. shortcut learning의 정도를 평가할 수 있는 메트릭도 제시.
[깃허브 링크](https://github.com/asgsaeid/mmlu-pro-plus) 🔗

🧑🏻‍💻 Dev SSI

2024.09 1주차

lya Sutskever’s startup, Safe Superintelligence, *raises $1 BILLION*

📜 Paper Tsinghua University

2024.09 1주차

Attention Heads of Large Language Models: A Survey

사람의 생각을 네 단계의 프레임워크로 distill: 1) Knowledge Recalling, 2) In-Context Identification, 3) Latent Reasoning, 4) Expression Preparation
[깃허브 링크](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) 🔗

📜 Paper HSE University

2024.09 1주차

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

source 이미지의 local & global 구조를 저장할 수 있도록 하는 layout-preserving energy function을 도입
→ fast & high-quality editing mechanism
[깃허브 링크](https://github.com/FusionBrainLab/Guide-and-Rescale) 🔗

📜 Paper Tsinghua University

2024.09 1주차

Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models

2024년 8월 72건

📜 Paper The Fin AI

2024.08 5주차

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

📜 Paper Singapore

2024.08 5주차

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

(2) 모델 학습과 평가를 위한 핵심 데이터셋에 대한 리뷰
(3) data processing methods, popular architectures 등 모델링 테크닉 요약
외에도 잠재적인 어려움이나 미래 발전 방향에 대해 논한 survery 페이퍼

📜 Paper British Columbia

2024.08 5주차

Automated Design of Agentic Systems

Meta Agent Search: 이전의 발견들을 쌓아두어 점점 커지는 archive를 바탕으로 계속해서 새로운 agent를 프로그래밍 해나갈 수 있다는 아이디어
[깃허브 링크](https://github.com/ShengranHu/ADAS) 🔗

📜 Paper Kyoto University

2024.08 5주차

Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?

일본어로 continued pretraining 한 Swallow, 영어와 일본어를 균형 있게 학습한 LLM-jp
→ 영어만이 latent language인 Llama2와 달리, Swallow와 LLM-jp는 영어와 일본어 둘 다 laten language라고 볼 수 있음

📜 Paper HuggingFace

2024.08 5주차

Building and better understanding vision-language models: insights and future directions

🧑🏻‍💻 Dev Priceton-NLP

2024.08 5주차

Llama-3-8B-ProLong

Instruct 버전도 존재하며 현재는 64K 버전만 공개되어 있음. 향후 512K 버전도 공개 예정
1저자가 SimCSE 저자임

📜 Paper Institute of Automation

2024.08 5주차

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

→ 이미지와 비디오는 텍스트에 비해 더 인지적 직관성이 높다는 특징을 이용 (이미지 아레나임)
K개의 모델이 한 번에 경쟁에 참여 ⇒ ELO 알고리즘 대비 16.3배 빠른 수렴 속도
[허깅페이스 스페이스 링크](https://huggingface.co/spaces/ksort/K-Sort-Arena) 🔗

📜 Paper University of Edinburgh

2024.08 5주차

Explicit Inductive Inference using Large Language Models

LLM을 이용하여 premise를 attested alternative 세트로 변경 & 이를 기반으로 hypothesis derive ⇒ 둘을 이용하여 NLI task 성능 향상

🧑🏻‍💻 Dev Anthropic

2024.08 5주차

Anthropic publishes Claude’s system prompts

이는 [Claude.ai](http://Claude.ai) 와 모바일 앱에 영향을 주지만 API와는 무관함

🧑🏻‍💻 Dev Nous Research

2024.08 5주차

DisTro

깃허브에 A Preliminary Report on DisTrO를 공개

🧑🏻‍💻 Dev DeepLearning.AI

2024.08 5주차

Large Multimodal Model Prompting with Gemini

function calling과 API 통합 관련 내용까지 포함

🧑🏻‍💻 Dev Google

2024.08 5주차

Google just released three new experimental Gemini 1.5 models

[Google AI Studio](https://ai.google.dev/aistudio/)에서 사용 가능

📜 Paper Waseem Inc.

2024.08 5주차

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

📜 Paper Google Research

2024.08 5주차

Diffusion Models Are Real-Time Game Engines

single TPU에서 초당 20 프레임으로 DOOM에서 simualte 가능
(1) RL-agent가 게임 플레이를 학습 (2) diffusion 모델이 이전 프레임과 행동들을 기반으로 다음 프레임을 생성하도록 학습
[깃허브 링크](https://gamengen.github.io) 🔗

🧑🏻‍💻 Dev Qwen

2024.08 5주차

Qwen2-VL: To See the World More Clearly

2B, 7B, 72B 중에서 72B는 API로만 이용 가능
72B 모델은 GPT-4o나 Claude 3.5-Sonnet을 넘어설 정도의 visual understanding benchmark score를 보여주었음

📜 Paper Google DeepMind

2024.08 5주차

Generative Verifiers: Reward Modeling as Next-Token Prediction

→ next-token prediction objective로 verifier를 학습, 즉 verification과 solution generation을 joint training
기존 instruction tuning, CoT reasoning 등과 seamlessly 통합 가능

📜 Paper Tsinghua

2024.08 5주차

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

→ 엄청나게 긴 생성 태스크를 여러 개의 subtask로 쪼개어 LLM이 20,000 단어 이상의 텍스트를 생성할 수 있도록 만드는 agent-based pipeline 제시
LongWriter-6K: 답변의 길이가 2K - 32K 에 이르는 텍스트로 구성된 데이터셋
장문의 텍스트 생성 능력이 있는지를 검증하는 벤치마크 LongBench-Write 또한 공개

📜 Paper Alibaba, Meta

2024.08 5주차

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

🧑🏻‍💻 Dev TII

2024.08 4주차

Welcome FalconMamba: The first strong attention-free 7B model

최적화 벤치마크에서는 더욱 뛰어난 성능
base/instruct 버전의 모델을 각각 공개 + 4-bit 버전도 공개 ([허깅페이스 링크](https://huggingface.co/tiiuae) 🔗)

📜 Paper Google DeepMind

2024.08 4주차

Towards flexible perception with visual memory

→ (1) 데이터의 사이즈에 관계 없이 이를 자유롭게 추가할 수 있는 능력 (2) unlearning & pruning을 통해 데이터를 삭제할 수 있는 능력 (3) 해석 가능한 의사 결정 메커니즘

📜 Paper I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm

2024.08 4주차

I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm

→ from scratch에서 계속해서 self-align 하는 학습 방식을 제안
Qwen & Llama 모델의 성능을 크게 개선할 수 있었다고 주장

📜 Paper DeepSeek

2024.08 4주차

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

DeepSeek-Prover-V1 모델의 학습 & 추론 과정을 최적화한 DeepSeek-Prover-V1.5 모델 공개
[깃허브 링크](https://github.com/deepseek-ai/DeepSeek-Prover-V1.5) 🔗

📜 Paper Salesforce AI, Univ of Washington

2024.08 4주차

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

엄선된 학습 데이터셋, 학습 레시피, 모델 아키텍쳐, 학습 결과 등을 오픈소스로 공개
DPO를 이용하여 safety tuning을 적용

📜 Paper Meta

2024.08 4주차

Imagine yourself: Tuning-Free Personalized Image Generation

→ 1) 이미지 다양성을 높이기 위한 synthetic paired data 생성 메커니즘, 2) 완전히 병렬적인 세 개의 text encoder와 학습 가능한 visual encoder, 3) visual quality를 점진적으로 향상시키는 coarse-to-fine multi-stage finetuning

📜 Paper Vanderbit University

2024.08 4주차

Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning

→ 이를 해결하기 위해 Counterfactual CoT & Agnostically Primed CoT 를 제안
bias를 줄이는 데 전자로만은 불충분할 수 있긴 하나, 특정 상황에서는 충분

🧑🏻‍💻 Dev Lambda

2024.08 4주차

Unveiling Hermes 3: The First Full-Parameter Fine-Tuned Llama 3.1 405B Model is on Lambda’s Cloud

[Lambda Chat Completions API](http://api.lambdalabs.com/docs)와 [Lambda Chat](https://lambda.chat/)에서 사용 가능

📜 Paper Google Research

2024.08 4주차

Transformers in music recommendation

Intention of action, Salience metrics, Metadata, Music track identifiers

🧑🏻‍💻 Dev Microsoft

2024.08 4주차

Microsoft releases Phi-3.5-mixture-of-experts (MoE)

4.9T 토큰 학습, 그중 10%는 multilingual content, 128k 토큰 길이 지원
SFT, PPO, DPO 등 학습 과정을 거침

🧑🏻‍💻 Dev OpenAI

2024.08 4주차

Fine-tuning now available for GPT-4o

[fine-tuning dashboard](https://platform.openai.com/finetune) 에서 사용할 수 있음

📜 Paper Waterloo, Fudan

2024.08 4주차

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

industrial scenarios를 반영한 벤치마크, TableBench를 제안
GPT-3.5 수준의 성능을 내는 TabelLLM을 소개 (TableInstruct 데이터셋으로 학습)

🧑🏻‍💻 Dev Ideogram

2024.08 4주차

Introducing Ideogram 2.0

Flux, Midjourney에 도전..! Color Palette Selection, Enhanced Text Rendering, Search Functionality, Improved Image Coherence 가 특징

📜 Paper NVIDIA

2024.08 4주차

LLM Pruning and Distillation in Practice: The Minitron Approach

depth pruning & joint hidden/attention/MLP (width) pruning 에 대해 탐구
기존 데이터를 모르는 상황에서 teacher 모델을 distillation dataset에 학습하는 방식이 유익할 수 있다고 주장
허깅 페이스에 공개: [Mistral-NeMo-Minitron-8B-Base](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base) | [Llama-3.1-Minitron-4B-Width-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base) | [Llama-3.1-Minitron-4B-Depth-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base)

🧑🏻‍💻 Dev Adobe Research

2024.08 4주차

MagicFixup

기존에는 이런 모델을 학습하기 위해 이미지를 사용하는데, 여기서는 비디오를 사용

🧑🏻‍💻 Dev Meta

2024.08 4주차

Sapiens: Foundation for Human Vision Models

위 네 개의 핵심 vision tasks를 지원하는 모델 패밀리 Sapiens를 공개
[아카이브 링크](https://about.meta.com/realitylabs/codecavatars/sapiens?_bhlid=9ff3b20994dca7d88de03063c5de34f1da2853ed) 🔗 [깃허브 링크](https://github.com/facebookresearch/sapiens) 🔗

📜 Paper Singapore

2024.08 4주차

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

Medical Classification & NER 벤치마크 점수 비교: BioMistral & Llama-2
standard prompting, CoT, Self-Consistency, RAG 등을 비교 → standard best
knowledge, reasoning 향상을 위한 여러 prompt 테크닉이 biomedical tasks에 쉽게 적용 불가능하다는 것을 시사하는 실험 결과

🧑🏻‍💻 Dev AI21 labs

2024.08 4주차

The Jamba 1.5 Open Model Family: The Most Powerful and Efficient Long Context Models

비슷한 사이즈의 모델 중에서 Mixtral 8x22B, Command-R+ 보다 뛰어난 성능 (Mini)
256K context window 사이즈를 가지며 추론 속도도 빠른 것이 특징
[허깅페이스 링크](https://huggingface.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251) 🔗

📜 Paper Google

2024.08 4주차

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

각 draft는 retrieved documents의 subset으로 생성 → draft당 input token count는 줄이면서 다양한 관점을 제공할 수 있다는 장점
각 subset에 대한 이해도를 높이고 긴 context에 대한 position bias를 줄일 수 있음
[Google Research 블로그 포스팅 링크](https://research.google/blog/speculative-rag-enhancing-retrieval-augmented-generation-through-drafting/) 🔗

🧑🏻‍💻 Dev Anthropic

2024.08 4주차

Anthropic added support Latex rendering in Claude Web interface

📜 Paper Google DeepMind

2024.08 3주차

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Gemma 2 2B의 전체 layer, 9B의 일부 layer에서 학습, 27B에서 선택된 JumpReLU SAEs를 공개 → 비교를 위해 instruction-tuned version을 함께 공개

📜 Paper Liverpool

2024.08 3주차

Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

→ LLM consistency를 평가하기 위한 새로운 벤치마크 제안, 직관적인 프롬프트 전략 제안
Andrej Karpathy가 언급한 [Jagged Intelligence](https://x.com/karpathy/status/1816531576228053133)와 관련된 문제로 볼 수 있음

📜 Paper Sakana AI

2024.08 3주차

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

open-ended 방식으로 아이디어 발전 과정을 반복하며 knowledge archive를 키워 나감
diffusion modeling, transformer-based language modeling, learning dynamics, 세 분야에서 실험하는 동안 15$ 이하의 비용이 발생
[깃허브 링크](https://github.com/SakanaAI/AI-Scientist) 🔗

📜 Paper Microsoft, Harvard

2024.08 3주차

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

1. target SLM이 Monte Carlo Tree Search (CMTS)를 human-like reasoning actions로 증강
2. another SLM이 target SLM이 만들어내는 trajectory를 discriminate
→ 양측 동의를 받은 것들은 mutual consistent로 구분

🧑🏻‍💻 Dev Anthropic

2024.08 3주차

Prompt caching with Claude

배경 지식, 예시 등을 설명하는데 사용되었던 컨텍스트가 캐싱됨으로써 비용을 90%까지 줄이고 latency도 85%까지 감소할 수 있음.
현재 public beta로 Claude 3.5 Sonnet & Haiku 에서 사용 가능

🧑🏻‍💻 Dev xAI

2024.08 3주차

Grok-2 Beta Release

(xAI피셜..) Claude 3.5 Sonnet & GPT-4-Turbo 이상의 성능
Grok-2 & Grok-2 mini 를 X로 선공개. 추후 Grok에서 API 지원

📜 Paper ACL 2024 Best Paper Award

2024.08 3주차

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

101개 언어를 지원하는 multilingual generative language model
instruction datasets을 [링크](https://hf.co/CohereForAI/aya-101)에 공개
[Cambridge, ETH] [Causal Estimation of Memorisation Profiles](https://arxiv.org/abs/2406.04327)

🧑🏻‍💻 Dev Google

2024.08 3주차

Gemini Live

Gemini Advanced 구독자 대상

🧑🏻‍💻 Dev Qwen

2024.08 3주차

Introducing Qwen2-Math

closed-source models (gpt-4o) 보다도 뛰어난 수학적, 추론 능력을 지녔다고 주장
[깃허브](https://github.com/QwenLM/Qwen2-Math) 링크 🔗 [허깅페이스](https://huggingface.co/Qwen) 링크 🔗

📜 Paper Google DeepMind

2024.08 3주차

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

(1) dense, process-based verifier reward models에 대한 searching
(2) 추론 시 프롬프트가 주어지면 response에 대해 adaptive 하게 모델 분포를 업데이트
→ ‘사전학습 vs 추론’ 시간의 trade-off에 관한 연구: 작은 모델들도 뛰어난 성능 달성

🧑🏻‍💻 Dev DeepLearning.AI

2024.08 3주차

Improving accuracy of LLM applications

Llama 3-8b 모델을 학습하여 text-to-SQL 어플리케이션을 개발

📜 Paper Oxford

2024.08 3주차

Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering

curriculum learning의 난이도를 사람이 정하는 것보다 모델이 정하는 것이 더 효율적이었다는 결과

🧑🏻‍💻 Dev MetaGPT: The Multi-Agent Framework

2024.08 3주차

MetaGPT: The Multi-Agent Framework

아주 간단하게 소프트웨어 제작 가능

🧑🏻‍💻 Dev NVIDIA

2024.08 3주차

How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

📜 Paper Sheffiled, Liverpool

2024.08 2주차

Adaptive Retrieval-Augmented Generation for Conversational Systems

발화할 때 과거의 내용을 돌아보게 만들어야하지 않을까 생각했던 것과 유사한 접근이라고 느껴짐

📜 Paper Sapienza NLP Group

2024.08 2주차

ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget

Retriever 모듈은 entity, relation 후보를 탐색 → Reader 모듈은 실제 관계를 파악

📜 Paper Meta

2024.08 2주차

Self-Taught Evaluators

unlabeled instruction → contrasting model outputs → reasoning traces & final judgements
최근 가장 주목을 받은 논문이 합성 데이터로 인한 모델 붕괴인데.. 아이러니하다.

📜 Paper ByteDance

2024.08 2주차

Language Model Can Listen While Speaking

listening-while-speaking language model (LSLM) 이라는 모델 디자인을 공개
early fusion, middle fusion, late fusion 셋 중에서 middel fusion의 balance가 가장 훌륭
OpenAI에서 공개했던 자연스러운 실시간 대화와 관련된 연구로 보임

🧑🏻‍💻 Dev NVIDIA

2024.08 2주차

Advancing Humanoid Robot Development

사용자의 움직임을 비전프로로 인식하고 로봇이 이를 실시간으로 모방하는 형태

🧑🏻‍💻 Dev OpenAI

2024.08 2주차

Introducing Structured Outputs in the API

`“strict”: true` 로 설정 시 100% 확률로 structured output 반환
function calling 또는 response_format 파라미터로 기능 지원

📜 Paper OpenGVLab, Tsinghua

2024.08 2주차

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

7개 종류의 multi-image 관계, 52개 태스크, 77K 이미지, 11K multiple-choice questions로 구성

🧑🏻‍💻 Dev DeepLearning.AI

2024.08 2주차

AI Python for Beginners

비지니스, 마케팅과 같은 실제 산업 분야에 파이썬을 활용하는 방법 안내
AI 어시스턴트를 이용한 코드 디버깅, 개념 설명 등을 시도

📜 Paper Google DeepMind

2024.08 2주차

Achieving Human Level Competitive Robot Table Tennis

탁구 칠 수 있는 로봇을 개발했는데 특징은 다음과 같음 (아마추어 수준으로 판단)
hierarchical and modular policy architecture
zero-shot sim-to-real을 가능하게 만드는 기술

🧑🏻‍💻 Dev HuggingFaceM4

2024.08 2주차

Idefics3-8B-Llama3

[google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) & [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
[v1 paper](https://huggingface.co/papers/2306.16527) 링크 🔗 & [v2 paper](https://huggingface.co/papers/2405.02246) 링크 🔗

🧑🏻‍💻 Dev NVIDIA

2024.08 2주차

Build a Digital Human

웹 사이트에서 음성을 통해 실시간 interaction 가능

📜 Paper Jilin University

2024.08 2주차

Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

→ 세 개의 regularization terms: (1) consistency regularizer (2) diversity regularizer (3) singular vector decomposition regularizer
[깃허브 링크](https://github.com/cyp-jlu-ai/BA-LoRA) 🔗

📜 Paper Appier AI Research

2024.08 2주차

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

특정 포맷을 강제할수록, 그리고 포맷이 엄격할수록 모델의 추론 능력이 하락하는 경향성을 관측

🧑🏻‍💻 Dev Google

2024.08 1주차

Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma

[Gemma 2 허깅페이스 링크](https://huggingface.co/collections/google/gemma-2-2b-release-66a20f3796a2ff2a7c76f98f) 🔗
언어 모델의 생성 결과를 필터링 해주는 ShieldGemma를 공개. SoTA급 성능.
모델의 내부 동작 과정을 살펴볼 수 있는 툴 Gemma scope 🔭 공개.

🧑🏻‍💻 Dev PyTorch

2024.08 1주차

Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile

[torchchat GitHub 링크](https://github.com/pytorch/torchchat) 🔗

🧑🏻‍💻 Dev DeepLearning.AI

2024.08 1주차

Embedding Models: From Architecture to Implementation

Word2Vec과 BERT와 같은 모델을 다양한 semantic search에 어떻게 활용하는지 학습

📜 Paper Tsinghua

2024.08 1주차

Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

NLI 데이터셋에 대해 MiniCPM, Phi-2, Gemma 모델을 contrastive fine-tuning

🧑🏻‍💻 Dev Stability.AI

2024.08 1주차

Introducing Stable Fast 3D: Rapid 3D Asset Generation From Single Images

게임, 가상현실 개발자들을 위한 어플리케이셔늘 포함
[허깅페이스 링크](https://huggingface.co/stabilityai/stable-fast-3d) 🔗

🗞️ News Figure

2024.08 1주차

Figure 02

📜 Paper Tsinghua

2024.08 1주차

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

→ LLM의 knowledge 활용 능력을 평가하기 위해 평가용 데이터셋을 자동적으로 생성하는 프레임워크 RAGEval을 제시
Completeness, Hallucination, Irrelevance 세 개의 metric을 사용

2024년 7월 74건

📜 Paper Oxford, Cambridge, Imperial College London, Toronto

2024.07 5주차

AI models collapse when trained on recursively generated data

LLM 생성 데이터가 점점 늘어나고 있는 상황에서 인간이 직접 만들어낸 데이터의 가치는 점점 높아질 것이라고 예측

📜 Paper Washington, AI2

2024.07 5주차

The Art of Refusal: A Survey of Abstention in Large Language Models

이를 query, model, human value, 세 개의 관점에서 평가하난 프레임워크를 제시

📜 Paper Equall

2024.07 5주차

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain

domain adaptation 과정은 세 단계로 구성됨.

🧑🏻‍💻 Dev Meta

2024.07 5주차

Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images

memory mechanism: 과거 segmentation 정보를 저장 & 불러오기 하여 프레임 간 continuous tracking이 가능
real-time processing이 가능한 빠른 추론 속도
51K videos & 600K masklets로 구성된 SA-V dataset 공개

🧑🏻‍💻 Dev OpenAI

2024.07 5주차

GPT-4o Long Output

요즘 가장 큰 두 개의 트렌드는 context 늘리기와 모델 사이즈 줄이기 (추론 속도 up)

📜 Paper Meta, Berkeley, NYU

2024.07 5주차

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

📜 Paper New York University

2024.07 4주차

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

최근 2년 간의 prompting 연구에 대해 총망라

📜 Paper Generative AI Research Lab (GAIR), Fudan

2024.07 4주차

Weak-to-Strong Reasoning

samll, but high-quality dataset으로 지도 학습을 시작 → 모델 스스로 contrastive sample로 식별한 케이스들에 대해 preference optimization
세 개의 weak 모델을 이용하여 LLama2-70B 모델의 성능을 향상시킬 수 있었다고 보고

📜 Paper Apple, Meta

2024.07 4주차

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

병목을 해결하기 위해 prefilling과 decoding에 중요한 토큰의 KV만 선별적으로 계산하는 방식 LazyLLM을 제안
다른 방식들과 달리 매 생성 step에서 ‘dynamically’ 토큰을 고른다는 점이 특징
기존 모델들에 추가 학습 없이 seamlessly 통합 가능하다는 점이 특징

🧑🏻‍💻 Dev groq

2024.07 4주차

Introducing Llama-3-Groq-Tool-Use Models

[Llama-3-Groq-70B-Tool-Use](https://huggingface.co/Groq/Llama-3-Groq-70B-Tool-Use) & [Llama-3-Groq-8B-Tool-Use](https://huggingface.co/Groq/Llama-3-Groq-8B-Tool-Use)
[GroqCloud Devloper Hub](http://console.groq.com/)에서도 이용 가능

📜 Paper Google DeepMind

2024.07 4주차

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Gemma 2 9B activations를 기준으로 reconstruction fidelity에서 SoTA를 달성한 JumpReLU SAEs를 제안
activation 관련해서 오랜만에 눈에 띄는 논문..

🧑🏻‍💻 Dev Meta

2024.07 4주차

Introducing Llama 3.1: Our most capable models to date

GPT-4 수준을 상회하는 오픈소스 모델은 최초라고 봐도 될 듯
[Meta paper 링크](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) 🔗
[Hugging Face Model Family 링크](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) 🔗

📜 Paper NC Research

2024.07 4주차

OffsetBias: Leveraging Debiased Data for Tuning Evaluators

→ judge 모델에 존재하는 6개 종류의 bias에 대한 연구
각 bias 종류별로 hand-crafted test 케이스를 포함하는 EvalBiasBench 제안

🧑🏻‍💻 Dev Numina, Hugging Face, MIT, Mistral, Peking

2024.07 4주차

NuminaMath

1M 수학 문제 & 정답으로 구성된 high-quality training dataset
[Hugging Face 데이터셋 링크](https://huggingface.co/collections/AI-MO/numinamath-6697df380293bcfdbc1d978c) 🔗

🧑🏻‍💻 Dev WWDC 24: Running Mistral 7B with Core ML

2024.07 4주차

WWDC 24: Running Mistral 7B with Core ML

간단히 공부하기 좋을 것 같은 허깅페이스 블로그 글

🧑🏻‍💻 Dev Mistral AI

2024.07 4주차

Mistral Large 2

French, German 등 다양한 언어 뿐만 아니라 Python, Java 등 프로그래밍 언어에도 특화
비상업적, 연구적 목적으로 이용 가능. [weight download](https://models.mistralcdn.com/mistral-large-2407/mistral-large-instruct-2407.tar) 🔗 [HuggingFace](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407) 🔗

🧑🏻‍💻 Dev OpenAI

2024.07 4주차

SearchGPT Prototype

conversational capability를 향상시킴으로써 real-time 정보를 보다 쉽게 획득할 수 있음
partnering with publisher & creator

🧑🏻‍💻 Dev Cohere

2024.07 4주차

Introducing Rerank 3 Nimble: Faster Reranking for Enterprise Search & Retrieval-Augmented Generation (RAG) Systems

영어 외에도 100개 이상의 언어를 지원
[Amazon Sagemaker](https://aws.amazon.com/marketplace/pp/prodview-rq7ik6yx6jnzc) 🔗

🧑🏻‍💻 Dev Google

2024.07 4주차

Gemini’s big upgrade: Faster responses with 1.5 Flash, expanded access and more

현재 트렌드는 조금 덜 뛰어난 성능일지라도 빠른 답변을 할 수 있는 모델을 제공하는 것. 빠른 속도를 한 번 경험하고 나면 느린 모델에 대한 반감이 커질 것 같다는 생각이 듦.

📜 Paper AI2, University of Washington, Microsoft

2024.07 4주차

The Art of Saying No: Contextual Noncompliance in Language Models

모델이 언제 어떻게 유저의 요청을 따르지 말아야 하는지에 대한 어휘 분류 체계를 도입
1,000개의 noncompliance prompt를 바탕으로 실험 → 30% 정도는 유저의 요청을 제대로 따르지 못하고 있음
→ request & noncompliant response로 구성된 학습용 학습 데이터를 제작 → Fine-tuning은 overfit으로 이어지는 반면 LoRA 같은 기법이 밸런스가 좋음

📜 Paper University of Washinton, AI2

2024.07 4주차

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

→ GPT-4o의 토크나이저는 39%의 non-English data로 학습되어 전작보다 multilingual 하다고 이야기 할 수 있음
→ Llama3 모델은 48%의 non-English data로 학습되었음

📜 Paper NVIDIA

2024.07 4주차

Compact Language Models via Pruning and Knowledge Distillation

📜 Paper Georgia Tech, NVIDIA

2024.07 3주차

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

LLM을 contest ranking & answer generatino, 두 가지에 fine-tuning 하는 방식
이런식으로 학습된 모델은 ranking 관련 데이터를 조금만 학습하더라도 기존 모델들보다 월등한 성능을 보임

📜 Paper MIT, University of Washington

2024.07 3주차

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

따라서 각각에 대한 attention weight의 비율을 입력 feature로 받는 hallucination detection model을 제안
lookback ration-based detector, Lookback Lens

📜 Paper Microsoft

2024.07 3주차

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

structural-anchor-based compression, inverse index translation, data-format-aware aggregation, 세 요소로 구성된 SheetCompressor를 도입
이를 바탕으로 Chain of Spreadsheet를 제안

🧑🏻‍💻 Dev DeepLearning.AI, MongoDB

2024.07 3주차

Prompt Compression and Query Optimization

Prefiltering and Postfiltering, Projection, Reranking, Prompt Compression

📜 Paper Qwen, Alibaba

2024.07 3주차

Qwen2 Technical Report

multilingual 능력이 뛰어나 30개 언어를 커버할 수 있다고 강조
[허깅페이스](https://huggingface.co/Qwen)와 [ModelScope](https://modelscope.cn/organization/qwen)에서만 이용 가능. [깃허브](https://github.com/QwenLM/Qwen2)에서 예시 코드 참조 가능.

🧑🏻‍💻 Dev Mistral AI

2024.07 3주차

MathΣtral

Mathstral: 수학적 추론 능력이 탁월한 7B 모델. 32K context window. Apache 2.0
Codestral Mamba: 코드 생성에 특화된 Mamba2 language model. Apache 2.0

🧑🏻‍💻 Dev LlamaIndex

2024.07 3주차

GraphRAG Implementation with LlamaIndex

🧑🏻‍💻 Dev AnthropicAI

2024.07 3주차

Doubled max output token limit for Claude 3.5 Sonnet

API, console 둘 다 적용 가능

📜 Paper University of Toronto

2024.07 3주차

Toward Adaptive Reasoning in Large Language Models with Thought Rollback

LLM이 thought에 대해 error 분석을 수행. trial-and-error를 프롬프트에 포함.
평소에 내가 고민하던 ‘인간이 사고하는 방식’을 고민한 것처럼 보이는 연구 결과

🧑🏻‍💻 Dev HuggingFace

2024.07 3주차

SmolLM - blazingly fast and remarkably powerful

Cosmopedia v2, FineWeb-Edu, Stack-Edu-Python을 정제한 Smollm-Corpus 데이터셋 ([링크](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) 🔗)

🧑🏻‍💻 Dev OpenAI

2024.07 3주차

Prover-Verifier Games improve legibility of language model outputs

정확도만을 높이기 위해 학습된 모델은 legibility가 떨어진다는 문제가 존재
Prover-Verifier Game 이론을 바탕으로 하는 학습 알고리즘을 제안
small verifier는 solution이 옳았는지를 구분하도록 학습, helpful prover는 verifier에게 인정받을 정확한 답변을 생성하도록 학습, sneaky prover는 verifier를 속일 수 있는 부정확한 solution을 생성하도록 학습.

🧑🏻‍💻 Dev Upstage, DeepLearning.AI

2024.07 3주차

Pretraining LLMs

Meta의 Llama 모델을 비롯한 다양한 모델들을 원하는대로 학습하는 방식 등
학습 비용을 크게 줄여주는 Depth Upscaling에 대한 소개
업스테이지 강의가 여기에 나오다니.. 엄청 신기..

🧑🏻‍💻 Dev Andrej Karpathy

2024.07 3주차

new AI Education company called Eureka labs

LLM101n 라는 첫 번째 컨텐츠 ([링크](https://github.com/karpathy/LLM101n) 🔗)
홈페이지 [링크](https://eurekalabs.ai/) 🔗, 깃허브 [링크](https://t.co/ubv4xONI57) 🔗

🧑🏻‍💻 Dev Apple

2024.07 3주차

DCLM-7B-8k

systematic data curation 관련해서 이점이 있음
Common Crawl로부터 추출한 240T 토큰의 corpus, DCLM (논문 [링크](https://arxiv.org/abs/2406.11794) 🔗)

🧑🏻‍💻 Dev OpenAI

2024.07 3주차

GPT-4o mini: advancing cost-efficient intelligence

reasoning, math & coding, multimodal reasoning 특화되어 있음
LMSYS의 리더보드에서 GPT-4 보다도 선택을 많이 받으며 MMLU도 82점을 기록

🧑🏻‍💻 Dev Mistral AI

2024.07 3주차

Mistral NeMo

128k context window를 지원
sentence 기반의 tokenizer → Tiktoken 기반의 tokenizer, Tekken을 사용

📜 Paper Tsinghua, CMU

2024.07 3주차

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

기존에는 이러한 데이터를 다른 LLM으로 생성하는 방식도 있으나, 법적 문제, 의존성 문제 등이 제기
→ task-specific input-output pair를 student LLM으로부터 합성하고, 이것으로 스스로를 학습하는 Self-Guide 메커니즘을 제안

📜 Paper University of Washington, AI2

2024.07 3주차

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

→ inference 시 사용 가능한 datastore의 사이즈를 키워 retrieval-based LM의 성능을 지속적으로 개선.
뭔가 당연해 보이는데.. datastore를 키워서 이를 이용하면 사이즈만 큰 모델보다 잘한다는 결과를 제시함
1.4T 토큰에 해당하는 datastore, MassiveDS 공개. ([링크](https://github.com/RulinShao/retrieval-scaling) 🔗)

📜 Paper The University of Hong Kong

2024.07 3주차

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

→ 큰 모델일수록 큰 vocab을 사용하는 것이 좋다. 그러나 현재 모델들은 너무 작은 vocab을 쓰고 있다.
예를 들어 Llama2-70B 모델에는 216K 이상의 vocab이 적절 (현재는 32K)

📜 Paper Meta

2024.07 3주차

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

global text description을 기반으로 fine-grained local control도 가능
information bottleneck layer를 temporal blurring과 함께 적용하여 디테일한 컨트롤과 관련된 정보를 추출
이런 모델들은 평가를 어떻게 하는 걸까?

📜 Paper Moqi, Peking

2024.07 3주차

Memory3: Language Modeling with Explicit Memory

📜 Paper Salesforce AI

2024.07 2주차

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

21개 카테고리에 대해 3,673개의 실행 가능한 fuction-calling 데이터를 수집
format checking, actual function execution, semantic verification, 세 단계를 거침
허깅페이스 데이터셋 링크: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k

🧑🏻‍💻 Dev Reddit

2024.07 2주차

ChatGPT prompt hacking issue

v1 ~ v6까지의 personality가 있고 현재는 v2 (Balanced & Friendly) 라고 답변

📜 Paper KAIST, AWS

2024.07 2주차

FineSurE: Fine-grained Summarization Evaluation using LLMs

completeness, conciseness,faithfulness 등을 기준으로 삼음
open-source vs proprietary LLMs를 비교
깃허브 링크: https://github.com/DISL-Lab/FineSurE-ACL24

📜 Paper Harvard

2024.07 2주차

Transcendence: Generative Models Can Outperform The Experts That Train Them

이를 Transcendence (초월성) 이라고 정의했는데, 과연 다양한 분야에 적용 가능한 것일지 의문

🧑🏻‍💻 Dev W&B

2024.07 2주차

Developer's guide to LLM prompting

🧑🏻‍💻 Dev Meta

2024.07 2주차

Multi-token-prediction

8-byte prediction 성능 굿. 요약 성능 굿.

🧑🏻‍💻 Dev Microsoft

2024.07 2주차

MInference

single A100에서 운용

📜 Paper Auburn University

2024.07 2주차

Vision language models are blind

→ 그러나 일부 (사람에게) 굉장히 쉬운 vision task (원이 중첩되어 있는가, 원 안의 글자는 무엇인가) 들은 오히려 엄청나게 못함.
세부적인 내용을 거의 파악하지 못하는 것으로 판단
https://vlmsareblind.github.io/

🧑🏻‍💻 Dev Anthropic

2024.07 2주차

Generate better prompts in the developer console

Claude 3.5 Sonnet 기반

📜 Paper Tianjin University

2024.07 2주차

Review-LLM: Harnessing Large Language Models for Personalized Review Generation

rating 정보도 포함하여 유저의 선호를 파악할 수 있도록 함

📜 Paper Google DeepMind

2024.07 2주차

PaliGemma: A versatile 3B VLM for transfer

transfer를 잘해서 다양한 open-word task를 수행할 수 있는 능력이 있는 모델
특히 remote-sensing & segmentation에서 강점

🧑🏻‍💻 Dev together.ai

2024.07 2주차

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

계산 및 데이터 이동의 중첩을 통해 처리 속도 가속
FP8의 저정밀도 처리를 사용하여 성능을 향상

🧑🏻‍💻 Dev Google

2024.07 2주차

4 Google updates coming to Samsung devices

갤럭시 Z 시리즈에서 circle 검색을 지원

📜 Paper University of Oxford

2024.07 2주차

A Critical Review of Causal Reasoning Benchmarks for Large Language Models

interventional or counterfactual reasoning을 통합함으로써 causal reasoning을 정의

📜 Paper lmsys, UC Berkeley

2024.07 2주차

RouteLLM: Learning to Route LLMs with Preference Data

📜 Paper Zhejiang University

2024.07 1주차

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

industry & academy 양측을 위한 합성 데이터 생성 관련 연구에 대한 폭 넓은 조사 결과를 공유

📜 Paper Tsinghua, Microsoft

2024.07 1주차

Direct Preference Knowledge Distillation for Large Language Models

선호 차를 바탕으로 implicit reward function을 학습하도록 하는 DPKD 제시
Implicit reward & Reverse KL divergence

📜 Paper Tencent AI

2024.07 1주차

Scaling Synthetic Data Creation with 1,000,000,000 Personas

다양한 시나리오를 대상으로 삼는 합성 데이터 생성 용이 (persona-driven data synthesis)

📜 Paper University of Wisoconsin-Madison

2024.07 1주차

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

일반적인 LLM이 long-context task에서 hallucination을 빈번히 보이는 것과 달리 fine-tuned 모델들은 performance drop을 일으키지 않음

🧑🏻‍💻 Dev infiniflow

2024.07 1주차

ragflow

Reranker 모델을 추가함으로써 향상된 retrieval 퍼포먼스를 보여줌
Q&A parsing 방식 중 Markdown & Docx 를 새로 지원

🧑🏻‍💻 Dev Learn RAG with Langchain

2024.07 1주차

Learn RAG with Langchain

📜 Paper Peking, Alibaba

2024.07 1주차

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Type-1 에러를 3단 평가 파이프라인과 엄격한 metric으로 최소화하는 벤치마크, MMEvalPro 를 제안
2,138개의 question triplets, 6,414 distinct questions, 이 중 2/3는 사람이 직접 annotation

📜 Paper Rice University

2024.07 1주차

MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities

incorrect answer rationales, ‘malgorithms’ 을 도입하여 이에 상응하는 오답을 맞히는 (identification) 태스크를 수행
Algorithm Identification Accuracy (AIA), Malgorithm Identification Accuracy (AIA)

📜 Paper Google Reserach

2024.07 1주차

CodecLM: Aligning Language Models with Tailored Synthetic Data

여러 downstream instructoin distribution에 맞는 고품질 합성 데이터를 생성해주는 프레임워크, CodecLM을 제안
seed instructions을 meta data로 인코딩 한 뒤, tailored instructions을 생성하기 위해 decode
Self-Rubrics & Contrastive Filtering 도입

🗞️ News OpenAI

2024.07 1주차

OpenAI will block people in China from using its services

🧑🏻‍💻 Dev CVPR 2024: Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more)

2024.07 1주차

CVPR 2024: Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more)

🧑🏻‍💻 Dev French AI Lab Announces an Open-Source GPT-4o Multimodal Alternative: Moshi

2024.07 1주차

French AI Lab Announces an Open-Source GPT-4o Multimodal Alternative: Moshi

이전에 4o 데모 영상에 비하면 아쉽다는 평이 많으나 오픈 소스 진영의 약진을 상징하기도 함

📜 Paper Salesforce AI

2024.07 1주차

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

query가 주어지면 관련된 내용을 source 기반으로 생성하는 태스크, Summary of a Haystack (conversation & news)

📜 Paper UKP Lab

2024.07 1주차

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

해당 데이터셋으로 학습한 모델들은 상대적으로 작은 사이즈의 LLM임에도 좋은 성능을 발휘

📜 Paper UIUC, Harvard

2024.07 1주차

Eliminating Position Bias of Language Models: A Mechanistic Approach

training-free zero-shot 방식, PINE을 제안.
segment 간 causal attention을 bidirectional attention으로 변경. attention value를 활용

📜 Paper DeepSeek AI

2024.07 1주차

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

2024년 6월 70건

📜 Paper Zou group

2024.06 5주차

TextGrad: Automatic "Differentiation" via Text

compound AI 시스템의 개별 구성 요소를 LLM에 의해 제공되는 피드백으로 개선
LLM은 general & rich 자연어로 피드백을 제공 → out-of-the-box 태스크도 잘 수행
[깃허브 링크](https://github.com/zou-group/textgrad) 🔗

📜 Paper Bloomberg

2024.06 5주차

Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering

→ generate-then-ground (GenGround) 프레임워크를 제시: 최종 답변이 도출될 때까지 두 단락을 번갈아보는 방식
Generate: 더 간단한 single-hop question과 이에 대응하는 정답을 생성
Ground: retrieved documnets에서 question-answer pair를 ground

📜 Paper USTC

2024.06 5주차

Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation

→ Retrieve-Plan-Generation (RPG) 프레임워크를 제안
Plan stage: subsequent generation을 가이드하는 plan tokens을 생성
Answer stage: plan을 근거로 fine-grained paragraphs를 선택, 이를 바탕으로 futher answer 생성

📜 Paper Amherst, Meta

2024.06 5주차

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

단순 의견 일치 비율 대신 Cohen’s Kappa Metric을 사용하는 것의 중요성을 강조
여러 언어 모델을 비교(base, instruction-tuned)한 결과를 제시: 작은 모델을 잘 학습하면 큰 모델보다 뛰어남

🧑🏻‍💻 Dev Andrej Karpathy

2024.06 5주차

https://github.com/karpathy/LLM101n

스토리텔링 AI LLM 구축 방법을 알려주는 강의를 담은 repo
from scratch in Python, C and CUDA

📜 Paper ICL, Tisnghua

2024.06 5주차

Entropy-Based Decoding for Retrieval-Augmented Large Language Models

→ training-free decoding method를 제안
entropy-based document-parallel ensemble: retrieved 문서로부터 low-entropy distribution에 우선순위를 높이고자 함
constrastive decoding 메커니즘을 통합

🧑🏻‍💻 Dev HuggingFace

2024.06 5주차

Open-llm-leaderboard 2

Qwen2 72B instruct > llama 3 70B > CommandR
MMLU-pro, GPQA, BBH 등 어려운 벤치마크 추가

📜 Paper Peking, HKUST, MIT

2024.06 5주차

Efficient Continual Pre-training by Mitigating the Stability Gap

→ 이를 해결하기 위한 세 가지 학습 전략을 제시
1. 여러 epoch 동안 적당한 사이즈의 subset으로 continual pre-training (single epoch, large corpus 대신)
2. high-quality의 sub-corpus에 대해서만 pre-training

📜 Paper ByteDance, MIT-IBM

2024.06 5주차

Selective Prompting Tuning for Personalized Conversations with LLM

개인화된 LLM을 만드는 방법론
prompt engineering보다 fine-tuning이 원하는 답변을 생성할 가능성이 더 높더라 → Selective Prompt Tuning (SPT)
soft prompts로 시작하고 학습 가능한 dense retriever를 사용하여 input context 기반 최적의 soft prompt를 dynamic하게 고르는 방식을 제안

📜 Paper HuggingFace

2024.06 5주차

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

96개의 Common Crawl snapshot으로부터 15T token 데이터셋을 구축 for pretraining
이 FineWeb으로부터 추가 filtering을 한 1.3T token 데이터셋 FineWeb-Edu 또한 공개

📜 Paper Hong Kong, Tsinghua, NVIDIA, HKUST

2024.06 5주차

Unlocking Continual Learning Abilities in Language Models

MIGU (MagnItude-based Gradient Updating for continual learning): LM의 linear layer에서 가장 큰 output 크기를 갖는 파라미터 업데이트에 집중하는 방식

🧑🏻‍💻 Dev Google

2024.06 5주차

Gemma 2 is now available to researchers and developers

27B 모델의 경우 A100/H100 한 대에서 추론 가능
[Kaggle](https://www.kaggle.com/models/google/gemma-2), [HuggingFace](https://huggingface.co/google/gemma-2-9b) 등에서 다운로드 가능

📜 Paper Tsinghua

2024.06 5주차

Aligning Teacher with Student Preferences for Tailored Training Data Generation

학생의 선호를 반영한 학습 예시를 생성 for Knowledge Distillation
우선 teacher model이 draft question & rationale 생성 → 이에 대한 학생의 in-context learning 능력을 proxy로 사용 → teacher model을 학생의 선호에 DPO

📜 Paper CMU, KAIST

2024.06 5주차

Learning to Correct for QA Reasoning with Black-box LLMs

→ CoBB (Correct for improving QA reasoning of Black-Box LLMs)
불완전한 추론을 올바른 추론으로 Seq2Seq 매핑하는 학습된 adaptation 모델을 사용
dataset과 sampled sub-dataset의 divergence를 최소화하기 위한 유전 알고리즘 적용

📜 Paper UC Berkeley, Toronto, Anthropic

2024.06 5주차

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

이를 inductive out-of-context (OOCR) 으로 표현
작은 모델은 부족하지만, GPT-3.5, GPT-4 정도의 모델들은 충분 → 명시적으로 학습하지 않은 내용도 유추가 가능함을 입증. LLM 학습의 새로운 위험성을 제시.

📜 Paper Meta

2024.06 5주차

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

📜 Paper Fudan, AI2

2024.06 4주차

SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals

→ 사람이 제공하는 피드백이 제한되고 느린(delayed) 상황에서도 high-level goal을 달성할 수 있도록 돕는 automatic apporach, SelfGoal을 제안
핵심: high-level goal을 실용적인 subgoal로 이루어진 tree structure로 쪼개는 것

📜 Paper AIRI

2024.06 4주차

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

20여개의 다양한 reasoning tasks를 포함
아직까지는 유의미한 long context understanding 벤치마크가 없다고 생각하는데, 향후 유의미한 연구들이 등장할 것인지 개인적인 의문

📜 Paper Hong Kong Science

2024.06 4주차

Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning

→ uncertainity-sensitive tuning: uncertainty recognition + prompt-sensitive activation
모르는 질문을 거절 + causal instruction을 통해 퍼포먼스 회복

📜 Paper AIRI

2024.06 4주차

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

📜 Paper Fudan, Tsinghua

2024.06 4주차

Needle In A Multimodal Haystack

multimodal retrieval, counting, reasoning, 세 타입의 태스크를 포함

🧑🏻‍💻 Dev DeepSeek AI

2024.06 4주차

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

338개 언어, 128K 컨텍스트 길이 지원
코딩 벤치마크에서 GPT-4-turbo를 능가하는 퍼포먼스 달성

📜 Paper Fudan, Shanghai

2024.06 4주차

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

Selection, self-refine, self-evaluation, Backpropagation 과정을 반복하며 MCTS 수행
이때 Upper Confidence Bound (UCB) 공식이 활용됨

🧑🏻‍💻 Dev Google DeepMind

2024.06 4주차

Generating audio for video

positive - negative prompt를 구분할 수 있을 정도로 정교한 컨트롤이 가능해짐

🧑🏻‍💻 Dev runway

2024.06 4주차

Introducing Gen-3 Alpha

Sora의 등장 이후로 이와 같은 고해상도 비디오 생성 모델들의 발전이 빠르게 이어지고 있는 듯한 느낌이 듦

📜 Paper Tisnghua

2024.06 4주차

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

→ 긴 context를 malleable(벼릴 수 있는) 외부 지식으로 생각하고 이를 dynamic하게 모으거나 통합하는 방법론

📜 Paper Cohere

2024.06 4주차

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

→ PPO의 많은 요소가 RLHF에 불필요함을 입증 & DPO, RAFT와 같은 RL-free 방식이 PPO보다 뛰어나다는 것을 입증
🧑🏻‍💻 [RLOO 알고리즘을 설명한 허깅페이스 블로그 링크](https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo)

🧑🏻‍💻 Dev Cohere

2024.06 4주차

Claude 3.5 Sonnet

뛰어난 coding 능력과 visual reasoning 능력을 강조
code snippets & website design과 같이 AI-generated content와 상호작용 가능한 Artifacts 기능을 공개

📜 Paper University of Maryland

2024.06 4주차

GenQA: Generating Millions of Instructions from a Handful of Prompts

→ single prompt로 large instruction datasets를 생성하는 방법을 제안
simple completion task부터 complex multi-turn dialogs까지 다양한 태스크에 이르는 데이터셋을 생성 가능

📜 Paper Georgia, MIT

2024.06 4주차

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

self-generated 합성 데이터를 사용하여 expert module을 구축 + self-optimized routing으로 통합
다른 방법론들에 비해 trade-off (학습하면 기존의 것을 까먹어 버리는 것에 대한)가 적은 편이라고 언급

🧑🏻‍💻 Dev Meta

2024.06 4주차

Sharing new research, models, and datasets from Meta FAIR

한 번에 여러 개의 토큰을 예측하는 Multi-Token Prediction ([HuggingFace](https://huggingface.co/facebook/multi-token-prediction) 🤗)
Meta Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation ([데모](https://pages.cs.huji.ac.il/adiyoss-lab/JASCO/) 🔗)
최초의 audio 워터마크 기법 (faster & efficient detection), AudioSeal ([Github](https://pages.cs.huji.ac.il/adiyoss-lab/JASCO/) 🧑🏻‍💻)

📜 Paper Santa Cruz

2024.06 3주차

Scalable MatMul-free Language Modeling

MatMul-free 모델이 transformer 기반의 모델보다 2.7B 사이즈까지 뛰어나도록 학습한 결과를 제시

📜 Paper University of Chicago

2024.06 3주차

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

전자는 simplices, 후자는 orthogonal, 복잡한 개념은 direct sum으로 구성된 polytope로 표현

🧑🏻‍💻 Dev Andrej Karpathy

2024.06 3주차

Let's reproduce GPT-2 (124M)

🧑🏻‍💻 Dev OpenAI, Apple

2024.06 3주차

OpenAI and Apple announce partnership to integrate ChatGPT into Apple experiences

privacy와 관련해서 애플이 직접 데이터 센터를 구축하고 관리하겠다고 함.

📜 Paper University of Waterloo

2024.06 3주차

GenAI Arena: An Open Evaluation Platform for Generative Models

text-to-image, text-to-video, image editing, 세 영역에 대한 평가가 가능

📜 Paper AI2

2024.06 3주차

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

GPT-4 turbo와 같은 LLM을 사용하여 WB-Reward, WB-Score 을 기준으로 평가 자동화
fine-grained pari-wise comparision 방식을 사용했으며, 세 개의 베이스라인을 설정

📜 Paper Duke, Stanford, Together AI

2024.06 3주차

Mixture-of-Agents Enhances Large Language Model Capabilities

즉, 여러 개의 LLM agents로 각 layer를 구성하는 방식. 각 agent는 이전 레이어의 결과물을 auxiliary information으로 활용.

🗞️ News LLMs Aren’t Just “Trained On the Internet” Anymore

2024.06 3주차

LLMs Aren’t Just “Trained On the Internet” Anymore

맞춤형 학습데이터를 제작하여 활용하는 방식이 대두. Phi-3가 대표적인 모델이며 [Scale.ai](http://Scale.ai) 같은 회사가 크게 주목을 받게 됨.

📜 Paper University of Washington

2024.06 3주차

Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses

Reddit, ChangedMyView에서 수집한 포스트에서 사람과 LLM 응답 간의 의미적 유사성 및 어휘 중복 정도를 비교 → open-ended scenarios에서 명백한 한계를 보임
LLM은 아직까지 social reasoning 성능이 부족함을 입증하고 어떻게 인간 의도와 감정을 통합할 수 있는지에 대한 방법을 제시

📜 Paper ByteDance

2024.06 3주차

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

(1) image tokenizer (2) class-conditional image generation (3) text-conditional image generation (4) optimizaing the inference speed of image generation

📜 Paper Washington, Meta, AI2

2024.06 3주차

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

→ numerical, tabular, knowledge-based reasoning을 다룰 수 있는, 즉 unified action space에서 학습한 open-source language agent, Husky를 제안
1. 다음 단계에 수행할 작업을 예측 2) expert 모델이 선택된 작업을 실행하고 상태 업데이트
7B 모델로도 GPT-4에 준하거나 그 이상의 성능을 보임

📜 Paper OpenAI, Stnaford, Microsoft

2024.06 3주차

The Prompt Report: A Systematic Survey of Prompting Techniques

58개의 프롬프팅 테크닉과 다른 modality에 활용 가능한 40개의 테크닉을 정리
자연어 prefix-prompting에 대한 내용도 다루고 있음

🧑🏻‍💻 Dev Microsoft

2024.06 3주차

Generative-AI-For-Beginners

생성형 AI application을 만드는 데 필요한 18개의 강의를 제공
데이터 베이스와 관련된 강의를 DeepLearning.AI 에서도 제공

🧑🏻‍💻 Dev Luma AI

2024.06 3주차

Dream Machine

📜 Paper University of Toronto

2024.06 3주차

Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions

→ 반대로 out-of-comtext prompting을 제안 (테스트 단계에서)

📜 Paper New York University

2024.06 3주차

Large Language Models Must Be Taught to Know What They Don't Know

→ 작은 correct & incorrect answer로 fine-tuning 함으로써 불확실성 추정에 대한 일반화 성능을 끌어올릴 수 있다.
인간과 AI가 협력하는 환경에서의 불확실성 추정이 어떻게 인간 의사결정에 도움이 되는지 연구

📜 Paper University of Edinburgh

2024.06 3주차

Are We Done with MMLU?

error taxonomy를 이용하여 데이터셋을 확인하는 프레임워크, MMLU-Redux를 제안
30개의 MMLU subjects에 대해서 3,000개를 reannotate → 벤치마크 성능과 실제 체감 성능 간의 괴리를 줄이고자 함

📜 Paper NVIDIA

2024.06 3주차

Nemotron-4 340B

📜 Paper Meta

2024.06 2주차

Contextual Position Encoding: Learning to Count What’s Important

→ 모델에 의해 결정되는 특정 토큰에 대한 position만 확장함으로써 position이 context에 conditioned 될 수 있도록 하는 Contextual Position Encoding(CoPE)를 제안

🗞️ News Samsung

2024.06 2주차

Samsung’s Galaxy S24 Series Dominates GenAI-capable Smartphone Market in Q1 2024

AI 기술 발전을 내세울 것으로 예상되는 애플의 WWDC가 많은 이들의 기대를 받고 있음

📜 Paper Princeton, CMU

2024.06 2주차

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

핵심 레이어의 연산 속도가 Mamba의 selective SSM보다 2-8배 정도 빠르면서, 트랜스포머 기반의 언어 모델과 견줄 수 있는 성능을 내세움

📜 Paper Perdue

2024.06 2주차

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

→ fine-grained confidence estimates를 표현하도록 가르치는 SaySelf 방법론을 제안
추가적으로 LLM은 스스로의 parametric knowledge를 나타내는 self-reflective rationale을 생성하고, 반대로 uncertainty를 표현할 수 있게 됨

🧑🏻‍💻 Dev LlamaIndex

2024.06 2주차

Introducing the Property Graph Index: A Powerful New Way to Build Knowledge Graphs with LLMs

그래프를 hybrid search를 위한 vector database로 사용 가능
Cypher graph query language를 이용한 복잡한 query 표현 가능

🧑🏻‍💻 Dev DeepLearning.AI

2024.06 2주차

AI Agents in LangGraph

추가로, 여러 개의 답변을 agent-friendly 형식으로 반환하는 agent serarch도 다룸

📜 Paper ByteDance

2024.06 2주차

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

또한 추가 실험을 통해 out-of-domain 데이터셋에 대한 성능도 준수하다는 것을 확인

📜 Paper Google DeepMind

2024.06 2주차

To Believe or Not to Believe Your LLM

information-theoretic metric을 사용하여 언제 epistemic uncertainty가 높은지를 탐지
이전의 답변을 기반으로 삼는 iterative prompting을 통해 metric을 계산. 즉, log-likelihood 등을 사용하지 않음.

🧑🏻‍💻 Dev Google

2024.06 2주차

PlaiGemma

다양한 태스크를 처리할 수 있는 PaliGemma와 특정 research dataset에 fine-tuned PaliGemma-FT를 공개
[캐글](https://www.kaggle.com/models/google/paligemma)에서 다운로드 가능

🧑🏻‍💻 Dev Mistral AI

2024.06 2주차

My Tailor is Mistral

LoRA를 기반으로 하여 memory-efficient 하면서도 performant한 fine-tuning 기법을 도입

📜 Paper KAIST, LG AI

2024.06 2주차

Block Transformer: Global-to-Local Language Modeling for Fast Inference

→ 낮은 layer에 대한 global modeling의 병목을 고립시키고, 상위 layer에 대해 fast local modeling을 적용. 입력 토큰을 특정 사이즈의 블록으로 압축하고 coarse level로 self attention을 적용.

🧑🏻‍💻 Dev OpenAI

2024.06 2주차

Extracting Concepts from GPT-4

GPT-4의 internal representation을 16M 개의 oft-interpretable pattern으로 decompose하기 위해 고안한 scalable method를 공개
k-sparse autoencoders를 제안하여 sparsity를 control 함과 동시에 reconstruction-sparsity frontier를 tuning하고 개선하는 과정을 간소화
autoencoder의 크기와 sparsity 간의 확연한 scaling laws를 관측

🧑🏻‍💻 Dev Google

2024.06 2주차

NotebookLM goes global with Slides support and better ways to fact-check

Google Slide, web URL, Google Docs, PDFs, text files를 지원
[NotebookLM 링크](https://notebooklm.google.com/?original_referer=https://blog.google%23&pli=1)🔗에서 가이드 확인 및 노트북 생성 가능

📜 Paper ELLIS

2024.06 2주차

Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

이를 통해 initial text가 hallucinated 인지 아닌지 판단할 수 있음

📜 Paper Peking, Berkeley, Stanford

2024.06 2주차

Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models

meta-buffer: 유익한 high-level thoughts를 저장
buffer-manager: meta-buffer를 동적으로 업데이트하여 meta-buffer의 capacity를 향상

🗞️ News KLING

2024.06 2주차

Forget Sora — Kling is a killer new AI video model that just dropped and I’m impressed

🧑🏻‍💻 Dev Alibaba

2024.06 2주차

Hello Qwen2

coding, mathematics, multilingual understanding, long-context understanding 등에서 Meta의 Llama3나 OpenAI의 GPT-4를 능가하는 수준의 성능을 보임

📜 Paper Renmin University

2024.06 1주차

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

RAG를 위한 scalable & pluggable 가상 토큰을 제안. 해당 토큰에 대한 임베딩만 fine-tuning

📜 Paper Jina AI

2024.06 1주차

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

→ 이를 해결하기 위해 multi-task contrastive training method를 제안

🧑🏻‍💻 Dev Anthropic

2024.06 1주차

Claude can now use tools

예를 들어 구조화된 데이터 추출, DB 기반 검색 및 답변, API 기능 자동화 등에 활용 가능

🧑🏻‍💻 Dev Perplexity

2024.06 1주차

Introducing Perplexity Pages

2024년 5월 89건

📜 Paper Fudan University

2024.05 5주차

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

reasoning chain에 대한 평가를 기반으로 정답을 고르는 방식. dynamic sampling 활용.

📜 Paper Cohere

2024.05 5주차

Cohere For AI Launches Aya 23, 8 and 35 Billion Parameter Open Weights Release

대규모 multilingual instruction fine-tuning dataset으로 학습된 Aya 모델을 기반으로 발전
[technical report on Aya 23](https://cohere.com/research/aya/aya-23-technical-report.pdf?ref=cohere-ai.ghost.io)

📜 Paper National University of Singapore, Salesforce

2024.05 5주차

Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework

→ 평가 과정을 여러 개의 단계로 decompose 후 결과를 aggregate 하는 방법론을 제안. 이때 교육학적 관행을 근거로 여러 단계로 구분.

📜 Paper University of Virginia, Princeton Language and Intelligence

2024.05 5주차

SimPO: Simple Preference Optimization with a Reference-Free Reward

target reward margin을 사용하여 winning & losing response 간의 격차를 벌림

📜 Paper IEEE

2024.05 5주차

Wav-KAN: Wavelet Kolmogorov-Arnold Networks

wavelet function을 KAN 네트워크 구조에 통합함으로써 입력 데이터의 high-/low-frequency 요소들을 효율적으로 capture 할 수 있도록 함

🗞️ News xAI

2024.05 5주차

Series B Funding Round

📜 Paper Fudna University

2024.05 5주차

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

다양한 오픈소스 LLM이 tokenization에서 겪는 어려움을 테스트하기 위한 ADT (Adversarial Dataset for Tokenizer) 구축

📜 Paper Google

2024.05 5주차

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

intrinsic uncertainty를 확인하기 위해 모델의 intrinsic confidence와 실제 결정 간의 갭을 측정할 수 있는 faithful response uncertainty를 공식화하여 실험

📜 Paper Meta

2024.05 5주차

An Introduction to Vision-Language Modeling

🧑🏻‍💻 Dev DeepLearning.AI

2024.05 5주차

AI Agentic Design Patterns with AutoGen

Reflection, Tool use, Planning 등 다양한 agentic design pattern에 대해 학습

📜 Paper National University of Singapore

2024.05 5주차

Faithful Logical Reasoning via Symbolic Chain-of-Thought

1. 자연어를 symbolic format으로 변경 2) 문제를 해결하기 위해 step-by-step plan을 구축 3) verifier가 translation & reasoning chain의 결과를 검증

🧑🏻‍💻 Dev Karpathy

2024.05 5주차

Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20

124M 사이즈의 GPT-2를 A100x8를 사용하여 엄청나게 효율적으로 학습하는 방식을 공개

🧑🏻‍💻 Dev Mistral AI

2024.05 5주차

Codestral: Hello, World!

22B 사이즈의 모델임에도 불구하고 Llama 3 70B, CodeLlama 70B 보다 뛰어난 성능을 보임
[허깅페이스](https://huggingface.co/mistralai/Codestral-22B-v0.1)에서 다운로드 가능

📜 Paper The University of Edinburgh

2024.05 5주차

2BP: 2-Stage Backpropagation

→ 2-stage backporpagation(2BP)을 제안. 이를 통해 1.70x 향상된 throughput을 확인

🗞️ News OpenAI

2024.05 5주차

OpenAI makes ChatGPT-4o's advanced tools available to users in free tier

또한 browse, vision, data analysis, file uploads, GPTs 등의 기능도 이용 가능

📜 Paper Meta

2024.05 5주차

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

이를 해결하기 위해 임의 길이의 real-world text spans를 LM 생성 과정에 통합하는 Nearest Neighbor Speculative Decoding (NEST)를 제안 → token-level의 retrieval을 매 inference step마다 수행

📜 Paper Adobe

2024.05 5주차

Calibrating Reasoning in Language Models with Internal Consistency

📜 Paper University of Cambridge

2024.05 4주차

Zero-Shot Tokenizer Transfer

tokenizer를 입력으로 받고 이에 대응하는 embedding을 예측하도록 학습하는 hypernetwork를 제안 → encoder & decoder 둘 다에 일반화 가능하다는 것을 실험적으로 입증

📜 Paper Alibaba

2024.05 4주차

Language Models can Evaluate Themselves via Probability Discrepancy

📜 Paper Stanford, Toronto

2024.05 4주차

Observational Scaling Laws and the Predictability of Language Model Performance

🧑🏻‍💻 Dev Korea Univ.

2024.05 4주차

Horangi 한국어 LLM 리더보드

llm-jp-eval을 기반으로 llm-kr-eval을 구축
Multi-turn 대화를 통해 생성 능력을 평가하는 MT-Bench를 포함

📜 Paper Microsoft

2024.05 4주차

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

LoRA와 마찬가지로 학습 이후에는 weight matrix에 merge 되는 방식을 취함.

🧑🏻‍💻 Dev DeepLearning.AI & Qualcomm

2024.05 4주차

Introduction to On-Device AI

🧑🏻‍💻 Dev llama3-from-scratch

2024.05 4주차

llama3-from-scratch

llama3의 구성 요소를 하나씩 간단히 살펴볼 수 있는 ipynb을 제공. meta로부터 weight를 받을 수 있는 공식 링크도 포함되어 있음.

📜 Paper ByteDance, Alibaba

2024.05 4주차

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Ray, vLLM, DeepSpeed와 같은 다양한 학습 기법들을 동원하며 Hugging Face와도 통합 가능.

🧑🏻‍💻 Dev Anthropic

2024.05 4주차

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Claude 3 Sonnet을 통해 LLM의 interpretability와 관련된 실험을 진행하고 그 결과를 report

🗞️ News You can now buy a 4-foot-tall humanoid robot for $16K

2024.05 4주차

You can now buy a 4-foot-tall humanoid robot for $16K

[데모 영상](https://www.youtube.com/watch?v=GzX1qOIO1bE&t=58s)을 보면 굉장히 자연스럽고 다양한 동작을 지원함 (상당히 유연..;;)

🧑🏻‍💻 Dev Google

2024.05 4주차

New AI tools to help merchants market brands and products

Product Studio에서 상품 이미지를 다른 배경이나 상황에 맞게끔 생성하여 다양한 연출이 가능

🧑🏻‍💻 Dev Microsoft

2024.05 4주차

What’s next: Microsoft Build continues the evolution and expansion of AI tools for developers

Microsoft Copilots and GitHub Copilot
New Copilot + PCs: PyTorch and a new Web Neural Network
Real Time intelligence, partnerships with ADM, Khan Academy, Cognition AI

📜 Paper Google DeepMind

2024.05 4주차

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

경량화된 모델, Gemini 1.5 Flash에 대한 실험 결과도 함께 제시

📜 Paper University of Michigan

2024.05 4주차

A Turing test of whether AI chatbots are behaviorally similar to humans

🧑🏻‍💻 Dev Mistral AI

2024.05 4주차

Mistral-7B-Instruct-v0.3

📜 Paper AIRI

2024.05 4주차

Your Transformer is Secretly Linear

이러한 linear block을 제거하더라도 모델의 성능에 거의 영향을 주지 않는다는 것이 관측됨
pretraining 단계에서 linearity를 최소화하기 위해 cosine-similarity-based regularization을 도입

📜 Paper Xi’an Jiaotong University

2024.05 4주차

Large Language Models Can Self-Correct with Minimal Effort

📜 Paper MIT

2024.05 4주차

Not All Language Model Features Are Linear

이러한 주장과 달리 일부 언어 모델들은 inherently multi-dimensional representation을 갖는다는 것을 입증

📜 Paper Xi’an Jiaotong University

2024.05 4주차

Quantifying Emergence in Large Language Models

→ 본 연구에서는 macroscopic(semantic) & microscopic(token) level에서 entropy reduction을 비교하여 strength of emergence를 quantify
metric의 variance와 ICL에서 shot의 개수 등 사이의 상관 계수 등을 바탕으로 novel emergence pattern을 파악하고, 이를 통해 hallucination을 새로운 관점에서 해석

🧑🏻‍💻 Dev phidata

2024.05 4주차

phidata

Assistant = LLM + Memory(Chat History, Summaries, ...) + Knowledge(PDF, Docs, … ) + Tools(Search Web, Send Email, …)

🧑🏻‍💻 Dev Mistral AI

2024.05 4주차

mistral-finetune

대부분의 파라미터는 frozen & 1-2% 정도의 추가 파라미터로 학습 → A100 or H100 권장

📜 Paper EluetherAI and others

2024.05 4주차

Lessons from the Trenches on Reproducible Evaluation of Language Models

언어 모델 평가의 공통된 한계점, research에서의 어려움을 최소화하는 방법, 이와 같은 이슈를 해소하는 데 적합한 오픈소스 라이브러리 Language Model Evaluation Harness (lm-eval)

🧑🏻‍💻 Dev Anthropic

2024.05 3주차

Prompt Generator

🧑🏻‍💻 Dev IBM

2024.05 3주차

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

논문 링크: https://arxiv.org/abs/2405.04324

🧑🏻‍💻 Dev OpenAI

2024.05 3주차

Hello GPT-4o

개인적인 교육 분야에서 특히 활용 여지가 많이 커진 것 같다고 느낌.
[유튜브에 공개된 데모 링크](https://www.youtube.com/watch?v=DQacCB9tDaw&t=3986s)

📜 Paper Baidu

2024.05 3주차

A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models

🧑🏻‍💻 Dev TII

2024.05 3주차

Falcon 2

📜 Paper Cohere

2024.05 3주차

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

‘tokenizer analysis, model weight-based indicators, prompting techniques’의 조합을 이용하여 위와 같은 problematic tokens를 자동적으로 detect 하는 방법론을 제안.

🧑🏻‍💻 Dev Google

2024.05 3주차

Google I/O 2024: An I/O for a new generation

Gemini를 구글 제품(포토, 이미지 검색, 워크 스페이스, 이메일 등)에 통합하겠다고 발표. (라이브 데모 x, 여름 또는 올해 말 출시 예정 ????)
GPT-4o와 마찬가지로 multimodality를 강조. 그러나 그만큼의 임팩트가 있지는 않음.

🧑🏻‍💻 Dev Salesforce

2024.05 3주차

SFR-Iterative-DPO-LLaMA-8B-R

📜 Paper HuggingFace

2024.05 3주차

What matters when building vision-language models?

📜 Paper Salesforce, UIUC

2024.05 3주차

RLHF Workflow: From Reward Modeling to Online RLHF

📜 Paper Hwawei

2024.05 3주차

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

🧑🏻‍💻 Dev DeepLearning.AI

2024.05 3주차

Multi AI Agent Systems with crewAI

🧑🏻‍💻 Dev OpenAI

2024.05 3주차

Improvements to data analysis in ChatGPT

차주부터 ChatGPT Plus, Team, Enterprise 유저들에게 공개.

📜 Paper University of Waterloo

2024.05 3주차

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

🗞️ News OpenAI & Reddit

2024.05 3주차

OpenAI strikes Reddit deal to train its AI on your posts

📜 Paper Columbia University

2024.05 3주차

LoRA Learns Less and Forgets Less

🧑🏻‍💻 Dev HuggingFace

2024.05 3주차

Hugging Face x LangChain : A new partner package in LangChain

🧑🏻‍💻 Dev TIGER-Lab

2024.05 3주차

MMLU-Pro

📜 Paper MIT

2024.05 3주차

The Platonic Representation Hypothesis

인공지능 모델의 발전 방향은 데이터 타입(언어의 종류, modality)과 무관할 것이라고 주장했던 사람이 생각남.

📜 Paper Meta

2024.05 3주차

Chameleon: Mixed-Modal Early-Fusion Foundation Models

📜 Paper MIT

2024.05 2주차

KAN: Kolmogorov-Arnold Networks

📜 Paper Imperial College London

2024.05 2주차

Argumentative Large Language Models for Explainable and Contestable Decision-Making

🗞️ News X

2024.05 2주차

X launches Stories, delivering news summarized by Grok AI

🧑🏻‍💻 Dev DeepLearning.AI & HuggingFace

2024.05 2주차

Quantization In Depth

🧑🏻‍💻 Dev Meta-Llama-3-120B-Instruct

2024.05 2주차

Meta-Llama-3-120B-Instruct

🗞️ News Nvidia

2024.05 2주차

Nvidia Launches ChatRTX Chatbot for RTX GPUs

🧑🏻‍💻 Dev LMSYS

2024.05 2주차

gpt2-chatbot is Back Online

🧑🏻‍💻 Dev DeepSeek-AI

2024.05 2주차

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

🧑🏻‍💻 Dev DeepLearning.AI

2024.05 2주차

Building Agentic RAG with LlamaIndex

📜 Paper xLSTM: Extended Long Short-Term Memory

2024.05 2주차

xLSTM: Extended Long Short-Term Memory

📜 Paper MIT

2024.05 2주차

Co-design for Efficient LLM Serving

🧑🏻‍💻 Dev Google

2024.05 2주차

Meet Pixel 8a: The Google AI phone at an unbeatable value

📜 Paper University of Texas

2024.05 2주차

Mitigating Exaggerated Safety in Large Language Models

📜 Paper Google Research

2024.05 2주차

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

📜 Paper UIUC, Cohere, Princeton

2024.05 1주차

SnapKV: LLM Knows What You are Looking for Before Generation

📜 Paper Meta

2024.05 1주차

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

🧑🏻‍💻 Dev DeepLearning.AI

2024.05 1주차

Prompt Engineering for Vision Models

🧑🏻‍💻 Dev MIT, MyShell

2024.05 1주차

OpenVoice

📜 Paper Cohere

2024.05 1주차

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

🗞️ News Mystery ‘Gpt2-Chatbot’ And Cryptic Sam Altman Tweet Fuel Speculation Over OpenAI’s Next ChatGPT Update

2024.05 1주차

Mystery ‘Gpt2-Chatbot’ And Cryptic Sam Altman Tweet Fuel Speculation Over OpenAI’s Next ChatGPT Update

📜 Paper Baidu

2024.05 1주차

HFT: Half Fine-Tuning for Large Language Models

🧑🏻‍💻 Dev Gradient

2024.05 1주차

LLama-3-8B-Instruct-Gradient-1048K

📜 Paper Bozewn-Bolzano

2024.05 1주차

When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively

📜 Paper UC Berkeley

2024.05 1주차

Is Bigger Edit Batch Size Always Better? - An Empirical Study on Model Editing with Llama-3

📜 Paper Meta

2024.05 1주차

Better & Faster Large Language Models via Multi-token Prediction

📜 Paper Hong Kong University

2024.05 1주차

Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment

📜 Paper KAIST AI

2024.05 1주차

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

📜 Paper Virginia

2024.05 1주차

Context-Aware Clustering using Large Language Models

🧑🏻‍💻 Dev Anthropic

2024.05 1주차

Introducing the Claude Team plan and iOS app

📜 Paper Predibase

2024.05 1주차

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

2024년 4월 90건

🧑🏻‍💻 Dev HuggingFace

2024.04 4주차

FineWeb

📜 Paper Epoch AI

2024.04 4주차

Chinchilla Scaling: A replication attempt

📜 Paper State Space Model for New-Generation Network Alternative to Transformers: A Survey

2024.04 4주차

State Space Model for New-Generation Network Alternative to Transformers: A Survey

📜 Paper Stanford

2024.04 4주차

How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

📜 Paper Stanford

2024.04 4주차

2024 AI Index Report

📜 Paper Fudan University

2024.04 4주차

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

📜 Paper Towards Logically Consistent Language Models via Probabilistic Reasoning

2024.04 4주차

Towards Logically Consistent Language Models via Probabilistic Reasoning

📜 Paper Nanyang Technological University

2024.04 4주차

Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

🧑🏻‍💻 Dev DeepLearning.AI

2024.04 4주차

Getting Started with Mistral

🧑🏻‍💻 Dev Cookbook

2024.04 4주차

Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora

📜 Paper Microsoft

2024.04 4주차

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

🧑🏻‍💻 Dev Adobe

2024.04 4주차

Generative AI in Premiere Pro powered by Adobe Firefly | Adobe Video

📜 Paper OpenAI

2024.04 4주차

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

📜 Paper CMU

2024.04 4주차

TREACLE: Thrifty Reasoning via Context-Aware LLM and Prompt Selection

📜 Paper Zhejiang University

2024.04 4주차

Information Re-Organization Improves Reasoning in Large Language Models

🧑🏻‍💻 Dev vals.ai

2024.04 4주차

Benchmarks for Industry

📜 Paper Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners

2024.04 4주차

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners

📜 Paper Tsinghua University

2024.04 4주차

Multi-Head Mixture-of-Experts

📜 Paper Apple

2024.04 4주차

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

🗞️ News The Ray-Ban Meta Smart Glasses have multimodal AI now

2024.04 4주차

The Ray-Ban Meta Smart Glasses have multimodal AI now

📜 Paper Adobe

2024.04 4주차

Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs

📜 Paper Microsoft

2024.04 4주차

Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

📜 Paper Meta

2024.04 4주차

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

🧑🏻‍💻 Dev PyTorch

2024.04 4주차

PyTorch 2.3 Release Blog

🧑🏻‍💻 Dev Snowflake

2024.04 4주차

snowflake-arctic-instruct

📜 Paper Peking, Microsoft

2024.04 4주차

Make Your LLM Fully Utilize the Context

🗞️ News China Unveils Vidu: A Powerful Text-to-Video Generator

2024.04 4주차

China Unveils Vidu: A Powerful Text-to-Video Generator

🧑🏻‍💻 Dev Mistral

2024.04 3주차

Mixtral-8x22B-v0.1-4bit

🧑🏻‍💻 Dev xAI

2024.04 3주차

Grok-1.5 Vision Preview

📜 Paper Google

2024.04 3주차

CodeGemma: Open Code Models Based on Gemma

🗞️ News Meta is testing an AI-powered search bar in Instagram

2024.04 3주차

Meta is testing an AI-powered search bar in Instagram

🧑🏻‍💻 Dev DeepLearning.AI

2024.04 3주차

Quantization Fundamentals with HuggingFace

📜 Paper Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

2024.04 3주차

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

📜 Paper Tinkoff

2024.04 3주차

Learn Your Reference Model for Real Good Alignment

📜 Paper Google

2024.04 3주차

TransformerFAM: Feedback attention is working memory

📜 Paper Meta, CMU

2024.04 3주차

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

🗞️ News Google

2024.04 3주차

Gemma-1.1 version released

📜 Paper Cambridge, Michigan, Oxford, Stanford, etc

2024.04 3주차

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

📜 Paper UT Austin

2024.04 3주차

Pre-training Small Base LMs with Fewer Tokens

📜 Paper KAIST

2024.04 3주차

Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards

🧑🏻‍💻 Dev Upstage

2024.04 3주차

Evalverse: Revolutionizing Large Language Model Evaluation with a Unified, User-Friendly Framework

🧑🏻‍💻 Dev Microsoft

2024.04 3주차

VASA-1: Lifelike Audio-Driven Talking FacesGenerated in Real Time

🧑🏻‍💻 Dev Meta

2024.04 3주차

Build the future of AI with Meta Llama 3

🧑🏻‍💻 Dev Google

2024.04 3주차

Tune in for Google I/O

🧑🏻‍💻 Dev AI2

2024.04 3주차

OLMo 1.7–7B: A 24 point improvement on MMLU

🧑🏻‍💻 Dev PyTorch

2024.04 3주차

torchtune

📜 Paper Google DeepMind

2024.04 3주차

Many-Shot In-Context Learning

📜 Paper Microsoft Research

2024.04 3주차

Position Engineering: Boosting Large Language Models through Positional Information Manipulation

📜 Paper Tencent AI

2024.04 3주차

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

🗞️ News Meta adds its AI chatbot, powered by Llama 3, to the search bar across its apps

2024.04 3주차

Meta adds its AI chatbot, powered by Llama 3, to the search bar across its apps

📜 Paper CMU, Meta AI

2024.04 3주차

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

🧑🏻‍💻 Dev OpenAI

2024.04 3주차

Introducing OpenAI Japan

🧑🏻‍💻 Dev Stability AI

2024.04 2주차

Introducing Stable Audio 2.0

🧑🏻‍💻 Dev MyShell, MIT-IBM, Princeton, Lepton AI

2024.04 2주차

JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars

📜 Paper University of Copenhagen, Google DeepMind

2024.04 2주차

MuLan: A Study of Fact Mutability in Language Models

📜 Paper Stanford, MIT

2024.04 2주차

Stream of Search (SoS): Learning to Search in Language

📜 Paper Stanford, Georgia

2024.04 2주차

Social Skill Training with Large Language Models

📜 Paper Microsoft Research

2024.04 2주차

Models to Self-Improve with General Preferences

🧑🏻‍💻 Dev W&B

2024.04 2주차

Weight & Biases Docs

🧑🏻‍💻 Dev Tesla

2024.04 2주차

Robotaxi

🧑🏻‍💻 Dev Andrej Karpathy

2024.04 2주차

llm.c

🧑🏻‍💻 Dev 3Blue1Brown

2024.04 2주차

Attention in transformers, visually explained

📜 Paper Mila, McGil

2024.04 2주차

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

📜 Paper Google

2024.04 2주차

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

📜 Paper NVIDIA

2024.04 2주차

RULER: What's the Real Context Size of Your Long-Context Language Models?

📜 Paper UIUC

2024.04 2주차

Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs

📜 Paper Apple

2024.04 2주차

Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

📜 Paper Tsinghua, Microsoft

2024.04 2주차

Rho-1: Not All Tokens Are What You Need

📜 Paper Google DeepMind

2024.04 2주차

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

🧑🏻‍💻 Dev IBM

2024.04 2주차

IBM watsonx chat

🧑🏻‍💻 Dev Anthropic

2024.04 1주차

Prompt library

🧑🏻‍💻 Dev xAI

2024.04 1주차

Announcing Grok-1.5

📜 Paper Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning

2024.04 1주차

Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning

📜 Paper Meta

2024.04 1주차

The Unreasonable Ineffectiveness of the Deeper Layers

🧑🏻‍💻 Dev OpenAI

2024.04 1주차

Navigating the Challenges and Opportunities of Synthetic Voices

📜 Paper AI21labs

2024.04 1주차

Jamba: A Hybrid Transformer-Mamba Language Model

📜 Paper Google DeepMind

2024.04 1주차

Gecko: Versatile Text Embeddings Distilled from Large Language Models

📜 Paper Apple

2024.04 1주차

ReALM: Reference Resolution As Language Modeling

🗞️ News Microsoft and OpenAI pledge $100 billion for ‘Stargate’ supercomputer facility

2024.04 1주차

Microsoft and OpenAI pledge $100 billion for ‘Stargate’ supercomputer facility

📜 Paper Microsoft

2024.04 1주차

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning

📜 Paper Naver Cloud

2024.04 1주차

HyperCLOVA X Technical Report

📜 Paper Anthropic

2024.04 1주차

Many-shot jailbreaking

📜 Paper Efficient Prompting Methods for Large Language Models: A Survey

2024.04 1주차

Efficient Prompting Methods for Large Language Models: A Survey

📜 Paper Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

2024.04 1주차

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

📜 Paper University of Waterloo, CMU

2024.04 1주차

Long-context LLMs Struggle with Long In-context Learning

📜 Paper Tsinghua University, UIUC

2024.04 1주차

Advancing LLM Reasoning Generalists with Preference Trees

📜 Paper Google DeepMind

2024.04 1주차

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

🗞️ News DALL-E now lets you edit images in ChatGPT

2024.04 1주차

DALL-E now lets you edit images in ChatGPT

🧑🏻‍💻 Dev Anthropic

2024.04 1주차

Claude can now use tools

📜 Paper Google DeepMind, Anthropic

2024.04 1주차

Training LLMs over Neurally Compressed Text

2024년 3월 87건

📜 Paper Instructing Large Language Models to Identify and Ignore Irrelevant Conditions

2024.03 5주차

Instructing Large Language Models to Identify and Ignore Irrelevant Conditions

📜 Paper Microsoft Research, CMU

2024.03 5주차

Can large language models explore in-context?

🧑🏻‍💻 Dev Lightning AI

2024.03 5주차

lightning-thunder

📜 Paper Johns Hopkins, Yale, AI2

2024.03 5주차

FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

📜 Paper UC Berkeley

2024.03 5주차

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

📜 Paper Rutgers University

2024.03 5주차

AIOS: LLM Agent Operating System

📜 Paper MIT, Berkeley, Chicago, Texas

2024.03 5주차

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

🧑🏻‍💻 Dev OpenAI

2024.03 5주차

Sora: first impressions

🧑🏻‍💻 Dev Databricks

2024.03 5주차

Introducing DBRX: A New State-of-the-Art Open LLM

MoE를 활용하여 132B/32B 전체/활성 파라미터 사이즈를 가짐. 32K context length 지원

🧑🏻‍💻 Dev Anthropic

2024.03 5주차

Claude-3-Opus vs GPT-4

📜 Paper Meta, MIT

2024.03 5주차

The Unreasonable Ineffectiveness of the Deeper Layers

📜 Paper Univ. of Hong Kong

2024.03 5주차

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

📜 Paper Meta, Mila, McGil, Montreal

2024.03 5주차

Improving Text-to-Image Consistency via Automatic Prompt Optimization

📜 Paper MIT, Microsoft

2024.03 5주차

Supervisory Prompt Training

📜 Paper Upstage

2024.03 5주차

sDPO: Don't Use Your Data All at Once

🧑🏻‍💻 Dev HuggingFace

2024.03 5주차

A little guide to building Large Language Models in 2024

🧑🏻‍💻 Dev AI21labs

2024.03 5주차

Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model

📜 Paper Can multiple-choice questions really be useful in detecting the abilities of LLMs?

2024.03 5주차

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

📜 Paper Understanding Emergent Abilities of Language Models from the Loss Perspective

2024.03 5주차

Understanding Emergent Abilities of Language Models from the Loss Perspective

🗞️ News Robotics startup Figure raises $675 mln from Microsoft, Nvidia, OpenAI

2024.03 4주차

Robotics startup Figure raises $675 mln from Microsoft, Nvidia, OpenAI

📜 Paper IIT

2024.03 4주차

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

📜 Paper Rice University

2024.03 4주차

Learning to Compress Prompt in Natural Language Formats

📜 Paper Microsoft

2024.03 4주차

ResLoRA: Identity Residual Mapping in Low-Rank Adaption

📜 Paper Datasets for Large Language Models: A Comprehensive Survey

2024.03 4주차

Datasets for Large Language Models: A Comprehensive Survey

📜 Paper Apple

2024.03 4주차

LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues

📜 Paper An Empirical Categorization of Prompting Techniques for Large Language Models: A Practitioner's Guide

2024.03 4주차

An Empirical Categorization of Prompting Techniques for Large Language Models: A Practitioner's Guide

📜 Paper Meta

2024.03 4주차

Learning and Leveraging World Models in Visual Representation Learning

🧑🏻‍💻 Dev Anthropic

2024.03 4주차

Introducing the next generation of Claude

📜 Paper Distilling Text Style Transfer With Self-Explanation From LLMs

2024.03 4주차

Distilling Text Style Transfer With Self-Explanation From LLMs

📜 Paper Stanford, Georgia Tech, Microsoft, Google DeepMind

2024.03 4주차

Design2Code: How Far Are We From Automating Front-End Engineering?

📜 Paper PHAnToM: Personality Has An Effect on Theory-of-Mind Reasoning in Large Language Models

2024.03 4주차

PHAnToM: Personality Has An Effect on Theory-of-Mind Reasoning in Large Language Models

🧑🏻‍💻 Dev 2024 오픈소스 컨트리뷰션 아카데미 [체험형

2024.03 4주차

멘티 모집](https://www.contribution.ac/)

‘Git 활용 및 Gemma를 이용한 LLM 앱 개발’

🧑🏻‍💻 Dev Elon Musk and OpenAI’s fiery battle

2024.03 4주차

Elon Musk and OpenAI’s fiery battle

🧑🏻‍💻 Dev Claude 3’s system prompt

2024.03 4주차

Claude 3’s system prompt

📜 Paper Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

2024.03 4주차

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

📜 Paper ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

2024.03 4주차

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

📜 Paper GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

2024.03 4주차

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

📜 Paper SaulLM-7B: A pioneering Large Language Model for Law

2024.03 4주차

SaulLM-7B: A pioneering Large Language Model for Law

🗞️ News Salesforce announces new AI tools for doctors

2024.03 4주차

Salesforce announces new AI tools for doctors

📜 Paper Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

2024.03 4주차

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

📜 Paper Yi: Open Foundation Models by 01.AI

2024.03 4주차

Yi: Open Foundation Models by 01.AI

📜 Paper Meta

2024.03 4주차

Teaching Large Language Models to Reason with Reinforcement Learning

🧑🏻‍💻 Dev WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

2024.03 4주차

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

🧑🏻‍💻 Dev mamba_peft.py on HuggingFace

2024.03 4주차

mamba_peft.py on HuggingFace

🧑🏻‍💻 Dev Foundation Model Development Cheatsheet

2024.03 4주차

Foundation Model Development Cheatsheet

📜 Paper Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

2024.03 4주차

Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

🗞️ News Nvidia

2024.03 4주차

Nvidia reveals Blackwell B200 GPU, the ‘world’s most powerful chip’ for AI

🧑🏻‍💻 Dev Open-Sora

2024.03 4주차

Open-Sora

📜 Paper CMU-LTI

2024.03 4주차

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

📜 Paper UC Berkeley

2024.03 4주차

RAFT: Adapting Language Model to Domain Specific RAG

📜 Paper Google Research

2024.03 4주차

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

📜 Paper EACL 2024

2024.03 4주차

Aligning Large and Small Language Models via Chain-of-Thought Reasoning

📜 Paper RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

2024.03 4주차

RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

📜 Paper KAIST

2024.03 4주차

SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs

📜 Paper Microsoft Corporation

2024.03 4주차

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

🧑🏻‍💻 Dev Google DeepMind

2024.03 4주차

TacticAI: an AI assistant for football tactics

📜 Paper Google DeepMind

2024.03 4주차

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

📜 Paper AI2

2024.03 4주차

RewardBench: Evaluating Reward Models for Language Modeling

📜 Paper LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

2024.03 4주차

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

📜 Paper MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

2024.03 4주차

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

📜 Paper KAIST

2024.03 4주차

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

📜 Paper Sakana AI

2024.03 4주차

Evolutionary Optimization of Model Merging Recipes

🧑🏻‍💻 Dev Gen AI Korea 2024

2024.03 3주차

생성형 AI 레드팀 챌린지

📜 Paper Anthropic

2024.03 3주차

The Claude 3 Model Family: Opus, Sonnet, Haiku

📜 Paper Microsoft

2024.03 3주차

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

📜 Paper Google Research

2024.03 3주차

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

📜 Paper Birbal: An efficient 7B instruct-model fine-tuned with curated datasets

2024.03 3주차

Birbal: An efficient 7B instruct-model fine-tuned with curated datasets

📜 Paper Google DeepMind

2024.03 3주차

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

📜 Paper MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

2024.03 3주차

MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

📜 Paper Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering

2024.03 3주차

Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering

🧑🏻‍💻 Dev Cohere

2024.03 3주차

Command-R: Retrieval Augmented Generation at Production Scale

📜 Paper MIT

2024.03 3주차

RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

🧑🏻‍💻 Dev OpenAI

2024.03 3주차

transfromer-debugger (TBD)

📜 Paper Google DeepMind, OpenAI

2024.03 3주차

Stealing Part of a Production Language Model

📜 Paper Meta

2024.03 3주차

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

🧑🏻‍💻 Dev DeepLearning.AI

2024.03 3주차

Knowledge Graph for RAG

🧑🏻‍💻 Dev Google DeepMind

2024.03 3주차

A generalist AI agent for 3D virtual environments

🧑🏻‍💻 Dev Microsoft Research

2024.03 3주차

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

🧑🏻‍💻 Dev OpenAI

2024.03 3주차

Figure Status Update - OpenAI Speech-to-Speech Reasoning

📜 Paper Tancent

2024.03 3주차

Large Language Models are Contrastive Reasoners

📜 Paper Logits of API-Protected LLMs Leak Proprietary Information

2024.03 3주차

Logits of API-Protected LLMs Leak Proprietary Information

📜 Paper Apple

2024.03 3주차

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

🗞️ News Ex-Activision CEO Bobby Kotick pitched buying TikTok to potential partners, including Sam Altman: report

2024.03 3주차

Ex-Activision CEO Bobby Kotick pitched buying TikTok to potential partners, including Sam Altman: report

🧑🏻‍💻 Dev xAI

2024.03 3주차

Open Release of Grok-1

🧑🏻‍💻 Dev Cohere

2024.03 3주차

C4AI Command-R (HuggingFace)

📜 Paper Stanford University

2024.03 3주차

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

📜 Paper Peking University

2024.03 3주차

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

2024년 2월 55건

📜 Paper OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

2024.02 5주차

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

🧑🏻‍💻 Dev OpenAI

2024.02 5주차

Memory and new controls for ChatGPT

🧑🏻‍💻 Dev NVIDIA

2024.02 5주차

Say What? Chat With RTX Brings Custom Chatbot to NVIDIA RTX AI PCs

NVIDIA의 성장세 기사

🧑🏻‍💻 Dev DeepLearning.AI

2024.02 5주차

Serverless LLM apps with Amazon Bedrock

LLM의 sefl-verification 한계점

📜 Paper Google DeepMind

2024.02 5주차

Transformers Can Achieve Length Generalization But Not Robustly

📜 Paper Google DeepMind

2024.02 5주차

Chain-of-Thought Reasoning Without Prompting

🧑🏻‍💻 Dev Google

2024.02 5주차

Our next-generation model: Gemini 1.5

🧑🏻‍💻 Dev OpenAI

2024.02 5주차

Sora: Creating video from text

📜 Paper Apple

2024.02 5주차

Guiding Instruction-based Image Editing via Multimodal Large Language Models

📜 Paper Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

2024.02 5주차

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

🗞️ News Slack

2024.02 5주차

Slack AI is here, letting you catch up on lengthy threads and unread messages

📜 Paper Google DeepMind & Research

2024.02 5주차

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts

📜 Paper DoRA: Weight-Decomposed Low-Rank Adaptation

2024.02 5주차

DoRA: Weight-Decomposed Low-Rank Adaptation

📜 Paper Can We Verify Step by Step for Incorrect Answer Detection?

2024.02 5주차

Can We Verify Step by Step for Incorrect Answer Detection?

🧑🏻‍💻 Dev minbpe

2024.02 5주차

minbpe

🧑🏻‍💻 Dev Meta

2024.02 5주차

V-JEPA

📜 Paper UC Berkely

2024.02 5주차

LoRA+: Efficient Low Rank Adaptation of Large Models

기존의 LoRA에서 사용하는 adapater 행렬 A와 B는 고정된 learning rate로 업데이트된다는 점이 문제임 → 두 행렬의 learning rate를 조절함으로써 퍼포먼스와 학습 속도를 향상시킬 수 있는 알고리즘 LoRA+ 를 제시

📜 Paper OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

2024.02 5주차

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

📜 Paper Large Language Models for Data Annotation: A Survey

2024.02 5주차

Large Language Models for Data Annotation: A Survey

📜 Paper Purifying Large Language Models by Ensembling a Small Language Model

2024.02 5주차

Purifying Large Language Models by Ensembling a Small Language Model

📜 Paper Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation

2024.02 5주차

Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation

📜 Paper tinyBenchmarks: evaluating LLMs with fewer examples

2024.02 5주차

tinyBenchmarks: evaluating LLMs with fewer examples

🧑🏻‍💻 Dev Google DeepMind

2024.02 5주차

Genie: Generative Interactive Environments

single image prompt로 게임 만들기..

🧑🏻‍💻 Dev Mistral AI

2024.02 5주차

Le Chat Mistral

🧑🏻‍💻 Dev Mitral AI

2024.02 5주차

Au Large

📜 Paper Microsoft Research

2024.02 5주차

Orca-Math: Unlocking the potential of SLMs in Grade School Math

Mistral-7B 모델을 베이스로 학습한 7B 모델 Orca-Math. 200K 개의 고품질 합성 데이터, feedback을 통합시키는 학습 방식 등이 활용됨. Llama-2-70B, ChatGPT-3.5 등을 능가하는 퍼포먼스

🧑🏻‍💻 Dev Argilla

2024.02 5주차

OpenHermesPreferences - a dataset of 1M AI preferences for RLAIF and DPO

📜 Paper LLMs with Chain-of-Thought Are Non-Causal Reasoners

2024.02 5주차

LLMs with Chain-of-Thought Are Non-Causal Reasoners

📜 Paper Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models

2024.02 5주차

Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models

🗞️ News Apple cancels work on electric car, shifts team to generative AI

2024.02 5주차

Apple cancels work on electric car, shifts team to generative AI

📜 Paper Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

2024.02 5주차

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

🧑🏻‍💻 Dev DeepLearning.AI

2024.02 5주차

Prompt Engineering with Llama 2

📜 Paper Linear Transformers with Learnable Kernel Functions are Better In-Context Models

2024.02 4주차

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

📜 Paper DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

2024.02 4주차

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

📜 Paper AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

2024.02 4주차

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

📜 Paper Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

2024.02 4주차

Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

📜 Paper Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

2024.02 4주차

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

🗞️ News SoftBank’s Masayoshi Son is reportedly seeking $100B to build a new AI chip venture

2024.02 4주차

SoftBank’s Masayoshi Son is reportedly seeking $100B to build a new AI chip venture

📜 Paper The FinBen: An Holistic Financial Benchmark for Large Language Models

2024.02 4주차

The FinBen: An Holistic Financial Benchmark for Large Language Models

🧑🏻‍💻 Dev cosmopedia

2024.02 4주차

cosmopedia

🧑🏻‍💻 Dev Andrej Karphathy

2024.02 4주차

Let’s build the GPT Tokenizer

📜 Paper Microsoft

2024.02 4주차

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

🧑🏻‍💻 Dev Google DeepMind

2024.02 4주차

Gemma: Introducing new state-of-the-art open models

🧑🏻‍💻 Dev Kaggle

2024.02 4주차

Google – AI Assistants for Data Tasks with Gemma

📜 Paper ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling

2024.02 4주차

ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling

📜 Paper Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

2024.02 4주차

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

🧑🏻‍💻 Dev Aria Everyday Activities Dataset

2024.02 4주차

Aria Everyday Activities Dataset

📜 Paper Microsoft Research

2024.02 4주차

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

📜 Paper Yonsei University

2024.02 4주차

KMMLU: Measuring Massive Multitask Language Understanding in Korean

📜 Paper OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

2024.02 4주차

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

🗞️ News Adobe Acrobat adds generative AI to ‘easily chat with documents’

2024.02 4주차

Adobe Acrobat adds generative AI to ‘easily chat with documents’

📜 Paper Hint-before-Solving Prompting: Guiding LLMs to Effectively Utilize Encoded Knowledge

2024.02 4주차

Hint-before-Solving Prompting: Guiding LLMs to Effectively Utilize Encoded Knowledge

📜 Paper CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

2024.02 4주차

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

📜 Paper YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

2024.02 4주차

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

🧑🏻‍💻 Dev Stability.ai

2024.02 4주차

Stable Diffusion 3

chanmuzi의 AI 큐레이션

Top 기관

최신 항목

전체 아카이브

2026년 5월 18건

2026년 4월 37건

2026년 3월 58건

2026년 2월 54건

2026년 1월 54건

2025년 12월 49건

2025년 11월 54건

2025년 10월 60건

2025년 9월 59건

2025년 8월 63건

2025년 7월 67건

2025년 6월 49건

2025년 5월 65건

2025년 4월 62건

2025년 3월 62건

2025년 2월 66건

2025년 1월 67건

2024년 12월 63건

2024년 11월 77건

2024년 10월 83건

2024년 9월 90건

2024년 8월 72건

2024년 7월 74건

2024년 6월 70건

2024년 5월 89건

2024년 4월 90건

2024년 3월 87건

2024년 2월 55건