@@ -18,7 +18,7 @@ <h1 class="text-3xl md:text-4xl font-bold text-slate-800 mb-8 max-w-4xl mx-auto"
18
18
</ p >
19
19
< p >
20
20
To explore the core knowledge representation in MLLMs, we introduce < strong > CoreCognition</ strong > , a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science.
21
- We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones.
21
+ We evaluate 230 models with 11 different prompts, leading to a total of 1503 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones.
22
22
</ p >
23
23
< p >
24
24
Finally, we propose < strong > Concept Hacking</ strong > , a novel controlled evaluation method, that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.
@@ -68,9 +68,9 @@ <h1 class="text-3xl md:text-4xl font-bold text-slate-800 mb-8 max-w-4xl mx-auto"
68
68
< sup > 1</ sup > University of California San Diego  
69
69
< sup > 2</ sup > Johns Hopkins University  
70
70
< sup > 3</ sup > Emory University  
71
- < sup > 4</ sup > University of North Carolina at Chapel Hill  
72
71
</ div >
73
72
< div class ="mb-4 font-bold ">
73
+ < sup > 4</ sup > University of North Carolina at Chapel Hill  
74
74
< sup > 5</ sup > Stanford University  
75
75
< sup > 6</ sup > Ben-Gurion University of the Negev  
76
76
</ div >
@@ -135,7 +135,7 @@ <h3 class="text-2xl font-bold text-gray-900">Dataset Curation</h3>
135
135
</ div >
136
136
< h4 class ="text-lg font-bold text-gray-900 "> Discriminativeness</ h4 >
137
137
</ div >
138
- < p class ="text-sm text-gray-700 leading-relaxed text-justify ">
138
+ < p class ="text-sm text-gray-700 leading-relaxed text-left ">
139
139
Instances should be structured such that models lacking the targeted core knowledge necessarily select the < strong class ="text-red-600 "> incorrect answers</ strong > , thereby ensuring the discriminative power.
140
140
</ p >
141
141
</ div >
@@ -150,7 +150,7 @@ <h4 class="text-lg font-bold text-gray-900">Discriminativeness</h4>
150
150
</ div >
151
151
< h4 class ="text-lg font-bold text-gray-900 "> Minimal Confounding</ h4 >
152
152
</ div >
153
- < p class ="text-sm text-gray-700 leading-relaxed text-justify ">
153
+ < p class ="text-sm text-gray-700 leading-relaxed text-left ">
154
154
Questions should minimize reliance on confounding capabilities, such as < strong > object recognition</ strong > , and must avoid conceptual overlap with other core knowledge included in the benchmark.
155
155
</ p >
156
156
</ div >
@@ -166,7 +166,7 @@ <h4 class="text-lg font-bold text-gray-900">Minimal Confounding</h4>
166
166
</ div >
167
167
< h4 class ="text-lg font-bold text-gray-900 "> Minimal Text Shortcut</ h4 >
168
168
</ div >
169
- < p class ="text-sm text-gray-700 leading-relaxed text-justify ">
169
+ < p class ="text-sm text-gray-700 leading-relaxed text-left ">
170
170
Instances should be crafted so that answers cannot be derived through textual shortcuts alone but require < strong class ="text-blue-600 "> genuine multimodal comprehension</ strong > .
171
171
</ p >
172
172
</ div >
@@ -190,6 +190,13 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Expert Collaboration</h4>
190
190
</ div >
191
191
</ div >
192
192
193
+ <!-- Data Curation Process Figure -->
194
+ < div class ="paper-figure-container mx-auto max-w-5xl mb-12 hover:shadow-xl transition-all duration-300 transform hover:-translate-y-1 cursor-pointer ">
195
+ < img src ="{{ '/assets/images/Data_curation.png' | relative_url }} "
196
+ alt ="Data Curation Process and Methodology "
197
+ class ="w-full h-auto rounded-lg shadow-sm ">
198
+ </ div >
199
+
193
200
<!-- Twelve Core Concepts Card -->
194
201
< div class ="bg-white rounded-2xl p-8 shadow-xl border border-gray-100 hover:shadow-2xl transition-all duration-300 transform hover:-translate-y-1 ">
195
202
< div class ="flex items-center mb-6 ">
@@ -210,7 +217,7 @@ <h3 class="text-2xl font-bold text-gray-900">Twelve Core Concepts</h3>
210
217
</ div >
211
218
< div >
212
219
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Permanence</ h4 >
213
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Objects do not cease to exist when they are no longer perceived.</ p >
220
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Objects do not cease to exist when they are no longer perceived.</ p >
214
221
</ div >
215
222
</ div >
216
223
</ div >
@@ -223,7 +230,7 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Permanence</h4>
223
230
</ div >
224
231
< div >
225
232
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Continuity</ h4 >
226
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Objects persist as unified, cohesive entities across space and time.</ p >
233
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Objects persist as unified, cohesive entities across space and time.</ p >
227
234
</ div >
228
235
</ div >
229
236
</ div >
@@ -236,7 +243,7 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Continuity</h4>
236
243
</ div >
237
244
< div >
238
245
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Boundary</ h4 >
239
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> The transition from one object to another.</ p >
246
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> The transition from one object to another.</ p >
240
247
</ div >
241
248
</ div >
242
249
</ div >
@@ -249,10 +256,10 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Boundary</h4>
249
256
</ div >
250
257
< div >
251
258
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Spatiality</ h4 >
252
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> The < em > a priori</ em > understanding of the Euclidean properties of the world.</ p >
259
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> The < em > a priori</ em > understanding of the Euclidean properties of the world.</ p >
260
+ </ div >
261
+ </ div >
253
262
</ div >
254
- </ div >
255
- </ div >
256
263
257
264
<!-- 5. Perceptual Constancy -->
258
265
< div class ="p-6 rounded-xl hover:shadow-lg transition-all duration-300 shadow-md " style ="background-color: #D7F0FE; ">
@@ -262,7 +269,7 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Spatiality</h4>
262
269
</ div >
263
270
< div >
264
271
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Perceptual Constancy</ h4 >
265
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Changes in appearances don't mean changes in physical properties.</ p >
272
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Changes in appearances don't mean changes in physical properties.</ p >
266
273
</ div >
267
274
</ div >
268
275
</ div >
@@ -272,10 +279,10 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Perceptual Constancy</h4>
272
279
< div class ="flex items-start ">
273
280
< div class ="rounded-lg mr-4 mt-1 flex-shrink-0 " style ="background-color: #BEE4FD; ">
274
281
< span class ="w-14 h-14 flex items-center justify-center font-bold text-4xl text-gray-800 "> 6</ span >
275
- </ div >
276
- < div >
282
+ </ div >
283
+ < div >
277
284
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Intuitive Physics</ h4 >
278
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Intuitions about the laws of how things interact in the physical world.</ p >
285
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Intuitions about the laws of how things interact in the physical world.</ p >
279
286
</ div >
280
287
</ div >
281
288
</ div >
@@ -288,9 +295,9 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Intuitive Physics</h4>
288
295
</ div >
289
296
< div >
290
297
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Perspective</ h4 >
291
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> To see what others see.</ p >
292
- </ div >
293
- </ div >
298
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> To see what others see.</ p >
299
+ </ div >
300
+ </ div >
294
301
</ div >
295
302
296
303
<!-- 8. Hierarchy -->
@@ -301,7 +308,7 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Perspective</h4>
301
308
</ div >
302
309
< div >
303
310
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Hierarchy</ h4 >
304
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Understanding of inclusion and exclusion of objects and categories.</ p >
311
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Understanding of inclusion and exclusion of objects and categories.</ p >
305
312
</ div >
306
313
</ div >
307
314
</ div >
@@ -314,7 +321,7 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Hierarchy</h4>
314
321
</ div >
315
322
< div >
316
323
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Conservation</ h4 >
317
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Invariances of properties despite transformations.</ p >
324
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Invariances of properties despite transformations.</ p >
318
325
</ div >
319
326
</ div >
320
327
</ div >
@@ -327,7 +334,7 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Conservation</h4>
327
334
</ div >
328
335
< div >
329
336
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Tool Use</ h4 >
330
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> The capacity to manipulate specific objects to achieve goals.</ p >
337
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> The capacity to manipulate specific objects to achieve goals.</ p >
331
338
</ div >
332
339
</ div >
333
340
</ div >
@@ -340,10 +347,10 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Tool Use</h4>
340
347
</ div >
341
348
< div >
342
349
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Intentionality</ h4 >
343
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> To see what others want.</ p >
344
- </ div >
345
- </ div >
346
- </ div >
350
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> To see what others want.</ p >
351
+ </ div >
352
+ </ div >
353
+ </ div >
347
354
348
355
<!-- 12. Mechanical Reasoning -->
349
356
< div class ="p-6 rounded-xl hover:shadow-lg transition-all duration-300 shadow-md " style ="background-color: #CFFAFE; ">
@@ -353,10 +360,10 @@ <h4 class="text-lg font-bold text-gray-900 mb-2">Intentionality</h4>
353
360
</ div >
354
361
< div >
355
362
< h4 class ="text-lg font-bold text-gray-900 mb-2 "> Mechanical Reasoning</ h4 >
356
- < p class ="text-sm text-gray-700 leading-relaxed text-justify "> Inferring actions from system states and vice versa.</ p >
363
+ < p class ="text-sm text-gray-700 leading-relaxed text-left "> Inferring actions from system states and vice versa.</ p >
357
364
</ div >
358
365
</ div >
359
- </ div >
366
+ </ div >
360
367
</ div >
361
368
</ div >
362
369
@@ -386,14 +393,14 @@ <h3 class="text-2xl font-bold text-gray-900">Dataset Statistics</h3>
386
393
387
394
<!-- Statistic 3 -->
388
395
< div >
389
- < div class ="text-5xl font-bold text-purple -500 mb-2 "> >26k </ div >
390
- < div class ="text-gray-700 font-medium "> Total Judgments </ div >
396
+ < div class ="text-5xl font-bold text-blue -500 mb-2 "> 1503 </ div >
397
+ < div class ="text-gray-700 font-medium "> Image-Question Pairs </ div >
391
398
</ div >
392
399
393
400
<!-- Statistic 4 -->
394
401
< div >
395
- < div class ="text-5xl font-bold text-blue -500 mb-2 "> 2,530 </ div >
396
- < div class ="text-gray-700 font-medium "> Image-Question Pairs </ div >
402
+ < div class ="text-5xl font-bold text-purple -500 mb-2 "> >3800k </ div >
403
+ < div class ="text-gray-700 font-medium "> Total Judgments </ div >
397
404
</ div >
398
405
</ div >
399
406
</ div >
@@ -412,7 +419,7 @@ <h3 class="text-2xl font-bold text-gray-900">Dataset Statistics</h3>
412
419
< div class ="max-w-7xl mx-auto px-4 sm:px-6 lg:px-8 ">
413
420
< div class ="text-center mb-12 ">
414
421
< h2 class ="text-3xl md:text-4xl font-bold text-gray-900 mb-4 "> Key Findings</ h2 >
415
- < p class ="text-xl text-gray-600 max-w-4xl mx-auto leading-relaxed text-justify ">
422
+ < p class ="text-xl text-gray-600 max-w-4xl mx-auto leading-relaxed text-center ">
416
423
Our study uncovers < strong > four primary shortcomings</ strong > shared by state-of-the-art MLLMs:
417
424
</ p >
418
425
</ div >
@@ -575,7 +582,7 @@ <h2 class="text-3xl md:text-4xl font-bold text-gray-900 mb-4">Concept Hacking: A
575
582
</ div >
576
583
< h4 class ="text-lg font-bold text-gray-900 "> Core Knowledge</ h4 >
577
584
</ div >
578
- < p class ="text-sm text-gray-700 leading-relaxed text-justify ">
585
+ < p class ="text-sm text-gray-700 leading-relaxed text-left ">
579
586
Correct responses on both controlled and manipulated tasks indicate genuine conceptual understanding.
580
587
</ p >
581
588
</ div >
@@ -590,7 +597,7 @@ <h4 class="text-lg font-bold text-gray-900">Core Knowledge</h4>
590
597
</ div >
591
598
< h4 class ="text-lg font-bold text-gray-900 "> Shortcut-taking</ h4 >
592
599
</ div >
593
- < p class ="text-sm text-gray-700 leading-relaxed text-justify ">
600
+ < p class ="text-sm text-gray-700 leading-relaxed text-left ">
594
601
Models exploiting training data similarities perform well on controlled tasks but fail when familiar patterns are paired with inverted labels.
595
602
</ p >
596
603
</ div >
@@ -605,7 +612,7 @@ <h4 class="text-lg font-bold text-gray-900">Shortcut-taking</h4>
605
612
</ div >
606
613
< h4 class ="text-lg font-bold text-gray-900 "> Core Deficits</ h4 >
607
614
</ div >
608
- < p class ="text-sm text-gray-700 leading-relaxed text-justify ">
615
+ < p class ="text-sm text-gray-700 leading-relaxed text-left ">
609
616
Incorrect responses to controlled tasks, regardless of manipulation performance, indicate the absence of core knowledge.
610
617
</ p >
611
618
</ div >
@@ -625,7 +632,7 @@ <h4 class="text-lg font-bold text-gray-900">Core Deficits</h4>
625
632
< div class ="max-w-7xl mx-auto px-4 sm:px-6 lg:px-8 ">
626
633
< div class ="text-center mb-8 ">
627
634
< h2 class ="text-3xl md:text-4xl font-bold mb-4 "> Citation</ h2 >
628
- < p class ="text-xl text-gray-300 max-w-3xl mx-auto text-justify ">
635
+ < p class ="text-xl text-gray-300 max-w-3xl mx-auto text-center ">
629
636
If you find this project useful in your research, please consider citing:
630
637
</ p >
631
638
</ div >
0 commit comments