具体内容:
⚪01.针对VLM在ADS应用的对抗攻击方法的挑战性,难点阐述不清楚,需要强调自动驾驶领域针对VLM攻击的独特的挑战性和难点。这一点针对审稿意见“However, more clarity on the specific challenges in autonomous driving systems could strengthen the argument for why Uni-Attack is crucial.”,“the novelty of these contributions, specifically in the context of LVLMs applied to autonomous driving, is unclear. Are these methods generally applicable to LVLM attacks, or are they specifically tailored for autonomous driving? The unique aspects of the autonomous driving scenario need to be further clarified. The paper should explain why these attack methods are particularly relevant in autonomous driving and how they differ from general LVLM attack approaches. Additionally, the paper should elaborate on the specific challenges LVLMs face in autonomous driving attacks”。
⚪02.对比实验分析缺乏深度,需要进一步分析新方法为什么能够超越baseline,哪个环节或者指标上超过baseline。针对审稿意见“Could you provide more insights into why these baselines perform worse than Uni-Attack and what unique features give Uni-Attack the edge? Please explain why specific metrics improved and what aspects of Uni-Attack led to these improvements.”
⚪03.提供丰富的原始样本和对抗样本的可视化对比。针对审稿意见“showing a side-by-side comparison of original vs. attacked outputs with qualitative examples would provide clearer insight into the attack’s effectiveness”
⚪04.实验数据中需要体现真实的驾驶环境,比如多种天气条件、交通密度波动、车辆快速移动或意外的行人入侵等,CVPR’2024 dataset基于Carla模拟器采集,可以考虑替换这个数据集,使得论文和比赛没有特别密切的关系,寻找一些复杂的自动驾驶数据集,或者搭建Carla环境自己采集;针对审稿意见“How does Uni-Attack perform in fully realistic scenarios, particularly with complex traffic conditions or environmental changes”, “Given the dynamic nature of environments where autonomous driving systems function, it is advisable that the authors assess the performance of Uni-Attack under dynamic conditions”。
⚪05.分析关键模块带来的计算瓶颈。针对审稿意见“Could you explain the computational overhead introduced by the gradient noise-based attack and typographic attacks?”
⚪06.可提供适当的防御措施建议。针对审稿意见“The paper focuses heavily on attacking methodologies, but there is no discussion on possible defenses.”
⚪07.针对创新点的阐述不细致,并且没有提供新颖的角度来分析已有工作的缺点,比如审稿意见“How do you ensure the balance between attack efficacy and computational efficiency in real-time autonomous driving environments??”, “ The paper lacks in-depth discussion of its unique contributions and fails to provide a new perspective. Please clarify the limitations of existing methods in the context of autonomous driving and demonstrate how Uni-Attack addresses these challenges. A more thorough discussion is needed to highlight the limitations of current LVLM adversarial attack frameworks in autonomous driving applications and how Uni-Attack stands out in overcoming these specific challenges.”
⚪08.可迁移对抗攻击模块的替代模型(ViT-B/16, CLIP, and BLIP-2)选择的依据是什么阐述清楚,针对审稿意见“For the transferability of adversarial patches, why were ViT-B/16, CLIP, and BLIP-2 chosen as surrogate models, and how do they compare in terms of their susceptibility to attacks in both white-box and black-box environments?”
⚪09.参数设置解释清楚。针对审稿意见“What was the rationale behind setting the perturbation limit at ϵ = 16/255 in the gradient noise-based module?”
⚪10.分析攻击方法的有效性以及泛化性,针对审稿意见 “It is recommended that the authors delve into potential defensive strategies to mitigate the proposed adversarial methods and discuss the breadth of models targeted by these attacks, encompassing rule-based systems, learning-based models, and large model-driven frameworks.”
⚪11.LVLM的攻击方法和Vision Model的攻击方法的差异之处没有阐述。针对审稿意见“More explanation is needed regarding how LVLMs differ from traditional vision models in terms of attack vulnerabilities. Can traditional methods still generate effective attacks against LVLMs, or are there fundamental differences?“
⚪12.图5 的字体太小了。审稿意见:”The font size of the labels in Fig. 5 should be enlarged.“
11.11 Monday
上午:word2vector模型了解了基本原理,Devign利用【torch_geometric.nn.conv.GatedGraphConv】学习图结构,利用【gensim.models.word2vec.Word2Vec】编码code,而LLGGNN基于此,替换word2Vec为CodeBert,编码code。此外没有什么改变。
下午:把LMGGNN的方法再训练,以及baseline跑出一个leaderboard来。修改FedVC。
FedVC的三个审稿意见,原文latex需要修改,修改部分用红色标出来,然后分别回复审稿意见。
《《《《《《《Reviewer-1》》》》》》》》》》:
Summary: This paper aims to achieve the performance for imbalanced dataset under the federated learning (FL) on task autonomous deriving. The main work is to propose a virtual client (VC) for FL, which can control the models’ updating process by sharing the raw data from real clients. Here are some suggestions:
1. Some definitions are not clear to me: data imbalance, local data imbalance, virtual client, and so on.
• a. What is the data imbalance as mentioned in Line 55 right on Page 1?
• b. What is the local data imbalance in Line 5 left on Page 2? Citations are also recommended, such as papers with a definition about data imbalance (e.g. A comprehensive survey of image augmentation techniques for deep learning).
• c. What is the real client? Is it the data from a car? I also suggest consistent definitions in the paper.
• d. What is the missing data on Line 51 right on Page 3?
• e. Page 3 Line 55 left, citations for the static method should be given, as well as dynamic training method on Line 69. Their simple definitions or descriptions are encouraged.
• f. As mentioned in Page 4 Line 5 left, “every virtual client maintaining a separate model”, my question is whether the number of models equals the number of clients. And then what is the definition of a model or what kinds of data can the model access?
• g. What is the difference between data silos, real and virtual clients?
• h. In the second paragraph on Page 4 right panel, the proof with citations or necessary analysis, and the reasons are recommended to give.
• i. For the motivation to propose the virtual clients as mentioned in Page 4 Lines 9-28, is better to provide at the beginning the methodology? I also suggest an extra subsection in Section 4, before introducing your method, to demonstrate the addressing challenge and the motivation of your idea.
2. Citations should be added as mentioned before. In Sections 3 and 4, there is no reference.
• a. Citations for K-means should be given on Page 3 Line 31 right.
3. How do you use K-means to generate the pseudo-class? Necessary details should be given. And how many classes do you have?
4. It seems that Figure 2 is the main work on the paper, rather than Figure 1. I suggest exchanging their place. Please explain the meaning of different shapes in Figure 2.
5. In the abstract, I suggest replacing synthetic datasets with the raw dataset in Line 25. “raw” is from Page 4 Line 36 left.
6. A potential advantage for your method is beyond the application of autonomous driving. Additional experiments on other applications and datasets are encouraged. Or you can mention it at least.
7. To avoid misunderstanding, I suggest replacing “proposed” in Line 52 left on Page 5 with “existing”.
《《《《《《《Reviewer-2》》》》》》》》》》:
The paper presents FedVC, a federated learning framework designed to address the challenges of imbalanced label distribution in autonomous driving. FedVC introduces virtual clients (VC), which select data subsets dynamically to approximate global data distribution without compromising privacy. This method enhances learning by aligning local training with the global data distribution. The results highlight that FedVC outperforms other methods, such as FedAvg and FedSAM, especially under imbalanced conditions. However, FedVC introduces increased communication costs that require further optimization for real-world deployment. The following are some major concerns:
The paper currently references several traditional FL methods. To strengthen the framework’s relevance, the authors should integrate recent advancements in FL that focus on:
• Reducing communication costs (e.g., gradient compression techniques or sparse model updates).
• Improving scalability to handle a higher number of clients efficiently.
Expand the discussion by addressing practical deployment challenges in limited-resource autonomous systems. Specifically, the following aspects need further exploration:
• Communication bandwidth limitations and how FedVC can maintain efficiency in low-bandwidth networks.
• The computational overhead on individual vehicles and the feasibility of deploying the proposed virtual client mechanism without degrading vehicle performance.
《《《《《《《Reviewer-3》》》》》》》》》》:
The paper presents a novel federated learning framework, FedVC, which utilizes Virtual Clients in a peer-to-peer network to achieve global distribution-aware training. The framework decouples forward and backward computations to better align local training with global data distribution, improving model performance without the need for dataset condensation. The application to steering angle prediction in autonomous driving demonstrates the framework’s effectiveness on three datasets with different levels of data imbalance. The idea is innovative and has practical implications for federated learning in privacy-sensitive applications.
• Strengths:
1. Innovative Approach: The concept of Virtual Clients and decoupling computations is a fresh contribution that addresses key challenges in federated learning, such as data privacy, communication costs, and data distribution alignment.
2. Practical Application: The focus on steering angle prediction for autonomous driving is relevant and timely, with potential real-world applications.
3. Experimental Validation: The paper demonstrates the proposed framework’s effectiveness across multiple datasets, showcasing its adaptability to different data imbalance scenarios.
• Areas for Improvement:
1. Expand Experimental Setup Details: The manuscript lacks sufficient detail about the experimental settings, such as the number of Virtual Clients and Real Clients used, model architecture specifics, and training parameters. Providing a more thorough description would enhance the reproducibility of the experiments and help readers better understand the conditions under which the method was tested.
2. Clarify Threshold Mechanism: The explanation regarding the use of thresholds for reconstructing the global data view is not detailed enough. The paper should include more information on how these thresholds are set, why they were chosen, and their effect on the training process. A comparison with alternative strategies could provide more context and strengthen the justification for this approach.
3. Expand the Discussion of Limitations and Future Directions: The paper would benefit from a more in-depth discussion of the framework’s limitations. For example, potential computational overhead associated with the use of Virtual Clients or challenges in synchronizing large datasets could be addressed. Additionally, suggesting future research directions, such as optimizing the threshold mechanism or extending the approach to other federated learning tasks, would add depth to the paper.
4. Better Integration with Related Work: The discussion section should more thoroughly compare the results with existing methods for federated learning, particularly in the context of autonomous driving. Drawing clearer connections between FedVC and other approaches would help demonstrate the novelty and advantages of the proposed method.
5. Improve Clarity in Some Sections: Certain explanations, such as the role of Virtual Clients in data partitioning and the communication flow in the peer-to-peer network, could be presented more clearly. Including visual aids like diagrams or flowcharts could help illustrate these concepts and improve reader comprehension.
6. Standardize Abbreviation Usage and Terminology: Make sure that all abbreviations are defined at their first occurrence, and used consistently throughout the manuscript. Avoid reintroducing abbreviations unnecessarily. Consistency in terminology will make the paper easier to read and follow.
7. Ensure Consistency in Figures and Numerical Notation: Verify that all figures, tables, and numerical results follow a consistent format, including the use of units, significant figures, and labeling conventions. This will contribute to a more professional presentation of the research.
8. Language and Grammar Refinement: Some sections could benefit from language refinement to improve clarity and grammatical accuracy. Editing for shorter, more concise sentences, and ensuring proper punctuation would enhance the manuscript’s readability.
9. Address Ethical Considerations: While the paper emphasizes privacy in the federated learning context, there is little discussion about potential ethical concerns, such as data security or user consent. Including a section that addresses these issues would make the paper more robust and relevant for real-world deployment considerations.
11.01 Friday
上午:论文调研
下午:论文调研
晚上:论文调研,整理成立PPT
11.02 Saturday
上午:理发
下午:emmmmmmmmmmmmmmmm
晚上:LLM GNN 代码,比赛化,成绩化,标准化,目标化。
AI4se: https://github.com/saltudelft/ml4se/tree/master
[Ph.D.招生]我想招两个做LLM的全奖PhD – 捉木马的蓝胖纸的文章 – 知乎
https://zhuanlan.zhihu.com/p/3368066426
11.03 Sunday
上午:尝试复现 vul-LLMGNN,环境搭建一半,再autodl-3090上。
下午:中午休息了一下,下午继续看了B站组会分享,解析GraphCodeBERT的代码,并且将Vul-LLMGNN的环境搭建到底,就差joern的适配了。
晚上:陪鑫鑫,接麻园回家。
11.04 Monday
上午:找Vul-LMGGNN代码bug,结合fork 的commit记录以及他依赖的Devign代码库,修改代码。
下午:joern配置好了,遇到的问题是embedding的,Devign使用的embedding来自Word2Vec,这里替换成了CodeBert的embedding层,涉及到大模型的配置,所以遇到很多问题。
晚上:问题没有排除,讲Uni-Attack论文梳理,学习到的内容包括:
1. 解决的问题是什么,问题的难点在哪里,挑战性;创先点围绕这个点来说。
2. 方法框架图太乱,不够清晰,名字起的太泛,不具体。
3. 每段话的阐述一定要有逻辑在里面,不能随意堆砌。
11.05 Tuesday
上午:找Vul-LMGGNN bug,迁移到A800上,3090的显存不够。
下午:跑通代码,但是数据预处理是cpu intensive的,要转移到本地CPU13900k上进行。
晚上:整理宿舍卫生,安装猫笼子,洗被子。
11.06 Wednesday
上午:排除Vul-LMGGNN的数据生成障碍,并整理Uni-Attack的修改to-dolist
下午:跑通Vul-LMGGNN的数据生成在服务器,python的多线程执行问题,在”joern_process.stdin.write(import_cpg_cmd)“后加这样一句代码”joern_process.stdin.flush()“即可。等处理完,把代码备份到github
代码审计,CodeQL商业化产品,开源产品Joern。读论文。
阿里云ecs服务器(本博客)1月3日过期。
Scala语言学习有助于理解大多数现代编程语言的语法规则,和Joern的使用。
安全研究员博客:代码审计之道:https://evilpan.com/2022/01/22/code-audit
需要读的论文:https://github.com/saltudelft/ml4se?tab=readme-ov-file#bugvulnerability-detection
比赛leaderboard: https://microsoft.github.io/CodeXGLUE
Tree-sitter is for syntax parsing (e.g., code editors, IDEs).
Joern is for security analysis (e.g., vulnerability detection in source code).
抓住Devign 的这几个baseline挖掘难点,总结已有方法的不足之处。
11.07 Thursday
上午:行云账号过期,重新申请恢复,遇到问题解决问题,不要抱怨,没有用。Vul-LMGGNN数据预处理某些数据超出GPU显存
下午:数据预处理直接try, except跳过有问题的数据。跑通。
晚上:出发去武汉。
补充:
‼‼2024/11/6 针对Uni-attack审稿意见,修改任务如下:
⚪01.针对VLM在ADS应用的对抗攻击方法的挑战性,难点阐述不清楚,需要强调自动驾驶领域针对VLM攻击的独特的挑战性和难点。这一点针对审稿意见
⚪02.对比实验分析缺乏深度,需要进一步分析新方法为什么能够超越baseline,哪个环节或者指标上超过baseline。
⚪03.提供丰富的原始样本和对抗样本的可视化对比。
⚪04.实验数据中需要体现真实的驾驶环境,比如多种天气条件、交通密度波动、车辆快速移动或意外的行人入侵等,CVPR’2024 dataset基于Carla模拟器采集,可以考虑替换这个数据集,使得论文和比赛没有特别密切的关系,寻找一些复杂的自动驾驶数据集,或者搭建Carla环境自己采集;
⚪05.分析关键模块带来的计算瓶颈。
⚪06.可提供适当的防御措施建议。
⚪07.针对创新点的阐述不细致,并且没有提供新颖的角度来分析已有工作的缺点
⚪08.可迁移对抗攻击模块的替代模型(ViT-B/16, CLIP, and BLIP-2)选择的依据是什么阐述清楚
⚪09.参数设置解释清楚。
⚪10.分析攻击方法的有效性以及泛化性
⚪11.LVLM的攻击方法和Vision Model的攻击方法的差异之处没有阐述。
⚪12.图5 的字体太小了。
‼‼ ‼‼ ‼‼ ‼‼ ‼
‼‼‼ 任务安排 ‼‼‼
⚪@Ap0lie 汪航同学:针对修改任务07,调研一下近两年,针对VLM应用的对抗攻击方法的局限性,尤其注意这些攻击方法在自动驾驶上的不足之处,最好可以有一个新颖的角度对已有方法归类,并找出其不足之处。
⚪@Arthur 汪一飞同学:针对修改任务04,搜集一下自驾数据集,要图像,且尽可能包含多种天气条件、交通密度波动、车辆快速移动或意外的行人入侵等场景。如果找不到,可以考虑搭建Carla环境自己采集。
⚪部分参考文献在这个PPT里
⚪注释:VLM: Vison Language Models
⚪注释:自动驾驶中的各个任务:perception, planning, control, trajectory prediction
具体内容:
⚪01.针对VLM在ADS应用的对抗攻击方法的挑战性,难点阐述不清楚,需要强调自动驾驶领域针对VLM攻击的独特的挑战性和难点。这一点针对审稿意见“However, more clarity on the specific challenges in autonomous driving systems could strengthen the argument for why Uni-Attack is crucial.”,“the novelty of these contributions, specifically in the context of LVLMs applied to autonomous driving, is unclear. Are these methods generally applicable to LVLM attacks, or are they specifically tailored for autonomous driving? The unique aspects of the autonomous driving scenario need to be further clarified. The paper should explain why these attack methods are particularly relevant in autonomous driving and how they differ from general LVLM attack approaches. Additionally, the paper should elaborate on the specific challenges LVLMs face in autonomous driving attacks”。
⚪02.对比实验分析缺乏深度,需要进一步分析新方法为什么能够超越baseline,哪个环节或者指标上超过baseline。针对审稿意见“Could you provide more insights into why these baselines perform worse than Uni-Attack and what unique features give Uni-Attack the edge? Please explain why specific metrics improved and what aspects of Uni-Attack led to these improvements.”
⚪03.提供丰富的原始样本和对抗样本的可视化对比。针对审稿意见“showing a side-by-side comparison of original vs. attacked outputs with qualitative examples would provide clearer insight into the attack’s effectiveness”
⚪04.实验数据中需要体现真实的驾驶环境,比如多种天气条件、交通密度波动、车辆快速移动或意外的行人入侵等,CVPR’2024 dataset基于Carla模拟器采集,可以考虑替换这个数据集,使得论文和比赛没有特别密切的关系,寻找一些复杂的自动驾驶数据集,或者搭建Carla环境自己采集;针对审稿意见“How does Uni-Attack perform in fully realistic scenarios, particularly with complex traffic conditions or environmental changes”, “Given the dynamic nature of environments where autonomous driving systems function, it is advisable that the authors assess the performance of Uni-Attack under dynamic conditions”。
⚪05.分析关键模块带来的计算瓶颈。针对审稿意见“Could you explain the computational overhead introduced by the gradient noise-based attack and typographic attacks?”
⚪06.可提供适当的防御措施建议。针对审稿意见“The paper focuses heavily on attacking methodologies, but there is no discussion on possible defenses.”
⚪07.针对创新点的阐述不细致,并且没有提供新颖的角度来分析已有工作的缺点,比如审稿意见“How do you ensure the balance between attack efficacy and computational efficiency in real-time autonomous driving environments??”, “ The paper lacks in-depth discussion of its unique contributions and fails to provide a new perspective. Please clarify the limitations of existing methods in the context of autonomous driving and demonstrate how Uni-Attack addresses these challenges. A more thorough discussion is needed to highlight the limitations of current LVLM adversarial attack frameworks in autonomous driving applications and how Uni-Attack stands out in overcoming these specific challenges.”
⚪08.可迁移对抗攻击模块的替代模型(ViT-B/16, CLIP, and BLIP-2)选择的依据是什么阐述清楚,针对审稿意见“For the transferability of adversarial patches, why were ViT-B/16, CLIP, and BLIP-2 chosen as surrogate models, and how do they compare in terms of their susceptibility to attacks in both white-box and black-box environments?”
⚪09.参数设置解释清楚。针对审稿意见“What was the rationale behind setting the perturbation limit at ϵ = 16/255 in the gradient noise-based module?”
⚪10.分析攻击方法的有效性以及泛化性,针对审稿意见 “It is recommended that the authors delve into potential defensive strategies to mitigate the proposed adversarial methods and discuss the breadth of models targeted by these attacks, encompassing rule-based systems, learning-based models, and large model-driven frameworks.”
⚪11.LVLM的攻击方法和Vision Model的攻击方法的差异之处没有阐述。针对审稿意见“More explanation is needed regarding how LVLMs differ from traditional vision models in terms of attack vulnerabilities. Can traditional methods still generate effective attacks against LVLMs, or are there fundamental differences?“
⚪12.图5 的字体太小了。审稿意见:”The font size of the labels in Fig. 5 should be enlarged.“
11.08 Friday
上午:酒店出门去总部,首先到3号楼,5011面见杨琪杨老师,然后带我到隔壁5013面见郭处,综合部副处长,然后再到隔壁面见周主任,周主任问我最近怎么样,我说挺好,最近在写毕业论文,同时在找工作,表达了愿意留在集团的意愿。实际上这个问题我很意外,“最近怎么样?”这种问题,语境太强,不同人对我境况的不同了解,我需要回答不同的方向,知己知彼,难。然后我就去了1号楼,会议室等待开会,期间见到了鲍远义,鲍博士,加到了下一届学生代表发言人华北电力的张亦驰。
中午:和子旭,世龙,鲍博士一起吃的饭,期间还有一个博士,小个子,气场还挺足的,话多。大家聊聊找工作,聊聊薪资待遇,聊聊未来发展的城市和愿景等。最后到达长电分手,他们关系挺熟悉的,都在武汉,告别罢了我就回去酒店了。
下午:随即退房起身坐地铁去机场。武汉机场比较坑的地方是,T3航站楼的标识很好找,但是T2像是临时加出来的一样,问了咨询,才知道T2在T3里面,我进去了,但还是过安检排队的时候走错了,因为进去以后T2的引导牌太小了,很难发现T2不是这里。T2即是Terminal 2,中文指示为“2号航站楼”,其实我应该有点常识的,“航站楼”“航站楼”怎么也得是个楼啊,进了T3即使看不到T2的引导牌,也应该怀疑按照T3的牌子走这么几步路,应该不会跨域一座楼吧,也就是两个航站楼之间的距离应该是比较远的。还好时间来得及,案件轮到我的时候提醒我走错了,于是我赶紧去找T2。T3安检在4楼,需要到2楼左转到有摆渡车的地方,叫他拉自己,赶时间,因为那个距离确实很远,差不多从老家村东头到村西头普,甚至还更远。到了T2,还要继续分一个岔路口,根据登机口划分安检到相对的两个地方,奇妙的是我又走错了。我本以为自己经常坐飞机,不会遇到各种问题了,但是我忽略了,每个城市的特色不同,管理者的解决方法和规则都不完全相同,因此还是要具体问题具体分析,不可墨守成规。这个事情还告诫我万事往前赶,给自己留出一些容错空间是好的,切不可胆大心不细,小看一切问题的难度,要谦虚谨慎。
晚上:飞机落地回到了宿舍。已经8点多了。和孟鑫语音到了11点。越来越熟练,她的性格很敏感,像是我10岁那年刚到小饭桌的状态,听不了一点敏感的玩笑,因为想家。她不同,因为缺乏爱,习得性无助(她所言),惧怕被抛弃的万丈深渊。所以当我开玩笑说吵架删微信这件事的时候,她瞬间眼眶红润,泛起了泪花。真的是瞬间!那种感觉像是把一个手无缚鸡之力的婴儿用细线悬挂在悬崖边上,别人说的每一句敏感的话就是在用小刀割那根细绳,她俯视着悬崖下的万丈深渊,那种失重感,被抛弃感,从多年前,从另一个平行宇宙,从记忆深处,瞬间地,闪电般地,被立刻召回。所以不要拿这种有失重感的话题,有抛弃感的话题开玩笑,对她来说,那无异于谋杀,不可被原谅。
总结一下禁忌话题:
1. 学习
2. 痛苦
3. 抛弃:删微信、冷战
4. 前男友
5. 母亲
6. 不可以打破隐私,尊重隐私,注意边界感
7. 信任
11.09 Saturday
上午:整理笔记,将Vul-LMGGNN上传到gitee.com备份
下午:中午回到宿舍就开始收拾房间,一直收拾到下午四五点钟,然后和小鑫打电话,直到7点。
晚上:去川大取回门禁卡扣,继续清理卫生间,并安装衣架,洗衣服。直到12点。
11.10 Sunday
上午:吃早饭,办公室继续看图神经网络哔哩哔哩。9点钟睡着了,起来给妈打个电话,然后调研手环手表,决定小米手环9NFC版本,apple watch一般,功能都是简单且多余的,无论是se还是s9,虽然没看ultra,但感觉没必要。
中午:吃饭,12点半就要到达南站,1点的车。带上罐头,雨伞。
晚上:带小鑫回来,吃饭,睡觉。
11.11 Monday
上午:word2vector模型了解了基本原理,Devign利用【torch_geometric.nn.conv.GatedGraphConv】学习图结构,利用【gensim.models.word2vec.Word2Vec】编码code,而LLGGNN基于此,替换word2Vec为CodeBert,编码code。此外没有什么改变。
下午:把LMGGNN的方法再训练,以及baseline跑出一个leaderboard来。修改FedVC。
FedVC的三个审稿意见,原文latex需要修改,修改部分用红色标出来,然后分别回复审稿意见。
《《《《《《《Reviewer-1》》》》》》》》》》:
Summary: This paper aims to achieve the performance for imbalanced dataset under the federated learning (FL) on task autonomous deriving. The main work is to propose a virtual client (VC) for FL, which can control the models’ updating process by sharing the raw data from real clients. Here are some suggestions:
1. Some definitions are not clear to me: data imbalance, local data imbalance, virtual client, and so on.
• a. What is the data imbalance as mentioned in Line 55 right on Page 1?
• b. What is the local data imbalance in Line 5 left on Page 2? Citations are also recommended, such as papers with a definition about data imbalance (e.g. A comprehensive survey of image augmentation techniques for deep learning).
• c. What is the real client? Is it the data from a car? I also suggest consistent definitions in the paper.
• d. What is the missing data on Line 51 right on Page 3?
• e. Page 3 Line 55 left, citations for the static method should be given, as well as dynamic training method on Line 69. Their simple definitions or descriptions are encouraged.
• f. As mentioned in Page 4 Line 5 left, “every virtual client maintaining a separate model”, my question is whether the number of models equals the number of clients. And then what is the definition of a model or what kinds of data can the model access?
• g. What is the difference between data silos, real and virtual clients?
• h. In the second paragraph on Page 4 right panel, the proof with citations or necessary analysis, and the reasons are recommended to give.
• i. For the motivation to propose the virtual clients as mentioned in Page 4 Lines 9-28, is better to provide at the beginning the methodology? I also suggest an extra subsection in Section 4, before introducing your method, to demonstrate the addressing challenge and the motivation of your idea.
2. Citations should be added as mentioned before. In Sections 3 and 4, there is no reference.
• a. Citations for K-means should be given on Page 3 Line 31 right.
3. How do you use K-means to generate the pseudo-class? Necessary details should be given. And how many classes do you have?
4. It seems that Figure 2 is the main work on the paper, rather than Figure 1. I suggest exchanging their place. Please explain the meaning of different shapes in Figure 2.
5. In the abstract, I suggest replacing synthetic datasets with the raw dataset in Line 25. “raw” is from Page 4 Line 36 left.
6. A potential advantage for your method is beyond the application of autonomous driving. Additional experiments on other applications and datasets are encouraged. Or you can mention it at least.
7. To avoid misunderstanding, I suggest replacing “proposed” in Line 52 left on Page 5 with “existing”.
《《《《《《《Reviewer-2》》》》》》》》》》:
The paper presents FedVC, a federated learning framework designed to address the challenges of imbalanced label distribution in autonomous driving. FedVC introduces virtual clients (VC), which select data subsets dynamically to approximate global data distribution without compromising privacy. This method enhances learning by aligning local training with the global data distribution. The results highlight that FedVC outperforms other methods, such as FedAvg and FedSAM, especially under imbalanced conditions. However, FedVC introduces increased communication costs that require further optimization for real-world deployment. The following are some major concerns:
The paper currently references several traditional FL methods. To strengthen the framework’s relevance, the authors should integrate recent advancements in FL that focus on:
• Reducing communication costs (e.g., gradient compression techniques or sparse model updates).
• Improving scalability to handle a higher number of clients efficiently.
Expand the discussion by addressing practical deployment challenges in limited-resource autonomous systems. Specifically, the following aspects need further exploration:
• Communication bandwidth limitations and how FedVC can maintain efficiency in low-bandwidth networks.
• The computational overhead on individual vehicles and the feasibility of deploying the proposed virtual client mechanism without degrading vehicle performance.
《《《《《《《Reviewer-3》》》》》》》》》》:
The paper presents a novel federated learning framework, FedVC, which utilizes Virtual Clients in a peer-to-peer network to achieve global distribution-aware training. The framework decouples forward and backward computations to better align local training with global data distribution, improving model performance without the need for dataset condensation. The application to steering angle prediction in autonomous driving demonstrates the framework’s effectiveness on three datasets with different levels of data imbalance. The idea is innovative and has practical implications for federated learning in privacy-sensitive applications.
• Strengths:
1. Innovative Approach: The concept of Virtual Clients and decoupling computations is a fresh contribution that addresses key challenges in federated learning, such as data privacy, communication costs, and data distribution alignment.
2. Practical Application: The focus on steering angle prediction for autonomous driving is relevant and timely, with potential real-world applications.
3. Experimental Validation: The paper demonstrates the proposed framework’s effectiveness across multiple datasets, showcasing its adaptability to different data imbalance scenarios.
• Areas for Improvement:
1. Expand Experimental Setup Details: The manuscript lacks sufficient detail about the experimental settings, such as the number of Virtual Clients and Real Clients used, model architecture specifics, and training parameters. Providing a more thorough description would enhance the reproducibility of the experiments and help readers better understand the conditions under which the method was tested.
2. Clarify Threshold Mechanism: The explanation regarding the use of thresholds for reconstructing the global data view is not detailed enough. The paper should include more information on how these thresholds are set, why they were chosen, and their effect on the training process. A comparison with alternative strategies could provide more context and strengthen the justification for this approach.
3. Expand the Discussion of Limitations and Future Directions: The paper would benefit from a more in-depth discussion of the framework’s limitations. For example, potential computational overhead associated with the use of Virtual Clients or challenges in synchronizing large datasets could be addressed. Additionally, suggesting future research directions, such as optimizing the threshold mechanism or extending the approach to other federated learning tasks, would add depth to the paper.
4. Better Integration with Related Work: The discussion section should more thoroughly compare the results with existing methods for federated learning, particularly in the context of autonomous driving. Drawing clearer connections between FedVC and other approaches would help demonstrate the novelty and advantages of the proposed method.
5. Improve Clarity in Some Sections: Certain explanations, such as the role of Virtual Clients in data partitioning and the communication flow in the peer-to-peer network, could be presented more clearly. Including visual aids like diagrams or flowcharts could help illustrate these concepts and improve reader comprehension.
6. Standardize Abbreviation Usage and Terminology: Make sure that all abbreviations are defined at their first occurrence, and used consistently throughout the manuscript. Avoid reintroducing abbreviations unnecessarily. Consistency in terminology will make the paper easier to read and follow.
7. Ensure Consistency in Figures and Numerical Notation: Verify that all figures, tables, and numerical results follow a consistent format, including the use of units, significant figures, and labeling conventions. This will contribute to a more professional presentation of the research.
8. Language and Grammar Refinement: Some sections could benefit from language refinement to improve clarity and grammatical accuracy. Editing for shorter, more concise sentences, and ensuring proper punctuation would enhance the manuscript’s readability.
9. Address Ethical Considerations: While the paper emphasizes privacy in the federated learning context, there is little discussion about potential ethical concerns, such as data security or user consent. Including a section that addresses these issues would make the paper more robust and relevant for real-world deployment considerations.
11.12 Tuesday
上午:baseline算法vul-LLMGGNN,Devign。 填写工程硕士自查表。
中午:修改自查表,接电话,接电话,接电话。
下午:和悦广场,补充台账,啥叫台账,excel?excel!!!
11.13 Wednesday
上午:制作台账导师交流的
下午:继续补充台账交流记录本
晚上:吃饭,吧唧嘴,生气,无奈,懦弱的男人
11.14 Thursday
上午:完成导师交流记录,到11.13
中午:送走鑫鑫
下午:emmo,从前慢,车马慢
11.15 Friday
上午:vul-LMGGNN 所有算法
中午:
下午:思路不能变了,基于这几天看的vul-LMGGNN,相似的论文:
Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs
Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation
Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
VulSim: Leveraging Similarity of Multi-Dimensional Neighbor Embeddings for Vulnerability Detection
不如弄个benchmark出来.
Devign 的代码需要简化,完全看懂后,和vul-LMGGNN融合
11.16 Saturday
上午:将LMGGNN与Devign的create_cpg和embed_task融合,并简化了config.py
下午:将这两个的模型和train融合。
新的想法:待解决的问题:数据不均衡,方法,融合fedsam的方法